These are my personal notes from RStudio::conf2021.
Return to home.
Keynotes
Maintaining the house the tidyverse built
Hadley Wickham, RStudio
Abstract
Hadley will talk about how the tidyverse has evolved since its creation (just five years ago!). You’ll learn about our greatest successes, learn from our biggest failures, and get some hints of what’s coming down the pipeline for the future.
Notes
- Your code that worked perfectly before, may no longer work at a later point, although it hasn’t changed
- Updating packages
- If you use code from a package, it may change when you update the package
- Any update may break code
- Tidyverse functions:
- Experimental
- “Out of warranty”
- New functions
- Deprecated
- “Out of warranty”
- Was a bad idea, stop use this function
- It still works, and does what it used to do
- Will be removed in the future
- Stable
- “In warranty”
- Most functions are in this category
- Superseded
- “In warranty”
- It works, but better alternatives have been developed
- Will not be removed, but also not updated
- When updating old projects, only update superseded functions if needed. If not, the code will still work as it did.
- Breaking changes
- Removing functions
- Remove argument
- Restrict allowed inputs
- Changing the output
- Non-breaking changes
- Adding functions
- adding arguments
- expanding allowed inputs
- “Off-label” use of functions
- The original author have no way of anticipating your way of using the function, which increases the risk of code breaking
- Are you using the function because it does what you want (intended or unintended side effect?), or because it does what it says it does
Reporting on and visualising the pandemic
John Burn-Murdoch, Financial Times
Abstract
John will discuss the lessons he’s learned reporting on and visualising the pandemic, including the world of difference between making charts for a technical audience and making charts for a mass audience. You’ll learn from his experience navigating the highly personal and political context within which people consume and evaluate graphics and data, and how that can help us better design and communicate with visualisations down the pipeline for the future.
Notes
- To be able to make effective visualizations, you need to understand how people consume charts
- Eye tracking experiment
- Peoples eyes are first drawn to the title of the chart
- You can have everything else right, but still have an ineffective plot if the title is not good
- When presenting a plot, there’s lots of assumed knowledge (what information does the type of graph convey etc)
- Including an active title, text and annotation, telling the main message of the plot, helps non-plot people to understand what’s going on
- If people are confused by your chart, that’s on you, not them
- Visualization of information is personal and political, which we need to keep in mind to minimize the risk of bad reactions
- Animations can be really effective to tell a story, compared to static charts
- Focus on getting the core message across, as easy as possible to understand. Making the plot as close to the truth as possible doesn’t matter as much as we think it does
- Make graphics not just for graphics editors and chart-people, but for everyone.
Your Public Garden
Vicky Boykis, Automattic
Abstract
Vicky will discuss how that as people who can write code and analyze data, we have a lot of input and power over what our digital and work worlds looks like, and therefore can act as agents of change and repair.
Talks
Always look on the bright side of plots
Kara Woo, Sage Bionetworks
Abstract
Everyone who creates visualizations in R is bound to make mistakes that prevent their plots from looking as they should. Sometimes, these mistakes create beautiful “accidental aRt”, though other times they’re just plain frustrating. Either way, however, there’s something to be learned. This talk will draw on years of watching both the ggplot2 issue tracker and the @accidental__aRt twitter account to highlight some common plot foibles and explain what they can teach us about how ggplot2 works.
Notes
- ggplot2 is a data first approach to building visualizations
- Mapping mishaps
- aesthetics map visual elements to variables in the data
- when adding text with
geom_text()
, x, y and label should only be mapped with aesthetics if they are corresponding to variables. If you add a specific text element by hand, mapping it with aes()
repeat the text element nrow(data)
times, making the plot slow and text pixelated.
annotate()
is built to add single text elements to the plot.
- Scale snafus
- Setting limits to the scale vs the coordinate systems.
- When setting scale limits, data outside the limits are set to
NA
, and not included in summaries.
- Setting the limits on the coordinate systems only zooms in on the plot
- Theme threats
- The most specific theme element wins. Example:
axis.text.y.right
overrides axis.text.y
, which overrides axis.text
- If a more specific theme element is specified in the theme, for example
axis.text.y.right
, then setting axis.text.y
will not affect the right y-axis text, which in that case need to be set specifically.
Art Lessons: One Year as RStudio’s Artist-in-Residence
Allison Horst, RStudio
Abstract
Art can be a welcoming bridge for learners and users to engage with and learn tools and skills in R. As RStudio’s first Artist-in-Residence, my goal has been to make the R landscape more welcoming for a broader community of users through engaging, didactic artwork. In this R, art, and heart-filled talk, I’ll share the motivation behind my R artwork and some lessons learned over the past year as Artist-in-Residence, including:
- Learning to embrace cute and credible artwork
- Art to help students engage with, learn and remember R skills
- Art for community building and support
I hope this talk inspires viewers to use, create and share more artwork, so that together we can make the R landscape feel even brighter.
Notes
- Art can help learners remember new functions
What’s new in tidymodels?
Max Kuhn, RStudio
Abstract
Tidymodels is a collection of packages for modeling using a tidy interface. In the last year there have been numerous improvements and extensions. This talk gives an overview of additional tuning methods, new extension packages for models and recipes, and other features.
Using R to Up Your Experimentation Game
Shirbi Ish-Shalom, AFFILIATION
Abstract
Have you ever cut an A/B test short? Maybe because of traffic constraints, your antsy boss, or early successful results. In reality, cutting your test short can be catastrophic, making your business decision no better than a coin flip. Learn some R-driven tips & tricks to get meaningful results quickly with a statistically rigorous methodology called sequential testing, an A/B testing enhancement my team employs at Intuit.
Key Takeaways.
- What is sequential testing and how to use it.
- How to learn (and fail!) quickly by taking big metric swings
- How I used R to share my learnings & make them useful for anyone (even non-data scientists!) at my company.
Resources
Fairness and Data Science: Failures, Factors, and Futures
Grant Fleming, Elder research
Abstract
In recent years, numerous highly publicized failures in data science have made evident that biases or issues of fairness in training data can sneak into, and be magnified by, our models, leading to harmful, incorrect predictions being made once the models are deployed into the real world. But what actually constitutes an unfair or biased model, and how can we diagnose and address these issues within our own work? In this talk, I will present a framework for better understanding how issues of fairness overlap with data science as well as how we can improve our modeling pipelines to make them more interpretable, reproducible, and fair to the groups that they are intended to serve. We will explore this new framework together through an analysis of ProPublica’s COMPAS recidivism dataset using the tidymodels, drake, and iml packages.
tidymodels/stacks, Or, In Preparation for Pesto: A Grammar for Stacked Ensemble Modeling
Simon Couch, Reed College
Abstract
Through a community survey conducted over the summer, the RStudio tidymodels team learned that users felt the #1 priority for future development in the tidymodels package ecosystem should be ensembling, a statistical modeling technique involving the synthesis of multiple learning algorithms to improve predictive performance. This December, we were delighted to announce the initial release of stacks, a package for tidymodels-aligned ensembling. A particularly statistically-involved pesto recipe will help us get a sense for how the package works and how it advances the tidymodels package ecosystem as a whole.
Using Guided Simulation Exercises to Teach Data Science with R
Chelsea Parlett-Pelleriti, Chapman University
Abstract
With more learning occurring virtually or in hybrid mode, hands-on ways to remotely teach DS are invaluable. Guided simulation exercises in R allow learners to explore concepts deeply, on their own time, and with others. They can also experiment with the simulations, try out edge cases, and challenge their assumptions, leading to more fruitful discussions. The comparison between coefficient estimates in regular, LASSO, and RIDGE regression, or how PCA performs when data are related are great examples of concepts where guided simulations can encourage learners to build intuitive knowledge. This talk explores how to use simulation exercises in R to help learners explore DS concepts and provides examples.
How I became a Data Composer – examples of simulated datasets that bring value to a data-driven company
Richard Vogg
Abstract
How can I get the buy-in from business partners to use more advanced techniques? What can I do to make a data project involving several teams more efficient? And how can I train analysts who do not (yet) have access to sensitive data? A good data composer is skilled at creating suitable data quickly and efficiently. R has many functions and packages that help with simulating independent variables and composing those in a meaningful way. In this talk, I will share how I started creating data and how this skill helped me with solving some of the issues described above. Showing a few examples of small, medium-sized, and large data composition, I want to encourage attendees to simulate data and enrich their data skillset.
Resources
Video
Blog #### Notes
Started by combining simple simulated varables into small datasets
You don’t only want independent variables, but variables that makes sense together
Being able to simulate realistic data allows for:
- Explaining concepts without having to rely on sensitive real data
- Starting to work on the data before it is collected, as analyses can be built on simulated data mimicking the expected data
Composing data
- distribution functions
- correlate simulated data
- Full packages: wakefield, rcorpora, charlatan, fabricatr…
Categorical Embeddings: New Ways to Simplify Complex Data
Alan Feder, Invesco
Abstract
When building a predictive model in R, many of the functions (such as lm(), glm(), randomForest, xgboost, or neural networks in keras) require that all input variables are numeric. If your data has categorical variables, you may have to choose between ignoring some of your data and too many new columns.
Categorical embeddings are a relative new method, utilizing methods popularized in Natural Language Processing that help models solve this problem and can help you understand more about the categories themselves.
While there are a number of online tutorials on how to use Keras (usually in Python) to create these embeddings, this talk will use embed::step_embed(), an extension of the recipes package, to create the embeddings.