These are my personal notes from RStudio::conf2019. Links to the slides are pointing to the presenters personal sites, and the video links are pointing to RStudio.com.
Return to home.
Shiny is a web framework for R, a language not traditionally known for web frameworks, to say the least. As such, Shiny has always faced questions about whether it can or should be used “in production”. In this talk we’ll explore what “production” even means, review some of the historical obstacles and objections to using Shiny for production purposes, and discuss practices and tools that can help your Shiny apps flourish.
In education, there is and has always been debate about how to teach. One of these debates centers around the role of the teacher: should their role be minimal, allowing students to find and classify knowledge independently, or should the teacher be in charge of what happens in the classroom, explaining students all they need to know? These forms of teaching have many names, but the most common ones are exploratory learning and direct instruction respectively. While the debate is not settled, more and more evidence is presented by researchers that explicit direct instruction is more effective than exploratory learning in teaching language and mathematics and science. These findings raise the question whether that might be true for programming education too. This is especially of interest since programming education is deeply rooted in the constructionist philosophy, leading many programmers to follow exploratory learning methods, often without being aware of it. This talk outlines this history of programming education and additional beliefs in programming that lead to the prevalence of exploratory forms of teaching. We also explain the didactic principles of direct instruction, explore them in the context of programming, and hypothesize how it might look like for programming.
“Everyone should learn programming”
— Every programmer ever
In this talk, I’ll lay out the reasons that blogging, open source contribution, and other forms of public work are a critical part of a data science career. For beginners, a blog is a great accompaniment to data science coursework and tutorials, since it gives you experience applying practical data science skills to real problems. For data scientists at any stage of their careers, open source development offers practice in collaboration, documentation, and interface design that complement other kinds of software development. And for data scientists more advanced in their careers, writing a book is a great way to crystallize your expertise and ensure others can build on it. All of these practices build skills in communication and collaboration that form an essential component of data science work. Each also lets you build a public portfolio of your skills, get feedback from your peers, and network with the larger data science community.
Have you ever had a challenging time cloning someone’s data analysis repo and easily re-running the analysis without fiddling with missing packages, mismatched versions, external dependencies, unavailable data or a whole host of other issues? Would you like your own work to be reproducible where someone else can access your data, code, workflow, models and provenance and easily re-create your results without consulting you? Then this is the talk for you.
Make a research compendium!
While teaching a course using “R for Data Science”, I wrote a complete set of solutions to its exercises and posted them on GitHub. Then other people started finding them. And now I’m here. In this talk, I’ll discuss why I did it, and what I learned from the process, both what I learned about the tidyverse itself, and what I learned from teaching it.
Come on a journey through pull request #2196. What started as a seemingly simple fix for a bug in ggplot2’s box plots developed into an entirely new placement algorithm for ggplot2 geoms. This talk will cover tips and techniques for debugging, testing, and not smashing your computer when dealing with tricky bugs.
reprex
- minimal reproducible exampledebug()
- look through what happens when a function is calledCategorical data, called “factor” data in R, presents unique challenges in data wrangling. R users often look down at tools like Excel for automatically coercing variables to incorrect datatypes, but factor data in R can produce very similar issues. The stringsAsFactors=HELLNO movement and standard tidyverse defaults have moved us away from the use of factors, but they are sometimes still necessary for analysis. This talk will outline common problems arising from categorical variable transformations in R, and show strategies to avoid them, using both base R and the tidyverse (particularly, dplyr and forcats functions).
The first iteration of the R4DS Online Learning Community was created as an online space for learners and mentors to gather and work through the “R for Data Science” text in a collaborative and supportive environment. The creation of this group was inspired by my own success in transitioning to a career in data science coupled with the resources that I wanted to see in the R programming space. This talk will go through the learnings of creating an online learning space focused on R programming for data science, and how future iterations of similar groups can more proactively center on bringing about diversity, equity, and inclusion to data science spaces.
Of the many coding puzzles on the web, few focus on the programming skills needed for handling untidy data. During my summer internship at RStudio, I worked with Jenny Bryan to develop a series of data science puzzles known as the “Tidies of March.” These puzzles isolate data wrangling tasks into bite-sized pieces to nurture core data science skills such as importing, reshaping, and summarizing data. We also provide access to puzzles and puzzle data directly in R through an accompanying Tidies of March package. I will show how this package models best practices for both data wrangling and project management.
In this talk, we will present our approach to incorporating R and RStudio into a 10-week introductory statistics course for non-majors Cal Poly. Our primary contribution will be to share a series of Shiny Apps, created to ease students with no statistical or coding background into the philosophy of using programming tools to explore data. Our program was recently used in 3 sections of 35 students each this Fall, during which students were surveyed regularly for their reactions to the approach. We will demonstrate our new tools, discuss our successes and failures, share student-generated output, and summarize the results of our Fall survey.
The Carpentries is an open, global community teaching researchers the skills to turn data into knowledge. Since 2012 we have taught 700+ R workshops & trained 1600+ volunteer instructors. Our workshops use evidence-based teaching, focus on foundational and relevant skills and create an inclusive environment. Teaching the tidyverse allows learners to start working with data quickly, and keeps them motivated to begin and sustain their learning. Our assessment show that these approaches have been successful in attracting diverse learners, building confidence & increasing coding usage. Through our train-the-trainer model and open, collaborative lessons, this approach scales globally to reach more learners and further democratize data.
Time series can be frustrating to work with, particularly when processing raw data into model-ready data. This work presents two new packages that address a gap in existing methodology for time series analysis (raised in rstudio::conf 2018). The tsibble package supports organizing and manipulating modern time series, leveraging tidy data principles along with contextual semantics: index and key. The tsibble data structure seamlessly flows into forecasting routines. The fable package is a tidy renovation of the forecast package. It promotes transparent forecasting practices and concise model representations, to empower analysts tackling a broad domain of forecasting problems. This collection of packages form the tidyverts, which facilitates a fluent and fluid workflow for analyzing time series.
3 main ideas
has_gaps()
, scan_gaps()
, count_gaps()
and fill_gaps()
explore missing observationsindex_by()
slide()
, tile()
and stretch()
, creates rolling functions, structured like purrr functions.
slide()
, slide2()
, pslide()
, and with type-stable suffixes like slide_dbl()
, slide_chr()
etctidy()
, glance()
and augment()
Key is the key to extract values across tsibble, mable and fable
Is there ever a place for the third dimension in visualizing data? Is the 3D pie chart truly bad, or just misunderstood? In this talk, I will show you how you can create beautiful 3D maps and visualizations with the rayshader package. In addition, I will talk about the value of 3D plotting, how interactions with the R community helped drive the development of rayshader, and how writing/blogging about your projects can vastly improve your code. And, of course—lots of beautiful 3D maps and figures.
Animation of data visualisation is becoming increasingly popular both as an attention grabber on social media and as a way to tell small data stories. gganimate is a package that extends ggplot2 for making animations and provides a grammar of animation on top of the grammar of graphics. This talk will quickly introduce gganimate, and then dive into a series of different animation and show how they were made and how they could be changed or expanded.
The R objects used to represent model fits are notoriously inconsistent, making data analysis inconvenient and frustrating. The broom package resolves this issue by defining a consistent way to represent model fits. By summarizing essential information about fits in tidy tibbles, broom makes it easy to programmatically work with model objects. Combining broom with list-columns results in an especially powerful way to work with many model fits at once. This talk will feature several case studies demonstrating how broom resolves common problems in data analysis.
tidy()
summarizes information about fit components
glance()
reports goodness of fit values, always a one-row summaryaugment()
add information about observations, like predictions etctidy()
and knitr::kable()
, you can produce nice and quick model outputmap_df()
, a list of model fits and glance()
, you can easily compare different model fitsnest()
, tidy()
/glance()
and unnest()
, before passing the result to ggplot()
parsnip is a new tidymodels package that generalizes model interfaces across packages. The idea is to have a single function interface for types of specific models (e.g. logistic regression) that lets the user choose the computational engine for training. For example, logistic regression could be fit with several R packages, Spark, Stan, and Tensorflow. parsnip also standardizes the return objects and sets up some new features for some upcoming packages.
Workflow:
set_engine()
, e.g. “lm”, and parsnip knows how to translate your input into the arguments of the modelmulti_predict()
returns a list column!Uncertainty is a key component of statistical inference. However, uncertainty is not easy to convey effectively in data visualizations. For example, viewers have a tendency to interpret visualizations of the most likely outcome as the only possible one. Viewers may also misjudge the likelihood of different possible outcomes or the extent to which moderately rare outcomes may deviate from the expectation. One way in which we can help the viewer grasp the amount of uncertainty present in a dataset is by showing a variety of different possible modeling outcomes at once. For example, in a linear regression, we could plot a number of different regression lines with slopes and intercepts drawn from the range of likely values, as determined by the variation in the data. Such visualizations are called Hypothetical Outcomes Plots (HOPs). HOPs can be made in static form, showing the various hypothetical outcomes all at once, or preferably in an animated form, where the display cycles between the different hypothetical outcomes. With recent progress in ggplot2-based animation, via gganimate, as well as packages such as tidybayes that make it easy to generate hypothetical outcomes, we can easily produce animated HOPs in a few lines of R code. This presentation will cover the key concepts, packages, and techniques to generate such visualizations.
bootstrapify(n)
as a dplyr verbsampler(n)
functions in the same way as bootstrapper()
How can you tell that your scripts, applications, and package functions are working as expected? Are you sure that when you make changes in one part of the code, it won’t break something in another part? Have you thought deeply about how the consumers of your code (including Future You) will use it, maintain it, fix it, and improve it? Code quality is essential not only for reliable results but also for your script’s maintainability and your users’ satisfaction. Quality can be measured in part with targeted testing, and fortunately, there are several effective and easy-to-use code testing tools available in R. This talk will discuss some of the most useful testing packages, covering both concepts and examples.
The main goals of pkgman is to make package installation fast and more reliable. This allows new, simpler and safer workflows, such as separate package libraries for projects. In this talk, we will show the features that make pkgman fast, convenient and reliable. Features that make pkgman fast: * Concurrency: pkgman performs all downloads, package builds and installations concurrently by default. * Metadata and package cache: pkgman caches all metadata and all downloaded and locally built packages in its cache. * Lazyness: pkgman only downloads and installs packages if needed. Features that make pkgman convenient: * BioC and GitHub packages are supported seamlessly. * Informative UI. pkgman can lay out the installation/update plan, that the user needs to confirm. It returns data about downloads, builds, installations, etc. Features that make pkgman reliable: * Dependency solver. pkgman makes sure that you end up in consistent, working state of dependencies. * Private library: pkgman’s own dependencies do not affect your regular package library, and vice versa. pkgman does not load any packages from your regular library.
Software dependencies can often be a double-edged sword. On one hand, they let you take advantage of others’ work, giving your software marvelous new features and reducing bugs. On the other hand, they can change, causing your software to break unexpectedly and increasing your maintenance burden. These problems occur everywhere, in R scripts, R packages, Shiny applications and deployed ML pipelines. So when should you take a dependency and when should you avoid them? Well, it depends! This talk will show ways to weigh the pros and cons of a given dependency and provide tools for calculating the weights for your project. It will also provide strategies for dealing with dependency changes, and if needed, removing them. We will demonstrate these techniques with some real-life cases from packages in the tidyverse and r-lib.
What does it mean to say software is, to quote one Twitter user, ‘so f***ing magical!’? In the context of our popular community hobby of rating and sharing R packages, the term ‘magic’ seems reserved for our most powerful expressions of visceral approval. Why is this? And what does it say about how we value software? Can this magical quality be quantified? We will consider these questions in examination of magical specimens, and in the process reveal the surprising depths at which notions of magic are embedded in the R zeitgiest.
Statistics has made science resemble math, so much so that we’ve begun to conflate p-values with mathematical proofs. We need to return to evaluating a scientific discovery by its reproducibility, which will require a change in how we report scientific results. This change will be a windfall to commercial data scientists because reproducible means repeatable, automatable, parameterizable, and schedulable.
The traditional way to beautiful PDFs is often through LaTeX or Word, but have you ever thought of printing a web page to PDF? Web technologies (HTML/CSS/JavaScript) are becoming more and more amazing. It is entirely possible to create high-quality PDFs through Google Chrome or Chromium now. Web pages are usually single-page documents, but they can be paginated thanks to the JavaScript library Paged.js, so that you can have elements like headers, footers, and page margins for the printing purpose. In this talk, we introduce a new R package, pagedown, to create PDF documents based on R Markdown and Paged.js. Applications of pagedown includes, but not limited to, books, articles, posters, resumes, letters, and business cards. With the power of CSS and JavaScript, you can typeset your documents with amazing elegance (e.g., a single line of CSS, “tr:nth-child(even) { background: #eee; }”, will give you a striped table, and “border-radius: 50%;” gives you a circular element) and power (e.g., HTML Widgets).
pagedown::chrome_print()
, print pages, not yet working very wellWith the gt package, anyone can make great-looking display tables. Though the package is still early in development, you can do some really great things with it right now! I’ll walk through a few examples that touch upon the more common table-making use cases. These will include features like adding table parts, integrating footnotes, styling/transforming table cells, using tables in R Markdown documents, and even including gt tables in email messages.
function | explanation |
---|---|
gt() |
turns your table data into a gt object |
tab_functions() |
adds parts to a gt table |
fmt_functions() |
formats parts of a table, can specify both columns and rows |
ìnfo_functions() |
shows information tables for the different formatting options |
tab_options() |
general options for a table appearance |
My brain is lazy, shallow and easily distracted. Learn how I use notebooks to keep my present-self organised, my future-self up to speed with what I was thinking months ago, and also how I use parameterised reports to share results for both quantitative and non-quantitative audiences across multiple endpoints. I can update and render outputs for a variety of outputs from a single markdown notebook or report. I’ll show you how I organise my work using the tidyverse, use child documents with parameterisation and also how this is served out to my colleagues via RStudio Connect.
Over the past eight years of doing data science, I’ve made plenty of mistakes, and I’d love to share them with you – including what I’ve learned and what I’d do differently with some hindsight. This talk will cover mistakes made during analyses (including communication when delivering results) team and infrastructure mistakes, plus some advice for incoming data scientists.
There is no doubt that RStudio has had an impact on how introductory statistics is taught in colleges today. When we consider the sheer dominance that giants like Texas Instruments, IBM, and Pearson Publishing have had in academic curriculum development it’s no small wonder that tools like R and Python have been able to gain a foothold. Projects like DataCamp, ModernDive.com, “Introductory Statistics with Randomization and Simulation” courtesy of openintro.org, Wickham’s “R for Data Science” and Peng’s “R Programming for Data Science” are great resources for the student who has already some fundamental math or statistical background and has become comfortable around computing and applications-driven computational exercises. But many of us know that Data Science cannot simply be relegated to the privileged few that stumble into it by virtue of circumstance. My passion, and the purpose of my talk, is to provide educators with a digestible guidebook that would be appropriate for introduction to statistical concepts in high school, college, and under-resourced schools looking for ways to increase diversity in STEM. Organized in small, adaptable activities designed to be the amuse-esprit enticing both the timid and the skeptical to the proverbial banquet table that is RStudio, this exploration into the world of statistics education should be of interest to a wide audience. My hope is to increase data literacy in real world context – with primary emphasis on descriptive statistics and distributions.
vctrs is a new package that provides tools (cognitive and computational) to ensure that functions behave consistently with respect to inputs of varying length and type. The end goal of vctrs is to be invisible to the end user of the tidyverse (simply enabling their predictions about function outputs to be more correct), but will help developers write functions that “just work”.
Logical | Integer | Double | Character | |
---|---|---|---|---|
Logical | Logical | Integer | Double | Character |
Integer | Integer | Integer | Double | Character |
Double | Double | Double | Double | Character |
Character | Character | Character | Character | Character |
vec_c()
gives you an error, unless you specifically tell the function that you want i.e. a character by setting the argument .ptype = character()
The “tidy eval” framework is implemented in the rlang package and is rolling out in packages across the tidyverse and beyond. There is a lively conversation these days, as people come to terms with tidy eval and share their struggles and successes with the community. Why is this such a big deal? For starters, never before have so many people engaged with R’s lazy evaluation model and been encouraged and/or required to manipulate it. I’ll cover some background fundamentals that provide the rationale for tidy eval and that equip you to get the most from other talks.
dplyr
rlang()
enquo()
and !!
, and maybe :=
if you want to name thingsIn practice there are two main flavors of tidy eval functions: functions that select columns, such as
dplyr::select()
, and functions that operate on columns, such asdplyr::mutate()
. While sharing a common tidy eval foundation, these functions have distinct properties, good practices, and available tooling. In this talk, you’ll learn your way around selecting and doing tidy eval style.