Repeatable Data Science: A Demo
Data science is repeatable if results can be reproduced on demand. This is a core tenet of good science; without repeatability it’s unclear whether changes will make things better, worse, or have no effect at all!
Reproducing results relies on two attributes of an analysis:
- that the same inputs will yield the same outputs every time they are applied,
- that inputs are known and accessible.
Simply doing data science in R or Python vastly increases the likelihood that the same inputs will yield the same outputs (as long as you’ve been careful about random number generation). However, ensuring that inputs are recoverable and sharable - can be much harder.
There are a variety of strategies for increasing input availability - making the data in an analysis centrally available and accessbile to anyone who needs it while keeping it secure.
Thanks to the pins package it’s easier than ever to have repeatable data.
What is Pins?
Pins is an R package that makes it possible to remotely save (“pin”) any object serializable by R, like a data frame or model object. These objects are saved to a “board” such as Connect. Once a pin is deployed to Connect, you can use the standard Connect access controls to share it with others.
Pins is particularly useful for sharing R objects when the objects are
- Relatively small (a few hundred megabytes at most)
- Reused across multiple pieces of content
- Only needed in their most current form
Things like an auxiliary data frame for an analysis or a statistical model are particularly good candidates for pins.
The Bike Prediction App
The Bike Prediction app displays the number of bikes predicted to be at the various docks of Washington DC’s bikeshare program in the near future.
In this app, the user can click on a dock on the map (built using the leaflet package) and get the predicted number of bikes at that station in the near future in the bottom half of the page.
Using Pins with Repeatable Data
The Bike Prediction App uses pins in two ways.
Metadata File
The app makes use of a metadata file for the stations in the bikeshare system. The station info data frame contains a mapping from the numeric stations ids to their names, latitudes, and longitudes. The data frame is stored as a pin on Connect and is updated every week by a scheduled R Markdown document on Connect. The security of the app is improved by securely accessing the pin with a Connect API key stored as an environment variable on Connect.
One of the nice features of data pins on Connect is that users can see a rendering of the data in the Connect UI.
Additionally, access to the pin can be controlled just like for any other asset on Connect.
A Model
The Bike Prediction App also uses pins to save the current version of the model. The model is automatically re-trained every morning and pinned onto Connect.
This pinned model is then used by both the model assessment script and the plumber API to assess model quality and serve the predictions.
Using pins
, it’s easy to recover the current state of the data or model that’s needed to make a particular analysis work, making your data science work much more repeatable.