Best practices
Creating Visualizations
Typically, a function that produces a plot in R performs the data crunching and the graphical rendering. For example, geom_histogram()
calculates the bin sizes and the count per bin, and then it renders the plot. Plotting functions usually require that 100% of the data be passed to them. This is a problem when working with a database. The best approach is to move the data transformation to the database, and then use a graphing function to render the results.
Enterprise-ready dashboards
A few principles to keep in mind when developing an enterprise level dashboard: Push as much of the calculations of the dashboard back to the database - The time it takes for a dashboard to load, and respond, will become the most important aspect of its design. For dashboards, the expected time to load and response is a few seconds. Give the end-user with “train of thought” paths - These paths are commonly provided by way of a drill down from within the dashboard.
Making scripts portable
When we share R scripts, RNotebooks or other kind of content with others, and want the code to also execute in their environment, then we need to take steps to make the code portable. There are two primary considerations when sharing content that contain database connections: Ensuring the connection details for the development and deployment environments are in sync
Run Queries Safely
We will review four options to run SQL commands safely using the DBI package: Parameterised queries Using glue_sql
, Interpolation by “hand”, Manual escaping, SQL Injection Attack. The dbGetQuery()
command allows us to write queries and retrieve the results. The query has to be written using the SQL syntax that matches to the database type.
Schema selection
It is common for enterprise databases to use multiple schemata to partition the data, it is either separated by business domain or some other context. This is especially true for Data warehouses. It is rare when the default schema is going to have all of the data needed for an analysis. For analyses using dplyr, the in_schema()
function should cover most of the cases when the non-default schema needs to be accessed.
Securing Credentials
As with every programming language, it is important to avoid publishing code with your credentials in plain text. There are several options to protect your credentials in R. In order of preference, here are the methods that we will cover: Integrated security with DSN, Integrated security without DSN, Encrypt credentials with the keyring package, Use a configuration file with the config package, Environment variables.
Securing Deployed Content
There are multiple ways to secure access to a database in deployed content. The types of content that we can deploy are: Shiny Apps, flexdashboards, RMarkdown documents. It is typical for shiny applications and R Markdown reports to provide insight from data that is not directly accessible by the content audience. In these 1-to-many cases, it is common to define service accounts that access the database on behalf of the content audience.
Selecting a database interface
Connecting to a database with a method that supports the DBI package provides the following advantages: A consistent set of functions that work across all connections, Makes using dplyr as a front-end possible. Mostly based on how difficult is to setup, the order of preference of connection methods is: Native database driver implemented in database package (e.g. RPostgresSQL). To see the list of known database packages check out the Databases page.
Setting up ODBC Drivers
Using Posit Professional Drivers When working with databases on RStudio Desktop Pro and other Posit professional products, it is strongly recommended to use the Posit Professional Drivers. Not only these come with full support, but also they simplify the installation and configuration process is most cases, not requiring many of the steps detailed below.