6. Data Access

In this section, you will learn:

How to enable content to connect to data sources
Different data access strategies
How to create service accounts
Methods for securing credentials
Troubleshooting data source connection errors

Data access from Connect

Content on Connect usually relies on one or more data sources. That data can exist inside or outside a content bundle.

Data inside the content bundle

When deploying content, developers can include data inside a content bundle. The main limitation here is that your developers must deploy everything (data and code) at the same time for it to be updated. If they need the data to be refreshed more often than the code is deployed, then a different data access method should be explored. Bundling data also increases the amount of server disk space the bundle files take up.

Data outside the content bundle

Data outside the content bundle can be stored in a database, a flat file on the Connect server filesystem, a flat file on a mounted file share, a pin on the server with the pins package, blog storage and other types of interfaces.

Any system requirements for accessing that data will need to be satisfied. For example, database drivers would need to be installed and configured and any file shares would need to be mounted.

Databases

Databases may use many different methodologies for data storage, such as:

File-based (e.g., SQLite, DuckDB)
Relational Database Management Systems (e.g., PostgresSQL, MariaDB, MicrosoftSQL)
NoSQL and Document Stores (e.g., MongoDB, Cassadra)

These databases use different wire protocols for connections, however, standardization across database vendors permits a smoother experience for administrators and users. Key standards to be aware of include:

ODBC, a cross-language standard protocol defining how external tools and systems can submit queries to databases. ODBC database drivers use the ODBC protocol for database calls. The drivers translate the ODBC interface to the database-specific functions.
Similar to the ODBC protocol, JDBC is a standard protocol for database queries, however, this is built around the Java ecosystem. We do not recommend JDBC database connections due to challenges in managing Java.
DBI, an R package and front-end for making database connections from the R language. This provides a consistent interface independent of the database. DBI is coupled with a back-end DBI-compliant R package such as odbc (note: lower-case. This is an R package) to make a database connection from R. Some database types have specific DBI-compliant packages available for making this front-end connection, such as RPostgres, RMariaDB, bigrquery, or RSQLite.
PEP-249 is a proposed standard in Python for making database connectors. This standard is used by packages such as pyodbc for making ODBC connections to databases.

As an administrator, you will need to install (preferably) ODBC drivers for relevant databases on Connect. Users can then use the appropriate R or Python packages to interact with these drivers to make their database connections.

Posit provides professionally-supported ODBC drivers for several databases for your use with Posit professional tooling. Installation is documented here. To configure the Posit Professional Drivers, or a driver supplied from another source, specify the driver location and attributes in /etc/odbcinst.ini.

Example /etc/odbcinst.ini file

[PostgreSQL]
Driver=/opt/rstudio-drivers/postgresql/bin/lib/libpostgresqlodbc_sb64.so
RStudioVersion=2021.10.0
Version=1.4.41.1001
Installer=RStudio Pro Drivers
UsageCount=1

Persistent Storage

One potential difficulty in using data from a mounted file share is that the path could change between the content running on Posit Workbench and Posit Connect. If the absolute path and files located there are identical on the Workbench and Connect servers, then the code will work on both servers (for example a /data directory). If there are differences in path, or if it is common for developers to be using a local IDE rather than Workbench, using a tool such as the config package can help with path management.

Data access strategies

Content on Connect runs differently than how scripts are executed on Workbench. By default, content is run by a service account that has minimal permissions and access. What this means is that access to data will need to be set up intentionally by the developer.

External data connections are handled by the R or Python code, never by Workbench or Connect itself. Developers will be able to connect to any external data source from Connect which they can connect to from R or Python (running on Linux).

There are different strategies for managing the authentication portions of the data connections, whether at the developer or server level, as well as for managing credentials:

Developer handled
Data Source Name (DSN)
Environment variables server-wide
Service accounts
Current user execution

An important note on security is that as Connect, by default, shares connections across an R or Python process data can be exposed to multiple connections. Often this is intended behavior for improving both app and server performance. However, in cases where there are security concerns around data access then forcing each connection to a single process by changing the runtime setting Max connections per process to 1 should be considered.

Developer handled

In this scenario, the developer is completely responsible for managing connections to the relevant data sources. Any server dependencies will need to be installed and configured on their behalf. For example database drivers, mounted file shares, and permissions for any designated directories inside Connect itself.

Consider recommending leveraging environment variables for storing any username and password information to avoid exposing secrets in the code. These variables are encrypted on disk for security.

Data Source Name (DSN)

DSNs provide a symbolic name that combines all the parameters required to make a database connection. This makes connections simpler for users to manage. Database credentials (typically a service account) can be stored in the DSN definition. If the database credentials are omitted here, users will need to include this in their database connection script. DSNs are configured in /etc/odbc.ini or ~/.odbc.ini.

For more details refer to the Installing Posit Professional Drivers for Posit Connect section.

Example odbc.ini file

[MyPostgresDB]
Driver   = "PostgreSQL"
Server   = "[your server's path]"
Database = "[your database's name]"
UID      = "[user]"
PWD      = "[password]"
Port     = 5432

Environment variables server-wide

Environment variables can be added server-wide for all pieces of content using a supervisor script, as detailed in the Process Management Section of the Connect Admin Guide. These environment variables can be connection strings, secrets, a DSN (see above), etc.

Service accounts

By default content on Connect runs as the default service account user rstudio-connect. Other service account users can be set to run pieces of content. A Linux user must be created and be a member of the primary group of the default run-as user (by default rstudio-connect and rstudio-connect).

After a service account user is created and configured, then credentials can be attached. One recommended way to do this is to create a user-level DSN with the database connections and service account credentials that are read-only for the service account to access. As the service account users are provisioned with home directories the DSN can be stored there.

For more information refer to the User Account for Processes section of the Admin Guide.

Current user execution

Content can be set to run as the currently logged-in user accessing an application, instead of the default rstudio-connect or other service accounts by running content using local Unix accounts for each user. This works for Shiny apps, Shiny documents, Dash, Streamlit and Bokeh applications and requires using PAM user authentication.

For more details on the configuration options available refer to the Current user execution section of the Admin Guide.

Credential management

As with every programming language, it is important to avoid publishing code with credentials in plain text. There are several options to protect credentials from being exposed:

DSNs defined in odbc.ini
keyring package
Content environment variables
config package
Prompt for credentials

For more information, refer to the Solutions article on managing credentials.

Connect API

The Connect API can do several content-level environment variable tasks including listing, setting and updating variables. Refer to the Content sections of the API documentation for more details on the endpoints available.

Troubleshooting

When in doubt, troubleshoot data access first from R or Python directly on the Connect Server. Instructions on how to utilize ODBC drivers in R and Python can be found in the Posit Pro Drivers Usage section of the documentation. This can help to distinguish between issues (driver, application code, and authentication issues). For troubleshooting, databases refer to the troubleshooting databases article.

The installation directories for R and Python can be searched to find all installed versions. Following Posit’s instructions for installing R and installing Python, each version will be listed in /opt/R and /opt/python. If R or Python has been installed in a different location then adjust the path accordingly for the examples that follow.

Executing R code

Check available language versions by running:

$ ls -1d /opt/R/*
sudo /opt/R/4.2.0
sudo /opt/R/4.1.3

Using the path found above, an interactive R session can be started with:

sudo /opt/R/4.2.0/bin/R

Commands can also be run from a single line. In this example the png package is being installed:

sudo /opt/R/4.2.0/bin/Rscript -e 'install.packages("png", repos = "http://cran.us.r-project.org")'

Executing Python code

Check available language versions by running:

$ ls -1d /opt/python/*
/opt/python/3.9.14
/opt/python/3.10.8

Using the path found above, an interactive Python session can be started with:

sudo /opt/python/3.10.8/bin/python3

Commands can also be run from a single line. In this example the pandas package is being installed:

sudo /opt/python/3.10.8/bin/python -m pip install pandas

Exercise

Tip

🚀 Launch the exercise environment!

In the exercise environment you will get experience:

Installing Posit Pro Drivers
Configuring DSNs and testing connections
Testing database connections using R and Python
Setting environment variables for content
Creating and running content from a service account.

Go to: 7. Managing Resources