4. Data Access


In this section, you will learn:


Your team’s data scientists will need access to their data, which could be stored in various locations. Administrator action is required to configure data access from the following locations:

Connections to data sources are handled in the developer’s script in R or Python, not by Workbench or Connect itself. If one can connect to a data source from R or Python running directly on the Linux server, one can also make that connection from within Workbench or Connect.

Directory on Workbench

Developers can store data in their code project directory or their home directory. You must use Linux file permissions to set access to files. For the RStudio Pro IDE, you can also restrict which files a user can view inside the IDE. See the Directory Management section of the Workbench Admin Guide.

An admin can create an additional shared directory that allows a team to upload/download/share data (e.g. /projects/project_id). After provisioning the directory, set access controls so that the appropriate team has read/write access.

Recommendations:

  • Provision access to the directory using groups in order to have auto-provisioning when new users are added.
  • Add appropriate conditions on the shared directory so that newly created content inherits the permissions of its parent folder.
  • Consider setting multiple levels of access so that there are different groups of users with read/write access and read-only access.

Mounted file share

Mounting an external file share containing the desired data is a common pattern for data access. This administrator task is done at the operating system level and will look different depending on the underlying file storage architecture. NFS is recommended due to native support with Linux. With additional configuration, you can also mount other architectures.

NFS

NFS file shares can be automatically and persistently mounted to the desired location by specifying the mount details in /etc/fstab. NFS supports Access Control Lists (ACLs) for granular permissions management. Access is managed with a configuration file. By default, NFS uses the User ID. With NFSv4, Username can be used instead. When defining access by User ID, ensure that User IDs are in sync across all machines.

SMB/CIFS

The simplest method to mount SMB/CIFS shares is to specify the mount in /etc/fstab. However, you will need to provide a set of credentials when using this method. These credentials will be used for all users’ access, which may be acceptable if only read-access is needed, however, this could be problematic if users should perform actions on the mount using their own credentials, for example, in write operations.

There are two ways to avoid mounting with a given set of credentials:

  • Mount the drive using initial, low-level credentials and specify the multiuser option. The drive will mount, but any user action will prompt a credential check from the user. The user inputs their credentials using either a Kerberos ticket or a CIFS-specific credential store, such as cifscreds.
  • Mount the share inside the user’s home directory (not globally) and use a PAM module to pass in the user’s password. PAM can be configured to contact the SMB server to mount the share with the appropriate user’s credentials. For an example of this approach refer to this support article on mounting Windows SMB/CIFS share via PAM.

Databases

Databases may use many different methodologies for data storage, such as:

  • File-based (e.g., SQLite, DuckDB)
  • Relational Database Management Systems (e.g., PostgreSQL, MariaDB, Microsoft SQL)
  • NoSQL and Document Stores (e.g., MongoDB, Cassandra)

These databases use different wire protocols for connections, however standardization across database vendors permits a smoother experience for administrators and users. Key standards to be aware of include:

  • ODBC, a cross-language standard protocol defining how external tools and systems can submit queries to databases. ODBC database drivers use the ODBC protocol for database calls. The drivers translate the ODBC interface to the database-specific functions.
  • Similar to the ODBC protocol, JDBC is a standard protocol for database queries, however this is built around the Java ecosystem. We do not recommend JDBC database connections due to challenges in managing Java.
  • DBI, an R package and front-end for making database connections from the R language. This provides a consistent interface independent of the database. DBI is coupled with a back-end DBI-compliant R package such as odbc (note: lower-case. This is an R package) to make a database connection from R. Some database types have specific DBI-compliant packages available for making this front end connection, such as RPostgres, RMariaDB, bigrquery, or RSQLite.
  • PEP-249 is a proposed standard in Python for making database connectors. This standard is used by packages such as pyodbc for making ODBC connections to databases.

As an administrator, you will need to install (preferably) ODBC drivers for relevant databases on Workbench and Connect. Users can then use the appropriate R or Python packages to interact with these drivers to make their database connections.

Professional Drivers installation

Posit provides professionally-supported ODBC drivers for a number of databases for your use with Posit professional tooling. Installation is documented here. To configure the Posit Professional Drivers, or a driver supplied from another source, specify the driver location and attributes in /etc/odbcinst.ini.

Example /etc/odbcinst.ini file
[PostgreSQL]
Driver=/opt/rstudio-drivers/postgresql/bin/lib/libpostgresqlodbc_sb64.so
RStudioVersion=2021.10.0
Version=1.4.41.1001
Installer=RStudio Pro Drivers
UsageCount=1

Data Source Name (DSN)

DSNs provide a symbolic name that combines all the parameters required to make a database connection. This makes connections simpler for users to manage. Database credentials (typically a service account) can be stored in the DSN definition. If the database credentials are omitted here, users will need to include this in their database connection script. DSNs are configured in /etc/odbc.ini or ~/.odbc.ini.

Example odbc.ini file
[MyPostgresDB]
Driver   = "PostgreSQL"
Server   = "[your server's path]"
Database = "[your database's name]"
UID      = "[user]"
PWD      = "[password]"
Port     = 5432

Troubleshooting database connections

Always test and confirm database connections at the Linux command line using isql first before troubleshooting database connections from within Workbench or Connect.

  • Ensure there is a DSN for the database defined in odbc.ini
  • Use isql -v <DSN> to test the connection
  • Test connection from R or Python only after isql can successfully connect

Typical problems with database connections are:

  • Dynamic libraries not found
  • MicrosoftSQL & Kerberos: Service Principal Name (SPN) does not match

See additional troubleshooting information in the Professional Drivers documentation here.

Data Sources Lab

🚀 Launch the exercise environment!

In the exercise environment you will get experience:

  • Installing Posit Professional Database Drivers

  • Configuring a DSN

  • Connecting to a database from R and Python in Workbench


Go to: 5. Monitoring

Back to top