Moving your Workbench Environment to RStudio on SageMaker

Introduction

This article will guide you in moving your existing data and workflows into Amazon SageMaker. This article may apply to you if you want to move to SageMaker from:

  • open source RStudio Desktop
  • open source RStudio Server
  • Posit Workbench running on an EC2 instance or on-premises

When you transition to SageMaker you want to make sure that you are able to reproduce your workflows and environment seamlessly including your:

  • environment variables and secrets
  • R/Python versions
  • Package libraries and dependencies
  • Project files
  • Data sources and database connections

SageMaker environment overview

RStudio on SageMaker is a service managed by Amazon. This means that unlike when running the RStudio IDE on your local computer or on a server, the user or system administrator does not have root access to the instance running the RStudio application, nor can they modify any underlying configuration files or defaults. This service effectively abstracts away the management of the environment to AWS and allows you - the Data Scientist - to focus on writing code.

Refer to Architecture for RStudio on SageMaker for more details around this implementation, including differences from a self-managed Posit Workbench implementation.

User permissions

When setting up RStudio on SageMaker, your AWS administrator will create a Domain and a user for you. This user will have a certain level of access and permissions based on an execution role, such as what EC2 instance types are available for sessions, and permissions available for accessing S3 buckets. See SageMaker Roles. If you do not have the permissions that you need, work with your AWS administrator.

Modifying the default environment

RStudio on SageMaker uses a default image owned by AWS to create sessions. If you need to change any underlying defaults for this environment - for example the version of R or Python - then you can do so by bringing your own image (BYOI) to your Domain. You can develop the image yourself, but your AWS administrator must add it to the SageMaker Domain for it to be available for use. Supplying your own custom Docker image increases your flexibility but also introduces additional responsibility for managing the environment and keeping custom images up-to-date.

Integrated Development Environments (IDEs)

RStudio on SageMaker brings the RStudio IDE to the SageMaker managed environment. The implementation of RStudio on SageMaker presents a modified architecture from a Posit Workbench implementation that is installed on your own, self-managed infrastructure. Notably, RStudio on SageMaker only launches RStudio sessions. SageMaker Studio is responsible for launching Jupyter sessions. If you had existing Jupyter Notebooks in your prior Posit Workbench environment, transfer these files to SageMaker following the guidelines below, and open a Jupyter session from SageMaker Studio.

Setting up your work environment

When you start working in a brand new RStudio on SageMaker environment, you may have configurations saved in your former environment that would be useful to move to SageMaker. You will need to:

Tip

Tip: if your organization does not have a private Package Manager instance, you can still enjoy the benefits of precompiled binaries and repository snapshots with Public Package Manager.

Accessing files, packages, and project environments

Bringing existing project files and required libraries into your new SageMaker Domain will help you to comfortably transition to your new RStudio on SageMaker Domain.

Adding Files to your home directory

Clone your repositories

Use of a version control system (e.g., git) is always recommended, and will help you to move your code into your new environment. Using either the terminal or the Git integration within the RStudio IDE, clone your target repositories to get access to all of your code from your previous environment.

Manual upload

As a supplement or alternative to cloning your repositories, you may upload files from your previous environment. Use the Upload button in the Files pane of the RStudio IDE.

File upload in RStudio on SageMaker

File upload in RStudio on SageMaker

You may upload files selectively, or upload multiple files or directories by creating a zip file to upload. The zip will automatically be expanded after upload.

Recall that your home directory is persistent across all sessions in RStudio on SageMaker and SageMaker Studio. Files that you upload from RStudio will be available to sessions running within SageMaker Studio as well.

Synchronize files with rsync

If you have ssh access to your previous Workbench environment, you can utilize the Linux utility rsync to synchronize the files or directories within your previous home directory to that of your new SageMaker environment. Perform the rsync from a terminal in SageMaker.

Restoring Package Libraries

Because packages are compiled against specific R versions, operating system versions, and system dependencies, it is not appropriate to copy package libraries from one environment to another.

To repopulate your package libraries in RStudio on SageMaker, re-download your required packages with install.packages() in the RStudio console. Packages are downloaded from a package repository and installed into a library where they can then be loaded for use.1

Tips

For best results, your package repository should be set to either the Public Package Manager or to your organization’s private Package Manager instance. This will allow you to download pre-compiled binaries rather than build packages from source. This is beneficial because it can significantly speed up the amount of time it takes to download packages, and it helps avoid package build failures due to missing system prerequisites.

  • Check your package repository by running options("repos") in your RStudio console
  • Modify your package repository by running options("repos" = c("<REPO-NAME>" = "https://your-repository-url.com")) in your RStudio console
  • See the Package Manager User Guide section Obtain a repository URL if you do not know where to locate your repository URL.
  • If you are not sure which distribution of Linux you are using, you can find it by typing cat /etc/*-release in your terminal.

In RStudio on SageMaker, package libraries should exist locally at the user home directory level, or as a best practice, at the individual project level with the help of the renv package. The renv package promotes isolation between project environments and aids in reproducibility and collaboration. Recall that your user home directory is persistent in SageMaker, so once you install a package in a local library, it will be available to load in future sessions.

A note on pre-installing packages: A small number of packages are included in the default RStudio image, installed at the system level. You may optionally include additional packages you wish to be available to all users in a custom image used to launch RStudio, however, including a large number of packages in an image can lead to inflated image sizes, which can then be slow to load. Instead, ensure that custom images contain necessary system dependencies and that users are downloading binaries from Package Manager to a local library. This combination will ensure fast, error-free package installation on-demand.

Should you need to install system dependencies, users have root access on session containers. This will allow you to manage system dependencies in real-time for a current session, but any changes will not persist across sessions. For persistence, the system dependencies should be incorporated into a custom image.

Accessing your data

Data can be stored in a variety of locales including databases, mounted network drives, and storage in cloud systems such as AWS S3, Google Drive, Azure Storage, One Drive, Microsoft 365, etc.

Accessing via Packages

You can access the vast majority of your data directly through your R code with packages such as paws (a Package for Amazon Web Services), googledrive, or Microsoft365R to connect to your data. Accessing data via code from within RStudio on SageMaker is unchanged from how you may have accessed the same data outside of RStudio on SageMaker

Database Connections

RStudio on SageMaker is configured with Posit’s Professional Database Drivers. These drivers enable ODBC connections to numerous database types, including Athena, PostgreSQL, Redshift, and Snowflake.

You can see what drivers are available by running odbc::odbcListDrivers() in your RStudio console. You can then access your data using your credentials with DBI::dbConnect(). For additional details, see the example Connecting to a Database in R and Python in our Database Best Practices resource.

The default RStudio on SageMaker implementation does not have the ability to set up system-wide DSNs. However, there are two alternatives:

  • Define your DSN at a user-level and save you home directory at ~/.odbc.ini. Ensure that the Driver setting in the DSN refers to the corresponding driver name or path listed in the output of odbc::odbcListDrivers()
  • Configure the DSN as part of a custom image supplied to the RStudio session

Accessing S3

For access to S3 please ensure that your AWS administrator has provisioned your user with the appropriate permissions to use S3 buckets. Permissions are controlled by the SageMaker execution role applied at user setup. See SageMaker Roles.

External File Systems

You may have had external file systems (e.g., EFS, NFS, CIFS) in your previous environment which includes data that you’d like to use in your new instance. It is not currently possible to define Volume Mounts in custom images.

Note, it is possible to mount the SageMaker Domain EFS onto another EC2 external to SageMaker, however this is a complex procedure which requires support of your AWS administrator, and the utility of this is mount is limited. SageMaker users can only access data in their respective home directories, so any transfer of data to the SageMaker EFS would need to be directed into the appropriate user’s home directory.

Back to top

Footnotes

  1. For a more detailed discussion, see https://solutions.posit.co/envs-pkgs/repos_and_libs/↩︎