Options for Scheduling Data Science Tasks
After creating some amazing artifact, it is very common for data scientists to worry about how to keep it updated. Dashboards and reports need to show the latest data, models need to be retrained, and sometimes end users will even request regular notifications, such as emails.
The good news is that data scientists work in code, so it is possible to automate most of these update tasks. A naive approach would be to manually re-run a script every time an update is needed. To avoid manual labor, a variety of solutions exist. These solutions fall on a spectrum, from simple solutions with limited capabilities to more robust solutions that offer flexibility but a steeper learning curve. This article outlines a number of options.
Scheduling on Connect
Connect sits in a sweet spot for data science scheduling. It provides flexibility and robustness, while remaining easy to use. Data scientists are able to write code in R Markdown documents, Quarto, or Jupyter notebooks, and then publish and schedule those on Connect all while easily satisfying authentication and security requirements.
See the user guide for scheduling on Connect.
During the publication process, Connect automatically handles package dependencies. Once scheduled, Connect ensures the content is run and handles logs, sends emails on errors, and maintains a versioned history of prior runs. Additionally, because the scheduled code is within a notebook, it is easy to document in place the purpose of the scheduled code. Data scientists can even customize email notifications if they would like to receive updates, such as a summary of new processed data.
View the sample projects and code for more details.
Pros
- Automatically handles package dependencies, logs, custom email notifications, error emails, and versioned history
- Highly accessible to data scientists through one-click publishing or Git-backed deployment
- Service accounts for accessing data sources can be used to execute the code using setup options through the user interface
- Environment variables can be added to the execution environment for a code project that are encrypted on disk, making this an excellent option for storing database credentials so they aren’t exposed in the code itself
Cons
- Scheduled notebooks are all independent, it is not possible to define a sequence of scheduled tasks, and execution is limited to the servers where Connect is installed
Desktop Schedulers
Scheduling content to run on your desktop is the easiest approach for using resources you already have access to. On Windows machines, the Windows Task Scheduler can be used to execute code on a schedule or following an event. The taskscheduleR package provides an R-specific wrapper. In Python, pywin32 provides an interface to the Task Scheduler. On Mac and Linux, the cron utility can be used, and the cronR or python-crontab packages provide a helpful wrapper.
The main limitation to scheduling on a laptop or desktop is downtime. Most people do not leave their laptops running indefinitely, and often local workstations are restarted for updates. These interruptions can conflict with scheduled tasks.
While scheduling on a desktop is widely available, it requires significant work to monitor schedules, ensure the correct software and packages are available, and capture success or failure logs. It also uses that user’s credentials for running code, which can result in denial of service errors once a certain capacity is reached as it can mimic a DOS attack against databases.
Pros
- Widely available and easy to get started
Cons
- Laptops/desktops are frequently off or offline, interrupting schedules
- All environment setup and logging must be built out manually
- All content is running using the desktop owner’s credentials, which can result in scaling issues
- Access to the desktop is using the desktop owner’s credentials, sharing access to the desktop is challenging without exposing personal authentication details
Using cron
on a Server
A step up from scheduling tasks on a local machine is scheduling them on a server. For Linux servers, the cron utility is widely available and very flexible. Schedules are defined in a crontab file, and typically the schedule will instruct the server to execute a shell script. These shell scripts provide total flexibility, allowing you to setup an environment, execute code, log side affects, and more. cron allows you to specify where log output files and errors should go.
# run the script run.sh every day at 11am
00 11 * * * run.sh
# sample run.sh
/opt/R/3.6.3/bin/R -f 'update.R'
A common gotcha for running R and Python jobs on a server is package management. If you setup a script to run, it is important that the script have access to the correct packages. It can be very easy to forget about these scripts when updating packages for other projects leading to unexpected errors. Or, you can find yourself in a situation where one script requires specific versions of a package that different from another script. While there are workarounds for these problems, such as the renv package for R or virtualenv for Python, it is critical to consider package dependencies for long-term stability.
Pros
- Widely available on most Linux servers
- Very flexible
- Servers, unlike local workstations, tend to have higher up-time and more robust processes for downtime
Cons
- Requires the user to handle everything: logging, environment setup, error handling
- Requires technical know-how to use effectively
Using an External Scheduler
On the far end of the specturm is the category of dedicated scheduling software. Examples of this software include tools like Luigi, Airflow, Oozie, Jenkins, and many others. These tools are varied in their features and intent.
Most of these options require a dedicated application support team to ensure they are correctly configured and regularly updated, though cloud vendors often offer these tools as hosted services.
Most of these tools have robust support for scheduling Directed Acyclic Graphs, which allow users to specify dependencies or sequences of tasks. Often these tools take advantage of caching and allow for seemless re-runs or backfills.
Many of these tools offer support for flexible scaling, co-ordinating execution across multiple servers or in environments like Hadoop or Kubernetes
In addition to learning these tools, data scientists will often need to account for and manage package dependencies themselves, especially for R workflows.
Pros
- Most flexible and complete feature set including support for DAGs, multiple execution backends, re-runs, backfills, and more.
Cons
- Typically require dedicated application support
- Steep learning curve
- Manual package management strategy for R workflows