Installing Python Packages

Overview

Once the versions of Python you need are installed, the next step is to install packages. One important difference between R and Python: R packages are typically installed within an active R session, as in,

R Console

> install.packages("dplyr")

By contrast, Python packages are usually installed from the command line using a module1 called pip.

Terminal

$ python -m pip install pandas
Note

Many Python packages can be installed by one name, but are referenced in code via another name. For example, python -m pip install python-dotenv installs the package dotenv, which in code is referenced via import dotenv. Try searching on PyPI if you’re unsure about a package name.

As with installing Python, when installing Python packages, you want to do so in a way that makes it easy to work on different projects concurrently. Before you install any packages, the first step is to create a “virtual environment”.

The Iron Law of Python Management

Create a virtual environment for every project.

You can do this by running python -m venv .venv. This executes the python module venv, which creates a virtual environment in the folder .venv/.
.venv is one of a few conventional names that are given to directories containing virtual environments. These directories contain links to a python executable, a copy of pip, and activation scripts:

If Python has been installed according to the Python installation directions, you can use the versions in /opt/python to create a virtual environment for your project:

Terminal

rstudio@e6a5639b8fca:~$ mkdir data-science-project
rstudio@e6a5639b8fca:~$ cd data-science-project
rstudio@e6a5639b8fca:~/data-science-project$ /opt/python/3.7.7/bin/python -m venv .venv
rstudio@e6a5639b8fca:~/data-science-project$ tree -aL 3
.
`-- .venv
    |-- bin
    |   |-- activate
    |   |-- activate.csh
    |   |-- activate.fish
    |   |-- easy_install
    |   |-- easy_install-3.7
    |   |-- pip
    |   |-- pip3
    |   |-- pip3.7
    |   |-- python -> /opt/python/3.7.7/bin/python
    |   `-- python3 -> python
    |-- include
    |-- lib
    |   `-- python3.7
    |-- lib64 -> lib
    `-- pyvenv.cfg

Terminal

WDAGUtilityAccount@mvp MINGW64 ~/Documents
$ mkdir data-science-project

WDAGUtilityAccount@mvp MINGW64 ~/Documents
$ cd data-science-project

WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ python -m venv .venv

WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ tree -aL 3
.
`-- .venv
    |-- Include
    |-- Lib
    |   `-- site-packages
    |-- Scripts
    |   |-- Activate.ps1
    |   |-- activate
    |   |-- activate.bat
    |   |-- deactivate.bat
    |   |-- easy_install-3.9.exe
    |   |-- easy_install.exe
    |   |-- pip.exe
    |   |-- pip3.9.exe
    |   |-- pip3.exe
    |   |-- python.exe
    |   `-- pythonw.exe
    `-- pyvenv.cfg

Once your virtual environment is created, you must then activate your Python virtual environment to isolate your project.

$ source .venv/bin/activate
$ source .venv/Scripts/activate

Your shell may add an indication that you are working in a virtual environment via (.venv).2 Some IDEs may detect that you have created a virtual environment and activate it for you. When your virtual environment is active, which python should return the path to your project. You can call deactivate to return to your shell’s default version of Python.

Terminal

rstudio@e6a5639b8fca:~/data-science-project$ source .venv/bin/activate

(.venv) rstudio@e6a5639b8fca:~/data-science-project$ which python
/home/rstudio/data-science-project/.venv/bin/python

(.venv) rstudio@e6a5639b8fca:~/data-science-project$ deactivate
rstudio@e6a5639b8fca:~/data-science-project$ which python

Terminal

WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ source .venv/Scripts/activate

(.venv)
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ which python
/c/Users/WDAGUtilityAccount/Documents/data-science-project/\Users\WDAGUtilityAccount\Documents\data-science-project\.venv/Scripts/python

(.venv)
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ deactivate

WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ which python
/c/Users/WDAGUtilityAccount/scoop/apps/pyenv/current/pyenv-win/shims/python

Virtual environment directories should not be checked into version control, so add the location of your virtual environment to your .gitignore.

Installing Python packages

Once your virtual environment is active, you can begin installing packages. It can sometimes be helpful to start by updating your version of pip, and other packages whose job is to help install packages:

WDAGUtilityAccount@mvp MINGW64 ~/Documents 
$ mkdir data-science-project

WDAGUtilityAccount@mvp MINGW64 ~/Documents 
$ cd data-science-project/

WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project 
$ python -m venv .venv 

WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project 
$ source .venv/Scripts/activate 

(.venv)
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project (master)
$ python -m pip install -U pip setuptools wheel
...
Successfully installed pip-21.1.2 setuptools-57.0.0 wheel-0.36.2

After that, you can install data science packages:

(.venv) 
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project 
$ python -m pip install pandas 
Collecting pandas 
  Downloading pandas-1.2.4-cp39-cp39-win_amd64.whl (9.3 MB) 
Collecting python-dateutil>=2.7.3 
  Downloading python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB) 
Collecting numpy>=1.16.5 
  Downloading numpy-1.20.3-cp39-cp39-win_amd64.whl (13.7 MB) 
Collecting pytz>=2017.3 
  Downloading pytz-2021.1-py2.py3-none-any.whl (510 kB) 
Collecting six>=1.5 
  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB) 
Installing collected packages: six, pytz, python-dateutil, numpy, pandas 
Successfully installed numpy-1.20.3 pandas-1.2.4 python-dateutil-2.8.1 pytz-2021.1 six-1.16.0 

You can print a table showing the packages installed in the active virtual environment.

(.venv) 
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project 
$ pip list 
Package         Version 
--------------- ------- 
numpy           1.20.3 
pandas          1.2.4 
pip             21.1.2 
python-dateutil 2.8.1 
pytz            2021.1 
setuptools      57.0.0 
six             1.16.0 
wheel           0.36.2 

You can also produce a machine-readable version of this list:

(.venv) 
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project 
$ pip freeze 
numpy==1.20.3 
pandas==1.2.4 
python-dateutil==2.8.1 
pytz==2021.1 
six==1.16.0 

You can redirect3 this machine-readable version to a requirements.txt file.

(.venv) 
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project 
$ pip freeze > requirements.txt 

(.venv) 
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project 
$ cat requirements.txt 
numpy==1.20.3 
pandas==1.2.4 
python-dateutil==2.8.1 
pytz==2021.1 
six==1.16.0 

which you should then commit to version control.

You can follow the same steps when collaborating on Python projects.

Clone the project and set up a virtual environment in the project directory:

WDAGUtilityAccount@mvp MINGW64 ~/Documents
$ git clone https://github.com/sol-eng/python-examples
Cloning into 'python-examples'...
...
Resolving deltas: 100% (351/351), done.

WDAGUtilityAccount@mvp MINGW64 ~/Documents
$ cd python-examples/flask-restx/

WDAGUtilityAccount@mvp MINGW64 ~/Documents/python-examples/flask-restx (master)
$ python -m venv .venv

WDAGUtilityAccount@mvp MINGW64 ~/Documents/python-examples/flask-restx (master)
$ source .venv/Scripts/activate

pip install the dependencies from the requirements.txt file:

(.venv)
WDAGUtilityAccount@mvp MINGW64 ~/Documents/python-examples/flask-restx (master)
$ pip install -r requirements.txt
Collecting flask-restx
  Downloading flask_restx-0.4.0-py2.py3-none-any.whl (5.3 MB)
Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting Flask<2.0.0,>=0.8
  Downloading Flask-1.1.4-py2.py3-none-any.whl (94 kB)
...
Successfully installed Flask-1.1.4 Jinja2-2.11.3 MarkupSafe-2.0.1 aniso8601-9.0.1 attrs-21.2.0 click-7.1.2 flask-restx-0.4.0 itsdangerous-1.1.0 joblib-1.0.1 jsonschema-3.2.0 numpy-1.20.3 pyrsistent-0.17.3 pytz-2021.1 scikit-learn-0.24.2 scipy-1.6.3 six-1.16.0 sklearn-0.0 threadpoolctl-2.1.0 werkzeug-1.0.1
(.venv)
WDAGUtilityAccount@mvp MINGW64 ~/Documents/python-examples/flask-restx (master)
$ pip list
Package       Version
------------- -------
aniso8601     9.0.1
attrs         21.2.0
click         7.1.2
Flask         1.1.4
flask-restx   0.4.0
itsdangerous  1.1.0
Jinja2        2.11.3
joblib        1.0.1
jsonschema    3.2.0
MarkupSafe    2.0.1
numpy         1.20.3
pip           20.2.3
pyrsistent    0.17.3
pytz          2021.1
scikit-learn  0.24.2
scipy         1.6.3
setuptools    49.2.1
six           1.16.0
sklearn       0.0
threadpoolctl 2.1.0
Werkzeug      1.0.1

If you run pip freeze and see a number of Python dependencies that you don’t remember installing that have nothing to do with your project, you have probably forgotten to activate the virtual environment for your project.

Where should I put my virtual environment?

Different python tools have different options here, but I recommend:

  1. always put your virtual environment in the same directory as the project
  2. always call your virtual environment the same thing

An advantage of always placing your virtual environment in the project directory is that it consolidates the state of the project to one location. If you make some unrecoverable error inside a project and want to erase it and restore from some known good state, removing the project directory will also erase the virtual environment. Not all Python tools necessarily cooperate with this assumption4, but it’s a good place to start.

Similarly, calling each environment the same thing makes it easy to globally git ignore virtual environments so you don’t accidentally commit them to version control.

A closing xkcd

Virtual environments are like git: if you make a mistake, you can always start over.
A comic from XKCD that suggests blowing away a virtual environment when things break or get too complicated.

Back to top

Footnotes

  1. You may also see this written as simply pip install pandas.↩︎

  2. Installing a helper program like starship can make it easier to keep track of whether a virtual environment is active.↩︎

  3. https://devhints.io/bash#redirection↩︎

  4. Jupyter, for example, stores state in a number of different locations. Run jupyter --data-dir, jupyter --config-dir, or jupyter --runtime-dir for more information.↩︎