Installing Python Packages
Overview
Once the versions of Python you need are installed, the next step is to install packages. One important difference between R and Python: R packages are typically installed within an active R session, as in,
R Console
> install.packages("dplyr")
By contrast, Python packages are usually installed from the command line using a module1 called pip
.
Terminal
$ python -m pip install pandas
Many Python packages can be installed by one name, but are referenced in code via another name. For example, python -m pip install python-dotenv
installs the package dotenv
, which in code is referenced via import dotenv
. Try searching on PyPI if you’re unsure about a package name.
As with installing Python, when installing Python packages, you want to do so in a way that makes it easy to work on different projects concurrently. Before you install any packages, the first step is to create a “virtual environment”.
The Iron Law of Python Management
Create a virtual environment for every project.
You can do this by running python -m venv .venv
. This executes the python module venv
, which creates a virtual environment in the folder .venv/
.
.venv
is one of a few conventional names that are given to directories containing virtual environments. These directories contain links to a python executable, a copy of pip, and activation scripts:
If Python has been installed according to the Python installation directions, you can use the versions in /opt/python
to create a virtual environment for your project:
Terminal
rstudio@e6a5639b8fca:~$ mkdir data-science-project
rstudio@e6a5639b8fca:~$ cd data-science-project
rstudio@e6a5639b8fca:~/data-science-project$ /opt/python/3.7.7/bin/python -m venv .venv
rstudio@e6a5639b8fca:~/data-science-project$ tree -aL 3
.
`-- .venv
|-- bin
| |-- activate
| |-- activate.csh
| |-- activate.fish
| |-- easy_install
| |-- easy_install-3.7
| |-- pip
| |-- pip3
| |-- pip3.7
| |-- python -> /opt/python/3.7.7/bin/python
| `-- python3 -> python
|-- include
|-- lib
| `-- python3.7
|-- lib64 -> lib
`-- pyvenv.cfg
Terminal
WDAGUtilityAccount@mvp MINGW64 ~/Documents
$ mkdir data-science-project
WDAGUtilityAccount@mvp MINGW64 ~/Documents
$ cd data-science-project
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ python -m venv .venv
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ tree -aL 3
.
`-- .venv
|-- Include
|-- Lib
| `-- site-packages
|-- Scripts
| |-- Activate.ps1
| |-- activate
| |-- activate.bat
| |-- deactivate.bat
| |-- easy_install-3.9.exe
| |-- easy_install.exe
| |-- pip.exe
| |-- pip3.9.exe
| |-- pip3.exe
| |-- python.exe
| `-- pythonw.exe
`-- pyvenv.cfg
Once your virtual environment is created, you must then activate your Python virtual environment to isolate your project.
$ source .venv/bin/activate
$ source .venv/Scripts/activate
Your shell may add an indication that you are working in a virtual environment via (.venv)
.2 Some IDEs may detect that you have created a virtual environment and activate it for you. When your virtual environment is active, which python
should return the path to your project. You can call deactivate
to return to your shell’s default version of Python.
Terminal
rstudio@e6a5639b8fca:~/data-science-project$ source .venv/bin/activate
(.venv) rstudio@e6a5639b8fca:~/data-science-project$ which python
/home/rstudio/data-science-project/.venv/bin/python
(.venv) rstudio@e6a5639b8fca:~/data-science-project$ deactivate
rstudio@e6a5639b8fca:~/data-science-project$ which python
Terminal
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ source .venv/Scripts/activate
(.venv)
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ which python
/c/Users/WDAGUtilityAccount/Documents/data-science-project/\Users\WDAGUtilityAccount\Documents\data-science-project\.venv/Scripts/python
(.venv)
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ deactivate
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ which python
/c/Users/WDAGUtilityAccount/scoop/apps/pyenv/current/pyenv-win/shims/python
Virtual environment directories should not be checked into version control, so add the location of your virtual environment to your .gitignore
.
Installing Python packages
Once your virtual environment is active, you can begin installing packages. It can sometimes be helpful to start by updating your version of pip
, and other packages whose job is to help install packages:
WDAGUtilityAccount@mvp MINGW64 ~/Documents
$ mkdir data-science-project
WDAGUtilityAccount@mvp MINGW64 ~/Documents
$ cd data-science-project/
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ python -m venv .venv
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ source .venv/Scripts/activate
(.venv)
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project (master)
$ python -m pip install -U pip setuptools wheel
...
Successfully installed pip-21.1.2 setuptools-57.0.0 wheel-0.36.2
After that, you can install data science packages:
(.venv)
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ python -m pip install pandas
Collecting pandas
Downloading pandas-1.2.4-cp39-cp39-win_amd64.whl (9.3 MB)
Collecting python-dateutil>=2.7.3
Downloading python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Collecting numpy>=1.16.5
Downloading numpy-1.20.3-cp39-cp39-win_amd64.whl (13.7 MB)
Collecting pytz>=2017.3
Downloading pytz-2021.1-py2.py3-none-any.whl (510 kB)
Collecting six>=1.5
Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Installing collected packages: six, pytz, python-dateutil, numpy, pandas
Successfully installed numpy-1.20.3 pandas-1.2.4 python-dateutil-2.8.1 pytz-2021.1 six-1.16.0
You can print a table showing the packages installed in the active virtual environment.
(.venv)
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ pip list
Package Version
--------------- -------
numpy 1.20.3
pandas 1.2.4
pip 21.1.2
python-dateutil 2.8.1
pytz 2021.1
setuptools 57.0.0
six 1.16.0
wheel 0.36.2
You can also produce a machine-readable version of this list:
(.venv)
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ pip freeze
numpy==1.20.3
pandas==1.2.4
python-dateutil==2.8.1
pytz==2021.1
six==1.16.0
You can redirect3 this machine-readable version to a requirements.txt
file.
(.venv)
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ pip freeze > requirements.txt
(.venv)
WDAGUtilityAccount@mvp MINGW64 ~/Documents/data-science-project
$ cat requirements.txt
numpy==1.20.3
pandas==1.2.4
python-dateutil==2.8.1
pytz==2021.1
six==1.16.0
which you should then commit to version control.
You can follow the same steps when collaborating on Python projects.
Clone the project and set up a virtual environment in the project directory:
WDAGUtilityAccount@mvp MINGW64 ~/Documents
$ git clone https://github.com/sol-eng/python-examples
Cloning into 'python-examples'...
...
Resolving deltas: 100% (351/351), done.
WDAGUtilityAccount@mvp MINGW64 ~/Documents
$ cd python-examples/flask-restx/
WDAGUtilityAccount@mvp MINGW64 ~/Documents/python-examples/flask-restx (master)
$ python -m venv .venv
WDAGUtilityAccount@mvp MINGW64 ~/Documents/python-examples/flask-restx (master)
$ source .venv/Scripts/activate
pip install
the dependencies from the requirements.txt
file:
(.venv)
WDAGUtilityAccount@mvp MINGW64 ~/Documents/python-examples/flask-restx (master)
$ pip install -r requirements.txt
Collecting flask-restx
Downloading flask_restx-0.4.0-py2.py3-none-any.whl (5.3 MB)
Collecting sklearn
Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting Flask<2.0.0,>=0.8
Downloading Flask-1.1.4-py2.py3-none-any.whl (94 kB)
...
Successfully installed Flask-1.1.4 Jinja2-2.11.3 MarkupSafe-2.0.1 aniso8601-9.0.1 attrs-21.2.0 click-7.1.2 flask-restx-0.4.0 itsdangerous-1.1.0 joblib-1.0.1 jsonschema-3.2.0 numpy-1.20.3 pyrsistent-0.17.3 pytz-2021.1 scikit-learn-0.24.2 scipy-1.6.3 six-1.16.0 sklearn-0.0 threadpoolctl-2.1.0 werkzeug-1.0.1
(.venv)
WDAGUtilityAccount@mvp MINGW64 ~/Documents/python-examples/flask-restx (master)
$ pip list
Package Version
------------- -------
aniso8601 9.0.1
attrs 21.2.0
click 7.1.2
Flask 1.1.4
flask-restx 0.4.0
itsdangerous 1.1.0
Jinja2 2.11.3
joblib 1.0.1
jsonschema 3.2.0
MarkupSafe 2.0.1
numpy 1.20.3
pip 20.2.3
pyrsistent 0.17.3
pytz 2021.1
scikit-learn 0.24.2
scipy 1.6.3
setuptools 49.2.1
six 1.16.0
sklearn 0.0
threadpoolctl 2.1.0
Werkzeug 1.0.1
If you run pip freeze
and see a number of Python dependencies that you don’t remember installing that have nothing to do with your project, you have probably forgotten to activate the virtual environment for your project.
Where should I put my virtual environment?
Different python tools have different options here, but I recommend:
- always put your virtual environment in the same directory as the project
- always call your virtual environment the same thing
An advantage of always placing your virtual environment in the project directory is that it consolidates the state of the project to one location. If you make some unrecoverable error inside a project and want to erase it and restore from some known good state, removing the project directory will also erase the virtual environment. Not all Python tools necessarily cooperate with this assumption4, but it’s a good place to start.
Similarly, calling each environment the same thing makes it easy to globally git ignore virtual environments so you don’t accidentally commit them to version control.
A closing xkcd
Virtual environments are like git: if you make a mistake, you can always start over.
Footnotes
You may also see this written as simply
pip install pandas
.↩︎Installing a helper program like starship can make it easier to keep track of whether a virtual environment is active.↩︎
https://devhints.io/bash#redirection↩︎
Jupyter, for example, stores state in a number of different locations. Run
jupyter --data-dir
,jupyter --config-dir
, orjupyter --runtime-dir
for more information.↩︎