For better reproducibility and mental health
Researchers often utilize multiple languages and packages with ever changing versions. Because the algorithms and functions in a package may change over time, code written today may yield slightly different results or not run at all tomorrow. For this reason, it’s often no longer enough to simply provide your data and code for replication. Virtual environments provide an easy solution to this problem by making computing environments that can be shared and reproduced. The best part though is that they are more than just an extra step to help researchers replicating your work. Virtual environments will make your life easier and save you time and headaches as you update packages and move from project to project. Old scripts don’t break, you'll never have to worry about dependencies, and code is more portable between local machines or the cloud.
In this tutorial I'm going to cover:
Think of virtual environments as isolated research labs within your computer. Each lab maintains its own equipment and when you want to work on a specific project, you go to the lab associated with that project. Each virtual environment is an isolated space that maintains its own copies of languages and packages. As the researcher you can switch between these environments quickly and easily depending on what you’re working on. For example, let’s say you’re working on some original research and you create your visualizations with ggplot 3.3. Simultaneously, you’re working on a replication project that uses ggplot 2.2. Unfortunately the visualizations written in ggplot 2.2 don’t work properly when run the code with ggplot 3.3. Troubleshooting other people’s code can be frustrating and time consuming. So rather than trying to update the 2.2 code so it runs with 3.3 packages, you can spin up a new virtual environment, install ggplot 2.2, and run the code as is.
As a rule of thumb, everything you do in Python and R should be inside a virtual environment. This may sound complicated, but it’s incredibly easy if you use the right tools!
There are several other language specific environment managers such as virtualenv for Python and renv for R, but I highly recommend using Anaconda’s conda environment manager. Conda is more intuitive than the alternatives, makes maintenance a breeze, and is widely adopted so that's what I'll use in this tutorial.
Before you start, you’re going to need Anaconda installed. If you don’t already have Anaconda installed head on over to the Anaconda website and follow the directions for your operating system. If you already have Anaconda installed, update to version 4.6 or later. Some of the commands won't work otherwise.
It’s handy to create an environment for miscellaneous tasks not attached to any project. I call it my sandbox. It has everything I regularly use, and I spend no effort worry about compatibility. If anything breaks I just delete it and create a fresh one. Let’screate a sandbox to get started. In your terminal type:
conda create --name sandbox
"sandbox" can be anything you want to name your environment. That’s it! If you want to use a specific version of python (see the section below for R versions), you can specify that when creating the environment:
conda create --name sandbox python=3.6
To work within your environment, you need to activate it. To do so, enter in the terminal:
conda activate sandbox
To exit the environment, enter:
conda deactivate
To install the latest version of a package or language to you environment, use the conda install command. For example, to install the latest version of the Pandas package for Python type:
conda install pandas
R packages use the format "r-packagename". For example, to install the R language to your environment enter:
conda install r-base
To install a specific version of the package or language, just specify then installing:
conda install r-ggplot2=2.2
If conda does not have a package you need, you can install the package through normal functions such as a pip install for python or using the install.packages() function in R. Always try to install from conda first though to avoid potential compatibility issues. Additionally, creating/deleting environments and installing new packages can often leave artifacts on your drive that take up space but aren't doing anything. To clean these up, simply run the following two commands every once in a while:
conda clean -t
conda clean -p
Most of us use an IDE such as Jupyter or R Studio when programming. You can use these in your environment as well! Applications you open in your virtual environment will use the packages and versions installed in that environment. To do so, simply activate your environment and then launch your IDE from that terminal. For example, If you have R Studio installed on your computer, you can launch it in your environment by simply typing “rstudio” in the terminal. If you just Jupyter Lab, you can either launch Jupyter by entering “jupyter lab” within your virtual environment, or you can select your environment from the Jupyter launcher if you add the kernel to the Jupyter.
Environments are shared by exporting a list of packages to a text file which are then used to automatically install packages. There are two formats for doing this. The first is the YML file which is used when reinstalling an environment on your own system or sharing an environment with someone using the same operating system. To create a YML file use the following command within your environment:
conda env export --name sandbox > sandbox.yml
To install an environment from a yml file, navigate your terminal to the directory with the YML file and use the following command:
conda env create --file sandbox.yml
YML files represent a best case scenario because they can install both Python and R packages from a single file. When sharing environments with users on a different operating system, users can either open the yml file and manually install the specific package versions, or use a language specific package manager. For python, environments can be exported to a requirements.txt file with pip:
pip freeze > sandbox.txt
To install an environment from a requirements.txt file use:
pip install -r requirements.txt
If you need an R specific environment I recommend either a manual install from the .yml file or using a package called renv. Renv handles environments a little bit differently so I recommend reading the documentation here if this is what you need.
Keep in mind this is a rough guide. There’s more you can with environments, and there are multiple platforms outside of Conda for managing environments. You’ll undoubtedly encounter some errors while you learn. Once you figure it out though, environments are a breeze to work with and make your life easier!