Managing Collaborative and Reproducible Data Analysis Using Git
Data Analysis helps businesses answer and solve critical questions and problems. Data can be used to guide decision-making and inform actions, such as increasing customer loyalty, decreasing cost or optimizing performance. In order to produce impactful insights, a data analyst needs to manage input data, processing scripts, models, and output files, such as presentations, figures and tables.
Using a version control system helps to increase productivity:
- Managing code changes. A version control system allows you to keep track of the changes you’ve made to your work over time. It creates a history of changes and lets you revert to earlier versions of the code.
- Ease of collaboration. It lets you and others work together on projects from anywhere, with multiple people working on different, or the same aspects, of an analysis. Thus, it is possible to parallelize analysis steps and facilitate experimentation and exploration.
- Enhanced reproducibility. It is easier to repeat the same data analysis in the future, or have another person run the data analysis and produce the same results because scripts are stored centrally. This helps to create transparency and makes dependencies more visible.
Using GitHub and RStudio to Run Data Analysis Projects and Tasks
GitHub is a cloud-based hosting service that lets you manage Git repositories and offers a free to use version. It is one of the most popular and well-known version control systems available. RStudio is one of the most popular IDEs when using R for data analysis. RStudio allows for Git and GitHub integration. Both, R and RStudio, area also open-source and free to use.
When linking GitHub to RStudio, a project needs to be created that is based on the version control system. RStudio projects make it straightforward to divide your work into multiple contexts, each with their own working directory, workspace, history, and source documents.
There are usually many files, scripts, and outputs as Data Analysis projects and tasks grow more and more complex. Based on what the project needs and constraints are, Git can be used in various ways to manage aspects of your Data Analysis workflows.
To create reproducible results, we use it the following way:
- Hosting input data. To make the input data available to reproduce results from anywhere.
- Managing processing code. Data may be preprocessed, cleaned, checked, transformed, combined, analyzed, and reported. You can save all those steps and scripts under one roof.
Our Data Analysis use case consists of exploring museums data visually to identify the most popular museums. We are going to use a small dataset containing information on museums, zoos, and aquariums in the US, including geolocation (longitude and latitude) as well as revenue. We are only interested in museums in Washington, DC and want to visualize their location on a map based on their revenue. Once we have created the map, we want to provide it to other analysts so that they can build upon it.
The dataset we are using is available on Kaggle, you can find it here.
Incorporating Git into the Data Analysis Workflow
The following pre-requisites are necessary to incorporate GitHub into your Data Analysis workflow when using RStudio:
- Have R/RStudio installed. Get started with R here and RStudio here.
- Have Git installed. Get started here.
- Have a local Git client installed. We are using GitKraken, get started here.
- Have a GitHub account to clone a remote repository from and push changes to, sign up here.
Once you have installed all the necessary applications, start by opening up RStudio and creating a new project based on version control:
In RStudio, go to “File > New Project”
Click on “Version Control: Checkout a project from a version control repository”
Click on “Git: Clone a project from a repository”
Fill in the info:
- URL: use HTTPS address
- Create as a subdirectory of: Browse to where you would like to create this folder
Now that we have cloned the remote repository and have RStudio linked to it, we can write our processing script. We start by installing and loading the necessary libraries. Among others, we install the package “RCurl,” which lets us interact with our input data hosted within our remote GitHub repository.
We write a short processing script, limiting the data to only museums in DC and visualizing the geolocations using the “leaflet” package. The revenue variable is used to visualize the size of the circle. The larger the circle, the more revenue the museum has generated. Thus, we can explore our data set visually and quickly understand which museums are most popular in DC.
Right now, the script is running on the local machine and stored in my local repository. I'd like to publish this code on the internet so I can collaborate with my colleagues. To do so, we are going to upload the script to the remote GitHub repository.
Before pushing the changes to the remote repository, we have to “commit” the changes first.
I’m navigating to my local Git client and creating a new branch:
Using GitKraken, go to Branch and type in the name of the new branch.
Click on “View Changes” where it says a file has changed.
Click on the script to see the changes. Git gives you the ability to see which lines have been added and deleted compared to the previous version.
Stage the changes and write a meaningful commit message reflecting the purpose of the change. Then click commit.
Once you have committed the changes to the local repository, you can now push those changes to the remote repository, GitHub:
Go to Push and choose your local branch to push to the “master” remote branch.
A message will notify you that the changes have been pushed to the local repository.
Navigate to the remote GitHub repository to view your newly uploaded file:
Integrating Git into the Regular Data Analysis Workflow
To get the most out of a version control system for Data Analysis, it’s best to bake the processes into your everyday Data Analysis workflow. Commit and push changes to the remote Git repository regularly to get into the habit of storing your work in a secure and remotely accessible location. At first, it might feel like the effort is not the worth the time, but the benefits will outweigh the initial hurdle of getting accustomed to it. The full functionality of Git and GitHub might not seem relevant to the workflow of a Data Analyst, but there are parts that make sense and using them can improve your workflow management and allow you to integrate other aspects of relevant software development best practices.
Now that our processing script as well as input data is available on GitHub, we can manage our code more effectively, collaborate with colleagues and reproduce the analysis at a different time.
You can find all the material from this post at the Excella Labs open-source GitHub repository.