Data Workflows

Reproducible Research for Biologists using Git

stylised image of an engineer fixing pipes in a jungle

Ever tried to run your amazing data analysis six months later? Or tried to share your code with a colleague? Was it a smooth and beautiful transition? Or did you spend hours trying to figure out what you did and why it doesn't work anymore? If so, this course is for you.

This course will introduce you to several tools that will up your data engineering game and make your research reproducible - both for future you and your collaborators. We will cover the basics of using Git and GitHub to manage your code and data, and how to use Python* tools to make your work more reproducible. We will cover the basics of version control, how to use GitHub to collaborate with others and how to use GitHub Actions to automate your workflow.

By the end of the course you will have a solid understanding of how to make your work more reproducible and save yourself future headaches.

* other programming languages can also be supported

What does the course cover?

Git and GitHub

We start the course by covering basic shell usage in your computer's terminal. We then move on to the git cli (command line interface), introducing the basic concepts of version control as we go, including: pushing, pulling and merging changes.

Once we're familiar with git cli basics we will move into the popular GitHub web interface, where we will cover the basics of creating and managing repositories, as well as the GitHub workflow using feature branches, pull requests and code reviews. We will also cover the concept of rebasing and how to use it to keep your git history clean and easy to follow. Finally we will demonstrate the power of GitHub Issues and Projects as a way to manage your work and collaborate with others.

Reproducibility (with Python)

The second part of the course will focus on some key ideas that you can use to improve the reproducibility of you work. This part of the course is usually done with a Python-first mentality but other programming languages can also be supported (see prerequisites).

We will cover virtual environments and reproducible dependency management (with Pipenv), environment and configuration files and an introduction to ensuring you don't break your code by adding automated testing (with pytest). Finally, we will go back into GitHub and show you how GitHub Actions can help you automate this and so much more.

Who is the course for?

Scientists at any career stage (including students) in biology and related areas of science and medicine who have some experience of coding in Python or another programming language (see prerequisites) and wish to expand their skills in the area of data engineering and reproducibility.

If you have no experience programming in any language then you may find our Python for Biologists course more appropriate.

Attendees usually work on their own laptops and are expected to install some programmes before the course. Any laptop or operating system is suitable.

A note on prerequisites

We expect attendees to have previous experience in Python, R or any other programming language. Programming skills should be equivalent to having done our Data Exploration course and built upon those skills in whichever language. If you are using any language other than Python, please make this clear when you organise or sign up for the course so we can be sure to show you the correct tools for your language.

Attendees should be comfortable installing applications on their own laptops and will be required to do so before the course, details of which will be provided closer to the time. Any laptop or operating system is suitable.