Data Exploration

Analysing and visualising data with Python for Biologists

stylised image of a scientist exploring a jungle of stacked lab books

Trapped in a jungle of data? Data analysis skills can help and are an essential part of the modern scientist's toolkit. We are working with increasingly larger and more complex datasets and programming provides us with tools that can automate data analysis and make it easier for us all. Data analysis in Python leverages both well established and state-of-the-art statistical and modelling methods, some of which are available as soon as a paper is published!

From mining data lakes by automated analysis through to sharing data and statistical results with colleagues, programming is an essential data analysis skill for the modern biologist. This intensive, hands-on, one-day course will teach you the basics of data exploration, analysis and visualisation using Python and is a natural follow on for those who have done our Python for Biologists course (see prerequisites below).

What does the course cover?

Data Exploration and Hypothesis Testing

In the first part of the course we'll walk through our example datasets and introduce you to some key python modules for quick and easy data exploration.

In groups we will clean our datasets, introducing data engineering approaches for repeatable analyses. We will then explore common statistical hypothesis testing and identify the correct tests for our example datasets and hypotheses. We'll explore different python modules for these tests and use them on our example datasets.

As part of this section we will demonstrate generalised linear models.

Visualisation, Reproducibility and Sharing

In the second part, we'll focus on visualising data by building interactive visualisations for data exploration and presenting the results of our earlier hypothesis testing.

We'll then discuss different aspects of reproducibility and of sharing data, code and results.

Finally, we'll package up the day's activities in an easy-to-share format that colleagues without coding experience can run/use as interactive widgets to see your results.

Who is the course for?

Scientists at any career stage (including students) in biology and related areas of science and medicine who have some experience of coding in Python (see prerequisites) and wish to expand their skills in the area of data analysis and visualisation. Some statistics knowledge is assumed (see notes on prerequisites).

If you have no experience programming in any language then you may find our Python for Biologists course more appropriate.

Attendees usually work on their own laptops and are expected to install some programmes before the course. Any laptop or operating system is suitable.

A note on prerequisites

We expect attendees to have previous Python skills equivalent to having done our Python for Biologists course and built upon those skills after the course.

This means that participants should be comfortable:

  • installing applications and python modules on their own laptops
  • reading and writing data into pandas dataframes
  • manipulating pandas dataframes to filter data
  • using numpy, scipy and pandas for simple data analysis, e.g. calculating mean or running linear regression
  • using seaborn and matplotlib for visualising data

This course also requires some basic statistical know-how, including:

  • understanding the difference between samples and populations
  • median, mean, standard deviation / variance and interquartile ranges
  • at least one form of regression analysis, e.g. linear regression for line of best fit
  • at least one form of hypothesis testing, e.g. a students's t-Test or similar
  • understanding the concept of distributions, e.g. normal or bell-curve