Machine Learning

Understanding Machine Learning for Biologists (in Python)

stylised image of a scientist looking down a microscope

Artificial Intelligence, also known as machine learning, is quickly becoming as essential as computers themselves became at the beginning of the century. Interfacing with such algorithms directly through a programming language allows us to automate and improve our scientific research in a way that isn't possible for mere "consumers" of those tools.

In this course you will understand the foundations of machine learning from linear regression to deep learning neural networks. We will provide applied examples and programming tasks in the Jupyter Notebook environment to let you start building your intuition around the topic. All of which will become invaluable when inevitably more and more of your research will rely on machine learning tools and data engineering. This course is a natural follow on for those who have done our Data Exploration course (see prerequisites).

What does the course cover?

Basic Predictive Models

Starting with simple linear regression we will look at familiar concepts from statistics a little bit differently. We will examine how linear and logistic regression are applied to predictive tasks, and how gradient descent is used to optimise models. It will surprise you how much of machine learning you already know (though under different names!). In the first set of practical exercises you will familiarise yourself with Scikit-learn, python's machine learning library.

Unsupervised Learning

Unsupervised learning techniques allow you to simplify and understand your unlabelled data. In this session we will cover popular clustering techniques used to identify patterns and subgroups of data (k-NN, k-means, DBSCAN), practice segmentation on images (with Gaussian Mixture Models) and visualise and interpret high dimensional data with Principal Component Analysis (PCA).

Supervised Learning

In the first afternoon session we are focusing on the powerful predictive models that use some labelled data to automate classification of wider dataset. We will learn the inner workings of models already widely used in scientific research such as Random Forest, Support Vector Machines and Neural Networks. In the practical exercise you will benchmark performance of your models to identify the correct approach and prevent under and over fitting.

Deep Learning

In the last part we will review the current landscape of most advanced models and frameworks available to researchers. Starting with deep learning techniques in image analysis and computer vision we will, look at tools for automated counting, tracking, pose estimation and behaviour classification. We will take a look at how Large Language Models (LLMs) can help you accelerate your coding, writing and scientific reasoning. We will also discuss the ethical implications of using such models and how to avoid common pitfalls.

Who is the course for?

Scientists at any career stage (including students) in biology and related areas of science and medicine who have some experience of coding in Python (see prerequisites) and wish to expand their skills in the area of data engineering. Some statistics knowledge is assumed (see notes on prerequisites).

If you have no experience programming in any language then you may find our Python for Biologists course more appropriate.

Attendees usually work on their own laptops and are expected to install some programmes before the course. Any laptop or operating system is suitable.

A note on prerequisites

We expect attendees to have previous Python skills equivalent to having done our Data Exploration course and built upon those skills after the course.

Basic knowledge of statistics and mathematics (equivalent of a foundation course) is also required to fully appreciate the technical underpinnings of machine learning, though not necessary to understand key concepts at a high-level.