Machine Learning and Healthcare: Breast Cancer Diagnosis, Part I

Machine Learning and Healthcare: Breast Cancer Diagnosis

Part 1: Introduction to ML and Cancer Diagnosis

This is Part 1 in a 2 Part series on Machine Learning and Healthcare, exploring breast cancer diagnosis as an example application.

We are living in a time where machine learning algorithms are quickly infiltrating every aspect of our lives, helping to make everyday tasks easier, faster and more consistent.

  • Accident on your usual route home? Waze will apply algorithms to vast quantities of location and user reported data to suggest a better route.
  • Too lazy to go to the bank to cash a check? Use your Banking App to send a photo and computer vision will decipher the handwriting so you no longer need to leave your home.
  • Need to send an urgent message while you’re driving? Just say “Ok Google” or “Hey Siri” and natural language processing algorithms will be called upon to translate your speech to text.

With so many applications of Machine Learning (ML) and Artificial intelligence (AI) assisting us in our daily lives, it is no stretch to imagine the impact they can have on many industries.

This series of blog posts will explore how Machine Learning is influencing the Healthcare industry and demonstrate breast cancer diagnosis as an application for this technology.

What is Machine Learning?

Lately, machine learning, algorithms and big data seem to be among the hottest buzzwords thrown around in the tech world. But beyond the buzz, let’s take a step back to look at what these terms means.

Algorithm: “A process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer.” – Oxford pocket dictionary

Machine Learning: “Machine learning is the science of getting computers to act without being explicitly programmed.” – Stanford

Put simply, machine learning goes beyond generic programming and allows computers to develop their own solutions to problems. As an example, let’s imagine that a program is tasked to classify photos of cats and dogs. In generic programming, a program would use a list of criteria to differentiate the two groups – dogs have floppy ears and long tongues, cats have pointy ears and whiskers, for example. In machine learning, labelled data – in this case photos that are known to be of cats and dogs – is used to train an algorithm to classify future data by identifying features of differentiation from within the data itself. This brings us to our final buzzword definition:

Big Data: “An all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.” – Wikipedia

By applying ML to big data we can find patterns buried in the data, without reliance on human direction. Using ML, the cat/dog classification algorithm can identify abstract features that allow photos of cats and dogs to be classified faster and with much higher precision than ever before.

Part 2 of this blog series will look at an example of how an ML algorithm is trained and evaluated.

Machine Learning and Healthcare

Machine Learning and AI are disrupting the healthcare industry in a number of ways, from personalized medicine, to drug development, smart electronic health records and beyond. Among its many uses, disease identification and diagnosis are at the forefront of ML research and applications for big data in healthcare.

Cancer is the second leading cause of death, responsible for approximately 8.8 million deaths worldwide each year. Beyond the obvious physical and emotional impacts of the disease, cancer has significant impacts on the economy, costing the United States around $80.2 billion each year.

Enter ML and its potential to improve early detection, helping to save countless lives and billions of dollars.  Already, ML has been considered a successful tool for cancer diagnosis in a number research studies:

  • A team at Stanford created a model using deep neural networks to diagnose skin cancer at accuracy levels consistent with expert dermatologists.
  • An automatic detection model for lymph node metastasis developed by researchers from MIT and Harvard Medical showed potential to drop human error rate in diagnosis by 85%.
  • The Beth Israel Deaconess Medical Centre is using deep learning to integrate speech and image recognition to diagnose tumours.



ML has the potential to speed up diagnosis and improve the accuracy with which cancer is identified. Where humans look at a relatively small number of easily quantified features for diagnosis, ML can unlock deeper trends and patterns within test results. This has the potential to better patient outcomes by improving treatment matching, reducing anxiety associated with misdiagnosis and lessening the burden on healthcare resources through diagnosis automation. Perhaps most importantly, ML can play a role in increasing the number of patients who are diagnosed early, which has been found to as much as triple cancer survival rates.

Breast Cancer Diagnosis

After a breast tissue irregularity has been identified through mammography, a fine needle aspiration (FNA) biopsy may be performed as a method to diagnose the mass as cancerous or benign. A sample of cells is removed from the mass and smeared on a glass slide to be examined by specialists in a pathology laboratory. An example of such a slide can be seen in the figure below. The specialist performs a visual inspection and records cell features such as texture, size, symmetry and more to diagnose the nature of the mass and recommend treatment.


Figure 1: Fine needle aspiration slide showing cancerous cells (left) and non-cancerous cells (right).



Faces of Metastatic Breast Cancer from FacesofMBC

Dataset for Breast Cancer Machine Learning Exercise

The dataset that will be explored in part 2 of this blog series is the Breast Cancer Wisconsin (Diagnostic) Data Set, publicly available on the UCI machine learning repository. The data consists of digitized FNA slides from 569 patients, 212 with cancer and 357 with benign masses and will be split into training and test sets to develop and evaluate an ML model.