Machine Learning with CMS Public Healthcare Dataset, Part I


From the outset, the term “Machine Learning” can seem very daunting to those unfamiliar with the technicalities of what this actually means, or so it seemed to me when I was initially assigned to develop a use case for one of these algorithms during my time here at RxDataScience.  From a quick study into the topic, Machine Learning (ML), put simply, is a branch of Artificial Intelligence (AI) that allows a system to automatically learn and improve itself without being explicitly instructed to, by using past and present data to predict certain outcomes [1].  The following video provides a gentle introduction into what ML is all about:


What is Machine Learning

Seems simple, right?

The next question I asked myself was, “Why does the pharmaceutical industry need Machine Learning? What is there to gain from using this technology at this point in time?” The answers to these questions are very clear. AI and ML are currently redefining the way the business world operates across a multitude of industries, from financial markets to manufacturing to sales. It is finally coming to light that all that old and seemingly useless data can now be repurposed to provide new business insights and predict potential outcomes from new ventures and investments. This newly reprocessed data can be used to make intelligent and well informed business decisions and by harnessing the combined power of Big Data and ML, this technology can be used to provide superior healthcare to patients and reduce the amount of time and money wasted on trying to gain these same insights through hundreds of painstaking man-hours. It’s for these reasons that the Pharmaceutical industry is finally adopting this novel cutting-edge technology.


With their improved efficiencies, healthcare companies can now save more lives. Organizations will be able to determine which patient is at a higher risk of contracting a certain disease. In addition, post-discharge outcomes can also be kept under control and the number of re-admissions can be reduced substantially. Furthermore, diagnoses will no longer take so much time and patients will be able to know immediately what they are suffering from and what action they need to take next. These benefits alone are more than enough to consider that the application of ML in the Pharmaceutical industry is worth further investigation and implementation.

The following video, which took place at a TEDx talk presented by Suchi Saria, offers an extremely interesting use case of ML in healthcare, while also providing an insight into the potential of how far ML can drive forward the course of innovation in this industry to deliver better healthcare for patients:


Over the course of these blogs, I will be demonstrating the benefits of using ML algorithms on Pharmaceutical based data-sets, and what predictive insights we can extract from them to further the cause of data analytics for improving patient care and protection from financial malpractices.

For the purpose of this exercise I will be exploring the CMS Open Payments data-set, as it will be used in conjunction with my selected Machine Learning model to provide Financial/Pharma analytics, but first, what exactly is CMS Open Payments?


CMS Open Payments is a federal program required by the affordable care act that collects all the information in regards to payments that manufacturers of drugs and healthcare devices/group purchasing organizations (GPOs) make to physicians/teaching hospitals. These payments can include things such as travel expenses, the costs of food and beverages consumed during meetings, gifts, and speaking fees etc. The aim of this program is to provide a level of transparency between the financial ties of manufacturers and healthcare professionals; this is to ensure that there are no inappropriate influences on research, education and clinical decision making processes that could create potential biases towards particular manufacturers or prevent patients from receiving the optimal levels of care. Although Open Payments is not solely for the purpose of identifying possible malpractices, it’s also a very beneficial way of identifying relationships between these two groups that could potentially lead to the development of new healthcare technologies and prevent wasteful financial practices.

Now you’re probably wondering, “Why use this data-set in particular? What insights can it provide when processed by an ML model?”  Using the information collected through Open Payments, it is possible for us to analyse the monetary relationships between manufacturers/GPOs and physicians/training hospitals, we can predict the amount of money that could be potentially spent on these client interactions based off of which product they are trying to sell, this helps manufacturers and sales reps target the appropriate customers, but it can also act as a regulatory benchmark when monitoring the amounts of money that are spent on particular physicians/training hospitals to ensure there is no bias across the market towards certain manufacturers.


In the next blog I will be discussing which Machine Learning models I will be applying to the CMS Open Payments data-set, the theory and model application process, and other previous use cases of where ML has been implemented in the healthcare industry.

About the Author

Jonathan is a graduate of Software & Electronic Systems Engineering from Queen’s University Belfast who recently joined the technical team as a go-getting and determined software engineer and data-scientist. From working in such a technical environment alongside RxDataScience he has gained an appreciation for all things machine learning and aspires to learn to a great deal in this ever growing field of industry in the hopes of helping revolutionize the healthcare industry with this game changing new technology.




Image References