The Tip of the Iceberg

Huge Potential Lies in Wait in Pharma Data

Machine Learning in Pharma has been largely limited to the realm of commonly known use cases such as image recognition, disease prevention, clinical trials and other related areas that are aligned closely to R&D. The application of ML as it related to real-world evidence related data or more generally, commercial market research has been relatively limited. This, by no means, indicates that there is a dearth of related use cases.

Instead, Pharma, with its wealth of data, especially in the United States is well positioned to apply Machine Learning to gain insights to problems that have been considered intractable. For instance, Real World Data, such as Insurance claims records from vendors such as Truven, IMS, Flatiron, Pharmetrics and others provide an unparalleled opportunity to identify rare disease patients, to analyse and make predictions using longitudinal patient data, to target physicians based on prescribing behaviour and patient characteristics and much more. Yet, despite, these opportunities, they have been conducted by very limited organisations in the Pharma and have been an afterthought if any within market research.

There are various reasons for this phenomenon. First, statistical analysis has been largely limited to biostatistics in Pharmaceutical organisations. Biostatistics answers questions such as whether a drug is effective in treating patients, whether environmental factors affect the spread of diseases, whether there is an interaction between demographics and prevalence. It deals with odds ratios, propensity scores, survey bias – topics that are atypical and extremely familiar to students in the field. These certainly have a deserving and very critical place in Pharma. Even market research has been dominated by experts who have been trained in these common areas of Pharma. It makes sense. A prospective candidate for a Pharma position, whether it is for commercial research or health outcomes is expected to have skills and suitable knowledge in disciplines related to healthcare.

Machine Learning, by contrast, is a broader subject. Although, it is related to ‘statistics’, it uses a fundamentally different approach. Hypothesis testing is replaced by Cross-Validation. P-values are replaced with F-Scores. Instead of Sensitivity and Specificity, ML experts debate over Precision and Recall. Machine Learning, as the name suggests is the practice of training machines, programmatically and with algorithms, to understand patterns and make predictions on new datasets for which the outcome is unknown. However, in a highly regulated market, such as the healthcare and pharmaceutical industries, that involves a high degree of liability, the opportunity to apply hypothetical predictive analytics with no industry-wide consensus on reliability of the results, the scope to apply Machine Learning has been limited. While other industries, most prominently, social media and the so-called sharing economy has progressed dramatically, in Pharma, the topic has been an ‘afterthought’.

This aversion to Machine Learning, coupled with the hiring strategies at most Pharmaceutical firms that hasn’t as such prioritised machine learning, or data science related degrees has resulted in a scarcity of talent and consequently, successful application of ML in healthcare.

There is certain degree of charm in aiming for lofty goals – the self-driving car, the humanoid robots, the augmented reality physician are terms that evoke Asimovian ambitions. But, the road to realising such goals is an unchartered territory, made no less chaotic by the relentless publications on data science in popular blogs and magazines that leave the reader with aspirations, but not the means to achieve them.

Machine Learning projects that generally succeed at the start of a new Data Science initiative usually have four distinct characteristics: 

  1. Use Cases: The use cases are intuitive and can be easily explained
  2. Measurable Outcomes: They have a concrete and quantitative, i.e., measurable outcome
  3. Fast Results: They can be achieved, or their results can be obtained in a relatively short period of time
  4. Data: The availability of relevant datasets that can be used to create machine learning models

Uniquely Equipped to Yield Quick Results

At RxDataScience, our team members, have deployed numerous machine learning projects that have not only been successful, but have found a permanent place in the day-to-day operations of organisations. We achieved this by not restricting ourselves to attaining the lofty ambitions, but by also focusing equally on the less ambitious, but nonetheless impactful use cases in Machine Learning. The latter had the effect of fostering management confidence and buy-in for the other, more ambitious, but longer-term projects. In other words, short-term machine learning projects with clearly demonstrable benefits help build the foundation for long-term machine learning projects.

The benefits of most machine learning projects are generally realised over relatively long periods of time, oftentimes more than a year. Further the outcome and the true business benefits of such projects are not well articulated and as such, difficult to precisely quantify. Whether a target list generated using a machine learning approach is truly superior to one created using the standard approach is subject to interpretation and competing opinions.

Lastly, the problem statement of machine learning projects can often appear cryptic. A project to “differentiate Pre-ictal state from Intra-ictal, Ictal and Post-ictal states” given a dataset of EEG recordings for epilepsy detection would appear perfectly normal to a practitioner of the domain, but incomprehensible for anyone else outside of the specific domain area. As a result, even if the project is a sound engagement, it lacks a level of simplicity that would make it accessible to an average user.

The cumulative effect of these three factors is that machine learning projects in Pharma seldom achieve the degree of success they may have been otherwise capable of.

Use Cases

A Few Examples to Get You Started

The list below highlights some of potential use cases of machine learning in Pharma. The examples shown are real-world use cases that have been conducted at RxDataScience and elsewhere.

  1. Rare Disease Patient Finder: Finding patients for rare diseases can be an extremely challenging ordeal. Pharmaceutical data vendors routinely sell databases that contain insurance claims data of up to 200 million patient lives. IMS APLD, Truven MarketScan and other databases are often well suited for such endeavours. The wealth of information available in such databases make them very valuable for finding patients with rare diseases. Using a combination of inpatient, outpatient and lab records of patients with a certain rare disease, machine learning practitioners build models that are able to identify other patients that exhibit similar patterns. The process is quite intensive and not trivial. Nevertheless, the fact that one can find such patients from using a computational or algorithmic approach instead of more conventional outreach is commendable, beneficial for patients and consequently holds a significant commercial value for pharmaceutical companies
  2. Treatment Identification: Using clustering and scoring models to assess which treatments and/or drugs should be recommended for patients based on historical outcomes and success rates of treatment pathways;
  3. Physician Identification: Databases such as IMS Xponent provide extensive information on the prescribing history of physicians. The information can be further supplemented with auxiliary databases such as NPI, AMA and others to create a holistic view of the physician, characteristics of their patients and other related information. The data can then be used to build predictive models for determining how physicians will respond to promotional response materials, what medications they may be willing to try and so on using data from physicians with similar characteristics. In this regard, the use of unsupervised learning techniques such as k-means to create clusters of physicians with similar characteristics is often very effective;
  4. Matching physicians & patients across datasets: Using data disambiguation techniques to identify and correlate disparate sources of information to create richer physician and patient profiles;
  5. Clustering Physicians and Clinics: Clustering techniques are used to quantitatively identify clinics and physicians that are most similar to one another to identify potential outliers and new sales targets; and
  6. Sales Field Messaging: Pharma sales reps frequently document their discussions with physicians during sales calls. These conversations or notes are generally stored in CRM systems. Using Natural Language Processing (NLP) it is possible to analyse conversations and in conjunction with sales records of the physician create machine learning models to build more effective sales field messaging

RXDS Analytics Platform

Competing products (Teradata, Spark, etc)

Cost as low as $ 50,000 for commercial claims analytics

On average, production deployments of Teradata, Spark and other solutions is at least $ 250,000

Implements multiple aspects of NoSQL: In-memory, columnar, key-value, sharding, etc.

Usually implements only 2-3 aspects of NoSQL.

True in-memory system – tables are stored physically in memory

In-memory databases are sometimes not ‘true’ in-memory in that the memory is only being used for caching results

Implements map-reduce with no coding requirements. Queries are automatically parallelised

Most NoSQL systems do not as such implement ‘map-reduce’. Further, parallel operations can be complex and time-intensive which require a fair degree of coding effort

Available as an on-demand model

Not always available as on-demand

Production 64-bit version is available at no charge for permitted use cases. This free version can be efficiently used for longitudianal patient analysis

The freely available versions of commercial systems are generally extremely limited. For eg., it would be impractical to perform longitudinal patient analysis using the free versions of most of the offerings


Not always available as a cloud-ready tool in the AWS marketplace

Transparent Hardware Architecture

Hardware architecture can be complicated if deployed on-premises

Resource needs are minimal

Demanding resource needs – requires dedicated DBAs in addition to analytics professionals

Low TCO (Total Cost of Ownership)

Due to the low cost of initial deployment, reduced resource needs and 100% support provisions from RxDataScience, the overall TCO is extremely low

High TCO

By the time an organisation deploys a fully functional system, the total cost of ownership that includes cost of hardware resources and other factors can be 2x – 3x of the original allotted budget


Dedicated support: Over 2,000 consultants have been trained on the RXDS Analytics platform who are dedicated to providing commercial value-added services

Support contracts can run into tens of thousands of dollars per year

Easy to change business logic: hours or days

Changing business logic usually takes weeks as developers have to modify complex SQL code

Minimal latency: The fact that the entire database binary can fit in the CPU cache

High latency overhead due to multiple abstractions

Minimum tuning: As the RXDS Platform comes pre-tuned for AWS, minimal client-side tuning is necessary

Some NoSQL solutions can require extensive tuning to get optimal performance benefits

Cloud Support: Supported on AWS, Azure, Google Cloud and even IBM Cloud !

Cloud support varies by vendor

Maturity: Has been battle-tested on Wall Street where it has been used for over 20 years

Most NoSQL solutions are at most 4-5 years old. The underlying database of the RXDS Analytics Platform by contrast has been in use for over 20 years