29/06/2023

Advancing Drug Safety: How the SafePolyMed Project Is Using Machine Learning to Predict Adverse Drug Reactions

by Alexandros Kanterakis and Kleanthi Voutsadaki

It is common knowledge in healthcare that there is no such thing as a completely harmless drug. It is also a known fact that drugs are thoroughly and painstakingly checked for their safety before being given permission to be prescribed to individuals. Through these steps, any undesired health effect of the drug is thoroughly studied, assessed, and most importantly documented. This information exists in the leaflet included in the packaging of the drug, is available online, and forms part of the standard education that all medical professionals and pharmacists receive.

However, approximately 8% of all primary care visits are attributed to the harm from a prescribed drug. [1] In the US alone, there were 142.000 deaths attributed to this cause in 2013. [2] Annually, these unexpected occurrences can result in costs of up to $30.1 billion in the USA. [3] So how does this happen? Besides the well-known and well-studied side effects that physicians are familiar with, a drug might lead to the development of an Adverse Drug Reaction (ADR). ADRs are unexpected negative health effects that arise from the administration of a drug. ADRs have their etiology in the very complex biological mechanisms that are involved in drug metabolism. These mechanisms, combined with our genetic makeup, demographic profile, lifestyle, and medical history result in a unique drug-reaction profile that even the most experienced physicians cannot predict. As an effect, one of the most active research areas in medical informatics focuses on the development of models that can assess the risk of ADRs prior to drug administration. Research consortia like SafePolyMed are gathering data from large biobanks, reaching the order of millions of samples. This data includes DNA sequences, medication history, prescription history, demographic profile and, at least in the case of an Estonian biobank, clinicians’ handwritten notes.

With this wealth of information, a sufficiently powerful computer can detect patterns in the data that indicate specific types of prescription events that are more prone to ADRs. But how does this work? The answer is Machine Learning (ML): a type of computer algorithm designed to locate interesting patterns in vast amounts of seemingly chaotic data.

ML has been in the media spotlight recently, mainly due to the extraordinary abilities of artificial intelligence chatbots like ChatGPT. However, it’s important to note that ML is a research area with a history that goes back to the origins of computer science itself. ML analysis methods seek to identify differences between data that belong to two (e.g., disease/healthy) or more categories. Differences in data can be quantified through many approaches. For example, we can assume that each individual described with 3 attributes (i.e. cholesterol, age, sex) is a point in a 3-dimensional space. Each point has a colour depending on the individual's disease status (e.g., red for disease and blue for healthy). We can easily imagine a set of patients as a set of coloured 3D points right in front of us. Next, we can imagine a 2D surface, akin to a tablecloth, moving freely. What would be the optimal position of this tablecloth so that most red points are on one side of the tablecloth and most of the blue points on the other? Assuming we have found this position, we can now make a decision whenever an individual with an unknown status (disease or healthy?) appears. While this is surely a simplification, it describes the core of any ML algorithm.

Today, we have numerous methods for quantifying the distances between data that borrow concepts from diverse disciplines such as geometry (as illustrated in our example), information theory, probabilities, and statistics. Similarly, we have methods for locating optimal data separators (like the aforementioned freely moving tablecloth) that borrow concepts from areas such as linear algebra and calculus, although some of the most efficient methods rely on simple heuristics and intuition. Finally, a considerable part of ML engineering is dedicated to addressing practical issues that arise when dealing with data that are far from ideal. These issues can occur in cases of missing examples, noisy data, underrepresentation of a specific class of samples (e.g., disease samples are much less than the healthy ones) and complex data types (e.g., cardiograms, CT-scans).

As an internet user, you have been repeatedly exposed to data generated, selected or refined by ML algorithms. Every time a streaming platform suggests a film, a friend is recommended in a social media platform, or a personalised search item is returned in a search engine, an ML algorithm has been trained beforehand and executed implicitly based on your requests. ML methods are limited by only two factors: the availability of data and the computing resources at hand. Recent advances, such as ChatGPT, show that when these two factors abide, the task that ML, be it prediction, classification, or generation, can reach tremendous practical uses. Projects like SafePolyMed strive to bring these two key factors together, combining detailed prescription data for over one million patients with state-of-the-art High-Performance Computing environments from leading research institutes in Europe. The ultimate aim is to provide physicians and pharmacists with timely and secure warnings regarding the possible occurrence of an ADR prior to prescribing medication – and to save lives by achieving this.

References