The combination of the environment an individual experiences and their genetic predispositions determines most of their risk. for various diseases. Great national efforts, such as the UK biobank, have created vast public resources to better understand the links between environment, genetics and disease. This has the potential to help people better understand how to stay healthy, doctors to treat diseases, and scientists to develop new medicines.
One challenge in this process is how to make sense of the vast amount of clinical measurements: the UK Biobank has many petabytes of imaging, metabolic tests and medical records covering 500,000 people. To best use this data, we must be able to represent the information present as short, informative labels about significant diseases and traits, a process called phenotyping. That’s where we can use the ability of ML models to detect subtle and intricate patterns in large amounts of data.
We previously demonstrated the ability to use ML models to quickly phenotype scaled for retinal diseases. However, these models were trained using clinical judgment labels, and access to clinical grade labels is a limiting factor due to the time and cost required to create them.
In “Deep Learning Chronic Obstructive Pulmonary Disease Inference on Raw Spirograms Identifies New Genetic Loci and Improves Risk Modeling“, published in Genetics of Nature, we are pleased to highlight a method for training accurate ML models for genetic disease discovery, even when using noisy and unreliable labels. We demonstrate the ability to train ML models that can phenotype directly from raw clinical measurements and unreliable medical record information. This reduced reliance on experts in the medical field for labeling greatly expands the range of applications of our technique to a panoply of diseases and has the potential to improve their prevention, diagnosis and treatment. We demonstrate this method with ML models that can better characterize lung function and chronic obstructive pulmonary disease (COPD). Furthermore, we show the utility of these models by demonstrating a better ability to identify genetic variants associated with COPD, a better understanding of the biology behind the disease, and a successful prediction of COPD-associated outcomes.
ML for a deeper understanding of the exhale
For this demonstration, we focus on COPD, the third leading cause of death worldwide in 2019, in which airway inflammation and airflow obstruction can progressively reduce lung function. Lung function for COPD and other diseases is measured by recording an individual’s expiratory volume over time (the recording is called spirogram; see an example below). Although there are guidelines (called GOLD) to determine COPD status from exhalation use only a few specific data points on the curve and apply fixed thresholds to those values. Much of the rich data from these spirograms is discarded in this analysis of lung function.
We reasoned that ML models trained to classify spirograms could use the rich data present more fully and result in more accurate and comprehensive measures of lung function and disease, similar to what we have seen in other classification tasks such as mammography or histology. We train ML models to predict if a person has COPD using the full spirograms as inputs.
The common model training method for this problem, supervised learning, requires samples to be associated with labels. Determining those labels may require the effort of time-constrained experts. For this paper, to demonstrate that we don’t necessarily need medically classified labels, we decided to use a variety of widely available sources of medical record information to create those labels without medical expert review. These tags are less reliable and noisy For two reasons. First, there are gaps in people’s medical records because they use multiple health services. Second, COPD often goes undiagnosed, which means that many people with the disease will not be labeled as having it, even if we compile complete medical records. Nevertheless, we trained a model to predict these noisy labels from the spirogram curves and treated the model predictions as a quantitative COPD liability or risk score.
Noisy COPD status labels were derived using various sources of medical records (clinical data). Next, a COPD liability model is trained to predict COPD status from raw flow-volume spirograms. |
Predict COPD Outcomes
We then investigated whether the risk scores produced by our model could better predict a variety of COPD binary outcomes (for example, an individual’s COPD status, whether they were hospitalized for COPD or died from it). For comparison, we compared the model in relation to the expert-defined measures needed to diagnose COPD, specifically FEV1/FVC, which compares specific points on the spirogram curve to a simple mathematical relationship. We observe an improvement in the ability to predict these outcomes as seen in the precision recall curves below.
Precision recovery curves for COPD status and results from our ML model (green) compared to traditional measures. Confidence intervals are shown with lighter shading. |
We also found that separating populations by their COPD model score predicted all-cause mortality. This graph suggests that people at higher risk of COPD are more likely to die earlier from any cause, and the risk likely has implications beyond COPD.
Survival analysis of a cohort of individuals from the UK Biobank stratified by predicted risk from their COPD model pastern. The decline of the curve indicates that individuals in the cohort die over time. For example, p100 represents the 25% of the cohort with the highest predicted risk, while p50 represents the second quartile. |
Identify genetic links to COPD
Since the goal of large-scale biobanks is to collect large amounts of phenotypic and genetic data, we also perform a test called genome-wide association study (GWAS) to identify genetic links to COPD and genetic predisposition. A GWAS measures the strength of the statistical association between a given genetic variant (a change at a specific position in DNA) and observations (eg, COPD) in a cohort of cases and controls. Genetic associations discovered in this way can inform the development of drugs that modify the activity or products of a gene, as well as broaden our understanding of the biology of a disease.
We show with our ML phenotyping method that we not only rediscovered almost all known COPD variants found by manual phenotyping, but also found many novel genetic variants significantly associated with COPD. Furthermore, we see good agreement on effect sizes for variants discovered by both our ML and manual approaches (R.2=0.93), which provides strong evidence for the validity of the newly found variants.
Finally, our collaborators from Harvard Medical School and Brigham and Women’s Hospital further examined the plausibility of these findings by providing insights into the possible biological role of the new variants in the development and progression of COPD (see see further discussion of these ideas in the paper).
Conclusion
We show that our previous methods for ML phenotyping can be expanded to a wide range of diseases and can provide new and valuable information. We made two key observations using this to predict COPD from spirograms and discover new genetic insights. First, domain knowledge was not required to make predictions from raw medical data. Interestingly, we show that raw medical data is likely underutilized and that the ML model can find patterns in it that are not captured by expert-defined measurements. Second, we don’t need medically classified labels; instead, noisy labels defined from widely available medical records can be used to generate clinically predictive and genetically informative risk scores. We hope this work greatly expands the field’s ability to use noisy labels and improves our collective understanding of lung function and disease.
expressions of gratitude
This work is the combined result of multiple collaborators and institutions. We are grateful to all contributors: Justin Cosentino, Babak Alipanahi, Zachary R. McCaw, Cory Y. McLean, Farhad Hormozdiari (Google), Davin Hill (Northeastern University), Tae-Hwi Schwantes-An and Dongbing Lai (Indiana University), Brian D Hobbs and Michael H. Cho (Brigham and Women’s Hospital and Harvard Medical School). We also thank Ted Yun and Nick Furlotte for reviewing the manuscript, Greg Corrado and Shravya Shetty for their support, and Howard Yang, Kavita Kulkarni, and Tammi Huynh for helping with publication logistics.