Earlier this year, Apple hosted the Workshop on Machine Learning for Health. This two-day hybrid event brought together Apple and the academic research community and clinicians to discuss state-of-the-art machine learning (ML) research in health.
In this post we share highlights from these discussions and recordings of select workshop talks.
Translating ML Research to Clinical Practice
A major issue with translating research to clinical practice is the long feedback cycle. Identifying the problem, gathering data, implementing a solution, and safely deploying it in the clinic can be daunting and time-consuming.
Workshop attendee and New York University Langone assistant professor Dr. Yindalon (Yin) Aphinyanaphongs described his experience accelerating this cycle as agile data science. The aim is to identify and mitigate bottlenecks to quickly process relevant data, develop models, and reintegrate predictions into clinical systems. Such efforts are already enabling the study and incorporation of ML systems ranging from administrative to clinical care, and using methods ranging from simple statistics to foundation models trained on health record data, as referenced in Dr. Aphinyanaphongs’s papers Health System-Scale Language Models Are All-Purpose Prediction Engines and A Validated, Real-Time Prediction Model for Favorable Outcomes in Hospitalized COVID-19 Patients.
A common theme at the workshop was that traditional model-comparison metrics—like the area under the receiver operating characteristic curve—are useful not only academically but also in the field. The real arbiter of success is the benefit to end users: patients, care providers, and administration. It’s not always the case that this area will translate into real health benefits. This difficulty was discussed by a number of speakers, but particularly highlighted by Dr. Ziad Obermeyer, workshop attendee, associate professor at University of California, Berkeley, and coauthor of Solving Medicine’s Data Bottleneck: Nightingale Open Science. Dr. Obermeyer discussed an application of ML that predicts sudden cardiac death. He touched on difficulties throughout the study: confirming outcomes and causes with death certificates, comparing predictors from electronic health records to those from waveforms, and identifying the performance gap when generalizing to new healthcare systems. These issues highlight the significant benefit of maintaining easy-to-use and accessible health data for developing algorithms and assessing performance.
Fairness and Robustness in Data Collection and Model Training
Fairness and robustness are critical in ML for health, from problem selection to data collection to model training and deployment.
Many datasets used in training and developing models are collected in only one country or a small number of countries, predominantly from high-income countries and populations. Training on homogenous datasets can result in models that do not generalize well across diverse countries and demographic factors. A number of presenters addressed this topic, including EPFL and IDIAP Professor Daniel Gatica-Perez and Dr. Leo Anthony Celi, senior research scientist at the Massachusetts Institute of technology. Dr. Celi described his efforts to increase participation in model development and data sharing with global partners. Professor Gatica-Perez worked with partners across Europe, Asia, and Latin America to collect a multicountry mobile-sensing dataset with university students.
ML models trained on datasets that do not capture diverse populations and signals learn biases that may not be apparent to downstream users. Dr. Celi presented an example using a large language model (LLM) for treatment recommendations, showing that the probability of the model recommending a CT scan was biased by race, according to the work Coding Inequity: Assessing GPT-4’s Potential for Perpetuating Racial and Gender Biases in Healthcare. Professor Gatica-Perez showed that models trained to infer mood on data from one country did not generalize well to other countries, and that partly personalized models trained on larger, multicountry datasets did not always perform as well as partly personalized models trained on smaller country-specific data, as seen in the work Generalization and Personalization of Mobile Sensing-Based Mood Inference Models: An Analysis of College Students in Eight Countries. Highlighting the need for diversity in data collection to reduce gaps in model performance across countries and cultures, he also discussed how some models may benefit from country-level generalization before individual-level personalization.
Workshop participants also discussed the need for diverse perspectives when designing systems and algorithms. Professor Gatica-Perez said that he worked with communities that emphasize community-based health and share information and tools within the community. Dr. Shrikanth (Shri) Narayanan, University of Southern California professor, had similar observations in his work on healthy aging in India, where he has observed a need for intergenerational design aspects for health tools.
Modeling strategies can improve model fairness and robustness to distribution shifts between training and deployment. Workshop attendee and Apple ML researcher Dr. Arno Blaas presented a method for model improvement to distribution shifts due to variables that causally influence both model input signals and outcomes. In Considerations for Distribution Shift Robustness in Health, Dr. Blaas and collaborators showed that including the causal relationship between a model’s outcomes and covariates can enhance model robustness when using both synthetic and real data.
Dr. Irene Chen, assistant professor at University of California, Berkeley and San Francisco, presented methods for modeling access to care in disease phenotyping by including access to care as a latent variable in a deep generative model that could handle multimodal data and intermittent sampling. When applied to electrocardiogram data for heart failures from the Beth Israel Deaconness Medical Center, the algorithm recreated known clinical findings and identified a potential new subtype for heart failure, as seen in the paper Clustering Interval-Censored Time-Series for Disease Phenotyping.
Safety and Quality Goals for ML in Health
Learning how goals differ across individuals requires working with a large volume of data. In her talk “Challenges in Menstrual and Reproductive Health,” workshop attendee and Apple obstetrician-gynecologist Dr. Chris Curry provided more background about these challenges. Menstrual health is a system that involves coordination of the central nervous system, ovaries, uterus, and hormones, in combination with direct influences from external factors (such as sleep and stress) and internal factors (such as diseases). Menstrual health manifests in a large set of nonspecific symptoms. Perturbation in menstruation can be a sign of disease, but given the lack of a single definition of a so-called normal menstruation cycle at the population level, distinguishing the normal from the truly abnormal is difficult. Individual differences also affect menstrual health, and success in tracking and predicting elements of menstrual cycles differs from individual to individual, and sometimes over time for the same individual.
Dr. Curry specified that the value the individual places on the accuracy of the fertile window may differ depending on their intent around pregnancy, and the value they place on precision in period predictions may depend on their access to menstrual hygiene products. One approach to address individual differences is building an ML system that can learn and adapt to individual patterns and objectives. This typically relies on large volume, longitudinal data. Dr. Curry introduced the Apple Women’s Health Study (AWHS), which is designed to collect data from a prospective longitudinal digital cohort on the relationship among menstrual cycles, health, and behavior.
Research methodologies and decision criteria are critical to the safety and overall quality of ML applications in health. Machine intelligence techniques can help create new tools for assessing and detecting health conditions. Workshop attendee Dr. Shrikanth (Shri) Narayanan, professor at the University of Southern California, discussed how his team applied machine intelligence techniques to analyze variations in speech and language development in children with autism spectrum disorder (ASD). See the paper presented at Interspeech 2023, Understanding Spoken Language Development of Children with ASD Using Pre-trained Speech Embeddings. Dr. Narayanan explained why traditional assessment methodologies, such as caregiver reports, are inadequate for the requisite behavioral phenotyping. He described how automated assessment of natural language samples can complement clinically meaningful benchmarks for ascertaining spoken-language capabilities in children with ASD, at scale.
Privacy and ML for Health
The values of privacy and utility can sometimes conflict in ML for health. Workshop attendee and professor at Vanderbilt University Brad Malin summarized the tradeoff between privacy and utility, saying that the more detail provided in the data, the greater the chance that the individuals to whom the data corresponds could have their privacy intruded upon. However, Professor Malin emphasized that re-identification can often be harder than it is portrayed, as discussed in his paper Re-identification of Individuals in Genomic Datasets Using Public Face Images. Professor Malin also discussed risk mitigation strategies that can be employed to share data while preserving privacy. For instance, tiered access to datasets can mitigate risk by employing different levels of protection to different data elements, depending on the data sensitivity, as discussed in his paper Managing Re-identification Risks While Providing Access to the All of Us Research Program.
Professor Nita Farahany, workshop attendee and Duke University professor, discussed the impact of new neural-sensing technology on individual privacy at the workshop and through her book, The Battle for Your Brain: Defending the Right to Think Freely in the Age of Neurotechnology. Professor Farahany detailed a long list of existing applications of brain-sensing technology that impact the self-determination, mental privacy, and freedom of thought of users, all important considerations as innovative technology is developed and deployed. Her talk crescendoed to a call for an explicit fundamental right, the right to cognitive liberty, as a guiding principle for research and a core value in commercial applications of new technology. The inference of mental states is an active area of research in ML and health, with the potential to positively impact people’s lives, and Professor Farahany’s talk highlighted the need to keep users’ considerations front and center throughout the process.
Applications of ML in Cardiology
Cardiology is one of the largest areas for applications of ML in health. It is also the second-largest medical specialty for ai algorithms cleared by the U.S. Food and Drug Administration as of October 2022, second only to radiology. ML is well suited to finding patterns in high-dimensional data used for diagnostics, such as medical imaging and electrocardiography, and such information is abundant in cardiology. Workshop presenters spoke about many ML applications and diverse use cases.
Randomized control trial validates that ML improves the efficiency of sonographers. Dr. David Ouyang, workshop attendee and assistant professor at Cedars-Sinai Medical Center, discussed a blinded prospective randomized trial evaluating the impact of ML in cardiology, specifically in the interpretation of echocardiography, according to the ai LVEF (EchoNet-RCT). – Opens in a new window” class=”icon icon-after icon-external” rel=”noopener nofollow”>Safety and Efficacy Study of ai LVEF (EchoNet-RCT). The trial compared ML-guided assessments of left ventricular ejection fraction (LVEF) with assessments made by sonographers. The results showed that ML was noninferior to the sonographer assessment, and that this ML-guided workflow saved time for both sonographers and cardiologists.
ML for large-scale screening of left ventricular dysfunction using wearables. Dr. Zachi Attia, workshop attendee and codirector of artificial intelligence in cardiology at Mayo Clinic, presented a talk titled Prospective Evaluation of Smartwatch-Enabled Detection of Left Ventricular Dysfunction, based on a 2022 Nature Medicine paper of the same title. The study involved enrolling 2454 patients who sent 125,610 electrocardiograms (ECGs) from their smartwatches to a secure data platform. The ML algorithm demonstrated high diagnostic utility, detecting patients with low ejection fraction (EF) with an area under the curve (AUC) of 0.885. The study showcased the transformative potential of ML applied to consumer watch ECGs in nonclinical settings, enabling effective identification of left ventricular dysfunction in a geographically dispersed population. The findings highlight the opportunity for remote care and the potential for revolutionizing large-scale screening and monitoring efforts for life-threatening cardiac conditions.
Physiology-inspired ML for cardiovascular monitoring. Workshop attendee Ramakrishna Mukkamala, professor at the University of Pittsburgh, spoke on the use of physiology-inspired ML for cardiovascular monitoring. Professor Mukkamala shared that the Cardiovascular Health tech Lab at University of Pittsburgh collaborates with clinicians to collect large-scale, high-fidelity patient data and develop ML tools for accurate cardiovascular monitoring. Projects discussed included converting smartphones into cuffless blood pressure sensors, using physiology-based features of arterial waveforms for aortic aneurysm screening, and transforming standard cuff devices into multiparameter hemodynamic monitors. The research aims to improve hypertension awareness and control, diagnose aortic aneurysms, and guide therapy to improve patient outcomes. Ongoing patient studies are being conducted to train and test ML models for these applications.
Workshop Resources
Related Videos
Challenges in Menstrual and Reproductive Health by Dr. Chris Curry (Apple)
Modeling Access to Healthcare in Disease Phenotyping by Dr. Irene Chen (University of California, Berkeley)
Modeling Heart Rate Response to Exercise with Wearable Data by Andy Miller (Apple)
Pre-trained Model Representations and Their Robustness Against Noise for Speech Emotion Analysis by Vikram Mitra (Apple)
Prospective Evaluation of Smartwatch-Enabled Detection of Left Ventricular Dysfunction by Dr. Zachi Attia (Mayo Clinic)
Towards Increasing Diversity in Mobile Sensing Research by Professor Daniel Gatica-Perez (IDIAP-EPFL)
Web3 and Decentralized ai by Ramesh Raskar (MIT)
Related Work
Apple Women’s Health Study by Harvard T. H. Chan School of Public Health
Blinded, Randomized Trial of Sonographer Versus ai Cardiac Function Assessment by Bryan He, Alan C. Kwan, Jae Hyung Cho, Neal Yuan, Charles Pollick, Takahiro Shiota, Joseph Ebinger, et al.
Considerations for Distribution Shift Robustness in Health by Arno Blaas, Andrew C. Miller, Luca Zappella, Jörn-Henrik Jacobsen, and Christina Heinze-Deml
Safety and Efficacy Study of ai LVEF (EchoNet-RCT), sponsored by Cedars-Sinai Medical Center
Acknowledgments
Many people contributed to this workshop, including Matt Bianchi, Arno Blaas, Lauren Cheung, Chris Curry, Greg Darnell, Joe Futoma, Agni Kumar, Andy Miller, Vikram Mitra, Jaya Narain, Steve Waydo, and Shunan Zhang.