Cameron Po-Hsuan Chen
Research Areas
Authored Publications
Sort By
Health equity assessment of machine learning performance (HEAL): a framework and dermatology AI model case study
Terry Spitz
Malcolm Chelliah
Heather Cole-Lewis
Donald Martin
Tiam Jaroensri
Geoff Keeling
Stephanie Farquhar
Qinghan Xue
Jenna Lester
Cían Hughes
Patricia Strachan
Fraser Tan
Peggy Bui
Craig Mermel
Lily Peng
Sunny Virmani
Ivor Horn
The Lancet eClinicalMedicine (2024)
Preview abstract
Background
Artificial intelligence (AI) has repeatedly been shown to encode historical inequities in healthcare. We aimed to develop a framework to quantitatively assess the performance equity of health AI technologies and to illustrate its utility via a case study.
Methods
Here, we propose a methodology to assess whether health AI technologies prioritise performance for patient populations experiencing worse outcomes, that is complementary to existing fairness metrics. We developed the Health Equity Assessment of machine Learning performance (HEAL) framework designed to quantitatively assess the performance equity of health AI technologies via a four-step interdisciplinary process to understand and quantify domain-specific criteria, and the resulting HEAL metric. As an illustrative case study (analysis conducted between October 2022 and January 2023), we applied the HEAL framework to a dermatology AI model. A set of 5420 teledermatology cases (store-and-forward cases from patients of 20 years or older, submitted from primary care providers in the USA and skin cancer clinics in Australia), enriched for diversity in age, sex and race/ethnicity, was used to retrospectively evaluate the AI model's HEAL metric, defined as the likelihood that the AI model performs better for subpopulations with worse average health outcomes as compared to others. The likelihood that AI performance was anticorrelated to pre-existing health outcomes was estimated using bootstrap methods as the probability that the negated Spearman's rank correlation coefficient (i.e., “R”) was greater than zero. Positive values of R suggest that subpopulations with poorer health outcomes have better AI model performance. Thus, the HEAL metric, defined as p (R >0), measures how likely the AI technology is to prioritise performance for subpopulations with worse average health outcomes as compared to others (presented as a percentage below). Health outcomes were quantified as disability-adjusted life years (DALYs) when grouping by sex and age, and years of life lost (YLLs) when grouping by race/ethnicity. AI performance was measured as top-3 agreement with the reference diagnosis from a panel of 3 dermatologists per case.
Findings
Across all dermatologic conditions, the HEAL metric was 80.5% for prioritizing AI performance of racial/ethnic subpopulations based on YLLs, and 92.1% and 0.0% respectively for prioritizing AI performance of sex and age subpopulations based on DALYs. Certain dermatologic conditions were significantly associated with greater AI model performance compared to a reference category of less common conditions. For skin cancer conditions, the HEAL metric was 73.8% for prioritizing AI performance of age subpopulations based on DALYs.
Interpretation
Analysis using the proposed HEAL framework showed that the dermatology AI model prioritised performance for race/ethnicity, sex (all conditions) and age (cancer conditions) subpopulations with respect to pre-existing health disparities. More work is needed to investigate ways of promoting equitable AI performance across age for non-cancer conditions and to better understand how AI models can contribute towards improving equity in health outcomes.
View details
Domain-specific optimization and diverse evaluation of self-supervised models for histopathology
Jeremy Lai
Faruk Ahmed
Supriya Vijay
Tiam Jaroensri
Jessica Loo
Saurabh Vyawahare
Saloni Agarwal
Fayaz Jamil
arXiv (2023)
Preview abstract
Task-specific deep learning models in histopathology offer promising opportunities for improving diagnosis, clinical research, and precision medicine. However, development of such models is often limited by availability of high-quality data. Foundation models in histopathology that learn general representations across a wide range of tissue types, diagnoses, and magnifications offer the potential to reduce the data, compute, and technical expertise necessary to develop task-specific deep learning models with the required level of model performance. In this work, we describe the development and evaluation of foundation models for histopathology via self-supervised learning (SSL). We first establish a diverse set of benchmark tasks involving 17 unique tissue types and 12 unique cancer types and spanning different optimal magnifications and task types. Next, we use this benchmark to explore and evaluate histopathology-specific SSL methods followed by further evaluation on held out patch-level and weakly supervised tasks. We found that standard SSL methods thoughtfully applied to histopathology images are performant across our benchmark tasks and that domain-specific methodological improvements can further increase performance. Our findings reinforce the value of using domain-specific SSL methods in pathology, and establish a set of high quality foundation models to enable further research across diverse applications.
View details
Pathologist Validation of a Machine Learning–Derived Feature for Colon Cancer Risk Stratification
Vincenzo L’Imperio
Markus Plass
Heimo Müller
Nicolò Tamini
Luca Gianotti
Nicola Zucchini
Robert Reihs
Lily Peng
Marialuisa Lavitrano
David F. Steiner
Kurt Zatloukal
Fabio Pagni
JAMA Network Open (2023)
Preview abstract
Importance: Identifying new prognostic features in colon cancer has the potential to refine histopathologic review and inform patient care. Although prognostic artificial intelligence systems have recently demonstrated significant risk stratification for several cancer types, studies have not yet shown that the machine learning–derived features associated with these prognostic artificial intelligence systems are both interpretable and usable by pathologists.
Objective: To evaluate whether pathologist scoring of a histopathologic feature previously identified by machine learning is associated with survival among patients with colon cancer.
Design, Setting, and Participants: This prognostic study used deidentified, archived colorectal cancer cases from January 2013 to December 2015 from the University of Milano-Bicocca. All available histologic slides from 258 consecutive colon adenocarcinoma cases were reviewed from December 2021 to February 2022 by 2 pathologists, who conducted semiquantitative scoring for tumor adipose feature (TAF), which was previously identified via a prognostic deep learning model developed with an independent colorectal cancer cohort.
Main Outcomes and Measures: Prognostic value of TAF for overall survival and disease-specific survival as measured by univariable and multivariable regression analyses. Interpathologist agreement in TAF scoring was also evaluated.
Results: A total of 258 colon adenocarcinoma histopathologic cases from 258 patients (138 men [53%]; median age, 67 years [IQR, 65-81 years]) with stage II (n = 119) or stage III (n = 139) cancer were included. Tumor adipose feature was identified in 120 cases (widespread in 63 cases, multifocal in 31, and unifocal in 26). For overall survival analysis after adjustment for tumor stage, TAF was independently prognostic in 2 ways: TAF as a binary feature (presence vs absence: hazard ratio [HR] for presence of TAF, 1.55 [95% CI, 1.07-2.25]; P = .02) and TAF as a semiquantitative categorical feature (HR for widespread TAF, 1.87 [95% CI, 1.23-2.85]; P = .004). Interpathologist agreement for widespread TAF vs lower categories (absent, unifocal, or multifocal) was 90%, corresponding to a κ metric at this threshold of 0.69 (95% CI, 0.58-0.80).
Conclusions and Relevance: In this prognostic study, pathologists were able to learn and reproducibly score for TAF, providing significant risk stratification on this independent data set. Although additional work is warranted to understand the biological significance of this feature and to establish broadly reproducible TAF scoring, this work represents the first validation to date of human expert learning from machine learning in pathology. Specifically, this validation demonstrates that a computationally identified histologic feature can represent a human-identifiable, prognostic feature with the potential for integration into pathology practice.
View details
Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging
Laura Anne Culp
Jan Freyberg
Basil Mustafa
Sebastien Baur
Simon Kornblith
Ting Chen
Patricia MacWilliams
Sara Mahdavi
Megan Zoë Walker
Aaron Loh
Scott Mayer McKinney
Jim Winkens
Zach William Beaver
Fiona Keleher Ryan
Justin David Krogue
Mozziyar Etemadi
Umesh Telang
Lily Hao Yi Peng
Geoffrey Everest Hinton
Neil Houlsby
Mohammad Norouzi
Nature Biomedical Engineering (2023)
Preview abstract
Machine-learning models for medical tasks can match or surpass the performance of clinical experts. However, in settings differing from those of the training dataset, the performance of a model can deteriorate substantially. Here we report a representation-learning strategy for machine-learning models applied to medical-imaging tasks that mitigates such ‘out of distribution’ performance problem and that improves model robustness and training efficiency. The strategy, which we named REMEDIS (for ‘Robust and Efficient Medical Imaging with Self-supervision’), combines large-scale supervised transfer learning on natural images and intermediate contrastive self-supervised learning on medical images and requires minimal task-specific customization. We show the utility of REMEDIS in a range of diagnostic-imaging tasks covering six imaging domains and 15 test datasets, and by simulating three realistic out-of-distribution scenarios. REMEDIS improved in-distribution diagnostic accuracies up to 11.5% with respect to strong supervised baseline models, and in out-of-distribution settings required only 1–33% of the data for retraining to match the performance of supervised models retrained using all available data. REMEDIS may accelerate the development lifecycle of machine-learning models for medical imaging.
View details
Deep Learning Detection of Active Pulmonary Tuberculosis at Chest Radiography Matched the Clinical Performance of Radiologists
Sahar Kazemzadeh
Jin Yu
Shahar Jamshy
Rory Pilgrim
Zaid Nabulsi
Christina Chen
Neeral Beladia
Chuck Lau
Scott Mayer McKinney
Thad Hughes
Atilla Peter Kiraly
Sreenivasa Raju Kalidindi
Monde Muyoyeta
Jameson Malemela
Ting Shih
Lily Hao Yi Peng
Kat Chou
Krish Eswaran
Shravya Ramesh Shetty
Radiology (2022)
Preview abstract
Background: The World Health Organization (WHO) recommends chest radiography to facilitate tuberculosis (TB) screening. However, chest radiograph interpretation expertise remains limited in many regions. Purpose: To develop a deep learning system (DLS) to detect active pulmonary TB on chest radiographs and compare its performance to that of radiologists. Materials and Methods: A DLS was trained and tested using retrospective chest radiographs (acquired between 1996 and 2020) from 10 countries. To improve generalization, large-scale chest radiograph pretraining, attention pooling, and semisupervised learning (“noisy-student”) were incorporated. The DLS was evaluated in a four-country test set (China, India, the United States, and Zambia) and in a mining population in South Africa, with positive TB confirmed with microbiological tests or nucleic acid amplification testing (NAAT). The performance of the DLS was compared with that of 14 radiologists. The authors studied the efficacy of the DLS compared with that of nine radiologists using the Obuchowski-Rockette-Hillis procedure. Given WHO targets of 90% sensitivity and 70% specificity, the operating point of the DLS (0.45) was prespecified to favor sensitivity. Results: A total of 165 754 images in 22 284 subjects (mean age, 45 years; 21% female) were used for model development and testing. In the four-country test set (1236 subjects, 17% with active TB), the receiver operating characteristic (ROC) curve of the DLS was higher than those for all nine India-based radiologists, with an area under the ROC curve of 0.89 (95% CI: 0.87, 0.91). Compared with these radiologists, at the prespecified operating point, the DLS sensitivity was higher (88% vs 75%, P < .001) and specificity was noninferior (79% vs 84%, P = .004). Trends were similar within other patient subgroups, in the South Africa data set, and across various TB-specific chest radiograph findings. In simulations, the use of the DLS to identify likely TB-positive chest radiographs for NAAT confirmation reduced the cost by 40%–80% per TB-positive patient detected. Conclusion: A deep learning method was found to be noninferior to radiologists for the determination of active tuberculosis on digital chest radiographs.
View details
Deep learning models for histologic grading of breast cancer and association with disease prognosis
Tiam Jaroensri
Trissia Brown
Isabelle Flament
Fraser Tan
Yuannan Cai
Kunal Nagpal
Emad Rakha
David J. Dabbs
Niels Olson
James H. Wren
Elaine E. Thompson
Erik Seetao
Carrie Robinson
Melissa Miao
Fabien Beckers
Lily Hao Yi Peng
Craig Mermel
npj Breast Cancer (2022)
Preview abstract
Histologic grading of breast cancer involves review and scoring of three well-established morphologic features: mitotic count, nuclear pleomorphism, and tubule formation. Taken together, these features form the basis of the Nottingham Grading System which is used to inform breast cancer characterization and prognosis. In this study, we developed deep learning models to perform histologic scoring of all three components using digitized hematoxylin and eosin-stained slides containing invasive breast carcinoma. We then evaluated the prognostic potential of these models using an external test set and progression free interval as the primary outcome. The individual component models performed at or above published benchmarks for algorithm-based grading approaches and achieved high concordance rates in comparison to pathologist grading. Prognostic performance of histologic scoring provided by the deep learning-based grading was on par with that of pathologists performing review of matched slides. Additionally, by providing scores for each component feature, the deep-learning based approach provided the potential to identify the grading components contributing most to prognostic value. This may enable optimized prognostic models as well as opportunities to improve access to consistent grading and better understand the links between histologic features and clinical outcomes in breast cancer.
View details
Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge
Wouter Bulten
Kimmo Kartasalo
Peter Ström
Hans Pinckaers
Kunal Nagpal
Yuannan Cai
Hester van Boven
Robert Vink
Christina Hulsbergen-van de Kaa
Jeroen van der Laak
Mahul B. Amin
Andrew J. Evans
Theodorus van der Kwast
Robert Allan
Peter A. Humphrey
Henrik Grönberg
Hemamali Samaratunga
Brett Delahunt
Toyonori Tsuzuki
Tomi Häkkinen
Lars Egevad
Maggie Demkin
Sohier Dane
Fraser Tan
Masi Valkonen
Lily Peng
Craig H. Mermel
Pekka Ruusuvuori
Geert Litjens
Martin Eklund
the PANDA challenge consortium
Nature Medicine, 28 (2022), pp. 154-163
Preview abstract
Artificial intelligence (AI) has shown promise for diagnosing prostate cancer in biopsies. However, results have been limited to individual studies, lacking validation in multinational settings. Competitions have been shown to be accelerators for medical imaging innovations, but their impact is hindered by lack of reproducibility and independent validation. With this in mind, we organized the PANDA challenge—the largest histopathology competition to date, joined by 1,290 developers—to catalyze development of reproducible AI algorithms for Gleason grading using 10,616 digitized prostate biopsies. We validated that a diverse set of submitted algorithms reached pathologist-level performance on independent cross-continental cohorts, fully blinded to the algorithm developers. On United States and European external validation sets, the algorithms achieved agreements of 0.862 (quadratically weighted κ, 95% confidence interval (CI), 0.840–0.884) and 0.868 (95% CI, 0.835–0.900) with expert uropathologists. Successful generalization across different patient populations, laboratories and reference standards, achieved by a variety of algorithmic approaches, warrants evaluating AI-based Gleason grading in prospective clinical trials.
View details
Interpretable Survival Prediction for Colorectal Cancer using Deep Learning
Melissa Moran
Markus Plass
Robert Reihs
Fraser Tan
Isabelle Flament
Trissia Brown
Peter Regitnig
Apaar Sadhwani
Bob MacDonald
Benny Ayalew
Lily Hao Yi Peng
Heimo Mueller
Zhaoyang Xu
Martin Stumpe
Kurt Zatloukal
Craig Mermel
npj Digital Medicine (2021)
Preview abstract
Deriving interpretable prognostic features from deep-learning-based prognostic histopathology models remains a challenge. In this study, we developed a deep learning system (DLS) for predicting disease-specific survival for stage II and III colorectal cancer using 3652 cases (27,300 slides). When evaluated on two validation datasets containing 1239 cases (9340 slides) and 738 cases (7140 slides), respectively, the DLS achieved a 5-year disease-specific survival AUC of 0.70 (95% CI: 0.66–0.73) and 0.69 (95% CI: 0.64–0.72), and added significant predictive value to a set of nine clinicopathologic features. To interpret the DLS, we explored the ability of different human-interpretable features to explain the variance in DLS scores. We observed that clinicopathologic features such as T-category, N-category, and grade explained a small fraction of the variance in DLS scores (R2 = 18% in both validation sets). Next, we generated human-interpretable histologic features by clustering embeddings from a deep-learning-based image-similarity model and showed that they explained the majority of the variance (R2 of 73–80%). Furthermore, the clustering-derived feature most strongly associated with high DLS scores was also highly prognostic in isolation. With a distinct visual appearance (poorly differentiated tumor cell clusters adjacent to adipose tissue), this feature was identified by annotators with 87.0–95.5% accuracy. Our approach can be used to explain predictions from a prognostic deep learning model and uncover potentially-novel prognostic features that can be reliably identified by people for future validation studies.
View details
Preview abstract
Rapid progress has been made in artificial intelligence (AI) models for medical applications, especially over the past 5 years, with substantial efforts focusing on diagnosis from medical images. An essential aspect of evaluating the performance of AI models and their potential clinical utility is the rigor of the reference standard. A reference standard is “the best available method for establishing the presence or absence of the target condition”, and is thus equivalent to what is commonly referred to as the ground truth in AI literature. Determination of what constitutes a reference standard is established by “opinion and practice within the medical, laboratory, and regulatory community”. The reference standard can either be a widely agreed-upon gold standard2 or, in its absence, a proxy that is highly correlated with the clinical outcome. Although a non-reference standard can also be used, correctness claims such as accuracy, sensitivity, and specificity should be dropped in favour of agreement with a comparative method.
View details
Deep learning for distinguishing normal versus abnormal chest radiographs and generalization to two unseen diseases tuberculosis and COVID-19
Zaid Nabulsi
Shahar Jamshy
Charles Lau
Eddie Santos
Atilla Peter Kiraly
Jie Yang
Rory Pilgrim
Sahar Kazemzadeh
Jin Yu
Lily Hao Yi Peng
Krish Eswaran
Neeral Beladia
Shravya Ramesh Shetty
Scientific Reports (2021)
Preview abstract
Chest radiography (CXR) is the most widely-used thoracic clinical imaging modality and is crucial for guiding the management of cardiothoracic conditions. The detection of specific CXR findings has been the main focus of several artificial intelligence (AI) systems. However, the wide range of possible CXR abnormalities makes it impractical to detect every possible condition by building multiple separate systems, each of which detects one or more pre-specified conditions. In this work, we developed and evaluated an AI system to classify CXRs as normal or abnormal. For training and tuning the system, we used a de-identified dataset of 248,445 patients from a multi-city hospital network in India. To assess generalizability, we evaluated our system using 6 international datasets from India, China, and the United States. Of these datasets, 4 focused on diseases that the AI was not trained to detect: 2 datasets with tuberculosis and 2 datasets with coronavirus disease 2019. Our results suggest that the AI system trained using a large dataset containing a diverse array of CXR abnormalities generalizes to new patient populations and unseen diseases. In a simulated workflow where the AI system prioritized abnormal cases, the turnaround time for abnormal cases reduced by 7–28%. These results represent an important step towards evaluating whether AI can be safely used to flag cases in a general setting where previously unseen abnormalities exist. Lastly, to facilitate the continued development of AI models for CXR, we release our collected labels for the publicly available dataset.
View details