The Digester

A multicohort study of 8,221 cancer patients shows that deep learning models predicting molecular biomarkers from routine H&E whole-slide images frequently learn composite signals tied to other mutations, tumor grade or mutation burden, causing biased subgroup performance and limited generalizability and calling for bias-aware evaluation and causal methods.

Models can show high overall accuracy but their AUROC often drops substantially within subgroups defined by co-occurring or mutually exclusive biomarkers.
Interdependencies among biomarkers, histological grade and tumor mutational burden act as confounders that models exploit as proxies for the target biomarker.
Association patterns between biomarkers and confounders vary across datasets, so external validation can be misleading about real-world generalizability.
For several targets a simple classifier based on pathologist grade approaches the performance of deep learning models, limiting their added clinical value.
The authors propose bias-aware evaluation using stratified metrics, permutation testing and comparisons with clinical baselines to detect reliance on confounders.
Until dependency-aware, causal approaches and richer datasets are developed, WSI-based predictors are best used for triage, screening or research support rather than replacing molecular tests.

AI reading of H&E slides is often biased by linked biomarkers, grade and mutation burden

Sources