AI reading of H&E slides is often biased by linked biomarkers, grade and mutation burden
Mar 3rd 2026
A multicohort study of 8,221 cancer patients shows that deep learning models predicting molecular biomarkers from routine H&E whole-slide images frequently learn composite signals tied to other mutations, tumor grade or mutation burden, causing biased subgroup performance and limited generalizability and calling for bias-aware evaluation and causal methods.
- Models can show high overall accuracy but their AUROC often drops substantially within subgroups defined by co-occurring or mutually exclusive biomarkers.
- Interdependencies among biomarkers, histological grade and tumor mutational burden act as confounders that models exploit as proxies for the target biomarker.
- Association patterns between biomarkers and confounders vary across datasets, so external validation can be misleading about real-world generalizability.
- For several targets a simple classifier based on pathologist grade approaches the performance of deep learning models, limiting their added clinical value.
- The authors propose bias-aware evaluation using stratified metrics, permutation testing and comparisons with clinical baselines to detect reliance on confounders.
- Until dependency-aware, causal approaches and richer datasets are developed, WSI-based predictors are best used for triage, screening or research support rather than replacing molecular tests.