The Use of Machine Learning in Health Care: No Shortcuts on the Long Road to Evidence-based Precision Health
Posted on byTwo recent systematic reviews reveal the high risk of bias present in randomized controlled trials (RCTs) and observational studies based on machine learning and artificial intelligence.
Digitization of health data holds profound potential to change the way we collect information and interact with the health care system. In current times, an increasing volume of health-related data is generated from sources such as biosensors, health data registries, genome sequencing, and electronic health records, necessitating increasing integration with computers capable of analyzing complex data with the assistance of artificial intelligence (AI).
One of the most common forms of AI applied to health care is machine learning (ML), a statistical technique for training algorithms to learn from and make predictions using data. ML may improve patient care by deriving novel insights from large data sets to predict which interventions are more likely to succeed for an individual patient.
A more recent branch of ML is deep learning (DL), which allows computational models to extract high-level representations of data through multiple layers of data processing. DL has been applied across a variety of medical fields, driven by specialties that use magnetic resonance imaging, echocardiography, mammography, and computed tomography scans and therefore have large data sets of annotated images available. For example, researchers have applied DL to electrocardiogram data to detect abnormalities with accuracy rates comparable to those of cardiologists. However, despite substantial advances of DL across many domains, a key concern is the nature of these algorithms, which are often referred to as “black box models”, alluding to the difficulty in understanding the innerworkings of how these algorithms determine an outcome.
Across numerous health-related applications, ML algorithms have shown remarkable ability to surpass current standard care performance. The number of FDA-approved technologies that are based on AI/ML algorithms has increased by several fold over the last several years. Yet, the field is still in its infancy; it was only in the early 2010s that DL achieved acceptance as a form of AI.
For ML algorithms to fulfill their promise to improve health care, it is important to evaluate the methodological quality and the risk of bias arising from shortcomings in design, conduct, and analysis in studies using ML techniques. Two recent systematic reviews tackle these important issues.
In the first paper, Zhou et al. conducted a systematic review of RCTs that included interventions using traditional statistical methods, ML, and DL tools. A total of 65 RCTs conducted within the past decade were identified. Most trials were designed to assist in treatment decision, diagnosis, or risk stratification. Top disease categories included cancer and other chronic diseases. The authors found that, for a large proportion of the trials, quality was suboptimal with regards to pre-estimation of sample size, randomization, and masking, among other factors. This finding is consistent with prior studies assessing the quality of RCTs in medicine. Specifically, over 35% of RCTs were considered to have a high overall risk for bias, 38% displayed some concerns of bias, and only 26% were found to have an overall low risk of bias. Notably, nearly 40% of the trial interventions failed to show clinical benefit compared to standard care.
In the second paper, Navarro et al. conducted a systematic review assessing the risk of bias in studies that used ML-based prediction models in observational studies across medical specialties. The authors identified a total of 152 prediction model studies, almost 40% of which were diagnostic (having an outcome), while the other 60% were prognostic (developing an outcome). The most common algorithms used in the studies were classification and regression tree, support vector machine, and random forest. Clinical specialties associated with most studies included oncology, surgery, and neurology. The authors found that 87% of analyses have a high risk of bias, largely due to small study size, propensity to overfitting, and poor handling of missing data.
The two systematic reviews provide valuable insights into the current limitations of studies using ML-based models. Given that methodological challenges and risk of biases in ML-based models can occur across different development stages, such as data curation, model selection and implementation, and validation, there is a need for broad discussion of possible solutions. Both reviews recommend researchers follow standardized reporting guidelines to better determine the risk of bias and to improve assessment of methodological quality. By addressing the limitations of ML-based algorithms and developing efforts to mitigate the risk of biases at each stage of development, ML and other advanced technologies can get us closer to a new era of precision health.