Deep learning is revolutionizing medicine. Algorithms are increasingly doing everything from triaging medical imagery to predicting treatment outcomes. Yet as hospitals undergo the same AI revolution affecting other fields, the dangers of AI bias and errors and the life-or-death consequences of medicine lends unique risk to these experiments, suggesting caution.
One of the fastest-growing uses of AI in medicine today is the analysis of medical imagery. Human analysis of imagery is slow, difficult to scale and error-prone. Replacing or augmenting human analysis with algorithmic analysis could even eventually allow medical imaging devices to diagnose patients in real-time as they are being imaged and direct technicians to collect additional imagery to narrow the diagnosis while the patient is still lying the imaging system.
The problem is that today’s correlative deep learning systemsrequire vast amounts of extremely diverse training imagery, which can be hard to acquire in hospital settings where there may be more uniformity in patient conditions, demographics and imaging systems. Most dangerously, AI algorithms can easily learn characteristics unrelated to the actual disease itself, lending to false positives and negatives that can cause adverse patient outcomes or even death.
Driverless cars are able to use simulators to generate the vast reams of scenarios they are unlikely to experience in real life, but to date medical systems have largely been trained on real-world data rather than imaging simulations.
Deep learning algorithms today are incredibly brittle black boxes, with little insight into the reasons they are making their decisions. Most importantly, it is nearly impossible to determine the boundaries of their learning and the edge conditions under which they will fail. This means there is little for doctors to go on in terms of estimating whether a given automated diagnosis is solidly within the algorithm’s learned sweet spot or if it is on the edge of its abilities and at greater risk of error.
Today’s automated assessment experiments are just that: experiments. Using AI algorithms to assess medical imagery is still performed primarily in a research context, with the machine’s diagnoses used only to evaluate its performance, rather than augment or replace human experts.
Over time, however, these algorithms will find increasing use in production scenarios.
Early adoption of these algorithms will almost certainly involve human augmentation, in which the machine merely provides suggestions for human review. Unfortunately, such systems typically rapidly devolve. In augmentation workflows the human analysts typically begin to trust their automated counterparts more than they trust themselves. While at first they may closely scrutinize the automated results more than they would check even a human colleague, over time they become complacent. Cautious verification is replaced by casual scrutiny and then by brief randomized spot checks.
As the machines yield a high success rate and scrutiny and caution lessens, human analysts will be assigned an ever-greater volume of content to verify, giving them less and less time to check each individual image. The overworked analysts will default to assume the machine is right, stopping to check only extreme cases.
Most dangerously, over time those human analysts will begin to trust the machine over their own experience and intuition when there are disagreements. Confronted with an edge case where the result is unclear, humans are more likely to defer to the algorithm under the false assumption that its computerized precision has allowed it to see a pattern or artifact invisible to the human eye.
While there are myriad ways to counter these effects, such as inserting randomized images to test inter- and intra-coder reliability over time, the simple fact is that over time more and more of the medical diagnostic world will be turned over to brittle and unpredictable machines that work flawlessly until they fail in the most unexpected ways, typically with severe harm or even death to the human patient.
Driverless cars have adopted a hybrid approach in which real-world training data is augmented with simulator-derived examples that generate coverage of the scenarios unlikely to have sufficient physical instantiations. Yet even all of this data is ultimately coupled with hand-coded rulesets that govern the most important life-and-death situations like stopping at stop signs. That deep learning algorithms are still wrapped within hand-coded rulesets to ensure the reliability of their most important behaviors reminds us that for all its hype and hyperbole, deep learning is still in its infancy and is not mature enough to take over such tasks in their entirety with sufficient robustness when lives are on the line.
Putting this all together, the future of medicine will be increasingly automated. The only question is how to address the severe weaknesses of today’s correlative deep learning algorithmswhen it comes to the life-and-death scenarios of medicine.
In the end, an AI algorithm that makes a bad prediction of what movie we should stream next has little consequence. An AI algorithm that recommends what treatment we should receive has our life resting on its accuracy.