preparing cancer patients for difficult decisions is the job of an oncologist. They don’t always remember to do it, though. At the University of Pennsylvania Health System, doctors get a helping hand to talk about a patient’s treatment and preferences at the end of life through an artificially intelligent algorithm that predicts the chances of death.
But it’s far from a set-it-and-forget-it tool. A routine technology review revealed that the algorithm deteriorated during the covid-19 pandemic, getting 7 percentage points worse at predicting who would die, according to a 2022 study.
There were probably real-life impacts. Ravi Parikh, an Emory University oncologist who was the study’s lead author, told KFF Health News that the tool failed hundreds of times to prompt doctors to initiate this important discussion, potentially preventing unnecessary chemotherapy, with patients who needed it.
He believes that several algorithms designed to improve medical care were weakened during the pandemicnot just Penn Medicine’s. “Many institutions don’t routinely monitor the performance” of their products, Parikh said.
Algorithm errors are one facet of a dilemma that computer scientists and doctors have long recognized but is beginning to baffle hospital executives and researchers: AI systems require monitoring and constant staffing to get them up and running and keep them running smoothly.
Bottom line: You need people, and more machines, to make sure new tools don’t break.
“Everybody thinks that AI is going to help us with our access and capacity and improve care and so on,” said Nigam Shah, chief data scientist at Stanford Health Care. “That’s all nice and well, but if it increases the cost of care by 20%, is it viable?”
Government officials worry that hospitals lack the resources to implement these technologies. “I’ve looked far and wide,” FDA Commissioner Robert Califf said at a recent agency panel on AI. “I don’t think there is a single health system, in the United States, that is capable of validating an AI algorithm that has been put into place in a clinical care system.”
AI is already widespread in health care. Algorithms are used to predict patients’ risk of death or deterioration, to suggest diagnoses or triage patients, to record and summarize visits to save the work of doctors already approve insurance claims.
If the tech evangelists are right, technology will become ubiquitous and profitable. Investment firm Bessemer Venture Partners has identified about 20 health-focused AI startups on track to earn $10 million in revenue each within a year. The FDA has approved nearly a thousand artificially intelligent products.
Assessing whether these products work is challenging. Assessing whether they still work, or have developed the software equivalent of a blown gasket or a leaky engine, is even trickier.
Take a recent study in Yale Medicine evaluating six “early warning systems,” which alert doctors when patients are likely to deteriorate rapidly. A supercomputer ran the data for several days, said Dana Edelson, PhD, of the University of Chicago and co-founder of a company that provided an algorithm for the study. The process was fruitful, showing large differences in performance between the six products.
It is not easy for hospitals and providers to select the best algorithms for their needs. The average doctor doesn’t have a supercomputer sitting around, and there are no consumer reports for AI.
“We don’t have standards,” said Jesse Ehrenfeld, immediate past president of the American Medical Association. “There’s nothing that can tell you today that there’s a standard for how you evaluate, monitor, observe the performance of an algorithmic model, AI-enabled or not, when it’s deployed.”
Perhaps the most common AI product in doctor’s offices is called ambient documentation, a tech-enabled assistant that listens and summarizes patient visits. So far this year, Rock Health investors have tracked $353 million flowing into these doc companies. But, Ehrenfeld said, “there is no standard now to compare the output of these tools.”
And that’s a problem, when even small mistakes can be devastating. A team at Stanford University tried to use large language models, the technology behind popular AI tools like ChatGPT, to summarize patients’ medical histories. They compared the results to what a doctor would write.
“Even in the best case, the models had a 35 percent error rate,” Stanford’s Shah said. In medicine, “when you’re writing an abstract and you forget a word, like ‘fever,’ I mean, that’s a problem, right?”
Sometimes the reasons why algorithms fail are quite logical. For example, changes to the underlying data can erode its effectiveness, such as when hospitals change lab providers.
Sometimes, however, reefs open for no apparent reason.
Sandy Aronson, chief technology officer of the personalized medicine program at Mass Brigham General in Boston, said that when his team tested an app intended to help genetic counselors locate relevant literature on DNA variants, the product suffered from “no determinism,” that is, when asked the same thing. question several times in a short period, gave different results.
Aronson is excited about the potential of large language models to summarize the knowledge of overworked genetic counselors, but “the technology needs to improve.”
If metrics and standards are scarce and errors can arise for strange reasons, what are institutions to do? Invest a lot of resources. At Stanford, Shah said, it took eight to 10 months and 115 man-hours just to audit two models for fairness and reliability.
Experts interviewed by KFF Health News floated the idea of ​​AI controlling AI, with some (human) data geeks overseeing both. All acknowledged that this would require organizations to spend even more money, a tough ask given the realities of hospital budgets and the limited supply of AI technology specialists.
“It’s great to have insight into where we’re melting icebergs to have a model that controls their model,” Shah said. “But is that really what I wanted? How many more people are we going to need?”
KFF Health News is a national newsroom that produces in-depth journalism on health issues and is one of the core operating programs of KFF — the independent source of health policy research, polling and journalism.