Moneyball Medicine: Interview with Dr. Ethan Halm

Dr. Ethan Halm and his colleagues use data from electronic health records to predict who is more likely to be at risk for bad things. Then they intervene early to reduce the risk of the bad things happening.

Asset 11.png

By Daniel Oppenheimer
Editor, Texas Health Journal


Although the algorithms that Dr. Ethan Halm and his colleagues deploy to help patients are complex, the premise is simple: use big data sets to predict who is more likely to be at risk for bad things. Then intervene early to reduce the risk of the bad things happening.

We spoke to Dr. Halm, who is Chief of the William T. and Gay F. Solomon Division of General Internal Medicine and Director of the Center for Patient-Centered Outcomes Research at The University of Texas Southwestern Medical Center, about his work with electronic health records, predictive algorithms, and negotiating the right balance between man and machine.

Halm received his B.A. from Wesleyan University, M.D. from the Yale School of Medicine, and M.P.H. from Harvard University. Following his residency at the University of California, San Francisco, he completed a general medicine/clinical research fellowship at the Massachusetts General Hospital. His research has been funded by the National Institutes of Health, Agency for Healthcare Research and Quality, and Robert Wood Johnson Foundation, among others.

Texas Health Journal: I’m interested in the details of the studies you’ve done, and interventions you’ve designed, but first give me the big picture. What is the elevator pitch?


Ethan Halm: We’re using electronic data, primarily information from electronic health records (EHRs), to build models of the risk of bad things happening, whether that’s readmissions after heart failure hospitalizations, cardiac arrest while in the hospital, or the development of diabetes. We’re validating those models using follow-up data or data from other patient populations, and then refining the models to make them more accurate and comprehensive. The final step is using the models to design interventions to decrease the chance of bad things happening. We don’t want to just predict who is likely to get hit by a bus and where. We want to stop people getting hit by a bus by putting in a stop light at a dangerous intersection. The goal is to come up with actionable intelligence.

What’s an example of what this looks like in practice?

A good example is the work we did in conjunction with the Parkland Center for Clinical Innovation predicting the risk that non-ICU patients in the hospital will suffer unexpected cardiac arrest or death. Working with EHR data from Parkland Hospital, which is the safety net hospital for Dallas, we developed an algorithm that automatically flags patients at high risk for these outcomes. We found that the algorithm identified people on average almost six hours before the rapid response team would otherwise have been called in by a clinician, and sixteen hours before the actual event. The rapid response team now follows up on these patients quickly, without waiting for a doctor or nurse to call them, which may occur when people are so sick it might be too late to turn things around. This can saves lives and reduce the severity of cardiac events.

We have also done a lot of work developing computerized models to predict the risk of a hospitalized patient needing to readmitted in the 30 days after discharge.  When we implemented use of an algorithm to identify and intervene on high risk patients hospitalized with heart failure at Parkland Hospital, it reduced readmissions by 27% and saved roughly $1 million in the first year. Parkland was so impressed by results, they decided to greatly expand the program. They’re now deploying a multi-condition risk algorithm perfected by the Parkland Center for Clinical Innovation to automatically flag high risk medical patients regardless of which diseases they have or the reason they were admitted. They have a transition-of-care team that uses that algorithm to really focus on high risk patients and figure out what they can do to mitigate the risks. It’s not even a research intervention anymore. It’s just how they do business.

What is it the algorithms are catching when they flag someone? What kinds of data are they using to identify high risk patients?

It depends on the condition or event, but a lot of it is what you would expect. The computers can only assess the variables that we are documenting. So that means things like: abnormal lab values, vital signs, diagnostic codes, and demographics like age, gender, and insurance status. There are also other types of things like “level of consciousness” (LOC) abnormalities, which can often be found in the nurses’ notes. In the study I mentioned above, we hypothesized that patients who were sicker in subtler ways would be more likely to be admitted to certain non-ICU medical floors, so we classified those wards as “high risk floors.”

The other thing we have done that has added predictive power is bring in some electronic proxies for the social determinants of health. We might look, for instance, at the number of emergency contacts that are listed in the EHR, and use that as a proxy for social support. If no one is listed, that person is more likely to be on their own, as opposed to someone who lists a spouse or children or friends. Lack of social support increases the risk of readmission.

We have also tapped into the Dallas-Fort Worth Hospital Council data set, which is an all-payer health information exchange that provides information on things like the number of visits someone has had to any hospital or to the ER in North Texas in the past 12 months. That kind of data can increase your precision in assessing their readmission risk.

Once a patient has been flagged, what then? What do you do?

When someone is flagged as high risk, a human comes in to investigate more and optimize the care they are getting. That might mean more care during the hospitalization, or more planning for discharge and follow up services. We are hoping to the get to the point where our algorithms can tell us not just that someone is at higher risk, but what to do to reduce the risk. Right now, however, the algorithms mostly tell us to bring in the clinicians and transition of care team. Then it’s up to them to use their expertise. It is a way of maximizing the allocation of scarce resources to those patients most likely to benefit from them, rather than taking a standard, one size-fits-all approach.

The next step will be to give the intervention team information about why specifically someone is at high risk. Some of that will come from better models, but a lot of it likely will come from collecting more data on other key risk factors. This is particularly true when you’re thinking about the social determinants of health. There is typically a lot of information in that realm that is gathered, but much of it is in free text form. It is not uniformly defined or recorded and is often entered into a part of the medical chart not regularly reviewed by doctors. We have to create more structured fields in the EHR to capture important social determinants of health and then harness that data better so it’s not out of sight, out of mind, and the whole care team can act on it.

Isn’t this kind of triaging something that clinicians have always done? Aren’t they always assessing patients, and using the available data to predict who is likely to have the highest risk of various things? How is this different from that?

Clinicians have always had a gestalt feel for who they worry the most about, and data-based protocols to rely on. Certainly if a doctor thinks someone is at high risk or is unstable, they will focus more on that person. But the kind of algorithms we are developing can help complement that approach and compensate for blind spots in human judgement. There are individuals who just don’t fit the profile of a classic high-risk person or aren’t flagged by existing protocols. They’re flying under the radar.

Right now, my colleagues and I have a project in process called “man vs. machine,” in which we are asking the treating doctors in the hospital to give us their sense about the risk of readmission, and also seeing what the computerized algorithm says about the risk of readmission. Then we’re comparing those to the actual outcomes. What we have found is that there are areas where the computers are better, and some where the people are better. The combination of human and big data judgement provides you the best of both worlds.

This is all sounding very Moneyball to me. I’m thinking of the Oakland A’s using quantitative methods to identify potential recruits who would be valuable to the team who wouldn’t tend to be singled out by their human scouts. Is that a good analogy?

Absolutely. The scout vs. the quant. There are certain players or patients who everyone can see are outliers, are five tool players or high risk patients, but there are lots of disconnects as well, and that is the Moneyball story. There are medical versions of the on-base percentage, which was one of the stats that the scouts tended to undervalue. We found, for instance, that any abnormality in the vital signs in the 24 hours before discharge is predictive of higher risk of a patient being readmitted, even after adjusting for many other things. However, one abnormal value is the kind of thing a physician may not see as a major problem, especially if that value is normal at other points within the 24 hours. Vital signs are vital. They can tell us a lot if we pay attention to them.

Sometimes it is not just one thing, but a combination of indicators. There are a lot of different risk predictors that may not affect the risk that much on their own, but you roll them all together and the risks compounds. That multi-factorial complexity is something that might be hard for clinicians to fully sense, especially when they are very busy.

In some machine learning-based models, one can end up with predictors or insights that the human programmers involved don’t themselves understand. Is that happening here? Do you always understand why your algorithms are flagging the patients they’re flagging?

Most of what we are doing takes a traditional statistical approach, using regression analysis, to identify predictors. But we have also had our analysts take more of that machine learning approach, to see if the computer can come up with unexpected combinations of things that we have not thought about, to see if those models are any better. And we have ended up with some of those black box types of findings.

Do you use them?

We are cautious about this. We have to ask if the increase in complexity, and decrease in our ability to explain the models, is worth the improvement. From the standpoint of producing actionable intelligence, we are hoping to produce signals that may guide the intervention, that provide some information about why a patient is high risk. If we don’t know why the algorithm is flagging someone, we don’t have as much actionable intelligence. If the gain in accuracy was large enough, it might be worth including anyway, but we have not seen too many examples of breakthroughs like this yet.

On that topic, how much better do you think we’re likely to get in terms of our predictive powers? What’s the science fiction version of this, 20 or 50 years down the road? Will we have everyone’s fully sequenced genome on file, and all their data from Google and Facebook, and be able to instantly integrate that into a profile that allows for near-perfect predictions?

As we have added more and more complexity, and pulled in more and more data elements, we have seen diminishing returns on our predictive ability. I think we can continue to improve our collection of data and our models, but there seems to be a limit. One reason is there is a lot of variation that is simply not explained and may not be explainable. There are so many things that go into why someone does badly. As Yogi Berra said, predictions are tricky, especially about the future.

Another thing to keep in mind is that we may discover predictors of risk that don’t point to an obvious solution. As we get better, for instance, at having the social determinants of health collected in a more granular, consistent way, that may explain some of the variance, but many of those determinants may not be easily influenceable. We can’t always fix it. If someone is homeless, people have created “housing-as-medicine” interventions, providing short term housing for someone so they can recover from something like pneumonia in a stable environment. But that doesn’t solve the underlying problem of homelessness in the long term. If someone is socially isolated, alone, you can check up on them periodically. You can help with Lyft or Uber transportation credits so they can get to a medical appointment, but they are still alone most of the time.

The flip side is to not let the perfect be the enemy of the good. We are not at the point where a computer does a risk assessment and sends a drone out to do the work.  However, a lot of good can be done now by combining the best of what computers and humans can uniquely do—the quant and the scout. We can get better. We are getting better.