Deploying data against disease
Like so many college students in 2020, Orediggers lived and studied through an unprecedented spring semester.
With COVID-19 spreading rampantly through the country, Mines closed campus the week before spring break, then decided to keep it closed for the remainder of the spring semester. Laura Albrecht, a PhD student in applied mathematics and statistics, was using data science to study blood coagulation throughout her semester in lockdown, but as data science emerged as essential behind-the-scenes work during the pandemic, she found herself wishing she could contribute directly to the emerging body of knowledge in her field.
An opportunity appeared for the statistician at the start of summer: The American Institute of Mathematics announced a workshop on modeling data-driven solutions to the coronavirus pandemic. Albrecht quickly applied and was soon on a team of students and faculty from universities around the country, working in four time zones, brainstorming problems they could solve using mathematical and statistical models.
“It’s been hard, as someone who works in a semi-related field, to see all these papers and data coming out and not feel like I had the skills or time to do anything with it,” Albrecht said. “So it’s been nice to immerse myself in that.”
Data science is at the heart of countless decisions officials are making during the pandemic to help keep people safe. Predictions based on hospitalization and testing rates have informed decisions on the scale of lockdowns, mask-up orders and school reopenings. Employment, business and tax data have fueled battles in the halls of Congress over stimulus funding.
And data’s also in play on a smaller scale: Mines has partnered with COVIDCheck Colorado to launch voluntary “surveillance” testing on campus for those who don’t have symptoms (and who want to contribute to science) to minimize spread. “We will be using the testing data in mathematical and statistical models to simulate the disease spread on campus to determine which groups are most at risk and should be included in the testing pool and determine how frequently people should be tested,” said Albrecht, who was tapped to work on the program this fall.
Data science is also being used to tackle novel research questions, such as whether there are racial disparities in COVID-19 infection rates and outcomes (there are), but it’s also humming along in the background elsewhere. Companies that had already invested in data analysis and AI have made quick pivots to the rapid changes the pandemic brought, even in hard-hit sectors of the economy.
“When COVID hit, everyone started producing less oil,” said Nick Sellers ’18, a database developer at Engage, an oil-and-gas production services platform. “Our system just started calculating everything automatically.” Their system uses predictive calculations to determine whether there’s enough oil at a particular site to pick up or whether a worker needs to go to a site to service equipment. The quick data analysis prevented surpluses from getting stuck in one place when demand suddenly shifted—a scenario that temporarily sunk oil into negative pricing. “We were able to get right ahead of that,” he said.
While data science is being wielded by experts to draw useful predictions and reveal real-time outcomes, large-scale data has been accessible to everyone, every day throughout the pandemic. Open-source projects, such as the COVID Tracking Project, have democratized COVID-19 data, providing an easy resource for data scientists to draw from to inform states on how to scale reopening (even the White House uses its data).
“It’s a great way just to keep track of the pulse of the pandemic,” said Doug Nychka, professor of applied math and statistics and co-director of Mines’ new graduate program in data science. “And the transparency is impressive—I can go to the New York Times and get the counts county by county across the U.S., and that’s fantastic.”
“That said, there’s a lot behind these numbers,” he continued. “For example, there might be delays in reporting. If these are test results that have been reported a week ago, that is not a snapshot of what is happening right now and may not give you a handle to get ahead of the infection. You are always playing catch- up. And there can be other less obvious problems and biases by taking the raw data at face value.”
That line of thinking is just the start of what happens when data scientists dig into the numbers—and the pandemic is producing a deluge of information. However, for a data scientist, the question is not whether the data set is big or small but whether it can be used to answer important questions.
“Our faculty at Mines deal with data issues like this all the time,” Nychka said. “For example, for a grad student collecting data off an instrument, there is always the issue of what does it mean? Beyond the data collection, there is always a next step of interpretation and modeling to make sense of the results. You use the raw data, of course, but you also apply a modeling framework to interpret it. And that’s a hook for why we need data science. Rarely is data by itself informative—it requires some analysis and assumptions to be useful.”
Working the COVID-19 data
Data scientists can develop models and algorithms to attempt to answer questions in virtually any field, which means they often end up working in tandem with experts in other scientific disciplines. “There’s a very strong thread in our profession that we come into interesting research problems by working with people outside of the field,” said Nychka, who has worked on projects as varied as measles outbreaks, climate and transportation.
At AIM’s COVID-19 data workshop, participants worked alongside disease-modeling experts. “We spent a few weeks looking at just what mathematical and statistical models we could use for COVID,” Albrecht said. They split into groups, and her group dove into two projects.
They first looked at the relationship between air quality and COVID-19 transmission. “You have to control for how much lockdown is going on versus how bad is the air quality,” she said. “In the U.S., that’s a difficult thing to do right now, because it varies from region to region.” To account for this, her team decided to use data from one region of Italy. “Italy had a uniform response across the country— essentially, they locked down their country at the same time.”
The team found a paper that previously had developed a similar model to look at the relationship between the flu and air quality and pulled in data on fine particulates and other factors that affect air quality. “We built a statistical model controlling for all these other things—temperature, humidity,” she said. “We used Google mobility data to find out how much people have been at home.”
Their results aren’t finalized, Albrecht said, but “it does seem like if the air quality gets bad enough, there is an increased risk. If you’re in the normal range of air quality, it’s not a risk.”
Theoretically, hospitals could use this sort of analysis to plan for a spike in new COVID-19 patients after an extended period of poor air quality.
For their second project, her team looked at the decrease in emissions during lockdown. “We looked at satellite data and a few countries—the biggest carbon emitters. We haven’t come to a conclusion yet, but we’re trying to see if we can quantify [emissions reductions] based on particular lockdown measures.”
This question has practical applications as well—policymakers could use the analysis to determine whether there’s a long-term sustainable solution amid the closing of workplaces and schools, such as having a certain percentage of the workforce stay home if it reduces emissions in a meaningful way.
Data challenges during the pandemic
When making predictions amid a deadly pandemic, the cost of an error in the numbers could be high. “I think that’s one of the things that makes working in COVID right now so difficult,” Albrecht said. “There are not a lot of consistencies across data sources, even city to city or country to country. You kind of have to be making large error bounds on your predictions. I don’t really know what the answer is to get better data integrity at the moment, but it’s definitely one of the biggest issues in trying to work with this right now.”
Quality control starts with the data and runs through the modeling. Good analysis is reproducible, Nychka said. “One thing that a data analysis should have is a trail of breadcrumbs from the original source of the data all the way to the figures or tables in the conclusions. This will allow someone else to reproduce all of their work. Reproducibility builds trust and objectivity in the conclusions, because then there’s nothing mysterious about what the person did, and they can see all the choices they made along the way.”
Mines’ new graduate program in data science, which launched this fall, will give students the skills to tackle complex data problems in many different areas of application. Though the program is a “big tent” that includes faculty across many different disciplines— which is a unique feature of the Mines program compared to other data science programs—Nychka said the goal was to zero in on data science as its own area of study and help students develop expertise in the specialized skills this field requires.
Mines added data science as a focus for undergraduates in computer science several years ago as well, and it’s already serving those who graduated from the program well. “I switched in 2017 as soon as they opened the new data science branch, so I graduated with that,” said Mady Deeter ’19. She found her job as a machine learning engineer at defense contractor CACI relatively quickly, she said. “It’s very versatile. Even other people I graduated with in computer science, we all work in very different places.”
As recent Mines graduates who are now working in data science, neither Deeter nor Sellers have been surprised to see data science emerge as essential in the pandemic and well beyond.
“There is no avoiding that data science will break into pretty much every field in the next few years, decades,” Sellers said. “Data science is going to drive a lot of our day-to-day lives, whether we realize it or recognize it or not.”