What animal viruses can infect humans? Computers compete to find out.

Colin Carlson, a biologist at Georgetown University, began to worry about mousepox.

The virus, discovered in 1930, spreads among mice, killing them with ruthless efficiency. But scientists have never considered it a potential threat to humans. Now Dr. Carlson, his colleagues and their computers are not so sure.

Using a technique known as machine learning, researchers have spent the past few years programming computers to learn about viruses that can infect human cells. Computers have combed through vast amounts of information about the biology and ecology of the animal hosts of these viruses, as well as the genomes and other characteristics of the viruses themselves. Over time, computers began to recognize certain factors by which one could predict whether a virus could spread to humans.

Once computers proved their ability to deal with viruses that scientists were already intensively studying, Dr. Carlson and his colleagues applied them to the unknown, eventually compiling a short list of animal viruses capable of crossing the species barrier and causing outbreaks in humans.

In recent runs, algorithms have unexpectedly placed the mousepox virus at the top of the list of dangerous pathogens.

“Every time we run this model, it gets very high,” Dr. Carlson said.

Puzzled, Dr. Carlson and his colleagues dug through the scientific literature. They stumbled upon the documentation of a long forgotten flash in 1987 in rural China. The schoolchildren developed an infection that caused sore throats and inflammation in the arms and legs.

Years later, a team of scientists ran tests on throat swabs that had been collected during the outbreak and placed in storage. These samples, the group reported in 2012, contained mousepox DNA. But their study received little attention, and a decade later, mousepox is still not considered a threat to humans.

If the computer programmed by Dr. Carlson and his colleagues is correct, the virus deserves a new look.

“It’s just insane that this gets lost in a huge pile of stuff that public health has to sift through,” he said. “It actually changes the way we think about this virus.”

Scientists have identified about 250 human diseases that arose when the animal virus crossed the species barrier. For example, HIV passed from chimpanzees, and the new coronavirus originated in bats.

Ideally, scientists would like to recognize the next spin-off virus before it infects humans. But there are too many animal viruses for virologists to study. Scientists have identified more than 1,000 viruses in mammals, but this is likely a tiny fraction of the true number. Some researchers suspect that mammals carry tens of thousands viruses, while others put a number in hundreds of thousands.

To uncover potential new side effects, researchers like Dr. Carlson are using computers to uncover hidden patterns in scientific data. Machines can focus on viruses that are particularly likely to cause human disease, for example, and can also predict which animals are most likely to carry dangerous viruses that we don’t yet know about.

“It feels like you have a new set of eyes,” says Barbara Hahn, a disease ecologist at the Cary Institute for Ecosystem Research in Millbrook, New York, who is collaborating with Dr. Carlson. “You just can’t see in as many dimensions as a model.”

Dr. Khan first encountered machine learning in 2010. Computer scientists have been developing this technique for decades and have begun to build powerful tools with it. These days, machine learning allows computers to detect fraudulent loan payments and recognize people’s faces.

But few researchers have applied machine learning to diseases. Dr Khan wondered if she could use it to answer open questions, such as why less than 10 percent of rodent species carry pathogens known to infect humans.

She entered information about various species of rodents from an online database into a computer, from their age at weaning to their population density. The computer then looked for features in rodents known to harbor large numbers of pathogens jumping from one species to another.

Once the computer created the model, it tested it on another group of rodent species to see how well the computer could guess which ones were infected with pathogens. Eventually the computer model reached accuracy 90 percent.

Dr. Khan then turned to rodents that have yet to be tested for incidental pathogens and compiled a list of high priority species. Dr. Hahn and her colleagues predicted that species such as the mountain vole and the northern grasshopper mouse in western North America would be particularly likely vectors of pathogens of concern.

Of all the characteristics that Dr. Khan and her colleagues gave to their computer, the most important was the lifespan of rodents. Species that die young end up carrying more pathogens, perhaps because evolution has devoted more of its resources to reproduction than to building a strong immune system.

These results took years of painstaking research, during which Dr. Khan and her colleagues combed ecological databases and scientific studies for useful data. More recently, researchers have accelerated this work by creating databases specifically designed to teach computers about viruses and their carriers.

In March, for example, Dr. Carlson and his colleagues disclosed an open access database called VIRION with half a million items of information on 9,521 viruses and 3,692 animal hosts, and the database continues to grow.

Databases like VIRION now allow you to ask more specific questions about new pandemics. When the Covid pandemic broke out, it soon became clear that it was caused by a new virus called SARS-CoV-2. Dr. Carlson, Dr. Khan and their colleagues have created programs to identify animals that are most likely to carry the new coronavirus.

SARS-CoV-2 belongs to a group of species called betacoronaviruses, which also includes the viruses that have caused SARS and MERS epidemics in humans. For the most part, betacoronaviruses infect bats. When SARS-CoV-2 was discovered in January 2020, 79 bat species were known to carry it.

But scientists have not conducted a systematic search for betacoronaviruses in all 1,447 bat species, and such a project would have taken many years to complete.

By inputting biological data about different types of bats – their diet, wing length, etc. – into their computer, Dr. Carlson, Dr. Hahn and their colleagues created a model that could offer the most likely predictions for bats. to shelter betacoronaviruses. They found over 300 species that fit the bill.

Since that prediction in 2020, researchers have indeed found betacoronaviruses in 47 bat species — all of them on prediction lists generated by some of the computer models they created for their study.

Daniel Becker, a disease ecologist at the University of Oklahoma who has also worked on beta coronavirus research, said it was amazing how simple traits like body size can lead to powerful predictions about viruses. “Many of them are low-hanging fruits of comparative biology,” he said.

Dr. Becker is now monitoring a list of potential betacoronavirus carriers from his backyard. Turns out some bats in Oklahoma are predicted to harbor them.

If Dr. Becker does find betacoronavirus in the backyard, he won’t be able to tell right away that it’s an immediate threat to humans. First, scientists will have to conduct painstaking experiments to assess the risk.

Pranav Pandit, an epidemiologist at the University of California, Davis, warns that these models are under development. When tested on well-studied viruses, they perform significantly better than random chance, but can be better.

“This is not the stage where we can just take these results and create an alert to start telling the world, ‘This is a zoonotic virus,'” he said.

Nardus Mollenze, a computer virologist at the University of Glasgow, and his colleagues are the first to develop a method that can markedly improve the accuracy of models. Instead of looking at carriers of the virus, their models look at its genes. A computer can be taught to recognize subtle features in the genes of viruses that can infect humans.

In their first report Using this method, Dr. Mollenze and his colleagues developed a model that could correctly recognize human-infecting viruses more than 70% of the time. Dr. Mollenze can’t yet say why his gene-based model worked, but he has some ideas. Our cells can recognize foreign genes and send an alarm signal to the immune system. Viruses that can infect our cells may have the ability to mimic our own DNA as a kind of viral camouflage.

When they applied this model to animal viruses, they came up with a list of 272 high-risk species. This is too much for virologists to study them as deeply as possible.

“You can only work with so many viruses,” said Emmy de Wit, a virologist at Rocky Mountain Laboratories in Hamilton, Montana, who oversees research on the novel coronavirus, influenza, and other viruses. “From our side, we really need to narrow the circle.”

Dr. Mollenze acknowledged that he and his colleagues needed to find a way to pinpoint the worst of the worst among animal viruses. “This is just the beginning,” he said.

To continue his initial research, Dr. Mollenze is working with Dr. Carlson and colleagues to combine data about the genes of viruses with data related to the biology and ecology of their hosts. Researchers are getting promising results from this approach, including a tantalizing mousepox lead.

Other kinds of data can make predictions even better. One of the most important features of a virus, for example, is the coating of sugar molecules on its surface. Different viruses have different sugar molecular structures, and this arrangement can have a huge impact on their success. Some viruses can use this molecular glaze to hide from their host’s immune system. In other cases, the virus may use its sugar molecules to attach itself to new cells, causing a new infection.

This month, Dr. Carlson and his colleagues published a comment on the Internet arguing that machine learning could extract a lot of usefulness from the sugar shell of viruses and their carriers. Scientists have already collected much of this knowledge, but it has yet to be put into a form that computers can learn from.

“It seems to me that we know much more than we think,” said Dr. Carlson.

Dr. de Wit said that someday, machine learning models will help virologists like her study certain animal viruses. “It will definitely be of great use,” she said.

But she noted that so far, models have focused mainly on the ability of a pathogen to infect human cells. Before causing a new human disease, the virus must also spread from one person to another and cause severe symptoms along the way. She is waiting for a new generation of machine learning models that can also make such predictions.

“We really want to know not necessarily what viruses can infect people, but what viruses can cause an outbreak,” she said. “So that’s really the next step that we need to think about.”