Researchers at the University of São Paulo in Brazil fed data for different coronaviruses into a machine learning model. The results reinforced the role of flying mammals as the first reservoirs of the virus that caused the COVID-19 pandemic. The tool can be used in future emergencies (photo: Wikimedia Commons)
Published on 08/22/2022
By André Julião | Agência FAPESP – A mathematical model developed at the University of São Paulo (USP) in Brazil with FAPESP’s support has confirmed that bats are the most probable hosts of SARS-CoV-2. The model can be used to predict which animals are likely to be infected by newly emerging coronaviruses.
An article on the study is published in the journal Scientific Reports.
“We fed spike protein data for different coronaviruses into a machine learning model and arrived at the bat as the most probable first host for SARS-CoV-2,” said Irina Yuri Kawashima, first author of the article. The study was part of her doctoral research under the programs of graduate studies in bioinformatics at the Institute of Mathematics and Statistics (IME-USP).
According to the researchers, the model is applicable to newly emerging viruses in the same family since it was able to identify hosts for SARS-CoV and MERS, which caused outbreaks in 2003 and 2012 respectively.
“The results serve as a warning of the need for tighter surveillance to assure early detection of novel viruses. Deforestation and climate change, among other factors, expose humanity to infection by viruses that already infect animals,” said Ronaldo Fumio Hashimoto, a professor at IME-USP supported by FAPESP and last author of the article.
Hashimoto pointed out that the results are corroborated by other recent research, including a study published last year in PLOS Pathogens, where bats were also found to be the most probable initial hosts for the coronavirus that started the COVID-19 pandemic.
“We expected the model to point first to humans as the initial hosts because the SARS-CoV-2 samples we used were mostly isolated from humans, but coronaviruses have coexisted with bats for a long time. Even when they jump to a new host, it takes a long time for the new host to become their main reservoir,” Kawashima said.
Novel method
Inspiration for the mathematical model came from an older article, published in 2015, on the hosts of other coronaviruses that emerged long before the COVID-19 pandemic began.
The USP group applied the model to SARS-CoV-2 but it failed to predict a host consistent with the data then available.
“When we used the model that then existed, it concluded that the hosts were birds, and we realized we had to create our own tool,” said Marielton dos Passos Cunha, penultimate author of the recent article. Cunha is currently a postdoctoral researcher with the Pasteur-USP Scientific Platform (SPPU).
The USP group then substituted the type of biological information used as a starting point for the analysis. While the 2015 model used spike protein dinucleotides (RNA molecules containing certain viral genome data), the new model was based on a technique called relative synonymous codon usage (RSCU).
Codons are combinations of three nucleotides that encode a specific amino acid in viral RNA. Synonymous codons are different codons that encode the same amino acid. Every organism has a “preference” for using synonymous codons. Known as RSCU bias, this is a critical factor in determining gene expression and cellular function. It can be measured in viral RNA to obtain patterns linked to a virus’s ability to adapt to a host.
When the researchers fed all the available data into the model, it predicted with precision the natural hosts for SARS-CoV-2 and other coronaviruses, and they expect it to be similarly capable of identifying the hosts for newly emerging viruses in the same family (Coronaviridae).
“Knowing the first host of the virus is very interesting to help focus the most basic research and for novel virus surveillance,” Cunha said. “Machine learning models work well and are relatively inexpensive to produce but depend heavily on the available data, so it’s vital to invest in research on novel viruses, collect samples from wild animals, sequence viral genomes and make all this available from public databases.”
The article “SARS-CoV-2 host prediction based on virus-host genetic features” is at: www.nature.com/articles/s41598-022-08350-6.
Source: https://agencia.fapesp.br/39414