scholarly journals SeqScreen: Accurate and Sensitive Functional Screening of Pathogenic Sequences via Ensemble Learning

2021 ◽  
Author(s):  
Advait Balaji ◽  
Bryce Kille ◽  
Anthony Kappell ◽  
Gene D. Godbold ◽  
Madeline Diep ◽  
...  

Modern benchtop DNA synthesis techniques and increased concern of emerging pathogens have elevated the importance of screening oligonucleotides for pathogens of concern. However, accurate and sensitive characterization of oligonucleotides is an open challenge for many of the current techniques and ontology-based tools. To address this gap, we have developed a novel software tool, SeqScreen, that can accurately and sensitively characterize short DNA sequences using a set of curated Functions of Sequences of Concern (FunSoCs), novel functional labels specific to microbial pathogenesis which describe the pathogenic potential of individual proteins. We show that our ensemble machine learning model after training on these curations can label sequences with FunSoCs via an imbalanced multi-class and multi-label classification task with high accuracy. In summary, SeqScreen represents a first step towards a novel paradigm of functionally informed pathogen characterization from genomic and metagenomic datasets. SeqScreen is open-source and freely available for download at: https://www.gitlab.com/treangenlab/seqscreen .

2021 ◽  
Vol 17 (3) ◽  
pp. e1009315
Author(s):  
Marylee L. Kapuscinski ◽  
Nicholas A. Bergren ◽  
Brandy J. Russell ◽  
Justin S. Lee ◽  
Erin M. Borland ◽  
...  

Bunyaviruses (Negarnaviricota: Bunyavirales) are a large and diverse group of viruses that include important human, veterinary, and plant pathogens. The rapid characterization of known and new emerging pathogens depends on the availability of comprehensive reference sequence databases that can be used to match unknowns, infer evolutionary and pathogenic potential, and make response decisions in an evidence-based manner. In this study, we determined the coding-complete genome sequences of 99 bunyaviruses in the Centers for Disease Control and Prevention’s Arbovirus Reference Collection, focusing on orthonairoviruses (family Nairoviridae), orthobunyaviruses (Peribunyaviridae), and phleboviruses (Phenuiviridae) that either completely or partially lacked genome sequences. These viruses had been collected over 66 years from 27 countries from vertebrates and arthropods representing 37 genera. Many of the viruses had been characterized serologically and through experimental infection of animals but were isolated in the pre-sequencing era. We took advantage of our unusually large sample size to systematically evaluate genomic characteristics of these viruses, including reassortment, and co-infection. We corroborated our findings using several independent molecular and virologic approaches, including Sanger sequencing of 197 genome segments, and plaque isolation of viruses from putative co-infected virus stocks. This study contributes to the described genetic diversity of bunyaviruses and will enhance the capacity to characterize emerging human pathogenic bunyaviruses.


2021 ◽  
Author(s):  
Jakub M Bartoszewicz ◽  
Ferdous Nasri ◽  
Melania Nowicka ◽  
Bernhard Y Renard

Background: Emerging pathogens are a growing threat, but large data collections and approaches for predicting the risk associated with novel agents are limited to bacteria and viruses. Pathogenic fungi, which also pose a constant threat to public health, remain understudied. Relevant, curated data remains comparatively scarce and scattered among many different sources, hindering the development of sequencing-based detection workflows for novel fungal pathogens. No prediction method working for agents across all three groups is available, even though the cause of an infection is often difficult to identify from symptoms alone. Results: We present a curated collection of fungal host range data, comprising records on human, animal and plant pathogens, as well as other plant-associated fungi, linked to publicly available genomes. We show that the resulting database can be used to predict the pathogenic potential of novel fungal species directly from DNA sequences with either sequence homology or deep learning. We develop learned, numerical representations of the collected genomes and show that the human pathogens are separable from non-human pathogens. Finally, we train multi-class models predicting if next-generation sequencing reads originate from novel fungal, bacterial or viral threats. Conclusions: The presented data collection enables accurate detection of novel pathogens from sequencing data. It is also a comprehensive resource that can find use beyond this particular task. This can include possible applications in proteomics and genomics, employing both machine learning and direct sequence comparison. Availability: The database and models are hosted at https://zenodo.org/record/5711852 and https://zenodo.org/record/5711877. Source code is available at https://gitlab.com/dacs-hpi/deepac.


2021 ◽  
Author(s):  
Gene D. Godbold ◽  
Anthony D. Kappell ◽  
Danielle S. LeSassier ◽  
Todd J. Treangen ◽  
Krista L. Ternus

To identify sequences with a role in microbial pathogenesis, we assessed the adequacy of their annotation by existing controlled vocabularies and sequence databases. Our goal was to regularize descriptions of microbial pathogenesis for improved integration with bioinformatic applications. Here we review the challenges of annotating sequences for pathogenic activity. We relate the categorization of more than 2750 sequences of pathogenic microbes through a controlled vocabulary called Functions of Sequences of Concern (FunSoCs). These allow for an ease of description by both humans and machines. We provide a subset of 220 fully annotated sequences in the supplementary material as examples. The use of this compact (∼30 terms) controlled vocabulary has potential benefits for research in microbial genomics, public health, biosecurity, biosurveillance, and the characterization of new and emerging pathogens.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Ryan Feehan ◽  
Meghan W. Franklin ◽  
Joanna S. G. Slusky

AbstractMetalloenzymes are 40% of all enzymes and can perform all seven classes of enzyme reactions. Because of the physicochemical similarities between the active sites of metalloenzymes and inactive metal binding sites, it is challenging to differentiate between them. Yet distinguishing these two classes is critical for the identification of both native and designed enzymes. Because of similarities between catalytic and non-catalytic  metal binding sites, finding physicochemical features that distinguish these two types of metal sites can indicate aspects that are critical to enzyme function. In this work, we develop the largest structural dataset of enzymatic and non-enzymatic metalloprotein sites to date. We then use a decision-tree ensemble machine learning model to classify metals bound to proteins as enzymatic or non-enzymatic with 92.2% precision and 90.1% recall. Our model scores electrostatic and pocket lining features as more important than pocket volume, despite the fact that volume is the most quantitatively different feature between enzyme and non-enzymatic sites. Finally, we find our model has overall better performance in a side-to-side comparison against other methods that differentiate enzymatic from non-enzymatic sequences. We anticipate that our model’s ability to correctly identify which metal sites are responsible for enzymatic activity could enable identification of new enzymatic mechanisms and de novo enzyme design.


1987 ◽  
Vol 13 (6) ◽  
pp. 609-619 ◽  
Author(s):  
A. V. Gudkov ◽  
O. B. Chernova ◽  
A. R. Kazarov ◽  
B. P. Kopnin

2014 ◽  
Vol 80 (16) ◽  
pp. 4958-4967 ◽  
Author(s):  
Marjolaine Martin ◽  
Sophie Biver ◽  
Sébastien Steels ◽  
Tristan Barbeyron ◽  
Murielle Jam ◽  
...  

ABSTRACTA metagenomic library was constructed from microorganisms associated with the brown algaAscophyllum nodosum. Functional screening of this library revealed 13 novel putative esterase loci and two glycoside hydrolase loci. Sequence and gene cluster analysis showed the wide diversity of the identified enzymes and gave an idea of the microbial populations present during the sample collection period. Lastly, an endo-β-1,4-glucanase having less than 50% identity to sequences of known cellulases was purified and partially characterized, showing activity at low temperature and after prolonged incubation in concentrated salt solutions.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Anthony P. West ◽  
Joel O. Wertheim ◽  
Jade C. Wang ◽  
Tetyana I. Vasylyeva ◽  
Jennifer L. Havens ◽  
...  

AbstractWide-scale SARS-CoV-2 genome sequencing is critical to tracking viral evolution during the ongoing pandemic. We develop the software tool, Variant Database (VDB), for quickly examining the changing landscape of spike mutations. Using VDB, we detect an emerging lineage of SARS-CoV-2 in the New York region that shares mutations with previously reported variants. The most common sets of spike mutations in this lineage (now designated as B.1.526) are L5F, T95I, D253G, E484K or S477N, D614G, and A701V. This lineage was first sequenced in late November 2020. Phylodynamic inference confirmed the rapid growth of the B.1.526 lineage. In concert with other variants, like B.1.1.7, the rise of B.1.526 appears to have extended the duration of the second wave of COVID-19 cases in NYC in early 2021. Pseudovirus neutralization experiments demonstrated that B.1.526 spike mutations adversely affect the neutralization titer of convalescent and vaccinee plasma, supporting the public health relevance of this lineage.


1987 ◽  
Vol 7 (5) ◽  
pp. 1776-1781
Author(s):  
M Fukui ◽  
T Yamamoto ◽  
S Kawai ◽  
F Mitsunobu ◽  
K Toyoshima

Results of previous studies have shown that a raf-related transforming DNA sequence is present in NIH 3T3 transformants that are derived from GL-5-JCK human glioblastoma DNA transfection. The transforming DNA was molecularly cloned by using cosmid vector pJB8 to determine its structure and origin. Analyses of selected clones revealed that the transforming DNA consisted of three portions of human DNA sequences, with the 3' half of the c-raf-1 gene as its middle portion. This raf region was about 20 kilobases long and contained exons 8 to 17 and the poly(A) addition site. RNA blot analysis showed that the raf-related transforming DNA was transcribed into 5.3-, 4.8-, and 2.5-kilobase mRNAs; the 2.5-kilobase transcript was thought to be the major transcript. Immunoprecipitation analyses revealed that a 44-kilodalton raf-related protein was specifically expressed in the NIH 3T3 transformants. The raf-related transforming DNA was considered to be activated when its amino-terminal sequence was truncated and the DNA was coupled with a foreign promoter sequence. On hybridization analysis of the original GL-5-JCK glioblastoma DNA, no rearrangement of c-raf-1 was detectable in the tumor DNA. The rearrangement of c-raf-1 may have occurred during transfection or may have been present in a small population of the original tumor cells as a result of tumor progression.


Sign in / Sign up

Export Citation Format

Share Document