scholarly journals Deep embeddings to comprehend and visualize microbiome protein space

2021 ◽  
Author(s):  
Krzysztof Odrzywolek ◽  
Zuzanna Karwowska ◽  
Jan Majta ◽  
Aleksander Byrski ◽  
Kaja Milanowska-Zabel ◽  
...  

Understanding the function of microbial proteins is essential to reveal the clinical potential of the microbiome. The application of high-throughput sequencing technologies allows for fast and increasingly cheaper acquisition of data from microbial communities. However, many of the inferred protein sequences are novel and not catalogued, hence the possibility of predicting their function through conventional homology-based approaches is limited. Here, we leverage a deep-learning-based representation of proteins to assess its utility in alignment-free analysis of microbial proteins. We trained a language model on the Unified Human Gastrointestinal Protein catalogue and validated the resulting protein representation on the bacterial part of the SwissProt database. Finally, we present a use case on proteins involved in SCFA metabolism. Results indicate that our model (ArdiMiPE) manages to accurately represent features related to protein structure and function, allowing for alignment-free protein analyses. Technologies such as ArdiMiPE that contextualize metagenomic data are a promising direction to deeply understand the microbiome.

Minerals ◽  
2018 ◽  
Vol 8 (12) ◽  
pp. 596 ◽  
Author(s):  
Shuang Zhou ◽  
Min Gan ◽  
Jianyu Zhu ◽  
Xinxing Liu ◽  
Guanzhou Qiu

It is widely known that bioleaching microorganisms have to cope with the complex extreme environment in which microbial ecology relating to community structure and function varies across environmental types. However, analyses of microbial ecology of bioleaching bacteria is still a challenge. To address this challenge, numerous technologies have been developed. In recent years, high-throughput sequencing technologies enabling comprehensive sequencing analysis of cellular RNA and DNA within the reach of most laboratories have been added to the toolbox of microbial ecology. The next-generation sequencing technology allowing processing DNA sequences can produce available draft genomic sequences of more bioleaching bacteria, which provides the opportunity to predict models of genetic and metabolic potential of bioleaching bacteria and ultimately deepens our understanding of bioleaching microorganism. High-throughput sequencing that focuses on targeted phylogenetic marker 16S rRNA has been effectively applied to characterize the community diversity in an ore leaching environment. RNA-seq, another application of high-throughput sequencing to profile RNA, can be for both mapping and quantifying transcriptome and has demonstrated a high efficiency in quantifying the changing expression level of each transcript under different conditions. It has been demonstrated as a powerful tool for dissecting the relationship between genotype and phenotype, leading to interpreting functional elements of the genome and revealing molecular mechanisms of adaption. This review aims to describe the high-throughput sequencing approach for bioleaching environmental microorganisms, particularly focusing on its application associated with challenges.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Chong Chu ◽  
Rebeca Borges-Monroy ◽  
Vinayak V. Viswanadham ◽  
Soohyun Lee ◽  
Heng Li ◽  
...  

AbstractTransposable elements (TEs) help shape the structure and function of the human genome. When inserted into some locations, TEs may disrupt gene regulation and cause diseases. Here, we present xTea (x-Transposable element analyzer), a tool for identifying TE insertions in whole-genome sequencing data. Whereas existing methods are mostly designed for short-read data, xTea can be applied to both short-read and long-read data. Our analysis shows that xTea outperforms other short read-based methods for both germline and somatic TE insertion discovery. With long-read data, we created a catalogue of polymorphic insertions with full assembly and annotation of insertional sequences for various types of retroelements, including pseudogenes and endogenous retroviruses. Notably, we find that individual genomes have an average of nine groups of full-length L1s in centromeres, suggesting that centromeres and other highly repetitive regions such as telomeres are a significant yet unexplored source of active L1s. xTea is available at https://github.com/parklab/xTea.


Water ◽  
2021 ◽  
Vol 13 (22) ◽  
pp. 3155
Author(s):  
Shumin Liu ◽  
Fengbin Zhao ◽  
Xin Fang

Phytoplankton and bacterioplankton play a vital role in the structure and function of aquatic ecosystems, and their activity is closely linked to water eutrophication. However, few researchers have considered the temporal and spatial succession of phytoplankton and bacterioplankton, and their responses to environmental factors. The temporal and spatial succession of bacterioplankton and their ecological interaction with phytoplankton and water quality were analyzed using 16S rDNA high-throughput sequencing for their identification, and the functions of bacterioplankton were predicted. The results showed that the dominant classes of bacterioplankton in the Qingcaosha Reservoir were Gammaproteobacteria, Alphaproteobacteria, Actinomycetes, Acidimicrobiia, and Cyanobacteria. In addition, the Shannon diversity indexes were compared, and the results showed significant temporal differences based on monthly averaged value, although no significant spatial difference. The community structure was found to be mainly influenced by phytoplankton density and biomass, dissolved oxygen, and electrical conductivity. The presence of Pseudomonas and Legionella was positively correlated with that of Pseudanabaena sp., and Sphingomonas and Paragonimus with Melosira granulata. On the contrary, the presence of Planctomycetes was negatively correlated with Melosira granulata, as was Deinococcus-Thermus with Cyclotella sp. The relative abundance of denitrifying bacteria decreased from April to December, while the abundance of nitrogen-fixing bacteria increased. This study provides a scientific basis for understanding the ecological interactions between bacteria, algae, and water quality in reservoir ecosystems.


Viruses ◽  
2021 ◽  
Vol 13 (10) ◽  
pp. 2006
Author(s):  
Anna Y Budkina ◽  
Elena V Korneenko ◽  
Ivan A Kotov ◽  
Daniil A Kiselev ◽  
Ilya V Artyushin ◽  
...  

According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.


2020 ◽  
Vol 21 (22) ◽  
pp. 8774
Author(s):  
Natalia Komarova ◽  
Daria Barkova ◽  
Alexander Kuznetsov

Aptamers are nucleic acid ligands that bind specifically to a target of interest. Aptamers have gained in popularity due to their high potential for different applications in analysis, diagnostics, and therapeutics. The procedure called systematic evolution of ligands by exponential enrichment (SELEX) is used for aptamer isolation from large nucleic acid combinatorial libraries. The huge number of unique sequences implemented in the in vitro evolution in the SELEX process imposes the necessity of performing extensive sequencing of the selected nucleic acid pools. High-throughput sequencing (HTS) meets this demand of SELEX. Analysis of the data obtained from sequencing of the libraries produced during and after aptamer isolation provides an informative basis for precise aptamer identification and for examining the structure and function of nucleic acid ligands. This review discusses the technical aspects and the potential of the integration of HTS with SELEX.


2020 ◽  
Vol 36 (11) ◽  
pp. 3365-3371
Author(s):  
Yaxin Xue ◽  
Anders Lanzén ◽  
Inge Jonassen

Abstract Motivation Technological advances in meta-transcriptomics have enabled a deeper understanding of the structure and function of microbial communities. ‘Total RNA’ meta-transcriptomics, sequencing of total reverse transcribed RNA, provides a unique opportunity to investigate both the structure and function of active microbial communities from all three domains of life simultaneously. A major step of this approach is the reconstruction of full-length taxonomic marker genes such as the small subunit ribosomal RNA. However, current tools for this purpose are mainly targeted towards analysis of amplicon and metagenomic data and thus lack the ability to handle the massive and complex datasets typically resulting from total RNA experiments. Results In this work, we introduce MetaRib, a new tool for reconstructing ribosomal gene sequences from total RNA meta-transcriptomic data. MetaRib is based on the popular rRNA assembly program EMIRGE, together with several improvements. We address the challenge posed by large complex datasets by integrating sub-assembly, dereplication and mapping in an iterative approach, with additional post-processing steps. We applied the method to both simulated and real-world datasets. Our results show that MetaRib can deal with larger datasets and recover more rRNA genes, which achieve around 60 times speedup and higher F1 score compared to EMIRGE in simulated datasets. In the real-world dataset, it shows similar trends but recovers more contigs compared with a previous analysis based on random sub-sampling, while enabling the comparison of individual contig abundances across samples for the first time. Availability and implementation The source code of MetaRib is freely available at https://github.com/yxxue/MetaRib. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Yang Young Lu ◽  
Jiaxing Bai ◽  
Yiwen Wang ◽  
Ying Wang ◽  
Fengzhu Sun

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Jaspreet Singh ◽  
Thomas Litfin ◽  
Jaswinder Singh ◽  
Kuldip Paliwal ◽  
Yaoqi Zhou

Motivation: Accurate prediction of protein contact map is essential for accurate proteins structure and function prediction. As a result, many methods have been developed for protein contact map prediction. However, most contact map prediction methods rely on protein sequence evolutionary information which may not exist for many proteins due to lack of sequence homology. Moreover, generating evolutionary profiles is computationally intensive and time consuming. Therefore, we developed a contact map predictor utilizing the output of a pre-trained language model ESM-1B as an input along with a large training set and an ensemble of residual neural networks. Results: We showed that the proposed method makes a significant improvement over a single-sequence-based predictor SSCpred with 15% improvement in the F1-score for the independent CASP14-FM test set. It also outperforms evolutionary-profile-based methods TrRosetta and SPOT-Contact with 48.7% and 48.5% respective improvement in the F1-score on the proteins in the SPOT-2018 set without homologs (Neff=1). The new method provides a much faster and reasonably accurate alternative to profile-based methods, useful for large-scale prediction, in particular.


2020 ◽  
Vol 78 (Supplement_3) ◽  
pp. 75-78
Author(s):  
Niv Zmora

Abstract Precision medicine has become the mainstay of modern therapeutics, especially for neoplastic disease, but this paradigm does not commonly prevail in dietary planning. Compelling evidence suggests that individual features, including the structure and function of the gut microbiota, contribute to harvesting and metabolizing energy from food, and thereby modulate the host metabolic phenotype and glucose homeostasis. Here, the concept of precision to dietary planning is highlighted by demonstrating the role of the microbiota in glucose intolerance in response to noncaloric artificial sweeteners, and by linking the microbiota and other host features to postprandial increases in blood glucose. These findings highlight the heterogeneity that exists among humans, which translates into divergent metabolic responses to similar food and warrants the adoption of next-generation sequencing technologies and advanced bioinformatics to revolutionize nutrition studies, laying the groundwork for an individually focused tailor-made practice.


Antibodies ◽  
2019 ◽  
Vol 8 (2) ◽  
pp. 29 ◽  
Author(s):  
Lefranc ◽  
Lefranc

At the 10th Human Genome Mapping (HGM10) Workshop, in New Haven, for the first time, immunoglobulin (IG) or antibody and T cell receptor (TR) variable (V), diversity (D), joining (J), and constant (C) genes were officially recognized as ‘genes’, as were the conventional genes. Under these HGM auspices, IMGT®, the international ImMunoGeneTics information system® (http://www.imgt.org), was created in June 1989 at Montpellier (University of Montpellier and CNRS). The creation of IMGT® marked the birth of immunoinformatics, a new science, at the interface between immunogenetics and bioinformatics. The accuracy and the consistency between genes and alleles, sequences, and three-dimensional (3D) structures are based on the IMGT Scientific chart rules generated from the IMGT-ONTOLOGY axioms and concepts: IMGT standardized keywords (IDENTIFICATION), IMGT gene and allele nomenclature (CLASSIFICATION), IMGT standardized labels (DESCRIPTION), IMGT unique numbering and IMGT Collier de Perles (NUMEROTATION). These concepts provide IMGT® immunoinformatics insights for antibody V and C domain structure and function, used for the standardized description in IMGT® web resources, databases and tools, immune repertoires analysis, single cell and/or high-throughput sequencing (HTS, NGS), antibody humanization, and antibody engineering in relation with effector properties.


Sign in / Sign up

Export Citation Format

Share Document