A Preliminary Metagenome Analysis Based on a Combination of Protein Domains

Metagenomic data have mainly been addressed by showing the composition of organisms based on a small part of a well-examined genomic sequence, such as ribosomal RNA genes and mitochondrial DNAs. On the contrary, whole metagenomic data obtained by the shotgun sequence method have not often been fully analyzed through a homology search because the genomic data in databases for living organisms on earth are insufficient. In order to complement the results obtained through homology-search-based methods with shotgun metagenomes data, we focused on the composition of protein domains deduced from the sequences of genomes and metagenomes, and we utilized them in characterizing genomes and metagenomes, respectively. First, we compared the relationships based on similarities in the protein domain composition with the relationships based on sequence similarities. We searched for protein domains of 325 bacterial species produced using the Pfam database. Next, the correlation coefficients of protein domain compositions between every pair of bacteria were examined. Every pairwise genetic distance was also calculated from 16S rRNA or DNA gyrase subunit B. We compared the results of these methods and found a moderate correlation between them. Essentially, the same results were obtained when we used partial random 100 bp DNA sequences of the bacterial genomes, which simulated raw sequence data obtained from short-read next-generation sequences. Then, we applied the method for analyzing the actual environmental data obtained by shotgun sequencing. We found that the transition of the microbial phase occurred because the seasonal change in water temperature was shown by the method. These results showed the usability of the method in characterizing metagenomic data based on protein domain compositions.

Download Full-text

Protecting Genomic Sequence Anonymity with Generalization Lattices

Methods of Information in Medicine ◽

10.1055/s-0038-1634025 ◽

2005 ◽

Vol 44 (05) ◽

pp. 687-692 ◽

Cited By ~ 18

Author(s):

B. A. Malin

Keyword(s):

Dna Sequences ◽

Genomic Sequence ◽

Sequence Data ◽

Personal Information ◽

Control Technique ◽

Single Nucleotide ◽

Specific Data ◽

Disclosure Control ◽

Genomic Privacy ◽

Nucleotide Region

Summary Objectives: Current genomic privacy technologies assume the identity of genomic sequence data is protected if personal information, such as demographics, are obscured, removed, or encrypted. While demographic features can directly compromise an individual’s identity, recent research demonstrates such protections are insufficient because sequence data itself is susceptible to re-identification. To counteract this problem, we introduce an algorithm for anonymizing a collection of person-specific DNA sequences. Methods: The technique is termed DNA lattice an-onymization (DNALA), and is based upon the formal privacy protection schema of k-anonymity. Under this model, it is impossible to observe or learn features that distinguish one genetic sequence from k-1 other entries in a collection. To maximize information retained in protected sequences, we incorporate a concept generalization lattice to learn the distance between two residues in a single nucleotide region. The lattice provides the most similar generalized concept for two residues (e.g. adenine and guanine are both purines). Results: The method is tested and evaluated with several publicly available human population datasets ranging in size from 30 to 400 sequences. Our findings imply the anonymization schema is feasible for the protection of sequences privacy. Conclusions: The DNALA method is the first computational disclosure control technique for general DNA sequences. Given the computational nature of the method, guarantees of anonymity can be formally proven. There is room for improvement and validation, though this research provides the groundwork from which future researchers can construct genomics anonymization schemas tailored to specific data-sharing scenarios.

Download Full-text

Searching more genomic sequence with less memory for fast and accurate metagenomic profiling

10.1101/036681 ◽

2016 ◽

Author(s):

Shea N Gardner ◽

Sasha K Ames ◽

Maya B Gokhale ◽

Tom R Slezak ◽

Jonathan Allen

Keyword(s):

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Low Cost ◽

False Negative ◽

Human Microbiome ◽

Human Microbiome Project ◽

Metagenomic Data ◽

Reference Database ◽

Metagenomic Sequence

Software for rapid, accurate, and comprehensive microbial profiling of metagenomic sequence data on a desktop will play an important role in large scale clinical use of metagenomic data. Here we describe LMAT-ML (Livermore Metagenomics Analysis Toolkit-Marker Library) which can be run with 24 GB of DRAM memory, an amount available on many clusters, or with 16 GB DRAM plus a 24 GB low cost commodity flash drive (NVRAM), a cost effective alternative for desktop or laptop users. We compared results from LMAT with five other rapid, low-memory tools for metagenome analysis for 131 Human Microbiome Project samples, and assessed discordant calls with BLAST. All the tools except LMAT-ML reported overly specific or incorrect species and strain resolution of reads that were in fact much more widely conserved across species, genera, and even families. Several of the tools misclassified reads from synthetic or vector sequence as microbial or human reads as viral. We attribute the high numbers of false positive and false negative calls to a limited reference database with inadequate representation of known diversity. Our comparisons with real world samples show that LMAT-ML is the only tool tested that classifies the majority of reads, and does so with high accuracy.

Download Full-text

Bovine Genome Analysis to Unravel the Location and Feature of Target Sites of RNA-Guided Hyperactivated Recombinase Gin with Spacer Length Six

Indian Journal of Animal Research ◽

10.18805/ijar.b-4693 ◽

2022 ◽

Author(s):

Shalu Kumari Pathak ◽

Arvind Sonwane ◽

Subodh Kumar

Keyword(s):

Dna Sequences ◽

Genomic Sequence ◽

Sequence Data ◽

Bovine Genome ◽

Search Pattern ◽

Spacer Length ◽

Guide Rna ◽

Target Sites ◽

Emboss Package ◽

Programmable Nucleases

Background: Programmable nucleases are very promising tools of genome editing (GE), but they suffer from limitations including potential risk of genotoxicity which led to the exploration of safer approach of GE based on RNA-guided recombinase (RGR) platform. RNA-guided recombinase (RGR) platform operates on a typical recognition or target site comprised of the minimal pseudo-core recombinase site, a 5 to 6-base pair spacer flanking it and whole this central region is flanked by two guide RNA-specified DNA sequences or Cas9 binding sites followed by protospacer adjacent motifs (PAMs). Methods: The current study focuses on analysis of entire cattle genome to prepare a detailed map of target sites for RNA-guided hyperactivated recombinase Gin with spacer length six. For this, chromosome wise whole genomic sequence data was retrieved from Ensembl. After that search pattern for recombinase Gin with spacer length six was designed. By using this search pattern, RGR target sites were located by using dreg program of Emboss package. Result: Total number of RGR target sites identified in bovine genome for recombinase Gin was 677 with spacer length six. It was also investigated that whether these RGR target sites are present with in any gene or not and it was found that RGR target sites lies in both genic and intergenic region. Besides this, description of genes in context with these target sites was identified.

Download Full-text

Protein Domain Analysis of Genomic Sequence Data Reveals Regulation of LRR Related Domains in Plant Transpiration in Ficus

PLoS ONE ◽

10.1371/journal.pone.0108719 ◽

2014 ◽

Vol 9 (9) ◽

pp. e108719 ◽

Cited By ~ 3

Author(s):

Tiange Lang ◽

Kangquan Yin ◽

Jinyu Liu ◽

Kunfang Cao ◽

Charles H. Cannon ◽

...

Keyword(s):

Genomic Sequence ◽

Sequence Data ◽

Domain Analysis ◽

Protein Domain ◽

Plant Transpiration

Download Full-text

Identification of Essential Protein Domains From High-density Transposon Insertion Sequencing

10.21203/rs.3.rs-589027/v1 ◽

2021 ◽

Author(s):

A.S.M. Zisanur Rahman ◽

Lukas Timmerman ◽

Flyn Gallardo ◽

Silvia T. Cardona

Keyword(s):

Unknown Function ◽

Plant Pathogens ◽

Bacterial Species ◽

Protein Domains ◽

Essential Genes ◽

High Density ◽

Protein Domain ◽

Burkholderia Cenocepacia ◽

Data Set ◽

Essential Protein

Abstract A first clue to gene function can be obtained by examining whether a gene is required for life in certain standard conditions, that is, whether a gene is essential. In bacteria, essential genes are usually identified by high-density transposon mutagenesis followed by sequencing of insertion sites (Tn-seq). These studies assign the term “essential” to whole genes rather than the protein domain sequences that confer the essential functions. However, genes can code for multiple protein domains that evolve their functions independently. Therefore, when essential genes code for more than one protein domain, only one of them could be essential. In this study, we defined this subset of genes as “essential domain-containing” (EDC) genes. Using a Tn-seq data set built-in Burkholderia cenocepacia K56-2, we developed an in silico pipeline to identify EDC genes and the essential protein domains they encode. We found forty candidate EDC genes and demonstrated growth defect phenotypes using CRISPR interference (CRISPRi). This analysis included two knockdowns of genes encoding the protein domains of unknown function DUF2213 and DUF4148. These essential domains are conserved in more than two hundred bacterial species, including human and plant pathogens. Together, our study suggests that essentiality should be assigned to individual protein domains rather than genes, contributing to a first functional characterization of protein domains of unknown function.

Download Full-text

A Vector Representation of DNA Sequences Using Locality Sensitive Hashing

10.1101/726729 ◽

2019 ◽

Cited By ~ 1

Author(s):

Lizhen Shi ◽

Bo Chen

Keyword(s):

Natural Language ◽

Language Processing ◽

Dna Sequences ◽

Genomic Sequence ◽

Sequence Data ◽

Error Rates ◽

Locality Sensitive Hashing ◽

Alternative Methods ◽

Sequencing Error ◽

Training Time

ABSTRACTDrawing from the analogy between natural language and "genomic sequence language", we explored the applicability of word embeddings in natural language processing (NLP) to represent DNA reads in Metagenomics studies. Here, k-mer is the equivalent concept of word in NLP and it has been widely used in analyzing sequence data. However, directly replacing word embedding with k-mer embedding is problematic due to two reasons: First, the number of k-mers is many times of the number of words in NLP, making the model too big to be useful. Second, sequencing errors create lots of rare k-mers (noise), making the model hard to be trained. In this work, we leverage Locality Sensitive Hashing (LSH) to overcoming these challenges. We then adopted the skip-gram with negative sampling model to learn k-mer embeddings. Experiments on metagenomic datasets with labels demonstrated that LSH can not only accelerate training time and reduce the memory requirements to store the model, but also achieve higher accuracy than alternative methods. Finally, we demonstrate the trained low-dimensional k-mer embeddings can be potentially used for accurate metagenomic read clustering and predict their taxonomy, and this method is robust on reads with high sequencing error rates (12-22%).

Download Full-text

Quantifying uncertainty of taxonomic placement in DNA barcoding and metabarcoding

10.1101/070573 ◽

2016 ◽

Cited By ~ 3

Author(s):

Panu Somervuo ◽

Douglas Yu ◽

Charles Xu ◽

Yinqiu Ji ◽

Jenni Hultman ◽

...

Keyword(s):

Dna Barcoding ◽

Dna Sequences ◽

Sequence Data ◽

Environmental Data ◽

Reference Sequence ◽

Future Research ◽

List Type ◽

Reference Databases ◽

Sequence Databases ◽

Taxonomic Groups

AbstractA crucial step in the use of DNA markers for biodiversity surveys is the assignment of Linnaean taxonomies (species, genus, etc.) to sequence reads. This allows the use of all the information known based on the taxonomic names. Taxonomic placement of DNA barcoding sequences is inherently probabilistic because DNA sequences contain errors, because there is natural variation among sequences within a species, and because reference databases are incomplete and can have false annotations. However, most existing bioinformatics methods for taxonomic placement either exclude uncertainty, or quantify it using metrics other than probability.In this paper we evaluate the performance of a recently proposed probabilistic taxonomic placement method PROTAX by applying it to both annotated reference sequence data as well as unknown environmental data. Our four case studies include contrasting taxonomic groups (fungi, bacteria, mammals, and insects), variation in the length and quality of the barcoding sequences (from individually Sanger-sequenced sequences to short Illumina reads), variation in the structures and sizes of the taxonomies (from 800 to 130 000 species), and variation in the completeness of the reference databases (representing 15% to 100% of the species).Our results demonstrate that PROTAX yields essentially unbiased assessment of probabilities of taxonomic placement, and thus that its quantification of species identification uncertainty is reliable. As expected, the accuracy of taxonomic placement increases with increasing coverage of taxonomic and reference sequence databases, and with increasing ratio of genetic variation among taxonomic levels over within taxonomic levels.Our results show that reliable species-level identification from environmental samples is still challenging, and thus neglecting identification uncertainty can lead to spurious inference. A key aim for future research is the completion and pruning of taxonomic and reference sequence databases, and making these two types of data compatible.

Download Full-text

Protein domain analysis from genomic sequence data revealed the regulation of LRR related domains in plant transpiration in Ficus

Journal of Bioequivalence & Bioavailability ◽

10.4172/0975-0851.s1.011 ◽

2013 ◽

Vol s4 (01) ◽

Author(s):

Jinyu Kunfang Cao

Keyword(s):

Genomic Sequence ◽

Sequence Data ◽

Domain Analysis ◽

Protein Domain ◽

Plant Transpiration

Download Full-text

Comparative genomics identifies thousands of candidate structured RNAs in human microbiomes

Genome Biology ◽

10.1186/s13059-021-02319-w ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Brayon J. Fremin ◽

Ami S. Bhatt

Keyword(s):

Comparative Genomics ◽

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Human Microbiome ◽

Automated Analysis ◽

Metagenomic Data ◽

Coding Regions ◽

Intergenic Regions ◽

Computationally Intensive

Abstract Background Structured RNAs play varied bioregulatory roles within microbes. To date, hundreds of candidate structured RNAs have been predicted using informatic approaches that search for motif structures in genomic sequence data. The human microbiome contains thousands of species and strains of microbes. Yet, much of the metagenomic data from the human microbiome remains unmined for structured RNA motifs primarily due to computational limitations. Results We sought to apply a large-scale, comparative genomics approach to these organisms to identify candidate structured RNAs. With a carefully constructed, though computationally intensive automated analysis, we identify 3161 conserved candidate structured RNAs in intergenic regions, as well as 2022 additional candidate structured RNAs that may overlap coding regions. We validate the RNA expression of 177 of these candidate structures by analyzing small fragment RNA-seq data from four human fecal samples. Conclusions This approach identifies a wide variety of candidate structured RNAs, including tmRNAs, antitoxins, and likely ribosome protein leaders, from a wide variety of taxa. Overall, our pipeline enables conservative predictions of thousands of novel candidate structured RNAs from human microbiomes.

Download Full-text

Faculty Opinions recommendation of A likelihood ratio test of speciation with gene flow using genomic sequence data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.3540959.3240060 ◽

2010 ◽

Author(s):

Nicolas Galtier ◽

Julien Dutheil

Keyword(s):

Gene Flow ◽

Likelihood Ratio ◽

Likelihood Ratio Test ◽

Genomic Sequence ◽

Sequence Data ◽

Ratio Test

Download Full-text