scholarly journals DiTaxa: nucleotide-pair encoding of 16S rRNA for host phenotype and biomarker detection

2018 ◽  
Vol 35 (14) ◽  
pp. 2498-2500 ◽  
Author(s):  
Ehsaneddin Asgari ◽  
Philipp C Münch ◽  
Till R Lesker ◽  
Alice C McHardy ◽  
Mohammad R K Mofrad

Abstract Summary Identifying distinctive taxa for micro-biome-related diseases is considered key to the establishment of diagnosis and therapy options in precision medicine and imposes high demands on the accuracy of micro-biome analysis techniques. We propose an alignment- and reference- free subsequence based 16S rRNA data analysis, as a new paradigm for micro-biome phenotype and biomarker detection. Our method, called DiTaxa, substitutes standard operational taxonomic unit (OTU)-clustering by segmenting 16S rRNA reads into the most frequent variable-length subsequences. We compared the performance of DiTaxa to the state-of-the-art methods in phenotype and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa performed competitively to the k-mer based state-of-the-art approach in phenotype prediction while outperforming the OTU-based state-of-the-art approach in finding biomarkers in both resolution and coverage evaluated over known links from literature and synthetic benchmark datasets. Availability and implementation DiTaxa is available under the Apache 2 license at http://llp.berkeley.edu/ditaxa. Supplementary information Supplementary data are available at Bioinformatics online.

2018 ◽  
Author(s):  
Ehsaneddin Asgari ◽  
Philipp C. Münch ◽  
Till R. Lesker ◽  
Alice C. McHardy ◽  
Mohammad R.K. Mofrad

ABSTRACTIdentifying combinations of taxa distinctive for microbiome-associated diseases is considered key to the establishment of diagnosis and therapy options in precision medicine and imposes high demands on accuracy of microbiome analysis techniques. We propose subsequence based 16S rRNA data analysis, as a new paradigm for microbiome phenotype classification and biomarker detection. This method and software called DiTaxa substitutes standard OTU-clustering or sequence-level analysis by segmenting 16S rRNA reads into the most frequent variable-length subsequences. These subsequences are then used as data representation for downstream phenotype prediction, biomarker detection and taxonomic analysis. Our proposed sequence segmentation called nucleotide-pair encoding (NPE) is an unsupervised data-driven segmentation inspired by Byte-pair encoding, a data compression algorithm. The identified subsequences represent commonly occurring sequence portions, which we found to be distinctive for taxa at varying evolutionary distances and highly informative for predicting host phenotypes. We compared the performance of DiTaxa to the state-of-the-art methods in disease phenotype prediction and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa identified 17 out of 29 taxa with confirmed links to periodontitis (recall= 0.59), relative to 3 out of 29 taxa (recall= 0.10) by the state-of-the-art method. On synthetic benchmark data, DiTaxa obtained full precision and recall in biomarker detection, compared to 0.91 and 0.90, respectively. In addition, machine-learning classifiers trained to predict host disease phenotypes based on the NPE representation performed competitively to the state-of-the art using OTUs or k-mers. For the rheumatoid arthritis dataset, DiTaxa substantially outperformed OTU features with a macro-F1 score of 0.76 compared to 0.65. Due to the alignment- and reference free nature, DiTaxa can efficiently run on large datasets. The full analysis of a large 16S rRNA dataset of 1359 samples required ≈1.5 hours on 20 cores, while the standard pipeline needed ≈6.5 hours in the same setting.AvailabilityAn implementation of our method called DiTaxa is available under the Apache 2 licence at http://llp.berkeley.edu/ditaxa.


Author(s):  
Kexin Huang ◽  
Tianfan Fu ◽  
Lucas M Glass ◽  
Marinka Zitnik ◽  
Cao Xiao ◽  
...  

Abstract Summary Accurate prediction of drug–target interactions (DTI) is crucial for drug discovery. Recently, deep learning (DL) models for show promising performance for DTI prediction. However, these models can be difficult to use for both computer scientists entering the biomedical field and bioinformaticians with limited DL experience. We present DeepPurpose, a comprehensive and easy-to-use DL library for DTI prediction. DeepPurpose supports training of customized DTI prediction models by implementing 15 compound and protein encoders and over 50 neural architectures, along with providing many other useful features. We demonstrate state-of-the-art performance of DeepPurpose on several benchmark datasets. Availability and implementation https://github.com/kexinhuang12345/DeepPurpose. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2016 ◽  
Vol 2016 ◽  
pp. 1-7 ◽  
Author(s):  
R. Alan Harris ◽  
Rajesh Shah ◽  
Emily B. Hollister ◽  
Rune Rose Tronstad ◽  
Nils Hovdenak ◽  
...  

Epigenetic and microbiome changes during pediatric development have been implicated as important elements in the developmental origins of inflammatory bowel diseases (IBDs) including Crohn’s disease (CD) and ulcerative colitis (UC), which are linked to early onset colorectal cancer (CRC). Colonic mucosal samples from 22 control children between 3.5 and 17.5 years of age were studied by Infinium HumanMethylation450 BeadChips and, in 10 cases, by 454 pyrosequencing of the bacterial16S rRNAgene. Intercalating age-specific DNA methylation and microbiome changes were identified, which may have significant translational relevance in the developmental origins of IBD and CRC.


2008 ◽  
Vol 57 (12) ◽  
pp. 1569-1576 ◽  
Author(s):  
Tanja Kuehbacher ◽  
Ateequr Rehman ◽  
Patricia Lepage ◽  
Stephan Hellmig ◽  
Ulrich R. Fölsch ◽  
...  

TM7 is a recently described subgroup of Gram-positive uncultivable bacteria originally found in natural environmental habitats. An association of the TM7 bacterial division with the inflammatory pathogenesis of periodontitis has been previously shown. This study investigated TM7 phylogenies in patients with inflammatory bowel diseases (IBDs). The mucosal microbiota of patients with active Crohn's disease (CD; n=42) and ulcerative colitis (UC; n=31) was compared with that of controls (n=33). TM7 consortia were examined using molecular techniques based on 16S rRNA genes, including clone libraries, sequencing and in situ hybridization. TM7 molecular signatures could be cloned from mucosal samples of both IBD patients and controls, but the composition of the clone libraries differed significantly. Taxonomic analysis of the sequences revealed a higher diversity of TM7 phylotypes in CD (23 different phylotypes) than in UC (10) and non-IBD controls (12). All clone libraries showed a high number of novel sequences (21 for controls, 34 for CD and 29 for UC). A highly atypical base substitution for bacterial 16S rRNA genes associated with antibiotic resistance was detected in almost all sequences from CD (97.3 %) and UC (100 %) patients compared to only 65.1 % in the controls. TM7 bacteria might play an important role in IBD similar to that previously described in oral inflammation. The alterations of TM7 bacteria and the genetically determined antibiotic resistance of TM7 species in IBD could be a relevant part of a more general alteration of bacterial microbiota in IBD as recently found, e.g. as a promoter of inflammation at early stages of disease.


2020 ◽  
Vol 36 (18) ◽  
pp. 4675-4681 ◽  
Author(s):  
Yuansheng Liu ◽  
Limsoon Wong ◽  
Jinyan Li

Abstract Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. Results We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark datasets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission. Availability and implementation https://github.com/yuansliu/memRGC. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Shubham Chandak ◽  
Kedar Tatwawadi ◽  
Srivatsan Sridhar ◽  
Tsachy Weissman

Abstract Motivation Nanopore sequencing provides a real-time and portable solution to genomic sequencing, enabling better assembly, structural variant discovery and modified base detection than second generation technologies. The sequencing process generates a huge amount of data in the form of raw signal contained in fast5 files, which must be compressed to enable efficient storage and transfer. Since the raw data is inherently noisy, lossy compression has potential to significantly reduce space requirements without adversely impacting performance of downstream applications. Results We explore the use of lossy compression for nanopore raw data using two state-of-the-art lossy time-series compressors, and evaluate the tradeoff between compressed size and basecalling/consensus accuracy. We test several basecallers and consensus tools on a variety of datasets at varying depths of coverage, and conclude that lossy compression can provide 35–50% further reduction in compressed size of raw data over the state-of-the-art lossless compressor with negligible impact on basecalling accuracy (≲0.2% reduction) and consensus accuracy (≲0.002% reduction). In addition, we evaluate the impact of lossy compression on methylation calling accuracy and observe that this impact is minimal for similar reductions in compressed size, although further evaluation with improved benchmark datasets is required for reaching a definite conclusion. The results suggest the possibility of using lossy compression, potentially on the nanopore sequencing device itself, to achieve significant reductions in storage and transmission costs while preserving the accuracy of downstream applications. Availabilityand implementation The code is available at https://github.com/shubhamchandak94/lossy_compression_evaluation. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Shubham Chandak ◽  
Kedar Tatwawadi ◽  
Srivatsan Sridhar ◽  
Tsachy Weissman

AbstractMotivationNanopore sequencing provides a real-time and portable solution to genomic sequencing, enabling better assembly, structural variant discovery and modified base detection than second generation technologies. The sequencing process generates a huge amount of data in the form of raw signal contained in fast5 files, which must be compressed to enable efficient storage and transfer. Since the raw data is inherently noisy, lossy compression has potential to significantly reduce space requirements without adversely impacting performance of downstream applications.ResultsWe explore the use of lossy compression for nanopore raw data using two state-of-the-art lossy time-series compressors, and evaluate the tradeoff between compressed size and basecalling/consensus accuracy. We test several basecallers and consensus tools on a variety of datasets at varying depths of coverage, and conclude that lossy compression can provide 35-50% further reduction in compressed size of raw data over the state-of-the-art lossless compressor with negligible impact on basecalling accuracy (≲0.2% reduction) and consensus accuracy (≲0.002% reduction). In addition, we evaluate the impact of lossy compression on methylation calling accuracy and observe that this impact is minimal for similar reductions in compressed size, although further evaluation with improved benchmark datasets is required for reaching a definite conclusion. The results suggest the possibility of using lossy compression, potentially on the nanopore sequencing device itself, to achieve significant reductions in storage and transmission costs while preserving the accuracy of downstream applications.AvailabilityThe code is available at https://github.com/shubhamchandak94/lossy_compression_evaluation.Supplementary informationSupplementary data are available at Bioinformatics [email protected]


Sign in / Sign up

Export Citation Format

Share Document