high throughput sequencing data Latest Research Papers

High-throughput sequencing data is rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in solving the experiment discovery problem and building compressed representations of annotated de Bruijn graphs where k-mer sets can be efficiently indexed and interactively queried. However, approaches for representing and retrieving other quantitative attributes such as gene expression or genome positions in a general manner have yet to be developed. In this work, we propose the concept of Counting de Bruijn graphs generalizing the notion of annotated (or colored) de Bruijn graphs. Counting de Bruijn graphs supplement each node-label relation with one or many attributes (e.g., a k-mer count or its positions in genome). To represent them, we first observe that many schemes for the representation of compressed binary matrices already support the rank operation on the columns or rows, which can be used to define an inherent indexing of any additional quantitative attributes. Based on this property, we generalize these schemes and introduce a new approach for representing non-binary sparse matrices in compressed data structures. Finally, we notice that relation attributes are often easily predictable from a node's local neighborhood in the graph. Notable examples are genome positions shifting by 1 for neighboring nodes in the graph, or expression levels that are often shared across neighbors. We exploit this regularity of graph annotations and apply an invertible delta-like coding to achieve better compression. We show that Counting de Bruijn graphs index k-mer counts from 2,652 human RNA-Seq read sets in representations over 8-fold smaller and yet faster to query compared to state-of-the-art bioinformatics tools. Furthermore, Counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip -9 for human Illumina RNA-Seq and 57% smaller for PacBio HiFi sequencing of viral samples. A complete joint searchable index of all viral PacBio SMRT reads from NCBI's SRA (152,884 read sets, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, they generate a lossless and fully queryable index that is 4.4-fold smaller compared to the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools employing de Bruijn graphs and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed and fully searchable graph-based sequence indexes.

Download Full-text

ChiTaH: a fast and accurate tool for identifying known human chimeric sequences from high-throughput sequencing data

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab112 ◽

2021 ◽

Vol 3 (4) ◽

Author(s):

Rajesh Detroja ◽

Alessandro Gorohovski ◽

Olawumi Giwa ◽

Gideon Baum ◽

Milana Frenkel-Morgenstern

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Complex Disease ◽

Single Cells ◽

Reference Database ◽

Sequencing Data ◽

Sequencing Technologies ◽

High Throughput Sequencing Data ◽

Chimeric Rnas ◽

Sensitivity Specificity

Abstract Fusion genes or chimeras typically comprise sequences from two different genes. The chimeric RNAs of such joined sequences often serve as cancer drivers. Identifying such driver fusions in a given cancer or complex disease is important for diagnosis and treatment. The advent of next-generation sequencing technologies, such as DNA-Seq or RNA-Seq, together with the development of suitable computational tools, has made the global identification of chimeras in tumors possible. However, the testing of over 20 computational methods showed these to be limited in terms of chimera prediction sensitivity, specificity, and accurate quantification of junction reads. These shortcomings motivated us to develop the first ‘reference-based’ approach termed ChiTaH (Chimeric Transcripts from High–throughput sequencing data). ChiTaH uses 43,466 non–redundant known human chimeras as a reference database to map sequencing reads and to accurately identify chimeric reads. We benchmarked ChiTaH and four other methods to identify human chimeras, leveraging both simulated and real sequencing datasets. ChiTaH was found to be the most accurate and fastest method for identifying known human chimeras from simulated and sequencing datasets. Moreover, especially ChiTaH uncovered heterogeneity of the BCR-ABL1 chimera in both bulk and single-cells of the K-562 cell line, which was confirmed experimentally.

Download Full-text

Productive visualization of high-throughput sequencing data using the SeqCode open portable platform

Scientific Reports ◽

10.1038/s41598-021-98889-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Enrique Blanco ◽

Mar González-Ramírez ◽

Luciano Di Croce

Keyword(s):

High Throughput ◽

Large Scale ◽

High Throughput Sequencing ◽

Graphical Analysis ◽

Sequencing Data ◽

Efficient Manner ◽

Link Type ◽

High Throughput Sequencing Data ◽

Almost All ◽

User Friendly

AbstractLarge-scale sequencing techniques to chart genomes are entirely consolidated. Stable computational methods to perform primary tasks such as quality control, read mapping, peak calling, and counting are likewise available. However, there is a lack of uniform standards for graphical data mining, which is also of central importance. To fill this gap, we developed SeqCode, an open suite of applications that analyzes sequencing data in an elegant but efficient manner. Our software is a portable resource written in ANSI C that can be expected to work for almost all genomes in any computational configuration. Furthermore, we offer a user-friendly front-end web server that integrates SeqCode functions with other graphical analysis tools. Our analysis and visualization toolkit represents a significant improvement in terms of performance and usability as compare to other existing programs. Thus, SeqCode has the potential to become a key multipurpose instrument for high-throughput professional analysis; further, it provides an extremely useful open educational platform for the world-wide scientific community. SeqCode website is hosted at http://ldicrocelab.crg.eu, and the source code is freely distributed at https://github.com/eblancoga/seqcode.

Download Full-text

HTSQualC is a flexible and one-step quality control software for high-throughput sequencing data analysis

Scientific Reports ◽

10.1038/s41598-021-98124-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Renesh Bedre ◽

Carlos Avila ◽

Kranthi Mandadi

Keyword(s):

Quality Control ◽

High Throughput ◽

High Throughput Sequencing ◽

Science Research ◽

Control Analysis ◽

Sequencing Data ◽

Quality Control Analysis ◽

High Throughput Sequencing Data ◽

One Step ◽

Automated Quality Control

AbstractUse of high-throughput sequencing (HTS) has become indispensable in life science research. Raw HTS data contains several sequencing artifacts, and as a first step it is imperative to remove the artifacts for reliable downstream bioinformatics analysis. Although there are multiple stand-alone tools available that can perform the various quality control steps separately, availability of an integrated tool that can allow one-step, automated quality control analysis of HTS datasets will significantly enhance handling large number of samples parallelly. Here, we developed HTSQualC, a stand-alone, flexible, and easy-to-use software for one-step quality control analysis of raw HTS data. HTSQualC can evaluate HTS data quality and perform filtering and trimming analysis in a single run. We evaluated the performance of HTSQualC for conducting batch analysis of HTS datasets with 322 samples with an average ~ 1 M (paired end) sequence reads per sample. HTSQualC accomplished the QC analysis in ~ 3 h in distributed mode and ~ 31 h in shared mode, thus underscoring its utility and robust performance. In addition to command-line execution, we integrated HTSQualC into the free, open-source, CyVerse cyberinfrastructure resource as a GUI interface, for wider access to experimental biologists who have limited computational resources and/or programming abilities.

Download Full-text

Inline Index Helped in Cleaning up Data Contamination Generated During Library Preparation and the Subsequent Steps

10.21203/rs.3.rs-840856/v1 ◽

2021 ◽

Author(s):

Ying Wang ◽

Hao Yuan ◽

Junman Huang ◽

Chenhong Li

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Cross Contamination ◽

Library Preparation ◽

Sequencing Data ◽

Index Method ◽

Data Contamination ◽

High Throughput Sequencing Data ◽

Ligation Step ◽

Contamination Problem

Abstract High-throughput sequencing involves library preparation and amplification steps, which may induce contamination across samples or between samples and the environment. We tested the effect of applying an inline-index strategy, in which DNA indices of 6 bp were added to both ends of the inserts at the ligation step of library prep for resolving the data contamination problem. Our results showed that the contamination ranged from 0.29–1.25% in one experiment and from 0.83–27.01% in the other. We also found that contamination could be environmental or from reagents besides cross-contamination between samples. Inline-index method is a useful experimental design to clean up the data and address the contamination problem which has been plaguing high-throughput sequencing data in many applications.

Download Full-text

KARGA: Multi-platform Toolkit for k-mer-based Antibiotic Resistance Gene Analysis of High-throughput Sequencing Data

2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI) ◽

10.1109/bhi50953.2021.9508479 ◽

2021 ◽

Author(s):

Mattia Prosperi ◽

Simone Marini

Keyword(s):

Antibiotic Resistance ◽

Resistance Gene ◽

High Throughput ◽

High Throughput Sequencing ◽

Antibiotic Resistance Gene ◽

Gene Analysis ◽

Sequencing Data ◽

High Throughput Sequencing Data

Download Full-text

PEGR: a flexible management platform for reproducible epigenomic and genomic research

10.1101/2021.07.26.453821 ◽

2021 ◽

Author(s):

Danying Shao ◽

Gretta Kellogg ◽

Ali Nematbakhsh ◽

Prashant Kuntala ◽

Shaun Mahony ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Genomic Research ◽

Sequencing Data ◽

Web Based ◽

Management Platform ◽

Significant Challenge ◽

High Throughput Sequencing Data ◽

The Galaxy ◽

Flexible Management

Reproducibility is a significant challenge in (epi)genomic research due to the complexity of experiments composed of traditional biochemistry and informatics. Recent advances have exacerbated this challenge as high-throughput sequencing data is generated at an unprecedented pace. Here we report on our development of a Platform for Epi-Genomic Research (PEGR), a web-based project management platform that tracks and quality controls experiments from conception to publication-ready figures, compatible with multiple assays and bioinformatic pipelines. It supports rigor and reproducibility for biochemists working at the wet bench, while continuing to fully support reproducibility and reliability for bioinformaticians through integration with the Galaxy platform.

Download Full-text

Response of Growth Performance, Blood Biochemistry Indices, and Rumen Bacterial Diversity in Lambs to Diets Containing Supplemental Probiotics and Chinese Medicine Polysaccharides

Frontiers in Veterinary Science ◽

10.3389/fvets.2021.681389 ◽

2021 ◽

Vol 8 ◽

Author(s):

Huan Chen ◽

Beibei Guo ◽

Mingrui Yang ◽

Junrong Luo ◽

Yiqing Hu ◽

...

Keyword(s):

Chinese Medicine ◽

Growth Performance ◽

Relative Abundance ◽

High Throughput Sequencing ◽

Rumen Fluid ◽

Average Daily Gain ◽

Sequencing Data ◽

Blood Indices ◽

High Throughput Sequencing Data ◽

Daily Gain

This study aims to investigate the effects of probiotics and Chinese medicine polysaccharides (CMPs) on growth performance, blood indices, rumen fermentation, and bacteria composition in lambs. Forty female lambs were randomly divided into four groups as follows: control, probiotics, CMP, and compound (probiotics + CMP) groups. The results showed that probiotics treatment increased the concentrations of blood glucose (GLU) and immunoglobulin G (IgG) and enhanced rumen microbial protein contents but declined the value of pH in rumen fluid compared with the control (P < 0.05). Furthermore, supplementation with CMP enhanced the average daily gain (ADG) and the contents of IgA, IgG, and IgM in the serum but decreased the F:G ratio compared with the control (P < 0.05). Besides, both CMP and compound (probiotics + CMP) treatments decreased the ratio of acetic acid and propionic acid compared with the control (P < 0.05). High-throughput sequencing data showed that at the genus level, the relative abundance of Veillonellaceae_UCG-001 in the probiotics group was increased, the relative abundance of Succiniclasticum and norank_f__Muribaculaceae in the CMP group were enhanced, and the relative abundance of Ruminococcaceae_UCG-002 in the compound group was raised compared with the control (P < 0.05). In summary, supplementation with probiotics can promote rumen protein fermentation but decrease the diversity of bacteria in rumen fluid; however, CMP treatment increased the relative abundance of Fibrobacteria, changed rumen microbial fermentation mode, increased the immune function, and ultimately improved the growth performance.

Download Full-text

high throughput sequencing data
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

An association study on imputed whole‐genome resequencing from high‐throughput sequencing data for body traits in crossbred pigs

An In Silico Detection of a Citrus Viroid from Raw High-Throughput Sequencing Data

Lossless Indexing with Counting de Bruijn Graphs

ChiTaH: a fast and accurate tool for identifying known human chimeric sequences from high-throughput sequencing data

Productive visualization of high-throughput sequencing data using the SeqCode open portable platform

HTSQualC is a flexible and one-step quality control software for high-throughput sequencing data analysis

Inline Index Helped in Cleaning up Data Contamination Generated During Library Preparation and the Subsequent Steps

KARGA: Multi-platform Toolkit for k-mer-based Antibiotic Resistance Gene Analysis of High-throughput Sequencing Data

PEGR: a flexible management platform for reproducible epigenomic and genomic research

Response of Growth Performance, Blood Biochemistry Indices, and Rumen Bacterial Diversity in Lambs to Diets Containing Supplemental Probiotics and Chinese Medicine Polysaccharides

Export Citation Format

high throughput sequencing dataRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

An association study on imputed whole‐genome resequencing from high‐throughput sequencing data for body traits in crossbred pigs

An In Silico Detection of a Citrus Viroid from Raw High-Throughput Sequencing Data

Lossless Indexing with Counting de Bruijn Graphs

ChiTaH: a fast and accurate tool for identifying known human chimeric sequences from high-throughput sequencing data

Productive visualization of high-throughput sequencing data using the SeqCode open portable platform

HTSQualC is a flexible and one-step quality control software for high-throughput sequencing data analysis

Inline Index Helped in Cleaning up Data Contamination Generated During Library Preparation and the Subsequent Steps

KARGA: Multi-platform Toolkit for k-mer-based Antibiotic Resistance Gene Analysis of High-throughput Sequencing Data

PEGR: a flexible management platform for reproducible epigenomic and genomic research

Response of Growth Performance, Blood Biochemistry Indices, and Rumen Bacterial Diversity in Lambs to Diets Containing Supplemental Probiotics and Chinese Medicine Polysaccharides

high throughput sequencing data
Recently Published Documents