high throughput sequencing data
Recently Published Documents


TOTAL DOCUMENTS

327
(FIVE YEARS 112)

H-INDEX

37
(FIVE YEARS 5)

2021 ◽  
pp. 275-283
Author(s):  
Tyler Dang ◽  
Andres Espindola ◽  
Georgios Vidalakis ◽  
Kitty Cardwell

2021 ◽  
Author(s):  
Mikhail Karasikov ◽  
Harun Mustafa ◽  
Gunnar Rätsch ◽  
André Kahles

High-throughput sequencing data is rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in solving the experiment discovery problem and building compressed representations of annotated de Bruijn graphs where k-mer sets can be efficiently indexed and interactively queried. However, approaches for representing and retrieving other quantitative attributes such as gene expression or genome positions in a general manner have yet to be developed. In this work, we propose the concept of Counting de Bruijn graphs generalizing the notion of annotated (or colored) de Bruijn graphs. Counting de Bruijn graphs supplement each node-label relation with one or many attributes (e.g., a k-mer count or its positions in genome). To represent them, we first observe that many schemes for the representation of compressed binary matrices already support the rank operation on the columns or rows, which can be used to define an inherent indexing of any additional quantitative attributes. Based on this property, we generalize these schemes and introduce a new approach for representing non-binary sparse matrices in compressed data structures. Finally, we notice that relation attributes are often easily predictable from a node's local neighborhood in the graph. Notable examples are genome positions shifting by 1 for neighboring nodes in the graph, or expression levels that are often shared across neighbors. We exploit this regularity of graph annotations and apply an invertible delta-like coding to achieve better compression. We show that Counting de Bruijn graphs index k-mer counts from 2,652 human RNA-Seq read sets in representations over 8-fold smaller and yet faster to query compared to state-of-the-art bioinformatics tools. Furthermore, Counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip -9 for human Illumina RNA-Seq and 57% smaller for PacBio HiFi sequencing of viral samples. A complete joint searchable index of all viral PacBio SMRT reads from NCBI's SRA (152,884 read sets, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, they generate a lossless and fully queryable index that is 4.4-fold smaller compared to the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools employing de Bruijn graphs and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed and fully searchable graph-based sequence indexes.


2021 ◽  
Vol 3 (4) ◽  
Author(s):  
Rajesh Detroja ◽  
Alessandro Gorohovski ◽  
Olawumi Giwa ◽  
Gideon Baum ◽  
Milana Frenkel-Morgenstern

Abstract Fusion genes or chimeras typically comprise sequences from two different genes. The chimeric RNAs of such joined sequences often serve as cancer drivers. Identifying such driver fusions in a given cancer or complex disease is important for diagnosis and treatment. The advent of next-generation sequencing technologies, such as DNA-Seq or RNA-Seq, together with the development of suitable computational tools, has made the global identification of chimeras in tumors possible. However, the testing of over 20 computational methods showed these to be limited in terms of chimera prediction sensitivity, specificity, and accurate quantification of junction reads. These shortcomings motivated us to develop the first ‘reference-based’ approach termed ChiTaH (Chimeric Transcripts from High–throughput sequencing data). ChiTaH uses 43,466 non–redundant known human chimeras as a reference database to map sequencing reads and to accurately identify chimeric reads. We benchmarked ChiTaH and four other methods to identify human chimeras, leveraging both simulated and real sequencing datasets. ChiTaH was found to be the most accurate and fastest method for identifying known human chimeras from simulated and sequencing datasets. Moreover, especially ChiTaH uncovered heterogeneity of the BCR-ABL1 chimera in both bulk and single-cells of the K-562 cell line, which was confirmed experimentally.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Enrique Blanco ◽  
Mar González-Ramírez ◽  
Luciano Di Croce

AbstractLarge-scale sequencing techniques to chart genomes are entirely consolidated. Stable computational methods to perform primary tasks such as quality control, read mapping, peak calling, and counting are likewise available. However, there is a lack of uniform standards for graphical data mining, which is also of central importance. To fill this gap, we developed SeqCode, an open suite of applications that analyzes sequencing data in an elegant but efficient manner. Our software is a portable resource written in ANSI C that can be expected to work for almost all genomes in any computational configuration. Furthermore, we offer a user-friendly front-end web server that integrates SeqCode functions with other graphical analysis tools. Our analysis and visualization toolkit represents a significant improvement in terms of performance and usability as compare to other existing programs. Thus, SeqCode has the potential to become a key multipurpose instrument for high-throughput professional analysis; further, it provides an extremely useful open educational platform for the world-wide scientific community. SeqCode website is hosted at http://ldicrocelab.crg.eu, and the source code is freely distributed at https://github.com/eblancoga/seqcode.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Renesh Bedre ◽  
Carlos Avila ◽  
Kranthi Mandadi

AbstractUse of high-throughput sequencing (HTS) has become indispensable in life science research. Raw HTS data contains several sequencing artifacts, and as a first step it is imperative to remove the artifacts for reliable downstream bioinformatics analysis. Although there are multiple stand-alone tools available that can perform the various quality control steps separately, availability of an integrated tool that can allow one-step, automated quality control analysis of HTS datasets will significantly enhance handling large number of samples parallelly. Here, we developed HTSQualC, a stand-alone, flexible, and easy-to-use software for one-step quality control analysis of raw HTS data. HTSQualC can evaluate HTS data quality and perform filtering and trimming analysis in a single run. We evaluated the performance of HTSQualC for conducting batch analysis of HTS datasets with 322 samples with an average ~ 1 M (paired end) sequence reads per sample. HTSQualC accomplished the QC analysis in ~ 3 h in distributed mode and ~ 31 h in shared mode, thus underscoring its utility and robust performance. In addition to command-line execution, we integrated HTSQualC into the free, open-source, CyVerse cyberinfrastructure resource as a GUI interface, for wider access to experimental biologists who have limited computational resources and/or programming abilities.


2021 ◽  
Author(s):  
Ying Wang ◽  
Hao Yuan ◽  
Junman Huang ◽  
Chenhong Li

Abstract High-throughput sequencing involves library preparation and amplification steps, which may induce contamination across samples or between samples and the environment. We tested the effect of applying an inline-index strategy, in which DNA indices of 6 bp were added to both ends of the inserts at the ligation step of library prep for resolving the data contamination problem. Our results showed that the contamination ranged from 0.29–1.25% in one experiment and from 0.83–27.01% in the other. We also found that contamination could be environmental or from reagents besides cross-contamination between samples. Inline-index method is a useful experimental design to clean up the data and address the contamination problem which has been plaguing high-throughput sequencing data in many applications.


2021 ◽  
Author(s):  
Danying Shao ◽  
Gretta Kellogg ◽  
Ali Nematbakhsh ◽  
Prashant Kuntala ◽  
Shaun Mahony ◽  
...  

Reproducibility is a significant challenge in (epi)genomic research due to the complexity of experiments composed of traditional biochemistry and informatics. Recent advances have exacerbated this challenge as high-throughput sequencing data is generated at an unprecedented pace. Here we report on our development of a Platform for Epi-Genomic Research (PEGR), a web-based project management platform that tracks and quality controls experiments from conception to publication-ready figures, compatible with multiple assays and bioinformatic pipelines. It supports rigor and reproducibility for biochemists working at the wet bench, while continuing to fully support reproducibility and reliability for bioinformaticians through integration with the Galaxy platform.


2021 ◽  
Vol 8 ◽  
Author(s):  
Huan Chen ◽  
Beibei Guo ◽  
Mingrui Yang ◽  
Junrong Luo ◽  
Yiqing Hu ◽  
...  

This study aims to investigate the effects of probiotics and Chinese medicine polysaccharides (CMPs) on growth performance, blood indices, rumen fermentation, and bacteria composition in lambs. Forty female lambs were randomly divided into four groups as follows: control, probiotics, CMP, and compound (probiotics + CMP) groups. The results showed that probiotics treatment increased the concentrations of blood glucose (GLU) and immunoglobulin G (IgG) and enhanced rumen microbial protein contents but declined the value of pH in rumen fluid compared with the control (P < 0.05). Furthermore, supplementation with CMP enhanced the average daily gain (ADG) and the contents of IgA, IgG, and IgM in the serum but decreased the F:G ratio compared with the control (P < 0.05). Besides, both CMP and compound (probiotics + CMP) treatments decreased the ratio of acetic acid and propionic acid compared with the control (P < 0.05). High-throughput sequencing data showed that at the genus level, the relative abundance of Veillonellaceae_UCG-001 in the probiotics group was increased, the relative abundance of Succiniclasticum and norank_f__Muribaculaceae in the CMP group were enhanced, and the relative abundance of Ruminococcaceae_UCG-002 in the compound group was raised compared with the control (P < 0.05). In summary, supplementation with probiotics can promote rumen protein fermentation but decrease the diversity of bacteria in rumen fluid; however, CMP treatment increased the relative abundance of Fibrobacteria, changed rumen microbial fermentation mode, increased the immune function, and ultimately improved the growth performance.


Sign in / Sign up

Export Citation Format

Share Document