StoatyDive: Evaluation and Classification of Peak Profiles for Sequencing Data

The prediction of binding sites (peak calling) is a common task in the data analysis of methods such as crosslinking or chromatin immunoprecipitation in combination with high-throughput sequencing (CLIP-Seq, ChIP-Seq). The predicted binding sites are often further analyzed to predict sequence motifs or structure patterns as an example. However, the obtained peak set can vary in their profile shapes because of the used peakcaller method, different binding domains of the protein, protocol biases, or other factors. Thus, a tool is missing that evaluates and classifies the predicted peaks based on their shapes. We hereby present StoatyDive, a tool that can be used to filter for specific peak profile shapes of sequencing data such as CLIP and ChIP. StoatyDive therefore fine tunes downstream analysis steps such as structure or sequence motif predictions and acts as a quality control.With StoatyDive we were able to classify distinct peak profile shapes from CLIP-seq data of the histone stem-loop-binding protein (SLBP). We show the potential of StoatyDive, as a quality control tool and as a filter to pick different shapes based on biological or methodical questions.StoatyDive is open source and freely available under GLP-3 at https://github.com/BackofenLab/StoatyDive and at bioconda https://anaconda.org/bioconda/stoatydive.

Download Full-text

PathoQC: Computationally Efficient Read Preprocessing and Quality Control for High-Throughput Sequencing Data Sets

Cancer Informatics ◽

10.4137/cin.s13890 ◽

2014 ◽

Vol 13s1 ◽

pp. CIN.S13890 ◽

Cited By ~ 1

Author(s):

Changjin Hong ◽

Solaiappan Manimaran ◽

William Evan Johnson

Keyword(s):

Quality Control ◽

High Throughput ◽

High Performance ◽

High Throughput Sequencing ◽

Next Generation Sequencing Data ◽

Data Sets ◽

Sequencing Data ◽

Computationally Efficient ◽

High Throughput Sequencing Data ◽

Downstream Analysis

Quality control and read preprocessing are critical steps in the analysis of data sets generated from high-throughput genomic screens. In the most extreme cases, improper preprocessing can negatively affect downstream analyses and may lead to incorrect biological conclusions. Here, we present PathoQC, a streamlined toolkit that seamlessly combines the benefits of several popular quality control software approaches for preprocessing next-generation sequencing data. PathoQC provides a variety of quality control options appropriate for most high-throughput sequencing applications. PathoQC is primarily developed as a module in the PathoScope software suite for metagenomic analysis. However, PathoQC is also available as an open-source Python module that can run as a stand-alone application or can be easily integrated into any bioinformatics workflow. PathoQC achieves high performance by supporting parallel computation and is an effective tool that removes technical sequencing artifacts and facilitates robust downstream analysis. The PathoQC software package is available at http://sourceforge.net/projects/PathoScope/ .

Download Full-text

Integration of viral transcriptome sequencing with structure and sequence motifs predicts novel regulatory elements in SARS-CoV-2

10.1101/2020.06.24.169144 ◽

2020 ◽

Author(s):

Brian J. Cox

Keyword(s):

Regulatory Elements ◽

Viral Gene ◽

Template Switching ◽

Sequence Motif ◽

Sequence Motifs ◽

Human Pathogens ◽

Sequencing Data ◽

Stem Loop ◽

Conserved Sequence ◽

Splice Junctions

SummaryIn the last twenty years, three separate coronaviruses have left their typical animal hosts and became human pathogens. An area of research interest is coronavirus transcription regulation that uses an RNA-RNA mediated template-switching mechanism. It is not known how different transcriptional stoichiometries of each viral gene are generated. Analysis of SARS-CoV-2 RNA sequencing data from whole RNA transcriptomes identified TRS dependent and independent transcripts. Integration of transcripts and 5’-UTR sequence motifs identified that the pentaloop and the stem-loop 3 were also located upstream of spliced genes. TRS independent transcripts were detected as likely non-polyadenylated. Additionally, a novel conserved sequence motif was discovered at either end of the TRS independent splice junctions. While similar both SARS viruses generated similar TRS independent transcripts they were more abundant in SARS-CoV-2. TRS independent gene regulation requires investigation to determine its relationship to viral pathogenicity.

Download Full-text

StoatyDive: Evaluation and classification of peak profiles for sequencing data

GigaScience ◽

10.1093/gigascience/giab045 ◽

2021 ◽

Vol 10 (6) ◽

Author(s):

Florian Heyl ◽

Rolf Backofen

Keyword(s):

Quality Control ◽

High Throughput ◽

Binding Sites ◽

Binding Proteins ◽

Rna Binding ◽

Rna Binding Proteins ◽

Typical Result ◽

Sequence Motif ◽

Peak Shape ◽

Sequencing Data

Abstract Background The prediction of binding sites (peak-calling) is a common task in the data analysis of methods such as cross-linking immunoprecipitation in combination with high-throughput sequencing (CLIP-Seq). The predicted binding sites are often further analyzed to predict sequence motifs or structure patterns. When looking at a typical result of such high-throughput experiments, the obtained peak profiles differ largely on a genomic level. Thus, a tool is missing that evaluates and classifies the predicted peaks on the basis of their shapes. We hereby present StoatyDive, a tool that can be used to filter for specific peak profile shapes of sequencing data such as CLIP. Findings With StoatyDive we are able to classify peak profile shapes from CLIP-seq data of the histone stem-loop-binding protein (SLBP). We compare the results to existing tools and show that StoatyDive finds more distinct peak shape clusters for CLIP data. Furthermore, we present StoatyDive’s capabilities as a quality control tool and as a filter to pick different shapes based on biological or technical questions for other CLIP data from different RNA binding proteins with different biological functions and numbers of RNA recognition motifs. We finally show that proteins involved in splicing, such as RBM22 and U2AF1, have potentially sharper-shaped peaks than other RNA binding proteins. Conclusion StoatyDive finally fills the demand for a peak shape clustering tool for CLIP-Seq data that fine-tunes downstream analysis steps such as structure or sequence motif predictions and that acts as a quality control.

Download Full-text

Complete mitochondrial genome sequence of Labriocimbex sinicus, a new genus and new species of Cimbicidae (Hymenoptera) from China

PeerJ ◽

10.7717/peerj.7853 ◽

2019 ◽

Vol 7 ◽

pp. e7853 ◽

Cited By ~ 1

Author(s):

Yuchen Yan ◽

Gengyun Niu ◽

Yaoyao Zhang ◽

Qianying Ren ◽

Shiyu Du ◽

...

Keyword(s):

Mitochondrial Genome ◽

New Genus ◽

High Throughput Sequencing ◽

Phylogenetic Analyses ◽

Complete Mitochondrial Genome ◽

Sister Group ◽

Morphological Characters ◽

Trna Genes ◽

Sequencing Data ◽

Link Type

Labriocimbex sinicus Yan & Wei gen. et sp. nov. of Cimbicidae is described. The new genus is similar to Praia Andre and Trichiosoma Leach. A key to extant Holarctic genera of Cimbicinae is provided. To identify the phylogenetic placement of Cimbicidae, the mitochondrial genome of L. sinicus was annotated and characterized using high-throughput sequencing data. The complete mitochondrial genome of L. sinicus was obtained with a length of 15,405 bp (GenBank: MH136623; SRA: SRR8270383) and a typical set of 37 genes (22 tRNAs, 13 PCGs, and two rRNAs). The results demonstrated that all PCGs were initiated by ATN codon, and ended with TAA or T stop codons. The study reveals that all tRNA genes have a typical clover-leaf secondary structure, except for trnS1. Remarkably, the secondary structures of the rrnS and rrnL of L. sinicus were much different from those of Corynis lateralis. Phylogenetic analyses verified the monophyly and positions of the three Cimbicidae species within the superfamily Tenthredinoidea and demonstrated a relationship as (Tenthredinidae + Cimbicidae) + (Argidae + Pergidae) with strong nodal supports. Furthermore, we found that the generic relationships of Cimbicidae revealed by the phylogenetic analyses based on COI genes agree quite closely with the systematic arrangement of the genera based on the morphological characters. Phylogenetic tree based on two methods shows that L. sinicus is the sister group of Praia with high support values. We suggest that Labriocimbex belongs to the tribe Trichiosomini of Cimbicinae based on adult morphology and molecular data. Besides, we suggest to promote the subgenus Asitrichiosoma to be a valid genus.

Download Full-text

hypeR: An R Package for Geneset Enrichment Workflows

10.1101/656637 ◽

2019 ◽

Cited By ~ 1

Author(s):

Anthony Federico ◽

Stefano Monti

Keyword(s):

High Throughput Sequencing ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Wide Audience ◽

Popular Method ◽

Link Type ◽

High Throughput Sequencing Data ◽

One Stop ◽

Recent Version

ABSTRACTSummaryGeneset enrichment is a popular method for annotating high-throughput sequencing data. Existing tools fall short in providing the flexibility to tackle the varied challenges researchers face in such analyses, particularly when analyzing many signatures across multiple experiments. We present a comprehensive R package for geneset enrichment workflows that offers multiple enrichment, visualization, and sharing methods in addition to novel features such as hierarchical geneset analysis and built-in markdown reporting. hypeR is a one-stop solution to performing geneset enrichment for a wide audience and range of use cases.Availability and implementationThe most recent version of the package is available at https://github.com/montilab/hypeR.Supplementary informationComprehensive documentation and tutorials, are available at https://montilab.github.io/hypeR-docs.

Download Full-text

Natrix: a Snakemake-based workflow for processing, clustering, and taxonomically assigning amplicon sequencing reads

BMC Bioinformatics ◽

10.1186/s12859-020-03852-4 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Marius Welzel ◽

Anja Lange ◽

Dominik Heider ◽

Michael Schwarz ◽

Bernd Freisleben ◽

...

Keyword(s):

High Throughput Sequencing ◽

Workflow Management ◽

Amplicon Sequencing ◽

Version Control ◽

Marker Genes ◽

Sequencing Data ◽

Taxonomic Assignment ◽

Ecological Processes ◽

Link Type ◽

User Friendly

Abstract Background Sequencing of marker genes amplified from environmental samples, known as amplicon sequencing, allows us to resolve some of the hidden diversity and elucidate evolutionary relationships and ecological processes among complex microbial communities. The analysis of large numbers of samples at high sequencing depths generated by high throughput sequencing technologies requires efficient, flexible, and reproducible bioinformatics pipelines. Only a few existing workflows can be run in a user-friendly, scalable, and reproducible manner on different computing devices using an efficient workflow management system. Results We present Natrix, an open-source bioinformatics workflow for preprocessing raw amplicon sequencing data. The workflow contains all analysis steps from quality assessment, read assembly, dereplication, chimera detection, split-sample merging, sequence representative assignment (OTUs or ASVs) to the taxonomic assignment of sequence representatives. The workflow is written using Snakemake, a workflow management engine for developing data analysis workflows. In addition, Conda is used for version control. Thus, Snakemake ensures reproducibility and Conda offers version control of the utilized programs. The encapsulation of rules and their dependencies support hassle-free sharing of rules between workflows and easy adaptation and extension of existing workflows. Natrix is freely available on GitHub (https://github.com/MW55/Natrix) or as a Docker container on DockerHub (https://hub.docker.com/r/mw55/natrix). Conclusion Natrix is a user-friendly and highly extensible workflow for processing Illumina amplicon data.

Download Full-text

re-Searcher: GUI-based bioinformatics tool for simplified genomics data mining of VCF files

PeerJ ◽

10.7717/peerj.11333 ◽

2021 ◽

Vol 9 ◽

pp. e11333

Author(s):

Daniyar Karabayev ◽

Askhat Molkenov ◽

Kaiyrgali Yerulanuly ◽

Ilyas Kabimoldayev ◽

Asset Daniyarov ◽

...

Keyword(s):

Web Application ◽

High Throughput Sequencing ◽

Sequencing Data ◽

Data Types ◽

Standard Format ◽

Standard Data ◽

Additional Information ◽

Link Type ◽

Sequencing Platforms ◽

User Friendly

Background High-throughput sequencing platforms generate a massive amount of high-dimensional genomic datasets that are available for analysis. Modern and user-friendly bioinformatics tools for analysis and interpretation of genomics data becomes essential during the analysis of sequencing data. Different standard data types and file formats have been developed to store and analyze sequence and genomics data. Variant Call Format (VCF) is the most widespread genomics file type and standard format containing genomic information and variants of sequenced samples. Results Existing tools for processing VCF files don’t usually have an intuitive graphical interface, but instead have just a command-line interface that may be challenging to use for the broader biomedical community interested in genomics data analysis. re-Searcher solves this problem by pre-processing VCF files by chunks to not load RAM of computer. The tool can be used as standalone user-friendly multiplatform GUI application as well as web application (https://nla-lbsb.nu.edu.kz). The software including source code as well as tested VCF files and additional information are publicly available on the GitHub repository (https://github.com/LabBandSB/re-Searcher).

Download Full-text

Rqc: A Bioconductor Package for Quality Control of High-Throughput Sequencing Data

Journal of Statistical Software ◽

10.18637/jss.v087.c02 ◽

2018 ◽

Vol 87 (Code Snippet 2) ◽

Cited By ~ 2

Author(s):

Wélliton de Souza ◽

Benilton de Sá Carvalho ◽

Iscia Lopes-Cendes

Keyword(s):

Quality Control ◽

High Throughput ◽

High Throughput Sequencing ◽

Bioconductor Package ◽

Sequencing Data ◽

High Throughput Sequencing Data

Download Full-text

miQC: An adaptive probabilistic framework for quality control of single-cell RNA-sequencing data

10.1101/2021.03.03.433798 ◽

2021 ◽

Author(s):

Ariel A. Hippen ◽

Matias M. Falco ◽

Lukas M. Weber ◽

Erdogan Pekcan Erkan ◽

Kaiyang Zhang ◽

...

Keyword(s):

Quality Control ◽

Single Cell ◽

Rna Sequencing ◽

Data Driven ◽

Probabilistic Framework ◽

Sequencing Data ◽

Link Type ◽

Tumor Tissues ◽

Single Cell Rna Sequencing ◽

Different Types

AbstractMotivationSingle-cell RNA-sequencing (scRNA-seq) has made it possible to profile gene expression in tissues at high resolution. An important preprocessing step prior to performing downstream analyses is to identify and remove cells with poor or degraded sample quality using quality control (QC) metrics. Two widely used QC metrics to identify a ‘low-quality’ cell are (i) if the cell includes a high proportion of reads that map to mitochondrial DNA (mtDNA) encoded genes and (ii) if a small number of genes are detected. Current best practices use these QC metrics independently with either arbitrary, uniform thresholds (e.g. 5%) or biological context-dependent (e.g. species) thresholds, and fail to jointly model these metrics in a data-driven manner. Current practices are often overly stringent and especially untenable on lower-quality tissues, such as archived tumor tissues.ResultsWe propose a data-driven QC metric (miQC) that jointly models both the proportion of reads mapping to mtDNA genes and the number of detected genes with mixture models in a probabilistic framework to predict the low-quality cells in a given dataset. We demonstrate how our QC metric easily adapts to different types of single-cell datasets to remove low-quality cells while preserving high-quality cells that can be used for downstream analyses.AvailabilitySoftware available at https://github.com/greenelab/miQC. The code used to download datasets, perform the analyses, and reproduce the figures is available at https://github.com/greenelab/mito-filtering.ContactStephanie C. Hicks ([email protected]) and Anna Vähärautio ([email protected])

Download Full-text

NASQAR: A web-based platform for high-throughput sequencing data analysis and visualization

10.1101/709980 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ayman Yousif ◽

Nizar Drou ◽

Jillian Rowe ◽

Mohammed Khalfan ◽

Kristin C Gunsalus

Keyword(s):

New York ◽

Data Analysis ◽

Open Source ◽

High Throughput ◽

High Throughput Sequencing ◽

Web Applications ◽

Rna Seq ◽

Sequencing Data ◽

Web Based ◽

Link Type

AbstractBackgroundAs high-throughput sequencing applications continue to evolve, the rapid growth in quantity and variety of sequence-based data calls for the development of new software libraries and tools for data analysis and visualization. Often, effective use of these tools requires computational skills beyond those of many researchers. To ease this computational barrier, we have created a dynamic web-based platform, NASQAR (Nucleic Acid SeQuence Analysis Resource).ResultsNASQAR offers a collection of custom and publicly available open-source web applications that make extensive use of a variety of R packages to provide interactive data analysis and visualization. The platform is publicly accessible at http://nasqar.abudhabi.nyu.edu/. Open-source code is on GitHub at https://github.com/nasqar/NASQAR, and the system is also available as a Docker image at https://hub.docker.com/r/aymanm/nasqarall. NASQAR is a collaboration between the core bioinformatics teams of the NYU Abu Dhabi and NYU New York Centers for Genomics and Systems Biology.ConclusionsNASQAR empowers non-programming experts with a versatile and intuitive toolbox to easily and efficiently explore, analyze, and visualize their Transcriptomics data interactively. Popular tools for a variety of applications are currently available, including Transcriptome Data Preprocessing, RNA-seq Analysis (including Single-cell RNA-seq), Metagenomics, and Gene Enrichment.

Download Full-text