SECAPR—a bioinformatics pipeline for the rapid and user-friendly processing of targeted enriched Illumina sequences, from raw reads to alignments

SECAPR - A bioinformatics pipeline for the rapid and user-friendly alignment of hybrid enrichment sequences, from raw reads to alignments

10.7287/peerj.preprints.26477v2 ◽

2018 ◽

Author(s):

Tobias Andermann ◽

Angela Cano ◽

Alexander Zizka ◽

Christine Bacon ◽

Alexandre Antonelli

Keyword(s):

Evolutionary Biology ◽

Sequence Data ◽

Model Organisms ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Sequence Alignments ◽

Multiple Sequence ◽

Sequence Capture ◽

Sequencing Platforms ◽

User Friendly

Evolutionary biology has entered an era of unprecedented amounts of DNA sequence data, as new sequencing platforms such as Massive Parallel Sequencing (MPS) can generate billions of nucleotides within less than a day. The current bottleneck is how to efficiently handle, process, and analyze such large amounts of data in an automated and reproducible way. To tackle these challenges we introduce the Sequence Capture Processor (SECAPR) pipeline for processing raw sequencing data into multiple sequence alignments for downstream phylogenetic and phylogeographic analyses. SECAPR is user-friendly and we provide an exhaustive tutorial intended for users with no prior experience with analyzing MPS output. SECAPR is particularly useful for the processing of sequence capture (= hybrid enrichment) datasets for non-model organisms, as we demonstrate using an empirical dataset of the palm genus Geonoma (Arecaceae). Various quality control and plotting functions help the user to decide on the most suitable settings for even challenging datasets. SECAPR is an easy-to-use, free, and versatile pipeline, aimed to enable efficient and reproducible processing of MPS data for many samples in parallel.

Download Full-text

SECAPR - A bioinformatics pipeline for the rapid and user-friendly processing of Illumina sequences, from raw reads to alignments

10.7287/peerj.preprints.26477 ◽

2018 ◽

Author(s):

Tobias Andermann ◽

Angela Cano ◽

Alexander Zizka ◽

Christine Bacon ◽

Alexandre Antonelli

Keyword(s):

Evolutionary Biology ◽

Sequence Data ◽

Model Organisms ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Sequence Alignments ◽

Multiple Sequence ◽

Sequence Capture ◽

Sequencing Platforms ◽

User Friendly

Evolutionary biology has entered an era of unprecedented amounts of DNA sequence data, as new sequencing platforms such as Massive Parallel Sequencing (MPS) can generate billions of nucleotides within less than a day. The current bottleneck is how to efficiently handle, process, and analyze such large amounts of data in an automated and reproducible way. To tackle these challenges we introduce the Sequence Capture Processor (SECAPR) pipeline for processing raw sequencing data into multiple sequence alignments for downstream phylogenetic and phylogeographic analyses. SECAPR is user-friendly and we provide an exhaustive empirical data tutorial intended for users with no prior experience with analyzing MPS output. SECAPR is particularly useful for the processing of sequence capture (synonyms: target or hybrid enrichment) datasets for non-model organisms, as we demonstrate using an empirical sequence capture dataset of the palm genus Geonoma (Arecaceae). Various quality control and plotting functions help the user to decide on the most suitable settings for even challenging datasets. SECAPR is an easy-to-use, free, and versatile pipeline, aimed to enable efficient and reproducible processing of MPS data for many samples in parallel.

Download Full-text

SECAPR - A bioinformatics pipeline for the rapid and user-friendly alignment of hybrid enrichment sequences, from raw reads to alignments

10.7287/peerj.preprints.26477v1 ◽

2018 ◽

Author(s):

Tobias Andermann ◽

Angela Cano ◽

Alexander Zizka ◽

Christine Bacon ◽

Alexandre Antonelli

Keyword(s):

Evolutionary Biology ◽

Sequence Data ◽

Model Organisms ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Sequence Alignments ◽

Multiple Sequence ◽

Sequence Capture ◽

Sequencing Platforms ◽

User Friendly

Evolutionary biology has entered an era of unprecedented amounts of DNA sequence data, as new sequencing platforms such as Massive Parallel Sequencing (MPS) can generate billions of nucleotides within less than a day. The current bottleneck is how to efficiently handle, process, and analyze such large amounts of data in an automated and reproducible way. To tackle these challenges we introduce the Sequence Capture Processor (SECAPR) pipeline for processing raw sequencing data into multiple sequence alignments for downstream phylogenetic and phylogeographic analyses. SECAPR is user-friendly and we provide an exhaustive tutorial intended for users with no prior experience with analyzing MPS output. SECAPR is particularly useful for the processing of sequence capture (= hybrid enrichment) datasets for non-model organisms, as we demonstrate using an empirical dataset of the palm genus Geonoma (Arecaceae). Various quality control and plotting functions help the user to decide on the most suitable settings for even challenging datasets. SECAPR is an easy-to-use, free, and versatile pipeline, aimed to enable efficient and reproducible processing of MPS data for many samples in parallel.

Download Full-text

SECAPR - A bioinformatics pipeline for the rapid and user-friendly processing of Illumina sequences, from raw reads to alignments

10.7287/peerj.preprints.26477v3 ◽

2018 ◽

Cited By ~ 1

Author(s):

Tobias Andermann ◽

Angela Cano ◽

Alexander Zizka ◽

Christine Bacon ◽

Alexandre Antonelli

Keyword(s):

Evolutionary Biology ◽

Sequence Data ◽

Model Organisms ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Sequence Alignments ◽

Multiple Sequence ◽

Sequence Capture ◽

Sequencing Platforms ◽

User Friendly

Evolutionary biology has entered an era of unprecedented amounts of DNA sequence data, as new sequencing platforms such as Massive Parallel Sequencing (MPS) can generate billions of nucleotides within less than a day. The current bottleneck is how to efficiently handle, process, and analyze such large amounts of data in an automated and reproducible way. To tackle these challenges we introduce the Sequence Capture Processor (SECAPR) pipeline for processing raw sequencing data into multiple sequence alignments for downstream phylogenetic and phylogeographic analyses. SECAPR is user-friendly and we provide an exhaustive empirical data tutorial intended for users with no prior experience with analyzing MPS output. SECAPR is particularly useful for the processing of sequence capture (synonyms: target or hybrid enrichment) datasets for non-model organisms, as we demonstrate using an empirical sequence capture dataset of the palm genus Geonoma (Arecaceae). Various quality control and plotting functions help the user to decide on the most suitable settings for even challenging datasets. SECAPR is an easy-to-use, free, and versatile pipeline, aimed to enable efficient and reproducible processing of MPS data for many samples in parallel.

Download Full-text

iCOMIC: a graphical interface-driven bioinformatics pipeline for analyzing cancer omics data

10.1101/2021.09.18.460896 ◽

2021 ◽

Author(s):

Anjana Anilkumar Sithara ◽

Devi Priyanka Maripuri ◽

Keerthika Moorthy ◽

Sai Sruthi Amirtha Ganesh ◽

Philge Philip ◽

...

Keyword(s):

Data Analysis ◽

Workflow Management ◽

Human Monocyte ◽

Complex Data ◽

Omics Data ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Sequencing Technologies ◽

Fastq Format ◽

User Friendly

Despite the tremendous increase in omics data generated by modern sequencing technologies, their analysis can be tricky and often requires substantial expertise in bioinformatics. To address this concern, we have developed a user-friendly pipeline to analyze (cancer) genomic data that takes in raw sequencing data (FASTQ format) as input and outputs insightful statistics on the nature of the data. Our iCOMIC toolkit pipeline can analyze whole-genome and transcriptome data and is embedded in the popular Snakemake workflow management system. iCOMIC is characterized by a user-friendly GUI that offers several advantages, including executing analyses with minimal steps, eliminating the need for complex command-line arguments. The toolkit features many independent core workflows for both whole genomic and transcriptomic data analysis. Even though all the necessary, well-established tools are integrated into the pipeline to enable "out-of-the-box" analysis, we provide the user with the means to replace modules or alter the pipeline as needed. Notably, we have integrated algorithms developed in-house for predicting driver and passenger mutations based on mutational context and tumor suppressor genes and oncogenes from somatic mutation data. We benchmarked our tool against Genome In A Bottle (GIAB) benchmark dataset (NA12878) and got the highest F1 score of 0.971 and 0.988 for indels and SNPs, respectively, using the BWA MEM - GATK HC DNA-Seq pipeline. Similarly, we achieved a correlation coefficient of r=0.85 using the HISAT2-StringTie-ballgown and STAR-StringTie-ballgown RNA-Seq pipelines on the human monocyte dataset (SRP082682). Overall, our tool enables easy analyses of omics datasets, with minimal steps, significantly ameliorating complex data analysis pipelines. Availability: https://github.com/RamanLab/iCOMIC

Download Full-text

BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data

10.1101/2021.10.02.462868 ◽

2021 ◽

Author(s):

Jacob L. Steenwyk ◽

Thomas J. Buida ◽

Carla Goncalves ◽

Dayna C. Goltz ◽

Grace H Morales ◽

...

Keyword(s):

Codon Usage ◽

Sequence Data ◽

Synonymous Codon ◽

Synonymous Codon Usage ◽

Relative Synonymous Codon Usage ◽

Summary Statistics ◽

Sequencing Data ◽

Sequence Alignments ◽

Multiple Sequence ◽

Genome Assemblies

Bioinformatic analysis - such as genome assembly quality assessment, alignment summary statistics, relative synonymous codon usage, paired-end aware quality trimming and filtering of sequencing reads, file format conversion, and processing and analysis - is integrated into diverse disciplines in the biological sciences. Several command-line pieces of software have been developed to conduct some of these individual analyses; however, the lack of a unified toolkit that conducts all these analyses can be a barrier in workflows. To address this obstacle, we introduce BioKIT, a versatile toolkit for the UNIX shell environment with 40 functions, several of which were community-sourced, that conduct routine and novel processing and analysis of genome assemblies, multiple sequence alignments, coding sequences, sequencing data, and more. To demonstrate the utility of BioKIT, we assessed the quality and characteristics of 901 eukaryotic genome assemblies, calculated alignment summary statistics for 10 phylogenomic data matrices, determined relative synonymous codon usage across 171 fungal genomes including those that use alternative genetic codes, and demonstrate that a novel metric, gene-wise relative synonymous codon usage, can accurately estimate gene-wise codon optimization. BioKIT will be helpful in facilitating and streamlining sequence analysis workflows. BioKIT is freely available under the MIT license from GitHub (https://github.com/JLSteenwyk/BioKIT), PyPi (https://pypi.org/project/biokit), and the Anaconda Cloud (https://anaconda.org/JLSteenwyk/biokit). Documentation, user tutorials, and instructions for requesting new features are available online (https://jlsteenwyk.com/BioKIT).

Download Full-text

A resource for improved predictions of Trypanosoma and Leishmania protein three-dimensional structure

10.1101/2021.09.02.458674 ◽

2021 ◽

Author(s):

Richard John Wheeler

Keyword(s):

Protein Structure ◽

Protein Sequence ◽

Structure Prediction ◽

Sequence Data ◽

Three Dimensional ◽

Model Organisms ◽

Dimensional Structure ◽

Sequence Alignments ◽

High Quality ◽

Multiple Sequence

AbstractAlphaFold2 and RoseTTAfold represent a transformative advance for predicting protein structure. They are able to make very high-quality predictions given a high-quality alignment of the protein sequence with related proteins. These predictions are now readily available via the AlphaFold database of predicted structures and AlphaFold/RoseTTAfold Colaboratory notebooks for custom predictions. However, predictions for some species tend to be lower confidence than model organisms. This includes Trypanosoma cruzi and Leishmania infantum: important unicellular eukaryotic human parasites in an early-branching eukaryotic lineage. The cause appears to be due to poor sampling of this branch of life in the protein sequences databases used for the AlphaFold database and ColabFold. Here, by comprehensively gathering openly available protein sequence data for species from this lineage, significant improvements to AlphaFold2 protein structure prediction over the AlphaFold database and ColabFold are demonstrated. This is made available as an easy-to-use tool for the parasitology community in the form of Colaboratory notebooks for generating multiple sequence alignments and AlphaFold2 predictions of protein structure for Trypanosoma, Leishmania and related species.

Download Full-text

A resource for improved predictions of Trypanosoma and Leishmania protein three-dimensional structure

PLoS ONE ◽

10.1371/journal.pone.0259871 ◽

2021 ◽

Vol 16 (11) ◽

pp. e0259871

Author(s):

Richard John Wheeler

Keyword(s):

Protein Structure ◽

Protein Sequence ◽

Structure Prediction ◽

Sequence Data ◽

Three Dimensional ◽

Model Organisms ◽

Dimensional Structure ◽

Sequence Alignments ◽

High Quality ◽

Multiple Sequence

AlphaFold2 and RoseTTAfold represent a transformative advance for predicting protein structure. They are able to make very high-quality predictions given a high-quality alignment of the protein sequence with related proteins. These predictions are now readily available via the AlphaFold database of predicted structures and AlphaFold or RoseTTAfold Colaboratory notebooks for custom predictions. However, predictions for some species tend to be lower confidence than model organisms. Problematic species include Trypanosoma cruzi and Leishmania infantum: important unicellular eukaryotic human parasites in an early-branching eukaryotic lineage. The cause appears to be due to poor sampling of this branch of life (Discoba) in the protein sequences databases used for the AlphaFold database and ColabFold. Here, by comprehensively gathering openly available protein sequence data for Discoba species, significant improvements to AlphaFold2 protein structure prediction over the AlphaFold database and ColabFold are demonstrated. This is made available as an easy-to-use tool for the parasitology community in the form of Colaboratory notebooks for generating multiple sequence alignments and AlphaFold2 predictions of protein structure for Trypanosoma, Leishmania and related species.

Download Full-text

Rapture-ready darters: choice of reference genome and genotyping method (whole-genome or sequence capture) influence population genomic inference in Etheostoma

10.1101/2020.05.21.108274 ◽

2020 ◽

Author(s):

Brendan N. Reid ◽

Rachel L. Moran ◽

Christopher J. Kopack ◽

Sarah W. Fitzpatrick

Keyword(s):

Reference Genome ◽

Sequence Data ◽

Low Cost ◽

Read Depth ◽

Model Organisms ◽

Whole Genome ◽

Reduced Representation ◽

Sequence Capture ◽

Population Genomic ◽

The Impact

AbstractResearchers studying non-model organisms have an increasing number of methods available for generating genomic data. However, the applicability of different methods across species, as well as the effect of reference genome choice on population genomic inference, are still difficult to predict in many cases. We evaluated the impact of data type (whole-genome vs. reduced representation) and reference genome choice on data quality and on population genomic and phylogenomic inference across several species of darters (subfamily Etheostomatinae), a highly diverse radiation of freshwater fish. We generated a high-quality reference genome and developed a hybrid RADseq/sequence capture (Rapture) protocol for the Arkansas darter (Etheostoma cragini). Rapture data from 1900 individuals spanning four darter species showed recovery of most loci across darter species at high depth and consistent estimates of heterozygosity regardless of reference genome choice. Loci with baits spanning both sides of the restriction enzyme cut site performed especially well across species. For low-coverage whole-genome data, choice of reference genome affected read depth and inferred heterozygosity. For similar amounts of sequence data, Rapture performed better at identifying fine-scale genetic structure compared to whole-genome sequencing. Rapture loci also recovered an accurate phylogeny for the study species and demonstrated high phylogenetic informativeness across the evolutionary history of the genus Etheostoma. Low cost and high cross-species effectiveness regardless of reference genome suggest that Rapture and similar sequence capture methods may be worthwhile choices for studies of diverse species radiations.

Download Full-text

VisFeature: a stand-alone program for visualizing and analyzing statistical features of biological sequences

Bioinformatics ◽

10.1093/bioinformatics/btz689 ◽

2019 ◽

Cited By ~ 3

Author(s):

Jun Wang ◽

Pu-Feng Du ◽

Xin-Yu Xue ◽

Guang-Ping Li ◽

Yuan-Ke Zhou ◽

...

Keyword(s):

Sequence Data ◽

Software Tool ◽

Data Retrieval ◽

Supplementary Information ◽

Statistical Features ◽

Biological Sequence ◽

Sequence Alignments ◽

Multiple Sequence ◽

Source Codes ◽

Multiple Sequence Alignments

Abstract Summary Many efforts have been made in developing bioinformatics algorithms to predict functional attributes of genes and proteins from their primary sequences. One challenge in this process is to intuitively analyze and to understand the statistical features that have been selected by heuristic or iterative methods. In this paper, we developed VisFeature, which aims to be a helpful software tool that allows the users to intuitively visualize and analyze statistical features of all types of biological sequence, including DNA, RNA and proteins. VisFeature also integrates sequence data retrieval, multiple sequence alignments and statistical feature generation functions. Availability and implementation VisFeature is a desktop application that is implemented using JavaScript/Electron and R. The source codes of VisFeature are freely accessible from the GitHub repository (https://github.com/wangjun1996/VisFeature). The binary release, which includes an example dataset, can be freely downloaded from the same GitHub repository (https://github.com/wangjun1996/VisFeature/releases). Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text