2FAST2Q: A general-purpose sequence search and counting program for FASTQ files

2021 ◽  
Author(s):  
Afonso Bravo ◽  
Athanasios Typas ◽  
Jan-Willem Veening

The increasingly widespread use of next generation sequencing protocols has brought the need for the development of user-friendly raw data processing tools. Here, we present 2FAST2Q, a versatile and intuitive standalone program capable of extracting and counting feature occurrences in FASTQ files. 2FAST2Q can be used in any experimental setup that requires feature extraction from raw reads, being able to quickly handle mismatch alignments, nucleotide wise Phred score filtering, custom read trimming, and sequence searching within a single program. Using published CRISPRi datasets in which Escherichia coli and Mycobacterium tuberculosis gene essentiality, as well as host-cell sensitivity towards SARS-CoV2 infectivity were tested, we demonstrate that 2FAST2Q efficiently recapitulates the output in read counts per provided feature as with traditional pipelines. Moreover, we show how different FASTQ read filtering parameters impact downstream analysis, and suggest a default usage protocol. 2FAST2Q has a familiar user interface and uses a custom sequence mismatch search algorithm, taking advantage of Pythons numba module JIT runtime speeds. It is thus easier to use and faster than currently available tools, efficiently processing large CRISPRi-Seq or random-barcode sequencing datasets on any up-to-date laptop. 2FAST2Q is available as an executable file for all current operating systems without installation and as a Python3 module on the PyPI repository (available at https://veeninglab.com/2fast2q). We expect that 2FAST2Q will not only be useful for people working in microbiology but also for other fields in which amplicon sequencing data is generated.

2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Marius Welzel ◽  
Anja Lange ◽  
Dominik Heider ◽  
Michael Schwarz ◽  
Bernd Freisleben ◽  
...  

Abstract Background Sequencing of marker genes amplified from environmental samples, known as amplicon sequencing, allows us to resolve some of the hidden diversity and elucidate evolutionary relationships and ecological processes among complex microbial communities. The analysis of large numbers of samples at high sequencing depths generated by high throughput sequencing technologies requires efficient, flexible, and reproducible bioinformatics pipelines. Only a few existing workflows can be run in a user-friendly, scalable, and reproducible manner on different computing devices using an efficient workflow management system. Results We present Natrix, an open-source bioinformatics workflow for preprocessing raw amplicon sequencing data. The workflow contains all analysis steps from quality assessment, read assembly, dereplication, chimera detection, split-sample merging, sequence representative assignment (OTUs or ASVs) to the taxonomic assignment of sequence representatives. The workflow is written using Snakemake, a workflow management engine for developing data analysis workflows. In addition, Conda is used for version control. Thus, Snakemake ensures reproducibility and Conda offers version control of the utilized programs. The encapsulation of rules and their dependencies support hassle-free sharing of rules between workflows and easy adaptation and extension of existing workflows. Natrix is freely available on GitHub (https://github.com/MW55/Natrix) or as a Docker container on DockerHub (https://hub.docker.com/r/mw55/natrix). Conclusion Natrix is a user-friendly and highly extensible workflow for processing Illumina amplicon data.


2018 ◽  
Author(s):  
Kendell Clement ◽  
Rick Farouni ◽  
Daniel E. Bauer ◽  
Luca Pinello

AbstractMotivationUnique molecular identifiers (UMIs) are added to DNA fragments before PCR amplification to discriminate between alleles arising from the same genomic locus and sequencing reads produced by PCR amplification. While computational methods have been developed to take into account UMI information in genome-wide and single-cell sequencing studies, they are not designed for modern amplicon based sequencing experiments, especially in cases of high allelic diversity. Importantly, no guidelines are provided for the design of optimal UMI length for amplicon-based sequencing experiments.ResultsBased on the total number of DNA fragments and the distribution of allele frequencies, we present a model for the determination of the minimum UMI length required to prevent UMI collisions and reduce allelic distortion. We also introduce a user-friendly software tool called AmpUMI to assist in the design and the analysis of UMI-based amplicon sequencing studies. AmpUMI provides quality control metrics on frequency and quality of UMIs, and trims and deduplicates amplicon sequences with user specified parameters for use in downstream analysis. AmpUMI is open-source and freely available at http://github.com/pinellolab/[email protected]


2020 ◽  
Author(s):  
Marius Welzel ◽  
Anja Lange ◽  
Dominik Heider ◽  
Michael Schwarz ◽  
Bernd Freisleben ◽  
...  

AbstractSequencing of marker genes amplified from environmental samples, known as amplicon sequencing, allows us to resolve some of the hidden diversity and elucidate evolutionary relationships and ecological processes among complex microbial communities. The analysis of large numbers of samples at high sequencing depths generated by high throughput sequencing technologies requires effcient, flexible, and reproducible bioinformatics pipelines. Only a few existing workflows can be run in a user-friendly, scalable, and reproducible manner on different computing devices using an effcient workflow management system. We present Natrix, an open-source bioinformatics workflow for preprocessing raw amplicon sequencing data. The workflow contains all analysis steps from quality assessment, read assembly, dereplication, chimera detection, split-sample merging, sequence representative assignment (OTUs or ASVs) to the taxonomic assignment of sequence representatives. The workflow is written using Snakemake, a workflow management engine for developing data analysis workflows. In addition, Conda is used for version control. Thus, Snakemake ensures reproducibility and Conda offers version control of the utilized programs. The encapsulation of rules and their dependencies support hassle-free sharing of rules between workflows and easy adaptation and extension of existing workflows. Natrix is freely available on GitHub (https://github.com/MW55/Natrix).


PLoS ONE ◽  
2020 ◽  
Vol 15 (12) ◽  
pp. e0243241
Author(s):  
Sebastian Hupfauf ◽  
Mohammad Etemadi ◽  
Marina Fernández-Delgado Juárez ◽  
María Gómez-Brandón ◽  
Heribert Insam ◽  
...  

In recent years, there has been a veritable boost in next-generation sequencing (NGS) of gene amplicons in biological and medical studies. Huge amounts of data are produced and need to be analyzed adequately. Various online and offline analysis tools are available; however, most of them require extensive expertise in computer science or bioinformatics, and often a Linux-based operating system. Here, we introduce “CoMA–Comparative Microbiome Analysis” as a free and intuitive analysis pipeline for amplicon-sequencing data, compatible with any common operating system. Moreover, the tool offers various useful services including data pre-processing, quality checking, clustering to operational taxonomic units (OTUs), taxonomic assignment, data post-processing, data visualization, and statistical appraisal. The workflow results in highly esthetic and publication-ready graphics, as well as output files in standardized formats (e.g. tab-delimited OTU-table, BIOM, NEWICK tree) that can be used for more sophisticated analyses. The CoMA output was validated by a benchmark test, using three mock communities with different sample characteristics (primer set, amplicon length, diversity). The performance was compared with that of Mothur, QIIME and QIIME2-DADA2, popular packages for NGS data analysis. Furthermore, the functionality of CoMA is demonstrated on a practical example, investigating microbial communities from three different soils (grassland, forest, swamp). All tools performed well in the benchmark test and were able to reveal the majority of all genera in the mock communities. Also for the soil samples, the results of CoMA were congruent to those of the other pipelines, in particular when looking at the key microbial players.


2021 ◽  
Author(s):  
Daniel Loos ◽  
Lu Zhang ◽  
Christine Beemelmanns ◽  
Oliver Kurzai ◽  
Gianni Panagiotou

AbstractTrillions of microbes representing all kingdoms of life are resident in, and on, humans holding essential roles for host development and physiology. The last decade over a dozen online tools and servers, accessible via public domain, have been developed for the analysis of bacterial sequences, however, the analysis of fungi is still in its infancy. Here we present a web server dedicated to the comprehensive analysis of the human mycobiome for (i) translating raw sequencing reads to data tables and high-standard figures; (ii) integrating statistical analysis and machine learning with a manually curated relational database; (iii) comparing the user’s uploaded datasets with publicly available from the Sequence Read Archive. Using 2,048 publicly available ITS samples, we demonstrated the utility of DAnIEL web server on large scale datasets and show the differences in fungal communities between human gut, skin, nasopharynx, and oral body sites.


2020 ◽  
Author(s):  
Christina Weiβbecker ◽  
Beatrix Schnabel ◽  
Anna Heintz-Buschart

AbstractBackgroundAmplicon sequencing of phylogenetic marker genes, e.g. 16S, 18S or ITS rRNA sequences, is still the most commonly used method to determine the composition of microbial communities. Microbial ecologists often have expert knowledge on their biological question and data analysis in general, and most research institutes have computational infrastructures to employ the bioinformatics command line tools and workflows for amplicon sequencing analysis, but requirements of bioinformatics skills often limit the efficient and up-to-date use of computational resources.Resultsdadasnake wraps pre-processing of sequencing reads, delineation of exact sequence variants using the favorably benchmarked, widely-used the DADA2 algorithm, taxonomic classification and post-processing of the resultant tables, and hand-off in standard formats, into a user-friendly, one-command Snakemake pipeline. The suitability of the provided default configurations is demonstrated using mock-community data from bacteria and archaea, as well as fungi.ConclusionsBy use of Snakemake, dadasnake makes efficient use of high-performance computing infrastructures. Easy user configuration guarantees flexibility of all steps, including the processing of data from multiple sequencing platforms. dadasnake facilitates easy installation via conda environments. dadasnake is available at https://github.com/a-h-b/dadasnake.


2021 ◽  
Vol 12 ◽  
Author(s):  
Daniel Loos ◽  
Lu Zhang ◽  
Christine Beemelmanns ◽  
Oliver Kurzai ◽  
Gianni Panagiotou

Trillions of microbes representing all kingdoms of life are resident in, and on, humans holding essential roles for the host development and physiology. The last decade over a dozen online tools and servers, accessible via public domain, have been developed for the analysis of bacterial sequences; however, the analysis of fungi is still in its infancy. Here, we present a web server dedicated to the comprehensive analysis of the human mycobiome for (i) translating raw sequencing reads to data tables and high-standard figures, (ii) integrating statistical analysis and machine learning with a manually curated relational database and (iii) comparing the user’s uploaded datasets with publicly available from the Sequence Read Archive. Using 1,266 publicly available Internal transcribed spacers (ITS) samples, we demonstrated the utility of DAnIEL web server on large scale datasets and show the differences in fungal communities between human skin and soil sites.


2021 ◽  
Author(s):  
Héctor Rodriguez-Perez ◽  
Laura Ciuffreda ◽  
Carlos Flores

Abstract The study of microbial communities and their applications have been leveraged by the advances in sequencing techniques and bioinformatics tools. The Oxford Nanopore Technologies long read sequencing by nanopores provides a portable and cost-efficient platform for sequencing assays opening the possibility of its application outside specialized environments and real-time analysis of data. To complement the existing efficient library preparation protocol with a streamlined analytic workflow, here we present NanoRTax, a nextflow pipeline for nanopore 16S rRNA amplicon data that features state-of-art taxonomic classification tools and real-time capability. The pipeline is paired with a web-based visual interface to enable user-friendly inspections of the experiment in progress.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Oksana Kutsyr ◽  
Lucía Maestre-Carballa ◽  
Mónica Lluesma-Gomez ◽  
Manuel Martinez-Garcia ◽  
Nicolás Cuenca ◽  
...  

AbstractThe gut microbiome is known to influence the pathogenesis and progression of neurodegenerative diseases. However, there has been relatively little focus upon the implications of the gut microbiome in retinal diseases such as retinitis pigmentosa (RP). Here, we investigated changes in gut microbiome composition linked to RP, by assessing both retinal degeneration and gut microbiome in the rd10 mouse model of RP as compared to control C57BL/6J mice. In rd10 mice, retinal responsiveness to flashlight stimuli and visual acuity were deteriorated with respect to observed in age-matched control mice. This functional decline in dystrophic animals was accompanied by photoreceptor loss, morphologic anomalies in photoreceptor cells and retinal reactive gliosis. Furthermore, 16S rRNA gene amplicon sequencing data showed a microbial gut dysbiosis with differences in alpha and beta diversity at the genera, species and amplicon sequence variants (ASV) levels between dystrophic and control mice. Remarkably, four fairly common ASV in healthy gut microbiome belonging to Rikenella spp., Muribaculaceace spp., Prevotellaceae UCG-001 spp., and Bacilli spp. were absent in the gut microbiome of retinal disease mice, while Bacteroides caecimuris was significantly enriched in mice with RP. The results indicate that retinal degenerative changes in RP are linked to relevant gut microbiome changes. The findings suggest that microbiome shifting could be considered as potential biomarker and therapeutic target for retinal degenerative diseases.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Caitlin M. Singleton ◽  
Francesca Petriglieri ◽  
Jannie M. Kristensen ◽  
Rasmus H. Kirkegaard ◽  
Thomas Y. Michaelsen ◽  
...  

AbstractMicroorganisms play crucial roles in water recycling, pollution removal and resource recovery in the wastewater industry. The structure of these microbial communities is increasingly understood based on 16S rRNA amplicon sequencing data. However, such data cannot be linked to functional potential in the absence of high-quality metagenome-assembled genomes (MAGs) for nearly all species. Here, we use long-read and short-read sequencing to recover 1083 high-quality MAGs, including 57 closed circular genomes, from 23 Danish full-scale wastewater treatment plants. The MAGs account for ~30% of the community based on relative abundance, and meet the stringent MIMAG high-quality draft requirements including full-length rRNA genes. We use the information provided by these MAGs in combination with >13 years of 16S rRNA amplicon sequencing data, as well as Raman microspectroscopy and fluorescence in situ hybridisation, to uncover abundant undescribed lineages belonging to important functional groups.


Sign in / Sign up

Export Citation Format

Share Document