Pyfastx: a robust Python package for fast random access to sequences from plain and gzipped FASTA/Q files

Abstract FASTA and FASTQ are the most widely used biological data formats that have become the de facto standard to exchange sequence data between bioinformatics tools. With the avalanche of next-generation sequencing data, the amount of sequence data being deposited and accessed in FASTA/Q formats is increasing dramatically. However, the existing tools have very low efficiency at random retrieval of subsequences due to the requirement of loading the entire index into memory. In addition, most existing tools have no capability to build index for large FASTA/Q files because of the limited memory. Furthermore, the tools do not provide support to randomly accessing sequences from FASTA/Q files compressed by gzip, which is extensively adopted by most public databases to compress data for saving storage. In this study, we developed pyfastx as a versatile Python package with commonly used command-line tools to overcome the above limitations. Compared to other tools, pyfastx yielded the highest performance in terms of building index and random access to sequences, particularly when dealing with large FASTA/Q files with hundreds of millions of sequences. A key advantage of pyfastx over other tools is that it offers an efficient way to randomly extract subsequences directly from gzip compressed FASTA/Q files without needing to uncompress beforehand. Pyfastx can easily be installed from PyPI (https://pypi.org/project/pyfastx) and the source code is freely available at https://github.com/lmdu/pyfastx.

Download Full-text

Exogene: A performant workflow for detecting viral integrations from paired-end next-generation sequencing data

10.1101/2021.04.19.440427 ◽

2021 ◽

Author(s):

Jean-Pierre Kocher ◽

Zachary Stephens ◽

Daniel O'Brien ◽

Mrunal Dehankar ◽

Lewis Roberts ◽

...

Keyword(s):

Next Generation Sequencing ◽

Sequence Data ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Long Read ◽

Breakpoint Detection ◽

Targeted Capture ◽

Genome Heterogeneity ◽

Generation Sequencing

The integration of viruses into the human genome is known to be associated with tumorigenesis in many cancers, but the accurate detection of integration breakpoints from short read sequencing data is made difficult by human-viral homologies, viral genome heterogeneity, coverage limitations, and other factors. To address this, we present Exogene, a sensitive and efficient workflow for detecting viral integrations from paired-end next generation sequencing data. Exogene's read filtering and breakpoint detection strategies yield integration coordinates that are highly concordant with those found in long read validation sets. We demonstrate this concordance across 6 TCGA Hepatocellular carcinoma (HCC) tumor samples, identifying integrations of hepatitis B virus that are validated by long reads. Additionally, we applied Exogene to targeted capture data from 426 previously studied HCC samples, achieving 98.9% concordance with existing methods and identifying 238 high-confidence integrations that were not previously reported. Exogene is applicable to multiple types of paired-end sequence data, including genome, exome, RNA-Seq or targeted capture.

Download Full-text

Exogene: A performant workflow for detecting viral integrations from paired-end next-generation sequencing data

PLoS ONE ◽

10.1371/journal.pone.0250915 ◽

2021 ◽

Vol 16 (9) ◽

pp. e0250915

Author(s):

Zachary Stephens ◽

Daniel O’Brien ◽

Mrunal Dehankar ◽

Lewis R. Roberts ◽

Ravishankar K. Iyer ◽

...

Keyword(s):

Next Generation Sequencing ◽

Sequence Data ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Long Read ◽

Breakpoint Detection ◽

Targeted Capture ◽

Genome Heterogeneity ◽

Generation Sequencing

The integration of viruses into the human genome is known to be associated with tumorigenesis in many cancers, but the accurate detection of integration breakpoints from short read sequencing data is made difficult by human-viral homologies, viral genome heterogeneity, coverage limitations, and other factors. To address this, we present Exogene, a sensitive and efficient workflow for detecting viral integrations from paired-end next generation sequencing data. Exogene’s read filtering and breakpoint detection strategies yield integration coordinates that are highly concordant with long read validation. We demonstrate this concordance across 6 TCGA Hepatocellular carcinoma (HCC) tumor samples, identifying integrations of hepatitis B virus that are also supported by long reads. Additionally, we applied Exogene to targeted capture data from 426 previously studied HCC samples, achieving 98.9% concordance with existing methods and identifying 238 high-confidence integrations that were not previously reported. Exogene is applicable to multiple types of paired-end sequence data, including genome, exome, RNA-Seq and targeted capture.

Download Full-text

easyfm : An easy software suite for file manipulation of Next Generation Sequencing data on desktops

10.1101/2021.09.29.462291 ◽

2021 ◽

Author(s):

Hyungtaek Jung ◽

Brendan Jeon ◽

Daniel Ortiz-Barrientos

Keyword(s):

Next Generation Sequencing ◽

High Performance ◽

Biological Data ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

File Formats ◽

Biological Data Analysis ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Storing and manipulating Next Generation Sequencing (NGS) file formats is an essential but difficult task in biological data analysis. The easyfm ( easy f ile m anipulation) toolkit ( https://github.com/TaekAndBrendan/easyfm ) makes manipulating commonly used NGS files more accessible to biologists. It enables them to perform end-to-end reproducible data analyses using a free standalone desktop application (available on Windows, Mac and Linux). Unlike existing tools (e.g. Galaxy), the Graphical User Interface (GUI)-based easyfm is not dependent on any high-performance computing (HPC) system and can be operated without an internet connection. This specific benefit allow easyfm to seamlessly integrate visual and interactive representations of NGS files, supporting a wider scope of bioinformatics applications in the life sciences.

Download Full-text

Using a fast clustering method for viral segment lineage determination, applied to the H9 influenza hemagglutinin.

10.7287/peerj.preprints.3166 ◽

2017 ◽

Author(s):

Andrew Dalby ◽

Lorna Tinworth ◽

Joshua Sealy ◽

Munir Iqbal

Keyword(s):

Phylogenetic Trees ◽

Sequence Data ◽

Next Generation Sequencing Data ◽

Viral Sequence ◽

Sequencing Data ◽

Operational Taxonomic Units ◽

Influenza Hemagglutinin ◽

Tree Construction ◽

Alternative Approach ◽

Generation Sequencing

Lineage determination is an important part of the analysis of viral sequence data. Previously this has depended on phylogenetic analysis in order to identify distinct clades within the phylogenetic trees. This method is time consuming and dependent on a set of empirical rules for clade identification. An alternative approach is to use clustering. Clustering is commonly used to identify operational taxonomic units in next generation sequencing data. In this paper we use clustering in order to rapidly identify viral segment lineages and clades without the need for tree construction.

Download Full-text

Using a fast clustering method for viral segment lineage determination, applied to the H9 influenza hemagglutinin.

10.7287/peerj.preprints.3166v1 ◽

2017 ◽

Author(s):

Andrew Dalby ◽

Lorna Tinworth ◽

Joshua Sealy ◽

Munir Iqbal

Keyword(s):

Phylogenetic Trees ◽

Sequence Data ◽

Next Generation Sequencing Data ◽

Viral Sequence ◽

Sequencing Data ◽

Operational Taxonomic Units ◽

Influenza Hemagglutinin ◽

Tree Construction ◽

Alternative Approach ◽

Generation Sequencing

Download Full-text

Compression of Next-Generation Sequencing Data and of DNA Digital Files

Algorithms ◽

10.3390/a13060151 ◽

2020 ◽

Vol 13 (6) ◽

pp. 151

Author(s):

Bruno Carpentieri

Keyword(s):

Next Generation Sequencing ◽

Dna Sequences ◽

Network Traffic ◽

Large Scale ◽

Genomic Data ◽

Biological Data ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

The increase in memory and in network traffic used and caused by new sequenced biological data has recently deeply grown. Genomic projects such as HapMap and 1000 Genomes have contributed to the very large rise of databases and network traffic related to genomic data and to the development of new efficient technologies. The large-scale sequencing of samples of DNA has brought new attention and produced new research, and thus the interest in the scientific community for genomic data has greatly increased. In a very short time, researchers have developed hardware tools, analysis software, algorithms, private databases, and infrastructures to support the research in genomics. In this paper, we analyze different approaches for compressing digital files generated by Next-Generation Sequencing tools containing nucleotide sequences, and we discuss and evaluate the compression performance of generic compression algorithms by confronting them with a specific system designed by Jones et al. specifically for genomic file compression: Quip. Moreover, we present a simple but effective technique for the compression of DNA sequences in which we only consider the relevant DNA data and experimentally evaluate its performances.

Download Full-text

Training workshop on Mycobacterium whole genome sequence data analysis

EMBnet journal ◽

10.14806/ej.24.0.920 ◽

2019 ◽

Vol 24 ◽

Author(s):

Yonas Kassahun Hirutu ◽

Mesert D Bayeleygne ◽

Adey F Desta ◽

Tewodros Tariku ◽

Markos Abebe

Keyword(s):

Life Science ◽

Sequence Data ◽

Addis Ababa ◽

Whole Genome Sequence ◽

Next Generation Sequencing Data ◽

Whole Genome ◽

Training Workshop ◽

Sequencing Data ◽

Training Initiative ◽

Generation Sequencing

Basic bioinformatics training workshop conducted at Armauer Hansen Research Institute (AHRI), Addis Ababa, Ethiopia. This report describes a bioinformatics training initiative started at AHRI aiming to support life science researchers and postgraduates in handling next-generation sequencing data.

Download Full-text

grabseqs: simple downloading of reads and metadata from multiple next-generation sequencing data repositories

Bioinformatics ◽

10.1093/bioinformatics/btaa167 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3607-3609

Author(s):

Louis J Taylor ◽

Arwa Abbas ◽

Frederic D Bushman

Keyword(s):

High Throughput Sequencing ◽

Source Code ◽

Supplementary Information ◽

Metagenomic Data ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Data Repositories ◽

Sequence Read Archive ◽

Python Package ◽

Generation Sequencing

Abstract Summary High-throughput sequencing is a powerful technique for addressing biological questions. Grabseqs streamlines access to publicly available metagenomic data by providing a single, easy-to-use interface to download data and metadata from multiple repositories, including the Sequence Read Archive, the Metagenomics Rapid Annotation through Subsystems Technology server and iMicrobe. Users can download data and metadata in a standardized format from any number of samples or projects from a given repository with a single grabseqs command. Availability and implementation Grabseqs is an open-source tool implemented in Python and licensed under the MIT license. The source code is freely available at https://github.com/louiejtaylor/grabseqs, the Python Package Index and Anaconda Cloud repository. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text