Pyfastx: a robust Python package for fast random access to sequences from plain and gzipped FASTA/Q files

Author(s):  
Lianming Du ◽  
Qin Liu ◽  
Zhenxin Fan ◽  
Jie Tang ◽  
Xiuyue Zhang ◽  
...  

Abstract FASTA and FASTQ are the most widely used biological data formats that have become the de facto standard to exchange sequence data between bioinformatics tools. With the avalanche of next-generation sequencing data, the amount of sequence data being deposited and accessed in FASTA/Q formats is increasing dramatically. However, the existing tools have very low efficiency at random retrieval of subsequences due to the requirement of loading the entire index into memory. In addition, most existing tools have no capability to build index for large FASTA/Q files because of the limited memory. Furthermore, the tools do not provide support to randomly accessing sequences from FASTA/Q files compressed by gzip, which is extensively adopted by most public databases to compress data for saving storage. In this study, we developed pyfastx as a versatile Python package with commonly used command-line tools to overcome the above limitations. Compared to other tools, pyfastx yielded the highest performance in terms of building index and random access to sequences, particularly when dealing with large FASTA/Q files with hundreds of millions of sequences. A key advantage of pyfastx over other tools is that it offers an efficient way to randomly extract subsequences directly from gzip compressed FASTA/Q files without needing to uncompress beforehand. Pyfastx can easily be installed from PyPI (https://pypi.org/project/pyfastx) and the source code is freely available at https://github.com/lmdu/pyfastx.

2021 ◽  
Author(s):  
Jean-Pierre Kocher ◽  
Zachary Stephens ◽  
Daniel O'Brien ◽  
Mrunal Dehankar ◽  
Lewis Roberts ◽  
...  

The integration of viruses into the human genome is known to be associated with tumorigenesis in many cancers, but the accurate detection of integration breakpoints from short read sequencing data is made difficult by human-viral homologies, viral genome heterogeneity, coverage limitations, and other factors. To address this, we present Exogene, a sensitive and efficient workflow for detecting viral integrations from paired-end next generation sequencing data. Exogene's read filtering and breakpoint detection strategies yield integration coordinates that are highly concordant with those found in long read validation sets. We demonstrate this concordance across 6 TCGA Hepatocellular carcinoma (HCC) tumor samples, identifying integrations of hepatitis B virus that are validated by long reads. Additionally, we applied Exogene to targeted capture data from 426 previously studied HCC samples, achieving 98.9% concordance with existing methods and identifying 238 high-confidence integrations that were not previously reported. Exogene is applicable to multiple types of paired-end sequence data, including genome, exome, RNA-Seq or targeted capture.


PLoS ONE ◽  
2021 ◽  
Vol 16 (9) ◽  
pp. e0250915
Author(s):  
Zachary Stephens ◽  
Daniel O’Brien ◽  
Mrunal Dehankar ◽  
Lewis R. Roberts ◽  
Ravishankar K. Iyer ◽  
...  

The integration of viruses into the human genome is known to be associated with tumorigenesis in many cancers, but the accurate detection of integration breakpoints from short read sequencing data is made difficult by human-viral homologies, viral genome heterogeneity, coverage limitations, and other factors. To address this, we present Exogene, a sensitive and efficient workflow for detecting viral integrations from paired-end next generation sequencing data. Exogene’s read filtering and breakpoint detection strategies yield integration coordinates that are highly concordant with long read validation. We demonstrate this concordance across 6 TCGA Hepatocellular carcinoma (HCC) tumor samples, identifying integrations of hepatitis B virus that are also supported by long reads. Additionally, we applied Exogene to targeted capture data from 426 previously studied HCC samples, achieving 98.9% concordance with existing methods and identifying 238 high-confidence integrations that were not previously reported. Exogene is applicable to multiple types of paired-end sequence data, including genome, exome, RNA-Seq and targeted capture.


2021 ◽  
Author(s):  
Hyungtaek Jung ◽  
Brendan Jeon ◽  
Daniel Ortiz-Barrientos

Storing and manipulating Next Generation Sequencing (NGS) file formats is an essential but difficult task in biological data analysis. The easyfm ( easy f ile m anipulation) toolkit ( https://github.com/TaekAndBrendan/easyfm ) makes manipulating commonly used NGS files more accessible to biologists. It enables them to perform end-to-end reproducible data analyses using a free standalone desktop application (available on Windows, Mac and Linux). Unlike existing tools (e.g. Galaxy), the Graphical User Interface (GUI)-based easyfm is not dependent on any high-performance computing (HPC) system and can be operated without an internet connection. This specific benefit allow easyfm to seamlessly integrate visual and interactive representations of NGS files, supporting a wider scope of bioinformatics applications in the life sciences.


2017 ◽  
Author(s):  
Andrew Dalby ◽  
Lorna Tinworth ◽  
Joshua Sealy ◽  
Munir Iqbal

Lineage determination is an important part of the analysis of viral sequence data. Previously this has depended on phylogenetic analysis in order to identify distinct clades within the phylogenetic trees. This method is time consuming and dependent on a set of empirical rules for clade identification. An alternative approach is to use clustering. Clustering is commonly used to identify operational taxonomic units in next generation sequencing data. In this paper we use clustering in order to rapidly identify viral segment lineages and clades without the need for tree construction.


2017 ◽  
Author(s):  
Andrew Dalby ◽  
Lorna Tinworth ◽  
Joshua Sealy ◽  
Munir Iqbal

Lineage determination is an important part of the analysis of viral sequence data. Previously this has depended on phylogenetic analysis in order to identify distinct clades within the phylogenetic trees. This method is time consuming and dependent on a set of empirical rules for clade identification. An alternative approach is to use clustering. Clustering is commonly used to identify operational taxonomic units in next generation sequencing data. In this paper we use clustering in order to rapidly identify viral segment lineages and clades without the need for tree construction.


Algorithms ◽  
2020 ◽  
Vol 13 (6) ◽  
pp. 151
Author(s):  
Bruno Carpentieri

The increase in memory and in network traffic used and caused by new sequenced biological data has recently deeply grown. Genomic projects such as HapMap and 1000 Genomes have contributed to the very large rise of databases and network traffic related to genomic data and to the development of new efficient technologies. The large-scale sequencing of samples of DNA has brought new attention and produced new research, and thus the interest in the scientific community for genomic data has greatly increased. In a very short time, researchers have developed hardware tools, analysis software, algorithms, private databases, and infrastructures to support the research in genomics. In this paper, we analyze different approaches for compressing digital files generated by Next-Generation Sequencing tools containing nucleotide sequences, and we discuss and evaluate the compression performance of generic compression algorithms by confronting them with a specific system designed by Jones et al. specifically for genomic file compression: Quip. Moreover, we present a simple but effective technique for the compression of DNA sequences in which we only consider the relevant DNA data and experimentally evaluate its performances.


2019 ◽  
Vol 24 ◽  
Author(s):  
Yonas Kassahun Hirutu ◽  
Mesert D Bayeleygne ◽  
Adey F Desta ◽  
Tewodros Tariku ◽  
Markos Abebe

Basic bioinformatics training workshop conducted at Armauer Hansen Research Institute (AHRI), Addis Ababa, Ethiopia. This report describes a bioinformatics training initiative started at AHRI aiming to support life science researchers and postgraduates in handling next-generation sequencing data.


2020 ◽  
Vol 36 (11) ◽  
pp. 3607-3609
Author(s):  
Louis J Taylor ◽  
Arwa Abbas ◽  
Frederic D Bushman

Abstract Summary High-throughput sequencing is a powerful technique for addressing biological questions. Grabseqs streamlines access to publicly available metagenomic data by providing a single, easy-to-use interface to download data and metadata from multiple repositories, including the Sequence Read Archive, the Metagenomics Rapid Annotation through Subsystems Technology server and iMicrobe. Users can download data and metadata in a standardized format from any number of samples or projects from a given repository with a single grabseqs command. Availability and implementation Grabseqs is an open-source tool implemented in Python and licensed under the MIT license. The source code is freely available at https://github.com/louiejtaylor/grabseqs, the Python Package Index and Anaconda Cloud repository. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document