scholarly journals BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data

2021 ◽  
Author(s):  
Jacob L. Steenwyk ◽  
Thomas J. Buida ◽  
Carla Goncalves ◽  
Dayna C. Goltz ◽  
Grace H Morales ◽  
...  

Bioinformatic analysis - such as genome assembly quality assessment, alignment summary statistics, relative synonymous codon usage, paired-end aware quality trimming and filtering of sequencing reads, file format conversion, and processing and analysis - is integrated into diverse disciplines in the biological sciences. Several command-line pieces of software have been developed to conduct some of these individual analyses; however, the lack of a unified toolkit that conducts all these analyses can be a barrier in workflows. To address this obstacle, we introduce BioKIT, a versatile toolkit for the UNIX shell environment with 40 functions, several of which were community-sourced, that conduct routine and novel processing and analysis of genome assemblies, multiple sequence alignments, coding sequences, sequencing data, and more. To demonstrate the utility of BioKIT, we assessed the quality and characteristics of 901 eukaryotic genome assemblies, calculated alignment summary statistics for 10 phylogenomic data matrices, determined relative synonymous codon usage across 171 fungal genomes including those that use alternative genetic codes, and demonstrate that a novel metric, gene-wise relative synonymous codon usage, can accurately estimate gene-wise codon optimization. BioKIT will be helpful in facilitating and streamlining sequence analysis workflows. BioKIT is freely available under the MIT license from GitHub (https://github.com/JLSteenwyk/BioKIT), PyPi (https://pypi.org/project/biokit), and the Anaconda Cloud (https://anaconda.org/JLSteenwyk/biokit). Documentation, user tutorials, and instructions for requesting new features are available online (https://jlsteenwyk.com/BioKIT).

2018 ◽  
Author(s):  
Tobias Andermann ◽  
Angela Cano ◽  
Alexander Zizka ◽  
Christine Bacon ◽  
Alexandre Antonelli

Evolutionary biology has entered an era of unprecedented amounts of DNA sequence data, as new sequencing platforms such as Massive Parallel Sequencing (MPS) can generate billions of nucleotides within less than a day. The current bottleneck is how to efficiently handle, process, and analyze such large amounts of data in an automated and reproducible way. To tackle these challenges we introduce the Sequence Capture Processor (SECAPR) pipeline for processing raw sequencing data into multiple sequence alignments for downstream phylogenetic and phylogeographic analyses. SECAPR is user-friendly and we provide an exhaustive tutorial intended for users with no prior experience with analyzing MPS output. SECAPR is particularly useful for the processing of sequence capture (= hybrid enrichment) datasets for non-model organisms, as we demonstrate using an empirical dataset of the palm genus Geonoma (Arecaceae). Various quality control and plotting functions help the user to decide on the most suitable settings for even challenging datasets. SECAPR is an easy-to-use, free, and versatile pipeline, aimed to enable efficient and reproducible processing of MPS data for many samples in parallel.


2005 ◽  
Vol 03 (01) ◽  
pp. 157-168 ◽  
Author(s):  
PETER L. MEINTJES ◽  
ALLEN G. RODRIGO

Mutation in Human Immunodeficiency Virus type-1 (HIV-1) is extremely rapid, a consequence of a low-fidelity viral reverse transcription process. The envelope gene has been shown to accumulate substitutions at a rate of approximately 1% per year and can frequently spend a long time in the host (approximately 10 years). The relative synonymous codon usage (RSCU) in HIV-1 is known to be different from that of the human host. However, by reengineering the protein coding sequences of HIV-1 to reflect the RSCU patterns observed in humans, a large increase in protein expression is observed. It is reasonable to suggest that within a host there may be a selective drive for change in the RSCU of HIV-1 towards human RSCU.To test this hypothesis we analyzed HIV-1 partial envelope sequences from eight patients sampled serially in time. For each sequence, an RSCU table was constructed. Sequences were labelled as "early" or "late" depending on whether they were sampled before or after the mid-point of the study. Using the RSCU values as descriptor variables, a Principal Components Analysis (PCA) was performed. The first three components clearly discriminated between early and late sequences. We also constructed pooled groupwise RSCU tables for early and late sequences. The viral RSCU values of each of the groups were correlated with human RSCU. If there is selection for host-adaptation in RSCU, we expect that "late" viral RSCUs would tend to be more highly correlated with human RSCU than "early" viral RSCUs. In fact, tests of significance suggest that this is the case. However, closer examination of the data revealed that the apparent trend towards human RSCU can be attributed to the homogenization of the codon usage by mutation pressure rather than host adaptation.


2018 ◽  
Author(s):  
Tobias Andermann ◽  
Angela Cano ◽  
Alexander Zizka ◽  
Christine Bacon ◽  
Alexandre Antonelli

Evolutionary biology has entered an era of unprecedented amounts of DNA sequence data, as new sequencing platforms such as Massive Parallel Sequencing (MPS) can generate billions of nucleotides within less than a day. The current bottleneck is how to efficiently handle, process, and analyze such large amounts of data in an automated and reproducible way. To tackle these challenges we introduce the Sequence Capture Processor (SECAPR) pipeline for processing raw sequencing data into multiple sequence alignments for downstream phylogenetic and phylogeographic analyses. SECAPR is user-friendly and we provide an exhaustive empirical data tutorial intended for users with no prior experience with analyzing MPS output. SECAPR is particularly useful for the processing of sequence capture (synonyms: target or hybrid enrichment) datasets for non-model organisms, as we demonstrate using an empirical sequence capture dataset of the palm genus Geonoma (Arecaceae). Various quality control and plotting functions help the user to decide on the most suitable settings for even challenging datasets. SECAPR is an easy-to-use, free, and versatile pipeline, aimed to enable efficient and reproducible processing of MPS data for many samples in parallel.


2018 ◽  
Author(s):  
Tobias Andermann ◽  
Angela Cano ◽  
Alexander Zizka ◽  
Christine Bacon ◽  
Alexandre Antonelli

Evolutionary biology has entered an era of unprecedented amounts of DNA sequence data, as new sequencing platforms such as Massive Parallel Sequencing (MPS) can generate billions of nucleotides within less than a day. The current bottleneck is how to efficiently handle, process, and analyze such large amounts of data in an automated and reproducible way. To tackle these challenges we introduce the Sequence Capture Processor (SECAPR) pipeline for processing raw sequencing data into multiple sequence alignments for downstream phylogenetic and phylogeographic analyses. SECAPR is user-friendly and we provide an exhaustive tutorial intended for users with no prior experience with analyzing MPS output. SECAPR is particularly useful for the processing of sequence capture (= hybrid enrichment) datasets for non-model organisms, as we demonstrate using an empirical dataset of the palm genus Geonoma (Arecaceae). Various quality control and plotting functions help the user to decide on the most suitable settings for even challenging datasets. SECAPR is an easy-to-use, free, and versatile pipeline, aimed to enable efficient and reproducible processing of MPS data for many samples in parallel.


Author(s):  
Tobias Andermann ◽  
Angela Cano ◽  
Alexander Zizka ◽  
Christine Bacon ◽  
Alexandre Antonelli

Evolutionary biology has entered an era of unprecedented amounts of DNA sequence data, as new sequencing platforms such as Massive Parallel Sequencing (MPS) can generate billions of nucleotides within less than a day. The current bottleneck is how to efficiently handle, process, and analyze such large amounts of data in an automated and reproducible way. To tackle these challenges we introduce the Sequence Capture Processor (SECAPR) pipeline for processing raw sequencing data into multiple sequence alignments for downstream phylogenetic and phylogeographic analyses. SECAPR is user-friendly and we provide an exhaustive empirical data tutorial intended for users with no prior experience with analyzing MPS output. SECAPR is particularly useful for the processing of sequence capture (synonyms: target or hybrid enrichment) datasets for non-model organisms, as we demonstrate using an empirical sequence capture dataset of the palm genus Geonoma (Arecaceae). Various quality control and plotting functions help the user to decide on the most suitable settings for even challenging datasets. SECAPR is an easy-to-use, free, and versatile pipeline, aimed to enable efficient and reproducible processing of MPS data for many samples in parallel.


2021 ◽  
Author(s):  
Chao Xu ◽  
◽  
Wen B. Bao ◽  
Sheng L. Wu ◽  
Zheng C. Wu ◽  
...  

Enterotoxigenic E. coli is an important zoonotic pathogen causing diarrhea in human and newborn animals. α - (1,2) fucosyltransferase 2 (FUT2) is closely associated with the formation of pathogenic receptors of Enterotoxigenic E. coli. Codon usage bias analysis can help to better understand the molecular mechanisms and evolutionary relationships of a particular gene. In order to understand the codon usage pattern of FUT2 gene, FUT2 gene coding sequences of nine species were selected from GenBank database for calculating the nucleotide composition (GC content) and genetic indices including effective number of codons, relative synonymous codon usage and relative codon usage bias using R software, in order to analyze codon usage bias and base composition in FUT2 gene from different species. The results showed that the codon usage of FUT2 gene in different species was affected by GC bias, especially GC frequency at the third position of codon (GC3). Most of the optimal codons were biased towards the G/C-ending types. GCC, CUG, UCC, GUG and AUC showed the highest relative synonymous codon usage value among different species, belonging to the most dominant codons. The usage characteristic of the codens for FUT2 gene in Sus scrofa was similar to that of Bos taurus; Homo sapiens was similar to Pan troglodytes. Effective number of codons was significantly, negatively correlated with GC3, and the relative higher frequency of optimal codon implied that FUT2 genes from different species had a strong bias in codon usage.


2009 ◽  
Vol 2 (3) ◽  
pp. 133-141
Author(s):  
Tangjie Zhang ◽  
Hong Chang ◽  
Yuzhi Liu ◽  
Huifang Li ◽  
Kuanwei Chen

Codon usage in mitochondrial genes of 11 Gallus gallus and two Anatidae species was analysed to determine the general patterns in codon choice of Callus gallus species. C3 contents were higher in Gallus gallus than in mammalian mitochondrial genomes that encode protein codon positions. The high C3 contents of Callus gallus might be the result of relatively strong mutational bias that occurred in the lineage of the Callus gallus species. A and C ending codons were detected as the “preferred 77 codons in Callus gallus and Anatidae. The NNR codon families are dominated by the A-ending codons, the NNY codon families are dominated by the C-ending codons and the NNN codon families are dominated by the A-ending or the C-ending codons. A comparison of the relative synonymous codon usage (RSCU) and synonymous codon families (SCF) of tRNA and proteins was made, and two groups can be classified by SCF. The codon usage in Callus gallus species indicates that codons containing A or C at the third position are used preferentially, regardless of whether corresponding tRNAs are encoded in the mtDNA. In both Callus gallus and Anatidae species mtDNA, codon usage biases are highly related to CC-ending binucleotide condons.


Sign in / Sign up

Export Citation Format

Share Document