AMAS: a fast tool for alignment manipulation and computing of summary statistics

10.7287/peerj.preprints.1355v1 ◽

2015 ◽

Author(s):

Marek L Borowiec

Keyword(s):

Amino Acid ◽

Source Code ◽

Data Sets ◽

Command Line ◽

Summary Statistics ◽

Computationally Efficient ◽

Python Package ◽

Alignment Length ◽

Amino Acid Alphabet ◽

Gc Contents

The amount of data used in phylogenetics has grown explosively in the recent years and many phylogenies are inferred with hundreds or even thousands of loci and many taxa. These modern phylogenomic studies often entail separate analyses of each of the loci in addition to multiple analyses of subsets of genes or concatenated sequences. Computationally efficient tools for handling and computing properties of thousands of single-locus or large concatenated alignments are needed. Here I present AMAS (Alignment Manipulation And Summary), a tool that can be used either as a stand-alone command-line utility or as a Python package. AMAS works on amino acid and nucleotide alignments and combines capabilities of sequence manipulation with a function that calculates basic statistics. The manipulation functions include conversions among popular formats, concatenation, extracting sites and splitting according to a pre-defined partitioning scheme, and creation of replicate data sets. The statistics calculated include the number of taxa, alignment length, total count of matrix cells, overall number of undetermined characters, percent of missing data, AT and GC contents (for DNA alignments), count and proportion of variable sites, count and proportion of parsimony informative sites, and counts of all characters relevant for a nucleotide or amino acid alphabet. AMAS is particularly suitable for very large alignments with hundreds of taxa and thousands of loci. It performs better at concatenation and summarizing alignments than other popular tools. AMAS is a Python 3 program that relies solely on Python’s core modules. AMAS source code and manual can be downloaded from http://github.com/marekborowiec/AMAS/

Download Full-text

AMAS: a fast tool for alignment manipulation and computing of summary statistics

PeerJ ◽

10.7717/peerj.1660 ◽

2016 ◽

Vol 4 ◽

pp. e1660 ◽

Cited By ~ 163

Author(s):

Marek L. Borowiec

Keyword(s):

Amino Acid ◽

Source Code ◽

Data Sets ◽

Command Line ◽

Summary Statistics ◽

Computationally Efficient ◽

Python Package ◽

Alignment Length ◽

Amino Acid Alphabet ◽

Gc Contents

The amount of data used in phylogenetics has grown explosively in the recent years and many phylogenies are inferred with hundreds or even thousands of loci and many taxa. These modern phylogenomic studies often entail separate analyses of each of the loci in addition to multiple analyses of subsets of genes or concatenated sequences. Computationally efficient tools for handling and computing properties of thousands of single-locus or large concatenated alignments are needed. Here I present AMAS (Alignment Manipulation And Summary), a tool that can be used either as a stand-alone command-line utility or as a Python package. AMAS works on amino acid and nucleotide alignments and combines capabilities of sequence manipulation with a function that calculates basic statistics. The manipulation functions include conversions among popular formats, concatenation, extracting sites and splitting according to a pre-defined partitioning scheme, creation of replicate data sets, and removal of taxa. The statistics calculated include the number of taxa, alignment length, total count of matrix cells, overall number of undetermined characters, percent of missing data, AT and GC contents (for DNA alignments), count and proportion of variable sites, count and proportion of parsimony informative sites, and counts of all characters relevant for a nucleotide or amino acid alphabet. AMAS is particularly suitable for very large alignments with hundreds of taxa and thousands of loci. It is computationally efficient, utilizes parallel processing, and performs better at concatenation than other popular tools. AMAS is a Python 3 program that relies solely on Python’s core modules and needs no additional dependencies. AMAS source code and manual can be downloaded fromhttp://github.com/marekborowiec/AMAS/under GNU General Public License.

Download Full-text

aCLImatise: automated generation of tool definitions for bioinformatics workflows

Bioinformatics ◽

10.1093/bioinformatics/btaa1033 ◽

2020 ◽

Author(s):

Michael Milton ◽

Natalie Thorne

Keyword(s):

Source Code ◽

Supplementary Information ◽

Command Line ◽

Supplementary Data ◽

Automated Generation ◽

Base Camp ◽

Python Package ◽

Bioinformatics Workflow ◽

Bioinformatics Workflows

Abstract Summary aCLImatise is a utility for automatically generating tool definitions compatible with bioinformatics workflow languages, by parsing command-line help output. aCLImatise also has an associated database called the aCLImatise Base Camp, which provides thousands of pre-computed tool definitions. Availability and implementation The latest aCLImatise source code is available within a GitHub organisation, under the GPL-3.0 license: https://github.com/aCLImatise. In particular, documentation for the aCLImatise Python package is available at https://aclimatise.github.io/CliHelpParser/, and the aCLImatise Base Camp is available at https://aclimatise.github.io/BaseCamp/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Knot_pull—python package for biopolymer smoothing and knot detection

Bioinformatics ◽

10.1093/bioinformatics/btz644 ◽

2019 ◽

Cited By ~ 1

Author(s):

Aleksandra I Jarmolinska ◽

Anna Gambin ◽

Joanna I Sulkowska

Keyword(s):

Learning Curve ◽

Source Code ◽

Supplementary Information ◽

Command Line ◽

Supplementary Data ◽

Steep Learning Curve ◽

Independent Source ◽

Python Package

Abstract Summary The biggest hurdle in studying topology in biopolymers is the steep learning curve for actually seeing the knots in structure visualization. Knot_pull is a command line utility designed to simplify this process—it presents the user with a smoothing trajectory for provided structures (any number and length of protein, RNA or chromatin chains in PDB, CIF or XYZ format), and calculates the knot type (including presence of any links, and slipknots when a subchain is specified). Availability and implementation Knot_pull works under Python >=2.7 and is system independent. Source code and documentation are available at http://github.com/dzarmola/knot_pull under GNU GPL license and include also a wrapper script for PyMOL for easier visualization. Examples of smoothing trajectories can be found at: https://www.youtube.com/watch?v=IzSGDfc1vAY. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

pyMAP: a Python package for small and large scale analysis of Illumina 450k methylation platform

10.1101/078048 ◽

2016 ◽

Cited By ~ 1

Author(s):

Amin Mahpour

Keyword(s):

Large Scale ◽

Source Code ◽

Command Line ◽

Scale Analysis ◽

Illumina 450K ◽

Large Scale Analysis ◽

Python Scripting ◽

Scripting Language ◽

Python Package ◽

450K Methylation

AbstractPyMAP is a native python module for analysis of 450k methylation platform and is freely available for public use. The package can be easily deployed to cloud platforms that support python scripting language for large-scale methylation studies. By implementing fast parsing functionality, this module can be used to analyze large scale methylation datasets. Additionally, command-line executables shipped with the module can be used to perform common analysis tasks on personal computers.Availability and implementation: PyMAP is implemented in Python and the source code is available under GPL v2 license from http://aminmahpour.github.io/PyMAP/.

Download Full-text

Introducing Python Programming into Undergraduate Biology

The American Biology Teacher ◽

10.1525/abt.2021.83.1.33 ◽

2021 ◽

Vol 83 (1) ◽

pp. 33-41

Author(s):

Andrew A. David

Keyword(s):

Biological Sciences ◽

Best Practice ◽

Line Graphs ◽

Data Sets ◽

Command Line ◽

Summary Statistics ◽

Undergraduate Biology ◽

Python Programming Language ◽

Student’S T ◽

Python Programming

The rise of “big data” within the biological sciences has resulted in an urgent demand for coding skills in the next generation of scientists. To address this issue, several institutions and departments across the country have incorporated coding into their curricula. I describe a coding module developed and deployed in an undergraduate parasitology course, with the overarching goal of familiarizing students with the Python programming language. The module, which was completed over four days, aimed to help students become comfortable with the command line; execute summary statistics and Student’s t-tests through coding; create simple bar and line graphs using code; and, parse, handle, and analyze imported data sets. There is currently no standard “best practice” for teaching coding skills to biology majors, but this module can serve as a template to ease students into coding, and can then be modified and built out for teaching more advanced skills.

Download Full-text

Online Judging Platform Utilizing Dynamic Plagiarism Detection Facilities

Computers ◽

10.3390/computers10040047 ◽

2021 ◽

Vol 10 (4) ◽

pp. 47

Author(s):

Fariha Iffath ◽

A. S. M. Kayes ◽

Md. Tahsin Rahman ◽

Jannatul Ferdows ◽

Mohammad Shamsul Arefin ◽

...

Keyword(s):

Source Code ◽

Large Data ◽

Large Data Sets ◽

Detection Technique ◽

Data Sets ◽

Plagiarism Detection ◽

Source Codes ◽

Efficient Detection ◽

Mathematical Problems ◽

Automatic Scoring

A programming contest generally involves the host presenting a set of logical and mathematical problems to the contestants. The contestants are required to write computer programs that are capable of solving these problems. An online judge system is used to automate the judging procedure of the programs that are submitted by the users. Online judges are systems designed for the reliable evaluation of the source codes submitted by the users. Traditional online judging platforms are not ideally suitable for programming labs, as they do not support partial scoring and efficient detection of plagiarized codes. When considering this fact, in this paper, we present an online judging framework that is capable of automatic scoring of codes by detecting plagiarized contents and the level of accuracy of codes efficiently. Our system performs the detection of plagiarism by detecting fingerprints of programs and using the fingerprints to compare them instead of using the whole file. We used winnowing to select fingerprints among k-gram hash values of a source code, which was generated by the Rabin–Karp Algorithm. The proposed system is compared with the existing online judging platforms to show the superiority in terms of time efficiency, correctness, and feature availability. In addition, we evaluated our system by using large data sets and comparing the run time with MOSS, which is the widely used plagiarism detection technique.

Download Full-text

KEC: unique sequence search by K-mer exclusion

Bioinformatics ◽

10.1093/bioinformatics/btab196 ◽

2021 ◽

Author(s):

Pavel Beran ◽

Dagmar Stehlíková ◽

Stephen P Cohen ◽

Vladislav Čurn

Keyword(s):

Amino Acid ◽

Nucleic Acid ◽

Source Code ◽

Unique Sequence ◽

Supplementary Information ◽

Supplementary Data ◽

Laptop Computers ◽

Sequence Search ◽

Target Sequences ◽

Cross Reference

Abstract Summary Searching for amino acid or nucleic acid sequences unique to one organism may be challenging depending on size of the available datasets. K-mer elimination by cross-reference (KEC) allows users to quickly and easily find unique sequences by providing target and non-target sequences. Due to its speed, it can be used for datasets of genomic size and can be run on desktop or laptop computers with modest specifications. Availability and implementation KEC is freely available for non-commercial purposes. Source code and executable binary files compiled for Linux, Mac and Windows can be downloaded from https://github.com/berybox/KEC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BiSulfite Bolt: A bisulfite sequencing analysis platform

GigaScience ◽

10.1093/gigascience/giab033 ◽

2021 ◽

Vol 10 (5) ◽

Author(s):

Colin Farrell ◽

Michael Thompson ◽

Anela Tosevska ◽

Adewale Oyetunde ◽

Matteo Pellegrini

Keyword(s):

Data Aggregation ◽

Bisulfite Sequencing ◽

Low Complexity ◽

Sequencing Analysis ◽

Command Line ◽

Sequencing Data ◽

Bisulfite Sequencing Data ◽

Analysis Platform ◽

Python Package ◽

Bisulfite Sequencing Analysis

Abstract Background Bisulfite sequencing is commonly used to measure DNA methylation. Processing bisulfite sequencing data is often challenging owing to the computational demands of mapping a low-complexity, asymmetrical library and the lack of a unified processing toolset to produce an analysis-ready methylation matrix from read alignments. To address these shortcomings, we have developed BiSulfite Bolt (BSBolt), a fast and scalable bisulfite sequencing analysis platform. BSBolt performs a pre-alignment sequencing read assessment step to improve efficiency when handling asymmetrical bisulfite sequencing libraries. Findings We evaluated BSBolt against simulated and real bisulfite sequencing libraries. We found that BSBolt provides accurate and fast bisulfite sequencing alignments and methylation calls. We also compared BSBolt to several existing bisulfite alignment tools and found BSBolt outperforms Bismark, BSSeeker2, BISCUIT, and BWA-Meth based on alignment accuracy and methylation calling accuracy. Conclusion BSBolt offers streamlined processing of bisulfite sequencing data through an integrated toolset that offers support for simulation, alignment, methylation calling, and data aggregation. BSBolt is implemented as a Python package and command line utility for flexibility when building informatics pipelines. BSBolt is available at https://github.com/NuttyLogic/BSBolt under an MIT license.

Download Full-text

Codon-Substitution Models for Heterogeneous Selection Pressure at Amino Acid Sites

Genetics ◽

10.1093/genetics/155.1.431 ◽

2000 ◽

Vol 155 (1) ◽

pp. 431-449 ◽

Cited By ~ 41

Author(s):

Ziheng Yang ◽

Rasmus Nielsen ◽

Nick Goldman ◽

Anne-Mette Krabbe Pedersen

Keyword(s):

Amino Acid ◽

Positive Selection ◽

Selective Pressure ◽

Acid Sites ◽

Data Sets ◽

Protein Coding ◽

Important Indicator ◽

Diversifying Selection ◽

Codon Substitution ◽

Neutral Mutations

AbstractComparison of relative fixation rates of synonymous (silent) and nonsynonymous (amino acid-altering) mutations provides a means for understanding the mechanisms of molecular sequence evolution. The nonsynonymous/synonymous rate ratio (ω = dN/dS) is an important indicator of selective pressure at the protein level, with ω = 1 meaning neutral mutations, ω < 1 purifying selection, and ω > 1 diversifying positive selection. Amino acid sites in a protein are expected to be under different selective pressures and have different underlying ω ratios. We develop models that account for heterogeneous ω ratios among amino acid sites and apply them to phylogenetic analyses of protein-coding DNA sequences. These models are useful for testing for adaptive molecular evolution and identifying amino acid sites under diversifying selection. Ten data sets of genes from nuclear, mitochondrial, and viral genomes are analyzed to estimate the distributions of ω among sites. In all data sets analyzed, the selective pressure indicated by the ω ratio is found to be highly heterogeneous among sites. Previously unsuspected Darwinian selection is detected in several genes in which the average ω ratio across sites is <1, but in which some sites are clearly under diversifying selection with ω > 1. Genes undergoing positive selection include the β-globin gene from vertebrates, mitochondrial protein-coding genes from hominoids, the hemagglutinin (HA) gene from human influenza virus A, and HIV-1 env, vif, and pol genes. Tests for the presence of positively selected sites and their subsequent identification appear quite robust to the specific distributional form assumed for ω and can be achieved using any of several models we implement. However, we encountered difficulties in estimating the precise distribution of ω among sites from real data sets.

Download Full-text