scholarly journals SANS serif: alignment-free, whole-genome based phylogenetic reconstruction

2021 ◽  
Author(s):  
Andreas Rempel ◽  
Roland Wittler

AbstractSummarySANS serif is a novel software for alignment-free, whole-genome based phylogeny estimation that follows a pangenomic approach to efficiently calculate a set of splits in a phylogenetic tree or network.Availability and ImplementationImplemented in C++ and supported on Linux, MacOS, and Windows. The source code is freely available for download at https://gitlab.ub.uni-bielefeld.de/gi/[email protected]

2021 ◽  
Author(s):  
Nader R. Abdelsalam ◽  
Mohamed E. Hasan ◽  
Samar M.A. Rabie ◽  
Houssam El-Din M.F. El-wakeel ◽  
Amera F. Zaitoun ◽  
...  

DNA barcodes have been considered as a tool to facilitate species identification based on their simplicity and high-level accuracy compression to the complexity and subjective biases linked to morphological identification of taxa. MaturaseK gene “ MatK” of the chloroplast is very crucial in the plant system which is involved in the group II intron splicing. The main objective of this current study is determining the relative utility of the “ MatK” chloroplast gene for barcoding in fifteen legume trees by both single region and multiregional approaches. The chloroplast “ MatK” gene sequences were submitted to GenBank and accession numbers (GenBank: LC602060, LC602154, LC602263, LC603347, LC603655, LC603845, LC603846, LC603847, LC604717, LC604718, LC605994, LC604799, LC605995, LC606468, LC606469) were obtained with sequence length ranging from 730 to 1545 nucleotides. These DNA sequences were aligned with database sequence using PROMALS server , Clustal Omega server and Bioedit program. Also,  the maximum likelihood and neighbor-joining algorithms for phylogenetic reconstruction using the MEGA-X program were employed. Overall, these results indicated that the phylogenetic tree analysis and the evolutionary distances of an individual dataset of each species were agreed with a phylogenetic tree of all each other consisting of two clades, the first clade comprising (Enterolobium contortisiliquum, Albizia lebbek), Acacia saligna , Leucaena leucocephala, Dichrostachys Cinerea, (Delonix regia, Parkinsonia aculeata), (Senna surattensis, Cassia fistula, Cassia javanica) and Schotia brachypetala were more closely to each other, respectively. The remaining four species of Erythrina humeana, (Sophora secundiflora, Dalbergia Sissoo, Tipuana Tipu) constituted the second clade. Therefore, MatK gene is considered promising a candidate for DNA barcoding in plant family Fabaceae and providing a clear relationship between the families. Moreover, their sequences could be successfully utilized in single nucleotide polymorphism (SNP) or part of the sequence as DNA fragment analysis utilizing polymerase chain reaction (PCR) in plant systematic.


2016 ◽  
Author(s):  
Rudy Arthur ◽  
Ole Schulz-Trieglaff ◽  
Anthony J. Cox ◽  
Jared Michael O’Connell

AbstractAncestry and Kinship Toolkit (AKT) is a statistical genetics tool for analysing large cohorts of whole-genome sequenced samples. It can rapidly detect related samples, characterise sample ancestry, calculate correlation between variants, check Mendel consistency and perform data clustering. AKT brings together the functionality of many state-of-the-art methods, with a focus on speed and a unified interface. We believe it will be an invaluable tool for the curation of large WGS data-sets.AvailabilityThe source code is available at https://illumina.github.io/[email protected], [email protected]


2016 ◽  
Author(s):  
Fanny-Dhelia Pajuste ◽  
Lauris Kaplinski ◽  
Märt Möls ◽  
Tarmo Puurand ◽  
Maarja Lepamets ◽  
...  

We have developed a computational method that counts the frequencies of unique k-mers in FASTQ-formatted genome data and uses this information to infer the genotypes of known variants. FastGT can detect the variants in a 30x genome in less than 1 hour using ordinary low-cost server hardware. The overall concordance with the genotypes of two Illumina “Platinum” genomes1 is 99.96%, and the concordance with the genotypes of the Illumina HumanOmniExpress is 99.82%. Our method provides k-mer database that can be used for the simultaneous genotyping of approximately 30 million single nucleotide variants (SNVs), including >23,000 SNVs from Y chromosome. The source code of FastGT software is available at GitHub (https://github.com/bioinfo-ut/GenomeTester4/).


2019 ◽  
Author(s):  
Tasfia Zahin ◽  
Md. Hasin Abrar ◽  
Mizanur Rahman ◽  
Tahrina Tasnim ◽  
Md. Shamsuzzoha Bayzid ◽  
...  

AbstractPhylogenetic analysis i.e. construction of an accurate phylogenetic tree from genomic sequences of a set of species is one of the main challenges in bioinformatics. The popular approaches to this require aligning each pair of sequences to calculate pairwise distances or aligning all the sequences to construct a multiple sequence alignment. The computational complexity and difficulties in getting accurate alignments have led to development of alignment-free methods to estimate phylogenies. However, the alignment free approaches focus on computing distances between species and do not utilize statistical approaches for phylogeny estimation. Herein, we present a simple alignment free method for phylogeny construction based on contiguous sub-sequences of length k termed k-mers. The presence or absence of these k-mers are used to construct a phylogeny using a maximum likelihood approach. The results suggest our method is competitive with other alignment-free approaches, while outperforming them in some cases.


2017 ◽  
Author(s):  
Igor Mandric ◽  
Sergey Knyazev ◽  
Alex Zelikovsky

AbstractSummaryGenomic sequences are assembled into a variable, but large number of contigs that should be scaffolded (ordered and oriented) for facilitating comparative or functional analysis. Finding scaffolding is computationally challenging due to misassemblies, inconsistent coverage across the genome, and long repeats. An accurate assessment of scaffolding tools should take into account multiple locations of the same contig on the reference scaffolding rather than matching a repeat to a single best location. This makes mapping of inferred scaffoldings onto the reference a computationally challenging problem. This paper formulates the repeat-aware scaffolding evaluation problem which is to find a mapping of the inferred scaffolding onto the reference maximizing number of correct links and proposes a scalable algorithm capable of handling large whole-genome datasets. Our novel scaffolding validation pipeline has been applied to assess the most of state-of-the-art scaffolding tools on the representative subset of GAGE datasets.AvailabilityThe source code of this evaluation framework is available at https://github.com/mandricigor/repeat-aware. The documentation is hosted at https://mandricigor.github.io/repeat-aware.


2021 ◽  
Vol 12 ◽  
Author(s):  
Yao-Qun Wu ◽  
Zu-Guo Yu ◽  
Run-Bin Tang ◽  
Guo-Sheng Han ◽  
Vo V. Anh

Alignment methods have faced disadvantages in sequence comparison and phylogeny reconstruction due to their high computational costs in handling time and space complexity. On the other hand, alignment-free methods incur low computational costs and have recently gained popularity in the field of bioinformatics. Here we propose a new alignment-free method for phylogenetic tree reconstruction based on whole genome sequences. A key component is a measure called information-entropy position-weighted k-mer relative measure (IEPWRMkmer), which combines the position-weighted measure of k-mers proposed by our group and the information entropy of frequency of k-mers. The Manhattan distance is used to calculate the pairwise distance between species. Finally, we use the Neighbor-Joining method to construct the phylogenetic tree. To evaluate the performance of this method, we perform phylogenetic analysis on two datasets used by other researchers. The results demonstrate that the IEPWRMkmer method is efficient and reliable. The source codes of our method are provided at https://github.com/ wuyaoqun37/IEPWRMkmer.


2021 ◽  
Vol 22 (10) ◽  
Author(s):  
ABDUL BASITH ◽  
Abinawanto Abinawanto ◽  
ENI KUSRINI ◽  
YASMAN YASMAN

Abstract. Basith A, Abinawanto, Kusrini E, Yasman. 2021. Genetic diversity analysis and phylogenetic reconstruction of groupers Epinephelus spp. from Madura Island, Indonesia based on partial sequence of CO1 gene. Biodiversitas 22: 4282-4290. Groupers populations in Indonesia, particularly from Madura Island, East Java are indicated to be over-fished, thereby requiring data collection of more accurate genetic resources as an important step for grouper conservation. A total of 14 samples of the Epinepheplus groupers were obtained from the fish landing port on Madura Island. The 617 bp CO1 gene sequence was utilized for genetic diversity analysis and phylogenetic tree reconstruction. Genetic diversity is based on the value of haplotype diversity (Hd) and nucleotide diversity (?). Reconstruction of the phylogenetic tree includes neighbor-joining (NJ) implementing K2P substitution model, while maximum likelihood (ML) is conducted by implementing HKY+G+I substitution model, both of which were evaluated by employing a bootstrap of 1000 replications. Analysis of genetic distance between species indicated that the farthest distance between E. heniochus and E. fasciatus was 0.189, while the closest distance between E. erythrurus and E. ongus was 0.099. Intrapopulation genetic diversity indicated a high value with details of Hd=0.978 and ?=0.12107. Furthermore, NJ and ML phylogenetic tree demonstrated similar topology in the observed Epinephelus spp. obtained from Madura Island grouped into 7 clades, that is Epinephelus coioides, E. bleekeri, E. areolatus, E. erythrurus, E. heniochus, E. fasciatus, and E. ongus.


2017 ◽  
Author(s):  
Mickael Silva ◽  
Miguel Machado ◽  
Diogo N. Silva ◽  
Mirko Rossi ◽  
Jacob Moran-Gilad ◽  
...  

ABSTRACTGene-by-gene approaches are becoming increasingly popular in bacterial genomic epidemiology and outbreak detection. However, there is a lack of open-source scalable software for schema definition and allele calling for these methodologies. The chewBBACA suite was designed to assist users in the creation and evaluation of novel whole-genome or core-genome gene-by-gene typing schemas and subsequent allele calling in bacterial strains of interest. The software can run in a laptop or in high performance clusters making it useful for both small laboratories and large reference centers. ChewBBACA is available athttps://github.com/B-UMMI/chewBBACAor as a docker image athttps://hub.docker.com/r/ummidock/chewbbaca/.DATA SUMMARYAssembled genomes used for the tutorial were downloaded from NCBI in August 2016 by selecting those submitted asStreptococcus agalactiaetaxon or sub-taxa. All the assemblies have been deposited as a zip file in FigShare (https://figshare.com/s/9cbe1d422805db54cd52), where a file with the original ftp link for each NCBI directory is also available.Code for the chewBBACA suite is available athttps://github.com/B-UMMI/chewBBACAwhile the tutorial example is found athttps://github.com/B-UMMI/chewBBACA_tutorial.I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. ⊠IMPACT STATEMENTThe chewBBACA software offers a computational solution for the creation, evaluation and use of whole genome (wg) and core genome (cg) multilocus sequence typing (MLST) schemas. It allows researchers to develop wg/cgMLST schemes for any bacterial species from a set of genomes of interest. The alleles identified by chewBBACA correspond to potential coding sequences, possibly offering insights into the correspondence between the genetic variability identified and phenotypic variability. The software performs allele calling in a matter of seconds to minutes per strain in a laptop but is easily scalable for the analysis of large datasets of hundreds of thousands of strains using multiprocessing options. The chewBBACA software thus provides an efficient and freely available open source solution for gene-by-gene methods. Moreover, the ability to perform these tasks locally is desirable when the submission of raw data to a central repository or web services is hindered by data protection policies or ethical or legal concerns.


2020 ◽  
Author(s):  
Xun Zhu ◽  
Ti-Cheng Chang ◽  
Richard Webby ◽  
Gang Wu

AbstractidCOV is a phylogenetic pipeline for quickly identifying the clades of SARS-CoV-2 virus isolates from raw sequencing data based on a selected clade-defining marker list. Using a public dataset, we show that idCOV can make equivalent calls as annotated by Nextstrain.org on all three common clade systems using user uploaded FastQ files directly. Web and equivalent command-line interfaces are available. It can be deployed on any Linux environment, including personal computer, HPC and the cloud. The source code is available at https://github.com/xz-stjude/idcov. A documentation for installation can be found at https://github.com/xz-stjude/idcov/blob/master/README.md.


2020 ◽  
Author(s):  
N Goonasekera ◽  
A Mahmoud ◽  
J Chilton ◽  
E Afgan

AbstractSummaryThe existence of more than 100 public Galaxy servers with service quotas is indicative of the need for an increased availability of compute resources for Galaxy to use. The GalaxyCloudRunner enables a Galaxy server to easily expand its available compute capacity by sending user jobs to cloud resources. User jobs are routed to the acquired resources based on a set of configurable rules and the resources can be dynamically acquired from any of 4 popular cloud providers (AWS, Azure, GCP, or OpenStack) in an automated fashion.Availability and implementationGalaxyCloudRunner is implemented in Python and leverages Docker containers. The source code is MIT licensed and available at https://github.com/cloudve/galaxycloudrunner. The documentation is available at http://gcr.cloudve.org/.ContactEnis Afgan ([email protected])Supplementary informationNone


Sign in / Sign up

Export Citation Format

Share Document