scholarly journals Phylogenomics of 42 tomato chloroplasts using assembly and alignment-free method

Author(s):  
Raúl Amado Cattáneo ◽  
Luis Diambra ◽  
Andrés Norman McCarthy

Phylogenetics and population genetics are central disciplines in evolutionary biology. Both are based on the comparison of single DNA sequences, or a concatenation of a number of these. However, with the advent of next-generation DNA sequencing technologies, the approaches that consider large genomic data sets are of growing importance for the elucidation of evolutionary relationships among species. Among these approaches, the assembly and alignment-free methods which allow an efficient distance computation and phylogeny reconstruction are of great importance. However, it is not yet clear under what quality conditions and abundance of genomic data such methods are able to infer phylogenies accurately. In the present study we assess the method originally proposed by Fan et al. for whole genome data, in the elucidation of Tomatoes' chloroplast phylogenetics using short read sequences. We find that this assembly and alignment-free method is capable of reproducing previous results under conditions of high coverage, given that low frequency k-mers (i.e. error prone data) are effectively filter out. Finally, we present a complete chloroplast phylogeny for the best data quality candidates of the recently published 360 tomato genomes.

2017 ◽  
Author(s):  
Raúl Amado Cattáneo ◽  
Luis Diambra ◽  
Andrés Norman McCarthy

Phylogenetics and population genetics are central disciplines in evolutionary biology. Both are based on the comparison of single DNA sequences, or a concatenation of a number of these. However, with the advent of next-generation DNA sequencing technologies, the approaches that consider large genomic data sets are of growing importance for the elucidation of evolutionary relationships among species. Among these approaches, the assembly and alignment-free methods which allow an efficient distance computation and phylogeny reconstruction are of great importance. However, it is not yet clear under what quality conditions and abundance of genomic data such methods are able to infer phylogenies accurately. In the present study we assess the method originally proposed by Fan et al. for whole genome data, in the elucidation of Tomatoes' chloroplast phylogenetics using short read sequences. We find that this assembly and alignment-free method is capable of reproducing previous results under conditions of high coverage, given that low frequency k-mers (i.e. error prone data) are effectively filter out. Finally, we present a complete chloroplast phylogeny for the best data quality candidates of the recently published 360 tomato genomes.


2019 ◽  
Vol 9 (10) ◽  
pp. 3101-3104 ◽  
Author(s):  
Johnathan Lo ◽  
Michelle M. Jonika ◽  
Heath Blackmon

Microsatellites are repetitive DNA sequences usually found in non-coding regions of the genome. Their quantification and analysis have applications in fields from population genetics to evolutionary biology. As genome assemblies become commonplace, the need for software that can facilitate analyses has never been greater. In particular, R packages that can analyze genomic data are particularly important since this is one of the most popular software environments for biologists. We created an R package, micRocounter, to quantify microsatellites. We have optimized our package for speed, accessibility, and portability, making the automated analysis of large genomic data sets feasible. Computationally intensive algorithms were built in C++ to increase speed. Tests using benchmark datasets show a 200-fold improvement in speed over existing software. A moderately sized genome of 500 Mb can be processed in under 50 sec. Results are output as an object in R increasing accessibility and flexibility for practitioners.


GigaScience ◽  
2020 ◽  
Vol 9 (6) ◽  
Author(s):  
Stefan Prost ◽  
Sven Winter ◽  
Jordi De Raad ◽  
Raphael T F Coimbra ◽  
Magnus Wolf ◽  
...  

Abstract Recent advances in genome sequencing technologies have simplified the generation of genome data and reduced the costs for genome assemblies, even for complex genomes like those of vertebrates. More practically oriented genomic courses can prepare university students for the increasing importance of genomic data used in biological and medical research. Low-cost third-generation sequencing technology, along with publicly available data, can be used to teach students how to process genomic data, assemble full chromosome-level genomes, and publish the results in peer-reviewed journals, or preprint servers. Here we outline experiences gained from 2 master's-level courses and discuss practical considerations for teaching hands-on genome assembly courses.


2020 ◽  
Author(s):  
Yang Young Lu ◽  
Jiaxing Bai ◽  
Yiwen Wang ◽  
Ying Wang ◽  
Fengzhu Sun

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.


2018 ◽  
pp. 1-9 ◽  
Author(s):  
Alida Palmisano ◽  
Yingdong Zhao ◽  
Richard M. Simon

Purpose Advances in next-generation sequencing technologies have led to a reduction in sequencing costs, which has increased the availability of genomic data sets to many laboratories. Increasing amounts of sequencing data require effective analysis tools to use genomic data for biologic discovery and patient management. Available packages typically require advanced programming knowledge and system administration privileges, or they are Web services that force researchers to work on outside servers. Methods To support the interactive exploration of genomic data sets on local machines with no programming skills required, we developed D3Oncoprint, a standalone application to visualize and dynamically explore annotated genomic mutation files. D3Oncoprint provides links to curated variants lists from CIViC, My Cancer Genome, OncoKB, and Food and Drug Administration–approved drugs to facilitate the use of genomic data for biomedical discovery and application. D3Oncoprint also includes curated gene lists from BioCarta pathways and FoundationOne cancer panels to explore commonly investigated biologic processes. Results This software provides a flexible environment to dynamically explore one or more variant mutation profiles provided as input. The focus on interactive visualization with biologic and medical annotation significantly lowers the barriers between complex genomics data and biomedical investigators. We describe how D3Oncoprint helps researchers explore their own data without the need for an extensive computational background. Conclusion D3Oncoprint is free software for noncommercial use. It is available for download from the Web site of the Biometric Research Program of the Division of Cancer Treatment and Diagnosis at the National Cancer Institute ( https://brb.nci.nih.gov/d3oncoprint ). We believe that this tool provides an important means of empowering researchers to translate information from collected data sets to biologic insights and clinical development.


2018 ◽  
Author(s):  
Jerome Kelleher ◽  
Yan Wong ◽  
Patrick K. Albers ◽  
Anthony W. Wohns ◽  
Gil McVean

AbstractA central problem in evolutionary biology is to infer the full genealogical history of a set of DNA sequences. This history contains rich information about the forces that have influenced a sexually reproducing species. However, existing methods are limited: the most accurate is unable to cope with more than a few dozen samples. With modern genetic data sets rapidly approaching millions of genomes, there is an urgent need for efficient inference methods to exploit such rich resources. We introduce an algorithm to infer whole-genome history which has comparable accuracy to the state-of-the-art but can process around four orders of magnitude more sequences. Additionally, our method results in an “evolutionary encoding” of the original sequence data, enabling efficient access to genealogies and calculation of genetic statistics over the data. We apply this technique to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the genealogies we estimate are both rich in biological signal and efficient to process.


Author(s):  
Guangying Wang ◽  
Xiaocui Chai ◽  
Jing Zhang ◽  
Wentao Yang ◽  
Chuanqi Jiang ◽  
...  

ABSTRACTCiliates contain two kinds of nuclei: the germline micronucleus (MIC) and the somatic macronucleus (MAC) in a single cell. The MAC usually have fragmented chromosomes. These fragmented chromosomes, capped with telomeres at both ends, could be gene size to several megabases in length among different ciliate species. So far, no telomere-to-telomere assembly of entire MAC genome in ciliate species is finished. Development of the third generation sequencing technologies allows to generate sequencing reads up to megabases in length that could possibly span an entire MAC chromosome. Taking advantage of ultra-long Nanopore reads, we established a simple strategy for the complete assembly of ciliate MAC genomes. Using this strategy, we assembled the complete MAC genomes of two ciliate species Tetrahymena thermophila and Tetrahymena shanghaiensis, composed of 181 and 214 chromosomes telomere-to-telomere respectively. The established strategy as well as the high-quality genome data will provide a useful approach for ciliate genome assembly, and a valuable community resource for further biological, evolutionary and population genomic studies.


ALGAE ◽  
2021 ◽  
Vol 36 (4) ◽  
pp. 333-340
Author(s):  
Seongmin Cheon ◽  
Sung-Gwon Lee ◽  
Hyun-Hee Hong ◽  
Hyun-Gwan Lee ◽  
Kwang Young Kim ◽  
...  

Phylotranscriptomics is the study of phylogenetic relationships among taxa based on their DNA sequences derived from transcriptomes. Because of the relatively low cost of transcriptome sequencing compared with genome sequencing and the fact that phylotranscriptomics is almost as reliable as phylogenomics, the phylotranscriptomic analysis has recently emerged as the preferred method for studying evolutionary biology. However, it is challenging to perform transcriptomic and phylogenetic analyses together without programming expertise. This study presents a protocol for phylotranscriptomic analysis to aid marine biologists unfamiliar with UNIX command-line interface and bioinformatics tools. Here, we used transcriptomes to reconstruct a molecular phylogeny of dinoflagellate protists, a diverse and globally abundant group of marine plankton organisms whose large and complex genomic sequences have impeded conventional phylogenic analysis based on genomic data. We hope that our proposed protocol may serve as practical and helpful information for the training and education of novice phycologists.


2020 ◽  
Vol 20 (2) ◽  
pp. e11
Author(s):  
Vicente Enrique Machaca Arceda

Viral subtyping classification is very relevant for the appropriate diagnosis and treatment of illnesses. The most used tools are based on alignment-based methods, nevertheless, they are becoming too slow with the increase of genomic data. For that reason, alignment-free methods have emerged as an alternative. In this work, we analyzed four alignment-free algorithms: two methods use k-mer frequencies (Kameris and Castor-KRFE); the third method used a frequency chaos game representation of a DNA with CNNs; finally the last one, process DNA sequences as a digital signal (ML-DSP). From the comparison, Kameris and Castor-KRFE outperformed the rest, followed by the method based on CNNs.


2019 ◽  
Author(s):  
F. Menardo ◽  
S. Duchêne ◽  
D. Brites ◽  
S. Gagneux

AbstractThe molecular clock and its phylogenetic applications to genomic data have changed how we study and understand one of the major human pathogens, Mycobacterium tuberculosis (MTB), the causal agent of tuberculosis. Genome sequences of MTB strains sampled at different times are increasingly used to infer when a particular outbreak begun, when a drug resistant clone appeared and expanded, or when a strain was introduced into a specific region. Despite the growing importance of the molecular clock in tuberculosis research, there is a lack of consensus as to whether MTB displays a clocklike behavior and about its rate of evolution. Here we performed a systematic study of the MTB molecular clock on a large genomic data set (6,285 strains), covering most of the global MTB diversity and representing different epidemiological settings. We found wide variation in the degree of clocklike structure among data sets, indicating that sampling times are sometimes insufficient to calibrate the clock of MTB. For data sets with temporal structure, we found that MTB genomes accumulate between 1×10−8 and 5×10−7 nucleotide changes per-site-per-year, which corresponds to 0.04 – 2.2 SNPs per-genome-per-year. Contrary to what expected, these estimates were not dependent on the time of the calibration points as they did not change significantly when we used epidemiological isolates (sampled in the last 40 years) or ancient DNA samples (about 1,000 years old) to calibrate the tree. Additionally, the uncertainty and the discrepancies in the results of different methods were often large, highlighting the importance of using different methods, and of considering carefully their assumptions and limitations.Significance StatementOne of the major recent advancement in evolutionary biology is the development of statistical methods to infer the past evolutionary history of species and populations with genomic data. In the last five years, many researchers have used the molecular clock to study the evolution of Mycobacterium tuberculosis, a bacterial pathogen that causes tuberculosis and is responsible for millions of human deaths every year. The application of the molecular clock to tuberculosis is extremely useful to understand the evolution of drug resistance, the spread of different strains and the origin of the disease. Since some of these studies found contrasting results, we performed a systematic analysis of the molecular clock of MTB. This study will provide an important guideline for future analyses of tuberculosis and other organisms.


Sign in / Sign up

Export Citation Format

Share Document