scholarly journals Chromosome assembly of large and complex genomes using multiple references

2016 ◽  
Author(s):  
Mikhail Kolmogorov ◽  
Joel Armstrong ◽  
Brian J. Raney ◽  
Ian Streeter ◽  
Matthew Dunn ◽  
...  

AbstractDespite the rapid development of sequencing technologies, assembly of mammalian-scale genomes into complete chromosomes remains one of the most challenging problems in bioinformatics. To help address this difficulty, we developed Ragout, a reference-assisted assembly tool that now works for large and complex genomes. Taking one or more target assemblies (generated from an NGS assembler) and one or multiple related reference genomes, Ragout infers the evolutionary relationships between the genomes and builds the final assemblies using a genome rearrangement approach. Using Ragout, we transformed NGS assemblies of 15 different Mus musculus and one Mus spretus genomes into sets of complete chromosomes, leaving less than 5% of sequence unlocalized per set. Various benchmarks, including PCR testing and realigning of long PacBio reads, suggest only a small number of structural errors in the final assemblies, comparable with direct assembly approaches. Additionally, we applied Ragout to Mus caroli and Mus pahari genomes, which exhibit karyotype-scale variations compared to other genomes from the Muridae family. Chromosome color maps confirmed most large-scale rearrangements that Ragout detected.


BMC Genomics ◽  
2019 ◽  
Vol 20 (S10) ◽  
Author(s):  
Tao Tang ◽  
Yuansheng Liu ◽  
Buzhong Zhang ◽  
Benyue Su ◽  
Jinyan Li

Abstract Background The rapid development of Next-Generation Sequencing technologies enables sequencing genomes with low cost. The dramatically increasing amount of sequencing data raised crucial needs for efficient compression algorithms. Reference-based compression algorithms have exhibited outstanding performance on compressing single genomes. However, for the more challenging and more useful problem of compressing a large collection of n genomes, straightforward application of these reference-based algorithms suffers a series of issues such as difficult reference selection and remarkable performance variation. Results We propose an efficient clustering-based reference selection algorithm for reference-based compression within separate clusters of the n genomes. This method clusters the genomes into subsets of highly similar genomes using MinHash sketch distance, and uses the centroid sequence of each cluster as the reference genome for an outstanding reference-based compression of the remaining genomes in each cluster. A final reference is then selected from these reference genomes for the compression of the remaining reference genomes. Our method significantly improved the performance of the-state-of-art compression algorithms on large-scale human and rice genome databases containing thousands of genome sequences. The compression ratio gain can reach up to 20-30% in most cases for the datasets from NCBI, the 1000 Human Genomes Project and the 3000 Rice Genomes Project. The best improvement boosts the performance from 351.74 compression folds to 443.51 folds. Conclusions The compression ratio of reference-based compression on large scale genome datasets can be improved via reference selection by applying appropriate data preprocessing and clustering methods. Our algorithm provides an efficient way to compress large genome database.



2018 ◽  
Author(s):  
Jose V. Die ◽  
Moamen Mahmoud Elmassry ◽  
Kimberly Hathaway LeBlanc ◽  
Olaitan I. Awe ◽  
Allissa Dillman ◽  
...  

AbstractDuring the last decade, plant biotechnological laboratories have sparked a monumental revolution with the rapid development of next sequencing technologies at affordable prices. Soon, these sequencing technologies and assembling of whole genomes will extend beyond the plant computational biologists and become commonplace within the plant biology disciplines. The current availability of large-scale genomic resources for non-traditional plant model systems (the so-called ‘orphan crops’) is enabling the construction of high-density integrated physical and genetic linkage maps with potential applications in plant breeding. The newly available fully sequenced plant genomes represent an incredible opportunity for comparative analyses that may reveal new aspects of genome biology and evolution. Analysis of the expansion and evolution of gene families across species is a common approach to infer biological functions. To date, the extent and role of gene families in plants has only been partially addressed and many gene families remain to be investigated. Manual identification of gene families is highly time-consuming and laborious, requiring an iterative process of manual and computational analysis to identify members of a given family, typically combining numerous BLAST searches and manually cleaning data. Due to the increasing abundance of genome sequences and the agronomical interest in plant gene families, the field needs a clear, automated annotation tool. Here, we present the GeneHummus pipeline, a step-by-step R-based pipeline for the identification, characterization and expression analysis of plant gene families. The impact of this pipeline comes from a reduction in hands-on annotation time combined with high specificity and sensitivity in extracting only proteins from the RefSeq database and providing the conserved domain architectures based on SPARCLE. As a case study we focused on the auxin receptor factors gene (ARF) family in Cicer arietinum (chickpea) and other legumes. We anticipate that our pipeline should be suitable for any plant gene family, and likely other gene families, vastly improving the speed and ease of genomic data processing.



2015 ◽  
Author(s):  
Peter Menzel ◽  
Kim Lee Ng ◽  
Anders Krogh

The constantly decreasing cost and increasing output of current sequencing technologies enable large scale metagenomic studies of microbial communities from diverse habitats. Therefore, fast and accurate methods for taxonomic classification are needed, which can operate on increasingly larger datasets and reference databases. Recently, several fast metagenomic classifiers have been developed, which are based on comparison of genomic k-mers. However, nucleotide comparison using a fixed k-mer length often lacks the sensitivity to overcome the evolutionary distance between sampled species and genomes in the reference database. Here, we present the novel metagenome classifier Kaiju for fast assignment of reads to taxa. Kaiju finds maximum exact matches on the protein-level using the Borrows-Wheeler transform, and can optionally allow amino acid substitutions in the search using a greedy heuristic. We show in a genome exclusion study that Kaiju can classify more reads with higher sensitivity and similar precision compared to fast k-mer based classifiers, especially in genera that are underrepresented in reference databases. We also demonstrate that Kaiju classifies more than twice as many reads in ten real metagenomes compared to programs based on genomic k-mers. Kaiju can process up to millions of reads per minute, and its memory footprint is below 6 GB of RAM, allowing the analysis on a standard PC. The program is available under the GPL3 license at: http://bioinformatics-centre.github.io/kaiju



Author(s):  
Leily Rabbani ◽  
Jonas Müller ◽  
Detlef Weigel

1AbstractMotivationNew DNA sequencing technologies have enabled the rapid analysis of many thousands of genomes from a single species. At the same time, the conventional approach of mapping sequencing reads against a single reference genome sequence is no longer adequate. However, even where multiple high-quality reference genomes are available, the problem remains how one would integrate results from pairwise analyses.ResultTo overcome the limits imposed by mapping sequence reads against a single reference genome, or serially mapping them against multiple reference genomes, we have developed the MGR method that allows simultaneous comparison against multiple high-quality reference genomes, in order to remove the bias that comes from using only a single-genome reference and to simplify downstream analyses. To this end, we present the MGR algorithm that creates a graph (MGR graph) as a multi-genome reference. To reduce the size and complexity of the multi-genome reference, highly similar orthologous1 and paralogous2 regions are collapsed while more substantial differences are retained. To evaluate the performance of our model, we have developed a genome compression tool, which can be used to estimate the amount of shared information between genomes.Availabilityhttps://github.com/LeilyR/[email protected]



2018 ◽  
Author(s):  
Lizhen Shi ◽  
Xiandong Meng ◽  
Elizabeth Tseng ◽  
Michael Mascagni ◽  
Zhong Wang

AbstractWhole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed a Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large scale sequence data analysis problems. The software is available under the Apache 2.0 license at https://bitbucket.org/LizhenShi/sparc.



2018 ◽  
Author(s):  
Nathan LaPierre ◽  
Rob Egan ◽  
Wei Wang ◽  
Zhong Wang

AbstractLong read sequencing technologies such as Oxford Nanopore can greatly de-crease the complexity of de novo genome assembly and large structural variation iden-tification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. Many methods for resolving these errors require access to reference genomes, high-fidelity short reads, or reference genomes, which are often not available. De novo error correction modules are available, often as part of assembly tools, but large-scale errors still remain in resulting assemblies, motivating further innovation in this area. We developed a novel Convolutional Neu-ral Network (CNN) based method, called MiniScrub, for de novo identification and subsequent “scrubbing” (removal) of low-quality Nanopore read segments. MiniScrub first generates read-to-read alignments by MiniMap, then encodes the alignments into images, and finally builds CNN models to predict low-quality segments that could be scrubbed based on a customized quality cutoff. Applying MiniScrub to real world con-trol datasets under several different parameters, we show that it robustly improves read quality. Compared to raw reads, de novo genome assembly with scrubbed reads pro-duces many fewer mis-assemblies and large indel errors. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub



2020 ◽  
Vol 17 ◽  
Author(s):  
Perumal Subramaniana ◽  
Jaime Jacqueline Jayapalan ◽  
Puteri Shafinaz Abdul-Rahmanb

A proteome is an efficient rendition of a genome, unswervingly controlling various cancer processes. Molecular mechanisms of several cancer processes have been unraveled by proteomic approach. Thus far, numerous tumors of diverse status have been investigated by two-dimensional electrophoresis. Numerous biomarkers have been recognized and precise categorization of apparent lesions has led to the timely detection of various cancers in persons at peril. Currently used pioneering approaches and technologies in proteomics have led to highly sensitive assays of cancer biomarkers and improved the early diagnosis of various cancers. The discovery of novel and definite biomarker signatures further widened our perceptive of the disease and novel potent drugs for efficient and aimed therapeutic outcomes in persistent cancers have emerged. However, a major limitation, even today, of proteomics is resolving and quantifying the proteins of low abundance. Despite the rapid development of proteomic technologies and their applications in cancer management, annulling the shortcomings of present proteomic technologies and development of better methods are still desirable. The main objectives of this review are to discuss the developing aspects, merits and demerits of pharmacoproteomics, redox proteomics, novel approaches and therapies being used for various types of cancer based on proteome studies.



2021 ◽  
Author(s):  
Parsoa Khorsand ◽  
Fereydoun Hormozdiari

Abstract Large scale catalogs of common genetic variants (including indels and structural variants) are being created using data from second and third generation whole-genome sequencing technologies. However, the genotyping of these variants in newly sequenced samples is a nontrivial task that requires extensive computational resources. Furthermore, current approaches are mostly limited to only specific types of variants and are generally prone to various errors and ambiguities when genotyping complex events. We are proposing an ultra-efficient approach for genotyping any type of structural variation that is not limited by the shortcomings and complexities of current mapping-based approaches. Our method Nebula utilizes the changes in the count of k-mers to predict the genotype of structural variants. We have shown that not only Nebula is an order of magnitude faster than mapping based approaches for genotyping structural variants, but also has comparable accuracy to state-of-the-art approaches. Furthermore, Nebula is a generic framework not limited to any specific type of event. Nebula is publicly available at https://github.com/Parsoa/Nebula.



Genetics ◽  
2001 ◽  
Vol 159 (4) ◽  
pp. 1765-1778
Author(s):  
Gregory J Budziszewski ◽  
Sharon Potter Lewis ◽  
Lyn Wegrich Glover ◽  
Jennifer Reineke ◽  
Gary Jones ◽  
...  

Abstract We have undertaken a large-scale genetic screen to identify genes with a seedling-lethal mutant phenotype. From screening ~38,000 insertional mutant lines, we identified >500 seedling-lethal mutants, completed cosegregation analysis of the insertion and the lethal phenotype for >200 mutants, molecularly characterized 54 mutants, and provided a detailed description for 22 of them. Most of the seedling-lethal mutants seem to affect chloroplast function because they display altered pigmentation and affect genes encoding proteins predicted to have chloroplast localization. Although a high level of functional redundancy in Arabidopsis might be expected because 65% of genes are members of gene families, we found that 41% of the essential genes found in this study are members of Arabidopsis gene families. In addition, we isolated several interesting classes of mutants and genes. We found three mutants in the recently discovered nonmevalonate isoprenoid biosynthetic pathway and mutants disrupting genes similar to Tic40 and tatC, which are likely to be involved in chloroplast protein translocation. Finally, we directly compared T-DNA and Ac/Ds transposon mutagenesis methods in Arabidopsis on a genome scale. In each population, we found only about one-third of the insertion mutations cosegregated with a mutant phenotype.



2021 ◽  
Author(s):  
Cong Wang ◽  
Zehao Song ◽  
Pei Shi ◽  
Lin Lv ◽  
Houzhao Wan ◽  
...  

With the rapid development of portable electronic devices, electric vehicles and large-scale grid energy storage devices, it needs to reinforce specific energy and specific power of related electrochemical devices meeting...



Sign in / Sign up

Export Citation Format

Share Document