scholarly journals A Fast Adaptive Algorithm for Computing Whole-Genome Homology Maps

2018 ◽  
Author(s):  
Chirag Jain ◽  
Sergey Koren ◽  
Alexander Dilthey ◽  
Adam M. Phillippy ◽  
Srinivas Aluru

AbstractMotivationWhole-genome alignment is an important problem in genomics for comparing different species, mapping draft assemblies to reference genomes, and identifying repeats. However, for large plant and animal genomes, this task remains compute and memory intensive.ResultsWe introduce an approximate algorithm for computing local alignment boundaries between long DNA sequences. Given a minimum alignment length and an identity threshold, our algorithm computes the desired alignment boundaries and identity estimates using kmer-based statistics, and maintains sufficient probabilistic guarantees on the output sensitivity. Further, to prioritize higher scoring alignment intervals, we develop a plane-sweep based filtering technique which is theoretically optimal and practically efficient. Implementation of these ideas resulted in a fast and accurate assembly-to-genome and genome-to-genome mapper. As a result, we were able to map an error-corrected whole-genome NA12878 human assembly to the hg38 human reference genome in about one minute total execution time and < 4 GB memory using 8 CPU threads, achieving significant performance improvement over competing methods. Recall accuracy of computed alignment boundaries was consistently found to be > 97% on multiple datasets. Finally, we performed a sensitive self-alignment of the human genome to compute all duplications of length ≥ 1 Kbp and ≥ 90% identity. The reported output achieves good recall and covers 5% more bases than the current UCSC genome browser's segmental duplication annotation.Availabilityhttps://github.com/marbl/[email protected], [email protected]

BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Ai Sasou ◽  
Yoshikazu Yuki ◽  
Ayaka Honma ◽  
Kotomi Sugiura ◽  
Koji Kashima ◽  
...  

Abstract Background We have previously developed a rice-based oral vaccine against cholera diarrhea, MucoRice-CTB. Using Agrobacterium-mediated co-transformation, we produced the selection marker–free MucoRice-CTB line 51A, which has three copies of the cholera toxin B subunit (CTB) gene and two copies of an RNAi cassette inserted into the rice genome. We determined the sequence and location of the transgenes on rice chromosomes 3 and 12. The expression of alpha-amylase/trypsin inhibitor, a major allergen protein in rice, is lower in this line than in wild-type rice. Line 51A was self-pollinated for five generations to fix the transgenes, and the seeds of the sixth generation produced by T5 plants were defined as the master seed bank (MSB). T6 plants were grown from part of the MSB seeds and were self-pollinated to produce T7 seeds (next seed bank; NSB). NSB was examined and its whole genome and proteome were compared with those of MSB. Results We re-sequenced the transgenes of NSB and MSB and confirmed the positions of the three CTB genes inserted into chromosomes 3 and 12. The DNA sequences of the transgenes were identical between NSB and MSB. Using whole-genome sequencing, we compared the genome sequences of three NSB with three MSB samples, and evaluated the effects of SNPs and genomic structural variants by clustering. No functionally important mutations (SNPs, translocations, deletions, or inversions of genic regions on chromosomes) between NSB and MSB samples were detected. Analysis of salt-soluble proteins from NSB and MSB samples by shot-gun MS/MS detected no considerable differences in protein abundance. No difference in the expression pattern of storage proteins and CTB in mature seeds of NSB and MSB was detected by immuno-fluorescence microscopy. Conclusions All analyses revealed no considerable differences between NSB and MSB samples. Therefore, NSB can be used to replace MSB in the near future.


2018 ◽  
Vol 34 (17) ◽  
pp. i748-i756 ◽  
Author(s):  
Chirag Jain ◽  
Sergey Koren ◽  
Alexander Dilthey ◽  
Adam M Phillippy ◽  
Srinivas Aluru

2019 ◽  
Author(s):  
DJ Darwin R. Bandoy ◽  
B Carol Huang ◽  
Bart C. Weimer

AbstractTaxonomic classification is an essential step in the analysis of microbiome data that depends on a reference database of whole genome sequences. Taxonomic classifiers are built on established reference species, such as the Human Microbiome Project database, that is growing rapidly. While constructing a population wide pangenome of the bacterium Hungatella, we discovered that the Human Microbiome Project reference species Hungatella hathewayi (WAL 18680) was significantly different to other members of this genus. Specifically, the reference lacked the core genome as compared to the other members. Further analysis, using average nucleotide identity (ANI) and 16s rRNA comparisons, indicated that WAL18680 was misclassified as Hungatella. The error in classification is being amplified in the taxonomic classifiers and will have a compounding effect as microbiome analyses are done, resulting in inaccurate assignment of community members and will lead to fallacious conclusions and possibly treatment. As automated genome homology assessment expands for microbiome analysis, outbreak detection, and public health reliance on whole genomes increases this issue will likely occur at an increasing rate. These observations highlight the need for developing reference free methods for epidemiological investigation using whole genome sequences and the criticality of accurate reference databases.


2020 ◽  
Vol 36 (14) ◽  
pp. 4130-4136
Author(s):  
David J Burks ◽  
Rajeev K Azad

Abstract Motivation Alignment-free, stochastic models derived from k-mer distributions representing reference genome sequences have a rich history in the classification of DNA sequences. In particular, the variants of Markov models have previously been used extensively. Higher-order Markov models have been used with caution, perhaps sparingly, primarily because of the lack of enough training data and computational power. Advances in sequencing technology and computation have enabled exploitation of the predictive power of higher-order models. We, therefore, revisited higher-order Markov models and assessed their performance in classifying metagenomic sequences. Results Comparative assessment of higher-order models (HOMs, 9th order or higher) with interpolated Markov model, interpolated context model and lower-order models (8th order or lower) was performed on metagenomic datasets constructed using sequenced prokaryotic genomes. Our results show that HOMs outperform other models in classifying metagenomic fragments as short as 100 nt at all taxonomic ranks, and at lower ranks when the fragment size was increased to 250 nt. HOMs were also found to be significantly more accurate than local alignment which is widely relied upon for taxonomic classification of metagenomic sequences. A novel software implementation written in C++ performs classification faster than the existing Markovian metagenomic classifiers and can therefore be used as a standalone classifier or in conjunction with existing taxonomic classifiers for more robust classification of metagenomic sequences. Availability and implementation The software has been made available at https://github.com/djburks/SMM. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 8 (42) ◽  
Author(s):  
Gabriela Vuletin Selak ◽  
Marina Raboteg ◽  
Audrey Dubost ◽  
Danis Abrouk ◽  
Katja Žanić ◽  
...  

Here, we present the total genome sequence of Pantoea sp. strain paga, a plant-associated bacterium isolated from knots present on olive trees grown on the Adriatic Coast. The genome size of Pantoea sp. paga is 5.08 Mb, with a G+C content of 54%. The genome contains 4,776 predicted coding DNA sequences (CDSs), including 70 tRNA genes and 1 ribosomal operon. Obtained genome sequence data will provide insight on the physiology, ecology, and evolution of Pantoea spp.


2016 ◽  
Vol 4 (4) ◽  
Author(s):  
U. Garza-Ramos ◽  
J. Silva-Sánchez ◽  
J. Catalán-Nájera ◽  
H. Barrios ◽  
N. Rodríguez-Medina ◽  
...  

A clinical isolate of extended-spectrum-β-lactamase-producing Klebsiella quasipneumoniae subsp. similipneumoniae 06-219 with hypermucoviscosity phenotypes obtained from a urine culture of an adult patient was used for whole-genome sequencing. Here, we report the draft genome sequences of this strain, consisting of 53 contigs with an ~5.6-Mb genome size and an average G+C content of 57.36%. The annotation revealed 6,622 coding DNA sequences and 77 tRNA genes.


1995 ◽  
Vol 11 (2) ◽  
pp. 147-153 ◽  
Author(s):  
Kun-Mao Chao ◽  
Jinghui Zhang ◽  
James Ostell ◽  
Webb Miller

Blood ◽  
2008 ◽  
Vol 112 (11) ◽  
pp. 5120-5120
Author(s):  
Brian GM Durie ◽  
Julia Beck ◽  
Sundar Jagannath ◽  
Howard B. Urnovitz ◽  
Ekkehard Schutz

Abstract There is no single sequence or breakpoint region linked to all patients with myeloma. It was thus elected to extract, amplify and fully sequence all (total) DNA present in circulating blood (CNA) using pyrosequencing with the Roche/454 sequencer (GSFLX). The origin of circulating DNA was investigated by local alignment analyses using BLAST and compared with the circulating DNA of 47 healthy individuals, aged 18 to 64 years. Thirty-one samples were subjected to total CNA sequencing during the course of the disease for a patient with lambda Bence Jones myeloma sequentially monitored from relapse, through complete response for 38 months and then with development of subsequent relapse. Conventional monitoring included both serum free light chain analyses and whole body CT-PET imaging. For comparative purposes the CNA was categorized by origin, functionality, chromosomal localization and genes, based on public databases (human reference genome build 36.2 and corresponding annotation file seq_gene.md as downloaded from NCBI (ftp.ncbi.nih.gov). The average length of sequenced nucleotides was 185.7 nucleotides and the total number of nucleotides sequenced per sample was 1.2 – 2.6 million. CNA sequences occurring with a frequency significantly greater or less than those observed for healthy controls were evaluated in detail. The next steps were to examine correlations with initial active myeloma, response to treatment, the complete remission state off all therapy, and subsequent relapse. The presence of 10 DNA sequences was highly correlated with the disease course: ZMYM2; TSPAN5; TLL1; PRKD1; ANTXR1; SYT14; PRKCH; MBNL1; EGFR and EDG7. Of these ZYMYM2 was highly correlated with disease reactivation and relapse off therapy. TSPAN5 showed a clear pattern with reduction to background levels during remission, but increase at the very earliest sign of relapse evident on CT-PET. Conversely, other sequences (GRID1; KIF16B) increased substantially during the peak impact of successful therapy. Multivariate analyses were used to identify the best combinations of gene sequences predictive of disease course and outcome. The combination of four genes (GRID1; PRKD1; ANTXR1; GAB1) was sufficient to define the early active disease data points vs the normal controls (odds-ratio OR: 135 [10.6 to 172]; p&lt;.0001). When the resulting model was kept and used as a search engine for the whole 2 years course, the time course followed the treatment success (OR: 54; p=.0017) and failure as well as the final relapse(OR: 13.3; p=.004) In conclusion: Total sequencing of DNA has proved promising in providing personalized molecular monitoring for myeloma. The details of associated DNA sequences provide insights to the underlying molecular pathophysiology linked to myeloma disease progression, response to successful therapy as well as subsequent relapse. Further testing is ongoing.


Sign in / Sign up

Export Citation Format

Share Document