SWALO: scaffolding with assembly likelihood optimization

Mapping Intimacies ◽

10.1101/081786 ◽

2016 ◽

Cited By ~ 1

Author(s):

Atif Rahman ◽

Lior Pachter

Keyword(s):

Maximum Likelihood ◽

Genome Assembly ◽

Statistical Models ◽

Substantial Improvement ◽

Generative Models ◽

Large Datasets ◽

Maximum Likelihood Estimates ◽

Link Type ◽

Genome Assemblies

AbstractScaffolding i.e. ordering and orienting contigs is an important step in genome assembly. We present a method for scaffolding based on likelihoods of genome assemblies. Generative models for sequencing are used to obtain maximum likelihood estimates of gaps between contigs and to estimate whether linking contigs into scaffolds would lead to an increase in the likelihood of the assembly. We then link contigs if they can be unambiguously joined or if the corresponding increase in likelihood is substantially greater than that of other possible joins of those contigs. The method is implemented in a tool called Swalo with approximations to make it efficient and applicable to large datasets. Analysis on real and simulated datasets reveals that it consistently makes more or similar number of correct joins as other scaffolders while linking very few contigs incorrectly, thus outperforming other scaffolders and demonstrating that substantial improvement in genome assembly may be achieved through the use of statistical models. Swalo is freely available for download at https://atifrahman.github.io/SWALO/.

Download Full-text

RefKA: A fast and efficient long-read genome assembly approach for large and complex genomes

10.1101/2020.04.17.035287 ◽

2020 ◽

Author(s):

Yuxuan Yuan ◽

Philipp E. Bayer ◽

Robyn Anderson ◽

HueyTyng Lee ◽

Chon-Kit Kenneth Chan ◽

...

Keyword(s):

Genome Assembly ◽

Chinese Spring ◽

Complete Genome ◽

Reference Genome ◽

Computing Time ◽

Link Type ◽

Recent Advances ◽

Long Read ◽

Genome Assemblies

AbstractRecent advances in long-read sequencing have the potential to produce more complete genome assemblies using sequence reads which can span repetitive regions. However, overlap based assembly methods routinely used for this data require significant computing time and resources. Here, we have developed RefKA, a reference-based approach for long read genome assembly. This approach relies on breaking up a closely related reference genome into bins, aligning k-mers unique to each bin with PacBio reads, and then assembling each bin in parallel followed by a final bin-stitching step. During benchmarking, we assembled the wheat Chinese Spring (CS) genome using publicly available PacBio reads in parallel in 168 wall hours on a 250 CPU system. The maximum RAM used was 300 Gb and the computing time was 42,000 CPU hours. The approach opens applications for the assembly of other large and complex genomes with much-reduced computing requirements. The RefKA pipeline is available at https://github.com/AppliedBioinformatics/RefKA

Download Full-text

Mixtures and products in two graphical models

Journal of Algebraic Statistics ◽

10.18409/jas.v9i1.90 ◽

2018 ◽

Vol 9 (1) ◽

pp. 1-20 ◽

Cited By ~ 1

Author(s):

Anna Seigal ◽

Guido Montufar

Keyword(s):

Maximum Likelihood ◽

Closed Form ◽

Graphical Models ◽

Statistical Models ◽

The Other ◽

Maximum Likelihood Estimates ◽

Boltzmann Machine ◽

Algebraic Description ◽

Binary Random Variables ◽

Probability Simplex

We compare two statistical models of three binary random variables. One is a mixture model and the other is a product of mixtures model called a restricted Boltzmann machine. Although the two models we study look different from their parametrizations, we show that they represent the same set of distributions on the interior of the probability simplex, and are equal up to closure. We give a semi-algebraic description of the model in terms of six binomial inequalities and obtain closed form expressions for the maximum likelihood estimates. We briefly discuss extensions to larger models.

Download Full-text

Genome Warehouse: A Public Repository Housing Genome-scale Data

10.1101/2021.02.10.430367 ◽

2021 ◽

Author(s):

Meili Chen ◽

Yingke Ma ◽

Song Wu ◽

Xinchang Zheng ◽

Hongen Kang ◽

...

Keyword(s):

Genome Assembly ◽

Public Repository ◽

Genome Sequences ◽

Link Type ◽

Genome Data ◽

Research Activities ◽

Wide Range ◽

Genome Assemblies ◽

Genome Scale ◽

Scale Data

AbstractThe Genome Warehouse (GWH) is a public repository housing genome assembly data for a wide range of species and delivering a series of web services for genome data submission, storage, release, and sharing. As one of the core resources in the National Genomics Data Center (NGDC), part of the China National Center for Bioinformation (CNCB, https://bigd.big.ac.cn/), GWH accepts both full genome and partial genome (chloroplast, mitochondrion, and plasmid) sequences with different assembly levels, as well as an update of existing genome assemblies. For each assembly, GWH collects detailed genome-related metadata including biological project and sample, and genome assembly information, in addition to genome sequence and annotation. To archive high-quality genome sequences and annotations, GWH is equipped with a uniform and standardized procedure for quality control. Besides basic browse and search functionalities, all released genome sequences and annotations can be visualized with JBrowse. By December 2020, GWH has received 17,264 direct submissions covering a diversity of 949 species, and has released 3370 of them. Collectively, GWH serves as an important resource for genome-scale data management and provides free and publicly accessible data to support research activities throughout the world. GWH is publicly accessible at https://bigd.big.ac.cn/gwh/.

Download Full-text

On Negative Heritability and Negative Estimates of Heritability

Genetics ◽

10.1534/genetics.120.303161 ◽

2020 ◽

Vol 215 (2) ◽

pp. 343-357 ◽

Cited By ~ 1

Author(s):

David Steinsaltz ◽

Andy Dahl ◽

Kenneth W. Wachter

Keyword(s):

Genetic Variation ◽

Maximum Likelihood ◽

Statistical Models ◽

Biological Process ◽

Random Noise ◽

Statistical Procedure ◽

Maximum Likelihood Estimates ◽

Physical Feature ◽

Additive Genetic Variation ◽

Heritability Estimates

We consider the problem of interpreting negative maximum likelihood estimates of heritability that sometimes arise from popular statistical models of additive genetic variation. These may result from random noise acting on estimates of genuinely positive heritability, but we argue that they may also arise from misspecification of the standard additive mechanism that is supposed to justify the statistical procedure. Researchers should be open to the possibility that negative heritability estimates could reflect a real physical feature of the biological process from which the data were sampled.

Download Full-text

Tapestry: validate and edit small eukaryotic genome assemblies with long reads

10.1101/2020.04.24.059402 ◽

2020 ◽

Author(s):

John W. Davey ◽

Seth J. Davis ◽

Jeremy C. Mottram ◽

Peter D. Ashton

Keyword(s):

Genome Assembly ◽

Source Code ◽

Gc Content ◽

Eukaryotic Genome ◽

Link Type ◽

Long Reads ◽

Genome Assemblies

AbstractSummarySmall eukaryotic genome assemblies based on long reads are often close to complete, but still require validation and editing. Tapestry produces an interactive report which can be used to validate, sort and filter the contigs in a raw genome assembly, taking into account GC content, telomeres, read depths, contig alignments and read alignments. The report can be shared with collaborators and included as supplemental material in publications.AvailabilitySource code is freely available at https://github.com/johnomics/tapestry. Package is freely available in Bioconda (https://anaconda.org/bioconda/tapestry)[email protected]

Download Full-text

On Negative Heritability and Negative Estimates of Heritability

10.1101/232843 ◽

2017 ◽

Cited By ~ 2

Author(s):

David Steinsaltz ◽

Andy Dahl ◽

Kenneth W. Wachter

Keyword(s):

Genetic Variation ◽

Maximum Likelihood ◽

Statistical Models ◽

Biological Process ◽

Random Noise ◽

Statistical Procedure ◽

Maximum Likelihood Estimates ◽

Physical Feature ◽

Additive Genetic Variation ◽

Heritability Estimates

AbstractWe consider the problem of interpreting negative maximum likelihood estimates of heritability that sometimes arise from popular statistical models of additive genetic variation. These may result from random noise acting on estimates of genuinely positive heritability, but we argue that they may also arise from misspecification of the standard additive mechanism that is supposed to justify the statistical procedure. Researchers should be open to the possibility that negative heritability estimates could reflect a real physical feature of the biological process from which the data were sampled.

Download Full-text

446. Evaluation of Carbon Tetrachloride Area Sampling Data Using a Computer Program to Produce Maximum Likelihood Estimates of the Summary Statistics

10.3320/1.2763315 ◽

1999 ◽

Author(s):

J. Robertson ◽

W. Galke

Keyword(s):

Carbon Tetrachloride ◽

Maximum Likelihood ◽

Computer Program ◽

Maximum Likelihood Estimates ◽

Summary Statistics ◽

Area Sampling

Download Full-text

Directional Selection and the Site-Frequency Spectrum

Genetics ◽

10.1093/genetics/159.4.1779 ◽

2001 ◽

Vol 159 (4) ◽

pp. 1779-1788 ◽

Cited By ~ 2

Author(s):

Carlos D Bustamante ◽

John Wakeley ◽

Stanley Sawyer ◽

Daniel L Hartl

Keyword(s):

Maximum Likelihood ◽

High Power ◽

Negative Selection ◽

Likelihood Estimation ◽

Directional Selection ◽

Likelihood Ratio Tests ◽

Maximum Likelihood Estimates ◽

Likelihood Methods ◽

Ancestral States ◽

Asymptotic Variances

Abstract In this article we explore statistical properties of the maximum-likelihood estimates (MLEs) of the selection and mutation parameters in a Poisson random field population genetics model of directional selection at DNA sites. We derive the asymptotic variances and covariance of the MLEs and explore the power of the likelihood ratio tests (LRT) of neutrality for varying levels of mutation and selection as well as the robustness of the LRT to deviations from the assumption of free recombination among sites. We also discuss the coverage of confidence intervals on the basis of two standard-likelihood methods. We find that the LRT has high power to detect deviations from neutrality and that the maximum-likelihood estimation performs very well when the ancestral states of all mutations in the sample are known. When the ancestral states are not known, the test has high power to detect deviations from neutrality for negative selection but not for positive selection. We also find that the LRT is not robust to deviations from the assumption of independence among sites.

Download Full-text

An Integrated Framework for the Inference of Viral Population History From Reconstructed Genealogies

Genetics ◽

10.1093/genetics/155.3.1429 ◽

2000 ◽

Vol 155 (3) ◽

pp. 1429-1437

Author(s):

Oliver G Pybus ◽

Andrew Rambaut ◽

Paul H Harvey

Keyword(s):

Maximum Likelihood ◽

Sequence Data ◽

Demographic History ◽

Population History ◽

Maximum Likelihood Estimates ◽

Viral Population ◽

True Parameter ◽

Subtype B ◽

Exponential Growth Model ◽

Parameter Values

Abstract We describe a unified set of methods for the inference of demographic history using genealogies reconstructed from gene sequence data. We introduce the skyline plot, a graphical, nonparametric estimate of demographic history. We discuss both maximum-likelihood parameter estimation and demographic hypothesis testing. Simulations are carried out to investigate the statistical properties of maximum-likelihood estimates of demographic parameters. The simulations reveal that (i) the performance of exponential growth model estimates is determined by a simple function of the true parameter values and (ii) under some conditions, estimates from reconstructed trees perform as well as estimates from perfect trees. We apply our methods to HIV-1 sequence data and find strong evidence that subtypes A and B have different demographic histories. We also provide the first (albeit tentative) genetic evidence for a recent decrease in the growth rate of subtype B.

Download Full-text

Maximum likelihood estimates of diffusion coefficients from single-particle tracking experiments

The Journal of Chemical Physics ◽

10.1063/5.0038174 ◽

2021 ◽

Vol 154 (23) ◽

pp. 234105

Author(s):

Jakob Tómas Bullerjahn ◽

Gerhard Hummer

Keyword(s):

Maximum Likelihood ◽

Particle Tracking ◽

Single Particle ◽

Diffusion Coefficients ◽

Single Particle Tracking ◽

Maximum Likelihood Estimates

Download Full-text