A pipeline for local assembly of minisatellite alleles from single-molecule sequencing data

Heretofore, little is known about the mechanism underlying the genotype-dependence of embryonic callus (EC) induction, which has severely inhibited the development of maize genetic engineering. Here, we report the genome sequence and annotation of a maize inbred line with high EC induction ratio, A188, which is assembled from single-molecule sequencing and optical genome mapping. We assembled a 2,210 Mb genome with a scaffold N50 size of 11.61 million bases (Mb), compared to those of 9.73 Mb for B73 and 10.2 Mb for Mo17. Comparative analysis revealed that ~30% of the predicted A188 genes had large structural variations to B73, Mo17 and W22 genomes, which caused considerable protein divergence and might lead to phenotypic variations between the four inbred lines. Combining our new A188 genome, previously reported QTLs and RNA sequencing data, we reveal 8 large structural variation genes and 4 differentially expressed genes playing potential roles in EC induction.

Download Full-text

Assembly of chromosome-scale contigs by efficiently resolving repetitive sequences with long reads

10.1101/345983 ◽

2018 ◽

Cited By ~ 2

Author(s):

Huilong Du ◽

Chengzhi Liang

Keyword(s):

Single Molecule ◽

High Efficiency ◽

Reference Genome ◽

Repetitive Sequences ◽

Sequencing Data ◽

High Quality ◽

Single Molecule Sequencing ◽

Genome Maps ◽

Long Reads ◽

Novel Method

AbstractDue to the large number of repetitive sequences in complex eukaryotic genomes, fragmented and incompletely assembled genomes lose value as reference sequences, often due to short contigs that cannot be anchored or mispositioned onto chromosomes. Here we report a novel method Highly Efficient Repeat Assembly (HERA), which includes a new concept called a connection graph as well as algorithms for constructing the graph. HERA resolves repeats at high efficiency with single-molecule sequencing data, and enables the assembly of chromosome-scale contigs by further integrating genome maps and Hi-C data. We tested HERA with the genomes of rice R498, maize B73, human HX1 and Tartary buckwheat Pinku1. HERA can correctly assemble most of the tandemly repetitive sequences in rice using single-molecule sequencing data only. Using the same maize and human sequencing data published by Jiao et al. (2017) and Shi et al. (2016), respectively, we dramatically improved on the sequence contiguity compared with the published assemblies, increasing the contig N50 from 1.3 Mb to 61.2 Mb in maize B73 assembly and from 8.3 Mb to 54.4 Mb in human HX1 assembly with HERA. We provided a high-quality maize reference genome with 96.9% of the gaps filled (only 76 gaps left) and several incorrectly positioned sequences fixed compared with the B73 RefGen_v4 assembly. Comparisons between the HERA assembly of HX1 and the human GRCh38 reference genome showed that many gaps in GRCh38 could be filled, and that GRCh38 contained some potential errors that could be fixed. We assembled the Pinku1 genome into 12 scaffolds with a contig N50 size of 27.85 Mb. HERA serves as a new genome assembly/phasing method to generate high quality sequences for complex genomes and as a curation tool to improve the contiguity and completeness of existing reference genomes, including the correction of assembly errors in repetitive regions.

Download Full-text

Critical assessment of bioinformatics methods for the characterization of pathological repeat expansions with single-molecule sequencing data

Briefings in Bioinformatics ◽

10.1093/bib/bbz099 ◽

2019 ◽

Vol 21 (6) ◽

pp. 1971-1986 ◽

Cited By ~ 1

Author(s):

Matteo Chiara ◽

Federico Zambelli ◽

Ernesto Picardi ◽

David S Horner ◽

Graziano Pesole

Keyword(s):

Single Molecule ◽

Tandem Repeats ◽

Simulated Data ◽

Detailed Comparison ◽

Sequencing Data ◽

Single Molecule Sequencing ◽

Sequencing Technologies ◽

Repeat Expansions

Abstract A number of studies have reported the successful application of single-molecule sequencing technologies to the determination of the size and sequence of pathological expanded microsatellite repeats over the last 5 years. However, different custom bioinformatics pipelines were employed in each study, preventing meaningful comparisons and somewhat limiting the reproducibility of the results. In this review, we provide a brief summary of state-of-the-art methods for the characterization of expanded repeats alleles, along with a detailed comparison of bioinformatics tools for the determination of repeat length and sequence, using both real and simulated data. Our reanalysis of publicly available human genome sequencing data suggests a modest, but statistically significant, increase of the error rate of single-molecule sequencing technologies at genomic regions containing short tandem repeats. However, we observe that all the methods herein tested, irrespective of the strategy used for the analysis of the data (either based on the alignment or assembly of the reads), show high levels of sensitivity in both the detection of expanded tandem repeats and the estimation of the expansion size, suggesting that approaches based on single-molecule sequencing technologies are highly effective for the detection and quantification of tandem repeat expansions and contractions.

Download Full-text

Evaluation of Single-Molecule Sequencing Technologies for Structural Variant Detection in Two Swedish Human Genomes

Genes ◽

10.3390/genes11121444 ◽

2020 ◽

Vol 11 (12) ◽

pp. 1444

Author(s):

Nazeefa Fatima ◽

Anna Petri ◽

Ulf Gyllensten ◽

Lars Feuk ◽

Adam Ameur

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Single Molecule ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Structural Variations ◽

Single Molecule Sequencing ◽

Human Samples

Long-read single molecule sequencing is increasingly used in human genomics research, as it allows to accurately detect large-scale DNA rearrangements such as structural variations (SVs) at high resolution. However, few studies have evaluated the performance of different single molecule sequencing platforms for SV detection in human samples. Here we performed Oxford Nanopore Technologies (ONT) whole-genome sequencing of two Swedish human samples (average 32× coverage) and compared the results to previously generated Pacific Biosciences (PacBio) data for the same individuals (average 66× coverage). Our analysis inferred an average of 17k and 23k SVs from the ONT and PacBio data, respectively, with a majority of them overlapping with an available multi-platform SV dataset. When comparing the SV calls in the two Swedish individuals, we find a higher concordance between ONT and PacBio SVs detected in the same individual as compared to SVs detected by the same technology in different individuals. Downsampling of PacBio reads, performed to obtain similar coverage levels for all datasets, resulted in 17k SVs per individual and improved overlap with the ONT SVs. Our results suggest that ONT and PacBio have a similar performance for SV detection in human whole genome sequencing data, and that both technologies are feasible for population-scale studies.

Download Full-text

Read Mapping Algorithms for Single Molecule Sequencing Data

Lecture Notes in Computer Science - Algorithms in Bioinformatics ◽

10.1007/978-3-540-87361-7_4 ◽

2008 ◽

pp. 38-49 ◽

Cited By ~ 2

Author(s):

Vladimir Yanovsky ◽

Stephen M. Rumble ◽

Michael Brudno

Keyword(s):

Single Molecule ◽

Sequencing Data ◽

Read Mapping ◽

Mapping Algorithms ◽

Single Molecule Sequencing

Download Full-text

Clair: Exploring the limit of using a deep neural network on pileup data for germline variant calling

10.1101/865782 ◽

2019 ◽

Cited By ~ 5

Author(s):

Ruibang Luo ◽

Chak-Lim Wong ◽

Yat-Sing Wong ◽

Chi-Ian Tang ◽

Chi-Man Liu ◽

...

Keyword(s):

Single Molecule ◽

Deep Neural Network ◽

Deep Neural Networks ◽

New Technologies ◽

Variant Calling ◽

Epigenetic Mark ◽

Sequencing Data ◽

Single Molecule Sequencing ◽

Sequencing Technologies ◽

Complex Genome

AbstractSingle-molecule sequencing technologies have emerged in recent years and revolutionized structural variant calling, complex genome assembly, and epigenetic mark detection. However, the lack of a highly accurate small variant caller has limited the new technologies from being more widely used. In this study, we present Clair, the successor to Clairvoyante, a program for fast and accurate germline small variant calling, using single molecule sequencing data. For ONT data, Clair achieves the best precision, recall and speed as compared to several competing programs, including Clairvoyante, Longshot and Medaka. Through studying the missed variants and benchmarking intentionally overfitted models, we found that Clair may be approaching the limit of possible accuracy for germline small variant calling using pileup data and deep neural networks. Clair requires only a conventional CPU for variant calling and is an open source project available at https://github.com/HKU-BAL/Clair.

Download Full-text

Two Color Single Molecule Sequencing on GenoCare 1600 Platform to Facilitate Clinical Applications

10.1101/2020.09.28.20203455 ◽

2020 ◽

Author(s):

Fang Chen ◽

Bin Liu ◽

Meirong Chen ◽

Zefei Jiang ◽

Zhiliang Zhou ◽

...

Keyword(s):

Single Molecule ◽

Rapid Development ◽

Clinical Applications ◽

Library Preparation ◽

Sequencing Data ◽

Single Molecule Sequencing ◽

Sequencing Platform ◽

E Coli ◽

Simple Instrument ◽

Clear Information

With the rapid development of precision medicine industry, DNA sequencing becomes increasingly important as a research and diagnosis tool. For clinical applications, medical professionals require a platform which is fast, easy to use, and presents clear information relevant to definitive diagnosis. We have developed a single molecule desktop sequencing platform, GenoCare 1600. Fast library preparation (without amplification) and simple instrument operation make it friendlier for clinical use. Here we presented sequencing data of E. coli sample from GenoCare 1600 with consensus accuracy reaches 99.99%. We also demonstrated sequencing of microbial mixtures and COVID-19 samples from throat swabs. Our data show accurate quantitation of microbial, sensitive identification of SARS-CoV-2 virus and detection of variants confirmed by Sanger sequencing.

Download Full-text

Tigmint: Correcting Assembly Errors Using Linked Reads From Large Molecules

10.1101/304253 ◽

2018 ◽

Cited By ~ 2

Author(s):

Shaun D Jackman ◽

Lauren Coombe ◽

Justin Chu ◽

Rene L Warren ◽

Benjamin P Vandervalk ◽

...

Keyword(s):

Single Molecule ◽

Sequencing Data ◽

Long Distance ◽

Distance Information ◽

Single Molecule Sequencing ◽

Large Molecules ◽

Mate Pair ◽

A Genome ◽

Original Genome

Genome sequencing yields the sequence of many short snippets of DNA (reads) from a genome. Genome assembly attempts to reconstruct the original genome from which these reads were derived. This task is difficult due to gaps and errors in the sequencing data, repetitive sequence in the underlying genome, and heterozygosity, and assembly errors are common. These misassemblies may be identified by comparing the sequencing data to the assembly, and by looking for discrepancies between the two. Once identified, these misassemblies may be corrected, improving the quality of the assembly. Although tools exist to identify and correct misassemblies using Illumina pair-end and mate-pair sequencing, no such tool yet exists that makes use of the long distance information of the large molecules provided by linked reads, such as those offered by the 10x Genomics Chromium platform. We have developed the tool Tigmint for this purpose. To demonstrate the effectiveness of Tigmint, we corrected assemblies of a human genome using short reads assembled with ABySS 2.0 and other assemblers. Tigmint reduced the number of misassemblies identified by QUAST in the ABySS assembly by 216 (27%). While scaffolding with ARCS alone more than doubled the scaffold NGA50 of the assembly from 3 to 8 Mbp, the combination of Tigmint and ARCS improved the scaffold NGA50 of the assembly over five-fold to 16.4 Mbp. This notable improvement in contiguity highlights the utility of assembly correction in refining assemblies. We demonstrate its usefulness in correcting the assemblies of multiple tools, as well as in using Chromium reads to correct and scaffold assemblies of long single-molecule sequencing. The source code of Tigmint is available for download from https://github.com/bcgsc/tigmint, and is distributed under the GNU GPL v3.0 license.

Download Full-text