Critical assessment of bioinformatics methods for the characterization of pathological repeat expansions with single-molecule sequencing data

Matteo Chiara; Federico Zambelli; Ernesto Picardi; David S Horner; Graziano Pesole

doi:10.1093/bib/bbz099

Critical assessment of bioinformatics methods for the characterization of pathological repeat expansions with single-molecule sequencing data

Briefings in Bioinformatics ◽

10.1093/bib/bbz099 ◽

2019 ◽

Vol 21 (6) ◽

pp. 1971-1986 ◽

Cited By ~ 1

Author(s):

Matteo Chiara ◽

Federico Zambelli ◽

Ernesto Picardi ◽

David S Horner ◽

Graziano Pesole

Keyword(s):

Single Molecule ◽

Tandem Repeats ◽

Simulated Data ◽

Detailed Comparison ◽

Sequencing Data ◽

Single Molecule Sequencing ◽

Sequencing Technologies ◽

Repeat Expansions

Abstract A number of studies have reported the successful application of single-molecule sequencing technologies to the determination of the size and sequence of pathological expanded microsatellite repeats over the last 5 years. However, different custom bioinformatics pipelines were employed in each study, preventing meaningful comparisons and somewhat limiting the reproducibility of the results. In this review, we provide a brief summary of state-of-the-art methods for the characterization of expanded repeats alleles, along with a detailed comparison of bioinformatics tools for the determination of repeat length and sequence, using both real and simulated data. Our reanalysis of publicly available human genome sequencing data suggests a modest, but statistically significant, increase of the error rate of single-molecule sequencing technologies at genomic regions containing short tandem repeats. However, we observe that all the methods herein tested, irrespective of the strategy used for the analysis of the data (either based on the alignment or assembly of the reads), show high levels of sensitivity in both the detection of expanded tandem repeats and the estimation of the expansion size, suggesting that approaches based on single-molecule sequencing technologies are highly effective for the detection and quantification of tandem repeat expansions and contractions.

Download Full-text

ExpansionHunter Denovo: A computational method for locating known and novel repeat expansions in short-read sequencing data

10.1101/863035 ◽

2019 ◽

Author(s):

Egor Dolzhenko ◽

Mark F. Bennett ◽

Phillip A. Richmond ◽

Brett Trost ◽

Sai Chen ◽

...

Keyword(s):

Tandem Repeats ◽

Simulated Data ◽

Computational Method ◽

Detection Methods ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Monogenic Disorders ◽

Genome Wide ◽

Repeat Expansions

AbstractExpansions of short tandem repeats are responsible for over 40 monogenic disorders, and undoubtedly many more pathogenic repeat expansions (REs) remain to be discovered. Existing methods for detecting REs in short-read sequencing data require predefined repeat catalogs. However recent discoveries have emphasized the need for detection methods that do not require candidate repeats to be specified in advance. To address this need, we introduce ExpansionHunter Denovo, an efficient catalog-free method for genome-wide detection of REs. Analysis of real and simulated data shows that our method can identify large expansions of 41 out of 44 pathogenic repeats, including nine recently reported non-reference REs not discoverable via existing methods.ExpansionHunter Denovo is freely available at https://github.com/Illumina/ExpansionHunterDenovo

Download Full-text

Tandem repeats structure of gel-forming mucin domains could be revealed by SMRT sequencing data

10.21203/rs.3.rs-112828/v1 ◽

2020 ◽

Author(s):

Tiange Lang

Keyword(s):

Single Molecule ◽

Tandem Repeats ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Coding Region ◽

Smrt Sequencing ◽

Single Nucleotide ◽

Sequencing Technologies ◽

Long Reads ◽

Great Complexity

Abstract Background. Gel-forming mucin domains of mucin genes show great complexity with tandem repeats (TRs), thus make it difficult to study the sequences. Methods. With the coming of single molecule real-time (SMRT) sequencing technologies, we manage to present sequence structure of mucin domains via SMRT long reads for MUC2, MUC5AC, MUC5B and MUC6. Results. Our study shows that for different individuals, single nucleotide polymorphisms (SNPs) could be found in mucin domains of MUC2, MUC5AC, MUC5B and MUC6, while different number of tandem repeats could be found in mucin domains of MUC2 and MUC6. Conclusions. This information will provided new insights on getting the sequence for Tandem Repeat parts which locate in coding region.

Download Full-text

Deciphering Neurodegenerative Diseases Using Long-Read Sequencing

Neurology ◽

10.1212/wnl.0000000000012466 ◽

2021 ◽

pp. 10.1212/WNL.0000000000012466

Author(s):

Yun Su ◽

Liyuan Fan ◽

Changhe Shi ◽

Tai Wang ◽

Huimin Zheng ◽

...

Keyword(s):

Neurodegenerative Diseases ◽

Single Molecule ◽

Direct Detection ◽

Gc Content ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Repeat Expansions ◽

Genomic Regions

Neurodegenerative diseases exhibit chronic progressive lesions in the central and peripheral nervous systems with unclear causes. The search for pathogenic mutations in human neurodegenerative diseases has benefited from massively parallel short-read sequencers. However, genomic regions, including repetitive elements, especially with high/low GC content, are far beyond the capability of conventional approaches. Recently, long-read single-molecule DNA sequencing technologies have emerged and enabled researchers to study genomes, transcriptomes, and metagenomes at unprecedented resolutions. The identification of novel mutations in unresolved neurodegenerative disorders, the characterization of causative repeat expansions, and the direct detection of epigenetic modifications on naive DNA by virtue of long-read sequencers will further expand our understanding of neurodegenerative diseases. In this paper, we review and compare two prevailing long-read sequencing technologies, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), and discuss their applications in neurodegenerative diseases.

Download Full-text

Reference-free reconstruction and quantification of transcriptomes from Nanopore long-read sequencing

10.1101/2020.02.08.939942 ◽

2020 ◽

Author(s):

Ivan de la Rubia ◽

Joel A. Indi ◽

Silvia Carbonell-Sala ◽

Julien Lagarde ◽

M Mar Albà ◽

...

Keyword(s):

Single Molecule ◽

Reference Genome ◽

Simulated Data ◽

Cost Effective ◽

Dna Assembly ◽

Sequencing Data ◽

Consensus Sequences ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read

AbstractSingle-molecule long-read sequencing with Nanopore provides an unprecedented opportunity to measure transcriptomes from any sample1–3. However, current analysis methods rely on the comparison with a reference genome or transcriptome2,4,5, or the use of multiple sequencing technologies6,7, thereby precluding cost-effective studies in species with no genome assembly available, in individuals underrepresented in the existing reference, and for the discovery of disease-specific transcripts not directly identifiable from a reference genome. Methods for DNA assembly8–10 cannot be directly transferred to transcriptomes since their consensus sequences lack the required interpretability for genes with multiple transcript isoforms. To address these challenges, we have developed RATTLE, the first tool to perform reference-free reconstruction and quantification of transcripts from Nanopore long reads. Using simulated data, isoform spike-ins, and sequencing data from tissues and cell lines, we demonstrate that RATTLE accurately determines transcript sequence and abundance, is comparable to reference-based methods, and shows saturation in the number of predicted transcripts with increasing number of input reads.

Download Full-text

Clair: Exploring the limit of using a deep neural network on pileup data for germline variant calling

10.1101/865782 ◽

2019 ◽

Cited By ~ 5

Author(s):

Ruibang Luo ◽

Chak-Lim Wong ◽

Yat-Sing Wong ◽

Chi-Ian Tang ◽

Chi-Man Liu ◽

...

Keyword(s):

Single Molecule ◽

Deep Neural Network ◽

Deep Neural Networks ◽

New Technologies ◽

Variant Calling ◽

Epigenetic Mark ◽

Sequencing Data ◽

Single Molecule Sequencing ◽

Sequencing Technologies ◽

Complex Genome

AbstractSingle-molecule sequencing technologies have emerged in recent years and revolutionized structural variant calling, complex genome assembly, and epigenetic mark detection. However, the lack of a highly accurate small variant caller has limited the new technologies from being more widely used. In this study, we present Clair, the successor to Clairvoyante, a program for fast and accurate germline small variant calling, using single molecule sequencing data. For ONT data, Clair achieves the best precision, recall and speed as compared to several competing programs, including Clairvoyante, Longshot and Medaka. Through studying the missed variants and benchmarking intentionally overfitted models, we found that Clair may be approaching the limit of possible accuracy for germline small variant calling using pileup data and deep neural networks. Clair requires only a conventional CPU for variant calling and is an open source project available at https://github.com/HKU-BAL/Clair.

Download Full-text

Tandem repeats structure of gel-forming mucin domains could be revealed by SMRT sequencing data

10.21203/rs.3.rs-112828/v2 ◽

2021 ◽

Author(s):

Tiange Lang

Keyword(s):

Single Molecule ◽

Tandem Repeats ◽

The Body ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Coding Region ◽

Smrt Sequencing ◽

Epithelial Surface ◽

Sequencing Technologies ◽

Long Reads

Abstract Mucins are large glycoproteins that cover and protect epithelial surface of the body. Gel-forming mucin domains of mucin genes are rich in proline, threonine, and serine that are heavily glycosylate. These domains show great complexity with tandem repeats (TRs), thus make it difficult to study the sequences. With the coming of single molecule real-time (SMRT) sequencing technologies, we manage to present sequence structure of mucin domains via SMRT long reads for gel-forming mucins MUC2, MUC5AC, MUC5B and MUC6. Our study shows that for different individuals, single nucleotide polymorphisms (SNPs) could be found in mucin domains of MUC2, MUC5AC, MUC5B and MUC6, while different number of tandem repeats could be found in mucin domains of MUC2 and MUC6. Furthermore, we get the sequence of MUC2, MUC5AC, and MUC5B mucin domain in a Chinese individual at accuracy of possibly maximum 99.98%, 99.93%, and 99.76%, respectively. We report a new method to obtain DNA sequence of gel-forming mucin domains. This method will provided new insights on getting the sequence for Tandem Repeat parts which locate in coding region. With the sequences we obtained with this method, we can give more information for people to study the sequences of gel-forming mucin domains.

Download Full-text

Sequoia: an interactive visual analytics platform for interpretation and feature extraction from nanopore sequencing datasets

BMC Genomics ◽

10.1186/s12864-021-07791-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Ratanond Koonchanok ◽

Swapna Vidhur Daulatabad ◽

Quoseena Mir ◽

Khairi Reda ◽

Sarath Chandra Janga

Keyword(s):

Single Molecule ◽

Visual Analytics ◽

Visual Analysis ◽

Direct Sequencing ◽

Visual Exploration ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Rna Sequences ◽

Sequencing Technologies ◽

Signal Features

Abstract Background Direct-sequencing technologies, such as Oxford Nanopore’s, are delivering long RNA reads with great efficacy and convenience. These technologies afford an ability to detect post-transcriptional modifications at a single-molecule resolution, promising new insights into the functional roles of RNA. However, realizing this potential requires new tools to analyze and explore this type of data. Result Here, we present Sequoia, a visual analytics tool that allows users to interactively explore nanopore sequences. Sequoia combines a Python-based backend with a multi-view visualization interface, enabling users to import raw nanopore sequencing data in a Fast5 format, cluster sequences based on electric-current similarities, and drill-down onto signals to identify properties of interest. We demonstrate the application of Sequoia by generating and analyzing ~ 500k reads from direct RNA sequencing data of human HeLa cell line. We focus on comparing signal features from m6A and m5C RNA modifications as the first step towards building automated classifiers. We show how, through iterative visual exploration and tuning of dimensionality reduction parameters, we can separate modified RNA sequences from their unmodified counterparts. We also document new, qualitative signal signatures that characterize these modifications from otherwise normal RNA bases, which we were able to discover from the visualization. Conclusions Sequoia’s interactive features complement existing computational approaches in nanopore-based RNA workflows. The insights gleaned through visual analysis should help users in developing rationales, hypotheses, and insights into the dynamic nature of RNA. Sequoia is available at https://github.com/dnonatar/Sequoia.

Download Full-text

Genome assembly of the maize inbred line A188 provides a new reference genome for functional genomics

10.1101/2021.03.15.435372 ◽

2021 ◽

Author(s):

Fei Ge ◽

Jingtao Qu ◽

Peng Liu ◽

Lang Pan ◽

Chaoying Zou ◽

...

Keyword(s):

Single Molecule ◽

Inbred Line ◽

Genome Mapping ◽

Maize Inbred Line ◽

Sequencing Data ◽

Structural Variations ◽

Single Molecule Sequencing ◽

Maize Genetic ◽

Induction Ratio ◽

Phenotypic Variations

Heretofore, little is known about the mechanism underlying the genotype-dependence of embryonic callus (EC) induction, which has severely inhibited the development of maize genetic engineering. Here, we report the genome sequence and annotation of a maize inbred line with high EC induction ratio, A188, which is assembled from single-molecule sequencing and optical genome mapping. We assembled a 2,210 Mb genome with a scaffold N50 size of 11.61 million bases (Mb), compared to those of 9.73 Mb for B73 and 10.2 Mb for Mo17. Comparative analysis revealed that ~30% of the predicted A188 genes had large structural variations to B73, Mo17 and W22 genomes, which caused considerable protein divergence and might lead to phenotypic variations between the four inbred lines. Combining our new A188 genome, previously reported QTLs and RNA sequencing data, we reveal 8 large structural variation genes and 4 differentially expressed genes playing potential roles in EC induction.

Download Full-text

A Transposon Story: From TE Content to TE Dynamic Invasion of Drosophila Genomes Using the Single-Molecule Sequencing Technology from Oxford Nanopore

Cells ◽

10.3390/cells9081776 ◽

2020 ◽

Vol 9 (8) ◽

pp. 1776

Author(s):

Mourdas Mohamed ◽

Nguyet Thi-Minh Dang ◽

Yuki Ogyama ◽

Nelly Burlet ◽

Bruno Mugat ◽

...

Keyword(s):

Single Molecule ◽

Wild Type ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

In The Wild ◽

Successive Generations ◽

Type Strains

Transposable elements (TEs) are the main components of genomes. However, due to their repetitive nature, they are very difficult to study using data obtained with short-read sequencing technologies. Here, we describe an efficient pipeline to accurately recover TE insertion (TEI) sites and sequences from long reads obtained by Oxford Nanopore Technology (ONT) sequencing. With this pipeline, we could precisely describe the landscapes of the most recent TEIs in wild-type strains of Drosophila melanogaster and Drosophila simulans. Their comparison suggests that this subset of TE sequences is more similar than previously thought in these two species. The chromosome assemblies obtained using this pipeline also allowed recovering piRNA cluster sequences, which was impossible using short-read sequencing. Finally, we used our pipeline to analyze ONT sequencing data from a D. melanogaster unstable line in which LTR transposition was derepressed for 73 successive generations. We could rely on single reads to identify new insertions with intact target site duplications. Moreover, the detailed analysis of TEIs in the wild-type strains and the unstable line did not support the trap model claiming that piRNA clusters are hotspots of TE insertions.

Download Full-text

298. Characterization of an AAV Capsid Library Using PacBio CCS Single Molecule Sequencing

Molecular Therapy ◽

10.1016/s1525-0016(16)35311-4 ◽

2014 ◽

Vol 22 ◽

pp. S115

Keyword(s):

Single Molecule ◽

Single Molecule Sequencing

Download Full-text