HapIso: An Accurate Method for the Haplotype-Specific Isoforms Reconstruction from Long Single-Molecule Reads

AbstractSequencing of RNA provides the possibility to study an individual’s transcriptome landscape and determine allelic expression ratios. Single-molecule protocols generate multi-kilobase reads longer than most transcripts allowing sequencing of complete haplotype isoforms. This allows partitioning the reads into two parental haplotypes. While the read length of the single-molecule protocols is long, the relatively high error rate limits the ability to accurately detect the genetic variants and assemble them into the haplotype-specific isoforms. In this paper, we present HapIso (Haplotype-specific Isoform Reconstruction), a method able to tolerate the relatively high error-rate of the single-molecule platform and partition the isoform reads into the parental alleles. Phasing the reads according to the allele of origin allows our method to efficiently distinguish between the read errors and the true biological mutations. HapIso uses a k-means clustering algorithm aiming to group the reads into two meaningful clusters maximizing the similarity of the reads within cluster and minimizing the similarity of the reads from different clusters. Each cluster corresponds to a parental haplotype. We use family pedigree information to evaluate our approach. Experimental validation suggests that HapIso is able to tolerate the relatively high error-rate and accurately partition the reads into the parental alleles of the isoform transcripts. Furthermore, our method is the first method able to reconstruct the haplotype-specific isoforms from long single-molecule reads.The open source Python implementation of HapIso is freely available for download at https://github.com/smangul1/HapIso/

Download Full-text

Long single-molecule reads can resolve the complexity of the Influenza virus composed of rare, closely related mutant variants

10.1101/036392 ◽

2016 ◽

Author(s):

Alexander Artyomenko ◽

Nicholas C Wu ◽

Serghei Mangul ◽

Eleazar Eskin ◽

Ren Sun ◽

...

Keyword(s):

Single Molecule ◽

Error Rate ◽

Rna Virus ◽

High Rate ◽

Read Length ◽

Viral Population ◽

High Error Rate ◽

Single Molecule Sequencing ◽

Sequencing Technologies ◽

Viral Mutant

AbstractAs a result of a high rate of mutations and recombination events, an RNA-virus exists as a heterogeneous “swarm” of mutant variants. The long read length offered by single-molecule sequencing technologies allows each mutant variant to be sequenced in a single pass. However, high error rate limits the ability to reconstruct heterogeneous viral population composed of rare, related mutant variants. In this paper, we present 2SNV, a method able to tolerate the high error-rate of the single-molecule protocol and reconstruct mutant variants. 2SNV uses linkage between single nucleotide variations to efficiently distinguish them from read errors. To benchmark the sensitivity of 2SNV, we performed a single-molecule sequencing experiment on a sample containing a titrated level of known viral mutant variants. Our method is able to accurately reconstruct clone with frequency of 0.2% and distinguish clones that differed in only two nucleotides distantly located on the genome. 2SNV outperforms existing methods for full-length viral mutant reconstruction. The open source implementation of 2SNV is freely available for download at http://alan.cs.gsu.edu/NGS/?q=content/2snv

Download Full-text

Quantitative design and experimental validation for a single-molecule DNA nanodevice transformable among three structural states

Nucleic Acids Research ◽

10.1093/nar/gkq250 ◽

2010 ◽

Vol 38 (13) ◽

pp. 4539-4546 ◽

Cited By ~ 5

Author(s):

Ken Komiya ◽

Masayuki Yamamura ◽

John A. Rose

Keyword(s):

Single Molecule ◽

Experimental Validation ◽

Quantitative Design

Download Full-text

Repeated convolutional codes for high-error-rate channels

IEEE Transactions on Communications ◽

10.1109/26.231908 ◽

1993 ◽

Vol 41 (6) ◽

pp. 852-863 ◽

Cited By ~ 2

Author(s):

Q. Wang ◽

G. Li ◽

V.K. Bhargava ◽

L.J. Mason

Keyword(s):

Error Rate ◽

Convolutional Codes ◽

High Error Rate

Download Full-text

Accurate measurement of microsatellite length by disrupting its tandem repeat structure

10.1101/2021.12.09.471828 ◽

2021 ◽

Author(s):

Dan Levy ◽

Zihua Wang ◽

Andrea Moffitt ◽

Michael H. Wigler

Keyword(s):

Tandem Repeat ◽

Error Rate ◽

Tandem Repeats ◽

Clinical Applications ◽

Error Rates ◽

Sequence Motifs ◽

High Error Rate ◽

Repeat Structure ◽

Flanking Regions ◽

Simple Sequence

Replication of tandem repeats of simple sequence motifs, also known as microsatellites, is error prone and variable lengths frequently occur during population expansions. Therefore, microsatellite length variations could serve as markers for cancer. However, accurate error-free quantitation of microsatellite lengths is difficult with current methods because of a high error rate during amplification and sequencing. We have solved this problem by using partial mutagenesis to disrupt enough of the repeat structure so that it can replicate faithfully, yet not so much that the flanking regions cannot be reliably identified. In this work we use bisulfite mutagenesis to convert a C to a U, later read as T. Compared to untreated templates, we achieve three orders of magnitude reduction in the error rate per round of replication. By requiring two independent first copies of an initial template, we reach error rates below one in a million. We discuss potential clinical applications of this method.

Download Full-text

HapIso: An Accurate Method for the Haplotype-Specific Isoforms Reconstruction from Long Single-Molecule Reads

Bioinformatics Research and Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-319-38782-6_7 ◽

2016 ◽

pp. 80-92

Author(s):

Serghei Mangul ◽

Harry Yang ◽

Farhad Hormozdiari ◽

Elizabeth Tseng ◽

Alex Zelikovsky ◽

...

Keyword(s):

Single Molecule ◽

Accurate Method

Download Full-text

Efficient ARQ scheme for high error rate channels

Electronics Letters ◽

10.1049/el:19840671 ◽

1984 ◽

Vol 20 (23) ◽

pp. 986 ◽

Cited By ~ 27

Author(s):

M. Moeneclaey ◽

H. Bruneel

Keyword(s):

Error Rate ◽

High Error Rate

Download Full-text

205 Runs of homozygosity and analysis of inbreeding depression

Journal of Animal Science ◽

10.1093/jas/skz258.078 ◽

2019 ◽

Vol 97 (Supplement_3) ◽

pp. 39-40

Author(s):

Pattarapol Sumreddee ◽

Sajjad Toghiani ◽

Andrew J Roberts ◽

El H Hay ◽

Samuel E Aggrey ◽

...

Keyword(s):

Inbreeding Depression ◽

Clustering Algorithm ◽

Search Algorithm ◽

Pedigree Information ◽

Minimum Length ◽

Runs Of Homozygosity ◽

Model Based Clustering ◽

Genome Wide ◽

Hereford Cattle ◽

Density Marker

Abstract Pedigree information was traditionally used to assess inbreeding. Availability of high-density marker panels provides an alternative to assess inbreeding, particularly in the presence of incomplete and error-prone pedigrees. Assessment of autozygosity across chromosomal segments using runs of homozygosity (ROH) is emerging as a valuable tool to estimate inbreeding due to its general flexibility and ability to quantify chromosomal contribution to genome-wide inbreeding. Unfortunately, identifying ROH segments is sensitive to the parameters used during the search process. These parameters are heuristically set, leading to significant variation in the results. The minimum length required to identify a ROH segment has major effects on the estimation of inbreeding, yet it is arbitrarily set. Understanding the rise, purging, and the effects of deleterious mutations requires the ability to discriminate between ancient and recent inbreeding. However, thresholds to discriminate between short and long ROH segments are largely unknown. To address these questions, an inbred Hereford cattle population of 785 animals genotyped for 30,220 SNPs was used. A search algorithm to approximate mutation loads was used to determine the minimum length of ROH segments. It consisted of finding genome segments with significant differences in trait means between animals with high and low autozygosity intervals at certain threshold values. The minimum length was around 1 Mb for weaning and yearling weights and ADG, and 2.5 Mb for birth weight. Using a model-based clustering algorithm, a mixture of three Gaussian distributions was clearly separable, resulting in three classes of short (< 6.16 Mb), medium (6.16–12.57 Mb), and long (>12.27 Mb) ROH segments, representing ancient, intermediate, and recent inbreeding. Contribution of ancient, intermediate and recent to genome-wide inbreeding was 37.4%, 40.1% and 22.5%, respectively. Inbreeding depression analyses showed a greater damaging effect of recent inbreeding, likely due to purging of old highly deleterious haplotypes.

Download Full-text

A Statistical Method for Observing Personal Diploid Methylomes and Transcriptomes with Single-Molecule Real-Time Sequencing

Genes ◽

10.3390/genes9090460 ◽

2018 ◽

Vol 9 (9) ◽

pp. 460 ◽

Cited By ~ 1

Author(s):

Yuta Suzuki ◽

Yunhao Wang ◽

Kin Au ◽

Shinichi Morishita

Keyword(s):

Statistical Model ◽

Real Time ◽

Single Molecule ◽

Error Rate ◽

Methylation Pattern ◽

Specific Expression ◽

Long Reads ◽

Allele Specific ◽

Complex Locus ◽

Allele Specific Methylation

We address the problem of observing personal diploid methylomes, CpG methylome pairs of homologous chromosomes that are distinguishable with respect to phased heterozygous variants (PHVs), which is challenging due to scarcity of PHVs in personal genomes. Single molecule real-time (SMRT) sequencing is promising as it outputs long reads with CpG methylation information, but a serious concern is whether reliable PHVs are available in erroneous SMRT reads with an error rate of ∼15%. To overcome the issue, we propose a statistical model that reduces the error rate of phasing CpG site to 1%, thereby calling CpG hypomethylation in each haplotype with >90% precision and sensitivity. Using our statistical model, we examined GNAS complex locus known for a combination of maternally, paternally, or biallelically expressed isoforms, and observed allele-specific methylation pattern almost perfectly reflecting their respective allele-specific expression status, demonstrating the merit of elucidating comprehensive personal diploid methylomes and transcriptomes.

Download Full-text

Linear time decoding of real-field codes over high error rate channels

2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2011.5946695 ◽

2011 ◽

Author(s):

Zaixing He ◽

Takahiro Ogawa ◽

And Miki Haseyama

Keyword(s):

Error Rate ◽

Linear Time ◽

Real Field ◽

High Error Rate

Download Full-text

Marking Fish by Scarring Soft Fin Rays

Canadian Journal of Fisheries and Aquatic Sciences ◽

10.1139/f81-156 ◽

1981 ◽

Vol 38 (9) ◽

pp. 1168-1170 ◽

Cited By ~ 9

Author(s):

Harold E. Welch ◽

Kenneth H. Mills

Keyword(s):

Adverse Effects ◽

Error Rate ◽

Individual Identification ◽

High Error Rate ◽

Fin Rays ◽

Low Costs

Fish can be permanently marked by scarring soft fin rays. Advantages over existing marking methods include rapidity of application, permanence, individual identification, low costs, and lack of adverse effects caused by the mark. Disadvantages include lack of recognition by untrained observers and a relatively high error rate when reading marks.Key words: fish marking, fin rays

Download Full-text