scholarly journals HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies

2016 ◽  
Vol 27 (5) ◽  
pp. 801-812 ◽  
Author(s):  
Peter Edge ◽  
Vineet Bafna ◽  
Vikas Bansal
2020 ◽  
Author(s):  
Mohammad Hossein Olyaee ◽  
Alireza Khanteymoori ◽  
Khosrow Khalifeh

AbstractDecreasing the cost of high-throughput DNA sequencing technologies, provides a huge amount of data that enables researchers to determine haplotypes for diploid and polyploid organisms. Although various methods have been developed to reconstruct haplotypes in diploid form, their accuracy is still a challenging task. Also, most of the current methods cannot be applied to polyploid form. In this paper, an iterative method is proposed, which employs hypergraph to reconstruct haplotype. The proposed method by utilizing chaotic viewpoint can enhance the obtained haplotypes. For this purpose, a haplotype set was randomly generated as an initial estimate, and its consistency with the input fragments was described by constructing a weighted hypergraph. Partitioning the hypergraph specifies those positions in the haplotype set that need to be corrected. This procedure is repeated until no further improvement could be achieved. Each element of the finalized haplotype set is mapped to a line by chaos game representation, and a coordinate series is defined based on the position of mapped points. Then, some positions with low qualities can be assessed by applying a local projection. Experimental results on both simulated and real datasets demonstrate that this method outperforms most other approaches, and is promising to perform the haplotype assembly.


2020 ◽  
Author(s):  
Ziqi Ke ◽  
Haris Vikalo

AbstractHaplotype assembly and viral quasispecies reconstruction are challenging tasks concerned with analysis of genomic mixtures using sequencing data. High-throughput sequencing technologies generate enormous amounts of short fragments (reads) which essentially oversample components of a mixture; the representation redundancy enables reconstruction of the components (haplotypes, viral strains). The reconstruction problem, known to be NP-hard, boils down to grouping together reads originating from the same component in a mixture. Existing methods struggle to solve this problem with required level of accuracy and low runtimes; the problem is becoming increasingly more challenging as the number and length of the components increase. This paper proposes a read clustering method based on a convolutional auto-encoder designed to first project sequenced fragments to a low-dimensional space and then estimate the probability of the read origin using learned embedded features. The components are reconstructed by finding consensus sequences that agglomerate reads from the same origin. Mini-batch stochastic gradient descent and dimension reduction of reads allow the proposed method to efficiently deal with massive numbers of long reads. Experiments on simulated, semi-experimental and experimental data demonstrate the ability of the proposed method to accurately reconstruct haplotypes and viral quasispecies, often demonstrating superior performance compared to state-of-the-art methods.


2020 ◽  
Author(s):  
Sina Majidian ◽  
Mohammad Hossein Kahaei ◽  
Dick de Ridder

AbstractBackgroundHaplotype information is essential for many genetic and genomic analyses, including genotype-phenotype associations in human, animals and plants. Haplotype assembly is a method for reconstructing haplotypes from DNA sequencing reads. By the advent of new sequencing technologies, new algorithms are needed to ensure long and accurate haplotypes. While a few linked-read haplotype assembly algorithms are available for diploid genomes, there are no algorithms yet for polyploids.ResultsThe first haplotyping algorithm designed for 10X linked reads generated from a polyploid genome is presented, built on a typical short-read haplotyping method, SDhaP. Using the input aligned reads and called variants, the haplotype-relevant information is extracted. Next, reads with the same barcodes are combined to produce molecule-specific fragments. Then, these fragments are clustered into strongly connected components which are then used as input of a haplotype assembly core in order to estimate accurate and long haplotypes.ConclusionsHap10 is a novel algorithm for haplotype assembly of polyploid genomes using linked reads. The performance of the algorithms is evaluated in a number of simulation scenarios and its applicability is demonstrated on a real dataset of sweet potato.


2020 ◽  
Vol 20 ◽  
Author(s):  
Md. Sahab Uddin ◽  
Sharifa Hasana ◽  
Md. Farhad Hossain ◽  
Md. Siddiqul Islam ◽  
Tapan Behl ◽  
...  

: Alzheimer’s disease (AD) is the most common form of dementia in the elderly and this complex disorder is associated with environmental as well as genetic components. Early-onset AD (EOAD) and late-onset AD (LOAD, more common) are major identified types of AD. The genetics of EOAD is extensively understood with three genes variants such as APP, PSEN1, and PSEN2 leading to disease. On the other hand, some common alleles including APOE are effectively associated with LOAD identified but the genetics of LOAD is not clear to date. It has been accounted that about 5% to 10% of EOAD patients can be explained through mutations in the three familiar genes of EOAD. The APOE ε4 allele augmented the severity of EOAD risk in carriers, and APOE ε4 allele was considered as a hallmark of EOAD. A great number of EOAD patients, who are not genetically explained, indicate that it is not possible to identify disease- triggering genes yet. Although several genes have been identified through using the technology of next-generation sequencing in EOAD families including SORL1, TYROBP, and NOTCH3. A number of TYROBP variants were identified through exome sequencing in EOAD patients and these TYROBP variants may increase the pathogenesis of EOAD. The existence of ε4 allele is responsible for increasing the severity of EOAD. However, several ε4 allele carriers live into their 90s that propose the presence of other LOAD genetic as well as environmental risk factors that are not identified yet. It is urgent to find out missing genetics of EOAD and LOAD etiology to discover new potential genetics facets which will assist to understand the pathological mechanism of AD. These investigations should contribute to developing a new therapeutic candidate for alleviating, reversing and preventing AD. This article based on current knowledge represents the overview of the susceptible genes of EOAD, and LOAD. Next, we represent the probable molecular mechanism which might elucidate the genetic etiology of AD and highlight the role of massively parallel sequencing technologies for novel gene discoveries.


Author(s):  
Rini Pauly ◽  
Catherine A. Ziats ◽  
Ludovico Abenavoli ◽  
Charles E. Schwartz ◽  
Luigi Boccuto

Background: Autism spectrum disorder (ASD) is a complex neurodevelopmental condition that poses several challenges in terms of clinical diagnosis and investigation of molecular etiology. The lack of knowledge on the pathogenic mechanisms underlying ASD has hampered the clinical trials that so far have tried to target ASD behavioral symptoms. In order to improve our understanding of the molecular abnormalities associated with ASD, a deeper and more extensive genetic profiling of targeted individuals with ASD was needed. Methods: The recent availability of new and more powerful sequencing technologies (third-generation sequencing) has allowed to develop novel strategies for characterization of comprehensive genetic profiles of individuals with ASD. In particular, this review will describe integrated approaches based on the combination of various omics technologies that will lead to a better stratification of targeted cohorts for the design of clinical trials in ASD. Results: In order to analyze the big data collected by assays such as whole genome, epigenome, transcriptome, and proteome, it is critical to develop an efficient computational infrastructure. Machine learning models are instrumental to identify non-linear relationships between the omics technologies and therefore establish a functional informative network among the different data sources. Conclusion: The potential advantage provided by these new integrated omics-based strategies is to better characterize the genetic background of ASD cohorts, identify novel molecular targets for drug development, and ultimately offer a more personalized approach in the design of clinical trials for ASD.


2019 ◽  
Vol 14 (2) ◽  
pp. 157-163
Author(s):  
Majid Hajibaba ◽  
Mohsen Sharifi ◽  
Saeid Gorgin

Background: One of the pivotal challenges in nowadays genomic research domain is the fast processing of voluminous data such as the ones engendered by high-throughput Next-Generation Sequencing technologies. On the other hand, BLAST (Basic Local Alignment Search Tool), a longestablished and renowned tool in Bioinformatics, has shown to be incredibly slow in this regard. Objective: To improve the performance of BLAST in the processing of voluminous data, we have applied a novel memory-aware technique to BLAST for faster parallel processing of voluminous data. Method: We have used a master-worker model for the processing of voluminous data alongside a memory-aware technique in which the master partitions the whole data in equal chunks, one chunk for each worker, and consequently each worker further splits and formats its allocated data chunk according to the size of its memory. Each worker searches every split data one-by-one through a list of queries. Results: We have chosen a list of queries with different lengths to run insensitive searches in a huge database called UniProtKB/TrEMBL. Our experiments show 20 percent improvement in performance when workers used our proposed memory-aware technique compared to when they were not memory aware. Comparatively, experiments show even higher performance improvement, approximately 50 percent, when we applied our memory-aware technique to mpiBLAST. Conclusion: We have shown that memory-awareness in formatting bulky database, when running BLAST, can improve performance significantly, while preventing unexpected crashes in low-memory environments. Even though distributed computing attempts to mitigate search time by partitioning and distributing database portions, our memory-aware technique alleviates negative effects of page-faults on performance.


2020 ◽  
Vol 110 (1) ◽  
pp. 106-120 ◽  
Author(s):  
Avijit Roy ◽  
Andrew L. Stone ◽  
Gabriel Otero-Colina ◽  
Gang Wei ◽  
Ronald H. Brlansky ◽  
...  

The genus Dichorhavirus contains viruses with bipartite, negative-sense, single-stranded RNA genomes that are transmitted by flat mites to hosts that include orchids, coffee, the genus Clerodendrum, and citrus. A dichorhavirus infecting citrus in Mexico is classified as a citrus strain of orchid fleck virus (OFV-Cit). We previously used RNA sequencing technologies on OFV-Cit samples from Mexico to develop an OFV-Cit–specific reverse transcription PCR (RT-PCR) assay. During assay validation, OFV-Cit–specific RT-PCR failed to produce an amplicon from some samples with clear symptoms of OFV-Cit. Characterization of this virus revealed that dichorhavirus-like particles were found in the nucleus. High-throughput sequencing of small RNAs from these citrus plants revealed a novel citrus strain of OFV, OFV-Cit2. Sequence comparisons with known orchid and citrus strains of OFV showed variation in the protein products encoded by genome segment 1 (RNA1). Strains of OFV clustered together based on host of origin, whether orchid or citrus, and were clearly separated from other dichorhaviruses described from infected citrus in Brazil. The variation in RNA1 between the original (now OFV-Cit1) and the new (OFV-Cit2) strain was not observed with genome segment 2 (RNA2), but instead, a common RNA2 molecule was shared among strains of OFV-Cit1 and -Cit2, a situation strikingly similar to OFV infecting orchids. We also collected mites at the affected groves, identified them as Brevipalpus californicus sensu stricto, and confirmed that they were infected by OFV-Cit1 or with both OFV-Cit1 and -Cit2. OFV-Cit1 and -Cit2 have coexisted at the same site in Toliman, Queretaro, Mexico since 2012. OFV strain-specific diagnostic tests were developed.


2009 ◽  
Vol 11 (1) ◽  
pp. 31-46 ◽  
Author(s):  
Michael L. Metzker

Sign in / Sign up

Export Citation Format

Share Document