scholarly journals Hi-C Resolution Enhancement with Genome Sequence Data

2021 ◽  
Author(s):  
Dmitrii Kriukov ◽  
Nikita Koritskiy ◽  
Igor Kozlovskii ◽  
Mark Zaretckii ◽  
Mariia Bazarevich ◽  
...  

The increasing interest in chromatin conformation inside the nucleus and the availability of genome-wide experimental data make it possible to develop computational methods that can increase the quality of the data and thus overcome the limitations of high experimental costs. Here we develop a deep-learning approach for increasing Hi-C data resolution by appending additional information about genome sequence. In this approach, we utilize two different deep-learning algorithms: the image-to-image model, which enhances Hi-C resolution by itself, and the sequence-to-image model, which uses additional information about the underlying genome sequence for further resolution improvement. Both models are combined with the simple head model that provides a more accurate enhancement of initial low-resolution Hi-C data. The code is freely available in a GitHub repository: https://github.com/koritsky/DL2021 HI-C

2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Alejandra Vergara-Lope ◽  
M. Reza Jabalameli ◽  
Clare Horscroft ◽  
Sarah Ennis ◽  
Andrew Collins ◽  
...  

Abstract Quantification of linkage disequilibrium (LD) patterns in the human genome is essential for genome-wide association studies, selection signature mapping and studies of recombination. Whole genome sequence (WGS) data provides optimal source data for this quantification as it is free from biases introduced by the design of array genotyping platforms. The Malécot-Morton model of LD allows the creation of a cumulative map for each choromosome, analogous to an LD form of a linkage map. Here we report LD maps generated from WGS data for a large population of European ancestry, as well as populations of Baganda, Ethiopian and Zulu ancestry. We achieve high average genetic marker densities of 2.3–4.6/kb. These maps show good agreement with prior, low resolution maps and are consistent between populations. Files are provided in BED format to allow researchers to readily utilise this resource.


Geophysics ◽  
2021 ◽  
pp. 1-64
Author(s):  
Xintao Chai ◽  
Genyang Tang ◽  
Kai Lin ◽  
Zhe Yan ◽  
Hanming Gu ◽  
...  

Sparse-spike deconvolution (SSD) is an important method for seismic resolution enhancement. With the wavelet given, many trace-by-trace SSD methods have been proposed for extracting an estimate of the reflection-coefficient series from stacked traces. The main drawbacks of the trace-by-trace methods are that they neither use the information from the adjacent seismograms and nor take full advantage of the inherent spatial continuity of the seismic data. Although several multitrace methods have been consequently proposed, these methods generally rely on different assumptions and theories and require different parameter settings for different data applications. Therefore, the traditional methods demand intensive human-computer interaction. This requirement undoubtedly does not fit the current dominant trend of intelligent seismic exploration. Therefore, we have developed a deep learning (DL)-based multitrace SSD approach. The approach transforms the input 2D/3D seismic data into the corresponding SSD result by training end-to-end encoder-decoder-style 2D/3D convolutional neural networks (CNNs). Our key motivations are that DL is effective for mining complicated relations from data, the 2D/3D CNNs can take multitrace information into account naturally, the additional information contributes to the SSD result with better spatial continuity, and parameter tuning is not necessary for CNN predictions. We report the significance of the learning rate for the training process's convergence. Benchmarking tests on the field 2D/3D seismic data confirm that the approach yields accurate high-resolution results that are mostly in agreement with the well logs; the DL-based multitrace SSD results generated by the 2D/3D CNNs are better than the trace-by-trace SSD results; and the 3D CNN outperforms the 2D CNN for 3D data application.


2020 ◽  
Author(s):  
Indhu-Shree Rajan-Babu ◽  
Junran Peng ◽  
Readman Chiu ◽  
Arezoo Mohajeri ◽  
Egor Dolzhenko ◽  
...  

ABSTRACTShort tandem repeat (STR) expansions cause several neurological and neuromuscular disorders. Screening for STR expansions in genome-wide (exome and genome) sequencing data can enable diagnosis, optimal clinical management/treatment, and accurate genetic counselling of patients with repeat expansion disorders. We assessed the performance of lobSTR, HipSTR, RepeatSeq, ExpansionHunter, TREDPARSE, GangSTR, STRetch, and exSTRa – bioinformatics tools that have been developed to detect and/or genotype STR expansions – on experimental and simulated genome sequence data with known STR expansions aligned using two different aligners, Isaac and BWA. We then adjusted the parameter settings to optimize the sensitivity and specificity of the STR tools and fed the optimized results into a machine-learning decision tree classifier to determine the best combination of tools to detect full mutation expansions with high diagnostic sensitivity and specificity. The decision tree model supported using ExpansionHunter’s full mutation calls with those of either STRetch or exSTRa for detection of full mutations with precision, recall, and F1-score of 90%, 100%, and 95%, respectively.We used this pipeline to screen the BWA-aligned exome or genome sequence data of 306 families of children with suspected genetic disorders for pathogenic expansions of known disease STR loci. We identified 27 samples, 17 with an apparent full-mutation expansion of the AR, ATXN1, ATXN2, ATXN8, DMPK, FXN, HTT, or TBP locus, nine with an intermediate or premutation allele in the FMR1 locus, and one with a borderline allele in the ATXN2 locus. We report the concordance between our bioinformatics findings and the clinical PCR results in a subset of these samples. Implementation of our bioinformatics workflow can improve the detection of disease STR expansions in exome and genome sequence diagnostics and enhance clinical outcomes for patients with repeat expansion disorders.


2010 ◽  
Vol 13 (05) ◽  
pp. 805-811 ◽  
Author(s):  
Adrian Zett ◽  
Mike Webster ◽  
Chris Davies ◽  
Pinggang Zhang ◽  
Parijat Mukerji

Summary A key factor in managing mature fields is to establish adequate surveillance in each phase of their life. The complexity increases when the field is developed with horizontal wells. Differences in data quality and resolution should be taken into consideration when planning such surveillance. Current uncertainties in Harding field relate to unreliable well conformance data using conventional production logs (PL) and assumptions in the reservoir description, which are subseismic resolution. We describe the learning from a horizontal well in Harding, where appropriate surveillance enhanced reservoir understanding and quality of decision making. Based on the initial understanding from the reservoir model, an insert string well work option was proposed to reduce water cut. Historically in this field, conventional PLs provided unreliable well conformance data in horizontal multiphase flow. To improve the characterization at the well scale, an array PL was deployed for the first time on this field. The flowing results revealed that the insert string solution was inappropriate and would result in lost oil production. The shut-in data identified crossflow between two zones separated by a shale section. In the initial model, this shale was mapped only at local level. Post surveillance, it was remapped on seismic as an extensive baffle having an impact on an area with more mobile oil to recover. There is a potential upside with a new infill target being identified toward the toe of this well. Most of the initial decisions about the insert string were based on seismic and modeling work. The new array PL data brought additional information into the model, increasing confidence in the results. Data resolution at the well level matters and this highlights the need to take more PL measurements to calibrate the seismic response and improve the reservoir model.


2017 ◽  
Author(s):  
◽  
Wouter Van Rheenen ◽  
Sara L. Pulit ◽  
Annelot M. Dekker ◽  
Ahmad Al Khleifat ◽  
...  

AbstractThe most recent genome-wide association study in amyotrophic lateral sclerosis (ALS) demonstrates a disproportionate contribution from low-frequency variants to genetic susceptibility of disease. We have therefore begun Project MinE, an international collaboration that seeks to analyse whole-genome sequence data of at least 15,000 ALS patients and 7,500 controls. Here, we report on the design of Project MinE and pilot analyses of newly whole-genome sequenced 1,264 ALS patients and 611 controls drawn from the Netherlands. As has become characteristic of sequencing studies, we find an abundance of rare genetic variation (minor allele frequency < 0.1 %), the vast majority of which is absent in public data sets. Principal component analysis reveals local geographical clustering of these variants within The Netherlands. We use the whole-genome sequence data to explore the implications of poor geographical matching of cases and controls in a sequence-based disease study and to investigate how ancestry-matched, externally sequenced controls can induce false positive associations. Also, we have publicly released genome-wide minor allele counts in cases and controls, as well as results from genic burden tests.


Author(s):  
Raveendra Gudodagi ◽  
Rayapur Venkata Siva Reddy ◽  
Mohammed Riyaz Ahmed

Owing to the substantial volume of human genome sequence data files (from 30-200 GB exposed) Genomic data compression has received considerable traction and storage costs are one of the major problems faced by genomics laboratories. This involves a modern technology of data compression that reduces not only the storage but also the reliability of the operation. There were few attempts to solve this problem independently of both hardware and software. A systematic analysis of associations between genes provides techniques for the recognition of operative connections among genes and their respective yields, as well as understandings into essential biological events that are most important for knowing health and disease phenotypes. This research proposes a reliable and efficient deep learning system for learning embedded projections to combine gene interactions and gene expression in prediction comparison of deep embeddings to strong baselines. In this paper we preform data processing operations and predict gene function, along with gene ontology reconstruction and predict the gene interaction. The three major steps of genomic data compression are extraction of data, storage of data, and retrieval of the data. Hence, we propose a deep learning based on computational optimization techniques which will be efficient in all the three stages of data compression.


Sign in / Sign up

Export Citation Format

Share Document