scholarly journals HiTea: a computational pipeline to identify non-reference transposable element insertions in Hi-C data

2020 ◽  
Author(s):  
Dhawal Jain ◽  
Chong Chu ◽  
Burak Han Alver ◽  
Soohyun Lee ◽  
Eunjung Alice Lee ◽  
...  

AbstractHi-C is a common technique for assessing three-dimensional chromatin conformation. Recent studies have shown that long-range interaction information in Hi-C data can be used to generate chromosome-length genome assemblies and identify large-scale structural variations. Here, we demonstrate the use of Hi-C data in detecting mobile transposable element (TE) insertions genome-wide. Our pipeline HiTea (Hi-C based Transposable element analyzer) capitalizes on clipped Hi-C reads and is aided by a high proportion of discordant read pairs in Hi-C data to detect insertions of three major families of active human TEs. Despite the uneven genome coverage in Hi-C data, HiTea is competitive with the existing callers based on whole genome sequencing (WGS) data and can supplement the WGS-based characterization of the TE insertion landscape. We employ the pipeline to identify TE insertions from human cell-line Hi-C samples. HiTea is available at https://github.com/parklab/HiTea and as a Docker image.

Author(s):  
Dhawal Jain ◽  
Chong Chu ◽  
Burak Han Alver ◽  
Soohyun Lee ◽  
Eunjung Alice Lee ◽  
...  

ABSTRACT   Hi-C is a common technique for assessing 3D chromatin conformation. Recent studies have shown that long-range interaction information in Hi-C data can be used to generate chromosome-length genome assemblies and identify large-scale structural variations. Here, we demonstrate the use of Hi-C data in detecting mobile transposable element (TE) insertions genome-wide. Our pipeline Hi-C-based TE analyzer (HiTea) capitalizes on clipped Hi-C reads and is aided by a high proportion of discordant read pairs in Hi-C data to detect insertions of three major families of active human TEs. Despite the uneven genome coverage in Hi-C data, HiTea is competitive with the existing callers based on whole-genome sequencing (WGS) data and can supplement the WGS-based characterization of the TE-insertion landscape. We employ the pipeline to identify TE-insertions from human cell-line Hi-C samples. Availability and implementation HiTea is available at https://github.com/parklab/HiTea and as a Docker image. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 49 (D1) ◽  
pp. D38-D46
Author(s):  
Kyukwang Kim ◽  
Insu Jang ◽  
Mooyoung Kim ◽  
Jinhyuk Choi ◽  
Min-Seo Kim ◽  
...  

Abstract Three-dimensional (3D) genome organization is tightly coupled with gene regulation in various biological processes and diseases. In cancer, various types of large-scale genomic rearrangements can disrupt the 3D genome, leading to oncogenic gene expression. However, unraveling the pathogenicity of the 3D cancer genome remains a challenge since closer examinations have been greatly limited due to the lack of appropriate tools specialized for disorganized higher-order chromatin structure. Here, we updated a 3D-genome Interaction Viewer and database named 3DIV by uniformly processing ∼230 billion raw Hi-C reads to expand our contents to the 3D cancer genome. The updates of 3DIV are listed as follows: (i) the collection of 401 samples including 220 cancer cell line/tumor Hi-C data, 153 normal cell line/tissue Hi-C data, and 28 promoter capture Hi-C data, (ii) the live interactive manipulation of the 3D cancer genome to simulate the impact of structural variations and (iii) the reconstruction of Hi-C contact maps by user-defined chromosome order to investigate the 3D genome of the complex genomic rearrangement. In summary, the updated 3DIV will be the most comprehensive resource to explore the gene regulatory effects of both the normal and cancer 3D genome. ‘3DIV’ is freely available at http://3div.kr.


2019 ◽  
Author(s):  
Sushant Kumar ◽  
Arif Harmanci ◽  
Jagath Vytheeswaran ◽  
Mark B. Gerstein

AbstractA rapid decline in sequencing cost has made large-scale genome sequencing studies feasible. One of the fundamental goals of these studies is to catalog all pathogenic variants. Numerous methods and tools have been developed to interpret point mutations and small insertions and deletions. However, there is a lack of approaches for identifying pathogenic genomic structural variations (SVs). That said, SVs are known to play a crucial role in many diseases by altering the sequence and three-dimensional structure of the genome. Previous studies have suggested a complex interplay of genomic and epigenomic features in the emergence and distribution of SVs. However, the exact mechanism of pathogenesis for SVs in different diseases is not straightforward to decipher. Thus, we built an agnostic machine-learning-based workflow, called SVFX, to assign a “pathogenicity score” to somatic and germline SVs in various diseases. In particular, we generated somatic and germline training models, which included genomic, epigenomic, and conservation-based features for SV call sets in diseased and healthy individuals. We then applied SVFX to SVs in six different cancer cohorts and a cardiovascular disease (CVD) cohort. Overall, SVFX achieved high accuracy in identifying pathogenic SVs. Moreover, we found that predicted pathogenic SVs in cancer cohorts were enriched among known cancer genes and many cancer-related pathways (including Wnt signaling, Ras signaling, DNA repair, and ubiquitin-mediated proteolysis). Finally, we note that SVFX is flexible and can be easily extended to identify pathogenic SVs in additional disease cohorts.


Genes ◽  
2021 ◽  
Vol 12 (9) ◽  
pp. 1336
Author(s):  
Azamat Totikov ◽  
Andrey Tomarovsky ◽  
Dmitry Prokopov ◽  
Aliya Yakupova ◽  
Tatiana Bulyonkova ◽  
...  

Genome assemblies are in the process of becoming an increasingly important tool for understanding genetic diversity in threatened species. Unfortunately, due to limited budgets typical for the area of conservation biology, genome assemblies of threatened species, when available, tend to be highly fragmented, represented by tens of thousands of scaffolds not assigned to chromosomal locations. The recent advent of high-throughput chromosome conformation capture (Hi-C) enables more contiguous assemblies containing scaffolds spanning the length of entire chromosomes for little additional cost. These inexpensive contiguous assemblies can be generated using Hi-C scaffolding of existing short-read draft assemblies, where N50 of the draft contigs is larger than 0.1% of the estimated genome size and can greatly improve analyses and facilitate visualization of genome-wide features including distribution of genetic diversity in markers along chromosomes or chromosome-length scaffolds. We compared distribution of genetic diversity along chromosomes of eight mammalian species, including six listed as threatened by IUCN, where both draft genome assemblies and newer chromosome-level assemblies were available. The chromosome-level assemblies showed marked improvement in localization and visualization of genetic diversity, especially where the distribution of low heterozygosity across the genomes of threatened species was not uniform.


Author(s):  
Kazuaki Yamaguchi ◽  
Mitsutaka Kadota ◽  
Osamu Nishimura ◽  
Yuta Ohishi ◽  
Yuki Naito ◽  
...  

Recent development of ecological studies has been fueled by the introduction of massive information based on chromosome-scale genome sequences, even for species whose genetic linkage was previously not accessible. This was enabled mainly by the application of Hi-C, a method for genome-wide chromosome conformation capture which was originally developed for investigating long-range interaction of chromatins. Performing genomic scaffolding using Hi-C data is highly resource-demanding in elaborate laboratory steps for sequencing sample preparation, building primary genome sequence assembly as an input, and computation for genome scaffolding using Hi-C data, followed by careful validation. This article summarizes existing solutions for these steps and provides a test case of its application to a reptile species, the Madagascar ground gecko (Paroedura picta). Among frequently exerted metrics for evaluating scaffolding results, we investigate the validity of completeness assessment using single-copy reference orthologs and report problems with the widely used program pipeline BUSCO.


2018 ◽  
Author(s):  
Jeramiah J. Smith ◽  
Nataliya Timoshevskaya ◽  
Vladimir A. Timoshevskiy ◽  
Melissa C. Keinath ◽  
Drew Hardy ◽  
...  

ABSTRACTThe axolotl (Ambystoma mexicanum) provides critical models for studying regeneration, evolution and development. However, its large genome (~32 gigabases) presents a formidable barrier to genetic analyses. Recent efforts have yielded genome assemblies consisting of thousands of unordered scaffolds that resolve gene structures, but do not yet permit large scale analyses of genome structure and function. We adapted an established mapping approach to leverage dense SNP typing information and for the first time assemble the axolotl genome into 14 chromosomes. Moreover, we used fluorescence in situ hybridization to verify the structure of these 14 scaffolds and assign each to its corresponding physical chromosome. This new assembly covers 27.3 gigabases and encompasses 94% of annotated gene models on chromosomal scaffolds. We show the assembly’s utility by resolving genome-wide orthologies between the axolotl and other vertebrates, identifying the footprints of historical introgression events that occurred during the development of axolotl genetic stocks, and precisely mapping several phenotypes including a large deletion underlying the cardiac mutant. This chromosome-scale assembly will greatly facilitate studies of the axolotl in biological research.


2021 ◽  
Author(s):  
Bo Wang ◽  
Yinping Jiao ◽  
Kapeel Chougule ◽  
Andrew Olson ◽  
Jian Huang ◽  
...  

ABSTRACTSorghum bicolor, one of the most important grass crops around the world, harbors a high degree of genetic diversity. We constructed chromosome-level genome assemblies for two important sorghum inbred lines, Tx2783 and RTx436. The final high-quality reference assemblies consist of 19 and 18 scaffolds, respectively, with contig N50 values of 25.6 and 20.3 Mb. Genes were annotated using evidence-based and de novo gene predictors, and RAMPAGE data demonstrate that transcription start sites were effectively captured. Together with other public sorghum genomes, BTx623, RTx430, and Rio, extensive structural variations (SVs) of various sizes were characterized using Tx2783 as a reference. Genome-wide scanning for disease resistance (R) genes revealed high levels of diversity among these five sorghum accessions. To characterize sugarcane aphid (SCA) resistance in Tx2783, we mapped the resistance region on chromosome 6 using a recombinant inbred line (RIL) population and found a SV of 191 kb containing a cluster of R genes in Tx2783. Using Tx2783 as a backbone, along with the SVs, we constructed a pan-genome to support alignment of resequencing data from 62 sorghum accessions, and then identified core and dispensable genes using this population. This study provides the first overview of the extent of genomic structural variations and R genes in the sorghum population, and reveals potential targets for breeding of SCA resistance.


2020 ◽  
Author(s):  
Lauren J. Mills ◽  
Milcah C. Scott ◽  
Pankti Shah ◽  
Anne R. Cunanan ◽  
Archana Deshpande ◽  
...  

AbstractOsteosarcoma is an aggressive tumor of the bone that primarily affects young adults and adolescents. Osteosarcoma is characterized by genomic chaos and heterogeneity. While inactivation of tumor suppressor p53 TP53 is nearly universal other high frequency mutations or structural variations have not been identified. Despite this genomic heterogeneity, key conserved transcriptional programs associated with survival have been identified across human, canine and induced murine osteosarcoma. The epigenomic landscape, including DNA methylation, plays a key role in establishing transcriptional programs in all cell types. The role of epigenetic dysregulation has been studied in a variety of cancers but has yet to be explored at scale in osteosarcoma. Here we examined genome-wide DNA methylation patterns in 24 human and 44 canine osteosarcoma samples identifying groups of highly correlated DNA methylation marks in human and canine osteosarcoma samples. We also link specific DNA methylation patterns to key transcriptional programs in both human and canine osteosarcoma. Building on previous work, we built a DNA methylation-based measure for the presence and abundance of various immune cell types in osteosarcoma. Finally, we determined that the underlying state of the tumor, and not changes in cell composition, were the main driver of differences in DNA methylation across the human and canine samples.SignificanceThis is the first large scale study of DNA methylation in osteosarcoma and lays the ground work for the exploration of DNA methylation programs that help establish conserved transcriptional programs in the context of different genomic landscapes.


2020 ◽  
Author(s):  
Martin Johnsson ◽  
Andrew Whalen ◽  
Roger Ros-Freixedes ◽  
Gregor Gorjanc ◽  
Ching-Yi Chen ◽  
...  

AbstractBackgroundIn this paper, we estimated recombination rate variation within the genome and between individuals in the pig using multiocus iterative peeling for 150,000 pigs across nine genotyped pedigrees. We used this to estimate the heritability of recombination and perform a genome-wide association study of recombination in the pig.ResultsOur results confirmed known features of the pig recombination landscape, including differences in chromosome length, and marked sex differences. The recombination landscape was repeatable between lines, but at the same time, the lines also showed differences in average genome-wide recombination rate. The heritability of genome-wide recombination was low but non-zero (on average 0.07 for females and 0.05 for males). We found three genomic regions associated with recombination rate, one of them harbouring the RNF212 gene, previously associated with recombination rate in several other species.ConclusionOur results from the pig agree with the picture of recombination rate variation in vertebrates, with low but nonzero heritability, and a major locus that is homologous to one detected in several other species. This work also highlights the utility of using large-scale livestock data to understand biological processes.


2019 ◽  
Vol 476 (15) ◽  
pp. 2141-2156 ◽  
Author(s):  
Isobel A. MacGregor ◽  
Ian R. Adams ◽  
Nick Gilbert

Abstract The spatial configuration of chromatin is fundamental to ensure any given cell can fulfil its functional duties, from gene expression to specialised cellular division. Significant technological innovations have facilitated further insights into the structure, function and regulation of three-dimensional chromatin organisation. To date, the vast majority of investigations into chromatin organisation have been conducted in interphase and mitotic cells leaving meiotic chromatin relatively unexplored. In combination, cytological and genome-wide contact frequency analyses in mammalian germ cells have recently demonstrated that large-scale chromatin structures in meiotic prophase I are reminiscent of the sequential loop arrays found in mitotic cells, although interphase-like segmentation of transcriptionally active and inactive regions are also evident along the length of chromosomes. Here, we discuss the similarities and differences of such large-scale chromatin architecture, between interphase, mitotic and meiotic cells, as well as their functional relevance and the proposed modulatory mechanisms which underlie them.


Sign in / Sign up

Export Citation Format

Share Document