scholarly journals HiTea: a computational pipeline to identify non-reference transposable element insertions in Hi-C data

Author(s):  
Dhawal Jain ◽  
Chong Chu ◽  
Burak Han Alver ◽  
Soohyun Lee ◽  
Eunjung Alice Lee ◽  
...  

ABSTRACT   Hi-C is a common technique for assessing 3D chromatin conformation. Recent studies have shown that long-range interaction information in Hi-C data can be used to generate chromosome-length genome assemblies and identify large-scale structural variations. Here, we demonstrate the use of Hi-C data in detecting mobile transposable element (TE) insertions genome-wide. Our pipeline Hi-C-based TE analyzer (HiTea) capitalizes on clipped Hi-C reads and is aided by a high proportion of discordant read pairs in Hi-C data to detect insertions of three major families of active human TEs. Despite the uneven genome coverage in Hi-C data, HiTea is competitive with the existing callers based on whole-genome sequencing (WGS) data and can supplement the WGS-based characterization of the TE-insertion landscape. We employ the pipeline to identify TE-insertions from human cell-line Hi-C samples. Availability and implementation HiTea is available at https://github.com/parklab/HiTea and as a Docker image. Supplementary information Supplementary data are available at Bioinformatics online.

2020 ◽  
Author(s):  
Dhawal Jain ◽  
Chong Chu ◽  
Burak Han Alver ◽  
Soohyun Lee ◽  
Eunjung Alice Lee ◽  
...  

AbstractHi-C is a common technique for assessing three-dimensional chromatin conformation. Recent studies have shown that long-range interaction information in Hi-C data can be used to generate chromosome-length genome assemblies and identify large-scale structural variations. Here, we demonstrate the use of Hi-C data in detecting mobile transposable element (TE) insertions genome-wide. Our pipeline HiTea (Hi-C based Transposable element analyzer) capitalizes on clipped Hi-C reads and is aided by a high proportion of discordant read pairs in Hi-C data to detect insertions of three major families of active human TEs. Despite the uneven genome coverage in Hi-C data, HiTea is competitive with the existing callers based on whole genome sequencing (WGS) data and can supplement the WGS-based characterization of the TE insertion landscape. We employ the pipeline to identify TE insertions from human cell-line Hi-C samples. HiTea is available at https://github.com/parklab/HiTea and as a Docker image.


2018 ◽  
Vol 35 (14) ◽  
pp. 2512-2514 ◽  
Author(s):  
Bongsong Kim ◽  
Xinbin Dai ◽  
Wenchao Zhang ◽  
Zhaohong Zhuang ◽  
Darlene L Sanchez ◽  
...  

Abstract Summary We present GWASpro, a high-performance web server for the analyses of large-scale genome-wide association studies (GWAS). GWASpro was developed to provide data analyses for large-scale molecular genetic data, coupled with complex replicated experimental designs such as found in plant science investigations and to overcome the steep learning curves of existing GWAS software tools. GWASpro supports building complex design matrices, by which complex experimental designs that may include replications, treatments, locations and times, can be accounted for in the linear mixed model. GWASpro is optimized to handle GWAS data that may consist of up to 10 million markers and 10 000 samples from replicable lines or hybrids. GWASpro provides an interface that significantly reduces the learning curve for new GWAS investigators. Availability and implementation GWASpro is freely available at https://bioinfo.noble.org/GWASPRO. Supplementary information Supplementary data are available at Bioinformatics online.


Genes ◽  
2021 ◽  
Vol 12 (9) ◽  
pp. 1336
Author(s):  
Azamat Totikov ◽  
Andrey Tomarovsky ◽  
Dmitry Prokopov ◽  
Aliya Yakupova ◽  
Tatiana Bulyonkova ◽  
...  

Genome assemblies are in the process of becoming an increasingly important tool for understanding genetic diversity in threatened species. Unfortunately, due to limited budgets typical for the area of conservation biology, genome assemblies of threatened species, when available, tend to be highly fragmented, represented by tens of thousands of scaffolds not assigned to chromosomal locations. The recent advent of high-throughput chromosome conformation capture (Hi-C) enables more contiguous assemblies containing scaffolds spanning the length of entire chromosomes for little additional cost. These inexpensive contiguous assemblies can be generated using Hi-C scaffolding of existing short-read draft assemblies, where N50 of the draft contigs is larger than 0.1% of the estimated genome size and can greatly improve analyses and facilitate visualization of genome-wide features including distribution of genetic diversity in markers along chromosomes or chromosome-length scaffolds. We compared distribution of genetic diversity along chromosomes of eight mammalian species, including six listed as threatened by IUCN, where both draft genome assemblies and newer chromosome-level assemblies were available. The chromosome-level assemblies showed marked improvement in localization and visualization of genetic diversity, especially where the distribution of low heterozygosity across the genomes of threatened species was not uniform.


Author(s):  
Kazuaki Yamaguchi ◽  
Mitsutaka Kadota ◽  
Osamu Nishimura ◽  
Yuta Ohishi ◽  
Yuki Naito ◽  
...  

Recent development of ecological studies has been fueled by the introduction of massive information based on chromosome-scale genome sequences, even for species whose genetic linkage was previously not accessible. This was enabled mainly by the application of Hi-C, a method for genome-wide chromosome conformation capture which was originally developed for investigating long-range interaction of chromatins. Performing genomic scaffolding using Hi-C data is highly resource-demanding in elaborate laboratory steps for sequencing sample preparation, building primary genome sequence assembly as an input, and computation for genome scaffolding using Hi-C data, followed by careful validation. This article summarizes existing solutions for these steps and provides a test case of its application to a reptile species, the Madagascar ground gecko (Paroedura picta). Among frequently exerted metrics for evaluating scaffolding results, we investigate the validity of completeness assessment using single-copy reference orthologs and report problems with the widely used program pipeline BUSCO.


2019 ◽  
Vol 35 (19) ◽  
pp. 3576-3583 ◽  
Author(s):  
Chong Wu ◽  
Wei Pan

Abstract Motivation Most trait-associated genetic variants identified in genome-wide association studies (GWASs) are located in non-coding regions of the genome and thought to act through their regulatory roles. Results To account for enriched association signals in DNA regulatory elements, we propose a novel and general gene-based association testing strategy that integrates enhancer-target gene pairs and methylation quantitative trait locus data with GWAS summary results; it aims to both boost statistical power for new discoveries and enhance mechanistic interpretability of any new discovery. By reanalyzing two large-scale schizophrenia GWAS summary datasets, we demonstrate that the proposed method could identify some significant and novel genes (containing no genome-wide significant SNPs nearby) that would have been missed by other competing approaches, including the standard and some integrative gene-based association methods, such as one incorporating enhancer-target gene pairs and one integrating expression quantitative trait loci. Availability and implementation Software: wuchong.org/egmethyl.html Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Jeramiah J. Smith ◽  
Nataliya Timoshevskaya ◽  
Vladimir A. Timoshevskiy ◽  
Melissa C. Keinath ◽  
Drew Hardy ◽  
...  

ABSTRACTThe axolotl (Ambystoma mexicanum) provides critical models for studying regeneration, evolution and development. However, its large genome (~32 gigabases) presents a formidable barrier to genetic analyses. Recent efforts have yielded genome assemblies consisting of thousands of unordered scaffolds that resolve gene structures, but do not yet permit large scale analyses of genome structure and function. We adapted an established mapping approach to leverage dense SNP typing information and for the first time assemble the axolotl genome into 14 chromosomes. Moreover, we used fluorescence in situ hybridization to verify the structure of these 14 scaffolds and assign each to its corresponding physical chromosome. This new assembly covers 27.3 gigabases and encompasses 94% of annotated gene models on chromosomal scaffolds. We show the assembly’s utility by resolving genome-wide orthologies between the axolotl and other vertebrates, identifying the footprints of historical introgression events that occurred during the development of axolotl genetic stocks, and precisely mapping several phenotypes including a large deletion underlying the cardiac mutant. This chromosome-scale assembly will greatly facilitate studies of the axolotl in biological research.


2017 ◽  
Author(s):  
Florian Privé ◽  
Hugues Aschard ◽  
Michael G.B. Blum

AbstractMotivation:Genome-wide datasets produced for association studies have dramatically increased in size over the past few years, with modern datasets commonly including millions of variants measured in dozens of thousands of individuals. This increase in data size is a major challenge severely slowing down genomic analyses. Specialized software for every part of the analysis pipeline have been developed to handle large genomic data. However, combining all these software into a single data analysis pipeline might be technically difficult.Results:Here we present two R packages, bigstatsr and bigsnpr, allowing for management and analysis of large scale genomic data to be performed within a single comprehensive framework. To address large data size, the packages use memory-mapping for accessing data matrices stored on disk instead of in RAM. To perform data pre-processing and data analysis, the packages integrate most of the tools that are commonly used, either through transparent system calls to existing software, or through updated or improved implementation of existing methods. In particular, the packages implement a fast derivation of Principal Component Analysis, functions to remove SNPs in Linkage Disequilibrium, and algorithms to learn Polygenic Risk Scores on millions of SNPs. We illustrate applications of the two R packages by analysing a case-control genomic dataset for the celiac disease, performing an association study and computing Polygenic Risk Scores. Finally, we demonstrate the scalability of the R packages by analyzing a simulated genome-wide dataset including 500,000 individuals and 1 million markers on a single desktop computer.Availability:https://privefl.github.io/bigstatsr/ & https://privefl.github.io/bigsnpr/Contact:[email protected] & [email protected] information:Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Bo Wang ◽  
Yinping Jiao ◽  
Kapeel Chougule ◽  
Andrew Olson ◽  
Jian Huang ◽  
...  

ABSTRACTSorghum bicolor, one of the most important grass crops around the world, harbors a high degree of genetic diversity. We constructed chromosome-level genome assemblies for two important sorghum inbred lines, Tx2783 and RTx436. The final high-quality reference assemblies consist of 19 and 18 scaffolds, respectively, with contig N50 values of 25.6 and 20.3 Mb. Genes were annotated using evidence-based and de novo gene predictors, and RAMPAGE data demonstrate that transcription start sites were effectively captured. Together with other public sorghum genomes, BTx623, RTx430, and Rio, extensive structural variations (SVs) of various sizes were characterized using Tx2783 as a reference. Genome-wide scanning for disease resistance (R) genes revealed high levels of diversity among these five sorghum accessions. To characterize sugarcane aphid (SCA) resistance in Tx2783, we mapped the resistance region on chromosome 6 using a recombinant inbred line (RIL) population and found a SV of 191 kb containing a cluster of R genes in Tx2783. Using Tx2783 as a backbone, along with the SVs, we constructed a pan-genome to support alignment of resequencing data from 62 sorghum accessions, and then identified core and dispensable genes using this population. This study provides the first overview of the extent of genomic structural variations and R genes in the sorghum population, and reveals potential targets for breeding of SCA resistance.


2020 ◽  
Author(s):  
Lauren J. Mills ◽  
Milcah C. Scott ◽  
Pankti Shah ◽  
Anne R. Cunanan ◽  
Archana Deshpande ◽  
...  

AbstractOsteosarcoma is an aggressive tumor of the bone that primarily affects young adults and adolescents. Osteosarcoma is characterized by genomic chaos and heterogeneity. While inactivation of tumor suppressor p53 TP53 is nearly universal other high frequency mutations or structural variations have not been identified. Despite this genomic heterogeneity, key conserved transcriptional programs associated with survival have been identified across human, canine and induced murine osteosarcoma. The epigenomic landscape, including DNA methylation, plays a key role in establishing transcriptional programs in all cell types. The role of epigenetic dysregulation has been studied in a variety of cancers but has yet to be explored at scale in osteosarcoma. Here we examined genome-wide DNA methylation patterns in 24 human and 44 canine osteosarcoma samples identifying groups of highly correlated DNA methylation marks in human and canine osteosarcoma samples. We also link specific DNA methylation patterns to key transcriptional programs in both human and canine osteosarcoma. Building on previous work, we built a DNA methylation-based measure for the presence and abundance of various immune cell types in osteosarcoma. Finally, we determined that the underlying state of the tumor, and not changes in cell composition, were the main driver of differences in DNA methylation across the human and canine samples.SignificanceThis is the first large scale study of DNA methylation in osteosarcoma and lays the ground work for the exploration of DNA methylation programs that help establish conserved transcriptional programs in the context of different genomic landscapes.


2020 ◽  
Author(s):  
Martin Johnsson ◽  
Andrew Whalen ◽  
Roger Ros-Freixedes ◽  
Gregor Gorjanc ◽  
Ching-Yi Chen ◽  
...  

AbstractBackgroundIn this paper, we estimated recombination rate variation within the genome and between individuals in the pig using multiocus iterative peeling for 150,000 pigs across nine genotyped pedigrees. We used this to estimate the heritability of recombination and perform a genome-wide association study of recombination in the pig.ResultsOur results confirmed known features of the pig recombination landscape, including differences in chromosome length, and marked sex differences. The recombination landscape was repeatable between lines, but at the same time, the lines also showed differences in average genome-wide recombination rate. The heritability of genome-wide recombination was low but non-zero (on average 0.07 for females and 0.05 for males). We found three genomic regions associated with recombination rate, one of them harbouring the RNF212 gene, previously associated with recombination rate in several other species.ConclusionOur results from the pig agree with the picture of recombination rate variation in vertebrates, with low but nonzero heritability, and a major locus that is homologous to one detected in several other species. This work also highlights the utility of using large-scale livestock data to understand biological processes.


Sign in / Sign up

Export Citation Format

Share Document