circHiC: circular visualization of Hi-C data and integration of genomic data

Mapping Intimacies ◽

10.1101/2020.08.13.249110 ◽

2020 ◽

Author(s):

Ivan Junier ◽

Nelle Varoquaux

Keyword(s):

Genomic Data ◽

Bacterial Chromosome ◽

Link Type ◽

Genome Wide ◽

Bacterial Chromosomes ◽

Wide Contact ◽

Linear Chromosomes

SummaryGenome wide contact frequencies obtained using Hi-C-like experiments have raised novel challenges in terms of visualization and rationalization of chromosome structuring phenomena. In bacteria, display of Hi-C data should be congruent with the circularity of chromosomes. However, standard representations under the form of square matrices or horizontal bands are not adapted to periodic conditions as those imposed by (most) bacterial chromosomes. Here, we fill this gap and propose a Python library, built upon the widely used Matplotlib library, to display Hi-C data in circular strips, together with the possibility to overlay genomic data. The proposed tools are light and fast, aiming to facilitate the exploration and understanding of bacterial chromosome structuring data. The library further includes the possibility to handle linear chromosomes, providing a fresh way to display and explore eukaryotic data.Availability and implementationThe package runs under Python 3 and is freely available at https://github.com/TrEE-TIMC/circHiC. The documentation can be found at https://tree-timc.github.io/circhic/; images obtained in different organisms are provided in the gallery section and are accompanied with [email protected], [email protected]

Download Full-text

Efficient management and analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr

10.1101/190926 ◽

2017 ◽

Cited By ~ 1

Author(s):

Florian Privé ◽

Hugues Aschard ◽

Michael G.B. Blum

Keyword(s):

Data Analysis ◽

Large Scale ◽

Genomic Data ◽

Supplementary Information ◽

Risk Scores ◽

Analysis Pipeline ◽

Polygenic Risk ◽

Link Type ◽

Genome Wide ◽

R Packages

AbstractMotivation:Genome-wide datasets produced for association studies have dramatically increased in size over the past few years, with modern datasets commonly including millions of variants measured in dozens of thousands of individuals. This increase in data size is a major challenge severely slowing down genomic analyses. Specialized software for every part of the analysis pipeline have been developed to handle large genomic data. However, combining all these software into a single data analysis pipeline might be technically difficult.Results:Here we present two R packages, bigstatsr and bigsnpr, allowing for management and analysis of large scale genomic data to be performed within a single comprehensive framework. To address large data size, the packages use memory-mapping for accessing data matrices stored on disk instead of in RAM. To perform data pre-processing and data analysis, the packages integrate most of the tools that are commonly used, either through transparent system calls to existing software, or through updated or improved implementation of existing methods. In particular, the packages implement a fast derivation of Principal Component Analysis, functions to remove SNPs in Linkage Disequilibrium, and algorithms to learn Polygenic Risk Scores on millions of SNPs. We illustrate applications of the two R packages by analysing a case-control genomic dataset for the celiac disease, performing an association study and computing Polygenic Risk Scores. Finally, we demonstrate the scalability of the R packages by analyzing a simulated genome-wide dataset including 500,000 individuals and 1 million markers on a single desktop computer.Availability:https://privefl.github.io/bigstatsr/ & https://privefl.github.io/bigsnpr/Contact:[email protected] & [email protected] information:Supplementary data are available at Bioinformatics online.

Download Full-text

Genome analysis reveals that the correct name of type strain Adlercreutzia caecicola DSM 22242T is Parvibacter caecicola Clavel et al. 2013

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijsem.0.004814 ◽

2021 ◽

Vol 71 (5) ◽

Author(s):

Dominic A. Stoll ◽

Nicolas Danylec ◽

Christina Grimmler ◽

Sabine E. Kulling ◽

Melanie Huch

Keyword(s):

Type Strain ◽

Type Species ◽

Draft Genome ◽

Amino Acid Identity ◽

Average Nucleotide Identity ◽

Content Type ◽

Link Type ◽

Genome Wide ◽

Average Amino Acid Identity ◽

Biochemical Analyses

The strain Adlercreutzia caecicola DSM 22242T (=CCUG 57646T=NR06T) was taxonomically described in 2013 and named as Parvibacter caecicola Clavel et al. 2013. In 2018, the name of the strain DSM 22242T was changed to Adlercreutzia caecicola (Clavel et al. 2013) Nouioui et al. 2018 due to taxonomic investigations of the closely related genera Adlercreutzia, Asaccharobacter and Enterorhabdus within the phylum Actinobacteria . However, the first whole draft genome of strain DSM 22242T was published by our group in 2019. Therefore, the genome was not available within the study of Nouioui et al. (2018). The results of the polyphasic approach within this study, including phenotypic and biochemical analyses and genome-based taxonomic investigations [genome-wide average nucleotide identity (gANI), alignment fraction (AF), average amino acid identity (AAI), percentage of orthologous conserved proteins (POCP) and genome blast distance phylogeny (GBDP) tree], indicated that the proposed change of the name Parvibacter caecicola to Adlercreutzia caecicola was not correct. Therefore, it is proposed that the correct name of Adlercreutzia caecicola (Clavel et al. 2013) Nouioui et al. 2018 strain DSM 22242T is Parvibacter caecicola Clavel et al. 2013.

Download Full-text

The theory on and software simulating large-scale genomic data for genotype-by-environment interactions

BMC Genomics ◽

10.1186/s12864-021-08191-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Xiujin Li ◽

Hailiang Song ◽

Zhe Zhang ◽

Yunmao Huang ◽

Qin Zhang ◽

...

Keyword(s):

Large Scale ◽

Simulated Data ◽

Genomic Data ◽

Efficient Tool ◽

Phenotypic Data ◽

Genotype By Environment Interactions ◽

Genotype By Environment ◽

Threshold Trait ◽

Genome Wide ◽

Increasing Demand

Abstract Background With the emphasis on analysing genotype-by-environment interactions within the framework of genomic selection and genome-wide association analysis, there is an increasing demand for reliable tools that can be used to simulate large-scale genomic data in order to assess related approaches. Results We proposed a theory to simulate large-scale genomic data on genotype-by-environment interactions and added this new function to our developed tool GPOPSIM. Additionally, a simulated threshold trait with large-scale genomic data was also added. The validation of the simulated data indicated that GPOSPIM2.0 is an efficient tool for mimicking the phenotypic data of quantitative traits, threshold traits, and genetically correlated traits with large-scale genomic data while taking genotype-by-environment interactions into account. Conclusions This tool is useful for assessing genotype-by-environment interactions and threshold traits methods.

Download Full-text

Genetic effect estimates in case-control studies when a continuous variable is omitted from the model

10.1101/756015 ◽

2019 ◽

Author(s):

Ying Sheng ◽

Chiung-Yu Huang ◽

Siarhei Lobach ◽

Lydia Zablotska ◽

Iryna Lobach ◽

...

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Large Scale ◽

False Positive Rate ◽

Continuous Variable ◽

Genetic Effects ◽

Data Availability ◽

Conditional Density ◽

Link Type ◽

Genome Wide

ABSTRACTLarge-scale genome-wide analyses scans provide massive volumes of genetic variants on large number of cases and controls that can be used to estimate the genetic effects. Yet, the sets of non-genetic variables available in publicly available databases are often brief. It is known that omitting a continuous variable from a logistic regression model can result in biased estimates of odds ratios (OR) (e.g., Gail et al (1984), Neuhaus et al (1993), Hauck et al (1991), Zeger et al (1988)). We are interested to assess what information is needed to recover the bias in the OR estimate of genotype due to omitting a continuous variable in settings when the actual values of the omitted variable are not available. We derive two estimating procedures that can recover the degree of bias based on a conditional density of the omitted variable or knowing the distribution of the omitted variable. Importantly, our derivations show that omitting a continuous variable can result in either under- or over-estimation of the genetic effects. We performed extensive simulation studies to examine bias, variability, false positive rate, and power in the model that omits a continuous variable. We show the application to two genome-wide studies of Alzheimer’s disease.Data Availability StatementThe data that support the findings of this study are openly available in the Database of Genotypes and Phenotypes at [https://www.ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?study_id=phs000372.v1.p1], reference number [phs000372.v1.p1] and at the Alzheimer’s Disease Neuroimaging Initiative http://adni.loni.usc.edu/.

Download Full-text

Lentzea tibetensis sp. nov., a novel Actinobacterium with antimicrobial activity isolated from soil of the Qinghai–Tibet Plateau

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijsem.0.004976 ◽

2021 ◽

Vol 71 (8) ◽

Author(s):

Jiao Huang ◽

Ying Huang

Keyword(s):

Antimicrobial Activity ◽

Type Species ◽

Sequence Similarity ◽

Tibet Plateau ◽

Rrna Gene ◽

Content Type ◽

Link Type ◽

Genome Wide ◽

Pr China ◽

Qinghai Tibet Plateau

A novel filamentous Actinobacterium, designated strain FXJ1.1311T, was isolated from soil collected in Ngari (Ali) Prefecture, Qinghai-Tibet Plateau, western PR China. The strain showed antimicrobial activity against Gram-positive bacteria and Fusarium oxysporum. Results of phylogenetic analysis based on 16S rRNA gene sequences indicated that strain FXJ1.1311T belonged to the genus Lentzea and showed the highest sequence similarity to Lentzea guizhouensis DHS C013T (98.04%). Morphological and chemotaxonomic characteristics supported its assignment to the genus Lentzea . The genome-wide average nucleotide identity between strain FXJ1.1311T and L. guizhouensis DHS C013T as well as other Lentzea type strains was <82.2 %. Strain FXJ1.1311T also formed a monophyletic line distinct from the known Lentzea species in the phylogenomic tree. In addition, physiological and chemotaxonomic characteristics allowed phenotypic differentiation of the novel strain from L. guizhouensis . Based on the evidence presented here, strain FXJ1.1311T represents a novel species of the genus Lentzea , for which the name Lentzea tibetensis sp. nov. is proposed. The type strain is FXJ1.1311T (=CGMCC 4.7383T=DSM 104975T).

Download Full-text

Qtlizer: comprehensive QTL annotation of GWAS results

Scientific Reports ◽

10.1038/s41598-020-75770-7 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Matthias Munz ◽

Inken Wohlers ◽

Eric Simon ◽

Tobias Reinberger ◽

Hauke Busch ◽

...

Keyword(s):

Association Studies ◽

Housekeeping Genes ◽

R Package ◽

Genome Wide Association Studies ◽

Protein Abundance ◽

Base Pairs ◽

Link Type ◽

Genome Wide ◽

Wide Range ◽

Distance Limit

AbstractExploration of genetic variant-to-gene relationships by quantitative trait loci such as expression QTLs is a frequently used tool in genome-wide association studies. However, the wide range of public QTL databases and the lack of batch annotation features complicate a comprehensive annotation of GWAS results. In this work, we introduce the tool “Qtlizer” for annotating lists of variants in human with associated changes in gene expression and protein abundance using an integrated database of published QTLs. Features include incorporation of variants in linkage disequilibrium and reverse search by gene names. Analyzing the database for base pair distances between best significant eQTLs and their affected genes suggests that the commonly used cis-distance limit of 1,000,000 base pairs might be too restrictive, implicating a substantial amount of wrongly and yet undetected eQTLs. We also ranked genes with respect to the maximum number of tissue-specific eQTL studies in which a most significant eQTL signal was consistent. For the top 100 genes we observed the strongest enrichment with housekeeping genes (P = 2 × 10–6) and with the 10% highest expressed genes (P = 0.005) after grouping eQTLs by r2 > 0.95, underlining the relevance of LD information in eQTL analyses. Qtlizer can be accessed via https://genehopper.de/qtlizer or by using the respective Bioconductor R-package (https://doi.org/10.18129/B9.bioc.Qtlizer).

Download Full-text

Integrative Bayesian Network Analysis of Genomic Data

Cancer Informatics ◽

10.4137/cin.s13786 ◽

2014 ◽

Vol 13s2 ◽

pp. CIN.S13786 ◽

Cited By ~ 1

Author(s):

Yang Ni ◽

Francesco C. Stingo ◽

Veerabhadran Baladandayuthapani

Keyword(s):

Bayesian Network ◽

Cancer Progression ◽

Rapid Development ◽

Genomic Data ◽

Biological Knowledge ◽

Network Approach ◽

Computationally Efficient ◽

Efficient Manner ◽

New Genes ◽

Genome Wide

Rapid development of genome-wide profiling technologies has made it possible to conduct integrative analysis on genomic data from multiple platforms. In this study, we develop a novel integrative Bayesian network approach to investigate the relationships between genetic and epigenetic alterations as well as how these mutations affect a patient's clinical outcome. We take a Bayesian network approach that admits a convenient decomposition of the joint distribution into local distributions. Exploiting the prior biological knowledge about regulatory mechanisms, we model each local distribution as linear regressions. This allows us to analyze multi-platform genome-wide data in a computationally efficient manner. We illustrate the performance of our approach through simulation studies. Our methods are motivated by and applied to a multi-platform glioblastoma dataset, from which we reveal several biologically relevant relationships that have been validated in the literature as well as new genes that could potentially be novel biomarkers for cancer progression.

Download Full-text

Author Correction: Genome-wide association study of self-reported walking pace suggests beneficial effects of brisk walking on health and survival

Communications Biology ◽

10.1038/s42003-020-01447-6 ◽

2020 ◽

Vol 3 (1) ◽

Author(s):

Iain R. Timmins ◽

Francesco Zaccardi ◽

Christopher P. Nelson ◽

Paul W. Franks ◽

Thomas Yates ◽

...

Keyword(s):

Association Study ◽

Genome Wide Association Study ◽

Genome Wide Association ◽

Brisk Walking ◽

Beneficial Effects ◽

Link Type ◽

Genome Wide

A Correction to this paper has been published: 10.1038/s42003-020-01447-6.

Download Full-text

Dbf4-Dependent Kinase (DDK)-Mediated Proteolysis of CENP-A Prevents Mislocalization of CENP-A in Saccharomyces cerevisiae

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401131 ◽

2020 ◽

Vol 10 (6) ◽

pp. 2057-2068 ◽

Cited By ~ 3

Author(s):

Jessica R. Eisenstatt ◽

Lars Boeckmann ◽

Wei-Chun Au ◽

Valerie Garcia ◽

Levi Bursch ◽

...

Keyword(s):

Dna Replication ◽

Replication Initiation ◽

Centromeric Chromatin ◽

Dna Replication Initiation ◽

Link Type ◽

Genome Wide ◽

A Genome ◽

Evolutionarily Conserved ◽

Histone H3 Variant

The evolutionarily conserved centromeric histone H3 variant (Cse4 in budding yeast, CENP-A in humans) is essential for faithful chromosome segregation. Mislocalization of CENP-A to non-centromeric chromatin contributes to chromosomal instability (CIN) in yeast, fly, and human cells and CENP-A is highly expressed and mislocalized in cancers. Defining mechanisms that prevent mislocalization of CENP-A is an area of active investigation. Ubiquitin-mediated proteolysis of overexpressed Cse4 (GALCSE4) by E3 ubiquitin ligases such as Psh1 prevents mislocalization of Cse4, and psh1Δ strains display synthetic dosage lethality (SDL) with GALCSE4. We previously performed a genome-wide screen and identified five alleles of CDC7 and DBF4 that encode the Dbf4-dependent kinase (DDK) complex, which regulates DNA replication initiation, among the top twelve hits that displayed SDL with GALCSE4. We determined that cdc7-7 strains exhibit defects in ubiquitin-mediated proteolysis of Cse4 and show mislocalization of Cse4. Mutation of MCM5 (mcm5-bob1) bypasses the requirement of Cdc7 for replication initiation and rescues replication defects in a cdc7-7 strain. We determined that mcm5-bob1 does not rescue the SDL and defects in proteolysis of GALCSE4 in a cdc7-7 strain, suggesting a DNA replication-independent role for Cdc7 in Cse4 proteolysis. The SDL phenotype, defects in ubiquitin-mediated proteolysis, and the mislocalization pattern of Cse4 in a cdc7-7 psh1Δ strain were similar to that of cdc7-7 and psh1Δ strains, suggesting that Cdc7 regulates Cse4 in a pathway that overlaps with Psh1. Our results define a DNA replication initiation-independent role of DDK as a regulator of Psh1-mediated proteolysis of Cse4 to prevent mislocalization of Cse4.

Download Full-text

When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data

Genome Biology ◽

10.1186/s13059-019-1809-x ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 13

Author(s):

Will P. M. Rowe

Keyword(s):

Genomic Data ◽

The Past ◽

Link Type ◽

Practical Guide ◽

Current State ◽

Great Utility ◽

State Of The Field

Abstract Considerable advances in genomics over the past decade have resulted in vast amounts of data being generated and deposited in global archives. The growth of these archives exceeds our ability to process their content, leading to significant analysis bottlenecks. Sketching algorithms produce small, approximate summaries of data and have shown great utility in tackling this flood of genomic data, while using minimal compute resources. This article reviews the current state of the field, focusing on how the algorithms work and how genomicists can utilize them effectively. References to interactive workbooks for explaining concepts and demonstrating workflows are included at https://github.com/will-rowe/genome-sketching.

Download Full-text