scholarly journals DBnorm as an R package for the comparison and selection of appropriate statistical methods for batch effect correction in metabolomic studies

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Nasim Bararpour ◽  
Federica Gilardi ◽  
Cristian Carmeli ◽  
Jonathan Sidibe ◽  
Julijana Ivanisevic ◽  
...  

AbstractAs a powerful phenotyping technology, metabolomics provides new opportunities in biomarker discovery through metabolome-wide association studies (MWAS) and the identification of metabolites having a regulatory effect in various biological processes. While mass spectrometry-based (MS) metabolomics assays are endowed with high throughput and sensitivity, MWAS are doomed to long-term data acquisition generating an overtime-analytical signal drift that can hinder the uncovering of real biologically relevant changes. We developed “dbnorm”, a package in the R environment, which allows for an easy comparison of the model performance of advanced statistical tools commonly used in metabolomics to remove batch effects from large metabolomics datasets. “dbnorm” integrates advanced statistical tools to inspect the dataset structure not only at the macroscopic (sample batches) scale, but also at the microscopic (metabolic features) level. To compare the model performance on data correction, “dbnorm” assigns a score that help users identify the best fitting model for each dataset. In this study, we applied “dbnorm” to two large-scale metabolomics datasets as a proof of concept. We demonstrate that “dbnorm” allows for the accurate selection of the most appropriate statistical tool to efficiently remove the overtime signal drift and to focus on the relevant biological components of complex datasets.

2020 ◽  
Author(s):  
Nasim Bararpour ◽  
Federica Gilardi ◽  
Cristian Carmeli ◽  
Jonathan Sidibe ◽  
Julijana Ivanisevic ◽  
...  

AbstractAs a powerful phenotyping technology, metabolomics provides new opportunities in biomarker discovery through metabolome-wide association studies (MWAS) and identification of metabolites having regulatory effect in various biological processes. While MS-based metabolomics assays are endowed with high-throughput and sensitivity, large-scale MWAS are doomed to long-term data acquisition generating an overtime-analytical signal drift that can hinder the uncovering of true biologically relevant changes.We developed “dbnorm”, a package in R environment, which allows visualization and removal of signal heterogeneity from large metabolomics datasets. “dbnorm” integrates advanced statistical tools to inspect dataset structure, at both macroscopic (sample batch) and microscopic (metabolic features) scales. To compare model performance on data correction, “dbnorm” assigns a score, which allows the straightforward identification of the best fitting model for each dataset. Herein, we show how “dbnorm” efficiently removes signal drift among batches to capture the true biological heterogeneity of data in two large-scale metabolomics studies.


Author(s):  
Xiuwen Zheng ◽  
J Wade Davis

Abstract Summary Phenome-wide association studies (PheWASs) are known to be a powerful tool in discovery and replication of genetic association studies. To reduce the computational burden of PheWAS in the large cohorts, such as the UK Biobank, the SAIGE method has been proposed to control for case–control imbalance and sample relatedness in a tractable manner. However, SAIGE is still computationally intensive when deployed in analyzing the associations of thousands of ICD10-coded phenotypes with whole-genome imputed genotype data. Here, we present a new high-performance statistical R package (SAIGEgds) for large-scale PheWAS using generalized linear mixed models. The package implements the SAIGE method in optimized C++ codes, taking advantage of sparse genotype dosages and integrating the efficient genomic data structure file format. Benchmarks using the UK Biobank White British genotype data (N ≈ 430 K) with coronary heart disease and simulated cases show that the implementation in SAIGEgds is 5–6 times faster than the SAIGE R package. When used in conjunction with high-performance computing clusters, SAIGEgds provides an efficient analysis pipeline for biobank-scale PheWAS. Availability and implementation https://bioconductor.org/packages/SAIGEgds; vignettes included. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Tianye Jia ◽  
Congying Chu ◽  
Yun Liu ◽  
Jenny van Dongen ◽  
Evangelos Papastergios ◽  
...  

AbstractDNA methylation, which is modulated by both genetic factors and environmental exposures, may offer a unique opportunity to discover novel biomarkers of disease-related brain phenotypes, even when measured in other tissues than brain, such as blood. A few studies of small sample sizes have revealed associations between blood DNA methylation and neuropsychopathology, however, large-scale epigenome-wide association studies (EWAS) are needed to investigate the utility of DNA methylation profiling as a peripheral marker for the brain. Here, in an analysis of eleven international cohorts, totalling 3337 individuals, we report epigenome-wide meta-analyses of blood DNA methylation with volumes of the hippocampus, thalamus and nucleus accumbens (NAcc)—three subcortical regions selected for their associations with disease and heritability and volumetric variability. Analyses of individual CpGs revealed genome-wide significant associations with hippocampal volume at two loci. No significant associations were found for analyses of thalamus and nucleus accumbens volumes. Cluster-based analyses revealed additional differentially methylated regions (DMRs) associated with hippocampal volume. DNA methylation at these loci affected expression of proximal genes involved in learning and memory, stem cell maintenance and differentiation, fatty acid metabolism and type-2 diabetes. These DNA methylation marks, their interaction with genetic variants and their impact on gene expression offer new insights into the relationship between epigenetic variation and brain structure and may provide the basis for biomarker discovery in neurodegeneration and neuropsychiatric conditions.


Author(s):  
F.H. Acuña ◽  
A.C. Excoffon ◽  
L. Ricci

This study analyses the possible relationships between body size and length of cnidae from different tissues of the sea anemone Oulactis muscosa. We describe the cnidom, providing new qualitative and quantitative data. Our description adds spirocysts for tentacles and acrorhagi, and is more precise about the ranges and types of basitrichs, microbasic b-mastigophores, and holotrichs. We distinguish two types of holotrichs in the acrorhagi, and differentiate between microbasic b-mastigophores and basitrichs in the actinopharynx and mesenterial filaments. A relationship between cnida length and body weight was not demonstrated. The results are based on a complete account of cnida types from all tissues, and considering the great number of capsules measured (5400) and the modern statistical tools employed, we think that a normal distribution of cnida lengths is uncommon, perhaps refuted. This finding is very important when a quantitative analysis of cnidae is necessary and an adequate statistical tool must be used. We have shown that generalized linear models are an alternative and therefore analyses can be done with parametric methods despite the non-normal distribution of cnida size. The use of these statistical tools should be generalized since appropriate package for analyses (like the R package) are available from the web and the obtained results are robust and powerful.


2011 ◽  
Vol 30 (3) ◽  
pp. 213-223 ◽  
Author(s):  
Miroslava Janković

Glycans as Biomarkers: Status and PerspectivesProtein glycosylation is a ubiquitous and complex co- and post-translational modification leading to glycan formation, i.e. oligosaccharide chains covalently attached to peptide backbones. The significance of changes in glycosylation for the beginning, progress and outcome of different human diseases is widely recognized. Thus, glycans are considered as unique structures to diagnose, predict susceptibility to and monitor the progression of disease. In the »omics« era, the glycome, a glycan analogue of the proteome and genome, holds considerable promise as a source of new biomarkers. In the design of a strategy for biomarker discovery, new principles and platforms for the analysis of relatively small amounts of numerous glycoproteins are needed. Emerging glycomics technologies comprising different types of mass spectrometry and affinity-based arrays are next in line to deliver new analytical procedures in the field of biomarkers. Screening different types of glycomolecules, selection of differentially expressed components, their enrichment and purification or identification are the most challenging parts of experimental and clinical glycoproteomics. This requires large-scale technologies enabling high sensitivity, proper standardization and validation of the methods to be used. Further progress in the field of applied glycoscience requires an integrated systematic approach in order to explore properly all opportunities for disease diagnosis.


Author(s):  
Zachary B Abrams ◽  
Caitlin E Coombes ◽  
Suli Li ◽  
Kevin R Coombes

Abstract Summary Unsupervised machine learning provides tools for researchers to uncover latent patterns in large-scale data, based on calculated distances between observations. Methods to visualize high-dimensional data based on these distances can elucidate subtypes and interactions within multi-dimensional and high-throughput data. However, researchers can select from a vast number of distance metrics and visualizations, each with their own strengths and weaknesses. The Mercator R package facilitates selection of a biologically meaningful distance from 10 metrics, together appropriate for binary, categorical and continuous data, and visualization with 5 standard and high-dimensional graphics tools. Mercator provides a user-friendly pipeline for informaticians or biologists to perform unsupervised analyses, from exploratory pattern recognition to production of publication-quality graphics. Availabilityand implementation Mercator is freely available at the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/Mercator/index.html).


Author(s):  
Felipe De Mendiburu ◽  
Reinhard Simon

Plant breeders and educators working with the International Potato Center (CIP) needed freely available statistical tools. In response, we created first a set of scripts for specific tasks using the open source statistical software R. Based on this we eventually compiled the R package agricolae as it covered a niche. Here we describe for the first time its main functions in the form of an article. We also review its reception using download statistics, citation data, and feedback from a user survey. We highlight usage in our extended network of collaborators. The package has found applications beyond agriculture in fields like aquaculture, ecology, biodiversity, conservation biology and cancer research. In summary, the package agricolae is a well established statistical toolbox based on R with a broad range of applications in design and analyses of experiments also in the wider biological community .


2018 ◽  
Author(s):  
Mykyta Artomov ◽  
Alexander A. Loboda ◽  
Maxim N. Artyomov ◽  
Mark J. Daly

AbstractAcquiring a sufficiently powered cohort of control samples can be time consuming or, sometimes, impossible. Accordingly, an ability to leverage control samples that were already collected and sequenced elsewhere could dramatically improve power in all genetic association studies. However, since majority of the genotyped and sequenced human DNA samples to date are subject to strict data sharing regulations, large-scale sharing of, in particular, control samples is extremely challenging. Using insights from image recognition, we developed a method allowing selection of the best-matching controls in an external pool of samples that is compliant with personal genotype data protection restrictions. Our approach uses singular value decomposition of the matrix of case genotypes to rank controls in another study by similarity to cases. We demonstrate that this recovers an accurate case-control association analysis for both ultra-rare and common variants and implement and provide online access to a library of ~17,000 controls that enables association studies for case cohorts lacking control subjects.


2021 ◽  
Author(s):  
Héctor Climente-González ◽  
Chloé-Agathe Azencott

AbstractSystems biology shows that genes related to the same phenotype are often functionally related. We can take advantage of this to discover new genes that affect a phenotype. However, the natural unit of analysis in genome-wide association studies (GWAS) is not the gene, but the single nucleotide polymorphism, or SNP. We introduce martini, an R package to build SNP co-function networks and use them to conduct GWAS. In SNP networks, two SNPs are connected if there is evidence they jointly contribute to the same biological function. By leveraging such information in GWAS, we search SNPs that are not only strongly associated with a phenotype, but also functionally related. This, in turn, boosts discovery and interpretability. Martini builds such networks using three sources of information: genomic position, gene annotations, and gene-gene interactions. The resulting SNP networks involve hundreds of thousands of nodes and millions of edges, making their exploration computationally intensive. Martini implements two network-guided biomarker discovery algorithms based on graph cuts that can handle such large networks: SConES and SigMod. They both seek a small subset of SNPs with high association scores with the phenotype of interest and densely interconnected in the network. Both algorithms use parameters that control the relative importance of the SNPs’ association scores, the number of SNPs selected, and their interconnection. Martini includes a cross-validation procedure to set these parameters automatically. Lastly, martini includes tools to visualize the selected SNPs’ network and association properties. Martini is available on GitHub (hclimente/martini) and Bioconductor (martini).


Author(s):  
Tadeusz Trzaskalik ◽  
Slawomir Jarek

Here we discuss the issue of planning a telemarketing campaign which will promote new services and use databases of the current customers. The databases can contain data of hundreds of thousands up to a few million customers. To ensure the best possible efficiency of the marketing action, certain conditions have to be satisfied: among other things, the campaign has to be planned so as to present at most one offer to each prospective customer. The problem discussed here can be treated as a vector maximization problem. The consecutive components of the vector criterion function are the numbers of the services sold of each kind. The campaign may have as its objective the creation of a plan which maximizes the profit or the total number of the services sold. The problem can be also regarded as hierarchical; one can also apply other known scalarization approaches, which will be described here. The problems to be solved are large-scale binary linear programming problems. We present a possible solution of these problems using the R package.


Sign in / Sign up

Export Citation Format

Share Document