scholarly journals Gene set enrichment for reproducible science: comparison of CERNO and eight other algorithms

2019 ◽  
Vol 35 (24) ◽  
pp. 5146-5154 ◽  
Author(s):  
Joanna Zyla ◽  
Michal Marczyk ◽  
Teresa Domaszewska ◽  
Stefan H E Kaufmann ◽  
Joanna Polanska ◽  
...  

Abstract Motivation Analysis of gene set (GS) enrichment is an essential part of functional omics studies. Here, we complement the established evaluation metrics of GS enrichment algorithms with a novel approach to assess the practical reproducibility of scientific results obtained from GS enrichment tests when applied to related data from different studies. Results We evaluated eight established and one novel algorithm for reproducibility, sensitivity, prioritization, false positive rate and computational time. In addition to eight established algorithms, we also included Coincident Extreme Ranks in Numerical Observations (CERNO), a flexible and fast algorithm based on modified Fisher P-value integration. Using real-world datasets, we demonstrate that CERNO is robust to ranking metrics, as well as sample and GS size. CERNO had the highest reproducibility while remaining sensitive, specific and fast. In the overall ranking Pathway Analysis with Down-weighting of Overlapping Genes, CERNO and over-representation analysis performed best, while CERNO and GeneSetTest scored high in terms of reproducibility. Availability and implementation tmod package implementing the CERNO algorithm is available from CRAN (cran.r-project.org/web/packages/tmod/index.html) and an online implementation can be found at http://tmod.online/. The datasets analyzed in this study are widely available in the KEGGdzPathwaysGEO, KEGGandMetacoreDzPathwaysGEO R package and GEO repository. Supplementary information Supplementary data are available at Bioinformatics online.

Author(s):  
Shashidhara Bola

A new method is proposed to classify the lung nodules as benign and malignant. The method is based on analysis of lung nodule shape, contour, and texture for better classification. The data set consists of 39 lung nodules of 39 patients which contain 19 benign and 20 malignant nodules. Lung regions are segmented based on morphological operators and lung nodules are detected based on shape and area features. The proposed algorithm was tested on LIDC (lung image database consortium) datasets and the results were found to be satisfactory. The performance of the method for distinction between benign and malignant was evaluated by the use of receiver operating characteristic (ROC) analysis. The method achieved area under the ROC curve was 0.903 which reduces the false positive rate.


2019 ◽  
Vol 2019 ◽  
pp. 1-9 ◽  
Author(s):  
Gabriele Valvano ◽  
Gianmarco Santini ◽  
Nicola Martini ◽  
Andrea Ripoli ◽  
Chiara Iacconi ◽  
...  

Cluster of microcalcifications can be an early sign of breast cancer. In this paper, we propose a novel approach based on convolutional neural networks for the detection and segmentation of microcalcification clusters. In this work, we used 283 mammograms to train and validate our model, obtaining an accuracy of 99.99% on microcalcification detection and a false positive rate of 0.005%. Our results show how deep learning could be an effective tool to effectively support radiologists during mammograms examination.


2004 ◽  
Author(s):  
Hesamoddin Jahanian ◽  
Hamid Soltanian-Zadeh ◽  
Gholam-Ali Hossein-Zadeh

2010 ◽  
Vol 08 (01) ◽  
pp. 99-115 ◽  
Author(s):  
JIYUAN AN ◽  
KWOK PUI CHOI ◽  
CHRISTINE A. WELLS ◽  
YI-PING PHOEBE CHEN

Background: Current miRNA target prediction tools have the common problem that their false positive rate is high. This renders identification of co-regulating groups of miRNAs and target genes unreliable. In this study, we describe a procedure to identify highly probable co-regulating miRNAs and the corresponding co-regulated gene groups. Our procedure involves a sequence of statistical tests: (1) identify genes that are highly probable miRNA targets; (2) determine for each such gene, the minimum number of miRNAs that co-regulate it with high probability; (3) find, for each such gene, the combination of the determined minimum size of miRNAs that co-regulate it with the lowest p-value; and (4) discover for each such combination of miRNAs, the group of genes that are co-regulated by these miRNAs with the lowest p-value computed based on GO term annotations of the genes. Results: Our method identifies 4, 3 and 2-term miRNA groups that co-regulate gene groups of size at least 3 in human. Our result suggests some interesting hypothesis on the functional role of several miRNAs through a "guilt by association" reasoning. For example, miR-130, miR-19 and miR-101 are known neurodegenerative diseases associated miRNAs. Our 3-term miRNA table shows that miR-130/19/101 form a co-regulating group of rank 22 (p-value =1.16 × 10-2). Since miR-144 is co-regulating with miR-130, miR-19 and miR-101 of rank 4 (p-value = 1.16 × 10-2) in our 4-term miRNA table, this suggests hsa-miR-144 may be neurodegenerative diseases related miRNA. Conclusions: This work identifies highly probable co-regulating miRNAs, which are refined from the prediction by computational tools using (1) signal-to-noise ratio to get high accurate regulating miRNAs for every gene, and (2) Gene Ontology to obtain functional related co-regulating miRNA groups. Our result has partly been supported by biological experiments. Based on prediction by TargetScanS, we found highly probable target gene groups in the Supplementary Information. This result might help biologists to find small set of miRNAs for genes of interest rather than huge amount of miRNA set. Supplementary Information:.


2017 ◽  
Author(s):  
Michele B. Nuijten ◽  
Marcel A. L. M. van Assen ◽  
Chris Hubertus Joseph Hartgerink ◽  
Sacha Epskamp ◽  
Jelte M. Wicherts

The R package “statcheck” (Epskamp & Nuijten, 2016) is a tool to extract statistical results from articles and check whether the reported p-value matches the accompanying test statistic and degrees of freedom. A previous study showed high interrater reliabilities (between .76 and .89) between statcheck and manual coding of inconsistencies (.76 - .89; Nuijten, Hartgerink, Van Assen, Epskamp, & Wicherts, 2016). Here we present an additional, detailed study of the validity of statcheck. In Study 1, we calculated its sensitivity and specificity. We found that statcheck’s sensitivity (true positive rate) and specificity (true negative rate) were high: between 85.3% and 100%, and between 96.0% and 100%, respectively, depending on the assumptions and settings. The overall accuracy of statcheck ranged from 96.2% to 99.9%. In Study 2, we investigated statcheck’s ability to deal with statistical corrections for multiple testing or violations of assumptions in articles. We found that the prevalence of corrections for multiple testing or violations of assumptions in psychology was higher than we initially estimated in Nuijten et al. (2016). Although we found numerous reporting inconsistencies in results corrected for violations of the sphericity assumption, we demonstrate that inconsistencies associated with statistical corrections are not what is causing the high estimates of the prevalence of statistical reporting inconsistencies in psychology.


2019 ◽  
Author(s):  
L Cao ◽  
C Clish ◽  
FB Hu ◽  
MA Martínez-González ◽  
C Razquin ◽  
...  

AbstractMotivationLarge-scale untargeted metabolomics experiments lead to detection of thousands of novel metabolic features as well as false positive artifacts. With the incorporation of pooled QC samples and corresponding bioinformatics algorithms, those measurement artifacts can be well quality controlled. However, it is impracticable for all the studies to apply such experimental design.ResultsWe introduce a post-alignment quality control method called genuMet, which is solely based on injection order of biological samples to identify potential false metabolic features. In terms of the missing pattern of metabolic signals, genuMet can reach over 95% true negative rate and 85% true positive rate with suitable parameters, compared with the algorithm utilizing pooled QC samples. genu-Met makes it possible for studies without pooled QC samples to reduce false metabolic signals and perform robust statistical analysis.Availability and implementationgenuMet is implemented in a R package and available on https://github.com/liucaomics/genuMet under GPL-v2 license.ContactLiming Liang: [email protected] informationSupplementary data are available at ….


2018 ◽  
Author(s):  
Farhad Maleki ◽  
Anthony J. Kusalik

AbstractGene set analysis methods are widely used to analyze data from high-throughput “omics” technologies. One drawback of these methods is their low specificity or high false positive rate. Over-representation analysis is one of the most commonly used gene set analysis methods. In this paper, we propose a systematic approach to investigate the hypothesis that gene set overlap is an underlying cause of low specificity in over-representation analysis. We quantify gene set overlap and show that it is a ubiquitous phenomenon across gene set databases. Statistical analysis indicates a strong negative correlation between gene set overlap and the specificity of over-representation analysis. We conclude that gene set overlap is an underlying cause of the low specificity. This result highlights the importance of considering gene set overlap in gene set analysis and explains the lack of specificity of methods that ignore gene set overlap. This research also establishes the direction for developing new gene set analysis methods.


2019 ◽  
Vol 35 (23) ◽  
pp. 4871-4878
Author(s):  
Peng Jiang ◽  
Jie Luo ◽  
Yiqi Wang ◽  
Pingji Deng ◽  
Bertil Schmidt ◽  
...  

Abstract Motivation K-mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc., attracting intensive studies in k-mer counting. However, the output of k-mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability. Results We introduce a novel idea of encoding k-mers as well as their frequency, achieving good memory saving and retrieval efficiency. Specifically, we propose a Bloom filter-like data structure to encode counted k-mers by coupled-bit arrays—one for k-mer representation and the other for frequency encoding. Experiments on five real datasets show that the average memory-saving ratio on all 31-mers is as high as 13.81 as compared with raw input, with 7 hash functions. At the same time, the retrieval time complexity is well controlled (effectively constant), and the false-positive rate is decreased by two orders of magnitude. Availability and implementation The source codes of our algorithm are available at github.com/lzhLab/kmcEx. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 12 (3) ◽  
pp. 442 ◽  
Author(s):  
Jesús Balado ◽  
Elena González ◽  
Pedro Arias ◽  
David Castro

Traffic signs are a key element in driver safety. Governments invest a great amount of resources in maintaining the traffic signs in good condition, for which a correct inventory is necessary. This work presents a novel method for mapping traffic signs based on data acquired with MMS (Mobile Mapping System): images and point clouds. On the one hand, images are faster to process and artificial intelligence techniques, specifically Convolutional Neural Networks, are more optimized than in point clouds. On the other hand, point clouds allow a more exact positioning than the exclusive use of images. The false positive rate per image is only 0.004. First, traffic signs are detected in the images obtained by the 360° camera of the MMS through RetinaNet and they are classified by their corresponding InceptionV3 network. The signs are then positioned in the georeferenced point cloud by means of a projection according to the pinhole model from the images. Finally, duplicate geolocalized signs detected in multiple images are filtered. The method has been tested in two real case studies with 214 images, where 89.7% of the signals have been correctly detected, of which 92.5% have been correctly classified and 97.5% have been located with an error of less than 0.5 m. This sequence, which combines images to detection–classification, and point clouds to geo-referencing, in this order, optimizes processing time and allows this method to be included in a company’s production process. The method is conducted automatically and takes advantage of the strengths of each data type.


2019 ◽  
Vol 35 (17) ◽  
pp. 3046-3054 ◽  
Author(s):  
Anastasia Gurinovich ◽  
Harold Bae ◽  
John J Farrell ◽  
Stacy L Andersen ◽  
Stefano Monti ◽  
...  

Abstract Motivation Over the last decade, more diverse populations have been included in genome-wide association studies. If a genetic variant has a varying effect on a phenotype in different populations, genome-wide association studies applied to a dataset as a whole may not pinpoint such differences. It is especially important to be able to identify population-specific effects of genetic variants in studies that would eventually lead to development of diagnostic tests or drug discovery. Results In this paper, we propose PopCluster: an algorithm to automatically discover subsets of individuals in which the genetic effects of a variant are statistically different. PopCluster provides a simple framework to directly analyze genotype data without prior knowledge of subjects’ ethnicities. PopCluster combines logistic regression modeling, principal component analysis, hierarchical clustering and a recursive bottom-up tree parsing procedure. The evaluation of PopCluster suggests that the algorithm has a stable low false positive rate (∼4%) and high true positive rate (>80%) in simulations with large differences in allele frequencies between cases and controls. Application of PopCluster to data from genetic studies of longevity discovers ethnicity-dependent heterogeneity in the association of rs3764814 (USP42) with the phenotype. Availability and implementation PopCluster was implemented using the R programming language, PLINK and Eigensoft software, and can be found at the following GitHub repository: https://github.com/gurinovich/PopCluster with instructions on its installation and usage. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document