StatBreak: Identifying “Lucky” Data Points Through Genetic Algorithms

Sometimes interesting statistical findings are produced by a small number of “lucky” data points within the tested sample. To address this issue, researchers and reviewers are encouraged to investigate outliers and influential data points. Here, we present StatBreak, an easy-to-apply method, based on a genetic algorithm, that identifies the observations that most strongly contributed to a finding (e.g., effect size, model fit, p value, Bayes factor). Within a given sample, StatBreak searches for the largest subsample in which a previously observed pattern is not present or is reduced below a specifiable threshold. Thus, it answers the following question: “Which (and how few) ‘lucky’ cases would need to be excluded from the sample for the data-based conclusion to change?” StatBreak consists of a simple R function and flags the luckiest data points for any form of statistical analysis. Here, we demonstrate the effectiveness of the method with simulated and real data across a range of study designs and analyses. Additionally, we describe StatBreak’s R function and explain how researchers and reviewers can apply the method to the data they are working with.

Download Full-text

Genetic Algorithm Optimization for Determining Fuzzy Measures from Fuzzy Data

Journal of Applied Mathematics ◽

10.1155/2013/542153 ◽

2013 ◽

Vol 2013 ◽

pp. 1-11 ◽

Cited By ~ 1

Author(s):

Chen Li ◽

Gong Zeng-tai ◽

Duan Gang

Keyword(s):

Genetic Algorithm ◽

Genetic Algorithms ◽

Real Data ◽

Difficult Problem ◽

Particle Swarm Algorithm ◽

Fuzzy Data ◽

Algorithm Optimization ◽

Practical Applications ◽

Fuzzy Measures ◽

Choquet Integrals

Fuzzy measures and fuzzy integrals have been successfully used in many real applications. How to determine fuzzy measures is a very difficult problem in these applications. Though there have existed some methodologies for solving this problem, such as genetic algorithms, gradient descent algorithms, neural networks, and particle swarm algorithm, it is hard to say which one is more appropriate and more feasible. Each method has its advantages. Most of the existed works can only deal with the data consisting of classic numbers which may arise limitations in practical applications. It is not reasonable to assume that all data are real data before we elicit them from practical data. Sometimes, fuzzy data may exist, such as in pharmacological, financial and sociological applications. Thus, we make an attempt to determine a more generalized type of general fuzzy measures from fuzzy data by means of genetic algorithms and Choquet integrals. In this paper, we make the first effort to define theσ-λrules. Furthermore we define and characterize the Choquet integrals of interval-valued functions and fuzzy-number-valued functions based onσ-λrules. In addition, we design a special genetic algorithm to determine a type of general fuzzy measures from fuzzy data.

Download Full-text

RAISS: robust and accurate imputation from summary statistics

Bioinformatics ◽

10.1093/bioinformatics/btz466 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4837-4839 ◽

Cited By ~ 1

Author(s):

Hanna Julienne ◽

Huwenbo Shi ◽

Bogdan Pasaniuc ◽

Hugues Aschard

Keyword(s):

Effect Size ◽

Association Studies ◽

Real Data ◽

Supplementary Information ◽

P Value ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Genome Wide ◽

Small Effect Size ◽

Python Package

Abstract Motivation Multi-trait analyses using public summary statistics from genome-wide association studies (GWASs) are becoming increasingly popular. A constraint of multi-trait methods is that they require complete summary data for all traits. Although methods for the imputation of summary statistics exist, they lack precision for genetic variants with small effect size. This is benign for univariate analyses where only variants with large effect size are selected a posteriori. However, it can lead to strong p-value inflation in multi-trait testing. Here we present a new approach that improve the existing imputation methods and reach a precision suitable for multi-trait analyses. Results We fine-tuned parameters to obtain a very high accuracy imputation from summary statistics. We demonstrate this accuracy for variants of all effect sizes on real data of 28 GWAS. We implemented the resulting methodology in a python package specially designed to efficiently impute multiple GWAS in parallel. Availability and implementation The python package is available at: https://gitlab.pasteur.fr/statistical-genetics/raiss, its accompanying documentation is accessible here http://statistical-genetics.pages.pasteur.fr/raiss/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A teaching tool about the fickle p value and other statistical principles based on real-life data

Naunyn-Schmiedeberg s Archives of Pharmacology ◽

10.1007/s00210-020-02045-3 ◽

2021 ◽

Author(s):

Salem Alawbathani ◽

Mehreen Batool ◽

Jan Fleckhaus ◽

Sarkawt Hamad ◽

Floyd Hassenrück ◽

...

Keyword(s):

Statistical Analysis ◽

Real Life ◽

Real Data ◽

P Value ◽

Life Data ◽

Real Life Data ◽

Random Samples ◽

Poor Understanding ◽

The Impact

AbstractA poor understanding of statistical analysis has been proposed as a key reason for lack of replicability of many studies in experimental biomedicine. While several authors have demonstrated the fickleness of calculated p values based on simulations, we have experienced that such simulations are difficult to understand for many biomedical scientists and often do not lead to a sound understanding of the role of variability between random samples in statistical analysis. Therefore, we as trainees and trainers in a course of statistics for biomedical scientists have used real data from a large published study to develop a tool that allows scientists to directly experience the fickleness of p values. A tool based on a commonly used software package was developed that allows using random samples from real data. The tool is described and together with the underlying database is made available. The tool has been tested successfully in multiple other groups of biomedical scientists. It can also let trainees experience the impact of randomness, sample sizes and choice of specific statistical test on measured p values. We propose that live exercises based on real data will be more impactful in the training of biomedical scientists on statistical concepts.

Download Full-text

Using Genetic Algorithms to Design an Optimized Keyboard Layout for Brazilian Portuguese

10.5753/eniac.2020.12149 ◽

2020 ◽

Author(s):

Gustavo Pacheco ◽

Eduardo Palmeira ◽

Keiji Yamanaka

Keyword(s):

Genetic Algorithm ◽

Genetic Algorithms ◽

Statistical Analysis ◽

Development Process ◽

Brazilian Portuguese ◽

Target Language ◽

Keyboard Layout ◽

Better Than ◽

User Productivity

Currently, keyboards are the most common means of communicating with computers. Despite being the most commonly used keyboard layout, QWERTY has had various issues raised concerning its effectiveness, as it is not efficient in English (target language) or in fact other languages. Therefore, this paper presents the development process of a Genetic Algorithm with the intention of generating a more adequate and coherent layout proposal for Brazilian Portuguese, which has its focus on ergonomics and user productivity. Using five ergonomic criteria and a statistical analysis of the characters and sequences of most frequently used pairs in Brazilian Portuguese, a layout approximately 53% better than QWERTY was obtained.

Download Full-text

Efficiency assessment of e-learning compared to traditional education using Principal Component Analysis and verification of Genetic Algorithms: تقييم كفاءة التعليم الإلكتروني مقارنة بالتعليم التقليدي باستخدام تحليل المكونات الأساسية والتحقق بالخوارزميات الجينية

Journal of engineering sciences and information technology - مجلة العلوم الهندسية و تكنولوجيا المعلومات ◽

10.26389/ajsrp.a160320 ◽

2020 ◽

Vol 4 (3) ◽

Author(s):

Firas Shawkat Hamid

Keyword(s):

Principal Component Analysis ◽

Genetic Algorithms ◽

Factor Analysis ◽

Statistical Analysis ◽

Technological Development ◽

Multivariate Data ◽

Principal Component ◽

Real Data ◽

Component Analysis ◽

E Learning

Multivariate data analysis is one of the common techniques that are used in the analysis of the main compounds that perform the process of converting a large number of related variables into a smaller number of unrelated compounds, In the case of the emergence of anomalous values, which can be detected in many ways, the adoption of the matrix of contrast and common contrast will lead to misleading results in the analysis of the principal compounds. Therefore, many of the phenomena that consist of a large group of variables that are difficult to deal with initially, and the process of interpreting these variables becomes a complex process, so reducing these variables to a lower setting is easier to deal with, and it is the aspiration of every researcher working in the field of main compounds analysis or factor analysis. Because of technological development and the ability to communicate by audio and video interaction at the same time, on this research, a multivariate data collection process was conducted, where an evaluation of the efficiency of e-learning was studied and analyzed by highlighting the process of analyzing real data using factor analysis by the Principal Component Analysis method. This is one of the techniques used to summarize and shorten the data and through the use of the SPSS: Statistical Packages for Social Sciences Program, Thus, it will be noted that the subject of the paper will flow into the concept of Data mining also, And then achieve it using genetic algorithms using the simulation program with its final version, which is MATLAB, also using the method of Multiple Linear Regression Procedure to find the arrangement of independent variables by calculating the weight of the independent variable. Total results were obtained for the eigenvalues of the stored correlation matrix or the rotating factor matrix, The study required conducting statistical analysis in the mentioned way and by reducing the number of variables without losing much information about the original variables and its aim is to simplify its understanding and reveal its structure and interpretation, The study required conducting statistical analysis in the mentioned way and by reducing the number of variables without losing much information about the original variables and its aim is to simplify its understanding and reveal its structure and interpretation. In addition to reaching a set of conclusions that were discussed in detail also the addition to the important recommendations.

Download Full-text

Beyond reporting statistical significance: Identifying informative effect sizes to improve scientific communication

Public Understanding of Science ◽

10.1177/0963662519834193 ◽

2019 ◽

Vol 28 (4) ◽

pp. 468-485 ◽

Cited By ~ 10

Author(s):

Paul HP Hanel ◽

David MA Mehler

Keyword(s):

Effect Size ◽

General Public ◽

Scientific Community ◽

Bayes Factor ◽

Statistical Significance ◽

Scientific Communication ◽

Bayes Factors ◽

Effect Sizes ◽

P Value ◽

Level Of Education

Transparent communication of research is key to foster understanding within and beyond the scientific community. An increased focus on reporting effect sizes in addition to p value–based significance statements or Bayes Factors may improve scientific communication with the general public. Across three studies ( N = 652), we compared subjective informativeness ratings for five effect sizes, Bayes Factor, and commonly used significance statements. Results showed that Cohen’s U3 was rated as most informative. For example, 440 participants (69%) found U3 more informative than Cohen’s d, while 95 (15%) found d more informative than U3, with 99 participants (16%) finding both effect sizes equally informative. This effect was not moderated by level of education. We therefore suggest that in general, Cohen’s U3 is used when scientific findings are communicated. However, the choice of the effect size may vary depending on what a researcher wants to highlight (e.g. differences or similarities).

Download Full-text

Simulation data for the analysis of Bayesian posterior significance and effect size indices for the two-sample t-test to support reproducible medical research

BMC Research Notes ◽

10.1186/s13104-020-05291-z ◽

2020 ◽

Vol 13 (1) ◽

Author(s):

Riko Kelter

Keyword(s):

Medical Research ◽

Effect Size ◽

Bayes Factor ◽

Significance Test ◽

T Test ◽

P Value ◽

Type I ◽

Type I Errors ◽

Statistical Programming ◽

Region Of Practical Equivalence

Abstract Objectives The data presented herein represents the simulated datasets of a recently conducted larger study which investigated the behaviour of Bayesian indices of significance and effect size as alternatives to traditional p-values. The study considered the setting of Student’s and Welch’s two-sample t-test often used in medical research. It investigated the influence of the sample size, noise, the selected prior hyperparameters and the sensitivity to type I errors. The posterior indices used included the Bayes factor, the region of practical equivalence, the probability of direction, the MAP-based p-value and the e-value in the Full Bayesian Significance Test. The simulation study was conducted in the statistical programming language R. Data description The R script files for simulation of the datasets used in the study are presented in this article. These script files can both simulate the raw datasets and run the analyses. As researchers may be faced with different effect sizes, noise levels or priors in their domain than the ones studied in the original paper, the scripts extend the original results by allowing to recreate all analyses of interest in different contexts. Therefore, they should be relevant to other researchers.

Download Full-text

RAISS: Robust and Accurate imputation from Summary Statistics

10.1101/502880 ◽

2018 ◽

Cited By ~ 1

Author(s):

Hanna Julienne ◽

Huwenbo Shi ◽

Bogdan Pasaniuc ◽

Hugues Aschard

Keyword(s):

Effect Size ◽

Association Studies ◽

Real Data ◽

Statistical Genetics ◽

P Value ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Link Type ◽

Small Effect Size ◽

Python Package

AbstractMotivationMulti-trait analyses using public summary statistics from genome-wide association studies (GWAS) are becoming increasingly popular. A constraint of multi-trait methods is that they require complete summary data for all traits. While methods for the imputation of summary statistics exist, they lack precision for genetic variants with small effect size. This is benign for univariate analyses where only variants with large effect size are selected a posteriori. However, it can lead to strong p-value inflation in multi-trait testing. Here we present a new approach that improve the existing imputation methods and reach a precision suitable for multi-trait analyses.ResultsWe fine-tuned parameters to obtain a very high accuracy imputation from summary statistics. We demonstrate this accuracy for small size-effect variants on real data of 28 GWAS. We implemented the resulting methodology in a python package specially designed to efficiently impute multiple GWAS in parallel.AvailabilityThe python package is available at: https://gitlab.pasteur.fr/statistical-genetics/raiss, its accompanying documentation is accessible here http://statistical-genetics.pages.pasteur.fr/raiss/[email protected]

Download Full-text

Formulation of Colors Using a Genetic Algorithm

Image Processing & Communications ◽

10.2478/v10248-012-0052-9 ◽

2012 ◽

Vol 17 (4) ◽

pp. 241-244

Author(s):

Cezary Draus ◽

Grzegorz Nowak ◽

Maciej Nowak ◽

Marcin Tokarski

Keyword(s):

Genetic Algorithm ◽

Genetic Algorithms ◽

Production Process ◽

Computer Technology ◽

Manual Method ◽

Apply Genetic ◽

Computer Aided ◽

Multiple Samples ◽

Plastics Industry

Abstract The possibility to obtain a desired color of the product and to ensure its repeatability in the production process is highly desired in many industries such as printing, automobile, dyeing, textile, cosmetics or plastics industry. So far, most companies have traditionally used the "manual" method, relying on intuition and experience of a colorist. However, the manual preparation of multiple samples and their correction can be very time consuming and expensive. The computer technology has allowed the development of software to support the process of matching colors. Nowadays, formulation of colors is done with appropriate equipment (colorimeters, spectrophotometers, computers) and dedicated software. Computer-aided formulation is much faster and cheaper than manual formulation, because fewer corrective iterations have to be carried out, to achieve the desired result. Moreover, the colors are analyzed with regard to the metamerism, and the best recipe can be chosen, according to the specific criteria (price, quantity, availability). Optimaization problem of color formulation can be solved in many diferent ways. Authors decided to apply genetic algorithms in this domain.

Download Full-text

Machine Learning Accelerated Genetic Algorithms for Computational Materials Search

10.26434/chemrxiv.7411172 ◽

2018 ◽

Author(s):

Steen Lysgaard ◽

Paul C. Jennings ◽

Jens Strabo Hummelshøj ◽

Thomas Bligaard ◽

Tejs Vegge

Keyword(s):

Machine Learning ◽

Genetic Algorithm ◽

Genetic Algorithms ◽

Au Nanoparticles ◽

Learning Model ◽

Energy Calculations ◽

Atomic Distribution ◽

Machine Learning Model ◽

Fold Reduction ◽

Computational Materials

A machine learning model is used as a surrogate fitness evaluator in a genetic algorithm (GA) optimization of the atomic distribution of Pt-Au nanoparticles. The machine learning accelerated genetic algorithm (MLaGA) yields a 50-fold reduction of required energy calculations compared to a traditional GA.

Download Full-text