scholarly journals Detection of Non-Normality in Data Sets and Comparison between Different Normality Tests

Author(s):  
Biu O. Emmanuel ◽  
Nwakuya T. Maureen ◽  
Nduka Wonu

The paper provides five tests of data normality at different sample sizes. The tests are the Shapiro-Wilk (SW) test, Anderson-Darling (AD) test, Kolmogorov-Smirnov (KS) test, Ryan-Joiner (RJ) test, and Jarque-Bera (JB) test. These tests were used to test for normality for two secondary data sets with sample size (155) for large and (40) for small; and then test the simulated scenario with standard normal “N(0,1)” data sets; where the large samples of sizes (150, 140, 130, 130, 110 and 100) and small samples of sizes (40. 35, 30, 25, 20, 15 and 10) are considered at two levels of significance (5% and 10%). However, the aim of this paper is to detect and compare the performance of the different normality tests considered. The normality test results shows Kolmogorov-Smirnov (KS) test is a most powerful test than other tests since it detect the simulated large sample data sets do not follow a normal distribution at 5%, while for small sample sizes at 5% level of significance; the results showed the Jarque-Bera (JB) test is a most powerful test than other tests since it detects that the simulated small sample data do not follow a normal distribution at 5%. This paper recommended JB test for normality test when the sample size is small and KS test when the sample size is large at 5% level of significance.

2014 ◽  
Vol 2 (2) ◽  
Author(s):  
Jorge Mario Insignares Movilla U Insignares Movilla

In situations where the size of the sample data set is relatively small, toassume a normal distribution. some uncertainties exist. A mistake is to userandom sampling and the other the small sample size. That is why, beginningwith the story of the generalities that solved the problem t distribution, thenabout topics that support, and …nally, a detailed analysis with some relationshipswith other distributions. While ignore the importance for hypothesis testing instatistical inference when means were contrasted


2011 ◽  
Vol 6 (2) ◽  
pp. 252-277 ◽  
Author(s):  
Stephen T. Ziliak

AbstractStudent's exacting theory of errors, both random and real, marked a significant advance over ambiguous reports of plant life and fermentation asserted by chemists from Priestley and Lavoisier down to Pasteur and Johannsen, working at the Carlsberg Laboratory. One reason seems to be that William Sealy Gosset (1876–1937) aka “Student” – he of Student'st-table and test of statistical significance – rejected artificial rules about sample size, experimental design, and the level of significance, and took instead an economic approach to the logic of decisions made under uncertainty. In his job as Apprentice Brewer, Head Experimental Brewer, and finally Head Brewer of Guinness, Student produced small samples of experimental barley, malt, and hops, seeking guidance for industrial quality control and maximum expected profit at the large scale brewery. In the process Student invented or inspired half of modern statistics. This article draws on original archival evidence, shedding light on several core yet neglected aspects of Student's methods, that is, Guinnessometrics, not discussed by Ronald A. Fisher (1890–1962). The focus is on Student's small sample, economic approach to real error minimization, particularly in field and laboratory experiments he conducted on barley and malt, 1904 to 1937. Balanced designs of experiments, he found, are more efficient than random and have higher power to detect large and real treatment differences in a series of repeated and independent experiments. Student's world-class achievement poses a challenge to every science. Should statistical methods – such as the choice of sample size, experimental design, and level of significance – follow the purpose of the experiment, rather than the other way around? (JEL classification codes: C10, C90, C93, L66)


2020 ◽  
Vol 16 (3) ◽  
pp. 1061-1074 ◽  
Author(s):  
Jörg Franke ◽  
Veronika Valler ◽  
Stefan Brönnimann ◽  
Raphael Neukom ◽  
Fernando Jaume-Santero

Abstract. Differences between paleoclimatic reconstructions are caused by two factors: the method and the input data. While many studies compare methods, we will focus in this study on the consequences of the input data choice in a state-of-the-art Kalman-filter paleoclimate data assimilation approach. We evaluate reconstruction quality in the 20th century based on three collections of tree-ring records: (1) 54 of the best temperature-sensitive tree-ring chronologies chosen by experts; (2) 415 temperature-sensitive tree-ring records chosen less strictly by regional working groups and statistical screening; (3) 2287 tree-ring series that are not screened for climate sensitivity. The three data sets cover the range from small sample size, small spatial coverage and strict screening for temperature sensitivity to large sample size and spatial coverage but no screening. Additionally, we explore a combination of these data sets plus screening methods to improve the reconstruction quality. A large, unscreened collection generally leads to a poor reconstruction skill. A small expert selection of extratropical Northern Hemisphere records allows for a skillful high-latitude temperature reconstruction but cannot be expected to provide information for other regions and other variables. We achieve the best reconstruction skill across all variables and regions by combining all available input data but rejecting records with insignificant climatic information (p value of regression model >0.05) and removing duplicate records. It is important to use a tree-ring proxy system model that includes both major growth limitations, temperature and moisture.


2018 ◽  
Author(s):  
Arghavan Bahadorinejad ◽  
Ivan Ivanov ◽  
Johanna W Lampe ◽  
Meredith AJ Hullar ◽  
Robert S Chapkin ◽  
...  

AbstractWe propose a Bayesian method for the classification of 16S rRNA metagenomic profiles of bacterial abundance, by introducing a Poisson-Dirichlet-Multinomial hierarchical model for the sequencing data, constructing a prior distribution from sample data, calculating the posterior distribution in closed form; and deriving an Optimal Bayesian Classifier (OBC). The proposed algorithm is compared to state-of-the-art classification methods for 16S rRNA metagenomic data, including Random Forests and the phylogeny-based Metaphyl algorithm, for varying sample size, classification difficulty, and dimensionality (number of OTUs), using both synthetic and real metagenomic data sets. The results demonstrate that the proposed OBC method, with either noninformative or constructed priors, is competitive or superior to the other methods. In particular, in the case where the ratio of sample size to dimensionality is small, it was observed that the proposed method can vastly outperform the others.Author summaryRecent studies have highlighted the interplay between host genetics, gut microbes, and colorectal tumor initiation/progression. The characterization of microbial communities using metagenomic profiling has therefore received renewed interest. In this paper, we propose a method for classification, i.e., prediction of different outcomes, based on 16S rRNA metagenomic data. The proposed method employs a Bayesian approach, which is suitable for data sets with small ration of number of available instances to the dimensionality. Results using both synthetic and real metagenomic data show that the proposed method can outperform other state-of-the-art metagenomic classification algorithms.


Galaxies ◽  
2020 ◽  
Vol 8 (4) ◽  
pp. 70
Author(s):  
Dimitris M. Christodoulou ◽  
Silas G. T. Laycock ◽  
Rigel Cappallo ◽  
Ankur Roy ◽  
Sayantan Bhattacharya  ◽  
...  

We carry out a meta-analysis of ultraluminous X-ray (ULX) sources that show large variabilities (by factors of >10) between their highest and lowest emission states in the X-ray energy range of 0.3–10 keV. We are guided by a recent stringent compilation of 25 such X-ray sources by Song et al. We examine the relation of logN versus logSmax, where N is the number of sources radiating above the maximum-flux level Smax. We find a strong deviation from all previously determined slopes in various high-mass X-ray binary (HMXB) samples. In fact, the ULX data clearly show a slope of −0.91. Thus, ULX sources do not appear to be uniform and isotropic in our Universe. We compare the ULX results against the local X-ray luminosity function of HMXBs in the Small Magellanic Cloud (SMC) constructed from our latest library that includes 41 Chandra 0.3–8 keV sources and 56 XMM-Newton 0.2–12 keV sources. The ULX data are not drawn from the same continuous distribution as the SMC data (the ULX data peak at the low tails of the SMC distributions), and none of our data sets is drawn from a normal distribution or from a log-normal distribution (they all show marked excesses at both tails). At a significance level of α=0.05 (2σ), the two-sample p-value of the Kolmogorov–Smirnov (KS) test gives p=4.7×10−3<α for the ULX versus the small Chandra sample and p=1.1×10−5<<α for the ULX versus the larger XMM-Newton sample, respectively. This adds to the evidence that ULX sources are not simply the higher end of the known local Be/X-ray pulsar distribution, but they represent a class of X-ray sources different from the young sources found in the SMC and in individual starburst galaxies. On the other hand, our two main SMC data sets are found to be statistically consistent, as they are drawn from the same continuous parent distribution (null hypothesis H0): at the α=0.05 significance level, the two-sample KS test shows an asymptotic p-value of 0.308>α, which tells us to accept H0.


2007 ◽  
Vol 135 (3) ◽  
pp. 1151-1157 ◽  
Author(s):  
Dag J. Steinskog ◽  
Dag B. Tjøstheim ◽  
Nils G. Kvamstø

Abstract The Kolmogorov–Smirnov goodness-of-fit test is used in many applications for testing normality in climate research. This note shows that the test usually leads to systematic and drastic errors. When the mean and the standard deviation are estimated, it is much too conservative in the sense that its p values are strongly biased upward. One may think that this is a small sample problem, but it is not. There is a correction of the Kolmogorov–Smirnov test by Lilliefors, which is in fact sometimes confused with the original Kolmogorov–Smirnov test. Both the Jarque–Bera and the Shapiro–Wilk tests for normality are good alternatives to the Kolmogorov–Smirnov test. A power comparison of eight different tests has been undertaken, favoring the Jarque–Bera and the Shapiro–Wilk tests. The Jarque–Bera and the Kolmogorov–Smirnov tests are also applied to a monthly mean dataset of geopotential height at 500 hPa. The two tests give very different results and illustrate the danger of using the Kolmogorov–Smirnov test.


2014 ◽  
Vol 14 (2) ◽  
pp. 391-403
Author(s):  
Zuzana Schubertová ◽  
Juraj Candrák

Abstract The aim of this study was to verify the newly proposed transformation of penalty points and ranking of showjumping horses for the purpose of genetic evaluation. Genomic information in the transformation of input data was used as well. Data of showjumping competition Global Champions Tour was used. Profit of penalty points was transformed to normally distributed variable using Blom formula (height of obstacles and height of obstacles with single nucleotide polymorphism - SNP effect taken into account). Non-normal distribution was obtained. The rankings of sport horses in competitions were transformed using the Blom formula (height of obstacles taken into account) to normal distribution (tests of normality Kolmogorov-Smirnov (KS) test Pr>D, D 0.011, P>0.150, Cramer-von Mises (CM) test Pr>W-Sq, W-Sq 0.039, P>0.250, Anderson-Darling test (AD) Pr>A-Sq, A-Sq 0.638, P<0.097). Better distributed variable ranking transformed by Blom formula (height of obstacles and SNP effect taken into account) was obtained (KS test Pr>D, D 0.004, P>0.150, CM test Pr>W-Sq, W-Sq 0.004, P>0.250, AD test Pr>A-Sq, A-Sq 0.062, P>0.250). Model where all used fixed effects to equation were applied without any combination of the effects was tested, R2 0.54. Variable ranking was transformed to normal score by Blom formula (height of obstacles was taken into account). In the following model some effects were taken into account in the form of quadratic regression, R2 0.61. Variable ranking was transformed to normal score, the same as in previous model. In the last model we transformed variable ranking to normal score by Blom formula, taking into account height of obstacles and SNP effect. Same effects as in previous model were used, R2 0.60


2021 ◽  
Vol 4 (1) ◽  
pp. 403-409
Author(s):  
Hengki Mangiring Parulian Simarmata ◽  
◽  
Doris Yolanda Saragih ◽  
Nora Januarti Panjaitan ◽  
◽  
...  

Work discipline can improve company progress. This study aims to determine the influence of employee work discipline on employee performance at PT Bridgestone Pondok Bandar Jambu, Simalungun Regency.This research is a quantitativestudy using all employees as the sample. Data collection techniques is done by distributing questionnaires, observation,and documentation.The results of the data were obtained by using SPSS, the data were tested by validity test, reliability test, Kolmogorov-Smirnov test, normality test, and simple linear analysis. From the research results, there is a significant positive effect of work discipline on employee performance where the magnitude of the influence of the discipline variable is 51.6% while the remaining 48.4 % is influenced by other variables that are not examined in this study such as work environment variables, leadership style,and others.Keywords: Disiplin kerja, kinerja karyawan


2019 ◽  
Author(s):  
Lara Nonell ◽  
Juan R González

AbstractDNA methylation plays an important role in the development and progression of disease. Beta-values are the standard methylation measures. Different statistical methods have been proposed to assess differences in methylation between conditions. However, most of them do not completely account for the distribution of beta-values. The simplex distribution can accommodate beta-values data. We hypothesize that simplex is a quite flexible distribution which is able to model methylation data.To test our hypothesis, we conducted several analyses using four real data sets obtained from microarrays and sequencing technologies. Standard data distributions were studied and modelled in comparison to the simplex. Besides, some simulations were conducted in different scenarios encompassing several distribution assumptions, regression models and sample sizes. Finally, we compared DNA methylation between females and males in order to benchmark the assessed methodologies under different scenarios.According to the results obtained by the simulations and real data analyses, DNA methylation data are concordant with the simplex distribution in many situations. Simplex regression models work well in small sample size data sets. However, when sample size increases, other models such as the beta regression or even the linear regression can be employed to assess group comparisons and obtain unbiased results. Based on these results, we can provide some practical recommendations when analyzing methylation data: 1) use data sets of at least 10 samples per studied condition for microarray data sets or 30 in NGS data sets, 2) apply a simplex or beta regression model for microarray data, 3) apply a linear model in any other case.


2022 ◽  
Vol 13 (1) ◽  
pp. 250-254
Author(s):  
Maftuhatur Rizkiyah Putri ◽  
Almira Disya Salsabil ◽  
I Made Agus Dwipayana ◽  
Widati Fatmaningrum

Introduction: The COVID-19 pandemic has harmed various fields, and people's activities cannot run as usual. Prevention of the transmission of COVID-19 is very important to be applied in everyday life. Washing hands with soap or hand sanitizer is easy and inexpensive prevention to do, but there are still many people who are wrong in practicing it. This needs to be done more counseling and education to the community in order to increase public knowledge about handwashing and hand sanitizer. Method: This research is an analytic study with a research design using a one-group pretest-posttest design. Using 31 respondents from Taro villagers who attended the counseling. Data analysis using Paired Sample T-test and Kolmogorov-Smirnov Test for Normality Test. Result: The average value of knowledge before counseling is 53.8710 while the value after counseling is 82.9677. Paired Sample T-test and obtained a significance value of 0.000 so that a significant difference was found (<0.005) between the values before and after counseling. Conclusion: There is a significant difference in the level of knowledge before and after handwashing and hand sanitizer counseling.


Sign in / Sign up

Export Citation Format

Share Document