Preparation of collected primary data for statistical analysis

Author(s):  
Alan J. Silman ◽  
Gary J. Macfarlane ◽  
Tatiana Macfarlane

Although epidemiological studies are increasingly based on the analysis of existing data sets (including linked data sets), many studies still require primary data collection. Such data may come from patient questionnaires, interviews, abstraction from records, and/or the results of tests and measures such as weight or blood test results. The next stage is to analyse the data gathered from individual subjects to provide the answers required. Before commencing with the statistical analysis of any data set, the data themselves must be prepared in a format so that the detailed statistical analysis can achieve its goals. Items to be considered include the format the data are initially collected in and how they are transferred to an appropriate electronic form. This chapter explores how errors are minimized and the quality of the data set ensured. These tasks are not trivial and need to be planned as part of a detailed study methodology.

2019 ◽  
Vol 118 (1) ◽  
pp. 14-19
Author(s):  
Boo-Gil Seok ◽  
Hyun-Suk Park

Background/Objectives: The purpose of this study is to examine the effects of exercise commitment facilitated by service quality of smartphone exercise Apps on continued exercise intention and provide primary data for developing and/or improving smartphone exercise Apps. Methods/Statistical analysis: A questionnaire survey was conducted amongst college students who have experiences in using exercise App(s) and regularly exercise. The questionnaire is composed of four parts asking about service quality, exercise commitment, continued exercise intention, which were measured with a 5-point Likert Scale, and demographics. Frequency analysis, factor analysis, correlation analysis, and regression analysis were carried out to analyze the obtained data with PASW 18.0.


2011 ◽  
pp. 24-32 ◽  
Author(s):  
Nicoleta Rogovschi ◽  
Mustapha Lebbah ◽  
Younès Bennani

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.


2017 ◽  
Vol 6 (3) ◽  
pp. 71 ◽  
Author(s):  
Claudio Parente ◽  
Massimiliano Pepe

The purpose of this paper is to investigate the impact of weights in pan-sharpening methods applied to satellite images. Indeed, different data sets of weights have been considered and compared in the IHS and Brovey methods. The first dataset contains the same weight for each band while the second takes in account the weighs obtained by spectral radiance response; these two data sets are most common in pan-sharpening application. The third data set is resulting by a new method. It consists to compute the inertial moment of first order of each band taking in account the spectral response. For testing the impact of the weights of the different data sets, WorlView-3 satellite images have been considered. In particular, two different scenes (the first in urban landscape, the latter in rural landscape) have been investigated. The quality of pan-sharpened images has been analysed by three different quality indexes: Root mean square error (RMSE), Relative average spectral error (RASE) and Erreur Relative Global Adimensionnelle de Synthèse (ERGAS).


1997 ◽  
Vol 3 (S2) ◽  
pp. 931-932 ◽  
Author(s):  
Ian M. Anderson ◽  
Jim Bentley

Recent developments in instrumentation and computing power have greatly improved the potential for quantitative imaging and analysis. For example, products are now commercially available that allow the practical acquisition of spectrum images, where an EELS or EDS spectrum can be acquired from a sequence of positions on the specimen. However, such data files typically contain megabytes of information and may be difficult to manipulate and analyze conveniently or systematically. A number of techniques are being explored for the purpose of analyzing these large data sets. Multivariate statistical analysis (MSA) provides a method for analyzing the raw data set as a whole. The basis of the MSA method has been outlined by Trebbia and Bonnet.MSA has a number of strengths relative to other methods of analysis. First, it is broadly applicable to any series of spectra or images. Applications include characterization of grain boundary segregation (position-), of channeling-enhanced microanalysis (orientation-), or of beam damage (time-variation of spectra).


2005 ◽  
Vol 5 (7) ◽  
pp. 1835-1841 ◽  
Author(s):  
S. Noël ◽  
M. Buchwitz ◽  
H. Bovensmann ◽  
J. P. Burrows

Abstract. A first validation of water vapour total column amounts derived from measurements of the SCanning Imaging Absorption spectroMeter for Atmospheric CHartographY (SCIAMACHY) in the visible spectral region has been performed. For this purpose, SCIAMACHY water vapour data have been determined for the year 2003 using an extended version of the Differential Optical Absorption Spectroscopy (DOAS) method, called Air Mass Corrected (AMC-DOAS). The SCIAMACHY results are compared with corresponding water vapour measurements by the Special Sensor Microwave Imager (SSM/I) and with model data from the European Centre for Medium-Range Weather Forecasts (ECMWF). In confirmation of previous results it could be shown that SCIAMACHY derived water vapour columns are typically slightly lower than both SSM/I and ECMWF data, especially over ocean areas. However, these deviations are much smaller than the observed scatter of the data which is caused by the different temporal and spatial sampling and resolution of the data sets. For example, the overall difference with ECMWF data is only -0.05 g/cm2 whereas the typical scatter is in the order of 0.5 g/cm2. Both values show almost no variation over the year. In addition, first monthly means of SCIAMACHY water vapour data have been computed. The quality of these monthly means is currently limited by the availability of calibrated SCIAMACHY spectra. Nevertheless, first comparisons with ECMWF data show that SCIAMACHY (and similar instruments) are able to provide a new independent global water vapour data set.


2019 ◽  
Vol 2 (2) ◽  
pp. 169-187 ◽  
Author(s):  
Ruben C. Arslan

Data documentation in psychology lags behind not only many other disciplines, but also basic standards of usefulness. Psychological scientists often prefer to invest the time and effort that would be necessary to document existing data well in other duties, such as writing and collecting more data. Codebooks therefore tend to be unstandardized and stored in proprietary formats, and they are rarely properly indexed in search engines. This means that rich data sets are sometimes used only once—by their creators—and left to disappear into oblivion. Even if they can find an existing data set, researchers are unlikely to publish analyses based on it if they cannot be confident that they understand it well enough. My codebook package makes it easier to generate rich metadata in human- and machine-readable codebooks. It uses metadata from existing sources and automates some tedious tasks, such as documenting psychological scales and reliabilities, summarizing descriptive statistics, and identifying patterns of missingness. The codebook R package and Web app make it possible to generate a rich codebook in a few minutes and just three clicks. Over time, its use could lead to psychological data becoming findable, accessible, interoperable, and reusable, thereby reducing research waste and benefiting both its users and the scientific community as a whole.


2016 ◽  
Vol 33 (S1) ◽  
pp. S379-S379
Author(s):  
I. Hamilton ◽  
P. Galdas ◽  
H. Essex

IntroductionDespite recent findings pointing toward cannabis psychosis as one area where gender differences may exist, there has been a widespread lack of attention paid to gender as a determinant of health in both psychiatric services and within the field of addiction.ObjectivesTo explore gender differences in treatment presentations for people with cannabis psychosis.AimsTo use national data sets to investigate gender differences.MethodsAnalysis of British Crime Survey data and a Hospital Episode Statistics data set were used in combination with data from previously published epidemiological studies to compare gender differences.ResultsMale cannabis users outnumber female users by 2:1, a similar gender ratio is found for those admitted to hospital with a diagnosis of schizophrenia or psychosis. However this ratio increases significantly for those admitted to hospital with a diagnosis of cannabis psychosis, with males outnumbering females by 4:1.ConclusionsThis research brings into focus the marked gender differences in cannabis psychosis. Attending to gender is important for research and treatment with the aim of improving understanding and providing gender sensitive services.Disclosure of interestThe authors have not supplied their declaration of competing interest.


Author(s):  
MUSTAPHA LEBBAH ◽  
YOUNÈS BENNANI ◽  
NICOLETA ROGOVSCHI

This paper introduces a probabilistic self-organizing map for topographic clustering, analysis and visualization of multivariate binary data or categorical data using binary coding. We propose a probabilistic formalism dedicated to binary data in which cells are represented by a Bernoulli distribution. Each cell is characterized by a prototype with the same binary coding as used in the data space and the probability of being different from this prototype. The learning algorithm, Bernoulli on self-organizing map, that we propose is an application of the EM standard algorithm. We illustrate the power of this method with six data sets taken from a public data set repository. The results show a good quality of the topological ordering and homogenous clustering.


2016 ◽  
Vol 25 (3) ◽  
pp. 431-440 ◽  
Author(s):  
Archana Purwar ◽  
Sandeep Kumar Singh

AbstractThe quality of data is an important task in the data mining. The validity of mining algorithms is reduced if data is not of good quality. The quality of data can be assessed in terms of missing values (MV) as well as noise present in the data set. Various imputation techniques have been studied in MV study, but little attention has been given on noise in earlier work. Moreover, to the best of knowledge, no one has used density-based spatial clustering of applications with noise (DBSCAN) clustering for MV imputation. This paper proposes a novel technique density-based imputation (DBSCANI) built on density-based clustering to deal with incomplete values in the presence of noise. Density-based clustering algorithm proposed by Kriegal groups the objects according to their density in spatial data bases. The high-density regions are known as clusters, and the low-density regions refer to the noise objects in the data set. A lot of experiments have been performed on the Iris data set from life science domain and Jain’s (2D) data set from shape data sets. The performance of the proposed method is evaluated using root mean square error (RMSE) as well as it is compared with existing K-means imputation (KMI). Results show that our method is more noise resistant than KMI on data sets used under study.


2016 ◽  
Author(s):  
Brecht Martens ◽  
Diego G. Miralles ◽  
Hans Lievens ◽  
Robin van der Schalie ◽  
Richard A. M. de Jeu ◽  
...  

Abstract. The Global Land Evaporation Amsterdam Model (GLEAM) is a set of algorithms dedicated to the estimation of terrestrial evaporation and root-zone soil moisture from satellite data. Ever since its development in 2011, the model has been regularly revised aiming at the optimal incorporation of new satellite-observed geophysical variables, and improving the representation of physical processes. In this study, the next version of this model (v3) is presented. Key changes relative to the previous version include: (1) a revised formulation of the evaporative stress, (2) an optimized drainage algorithm, and (3) a new soil moisture data assimilation system. GLEAM v3 is used to produce three new data sets of terrestrial evaporation and root-zone soil moisture, including a 35-year data set spanning the period 1980–2014 (v3.0a, based on satellite-observed soil moisture, vegetation optical depth and snow water equivalents, reanalysis air temperature and radiation, and a multi-source precipitation product), and two fully satellite-based data sets. The latter two share most of their forcing, except for the vegetation optical depth and soil moisture products, which are based on observations from different passive and active C- and L-band microwave sensors (European Space Agency Climate Change Initiative data sets) for the first data set (v3.0b, spanning the period 2003–2015) and observations from the Soil Moisture and Ocean Salinity satellite in the second data set (v3.0c, spanning the period 2011–2015). These three data sets are described in detail, compared against analogous data sets generated using the previous version of GLEAM (v2), and validated against measurements from 64 eddy-covariance towers and 2338 soil moisture sensors across a broad range of ecosystems. Results indicate that the quality of the v3 soil moisture is consistently better than the one from v2: average correlations against in situ surface soil moisture measurements increase from 0.61 to 0.64 in case of the v3.0a data set and the representation of soil moisture in the second layer improves as well, with correlations increasing from 0.47 to 0.53. Similar improvements are observed for the two fully satellite-based data sets. Despite regional differences, the quality of the evaporation fluxes remains overall similar as the one obtained using the previous version of GLEAM, with average correlations against eddy-covariance measurements between 0.78 and 0.80 for the three different data sets. These global data sets of terrestrial evaporation and root-zone soil moisture are now openly available at http://GLEAM.eu and may be used for large-scale hydrological applications, climate studies and research on land-atmosphere feedbacks.


Sign in / Sign up

Export Citation Format

Share Document