Detecting Subgroups in Survey Research

A problem often not detected in the interpretation of survey research is the potential interaction between subgroups within the sample and aspects of the survey. Potentially interesting interactions are commonly obscured when data are analyzed using descriptive and univariate statistical procedures. This paper suggests the use of cluster analysis as a tool for interpretation of data, particularly when such data take the form of coded categories. An example of the analysis of two data sets with known properties, one random and the other contrived, is presented to illustrate the application of cluster procedures to survey research data.

Download Full-text

Development of Survey Research

Political Science ◽

10.1093/obo/9780199756223-0019 ◽

2011 ◽

Author(s):

Richard Johnston

Keyword(s):

Public Opinion ◽

Political Science ◽

Survey Research ◽

Experimental Treatment ◽

The Other ◽

Sample Survey ◽

Data Sets ◽

Erie County ◽

Current Period ◽

The One

Survey research and empirical political science grew up together. Although the bills for commercial survey fieldwork are mainly paid for nonpolitical purposes, early surveys were justified publicly for their contribution to a deepened understanding of the electorate. Even today, polls on political questions are the loss leader for many high-profile firms. On the academic side, systematic quantitative investigation of political phenomena began with the Erie County Study (Lazarsfeld, et al. 1968, cited under Based on Purpose-Built Academic Data Sets), and academic and commercial practices intersected with controversies over quota versus probability sampling in the 1940s (Converse 1987). Survey research on public opinion and elections was the central force in shaping empirical methods for the discipline as a whole. Whereas survey research was initially a path along which insights from sociology and psychology were imported into political science, in time political scientists came to dominate the trade. Also with time, survey analysts were forced to acknowledge the limitations of their own method, for causal inference in general but also for historical and institutional nuance. As an expression of a scientific temperament, survey research thus yielded ground to other techniques, most notably statistical analysis of archival data on one hand and experimentation on the other. But these challenges arguably have forced the sample survey to reveal its versatility. Cross-level analyses are increasingly common—all the more so as our understanding of the statistical foundations of multilevel modeling has grown. In addition, surveys are serving increasingly as vehicles for experimentation, a way of recruiting subjects outside the laboratory and off-campus and of linking random selection of subjects to random assignment to experimental treatment or control. The current period is one of massive flux and, possibly, rapid obsolescence. On the one hand, target populations are growing less compliant with surveys, even as the bases for survey coverage become more uncertain. On the other hand, new techniques have emerged, often linked to new funding models. Most critical is the World Wide Web. Ironically, the emergence of the web as a survey platform has revived controversies, seemingly settled in the 1940s, over the requirement for probability samples. Through all of this, concern has grown about the very meaning of survey response and its relation to public opinion—indeed, if such a thing as public opinion exists.

Download Full-text

Cluster analysis for large data sets: applications to individual aerosol particles from the mid-pacific

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100132078 ◽

1992 ◽

Vol 50 (2) ◽

pp. 1488-1489

Author(s):

Thomas W. Shattuck ◽

James R. Anderson ◽

Neil W. Tindale ◽

Peter R. Buseck

Keyword(s):

Cluster Analysis ◽

Chemical Reactivity ◽

Large Data ◽

Large Data Sets ◽

Particle Analysis ◽

Data Sets ◽

Halogen Chemistry ◽

Complete Study ◽

Components Analysis ◽

Automated Scanning

Individual particle analysis involves the study of tens of thousands of particles using automated scanning electron microscopy and elemental analysis by energy-dispersive, x-ray emission spectroscopy (EDS). EDS produces large data sets that must be analyzed using multi-variate statistical techniques. A complete study uses cluster analysis, discriminant analysis, and factor or principal components analysis (PCA). The three techniques are used in the study of particles sampled during the FeLine cruise to the mid-Pacific ocean in the summer of 1990. The mid-Pacific aerosol provides information on long range particle transport, iron deposition, sea salt ageing, and halogen chemistry.Aerosol particle data sets suffer from a number of difficulties for pattern recognition using cluster analysis. There is a great disparity in the number of observations per cluster and the range of the variables in each cluster. The variables are not normally distributed, they are subject to considerable experimental error, and many values are zero, because of finite detection limits. Many of the clusters show considerable overlap, because of natural variability, agglomeration, and chemical reactivity.

Download Full-text

Using historical data to facilitate clinical prevention trials in Alzheimer disease? An analysis of longitudinal MCI (mild cognitive impairment) data sets

Alzheimer s Research & Therapy ◽

10.1186/s13195-021-00832-5 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Manfred Berres ◽

Andreas U. Monsch ◽

René Spiegel

Keyword(s):

Cognitive Impairment ◽

Mild Cognitive Impairment ◽

Cognitive Performance ◽

Placebo Treatment ◽

Drug Trial ◽

The Other ◽

Data Sets ◽

Control Groups ◽

Community Needs ◽

Clinical Prevention

Abstract Background The Placebo Group Simulation Approach (PGSA) aims at partially replacing randomized placebo-controlled trials (RPCTs), making use of data from historical control groups in order to decrease the needed number of study participants exposed to lengthy placebo treatment. PGSA algorithms to create virtual control groups were originally derived from mild cognitive impairment (MCI) data of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. To produce more generalizable algorithms, we aimed to compile five different MCI databases in a heuristic manner to create a “standard control algorithm” for use in future clinical trials. Methods We compared data from two North American cohort studies (n=395 and 4328, respectively), one company-sponsored international clinical drug trial (n=831) and two convenience patient samples, one from Germany (n=726), and one from Switzerland (n=1558). Results Despite differences between the five MCI samples regarding inclusion and exclusion criteria, their baseline demographic and cognitive performance data varied less than expected. However, the five samples differed markedly with regard to their subsequent cognitive performance and clinical development: (1) MCI patients from the drug trial did not deteriorate on verbal fluency over 3 years, whereas patients in the other samples did; (2) relatively few patients from the drug trial progressed from MCI to dementia (about 10% after 4 years), in contrast to the other four samples with progression rates over 30%. Conclusion Conventional MCI criteria were insufficient to allow for the creation of well-defined and internationally comparable samples of MCI patients. More recently published criteria for MCI or “MCI due to AD” are unlikely to remedy this situation. The Alzheimer scientific community needs to agree on a standard set of neuropsychological tests including appropriate selection criteria to make MCI a scientifically more useful concept. Patient data from different sources would then be comparable, and the scientific merits of algorithm-based study designs such as the PGSA could be properly assessed.

Download Full-text

DV-DVFS: merging data variety and DVFS technique to manage the energy consumption of big data processing

Journal Of Big Data ◽

10.1186/s40537-021-00437-7 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Hossein Ahmadvand ◽

Fouzhan Foroutan ◽

Mahmood Fathy

Keyword(s):

Big Data ◽

Energy Consumption ◽

Processing Time ◽

Experimental Results ◽

The Other ◽

Data Sets ◽

Multiple Sources ◽

Evaluation Phase ◽

Dynamic Voltage ◽

Processing Resources

AbstractData variety is one of the most important features of Big Data. Data variety is the result of aggregating data from multiple sources and uneven distribution of data. This feature of Big Data causes high variation in the consumption of processing resources such as CPU consumption. This issue has been overlooked in previous works. To overcome the mentioned problem, in the present work, we used Dynamic Voltage and Frequency Scaling (DVFS) to reduce the energy consumption of computation. To this goal, we consider two types of deadlines as our constraint. Before applying the DVFS technique to computer nodes, we estimate the processing time and the frequency needed to meet the deadline. In the evaluation phase, we have used a set of data sets and applications. The experimental results show that our proposed approach surpasses the other scenarios in processing real datasets. Based on the experimental results in this paper, DV-DVFS can achieve up to 15% improvement in energy consumption.

Download Full-text

Theory and Applications of the Unit Gamma/Gompertz Distribution

Mathematics ◽

10.3390/math9161850 ◽

2021 ◽

Vol 9 (16) ◽

pp. 1850

Author(s):

Rashad A. R. Bantan ◽

Farrukh Jamal ◽

Christophe Chesneau ◽

Mohammed Elgarhy

Keyword(s):

Stochastic Ordering ◽

Real Data ◽

Rate Function ◽

The Other ◽

Likelihood Method ◽

Model Parameters ◽

Data Sets ◽

Gompertz Distribution ◽

Probability And Statistics ◽

Analytical Behavior

Unit distributions are commonly used in probability and statistics to describe useful quantities with values between 0 and 1, such as proportions, probabilities, and percentages. Some unit distributions are defined in a natural analytical manner, and the others are derived through the transformation of an existing distribution defined in a greater domain. In this article, we introduce the unit gamma/Gompertz distribution, founded on the inverse-exponential scheme and the gamma/Gompertz distribution. The gamma/Gompertz distribution is known to be a very flexible three-parameter lifetime distribution, and we aim to transpose this flexibility to the unit interval. First, we check this aspect with the analytical behavior of the primary functions. It is shown that the probability density function can be increasing, decreasing, “increasing-decreasing” and “decreasing-increasing”, with pliant asymmetric properties. On the other hand, the hazard rate function has monotonically increasing, decreasing, or constant shapes. We complete the theoretical part with some propositions on stochastic ordering, moments, quantiles, and the reliability coefficient. Practically, to estimate the model parameters from unit data, the maximum likelihood method is used. We present some simulation results to evaluate this method. Two applications using real data sets, one on trade shares and the other on flood levels, demonstrate the importance of the new model when compared to other unit models.

Download Full-text

Improvements for research data repositories: The case of text spam

Journal of Information Science ◽

10.1177/0165551521998636 ◽

2021 ◽

pp. 016555152199863

Author(s):

Ismael Vázquez ◽

María Novo-Lourés ◽

Reyes Pavón ◽

Rosalía Laza ◽

José Ramón Méndez ◽

...

Keyword(s):

Web Application ◽

Research Data ◽

Data Sets ◽

Data Repositories ◽

Software Applications ◽

Public Data ◽

Protection Mechanisms ◽

Experimental Protocols ◽

Learning Research ◽

Processing Steps

Current research has evolved in such a way scientists must not only adequately describe the algorithms they introduce and the results of their application, but also ensure the possibility of reproducing the results and comparing them with those obtained through other approximations. In this context, public data sets (sometimes shared through repositories) are one of the most important elements for the development of experimental protocols and test benches. This study has analysed a significant number of CS/ML ( Computer Science/ Machine Learning) research data repositories and data sets and detected some limitations that hamper their utility. Particularly, we identify and discuss the following demanding functionalities for repositories: (1) building customised data sets for specific research tasks, (2) facilitating the comparison of different techniques using dissimilar pre-processing methods, (3) ensuring the availability of software applications to reproduce the pre-processing steps without using the repository functionalities and (4) providing protection mechanisms for licencing issues and user rights. To show the introduced functionality, we created STRep (Spam Text Repository) web application which implements our recommendations adapted to the field of spam text repositories. In addition, we launched an instance of STRep in the URL https://rdata.4spam.group to facilitate understanding of this study.

Download Full-text

Measuring the Value of Research Data: A Citation Analysis of Oceanographic Data Sets

PLoS ONE ◽

10.1371/journal.pone.0092590 ◽

2014 ◽

Vol 9 (3) ◽

pp. e92590 ◽

Cited By ~ 33

Author(s):

Christopher W. Belter

Keyword(s):

Citation Analysis ◽

Research Data ◽

Data Sets ◽

Oceanographic Data

Download Full-text

EROICA: Exploring regions of interest with cluster analysis in large functional magnetic resonance imaging data sets

Concepts in Magnetic Resonance ◽

10.1002/cmr.a.10053 ◽

2003 ◽

Vol 16A (1) ◽

pp. 50-62 ◽

Cited By ~ 9

Author(s):

M. Jarmasz ◽

R.L. Somorjai

Keyword(s):

Magnetic Resonance Imaging ◽

Cluster Analysis ◽

Magnetic Resonance ◽

Functional Magnetic Resonance Imaging ◽

Regions Of Interest ◽

Data Sets ◽

Imaging Data ◽

Functional Magnetic Resonance ◽

Resonance Imaging ◽

Magnetic Resonance Imaging Data

Download Full-text

PS2-54: Best Practices: Improving Quality and Reliability in Research Data Sets

Clinical Medicine & Research ◽

10.3121/cmr.2013.1176.ps2-54 ◽

2013 ◽

Vol 11 (3) ◽

pp. 157-157

Author(s):

L. McFarland ◽

J. Richter ◽

C. Bredfeldt

Keyword(s):

Best Practices ◽

Research Data ◽

Data Sets ◽

Quality And Reliability

Download Full-text

Application of multivariable statistical techniques in plant-wide WWTP control strategies analysis

Water Science & Technology ◽

10.2166/wst.2007.586 ◽

2007 ◽

Vol 56 (6) ◽

pp. 75-83 ◽

Cited By ~ 3

Author(s):

X. Flores ◽

J. Comas ◽

I.R. Roda ◽

L. Jiménez ◽

K.V. Gernaey

Keyword(s):

Cluster Analysis ◽

Control Strategies ◽

Treatment Plant ◽

Principal Component ◽

Statistical Techniques ◽

Data Sets ◽

Data Set ◽

Casual Relation ◽

Evaluation Matrix ◽

Natural Groups

The main objective of this paper is to present the application of selected multivariable statistical techniques in plant-wide wastewater treatment plant (WWTP) control strategies analysis. In this study, cluster analysis (CA), principal component analysis/factor analysis (PCA/FA) and discriminant analysis (DA) are applied to the evaluation matrix data set obtained by simulation of several control strategies applied to the plant-wide IWA Benchmark Simulation Model No 2 (BSM2). These techniques allow i) to determine natural groups or clusters of control strategies with a similar behaviour, ii) to find and interpret hidden, complex and casual relation features in the data set and iii) to identify important discriminant variables within the groups found by the cluster analysis. This study illustrates the usefulness of multivariable statistical techniques for both analysis and interpretation of the complex multicriteria data sets and allows an improved use of information for effective evaluation of control strategies.

Download Full-text