Efficient Sampling and Handling of Variance in Tuning Data Mining Models

Author(s):  
Patrick Koch ◽  
Wolfgang Konen
2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Mahboubeh Parsaeian ◽  
Mahdi Mahdavi ◽  
Mojdeh Saadati ◽  
Parinaz Mehdipour ◽  
Ali Sheidaei ◽  
...  

Abstract Background Sampling a small number of participants from an entire country is not straightforward. In this case, researchers reluctantly sample from a single setting or few settings, which limits the generalizability of findings. Therefore, there is a need to design efficient sampling method for small sample size surveys that can produce generalizable results at the country level. Methods Data comprised of twenty proxy variables to measure health services demands, structures, and outcomes of 413 districts of Iran. We used two data mining methods (hierarchical clustering method (HCM) and model-based clustering method (MCM)) to create homogenous groups of districts, i.e., strata based on these variables. We compared the internal and stability validity of the methods by statistical indices. An expert group checked the face validity of the methods, particularly regarding the total number of strata and the combination of districts in each stratum. The efficiency of selected method, which is measured by the inverse of variance, was compared with a simple random sampling (SRS) through simulation. The sampling design was tested in a national study in Iran, which aimed to evaluate the quality and costs of medical care for eight selected diseases by only recruiting 300 participants per disease at the country level. Results MCM and HCM divided the districts into eight and two clusters, respectively. The measures of internal and stability validity showed that clusters created by MCM were more separated, compact, and stable, thus forming our optimum strata. The probability of death from stroke, chronic obstructive pulmonary disease, and in-hospital mortality rate were the most important indicators that distinguished the eight strata. Based on the simulation results, MCM increased the efficiency of the sampling design up to 1.7 times compared to SRS. Conclusions The use of data mining improved the efficiency of sampling up to 1.7 times greater than SRS and markedly reduced the number of strata to eight in the entire country. The proposed sampling design also identified key variables that could be used to classify districts in Iran for sampling from these target populations in the future studies.


2020 ◽  
Author(s):  
Mohammed J. Zaki ◽  
Wagner Meira, Jr
Keyword(s):  

2010 ◽  
Vol 24 (2) ◽  
pp. 112-119 ◽  
Author(s):  
F. Riganello ◽  
A. Candelieri ◽  
M. Quintieri ◽  
G. Dolce

The purpose of the study was to identify significant changes in heart rate variability (an emerging descriptor of emotional conditions; HRV) concomitant to complex auditory stimuli with emotional value (music). In healthy controls, traumatic brain injured (TBI) patients, and subjects in the vegetative state (VS) the heart beat was continuously recorded while the subjects were passively listening to each of four music samples of different authorship. The heart rate (parametric and nonparametric) frequency spectra were computed and the spectra descriptors were processed by data-mining procedures. Data-mining sorted the nu_lf (normalized parameter unit of the spectrum low frequency range) as the significant descriptor by which the healthy controls, TBI patients, and VS subjects’ HRV responses to music could be clustered in classes matching those defined by the controls and TBI patients’ subjective reports. These findings promote the potential for HRV to reflect complex emotional stimuli and suggest that residual emotional reactions continue to occur in VS. HRV descriptors and data-mining appear applicable in brain function research in the absence of consciousness.


Author(s):  
Kiran Kumar S V N Madupu

Big Data has terrific influence on scientific discoveries and also value development. This paper presents approaches in data mining and modern technologies in Big Data. Difficulties of data mining as well as data mining with big data are discussed. Some technology development of data mining as well as data mining with big data are additionally presented.


2020 ◽  
Vol 3 (3) ◽  
pp. 187-201
Author(s):  
Sufajar Butsianto ◽  
Nindi Tya Mayangwulan

Penggunaan mobil di Indonesia setiap tahunnya selalu meningkat dan membuat perusahaan otomotif berlomba-lomba dalam peningkatan penjualannya. Tujuan dari penelitian ini untuk mengelompokan data penjualan kedalam sebuah cluster dengan metode Data Mining Algoritma K-Means Clustering. Data Penjualan nantinya akan dikelompokan berdasarkan kemiripan data tersebut sehingga data dengan karakteristik yang sama akan berada dalam satu cluster. Atribut yang digunakan adalah brand dan penjualan. Cluster yang terbentuk setelah dilakukan proses K-Means Clustering terbagi menjadi tiga cluster yaitu Cluster 0 jumlah anggota 235 dengan presentase 26% dikategorikan Laris, Cluster 1 jumlah anggota 604 dengan presentase 67% dikategorikan Kurang Laris, dan Cluster 2 jumlah angota 61 dengan presentase 7% dikategorikan Paling Laris, dari proses clustering diatas dapat diperoleh validasi DBI (Davies Bouldin Index) dengan nilai 0,341


Sign in / Sign up

Export Citation Format

Share Document