Performances of Clustering Methods Considering Data Transformation and Sample Size: An Evaluation with Fisheries Survey Data

2020 ◽  
Vol 19 (3) ◽  
pp. 659-668
Author(s):  
Jia Wo ◽  
Chongliang Zhang ◽  
Binduo Xu ◽  
Ying Xue ◽  
Yiping Ren
Circulation ◽  
2018 ◽  
Vol 138 (Suppl_1) ◽  
Author(s):  
Anshul Saxena ◽  
Emir Veledar

Introduction: Every year more than 75 million adults are diagnosed with HTN in the US. Despite spending $50 billion/ year to lower mortality related to HTN, only about 54% of patients have this condition managed. In 2014, mortality due to HTN or related complications was 410,000. AMI is a more severe manifestation of coronary artery disease with 735,000 people diagnosed yearly. It is estimated that hospitals lose $4493-$7940 per patient due to AMI. We studied costs for events associated with HTN and AMI in MEPS. Methods: Individuals aged ≥20 in MEPS (2014) were included. Hierarchical clustering was used for analysis. Age, sex, education status, race, Hispanic ethnicity, US citizenship status, family income, insurance, people who reported HTN and/ or AMI events were entered as dimensions. Cluster and descriptive analyses were adjusted for survey weights. Results: About 236 million weighted individuals were eligible for the analysis (Female: 51.7%; White: 79.1%; High school or above: 61%; Any private insurance: 66%; and Income≥ 400% of poverty line: 41%). Out of these, about 4.9 million participants reported costs/ events related to AMI and 61.8 million related to HTN. Five groups were identified based on similarity within each cluster (Table 1). The maximum number of participants were present in cluster 1 (weighted n = 101 million) with the most number of participants (21.5%) who reported HTN among all cluster groups. In general, cluster 5 had the lowest annual total of direct health care payments and events. Selected medical history is described in Table 2. Conclusion: Results show insights into the cost of HTN and AMI events, and their relationship with other comorbid conditions. Being computationally demanding, clustering methodologies are rarely utilized in survey data analysis. This could be the first example of clustering approach applied to data from MEPS. Once clustering methods are improved in weighted survey data, estimates will be more reliable.


2014 ◽  
Vol 92 (2) ◽  
pp. 485-497 ◽  
Author(s):  
P. Boddhireddy ◽  
M. J. Kelly ◽  
S. Northcutt ◽  
K. C. Prayaga ◽  
J. Rumph ◽  
...  

2016 ◽  
Vol 32 (3) ◽  
pp. 643-660 ◽  
Author(s):  
Samuel De Haas ◽  
Peter Winker

Abstract Falsified interviews represent a serious threat to empirical research based on survey data. The identification of such cases is important to ensure data quality. Applying cluster analysis to a set of indicators helps to identify suspicious interviewers when a substantial share of all of their interviews are complete falsifications, as shown by previous research. This analysis is extended to the case when only a share of questions within all interviews provided by an interviewer is fabricated. The assessment is based on synthetic datasets with a priori set properties. These are constructed from a unique experimental dataset containing both real and fabricated data for each respondent. Such a bootstrap approach makes it possible to evaluate the robustness of the method when the share of fabricated answers per interview decreases. The results indicate a substantial loss of discriminatory power in the standard cluster analysis if the share of fabricated answers within an interview becomes small. Using a novel cluster method which allows imposing constraints on cluster sizes, performance can be improved, in particular when only few falsifiers are present. This new approach will help to increase the robustness of survey data by detecting potential falsifiers more reliably.


2003 ◽  
Vol 33 (1) ◽  
pp. 160-162

The following data come from three surveys conducted among Palestinian refugees in theWest Bank and Gaza (January 2003), Jordan (May 2003), and Lebanon (June 2003). The sample size was 4,506, distributed among the three areas almost equally (about 1,500 interviews in each area). Earlier Palestinian Center for Policy and Survey Research (PCPSR) surveys cited in the press release found that ““the overwhelming majority of the refugees (more than 95 percent) insist on maintaining the ‘‘right of return’’ as a sacred right that can never be given up.““ On this basis, the press release states that the goal of the new surveys was to find out ““how refugees would behave once they have obtained that right and how they would react under various likely conditions and circumstances of the permanent settlement.”” More specifically, the surveys (the questionnaire for which was prepared with PLO and PA bodies concerned with negotiations and refugee affairs) aimed at finding out refugee preferences in a permanent settlement and at making estimates, for planning purposes, of how many refugees might opt to live in the Palestinian state. In addition to the sections on refugee views and preferences (the press release summaries of which are reproduced below), the survey data also covered socioeconomic data. The final results and analysis of the surveys are expected to be available toward the end of the year; the full press release can be obtained from the PCPSR Web site at www.pcpsr.org.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Xiao Liu ◽  
Weichen Song ◽  
Brandon Y. Wong ◽  
Ting Zhang ◽  
Shunying Yu ◽  
...  

Abstract Background With the expanding applications of mass cytometry in medical research, a wide variety of clustering methods, both semi-supervised and unsupervised, have been developed for data analysis. Selecting the optimal clustering method can accelerate the identification of meaningful cell populations. Result To address this issue, we compared three classes of performance measures, “precision” as external evaluation, “coherence” as internal evaluation, and stability, of nine methods based on six independent benchmark datasets. Seven unsupervised methods (Accense, Xshift, PhenoGraph, FlowSOM, flowMeans, DEPECHE, and kmeans) and two semi-supervised methods (Automated Cell-type Discovery and Classification and linear discriminant analysis (LDA)) are tested on six mass cytometry datasets. We compute and compare all defined performance measures against random subsampling, varying sample sizes, and the number of clusters for each method. LDA reproduces the manual labels most precisely but does not rank top in internal evaluation. PhenoGraph and FlowSOM perform better than other unsupervised tools in precision, coherence, and stability. PhenoGraph and Xshift are more robust when detecting refined sub-clusters, whereas DEPECHE and FlowSOM tend to group similar clusters into meta-clusters. The performances of PhenoGraph, Xshift, and flowMeans are impacted by increased sample size, but FlowSOM is relatively stable as sample size increases. Conclusion All the evaluations including precision, coherence, stability, and clustering resolution should be taken into synthetic consideration when choosing an appropriate tool for cytometry data analysis. Thus, we provide decision guidelines based on these characteristics for the general reader to more easily choose the most suitable clustering tools.


Author(s):  
Aslı Suner

Abstract A number of specialized clustering methods have been developed so far for the accurate analysis of single-cell RNA-sequencing (scRNA-seq) expression data, and several reports have been published documenting the performance measures of these clustering methods under different conditions. However, to date, there are no available studies regarding the systematic evaluation of the performance measures of the clustering methods taking into consideration the sample size and cell composition of a given scRNA-seq dataset. Herein, a comprehensive performance evaluation study of 11 selected scRNA-seq clustering methods was performed using synthetic datasets with known sample sizes and number of subpopulations, as well as varying levels of transcriptome complexity. The results indicate that the overall performance of the clustering methods under study are highly dependent on the sample size and complexity of the scRNA-seq dataset. In most of the cases, better clustering performances were obtained as the number of cells in a given expression dataset was increased. The findings of this study also highlight the importance of sample size for the successful detection of rare cell subpopulations with an appropriate clustering tool.


2019 ◽  
Vol 2 (1) ◽  
pp. 281
Author(s):  
Irfandi Buamonabot ◽  
Nurlaila Nurlaila ◽  
Nurdin Nurdin

Education plays an important role in the development of a nation. In addition, education is also an important instrument in ensuring the sustainability of individuals and society. Thus, it can be ascertained that education has an advanced control whether or not a nation. The purpose of this study was to examine the effect of college attributes on the satisfaction of choosing a college. The sampling technique that is Purposive sampling was chosen based on the consideration that students who were sampled were students who had registered, so that the sample size used in this study were 85 respondents. The respondents in this study were students of the Faculty of Economics and Business, Khairun University. 85 respondents were collected from the survey. Data analysis uses simple regression. The results of the study showed that the college attributes had an effect on the satisfaction of choosing a college. The results of this study are also consistent with the opinion of Oliver (1997) that satisfaction is the customer's response to the fulfillment of their needs. That means assessing that a form of privilege of an item or service or goods / service itself, provides a level of comfort associated with fulfilling a need including meeting needs under expectations or meeting needs exceeding expectations.


Sign in / Sign up

Export Citation Format

Share Document