ABOUT PARAMETRIZATION OF SELECTION OF SIGNIFICANT CLUSTERS

One of the fundamental tasks of cluster analysis is the partitioning of multidimensional data samples into groups of clusters – objects, which are closed in the sense of some given measure of similarity. In a some of problems, the number of clusters is set a priori, but more often it is required to determine them in the course of solving clustering. With a large number of clusters, especially if the data is “noisy,” the task becomes difficult for analyzing by experts, so it is artificially reduces the number of consideration clusters. The formal means of merging the “neighboring” clusters are considered, creating the basis for parameterizing the number of significant clusters in the “natural” clustering model [1].

Download Full-text

Cluster Analysis of Antigenic Profiles of Tumors: Selection of Number of Clusters Using Akaike’s Information Criterion

Methods of Information in Medicine ◽

10.1055/s-0038-1634783 ◽

1990 ◽

Vol 29 (03) ◽

pp. 200-204 ◽

Cited By ~ 7

Author(s):

J. A. Koziol

Keyword(s):

Cluster Analysis ◽

Basic Problem ◽

Information Criterion ◽

Akaike's Information Criterion ◽

Cell Surface Antigens ◽

Number Of Clusters ◽

Akaike’S Information Criterion ◽

Multinomial Data ◽

Tumor Types ◽

Selection Of

AbstractA basic problem of cluster analysis is the determination or selection of the number of clusters evinced in any set of data. We address this issue with multinomial data using Akaike’s information criterion and demonstrate its utility in identifying an appropriate number of clusters of tumor types with similar profiles of cell surface antigens.

Download Full-text

Automatic selection of the number of clusters in multidimensional data problems

Proceedings of 3rd IEEE International Conference on Image Processing ◽

10.1109/icip.1996.560574 ◽

2002 ◽

Cited By ~ 3

Author(s):

A. Marazzi ◽

P. Gamba ◽

A. Mecocci ◽

A. Semboloni

Keyword(s):

Multidimensional Data ◽

Number Of Clusters ◽

Automatic Selection ◽

Selection Of

Download Full-text

Multi-Attribute Utility Theory Based K-Means Clustering Applications

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2017040101 ◽

2017 ◽

Vol 13 (2) ◽

pp. 1-12 ◽

Cited By ~ 2

Author(s):

Jungmok Ma

Keyword(s):

Cluster Analysis ◽

Utility Theory ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

User Preferences ◽

Number Of Clusters ◽

Clustering Problem ◽

Multi Attribute Utility Theory ◽

Systematic Framework ◽

Selection Of

One of major obstacles in the application of the k-means clustering algorithm is the selection of the number of clusters k. The multi-attribute utility theory (MAUT)-based k-means clustering algorithm is proposed to tackle the problem by incorporating user preferences. Using MAUT, the decision maker's value structure for the number of clusters and other attributes can be quantitatively modeled, and it can be used as an objective function of the k-means. A target clustering problem for military targeting process is used to demonstrate the MAUT-based k-means and provide a comparative study. The result shows that the existing clustering algorithms do not necessarily reflect user preferences while the MAUT-based k-means provides a systematic framework of preferences modeling in cluster analysis.

Download Full-text

Influence of meteorological input data on backtrajectory cluster analysis – a seven-year study for southeastern Spain

Advances in Science and Research ◽

10.5194/asr-2-65-2008 ◽

2008 ◽

Vol 2 (1) ◽

pp. 65-70 ◽

Cited By ~ 12

Author(s):

M. Cabello ◽

J. A. G. Orza ◽

V. Galiano ◽

G. Ruiz

Keyword(s):

Cluster Analysis ◽

Input Data ◽

Meteorological Data ◽

Data Sets ◽

Number Of Clusters ◽

Southeastern Spain ◽

Meteorological Input ◽

Initial Selection ◽

Southeast Spain ◽

Selection Of

Abstract. Backtrajectory differences and clustering sensitivity to the meteorological input data are studied. Trajectories arriving in Southeast Spain (Elche), at 3000, 1500 and 500 m for the 7-year period 2000–2006 have been computed employing two widely used meteorological data sets: the NCEP/NCAR Reanalysis and the FNL data sets. Differences between trajectories grow linearly at least up to 48 h, showing faster growing after 72 h. A k-means cluster analysis performed on each set of trajectories shows differences in the identified clusters (main flows), partially because the number of clusters of each clustering solution differs for the trajectories arriving at 3000 and 1500 m. Trajectory membership to the identified flows is in general more sensitive to the input meteorological data than to the initial selection of cluster centroids.

Download Full-text

Choosing the Number of Clusters, Subset Selection of Variables, and Outlier Detection in the Standard Mixture-Model Cluster Analysis

New Approaches in Classification and Data Analysis - Studies in Classification, Data Analysis, and Knowledge Organization ◽

10.1007/978-3-642-51175-2_19 ◽

1994 ◽

pp. 169-177 ◽

Cited By ~ 7

Author(s):

Hamparsum Bozdogan

Keyword(s):

Cluster Analysis ◽

Outlier Detection ◽

Mixture Model ◽

Subset Selection ◽

Standard Mixture ◽

Number Of Clusters ◽

Model Cluster ◽

Selection Of Variables ◽

Selection Of

Download Full-text

Estimating the number of clusters via a corrected clustering instability

Computational Statistics ◽

10.1007/s00180-020-00981-5 ◽

2020 ◽

Vol 35 (4) ◽

pp. 1879-1894

Author(s):

Jonas M. B. Haslbeck ◽

Dirk U. Wulff

Keyword(s):

Cluster Analysis ◽

R Package ◽

Time Model ◽

Number Of Clusters ◽

Model Based ◽

Model Free ◽

Current Instability ◽

First Time ◽

Selection Of

Abstract We improve instability-based methods for the selection of the number of clusters k in cluster analysis by developing a corrected clustering distance that corrects for the unwanted influence of the distribution of cluster sizes on cluster instability. We show that our corrected instability measure outperforms current instability-based measures across the whole sequence of possible k, overcoming limitations of current insability-based methods for large k. We also compare, for the first time, model-based and model-free approaches to determining cluster-instability and find their performance to be comparable. We make our method available in the R-package .

Download Full-text

Innovative Approach to Information Search by Example of a Patent Analysis of an Important Substitution Plan

Экономическая наука современной России ◽

10.33293/1609-1442-2020-1(88)-143-157 ◽

2020 ◽

pp. 143-157

Author(s):

Maria A. Milkova

Keyword(s):

Information Search ◽

Topic Modeling ◽

Cognitive Biases ◽

A Priori ◽

Import Substitution ◽

Innovative Approach ◽

Iterative Search ◽

Comprehensive Picture ◽

Priori Information ◽

Selection Of

Nowadays the process of information accumulation is so rapid that the concept of the usual iterative search requires revision. Being in the world of oversaturated information in order to comprehensively cover and analyze the problem under study, it is necessary to make high demands on the search methods. An innovative approach to search should flexibly take into account the large amount of already accumulated knowledge and a priori requirements for results. The results, in turn, should immediately provide a roadmap of the direction being studied with the possibility of as much detail as possible. The approach to search based on topic modeling, the so-called topic search, allows you to take into account all these requirements and thereby streamline the nature of working with information, increase the efficiency of knowledge production, avoid cognitive biases in the perception of information, which is important both on micro and macro level. In order to demonstrate an example of applying topic search, the article considers the task of analyzing an import substitution program based on patent data. The program includes plans for 22 industries and contains more than 1,500 products and technologies for the proposed import substitution. The use of patent search based on topic modeling allows to search immediately by the blocks of a priori information – terms of industrial plans for import substitution and at the output get a selection of relevant documents for each of the industries. This approach allows not only to provide a comprehensive picture of the effectiveness of the program as a whole, but also to visually obtain more detailed information about which groups of products and technologies have been patented.

Download Full-text

Using clusterization principles for the selection of repair sections of main oil pipelines based on diagnostic data

Proceedings of the Mavlyutov Institute of Mechanics ◽

10.21662/uim2011.1.019 ◽

2011 ◽

Vol 8 (1) ◽

pp. 201-210

Author(s):

R.M. Bogdanov

Keyword(s):

Cluster Analysis ◽

Defect Density ◽

Density Variation ◽

Distance Functions ◽

Oil Pipeline ◽

Oil Pipelines ◽

Classification Of Images ◽

Main Oil Pipeline ◽

Selection Of

The problem of determining the repair sections of the main oil pipeline is solved, basing on the classification of images using distance functions and the clustering principle, The criteria characterizing the cluster are determined by certain given values, based on a comparison with which the defect is assigned to a given cluster, procedures for the redistribution of defects in cluster zones are provided, and the cluster zones parameters are being changed. Calculations are demonstrating the range of defect density variation depending on pipeline sections and the universal capabilities of linear objects configuration with arbitrary density, provided by cluster analysis.

Download Full-text

Incorporating radiomics into clinical trials: expert consensus on considerations for data-driven compared to biologically driven quantitative biomarkers

European Radiology ◽

10.1007/s00330-020-07598-8 ◽

2021 ◽

Author(s):

Laure Fournier ◽

Lena Costaridou ◽

Luc Bidaut ◽

Nicolas Michoux ◽

Frederic E. Lecouvet ◽

...

Keyword(s):

Clinical Trials ◽

Quantitative Imaging ◽

A Priori ◽

External Validation ◽

Data Driven ◽

Imaging Biomarkers ◽

Clinical Validation ◽

Biological Processes ◽

Imaging Data ◽

Selection Of

Abstract Existing quantitative imaging biomarkers (QIBs) are associated with known biological tissue characteristics and follow a well-understood path of technical, biological and clinical validation before incorporation into clinical trials. In radiomics, novel data-driven processes extract numerous visually imperceptible statistical features from the imaging data with no a priori assumptions on their correlation with biological processes. The selection of relevant features (radiomic signature) and incorporation into clinical trials therefore requires additional considerations to ensure meaningful imaging endpoints. Also, the number of radiomic features tested means that power calculations would result in sample sizes impossible to achieve within clinical trials. This article examines how the process of standardising and validating data-driven imaging biomarkers differs from those based on biological associations. Radiomic signatures are best developed initially on datasets that represent diversity of acquisition protocols as well as diversity of disease and of normal findings, rather than within clinical trials with standardised and optimised protocols as this would risk the selection of radiomic features being linked to the imaging process rather than the pathology. Normalisation through discretisation and feature harmonisation are essential pre-processing steps. Biological correlation may be performed after the technical and clinical validity of a radiomic signature is established, but is not mandatory. Feature selection may be part of discovery within a radiomics-specific trial or represent exploratory endpoints within an established trial; a previously validated radiomic signature may even be used as a primary/secondary endpoint, particularly if associations are demonstrated with specific biological processes and pathways being targeted within clinical trials. Key Points • Data-driven processes like radiomics risk false discoveries due to high-dimensionality of the dataset compared to sample size, making adequate diversity of the data, cross-validation and external validation essential to mitigate the risks of spurious associations and overfitting. • Use of radiomic signatures within clinical trials requires multistep standardisation of image acquisition, image analysis and data mining processes. • Biological correlation may be established after clinical validation but is not mandatory.

Download Full-text