An Improved Genetic Algorithm for Text Clustering

The genetic algorithm (GA) is a self-adapted probability search method used to solve optimization problems, which has been applied widely in science and engineering. In this paper, we propose an improved variable string length genetic algorithm (IVGA) for text clustering. Our algorithm has been exploited for automatically evolving the optimal number of clusters as well as providing proper data set clustering. The chromosome is encoded by special indices to indicate the location of each gene. More effective version of evolutional steps can automatically adjust the influence between the diversity of the population and selective pressure during generations. The superiority of the improved genetic algorithm over conventional variable string length genetic algorithm (VGA) is demonstrated by providing proper text clustering.

Download Full-text

A Method for Integration of Preferences to a Multi-Objective Evolutionary Algorithm Using Ordinal Multi-Criteria Classification

Mathematical and Computational Applications ◽

10.3390/mca26020027 ◽

2021 ◽

Vol 26 (2) ◽

pp. 27

Author(s):

Alejandro Castellanos-Alvarez ◽

Laura Cruz-Reyes ◽

Eduardo Fernandez ◽

Nelson Rangel-Valdez ◽

Claudia Gómez-Santillán ◽

...

Keyword(s):

Genetic Algorithm ◽

Selective Pressure ◽

Optimization Problems ◽

Region Of Interest ◽

Decision Makers ◽

Multiple Objective ◽

Objective Functions ◽

Multi Objective ◽

Imperfect Knowledge ◽

Real World Problems

Most real-world problems require the optimization of multiple objective functions simultaneously, which can conflict with each other. The environment of these problems usually involves imprecise information derived from inaccurate measurements or the variability in decision-makers’ (DMs’) judgments and beliefs, which can lead to unsatisfactory solutions. The imperfect knowledge can be present either in objective functions, restrictions, or decision-maker’s preferences. These optimization problems have been solved using various techniques such as multi-objective evolutionary algorithms (MOEAs). This paper proposes a new MOEA called NSGA-III-P (non-nominated sorting genetic algorithm III with preferences). The main characteristic of NSGA-III-P is an ordinal multi-criteria classification method for preference integration to guide the algorithm to the region of interest given by the decision-maker’s preferences. Besides, the use of interval analysis allows the expression of preferences with imprecision. The experiments contrasted several versions of the proposed method with the original NSGA-III to analyze different selective pressure induced by the DM’s preferences. In these experiments, the algorithms solved three-objectives instances of the DTLZ problem. The obtained results showed a better approximation to the region of interest for a DM when its preferences are considered.

Download Full-text

A classification approach based on variable precision rough sets and cluster validity index function

Engineering Computations ◽

10.1108/ec-11-2012-0297 ◽

2014 ◽

Vol 31 (8) ◽

pp. 1778-1789

Author(s):

Hongkang Lin

Keyword(s):

Optimal Number ◽

Data Sets ◽

Cluster Validity ◽

Cluster Validity Index ◽

Index Method ◽

Data Set ◽

Content Type ◽

The Individual ◽

Variable Precision Rough Sets ◽

Optimal Number Of Clusters

Purpose – The clustering/classification method proposed in this study, designated as the PFV-index method, provides the means to solve the following problems for a data set characterized by imprecision and uncertainty: first, discretizing the continuous values of all the individual attributes within a data set; second, evaluating the optimality of the discretization results; third, determining the optimal number of clusters per attribute; and fourth, improving the classification accuracy (CA) of data sets characterized by uncertainty. The paper aims to discuss these issues. Design/methodology/approach – The proposed method for the solution of the clustering/classifying problem, designated as PFV-index method, combines a particle swarm optimization algorithm, fuzzy C-means method, variable precision rough sets theory, and a new cluster validity index function. Findings – This method could cluster the values of the individual attributes within the data set and achieves both the optimal number of clusters and the optimal CA. Originality/value – The validity of the proposed approach is investigated by comparing the classification results obtained for UCI data sets with those obtained by supervised classification BPNN, decision-tree methods.

Download Full-text

Multi-Objective Genetic Algorithm for Robust Clustering with Unknown Number of Clusters

International Journal of Applied Evolutionary Computation ◽

10.4018/jaec.2012010101 ◽

2012 ◽

Vol 3 (1) ◽

pp. 1-20

Author(s):

Amit Banerjee

Keyword(s):

Genetic Algorithm ◽

Data Clustering ◽

Optimal Number ◽

Least Trimmed Squares ◽

Cluster Assignment ◽

Objective Criterion ◽

Number Of Clusters ◽

Multi Objective ◽

Multi Objective Genetic Algorithm ◽

Optimal Number Of Clusters

In this paper, a multi-objective genetic algorithm for data clustering based on the robust fuzzy least trimmed squares estimator is presented. The proposed clustering methodology addresses two critical issues in unsupervised data clustering – the ability to produce meaningful partition in noisy data, and the requirement that the number of clusters be known a priori. The multi-objective genetic algorithm-driven clustering technique optimizes the number of clusters as well as cluster assignment, and cluster prototypes. A two-parameter, mapped, fixed point coding scheme is used to represent assignment of data into the true retained set and the noisy trimmed set, and the optimal number of clusters in the retained set. A three-objective criterion is also used as the minimization functional for the multi-objective genetic algorithm. Results on well-known data sets from literature suggest that the proposed methodology is superior to conventional fuzzy clustering algorithms that assume a known value for optimal number of clusters.

Download Full-text

Genetic‐algorithm/neural‐network approach to seismic attribute selection for well‐log prediction

Geophysics ◽

10.1190/1.1649389 ◽

2004 ◽

Vol 69 (1) ◽

pp. 212-221 ◽

Cited By ~ 57

Author(s):

Kevin P. Dorrington ◽

Curtis A. Link

Keyword(s):

Neural Network ◽

Genetic Algorithm ◽

Reservoir Characterization ◽

Seismic Attributes ◽

Optimal Number ◽

Attribute Selection ◽

Well Log ◽

Seismic Attribute ◽

Neural Network Approach ◽

Data Set

Neural‐network prediction of well‐log data using seismic attributes is an important reservoir characterization technique because it allows extrapolation of log properties throughout a seismic volume. The strength of neural‐networks in the area of pattern recognition is key in its success for delineating the complex nonlinear relationship between seismic attributes and log properties. We have found that good neural‐network generalization of well‐log properties can be accomplished using a small number of seismic attributes. This study presents a new method for seismic attribute selection using a genetic‐algorithm approach. The genetic algorithm attribute selection uses neural‐network training results to choose the optimal number and type of seismic attributes for porosity prediction. We apply the genetic‐algorithm attribute‐selection method to the C38 reservoir in the Stratton field 3D seismic data set. Eleven wells with porosity logs are used to train a neural network using genetic‐algorithm selected‐attribute combinations. A histogram of 50 genetic‐algorithm attribute selection runs indicates that amplitude‐based attributes are the best porosity predictors for this data set. On average, the genetic algorithm selected four attributes for optimal porosity log prediction, although the number of attributes chosen ranged from one to nine. A predicted porosity volume was generated using the best genetic‐algorithm attribute combination based on an average cross‐validation correlation coefficient. This volume suggested a network of channel sands within the C38 reservoir.

Download Full-text

Large-Scale Truss Topology and Sizing Optimization by an Improved Genetic Algorithm with Multipoint Approximation

Applied Sciences ◽

10.3390/app12010407 ◽

2021 ◽

Vol 12 (1) ◽

pp. 407

Author(s):

Tianshan Dong ◽

Shenyan Chen ◽

Hai Huang ◽

Chao Han ◽

Ziqi Dai ◽

...

Keyword(s):

Genetic Algorithm ◽

Large Scale ◽

Optimization Problems ◽

Approximation Technique ◽

Discrete Topology ◽

Improved Genetic Algorithm ◽

Approximation Accuracy ◽

Design Variables ◽

Metaheuristic Methods ◽

Multipoint Approximation

Truss size and topology optimization problems have recently been solved mainly by many different metaheuristic methods, and these methods usually require a large number of structural analyses due to their mechanism of population evolution. A branched multipoint approximation technique has been introduced to decrease the number of structural analyses by establishing approximate functions instead of the structural analyses in Genetic Algorithm (GA) when GA addresses continuous size variables and discrete topology variables. For large-scale trusses with a large number of design variables, an enormous change in topology variables in the GA causes a loss of approximation accuracy and then makes optimization convergence difficult. In this paper, a technique named the label–clip–splice method is proposed to improve the above hybrid method in regard to the above problem. It reduces the current search domain of GA gradually by clipping and splicing the labeled variables from chromosomes and optimizes the mixed-variables model efficiently with an approximation technique for large-scale trusses. Structural analysis of the proposed method is extremely reduced compared with these single metaheuristic methods. Numerical examples are presented to verify the efficacy and advantages of the proposed technique.

Download Full-text

Text Clustering Method Based on Improved Genetic Algorithm

Lecture Notes in Electrical Engineering - Advances in Electronic Engineering, Communication and Management Vol.2 ◽

10.1007/978-3-642-27296-7_59 ◽

2012 ◽

pp. 377-382

Author(s):

ZhanGang Hao ◽

Tong Wang ◽

XiaoQian Song

Keyword(s):

Genetic Algorithm ◽

Text Clustering ◽

Improved Genetic Algorithm ◽

Clustering Method

Download Full-text

A MULTI-CLUSTERING FUSION SCHEME FOR DATA PARTITIONING

International Journal of Neural Systems ◽

10.1142/s0129065705000360 ◽

2005 ◽

Vol 15 (05) ◽

pp. 391-401 ◽

Cited By ~ 3

Author(s):

DIMITRIOS S. FROSSYNIOTIS ◽

CHRISTOS PATERITSAS ◽

ANDREAS STAFYLOPATIS

Keyword(s):

Clustering Algorithm ◽

Optimal Number ◽

Data Partitioning ◽

Clustering Methods ◽

Data Set ◽

Fusion Procedure ◽

Distinct Partition ◽

Fusion Scheme ◽

Previous Phase ◽

Optimal Number Of Clusters

A multi-clustering fusion method is presented based on combining several runs of a clustering algorithm resulting in a common partition. More specifically, the results of several independent runs of the same clustering algorithm are appropriately combined to obtain a distinct partition of the data which is not affected by initialization and overcomes the instabilities of clustering methods. Subsequently, a fusion procedure is applied to the clusters generated during the previous phase to determine the optimal number of clusters in the data set according to some predefined criteria.

Download Full-text

Clusters of people with type 2 diabetes in the general population: unsupervised machine learning approach using national surveys in Latin America and the Caribbean

BMJ Open Diabetes Research & Care ◽

10.1136/bmjdrc-2020-001889 ◽

2021 ◽

Vol 9 (1) ◽

pp. e001889

Author(s):

Rodrigo M Carrillo-Larco ◽

Manuel Castillo-Cara ◽

Cecilia Anza-Ramirez ◽

Antonio Bernabé-Ortiz

Keyword(s):

Machine Learning ◽

Type 2 Diabetes ◽

Optimal Number ◽

Training Data ◽

Data Set ◽

Number Of Clusters ◽

National Surveys ◽

The Caribbean ◽

Optimal Number Of Clusters

IntroductionWe aimed to identify clusters of people with type 2 diabetes mellitus (T2DM) and to assess whether the frequency of these clusters was consistent across selected countries in Latin America and the Caribbean (LAC).Research design and methodsWe analyzed 13 population-based national surveys in nine countries (n=8361). We used k-means to develop a clustering model; predictors were age, sex, body mass index (BMI), waist circumference (WC), systolic/diastolic blood pressure (SBP/DBP), and T2DM family history. The training data set included all surveys, and the clusters were then predicted in each country-year data set. We used Euclidean distance, elbow and silhouette plots to select the optimal number of clusters and described each cluster according to the underlying predictors (mean and proportions).ResultsThe optimal number of clusters was 4. Cluster 0 grouped more men and those with the highest mean SBP/DBP. Cluster 1 had the highest mean BMI and WC, as well as the largest proportion of T2DM family history. We observed the smallest values of all predictors in cluster 2. Cluster 3 had the highest mean age. When we reflected the four clusters in each country-year data set, a different distribution was observed. For example, cluster 3 was the most frequent in the training data set, and so it was in 7 out of 13 other country-year data sets.ConclusionsUsing unsupervised machine learning algorithms, it was possible to cluster people with T2DM from the general population in LAC; clusters showed unique profiles that could be used to identify the underlying characteristics of the T2DM population in LAC.

Download Full-text

A Genetic Algorithm with the Mixed Heuristics for Traveling Salesman Problem

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026815500030 ◽

2015 ◽

Vol 14 (01) ◽

pp. 1550003 ◽

Cited By ~ 1

Author(s):

Yong Wang

Keyword(s):

Genetic Algorithm ◽

Traveling Salesman Problem ◽

Discrete Optimization ◽

Time Complexity ◽

Optimization Problems ◽

Traveling Salesman ◽

Optimization Process ◽

Hamiltonian Cycles ◽

Improved Genetic Algorithm ◽

Discrete Optimization Problems

Traveling salesman problem (TSP) is one of well-known discrete optimization problems. The genetic algorithm is improved with the mixed heuristics to resolve TSP. The first heuristics is the four vertices and three lines inequality, which is applied to the 4-vertex paths to generate the shorter Hamiltonian cycles (HC). The second local heuristics is executed to reverse the i-vertex paths with more than two vertices, which also generates the shorter HCs. It is necessary that the two heuristics coordinate with each other in the optimization process. The time complexity of the first and second heuristics are O(n) and O(n3), respectively. The two heuristics are merged into the original genetic algorithm. The computation results show that the improved genetic algorithm with the mixed heuristics can find better solutions than the original GA does under the same conditions.

Download Full-text

A new cluster validity index using maximum cluster spread based compactness measure

International Journal of Intelligent Computing and Cybernetics ◽

10.1108/ijicc-02-2016-0006 ◽

2016 ◽

Vol 9 (2) ◽

pp. 179-204 ◽

Cited By ~ 10

Author(s):

M. Arif Wani ◽

Romana Riyaz

Keyword(s):

Optimal Number ◽

Data Sets ◽

Cluster Validity ◽

Cluster Validity Index ◽

Validity Index ◽

Data Set ◽

Number Of Clusters ◽

Content Type ◽

Validity Indices ◽

Optimal Number Of Clusters

Purpose – The most commonly used approaches for cluster validation are based on indices but the majority of the existing cluster validity indices do not work well on data sets of different complexities. The purpose of this paper is to propose a new cluster validity index (ARSD index) that works well on all types of data sets. Design/methodology/approach – The authors introduce a new compactness measure that depicts the typical behaviour of a cluster where more points are located around the centre and lesser points towards the outer edge of the cluster. A novel penalty function is proposed for determining the distinctness measure of clusters. Random linear search-algorithm is employed to evaluate and compare the performance of the five commonly known validity indices and the proposed validity index. The values of the six indices are computed for all nc ranging from (nc min, nc max) to obtain the optimal number of clusters present in a data set. The data sets used in the experiments include shaped, Gaussian-like and real data sets. Findings – Through extensive experimental study, it is observed that the proposed validity index is found to be more consistent and reliable in indicating the correct number of clusters compared to other validity indices. This is experimentally demonstrated on 11 data sets where the proposed index has achieved better results. Originality/value – The originality of the research paper includes proposing a novel cluster validity index which is used to determine the optimal number of clusters present in data sets of different complexities.

Download Full-text