An Improved Genetic Algorithm for Text Clustering

2014 ◽  
Vol 989-994 ◽  
pp. 1853-1856
Author(s):  
Shi Dong Yu ◽  
Yuan Ding ◽  
Xi Cheng Ma ◽  
Jian Sun

The genetic algorithm (GA) is a self-adapted probability search method used to solve optimization problems, which has been applied widely in science and engineering. In this paper, we propose an improved variable string length genetic algorithm (IVGA) for text clustering. Our algorithm has been exploited for automatically evolving the optimal number of clusters as well as providing proper data set clustering. The chromosome is encoded by special indices to indicate the location of each gene. More effective version of evolutional steps can automatically adjust the influence between the diversity of the population and selective pressure during generations. The superiority of the improved genetic algorithm over conventional variable string length genetic algorithm (VGA) is demonstrated by providing proper text clustering.

2021 ◽  
Vol 26 (2) ◽  
pp. 27
Author(s):  
Alejandro Castellanos-Alvarez ◽  
Laura Cruz-Reyes ◽  
Eduardo Fernandez ◽  
Nelson Rangel-Valdez ◽  
Claudia Gómez-Santillán ◽  
...  

Most real-world problems require the optimization of multiple objective functions simultaneously, which can conflict with each other. The environment of these problems usually involves imprecise information derived from inaccurate measurements or the variability in decision-makers’ (DMs’) judgments and beliefs, which can lead to unsatisfactory solutions. The imperfect knowledge can be present either in objective functions, restrictions, or decision-maker’s preferences. These optimization problems have been solved using various techniques such as multi-objective evolutionary algorithms (MOEAs). This paper proposes a new MOEA called NSGA-III-P (non-nominated sorting genetic algorithm III with preferences). The main characteristic of NSGA-III-P is an ordinal multi-criteria classification method for preference integration to guide the algorithm to the region of interest given by the decision-maker’s preferences. Besides, the use of interval analysis allows the expression of preferences with imprecision. The experiments contrasted several versions of the proposed method with the original NSGA-III to analyze different selective pressure induced by the DM’s preferences. In these experiments, the algorithms solved three-objectives instances of the DTLZ problem. The obtained results showed a better approximation to the region of interest for a DM when its preferences are considered.


2014 ◽  
Vol 31 (8) ◽  
pp. 1778-1789
Author(s):  
Hongkang Lin

Purpose – The clustering/classification method proposed in this study, designated as the PFV-index method, provides the means to solve the following problems for a data set characterized by imprecision and uncertainty: first, discretizing the continuous values of all the individual attributes within a data set; second, evaluating the optimality of the discretization results; third, determining the optimal number of clusters per attribute; and fourth, improving the classification accuracy (CA) of data sets characterized by uncertainty. The paper aims to discuss these issues. Design/methodology/approach – The proposed method for the solution of the clustering/classifying problem, designated as PFV-index method, combines a particle swarm optimization algorithm, fuzzy C-means method, variable precision rough sets theory, and a new cluster validity index function. Findings – This method could cluster the values of the individual attributes within the data set and achieves both the optimal number of clusters and the optimal CA. Originality/value – The validity of the proposed approach is investigated by comparing the classification results obtained for UCI data sets with those obtained by supervised classification BPNN, decision-tree methods.


2012 ◽  
Vol 3 (1) ◽  
pp. 1-20
Author(s):  
Amit Banerjee

In this paper, a multi-objective genetic algorithm for data clustering based on the robust fuzzy least trimmed squares estimator is presented. The proposed clustering methodology addresses two critical issues in unsupervised data clustering – the ability to produce meaningful partition in noisy data, and the requirement that the number of clusters be known a priori. The multi-objective genetic algorithm-driven clustering technique optimizes the number of clusters as well as cluster assignment, and cluster prototypes. A two-parameter, mapped, fixed point coding scheme is used to represent assignment of data into the true retained set and the noisy trimmed set, and the optimal number of clusters in the retained set. A three-objective criterion is also used as the minimization functional for the multi-objective genetic algorithm. Results on well-known data sets from literature suggest that the proposed methodology is superior to conventional fuzzy clustering algorithms that assume a known value for optimal number of clusters.


Geophysics ◽  
2004 ◽  
Vol 69 (1) ◽  
pp. 212-221 ◽  
Author(s):  
Kevin P. Dorrington ◽  
Curtis A. Link

Neural‐network prediction of well‐log data using seismic attributes is an important reservoir characterization technique because it allows extrapolation of log properties throughout a seismic volume. The strength of neural‐networks in the area of pattern recognition is key in its success for delineating the complex nonlinear relationship between seismic attributes and log properties. We have found that good neural‐network generalization of well‐log properties can be accomplished using a small number of seismic attributes. This study presents a new method for seismic attribute selection using a genetic‐algorithm approach. The genetic algorithm attribute selection uses neural‐network training results to choose the optimal number and type of seismic attributes for porosity prediction. We apply the genetic‐algorithm attribute‐selection method to the C38 reservoir in the Stratton field 3D seismic data set. Eleven wells with porosity logs are used to train a neural network using genetic‐algorithm selected‐attribute combinations. A histogram of 50 genetic‐algorithm attribute selection runs indicates that amplitude‐based attributes are the best porosity predictors for this data set. On average, the genetic algorithm selected four attributes for optimal porosity log prediction, although the number of attributes chosen ranged from one to nine. A predicted porosity volume was generated using the best genetic‐algorithm attribute combination based on an average cross‐validation correlation coefficient. This volume suggested a network of channel sands within the C38 reservoir.


2021 ◽  
Vol 12 (1) ◽  
pp. 407
Author(s):  
Tianshan Dong ◽  
Shenyan Chen ◽  
Hai Huang ◽  
Chao Han ◽  
Ziqi Dai ◽  
...  

Truss size and topology optimization problems have recently been solved mainly by many different metaheuristic methods, and these methods usually require a large number of structural analyses due to their mechanism of population evolution. A branched multipoint approximation technique has been introduced to decrease the number of structural analyses by establishing approximate functions instead of the structural analyses in Genetic Algorithm (GA) when GA addresses continuous size variables and discrete topology variables. For large-scale trusses with a large number of design variables, an enormous change in topology variables in the GA causes a loss of approximation accuracy and then makes optimization convergence difficult. In this paper, a technique named the label–clip–splice method is proposed to improve the above hybrid method in regard to the above problem. It reduces the current search domain of GA gradually by clipping and splicing the labeled variables from chromosomes and optimizes the mixed-variables model efficiently with an approximation technique for large-scale trusses. Structural analysis of the proposed method is extremely reduced compared with these single metaheuristic methods. Numerical examples are presented to verify the efficacy and advantages of the proposed technique.


2005 ◽  
Vol 15 (05) ◽  
pp. 391-401 ◽  
Author(s):  
DIMITRIOS S. FROSSYNIOTIS ◽  
CHRISTOS PATERITSAS ◽  
ANDREAS STAFYLOPATIS

A multi-clustering fusion method is presented based on combining several runs of a clustering algorithm resulting in a common partition. More specifically, the results of several independent runs of the same clustering algorithm are appropriately combined to obtain a distinct partition of the data which is not affected by initialization and overcomes the instabilities of clustering methods. Subsequently, a fusion procedure is applied to the clusters generated during the previous phase to determine the optimal number of clusters in the data set according to some predefined criteria.


2021 ◽  
Vol 9 (1) ◽  
pp. e001889
Author(s):  
Rodrigo M Carrillo-Larco ◽  
Manuel Castillo-Cara ◽  
Cecilia Anza-Ramirez ◽  
Antonio Bernabé-Ortiz

IntroductionWe aimed to identify clusters of people with type 2 diabetes mellitus (T2DM) and to assess whether the frequency of these clusters was consistent across selected countries in Latin America and the Caribbean (LAC).Research design and methodsWe analyzed 13 population-based national surveys in nine countries (n=8361). We used k-means to develop a clustering model; predictors were age, sex, body mass index (BMI), waist circumference (WC), systolic/diastolic blood pressure (SBP/DBP), and T2DM family history. The training data set included all surveys, and the clusters were then predicted in each country-year data set. We used Euclidean distance, elbow and silhouette plots to select the optimal number of clusters and described each cluster according to the underlying predictors (mean and proportions).ResultsThe optimal number of clusters was 4. Cluster 0 grouped more men and those with the highest mean SBP/DBP. Cluster 1 had the highest mean BMI and WC, as well as the largest proportion of T2DM family history. We observed the smallest values of all predictors in cluster 2. Cluster 3 had the highest mean age. When we reflected the four clusters in each country-year data set, a different distribution was observed. For example, cluster 3 was the most frequent in the training data set, and so it was in 7 out of 13 other country-year data sets.ConclusionsUsing unsupervised machine learning algorithms, it was possible to cluster people with T2DM from the general population in LAC; clusters showed unique profiles that could be used to identify the underlying characteristics of the T2DM population in LAC.


Author(s):  
Yong Wang

Traveling salesman problem (TSP) is one of well-known discrete optimization problems. The genetic algorithm is improved with the mixed heuristics to resolve TSP. The first heuristics is the four vertices and three lines inequality, which is applied to the 4-vertex paths to generate the shorter Hamiltonian cycles (HC). The second local heuristics is executed to reverse the i-vertex paths with more than two vertices, which also generates the shorter HCs. It is necessary that the two heuristics coordinate with each other in the optimization process. The time complexity of the first and second heuristics are O(n) and O(n3), respectively. The two heuristics are merged into the original genetic algorithm. The computation results show that the improved genetic algorithm with the mixed heuristics can find better solutions than the original GA does under the same conditions.


Author(s):  
M. Arif Wani ◽  
Romana Riyaz

Purpose – The most commonly used approaches for cluster validation are based on indices but the majority of the existing cluster validity indices do not work well on data sets of different complexities. The purpose of this paper is to propose a new cluster validity index (ARSD index) that works well on all types of data sets. Design/methodology/approach – The authors introduce a new compactness measure that depicts the typical behaviour of a cluster where more points are located around the centre and lesser points towards the outer edge of the cluster. A novel penalty function is proposed for determining the distinctness measure of clusters. Random linear search-algorithm is employed to evaluate and compare the performance of the five commonly known validity indices and the proposed validity index. The values of the six indices are computed for all nc ranging from (nc min, nc max) to obtain the optimal number of clusters present in a data set. The data sets used in the experiments include shaped, Gaussian-like and real data sets. Findings – Through extensive experimental study, it is observed that the proposed validity index is found to be more consistent and reliable in indicating the correct number of clusters compared to other validity indices. This is experimentally demonstrated on 11 data sets where the proposed index has achieved better results. Originality/value – The originality of the research paper includes proposing a novel cluster validity index which is used to determine the optimal number of clusters present in data sets of different complexities.


Sign in / Sign up

Export Citation Format

Share Document