scholarly journals Development of Research Proposal Selection Based on Domain Ontology using K-Means Categorical Clustering

With the prompt improvement in research progress of various zones, selection of research proposals became a remarkable methodology in many research funding agencies and organizations. When a less number of research proposals are received, then it is ease to cluster the research proposals and the selection process became as non-problematic way. If a number of research proposals elevated, then the clustering and selecting the proposals became complicated. In current system, proposals grouping is done in manual-based or along with their similarities in subject disciplinaries which yield irrelevant results in some cases. The main goal of this research work is to develop an enhanced system in selection of research proposals based on Domain ontology, where the ontology acts as a searching criteria for the topics of research proposals. This proposed system will help to select the topics of research proposals in well-systematic way without the interference of manual progression. In this paper, an algorithm is proposed as Scikit-learn K-means Multiclass Document Clustering(SKMDC) to group each subject discipline according to their sub-topics and sub-domains. Here, the k-means clustering technique is implemented on categorical data to implement the clustering process. As, the categorical data are not able to applied directly in K-means clustering algorithm, the LabelEncoder method is implemented to encode the text data to numerical values and the dimensions of a dataset are reduced using Principal Component Analysis. This paper also overwhelms the weaknesses of k-means technique in specification of cluster number in initial stage. It is done through the determination of optimal number of clusters by using Elbow Curve method and it is cross-validated through Silhouette Score analysis.

2020 ◽  
Vol 2020 ◽  
pp. 1-13
Author(s):  
Ziqi Jia ◽  
Ling Song

The k-prototypes algorithm is a hybrid clustering algorithm that can process Categorical Data and Numerical Data. In this study, the method of initial Cluster Center selection was improved and a new Hybrid Dissimilarity Coefficient was proposed. Based on the proposed Hybrid Dissimilarity Coefficient, a weighted k-prototype clustering algorithm based on the hybrid dissimilarity coefficient was proposed (WKPCA). The proposed WKPCA algorithm not only improves the selection of initial Cluster Centers, but also puts a new method to calculate the dissimilarity between data objects and Cluster Centers. The real dataset of UCI was used to test the WKPCA algorithm. Experimental results show that WKPCA algorithm is more efficient and robust than other k-prototypes algorithms.


2021 ◽  
Author(s):  
Congming Shi ◽  
Bingtao Wei ◽  
Shoulin Wei ◽  
Wen Wang ◽  
Hai Liu ◽  
...  

Abstract Clustering, a traditional machine learning method, plays a significant role in data analysis. Most clustering algorithms depend on a predetermined exact number of clusters, whereas, in practice, clusters are usually unpredictable. Although the Elbow method is one of the most commonly used methods to discriminate the optimal cluster number, the discriminant of the number of clusters depends on the manual identification of the elbow points on the visualization curve. Thus, experienced analysts cannot clearly identify the elbow point from the plotted curve when the plotted curve is fairly smooth. To solve this problem, a new elbow point discriminant method is proposed to yield a statistical metric that estimates an optimal cluster number when clustering on a dataset. First, the average degree of distortion obtained by the Elbow method is normalized to the range of 0 to 10. Second, the normalized results are used to calculate the cosine of intersection angles between elbow points. Third, this calculated cosine of intersection angles and the arccosine theorem are used to compute the intersection angles between elbow points. Finally, the index of the above computed minimal intersection angles between elbow points is used as the estimated potential optimal cluster number. The experimental results based on simulated datasets and a well-known public dataset (Iris Dataset) demonstrated that the estimated optimal cluster number obtained by our newly proposed method is better than the widely used Silhouette method.


Author(s):  
U.A. Nuralieva ◽  
A.A. Baisabyrova ◽  
G.A. Moldakhmetova ◽  
K.A. Temirbayeva ◽  
R.Zh. Shimelkova ◽  
...  

One of the ways to intensify the production of beekeeping products is selection. Bee breeding is not only one of the most important methods, but also the most economically efficient way to increase the productivity of bee colonies. Thus, the selection of bees and the implementation of its achievements into production are one of the most important and most effective directions for intensifying beekeeping. Research work was carried out under the project of program-targeted financing of the Ministry of Education and Science of the Republic of Kazakhstan on the topic "Development of technologies for effective management of the selection process in beekeeping." This article examines the characteristics of the morphometric indicators of honeybees in the Almaty region of the Republic of Kazakhstan. The material for the research was the specimens of worker bees from apiaries located in the Almaty region of the Devochkin farm, Panov farm, Kalinin Individual Entrepreneur, Adilgazy Individual Entrepreneur, Kashkimbaev farm. To carry out the study according to the method of A.B. Kartashev, 35 samples of bees were worked out. Changes in the parameters of the wings, including the cubital and dumbbell index, discoidal displacement by bee species: Central Russian, Carpathian, Italian and Carniolian honey bee, are considered. It was found that in Kalinin’s apiary morphometric indicators for the cubital index, the average value was 2,787%. As a result, the morphometric indices for the cubital index in bees of the IP Kalinin bee were 2.777%. Whereas in other farms, the average value was significantly lower for all indicators. Accordingly, the percentage of the cubital index was 7.42-17.36%, the dumbbell index was 6.77-11.81%, and the discoidal displacement was 32.91-47.37%. According to all indicators, it is clear that the Kalinin Individual Entrepreneur’s bee farm is superior to other bee farms in terms of morphometric data. This is due to the isolation of the beekeeping and out of reach of other bees, thus ensuring a low level of hybridization. The considered analysis of the species belonging to the entire apiary, as well as economically useful features, can significantly increase the efficiency of selection work in beekeeping.


This research work proposed an integrated approach using Fuzzy Clustering to discover the optimal number of clusters. The proposed technique is a great technological innovation clustering algorithm in marketing and could be used to determine the best group of customers, similar items and products. The new approach can independently determine the initial distribution of cluster centers. The task of finding the number of clusters is converted into the task of determining the size of the neural network, which later translated to identify the optimal groups of clusters. This approach has been tested using four business data set and shows outstanding results compared to traditional approaches. The proposed method is able to find without any significant error the expected exact number of clusters. Further, we believe that this work is a business value to increase market efficiency in finding out what group of clusters is more cost-effective.


2020 ◽  
Author(s):  
Congming Shi ◽  
Bingtao Wei ◽  
Shoulin Wei ◽  
Wen Wang ◽  
Hai Liu ◽  
...  

Abstract Clustering, as a traditional machine learning method, is still playing a significant role in data analysis. The most of clustering algorithms depend on a predetermined exact number of clusters, whereas, in practice, clusters are usually unpredictable. Although elbow method is one of the most commonly used methods to discriminate the optimal cluster number, the discriminant of the number of clusters depends on manual identification of the elbow points on the visualization curve, which will lead to the experienced analysts not being able to clearly identify the elbow point from the plotted curve when the plotted curve being fairly smooth. To solve this problem, a new elbow point discriminant method is proposed to work out a statistical metric estimating an optimal cluster number when clustering on a dataset. Firstly, the average degree of distortion obtained by Elbow method is normalized to the range of 0 to10; Secondly, the normalized results are used to calculate Cosine of intersection angles between elbow points; Thirdly, the above calculated Cosine of intersection angles and Arccosine theorem are used to compute the intersection angles between elbow points; Finally, the index of the above computed minimal intersection angles between elbow points is used as the estimated potential optimal cluster number. The experimental results based on simulated datasets and a public well-known dataset demonstrated that the estimated optimal cluster number output by our newly proposed method is better than widely used Silhouette method.


2020 ◽  
Author(s):  
Congming Shi ◽  
Bingtao Wei ◽  
Shoulin Wei ◽  
Wen Wang ◽  
Hai Liu ◽  
...  

Abstract Clustering, as a traditional machine learning method, is still playing a significant role in data analysis. The most of clustering algorithms depend on a predetermined exact number of clusters, whereas, in practice, clusters are usually unpredictable. Although elbow method is one of the most commonly used methods to discriminate the optimal cluster number, the discriminant of the number of clusters depends on manual identification of the elbow points on the visualization curve, which will lead to the experienced analysts not being able to clearly identify the elbow point from the plotted curve when the plotted curve being fairly smooth. To solve this problem, a new elbow point discriminant method is proposed to work out a statistical metric estimating an optimal cluster number when clustering on a dataset. Firstly, the average degree of distortion obtained by Elbow method is normalized to the range of 0 to10; Secondly, the normalized results are used to calculate Cosine of intersection angles between elbow points; Thirdly, the above calculated Cosine of intersection angles and Arccosine theorem are used to compute the intersection angles between elbow points; Finally, the index of the above computed minimal intersection angles between elbow points is used as the estimated potential optimal cluster number. The experimental results based on simulated datasets and a public well-known dataset (Iris Dataset) demonstrated that the estimated optimal cluster number output by our newly proposed method is better than widely used Silhouette method.


Author(s):  
Shashwati Mishra ◽  
Mrutyunjaya Panda

Feature plays a very important role in the analysis and prediction of data as it carries the most valuable information about the data. This data may be in a structured format or in an unstructured format. Feature engineering process is used to extract features from these data. Selection of features is one of the crucial steps in the feature engineering process. This feature selection process can adopt four different approaches. On that basis, it can be classified into four basic categories, namely filter method, wrapper method, embedded method, and hybrid method. This chapter discusses about different techniques coming under these four categories along with the research work on feature selection.


Author(s):  
MIIN-SHEN YANG ◽  
CHIH-YING LIN ◽  
YI-CHENG TIAN

In 1993, Yang first extended the classification maximum likelihood (CML) to a so-called fuzzy CML, by combining fuzzy c-partitions with the CML function. Fuzzy c-partitions are generally an extension of hard c-partitions. It was claimed that this was more robust. However, the fuzzy CML still lacks some robustness as a clustering algorithm, such as its in-ability to detect different volumes of clusters, its heavy dependence on parameter initializations and the necessity to provide an a priori cluster number. In this paper, we construct a robust fuzzy CML clustering framework that has a robust clustering method. The eigenvalue decomposition of a covariance matrix is firstly considered using the fuzzy CML model. The Bayesian information criterion (BIC) is then used for model selection, in order to choose the best model with the optimal number of clusters. Therefore, the proposed robust fuzzy CML clustering framework exhibits clustering characteristics that are robust in terms of the parameter initialization, robust in terms of the cluster number and also in terms of its capability to detect different volumes of clusters. Numerical examples and real data applications with comparisons are provided, which demonstrate the effectiveness and superiority of the proposed method.


2011 ◽  
Vol 2011 ◽  
pp. 1-21 ◽  
Author(s):  
Yanfei Zhong ◽  
Liangpei Zhang

A new fuzzy clustering algorithm based on clonal selection theory from artificial immune systems (AIS), namely, FCSA, is proposed to obtain the optimal clustering result of land cover classification withouta prioriassumptions on the number of clusters. FCSA can adaptively find the optimal number of clusters and is designed as a two-layer system: the classification layer and the optimization layer. The classification layer of FCSA, inspired by clonal selection theory, generates the optimal classification result with a fixed cluster number by utilizing the clone, mutation, and selection of immune operators. The optimization layer of FCSA evaluates the optimal solutions according to performance measures for cluster validity and then adjusts the cluster number to output the final optimal cluster number. Two experiments with different types of image evince that FCSA not only finds the optimal number of clusters, but also consistently outperforms the traditional clustering algorithms, such as K-means and Fuzzy C-means. Hence, FCSA provides an effective option for performing the task of land cover classification.


Sign in / Sign up

Export Citation Format

Share Document