The Improvement of K-Medoids Clustering Algorithm under Semantic Web

2013 ◽  
Vol 380-384 ◽  
pp. 1286-1289
Author(s):  
Wen Tian Ji ◽  
Qing Ju Guo ◽  
Sheng Zhong

K-medoids clustering algorithm is an efficient algorithm in classifying cluster categories. Based on algorithm analysis, this paper first improves the selection of K center point and then sets up a web model of ontology data set object with the aim of demonstrating through experiment evaluation that the improved algorithm can greatly enhance the accuracy of clustering results under semantic web.

2013 ◽  
Vol 380-384 ◽  
pp. 1290-1293
Author(s):  
Qing Ju Guo ◽  
Wen Tian Ji ◽  
Sheng Zhong

Lots of research findings have been made from home and abroad on clustering algorithm in recent years. In view of the traditional partition clustering method K-means algorithm, this paper, after analyzing its advantages and disadvantages, combines it with ontology-based data set to establish a semantic web model. It improves the existing clustering algorithm in various constraint conditions with the aim of demonstrating that the improved algorithm has better efficiency and accuracy under semantic web.


2016 ◽  
Vol 9 (1) ◽  
pp. 152
Author(s):  
Burak Omer Saracoglu

Purpose: The electricity demand in Turkey has been increasing for a while. Hydropower is one of the major electricity generation types to compensate this electricity demand in Turkey. Private investors (domestic and foreign) in the hydropower electricity generation sector have been looking for the most appropriate and satisfactory new private hydropower investment (PHPI) options and opportunities in Turkey. This study aims to present a qualitative multi-attribute decision making (MADM) model, that is easy, straightforward, and fast for the selection of the most satisfactory reasonable PHPI options during the very early investment stages (data and information poorness on projects).Design/methodology/approach: The data and information of the PHPI options was gathered from the official records on the official websites. A wide and deep literature review was conducted for the MADM models and for the hydropower industry. The attributes of the model were identified, selected, clustered and evaluated by the expert decision maker (EDM) opinion and by help of an open source search results clustering engine (Carrot2) (helpful for also comprehension). The PHPI options were clustered according to their installed capacities main property to analyze the options in the most appropriate, decidable, informative, understandable and meaningful way. A simple clustering algorithm for the PHPI options was executed in the current study. A template model for the selection of the most satisfactory PHPI options was built in the DEXi (Decision EXpert for Education) and the DEXiTree software.Findings: The basic attributes for the selection of the PHPI options were presented and afterwards the aggregate attributes were defined by the bottom-up structuring for the early investment stages. The attributes were also analyzed by help of Carrot2. The most satisfactory PHPI options in Turkey in the big options data set were selected for each PHPI options cluster by the EDM evaluations in the DEXi.Originality/value: The recommended DEXi PHPI selection model by the search results clustering engine within a country wise case offered the possibility of easy, meaningful and satisfying continental or worldwide applications for the private investors and the international financial institutions such as the African Development Bank, or the World Bank was the main contribution.


2019 ◽  
Vol 48 (4) ◽  
pp. 673-681
Author(s):  
Shufen Zhang ◽  
Zhiyu Liu ◽  
Xuebin Chen ◽  
Changyin Luo

In order to solve the problem of traditional K-Means clustering algorithm in dealing with large-scale data set, a Hadoop K-Means (referred to HKM) clustering algorithm is proposed. Firstly, according to the sample density, the algorithm eliminates the effects of noise points in the data set. Secondly, it optimizes the selection of the initial center point using the thought of the max-min distance. Finally, it uses a MapReduce programming model to realize the parallelization. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but can also solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.


2018 ◽  
Vol 3 (1) ◽  
pp. 001
Author(s):  
Zulhendra Zulhendra ◽  
Gunadi Widi Nurcahyo ◽  
Julius Santony

In this study using Data Mining, namely K-Means Clustering. Data Mining can be used in searching for a large enough data analysis that aims to enable Indocomputer to know and classify service data based on customer complaints using Weka Software. In this study using the algorithm K-Means Clustering to predict or classify complaints about hardware damage on Payakumbuh Indocomputer. And can find out the data of Laptop brands most do service on Indocomputer Payakumbuh as one of the recommendations to consumers for the selection of Laptops.


2021 ◽  
pp. 016555152110184
Author(s):  
Gunjan Chandwani ◽  
Anil Ahlawat ◽  
Gaurav Dubey

Document retrieval plays an important role in knowledge management as it facilitates us to discover the relevant information from the existing data. This article proposes a cluster-based inverted indexing algorithm for document retrieval. First, the pre-processing is done to remove the unnecessary and redundant words from the documents. Then, the indexing of documents is done by the cluster-based inverted indexing algorithm, which is developed by integrating the piecewise fuzzy C-means (piFCM) clustering algorithm and inverted indexing. After providing the index to the documents, the query matching is performed for the user queries using the Bhattacharyya distance. Finally, the query optimisation is done by the Pearson correlation coefficient, and the relevant documents are retrieved. The performance of the proposed algorithm is analysed by the WebKB data set and Twenty Newsgroups data set. The analysis exposes that the proposed algorithm offers high performance with a precision of 1, recall of 0.70 and F-measure of 0.8235. The proposed document retrieval system retrieves the most relevant documents and speeds up the storing and retrieval of information.


Genetics ◽  
2001 ◽  
Vol 159 (2) ◽  
pp. 699-713
Author(s):  
Noah A Rosenberg ◽  
Terry Burke ◽  
Kari Elo ◽  
Marcus W Feldman ◽  
Paul J Freidlin ◽  
...  

Abstract We tested the utility of genetic cluster analysis in ascertaining population structure of a large data set for which population structure was previously known. Each of 600 individuals representing 20 distinct chicken breeds was genotyped for 27 microsatellite loci, and individual multilocus genotypes were used to infer genetic clusters. Individuals from each breed were inferred to belong mostly to the same cluster. The clustering success rate, measuring the fraction of individuals that were properly inferred to belong to their correct breeds, was consistently ~98%. When markers of highest expected heterozygosity were used, genotypes that included at least 8–10 highly variable markers from among the 27 markers genotyped also achieved >95% clustering success. When 12–15 highly variable markers and only 15–20 of the 30 individuals per breed were used, clustering success was at least 90%. We suggest that in species for which population structure is of interest, databases of multilocus genotypes at highly variable markers should be compiled. These genotypes could then be used as training samples for genetic cluster analysis and to facilitate assignments of individuals of unknown origin to populations. The clustering algorithm has potential applications in defining the within-species genetic units that are useful in problems of conservation.


Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-16 ◽  
Author(s):  
Yiwen Zhang ◽  
Yuanyuan Zhou ◽  
Xing Guo ◽  
Jintao Wu ◽  
Qiang He ◽  
...  

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.


2018 ◽  
Vol 143 (5) ◽  
pp. 587-592 ◽  
Author(s):  
Pieter J. Slootweg ◽  
Edward W. Odell ◽  
Daniel Baumhoer ◽  
Roman Carlos ◽  
Keith D. Hunter ◽  
...  

A data set has been developed for the reporting of excisional biopsies and resection specimens for malignant odontogenic tumors by members of an expert panel working on behalf of the International Collaboration on Cancer Reporting, an international organization established to unify and standardize reporting of cancers. Odontogenic tumors are rare, which limits evidence-based support for designing a scientifically sound data set for reporting them. Thus, the selection of reportable elements within the data set and considering them as either core or noncore is principally based on evidence from malignancies affecting other organ systems, limited case series, expert opinions, and/or anecdotal reports. Nevertheless, this data set serves as the initial step toward standardized reporting on malignant odontogenic tumors that should evolve over time as more evidence becomes available and functions as a prompt for further research to provide such evidence.


2021 ◽  
Vol 11 (22) ◽  
pp. 10596
Author(s):  
Chung-Hong Lee ◽  
Hsin-Chang Yang ◽  
Yenming J. Chen ◽  
Yung-Lin Chuang

Recently, an emerging application field through Twitter messages and algorithmic computation to detect real-time world events has become a new paradigm in the field of data science applications. During a high-impact event, people may want to know the latest information about the development of the event because they want to better understand the situation and possible trends of the event for making decisions. However, often in emergencies, the government or enterprises are usually unable to notify people in time for early warning and avoiding risks. A sensible solution is to integrate real-time event monitoring and intelligence gathering functions into their decision support system. Such a system can provide real-time event summaries, which are updated whenever important new events are detected. Therefore, in this work, we combine a developed Twitter-based real-time event detection algorithm with pre-trained language models for summarizing emergent events. We used an online text-stream clustering algorithm and self-adaptive method developed to gather the Twitter data for detection of emerging events. Subsequently we used the Xsum data set with a pre-trained language model, namely T5 model, to train the summarization model. The Rouge metrics were used to compare the summary performance of various models. Subsequently, we started to use the trained model to summarize the incoming Twitter data set for experimentation. In particular, in this work, we provide a real-world case study, namely the COVID-19 pandemic event, to verify the applicability of the proposed method. Finally, we conducted a survey on the example resulting summaries with human judges for quality assessment of generated summaries. From the case study and experimental results, we have demonstrated that our summarization method provides users with a feasible method to quickly understand the updates in the specific event intelligence based on the real-time summary of the event story.


Sign in / Sign up

Export Citation Format

Share Document