scholarly journals Measurement of clustering effectiveness for document collections

Author(s):  
Meng Yuan ◽  
Justin Zobel ◽  
Pauline Lin

AbstractClustering of the contents of a document corpus is used to create sub-corpora with the intention that they are expected to consist of documents that are related to each other. However, while clustering is used in a variety of ways in document applications such as information retrieval, and a range of methods have been applied to the task, there has been relatively little exploration of how well it works in practice. Indeed, given the high dimensionality of the data it is possible that clustering may not always produce meaningful outcomes. In this paper we use a well-known clustering method to explore a variety of techniques, existing and novel, to measure clustering effectiveness. Results with our new, extrinsic techniques based on relevance judgements or retrieved documents demonstrate that retrieval-based information can be used to assess the quality of clustering, and also show that clustering can succeed to some extent at gathering together similar material. Further, they show that intrinsic clustering techniques that have been shown to be informative in other domains do not work for information retrieval. Whether clustering is sufficiently effective to have a significant impact on practical retrieval is unclear, but as the results show our measurement techniques can effectively distinguish between clustering methods.

2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Wenjia Chen ◽  
Jinlin Li

Abstract Background To enhance teleconsultation management, demands can be classified into different patterns, and the service of each pattern demand can be improved. Methods For the effective teleconsultation classification, a novel ensemble hierarchical clustering method is proposed in this study. In the proposed method, individual clustering results are first obtained by different hierarchical clustering methods, and then ensembled by one-hot encoding, the calculation and division of cosine similarity, and network graph representation. In the built network graph about the high cosine similarity, the connected demand series can be categorized into one pattern. For verification, 43 teleconsultation demand series are used as sample data, and the efficiency and quality of teleconsultation services are respectively analyzed before and after the demand classification. Results The teleconsultation demands are classified into three categories, erratic, lumpy, and slow. Under the fixed strategies, the service analysis after demand classification reveals the deficiencies of teleconsultation services, but analysis before demand classification can’t. Conclusion The proposed ensemble hierarchical clustering method can effectively category teleconsultation demands, and the effective demand categorization can enhance teleconsultation management.


2020 ◽  
Vol 25 (2) ◽  
pp. 87-104
Author(s):  
Satinder Bal Gupta ◽  
Rajkumar Yadav ◽  
Shivani Gupta

AbstractClustering has now become a very important tool to manage the data in many areas such as pattern recognition, machine learning, information retrieval etc. The database is increasing day by day and thus it is required to maintain the data in such a manner that useful information can easily be extracted and used accordingly. In this process, clustering plays an important role as it forms clusters of the data on the basis of similarity in data. There are more than hundred clustering methods and algorithms that can be used for mining the data but all these algorithms do not provide models for their clusters and thus it becomes difficult to categorise all of them. This paper describes the most commonly used and popular clustering techniques and also compares them on the basis of their merits, demerits and time complexity.


Author(s):  
Andi Setiawan ◽  
Sri Nining ◽  
Tri Ginanjar Laksana

Distribution of midwife practice pomegranate (quality of service) in Cirebon is difficult to know where the location of the practice because of the vast area of Cirebon. Then, the number of pregnant women who are less get help quickly (giving birth without medical assistance) because of ignorance location midwife practice pomegranate (quality of service) nearby. And the number of midwives pomegranate (quality of service) has not cooperated with the insurance BPJS to perform payment transactions. This study uses a clustering method, which can segment data clustering method, which is used to facilitate information retrieval midwife pomegranate (quality of service). Clustering methods have representation stage pattern, the selection traits or characteristics, pattern proximity, distance measurement, data obtained from IBI (Indonesian Midwives Association) and the tools used: phpMyAdmin, notepad ++, xampp, GoogleMapApi, Dreamwaver. This system can be expected to map the location of the practice of midwives pomegranate (quality of service) in the district of Cirebon, can find the nearest location midwife pomegranate (quality of service), can find pomegranate midwives who work with BPJS to perform payment transactions. Then, hopefully it can help people in handling pregnant women rapidly. And, is expected to reduce maternal and child mortality.


2007 ◽  
Vol 06 (03) ◽  
pp. 181-188 ◽  
Author(s):  
Jiaming Zhan ◽  
Han Tong Loh

Document clustering is a significant research issue in information retrieval and text mining. Traditionally, most clustering methods were based on the vector space model which has a few limitations such as high dimensionality and weakness in handling synonymous and polysemous problems. Latent semantic indexing (LSI) is able to deal with such problems to some extent. Previous studies have shown that using LSI could reduce the time in clustering a large document set while having little effect on clustering accuracy. However, when conducting clustering upon a small document set, the accuracy is more concerned than efficiency. In this paper, we demonstrate that LSI can improve the clustering accuracy of a small document set and we also recommend the dimensions needed to achieve the best clustering performance.


2015 ◽  
Vol 2015 ◽  
pp. 1-16
Author(s):  
Xiao Sun ◽  
Tongda Zhang ◽  
Yueting Chai ◽  
Yi Liu

Most of popular clustering methods typically have some strong assumptions of the dataset. For example, thek-means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it.


Technologies are changing day by day and IoT is worldwide data and may of great business important to various users. sTo create such reasonable data, majority adaptive and K-mediod clustering techniques are employed in data mining. In research work, it focus on comparing adaptive, K-medisod and novel clustering technique to internet-of-things data collection in ITSs (Intelligence Traffic System). In traffic DataStream is composed form online site, it challenges of 30,000 instances with 9 attributes, clusters formed after evaluation and number of clusters is identified after the evaluation. Proposed techniques are significant too easy than some other clustering techniques with respect to all computation recall and precision parameters. In traffic databases depends on the data separation and cluster enhancement that is quality of clusters. To resolve the major issues that over load the system or Centre’s in IoT which consequences the huge kind of data on internet. It evaluated a set of consequences experiments using token and manufacture data from traffic use case view where the traffic considerations from the city monitor. Comparison of clustering methods that helps in determining suitable clustering approach for the offer internet of things database which results in optimal performance metrics.


Author(s):  
Hilton H. Mollenhauer

Many factors (e.g., resolution of microscope, type of tissue, and preparation of sample) affect electron microscopical images and alter the amount of information that can be retrieved from a specimen. Of interest in this report are those factors associated with the evaluation of epoxy embedded tissues. In this context, informational retrieval is dependant, in part, on the ability to “see” sample detail (e.g., contrast) and, in part, on tue quality of sample preservation. Two aspects of this problem will be discussed: 1) epoxy resins and their effect on image contrast, information retrieval, and sample preservation; and 2) the interaction between some stains commonly used for enhancing contrast and information retrieval.


2021 ◽  
Vol 10 (3) ◽  
pp. 161
Author(s):  
Hao-xuan Chen ◽  
Fei Tao ◽  
Pei-long Ma ◽  
Li-na Gao ◽  
Tong Zhou

Spatial analysis is an important means of mining floating car trajectory information, and clustering method and density analysis are common methods among them. The choice of the clustering method affects the accuracy and time efficiency of the analysis results. Therefore, clarifying the principles and characteristics of each method is the primary prerequisite for problem solving. Taking four representative spatial analysis methods—KMeans, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Clustering by Fast Search and Find of Density Peaks (CFSFDP), and Kernel Density Estimation (KDE)—as examples, combined with the hotspot spatiotemporal mining problem of taxi trajectory, through quantitative analysis and experimental verification, it is found that DBSCAN and KDE algorithms have strong hotspot discovery capabilities, but the heat regions’ shape of DBSCAN is found to be relatively more robust. DBSCAN and CFSFDP can achieve high spatial accuracy in calculating the entrance and exit position of a Point of Interest (POI). KDE and DBSCAN are more suitable for the classification of heat index. When the dataset scale is similar, KMeans has the highest operating efficiency, while CFSFDP and KDE are inferior. This paper resolves to a certain extent the lack of scientific basis for selecting spatial analysis methods in current research. The conclusions drawn in this paper can provide technical support and act as a reference for the selection of methods to solve the taxi trajectory mining problem.


2015 ◽  
Vol 17 (5) ◽  
pp. 719-732
Author(s):  
Dulakshi Santhusitha Kumari Karunasingha ◽  
Shie-Yui Liong

A simple clustering method is proposed for extracting representative subsets from lengthy data sets. The main purpose of the extracted subset of data is to use it to build prediction models (of the form of approximating functional relationships) instead of using the entire large data set. Such smaller subsets of data are often required in exploratory analysis stages of studies that involve resource consuming investigations. A few recent studies have used a subtractive clustering method (SCM) for such data extraction, in the absence of clustering methods for function approximation. SCM, however, requires several parameters to be specified. This study proposes a clustering method, which requires only a single parameter to be specified, yet it is shown to be as effective as the SCM. A method to find suitable values for the parameter is also proposed. Due to having only a single parameter, using the proposed clustering method is shown to be orders of magnitudes more efficient than using SCM. The effectiveness of the proposed method is demonstrated on phase space prediction of three univariate time series and prediction of two multivariate data sets. Some drawbacks of SCM when applied for data extraction are identified, and the proposed method is shown to be a solution for them.


2006 ◽  
Vol 25 (2) ◽  
pp. 78 ◽  
Author(s):  
Marcia D. Kerchner

In the early years of modern information retrieval, the fundamental way in which we understood and evaluated search performance was by measuring precision and recall. In recent decades, however, models of evaluation have expanded to incorporate the information-seeking task and the quality of its outcome, as well as the value of the information to the user. We have developed a systems engineering-based methodology for improving the whole search experience. The approach focuses on understanding users’ information-seeking problems, understanding who has the problems, and applying solutions that address these problems. This information is gathered through ongoing analysis of site-usage reports, satisfaction surveys, Help Desk reports, and a working relationship with the business owners.


Sign in / Sign up

Export Citation Format

Share Document