scholarly journals Type2 IFC with SOA for Topic Detection and Document Clustering Analysis

Author(s):  
Perumal P ◽  
Mathivanan B

Abstract The automatic document clustering and topic extraction from the corpus provides a very essential requirement in many real time applications. The document clustering and topic detection is utilized to locating data quickly. Hence, in this paper, Type 2 Intuitionistic Fuzzy Clustering and Seagull Optimization Algorithm (Type 2 IFCSOA) is developed for document clustering and topic detection. The Type 2 IFCSOA is utilized to cluster the documents. Additionally, ensemble approach is utilized to identify by the topics from the clustered documents. In the proposed methodology, the pre-processing is utilized to remove unwanted information from the documents such as tokenization, stop word removal and stemming process. After that, the proposed method is utilized to cluster the documents. The clustered documents are labeled with the basis of clusters. After that, to achieve topic detection, the ensemble approach is utilized with feature extraction phases such as Term Frequency- Inverse Document Frequency (TF-IDF), Mutual information (MI), Text Rank Algorithm and analysis of keyword taking out from co-occurrence statistical -Information (CSI). The proposed methodology is implemented in MATLAB and performances were evaluated with the statistical measurements such as precision, recall, accuracy, sensitivity, purity measure and entropy. The proposed method is compared with the conventional methods such as Fuzzy C Means clustering (FCM), FCM-Particle Swarm Optimization (PSO), FCM-Genetic Algorithm (GA) and K means clustering.

Author(s):  
Yuchi Kanzawa ◽  

In this study, a maximizing model of Bezdek-like spherical fuzzyc-means clustering is proposed, which is based on the regularization of the maximizing model of spherical hardc-means. Such a maximizing model was unclear in Bezdek-like method, whereas other types of methods have been investigated well both in minimizing and maximizing model. Using theoretical analysis and numerical experiments, the classification rule of the proposed method is shown. Using numerical examples, the proposed method is shown to be valid for document clustering, because documents are represented as spherical data via term document-inverse document frequency weighting and normalization processing.


Author(s):  
Pedram Vahdani Amoli ◽  
Omid Sojoodi Sh.

<table border="1" cellspacing="0" cellpadding="0" width="593"><tbody><tr><td width="387" valign="top"><p>In this paper a novel method is proposed for scientific document clustering. The proposed method is a summarization-based hybrid algorithm which comprises a preprocessing phase. In the preprocessing phase unimportant words which are frequently used in the text are removed. This process reduces the amount of data for the clustering purpose. Furthermore frequent items cause overlapping between the clusters which leads to inefficiency of the cluster separation. After the preprocessing phase, Term Frequency/Inverse Document Frequency (TFIDF) is calculated for all words and stems over the document to score them in the document. Text summarization is performed then in the sentence level. Document clustering is finally done according to the scores of calculated TFIDF. The hybrid progress of the proposed scheme, from preprocessing phase to document clustering, gains a rapid and efficient clustering method which is evaluated by 400 English texts extracted from scientific databases of 11 different topics. The proposed method is compared with CSSA, SMTC and Max-Capture methods. The results demonstrate the proficiency of the proposed scheme in terms of computation time and efficiency using F-measure criterion.</p></td></tr></tbody></table>


2010 ◽  
Vol 29-32 ◽  
pp. 2620-2626
Author(s):  
Jing Li Zhou ◽  
Xue Jun Nie ◽  
Lei Hua Qin ◽  
Jian Feng Zhu

This paper proposes a novel fuzzy similarity measure based on the relationships between terms and categories. A term-category matrix is presented to represent such relationships and each element in the matrix denotes a membership degree that a term belongs to a category, which is computed using term frequency inverse document frequency and fuzzy relationships between documents and categories. Fuzzy similarity takes into account the situation that one document belongs to multiple categories and is computed using fuzzy operators. The experimental results show that the proposed fuzzy similarity surpasses other common similarity measures not only in the reliable derivation of document clustering results, but also in document clustering accuracies.


Database ◽  
2019 ◽  
Vol 2019 ◽  
Author(s):  
Peter Brown ◽  
Aik-Choon Tan ◽  
Mohamed A El-Esawi ◽  
Thomas Liehr ◽  
Oliver Blanck ◽  
...  

Abstract Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency–Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.


1995 ◽  
Vol 1 (2) ◽  
pp. 163-190 ◽  
Author(s):  
Kenneth W. Church ◽  
William A. Gale

AbstractShannon (1948) showed that a wide range of practical problems can be reduced to the problem of estimating probability distributions of words and ngrams in text. It has become standard practice in text compression, speech recognition, information retrieval and many other applications of Shannon's theory to introduce a “bag-of-words” assumption. But obviously, word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph. The proposed Poisson mixture captures much of this heterogeneous structure by allowing the Poisson parameter θ to vary over documents subject to a density function φ. φ is intended to capture dependencies on hidden variables such genre, author, topic, etc. (The Negative Binomial is a well-known special case where φ is a Г distribution.) Poisson mixtures fit the data better than standard Poissons, producing more accurate estimates of the variance over documents (σ2), entropy (H), inverse document frequency (IDF), and adaptation (Pr(x ≥ 2/x ≥ 1)).


Author(s):  
Saud Altaf ◽  
Sofia Iqbal ◽  
Muhammad Waseem Soomro

This paper focuses on capturing the meaning of Natural Language Understanding (NLU) text features to detect the duplicate unsupervised features. The NLU features are compared with lexical approaches to prove the suitable classification technique. The transfer-learning approach is utilized to train the extraction of features on the Semantic Textual Similarity (STS) task. All features are evaluated with two types of datasets that belong to Bosch bug and Wikipedia article reports. This study aims to structure the recent research efforts by comparing NLU concepts for featuring semantics of text and applying it to IR. The main contribution of this paper is a comparative study of semantic similarity measurements. The experimental results demonstrate the Term Frequency–Inverse Document Frequency (TF-IDF) feature results on both datasets with reasonable vocabulary size. It indicates that the Bidirectional Long Short Term Memory (BiLSTM) can learn the structure of a sentence to improve the classification.


Author(s):  
Mariani Widia Putri ◽  
Achmad Muchayan ◽  
Made Kamisutara

Sistem rekomendasi saat ini sedang menjadi tren. Kebiasaan masyarakat yang saat ini lebih mengandalkan transaksi secara online dengan berbagai alasan pribadi. Sistem rekomendasi menawarkan cara yang lebih mudah dan cepat sehingga pengguna tidak perlu meluangkan waktu terlalu banyak untuk menemukan barang yang diinginkan. Persaingan antar pelaku bisnis pun berubah sehingga harus mengubah pendekatan agar bisa menjangkau calon pelanggan. Oleh karena itu dibutuhkan sebuah sistem yang dapat menunjang hal tersebut. Maka dalam penelitian ini, penulis membangun sistem rekomendasi produk menggunakan metode Content-Based Filtering dan Term Frequency Inverse Document Frequency (TF-IDF) dari model Information Retrieval (IR). Untuk memperoleh hasil yang efisien dan sesuai dengan kebutuhan solusi dalam meningkatkan Customer Relationship Management (CRM). Sistem rekomendasi dibangun dan diterapkan sebagai solusi agar dapat meningkatkan brand awareness pelanggan dan meminimalisir terjadinya gagal transaksi di karenakan kurang nya informasi yang dapat disampaikan secara langsung atau offline. Data yang digunakan terdiri dari 258 kode produk produk yang yang masing-masing memiliki delapan kategori dan 33 kata kunci pembentuk sesuai dengan product knowledge perusahaan. Hasil perhitungan TF-IDF menunjukkan nilai bobot 13,854 saat menampilkan rekomendasi produk terbaik pertama, dan memiliki keakuratan sebesar 96,5% dalam memberikan rekomendasi pena.


Sign in / Sign up

Export Citation Format

Share Document