Maximum Frequent Item Set based Clustering Algorithm for Big Text Data

Due to fast growth of internet and continuous expansion of World Wide Web like digital libraries, online news contributes to massive amount of electronic unstructured text documents on the web. Although lot traditional techniques are available to extract the knowledge from large collection of text documents, still to improve precision of the web search retrieval and to find most appropriate documents from huge text collections proficiently is a big challenge. Clustering techniques helps the search engine to retrieve the documents. The proposed system overcomes existing problems using bivariate n-gram frequent item clustering algorithm by concept of maximum frequent set which maintain the sequence and meaning of sentence in order to reduce huge dimension and and frequent item sets finds similarity. Then based on maximum document occurrence we cluster the documents. Thus our method obtains quality of clusters when compared with existing methodologies and improves the efficiency. The experiment is shown for sample Newsgroup dataset for existing K-Mean and FICMDO (Frequent item clustering method based on maximum document occurrence) and proved the f-measure is higher for our algorithm. Since the f-measure increases, obtains efficient clusters. Hence it is faster and efficient big data method which improves the performance when compared with vector space model like K-Means algorithm.

Download Full-text

A Roadmap to Integrate Document Clustering in Information Retrieval

Information Retrieval Methods for Multidisciplinary Applications ◽

10.4018/978-1-4666-3898-3.ch003 ◽

2013 ◽

pp. 31-45

Author(s):

R. Subhashini ◽

V.Jawahar Senthil Kumar

Keyword(s):

Information Retrieval ◽

Search Engines ◽

World Wide ◽

Clustering Algorithm ◽

Web Search ◽

Full Potential ◽

Digital Information ◽

Search Results ◽

The World ◽

The Web

The World Wide Web is a large distributed digital information space. The ability to search and retrieve information from the Web efficiently and effectively is an enabling technology for realizing its full potential. Information Retrieval (IR) plays an important role in search engines. Today’s most advanced engines use the keyword-based (“bag of words”) paradigm, which has inherent disadvantages. Organizing web search results into clusters facilitates the user’s quick browsing of search results. Traditional clustering techniques are inadequate because they do not generate clusters with highly readable names. This paper proposes an approach for web search results in clustering based on a phrase based clustering algorithm. It is an alternative to a single ordered result of search engines. This approach presents a list of clusters to the user. Experimental results verify the method’s feasibility and effectiveness.

Download Full-text

The Effect of Stemming on Arabic Text Classification

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2011070104 ◽

2011 ◽

Vol 1 (3) ◽

pp. 54-70 ◽

Cited By ~ 11

Author(s):

Abdullah Wahbeh ◽

Mohammed Al-Kabi ◽

Qasem Al-Radaideh ◽

Emad Al-Shawakfa ◽

Izzat Alsmadi

Keyword(s):

Text Classification ◽

Digital Libraries ◽

Arabic Language ◽

Support Vector ◽

Svm Classifier ◽

Arabic Text ◽

Text Documents ◽

Information Retrieval Systems ◽

Arabic Text Classification ◽

The Web

The information world is rich of documents in different formats or applications, such as databases, digital libraries, and the Web. Text classification is used for aiding search functionality offered by search engines and information retrieval systems to deal with the large number of documents on the web. Many research papers, conducted within the field of text classification, were applied to English, Dutch, Chinese, and other languages, whereas fewer were applied to Arabic language. This paper addresses the issue of automatic classification or classification of Arabic text documents. It applies text classification to Arabic language text documents using stemming as part of the preprocessing steps. Results have showed that applying text classification without using stemming; the support vector machine (SVM) classifier has achieved the highest classification accuracy using the two test modes with 87.79% and 88.54%. On the other hand, stemming has negatively affected the accuracy, where the SVM accuracy using the two test modes dropped down to 84.49% and 86.35%.

Download Full-text

Clustering of the Web Search Results in Educational Recommender Systems

Educational Recommender Systems and Technologies ◽

10.4018/978-1-61350-489-5.ch007 ◽

2012 ◽

pp. 154-181 ◽

Cited By ~ 12

Author(s):

Constanta-Nicoleta Bodea ◽

Maria-Iuliana Dascalu ◽

Adina Lipai

Keyword(s):

Recommender Systems ◽

Clustering Algorithm ◽

Web Search ◽

Web Pages ◽

Lexical Database ◽

Assessment Task ◽

Search Results ◽

Meta Search ◽

Search Approach ◽

The Web

This chapter presents a meta-search approach, meant to deliver bibliography from the internet, according to trainees’ results obtained at an e-assessment task. The bibliography consists of web pages related to the knowledge gaps of the trainees. The meta-search engine is part of an education recommender system, attached to an e-assessment application for project management knowledge. Meta-search means that, for a specific query (or mistake made by the trainee), several search mechanisms for suitable bibliography (further reading) could be applied. The lists of results delivered by the standard search mechanisms are used to build thematically homogenous groups using an ontology-based clustering algorithm. The clustering process uses an educational ontology and WordNet lexical database to create its categories. The research is presented in the context of recommender systems and their various applications to the education domain.

Download Full-text

GOW-Stream: A novel approach of graph-of-words based mixture model for semantic-enhanced text stream clustering

Intelligent Data Analysis ◽

10.3233/ida-205443 ◽

2021 ◽

Vol 25 (5) ◽

pp. 1211-1231

Author(s):

Tham Vo ◽

Phuc Do

Keyword(s):

Social Networks ◽

Rapid Change ◽

Online News ◽

Text Documents ◽

Existing Problems ◽

Stream Clustering ◽

Evaluation Approach ◽

Novel Approach ◽

Independent Evaluation ◽

Real World Datasets

Recently, rapid growth of social networks and online news resources from Internet have made text stream clustering become an insufficient application in multiple domains (e.g.: text retrieval diversification, social event detection, text summarization, etc.) Different from traditional static text clustering approach, text stream clustering task has specific key challenges related to the rapid change of topics/clusters and high-velocity of coming streaming document batches. Recent well-known model-based text stream clustering models, such as: DTM, DCT, MStream, etc. are considered as word-independent evaluation approach which means largely ignoring the relations between words while sampling clusters/topics. It definitely leads to the decrease of overall model accuracy performance, especially for short-length text documents such as comments, microblogs, etc. in social networks. To tackle these existing problems, in this paper we propose a novel approach of graph-of-words (GOWs) based text stream clustering, called GOW-Stream. The application of common GOWs which are generated from each document batch while sampling clusters/topics can support to overcome the word-independent evaluation challenge. Our proposed GOW-Stream is promising to significantly achieve better text stream clustering performance than recent state-of-the-art baselines. Extensive experiments on multiple benchmark real-world datasets demonstrate the effectiveness of our proposed model in both accuracy and time-consuming performances.

Download Full-text

The Effect of Stemming on Arabic Text Classification

Information Retrieval Methods for Multidisciplinary Applications ◽

10.4018/978-1-4666-3898-3.ch013 ◽

2013 ◽

pp. 207-225 ◽

Cited By ~ 3

Author(s):

Abdullah Wahbeh ◽

Mohammed Al-Kabi ◽

Qasem Al-Radaideh ◽

Emad Al-Shawakfa ◽

Izzat Alsmadi

Keyword(s):

Text Classification ◽

Digital Libraries ◽

Arabic Language ◽

Support Vector ◽

Svm Classifier ◽

Arabic Text ◽

Text Documents ◽

Information Retrieval Systems ◽

Arabic Text Classification ◽

The Web

Download Full-text

Retriever

Managing Business with Electronic Commerce ◽

10.4018/978-1-930708-12-9.ch004 ◽

2002 ◽

pp. 59-81 ◽

Cited By ~ 7

Author(s):

Anupam Joshi ◽

Zhihua Jiang

Keyword(s):

Vector Space ◽

Search Engine ◽

Search Engines ◽

Web Search ◽

State Of The Art ◽

Dissimilarity Matrix ◽

Fuzzy Algorithm ◽

Web Search Engines ◽

N Gram ◽

The Web

Web search engines have become increasingly ineffective as the number of documents on the Web have proliferated. Typical queries retrieve hundreds of documents, most of which have no relation with what the user was looking for. The chapter describes a system named Retriever that uses a recently proposed robust fuzzy algorithm RFCMdd to cluster the results of a query from a search engine into groups. These groups and their associated keywords are presented to the user, who can then look into the URLs for the group(s) that s/he finds interesting. This application requires clustering in the presence of a significant amount of noise, which our system can handle efficiently. N-Gram and Vector Space methods are used to create the dissimilarity matrix for clustering. We discuss the performance of our system by comparing it with other state-of-the-art peers, such as Husky search, and present the results from analyzing the effectiveness of the N-Gram and Vector Space methods during the generation of dissimilarity matrices.

Download Full-text

Classification of means and methods of the Web semantic retrieval

PROBLEMS IN PROGRAMMING ◽

10.15407/pp2017.01.030 ◽

2017 ◽

pp. 030-050

Author(s):

J.V. Rogushina ◽

Keyword(s):

Search Engines ◽

Domain Knowledge ◽

Information Needs ◽

Web Search ◽

User Interaction ◽

Query Languages ◽

Semantic Search ◽

Semantic Retrieval ◽

The Web

Problems associated with the improve ment of information retrieval for open environment are considered and the need for it’s semantization is grounded. Thecurrent state and prospects of development of semantic search engines that are focused on the Web information resources processing are analysed, the criteria for the classification of such systems are reviewed. In this analysis the significant attention is paid to the semantic search use of ontologies that contain knowledge about the subject area and the search users. The sources of ontological knowledge and methods of their processing for the improvement of the search procedures are considered. Examples of semantic search systems that use structured query languages (eg, SPARQL), lists of keywords and queries in natural language are proposed. Such criteria for the classification of semantic search engines like architecture, coupling, transparency, user context, modification requests, ontology structure, etc. are considered. Different ways of support of semantic and otology based modification of user queries that improve the completeness and accuracy of the search are analyzed. On base of analysis of the properties of existing semantic search engines in terms of these criteria, the areas for further improvement of these systems are selected: the development of metasearch systems, semantic modification of user requests, the determination of an user-acceptable transparency level of the search procedures, flexibility of domain knowledge management tools, increasing productivity and scalability. In addition, the development of means of semantic Web search needs in use of some external knowledge base which contains knowledge about the domain of user information needs, and in providing the users with the ability to independent selection of knowledge that is used in the search process. There is necessary to take into account the history of user interaction with the retrieval system and the search context for personalization of the query results and their ordering in accordance with the user information needs. All these aspects were taken into account in the design and implementation of semantic search engine "MAIPS" that is based on an ontological model of users and resources cooperation into the Web.

Download Full-text

A Survey on Improving the Web Search Ranking by User Behavior Information

SSRN Electronic Journal ◽

10.2139/ssrn.3419718 ◽

2010 ◽

Author(s):

Mohamed Husain ◽

Amarjeet Singh ◽

Manoj Kumar ◽

Rakesh Ranjan

Keyword(s):

Web Search ◽

User Behavior ◽

The Web ◽

Web Search Ranking

Download Full-text

Fuzzy Set Based Clustering Algorithm of Web Text

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.678.19 ◽

2014 ◽

Vol 678 ◽

pp. 19-22

Author(s):

Hong Xin Wan ◽

Yun Peng

Keyword(s):

Key Words ◽

Fuzzy Set ◽

Clustering Algorithm ◽

Text Clustering ◽

Classification Methods ◽

Comparative Experiment ◽

Fuzzy Algorithm ◽

Pattern Clustering ◽

The Web ◽

Computing Accuracy

Web text exists non-certain and non-structure contents ,and it is difficult to cluster the text by normal classification methods. We propose a web text clustering algorithm based on fuzzy set to increase the computing accuracy with the web text. After abstracting the key words of the text, we can look it as attributes and design the fuzzy algorithm to decide the membership of the words. The algorithm can improve the algorithm complexity of time and space, increase the robustness comparing to the normal algorithm. To test the accuracy and efficiency of the algorithm, we take the comparative experiment between pattern clustering and our algorithm. The experiment shows that our method has a better result.

Download Full-text

A MODEL BASED ON FUZZY LINGUISTIC INFORMATION TO EVALUATE THE QUALITY OF DIGITAL LIBRARIES

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622010003907 ◽

2010 ◽

Vol 09 (03) ◽

pp. 455-472 ◽

Cited By ~ 31

Author(s):

F. J. CABRERIZO ◽

J. LÓPEZ-GIJÓN ◽

A. A. RUÍZ ◽

E. HERRERA-VIEDMA

Keyword(s):

Digital Libraries ◽

User Satisfaction ◽

Information Access ◽

Great Influence ◽

Linguistic Information ◽

Model Based ◽

Global Quality ◽

Fuzzy Linguistic ◽

The Web

The Web is changing the information access processes and it is one of the most important information media. Thus, the developments on the Web are having a great influence over the developments on others information access instruments as digital libraries. As the development of digital libraries is to satisfy user need, user satisfaction is essential for the success of a digital library. The aim of this paper is to present a model based on fuzzy linguistic information to evaluate the quality of digital libraries. The quality evaluation of digital libraries is defined using users' perceptions on the quality of digital services provided through their Websites. We assume a fuzzy linguistic modeling to represent the users' perception and apply automatic tools of fuzzy computing with words based on the LOWA and LWA operators to compute global quality evaluations of digital libraries. Additionally, we show an example of application of this model where three Spanish academic digital libraries are evaluated by fifty users.

Download Full-text