Information Retrieval based on Cluster Analysis Approach

Orabe Almanaseer

doi:10.5121/ijcsit.2021.13502

Information Retrieval based on Cluster Analysis Approach

International Journal of Computer Science and Information Technology ◽

10.5121/ijcsit.2021.13502 ◽

2021 ◽

Vol 13 (5) ◽

pp. 21-29

Author(s):

Orabe Almanaseer

Keyword(s):

Information Retrieval ◽

Phase 1 ◽

Evaluation Criteria ◽

Retrieval Process ◽

Text Documents ◽

Data Set ◽

Analysis Process ◽

High Utility ◽

High Utility Patterns ◽

User Queries

The huge volume of text documents available on the internet has made it difficult to find valuable information for specific users. In fact, the need for efficient applications to extract interested knowledge from textual documents is vitally important. This paper addresses the problem of responding to user queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a cluster-based information retrieval framework was proposed in this paper, in order to design and develop a system for analysing and extracting useful patterns from text documents. In this approach, a preprocessing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector Space Model (VSM) is performed to represent the dataset. The system was implemented through two main phases. In phase 1, the clustering analysis process is designed and implemented to group documents into several clusters, while in phase 2, an information retrieval process was implemented to rank clusters according to the user queries in order to retrieve the relevant documents from specific clusters deemed relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision (P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.

Download Full-text

DOCUMENT CLUSTERING BY DYNAMIC HIERARCHICAL ALGORITHM BASED ON FUZZY SET TYPE-II FROM FREQUENT ITEMSET

Jurnal Ilmu Komputer dan Informasi ◽

10.21609/jiki.v9i2.383 ◽

2016 ◽

Vol 9 (2) ◽

pp. 88

Author(s):

Saiful Bahri Musa ◽

Andi Baso Kaswar ◽

Supria Supria ◽

Susiana Sari

Keyword(s):

Information Retrieval ◽

Fuzzy Set ◽

Document Clustering ◽

Static Method ◽

Frequent Itemset ◽

Type Ii ◽

Retrieval Process ◽

Text Documents ◽

F Measure

One of ways to facilitate process of information retrieval is by performing clustering toward collection of the existing documents. The existing text documents are often unstructured. The forms are varied and their groupings are ambiguous. This cases cause difficulty on information retrieval process. Moreover, every second new documents emerge and need to be clustered. Generally, static document clustering method performs clustering of document after whole documents are collected. However, performing re-clustering toward whole documents when new document arrives causes inefficient clustering process. In this paper, we proposed a new method for document clustering with dynamic hierarchy algorithm based on fuzzy set type - II from frequent itemset. To achieve the goals, there are three main phases, namely: determination of key-term, the extraction of candidates clusters and cluster hierarchical construction. Based on the experiment, it resulted the value of F-measure 0.40 for Newsgroup, 0.62 for Classic and 0.38 for Reuters. Meanwhile, time of computation when addition of new document is lower than to the previous static method. The result shows that this method is suitable to produce solution of clustering with hierarchy in dynamical environment effectively and efficiently. This method also gives accurate clustering result.

Download Full-text

Using Inverted Index for Fingerprint Search

Journal of Information and Data Management ◽

10.5753/jidm.2021.1918 ◽

2021 ◽

Vol 12 (5) ◽

Author(s):

Johnny Marcos S. Soares ◽

Luciano Barbosa ◽

Paulo Antonio Leal Rego ◽

Regis Pires Magalhães ◽

Jose Antônio F. de Macêdo

Keyword(s):

Information Retrieval ◽

Penetration Rate ◽

Locality Sensitive Hashing ◽

Inverted Index ◽

Text Documents ◽

Data Set ◽

Textual Information ◽

Data Indexing ◽

Biometric Information ◽

Fingerprint Data

Fingerprints are the most used biometric information for identifying people. With the increase in fingerprint data, indexing techniques are essential to perform an efficient search. In this work, we devise a solution that applies traditional inverted index, widely used in textual information retrieval, for fingerprint search. For that, it first converts fingerprints to text documents using techniques, such as Minutia Cylinder-Code and Locality-Sensitive Hashing, and then indexes them in inverted files. In the experimental evaluation, our approach obtained 0.42% of error rate with 10% of penetration rate in the FVC2002 DB1a data set, surpassing some established methods.

Download Full-text

Cluster-based information retrieval using pattern mining

Applied Intelligence ◽

10.1007/s10489-020-01922-x ◽

2020 ◽

Author(s):

Youcef Djenouri ◽

Asma Belhadi ◽

Djamel Djenouri ◽

Jerry Chun-Wei Lin

Keyword(s):

Information Retrieval ◽

Pattern Mining ◽

Spatial Clustering ◽

Clustering Algorithms ◽

User Query ◽

High Quality Information ◽

High Utility ◽

Score Pattern ◽

Mining Algorithms ◽

User Queries

Abstract This paper addresses the problem of responding to user queries by fetching the most relevant object from a clustered set of objects. It addresses the common drawbacks of cluster-based approaches and targets fast, high-quality information retrieval. For this purpose, a novel cluster-based information retrieval approach is proposed, named Cluster-based Retrieval using Pattern Mining (CRPM). This approach integrates various clustering and pattern mining algorithms. First, it generates clusters of objects that contain similar objects. Three clustering algorithms based on k-means, DBSCAN (Density-based spatial clustering of applications with noise), and Spectral are suggested to minimize the number of shared terms among the clusters of objects. Second, frequent and high-utility pattern mining algorithms are performed on each cluster to extract the pattern bases. Third, the clusters of objects are ranked for every query. In this context, two ranking strategies are proposed: i) Score Pattern Computing (SPC), which calculates a score representing the similarity between a user query and a cluster; and ii) Weighted Terms in Clusters (WTC), which calculates a weight for every term and uses the relevant terms to compute the score between a user query and each cluster. Irrelevant information derived from the pattern bases is also used to deal with unexpected user queries. To evaluate the proposed approach, extensive experiments were carried out on two use cases: the documents and tweets corpus. The results showed that the designed approach outperformed traditional and cluster-based information retrieval approaches in terms of the quality of the returned objects while being very competitive in terms of runtime.

Download Full-text

Comparison of Spectroscopic Techniques Combined with Chemometrics for Cocaine Powder Analysis

Journal of Analytical Toxicology ◽

10.1093/jat/bkaa101 ◽

2020 ◽

Vol 44 (8) ◽

pp. 851-860

Author(s):

Joy Eliaerts ◽

Natalie Meert ◽

Pierre Dardenne ◽

Vincent Baeten ◽

Juan-Antonio Fernandez Pierna ◽

...

Keyword(s):

Gas Chromatography ◽

Near Infrared ◽

Evaluation Criteria ◽

Classification Model ◽

Support Vector ◽

Spectroscopic Techniques ◽

Data Set ◽

Promising Tool ◽

Powder Analysis ◽

Mir Spectra

Abstract Spectroscopic techniques combined with chemometrics are a promising tool for analysis of seized drug powders. In this study, the performance of three spectroscopic techniques [Mid-InfraRed (MIR), Raman and Near-InfraRed (NIR)] was compared. In total, 364 seized powders were analyzed and consisted of 276 cocaine powders (with concentrations ranging from 4 to 99 w%) and 88 powders without cocaine. A classification model (using Support Vector Machines [SVM] discriminant analysis) and a quantification model (using SVM regression) were constructed with each spectral dataset in order to discriminate cocaine powders from other powders and quantify cocaine in powders classified as cocaine positive. The performances of the models were compared with gas chromatography coupled with mass spectrometry (GC–MS) and gas chromatography with flame-ionization detection (GC–FID). Different evaluation criteria were used: number of false negatives (FNs), number of false positives (FPs), accuracy, root mean square error of cross-validation (RMSECV) and determination coefficients (R2). Ten colored powders were excluded from the classification data set due to fluorescence background observed in Raman spectra. For the classification, the best accuracy (99.7%) was obtained with MIR spectra. With Raman and NIR spectra, the accuracy was 99.5% and 98.9%, respectively. For the quantification, the best results were obtained with NIR spectra. The cocaine content was determined with a RMSECV of 3.79% and a R2 of 0.97. The performance of MIR and Raman to predict cocaine concentrations was lower than NIR, with RMSECV of 6.76% and 6.79%, respectively and both with a R2 of 0.90. The three spectroscopic techniques can be applied for both classification and quantification of cocaine, but some differences in performance were detected. The best classification was obtained with MIR spectra. For quantification, however, the RMSECV of MIR and Raman was twice as high in comparison with NIR. Spectroscopic techniques combined with chemometrics can reduce the workload for confirmation analysis (e.g., chromatography based) and therefore save time and resources.

Download Full-text

Online information retrieval behaviour and economics of attention

Online Information Review ◽

10.1108/oir-05-2015-0139 ◽

2015 ◽

Vol 39 (6) ◽

pp. 779-794 ◽

Cited By ~ 2

Author(s):

Mustafa Utku Özmen

Keyword(s):

Information Retrieval ◽

Information Provision ◽

General Information ◽

Survival Duration ◽

Online Information ◽

Data Set ◽

Content Type ◽

Factors Affecting ◽

Quasi Experimental ◽

Information Providers

Purpose – The purpose of this paper is to analyse users’ attitudes towards online information retrieval and processing. The aim is to identify the characteristics of information that better capture the attention of the users and to provide evidence for the information retrieval behaviour of the users by studying online photo archives as information units. Design/methodology/approach – The paper analyses a unique quasi-experimental data of photo archive access counts collected by the author from an online newspaper. In addition to access counts of each photo in 500 randomly chosen photo galleries, characteristics of the photo galleries are also recorded. Survival (duration) analysis is used in order to analyse the factors affecting the share of the photo gallery viewed by a certain proportion of the initial number of viewers. Findings – The results of the survival analysis indicate that users are impatient in case of longer photo galleries; they lose attention faster and stop viewing earlier when gallery length is uncertain; they are attracted by keywords and initial presentation and they give more credit to specific rather than general information categories. Practical implications – Results of the study offer applicable implications for information providers, especially on the online domain. In order to attract more attention, entities can engage in targeted information provision by taking into account people’s attitude towards information retrieval and processing as presented in this paper. Originality/value – This paper uses a unique data set in a quasi-experimental setting in order to identify the characteristics of online information that users are attracted to.

Download Full-text

Sustainable Tourism Concept to Revitalize Ocarina Area in Batam City, Indonesia

Journal of Architectural Design and Urbanism ◽

10.14710/jadu.v3i1.7592 ◽

2020 ◽

Vol 3 (1) ◽

pp. 10-19

Author(s):

Helen Cia ◽

I Gusti Ngurah Anom Gunawan ◽

Hendro Murtiono

Keyword(s):

Public Space ◽

Qualitative Method ◽

Sustainable Tourism ◽

Phase 1 ◽

Field Analysis ◽

Tourist Destinations ◽

Analysis Process ◽

Commercial Area ◽

Strategic Location ◽

To Come

The purpose of this research is to explore the concept of revitalizing the coastal tourism area with a sustainable tourism approach. Ocarina area is one of the tourist destinations in Batam city. Its strategic location is right in the center of Batam and is surrounded by several areas with different functions, among others there are housing (housing Regata, housing Monde Residence, housing Avante, Monde Signature housing, etc.), a school (Mondial school), a commercial area ( Pasir Putih shops, Mahkota Raya shops) and also the location of Ocarina area is close to the international ferry. The phenomenon that occurs in this tourist area has long been built and managed but has decreased visitors so that needs to be revitalized by using the concept of sustainable tourism. This strategic location is not accompanied by the success of the development of the Ocarina area as a public space that offers a variety of game facilities and culinary venues. The phenomenon that makes the need for revitalizing measures both physically and economically to make the Ocarina area can attract visitors to come and enjoy the facilities that are in it. The research method was a qualitative method of direct observation in the field. Analysis process is conducted to identify the problems that exist in the field today especially in the area of Ocarina Phase 1 so that the area of Ocarina Phase 2 can make Ocarina Phase 1 to be revitalized area and can be one of the sustainable tourism Batam city that can improve the economy of Batam city. Keywords: revitalization, sustainable tourism, visitor

Download Full-text

Intelligent Information Retrieval Using Hybrid of Fuzzy Set and Trust

Oriental journal of computer science and technology ◽

10.13005/ojcst/10.02.09 ◽

2017 ◽

Vol 10 (2) ◽

pp. 311-325

Author(s):

Suruchi Chawla

Keyword(s):

Information Retrieval ◽

Fuzzy Set ◽

Query Expansion ◽

Information Need ◽

Data Set ◽

Search Results ◽

Intelligent Information Retrieval ◽

Main Challenge ◽

Fuzzy Query ◽

Intelligent Information

The main challenge for effective web Information Retrieval(IR) is to infer the information need from user’s query and retrieve relevant documents. The precision of search results is low due to vague and imprecise user queries and hence could not retrieve sufficient relevant documents. Fuzzy set based query expansion deals with imprecise and vague queries for inferring user’s information need. Trust based web page recommendations retrieve search results according to the user’s information need. In this paper an algorithm is designed for Intelligent Information Retrieval using hybrid of Fuzzy set and Trust in web query session mining to perform Fuzzy query expansion for inferring user’s information need and trust is used for recommendation of web pages according to the user’s information need. Experiment was performed on the data set collected in domains Academics, Entertainment and Sports and search results confirm the improvement of precision.

Download Full-text

Indexing and Abstracting as Tools for Information Retrieval in Digital Libraries

Library Science and Administration ◽

10.4018/978-1-5225-3914-8.ch027 ◽

2018 ◽

pp. 573-595

Author(s):

Olaronke O. Fagbola

Keyword(s):

Information Retrieval ◽

Literature Review ◽

Digital Libraries ◽

Information Resource ◽

Potential User ◽

Retrieval Process ◽

Road Signs ◽

Retrieval Systems ◽

Knowledge Organisation ◽

Information Retrieval Systems

Indexing and abstracting are like Siamese twins in the information retrieval process. Indexing and abstracting are the two approaches to distilling information content into an abbreviated, but comprehensive representation of an information resource(s). They are knowledge organisation tools which usually provide detailed and accurate maps and road signs in the information superhighway. Digital libraries are characterised by an electronic stock of information which can be accessed via computers, and are extension and augmentations of physical libraries in digital forms. They are information retrieval systems (a device interposed between a potential user of information and the information itself) which provide opportunities to access and retrieve information that is often accessible for a variety of reasons. This chapter presents a literature review on indexing and abstracting, information retrieval process, digital libraries pointing out the importance of indexing and abstracting in the information retrieving process and then highlighting the roles played by indexing and abstracting as tools for information retrieval in digital libraries. The chapter posits that indexing and abstracting plays a significant role as information retrieval tools in digital libraries.

Download Full-text

A New LCS-Neutrosophic Similarity Measure for Text Information Retrieval

Neutrosophic Sets in Decision Analysis and Operations Research - Advances in Logistics, Operations, and Management Science ◽

10.4018/978-1-7998-2555-5.ch012 ◽

2020 ◽

pp. 258-280

Author(s):

Misturah Adunni Alaran ◽

AbdulAkeem Adesina Agboola ◽

Adio Taofiki Akinwale ◽

Olusegun Folorunso

Keyword(s):

Information Retrieval ◽

Similarity Measure ◽

Information Search ◽

Longest Common Subsequence ◽

Data Set ◽

String Similarity ◽

True Match ◽

Neutrosophic Logic ◽

Common Subsequence ◽

Text Information

The reality of human existence and their interactions with various things that surround them reveal that the world is imprecise, incomplete, vague, and even sometimes indeterminate. Neutrosophic logic is the only theory that attempts to unify all previous logics in the same global theoretical framework. Extracting data from a similar environment is becoming a problem as the volume of data keeps growing day-in and day-out. This chapter proposes a new neutrosophic string similarity measure based on the longest common subsequence (LCS) to address uncertainty in string information search. This new method has been compared with four other existing classical string similarity measure using wordlist as data set. The analyses show the performance of proposed neutrosophic similarity measure to be better than the existing in information retrieval task as the evaluation is based on precision, recall, highest false match, lowest true match, and separation.

Download Full-text

Incorporating Text OLAP in Business Intelligence

Business Intelligence Applications and the Web - Advances in Business Information Systems and Analytics ◽

10.4018/978-1-61350-038-5.ch004 ◽

2011 ◽

pp. 77-101 ◽

Cited By ~ 1

Author(s):

Byung-Kwon Park ◽

Il-Yeol Song

Keyword(s):

Information Retrieval ◽

Text Mining ◽

Business Intelligence ◽

Multidimensional Analysis ◽

Web Pages ◽

Data Types ◽

Text Documents ◽

Text Data ◽

Platform Architecture ◽

Unstructured Text

As the amount of data grows very fast inside and outside of an enterprise, it is getting important to seamlessly analyze both data types for total business intelligence. The data can be classified into two categories: structured and unstructured. For getting total business intelligence, it is important to seamlessly analyze both of them. Especially, as most of business data are unstructured text documents, including the Web pages in Internet, we need a Text OLAP solution to perform multidimensional analysis of text documents in the same way as structured relational data. We first survey the representative works selected for demonstrating how the technologies of text mining and information retrieval can be applied for multidimensional analysis of text documents, because they are major technologies handling text data. And then, we survey the representative works selected for demonstrating how we can associate and consolidate both unstructured text documents and structured relation data for obtaining total business intelligence. Finally, we present a future business intelligence platform architecture as well as related research topics. We expect the proposed total heterogeneous business intelligence architecture, which integrates information retrieval, text mining, and information extraction technologies all together, including relational OLAP technologies, would make a better platform toward total business intelligence.

Download Full-text