The Automated VSMs to Categorize Arabic Text Data Sets

Mamoun Suleiman Al Rababaa; Essam Said Hanandeh

doi:10.24297/ijct.v13i1.2925

The Automated VSMs to Categorize Arabic Text Data Sets

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v13i1.2925 ◽

2014 ◽

Vol 13 (1) ◽

pp. 4074-4081

Author(s):

Mamoun Suleiman Al Rababaa ◽

Essam Said Hanandeh

Keyword(s):

Data Mining ◽

Information Retrieval ◽

Vector Space ◽

Text Categorization ◽

Experimental Results ◽

Data Sets ◽

Arabic Text ◽

Text Data ◽

Evaluation Measures ◽

Vector Space Models

Text Categorization is one of the most important tasks in information retrieval and data mining. This paper aims at investigating different variations of vector space models (VSMs) using KNN algorithm. we used 242 Arabic abstract documents that were used by (Hmeidi & Kanaan, 1997). The bases of our comparison are the most popular text evaluation measures; we use Recall measure, Precision measure, and F1 measure. The Experimental results against the Saudi data sets reveal that Cosine outperformed over of the Dice and Jaccard coefficients.

Get full-text (via PubEx)

Issues and Considerations for Effective Text Data Retrieval

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a4236.119119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 1442-1445

Keyword(s):

Data Mining ◽

Data Retrieval ◽

Text Retrieval ◽

Experimental Results ◽

Free Text ◽

Data Sets ◽

Retrieval Process ◽

Text Data ◽

Text Information ◽

Major Domain

Text data retrieval is one of the major domain for extracting knowledge from the stored data sets. Within the text information, the text meaningful numerical codes extracted unstructured process information is to make the free text associated with the unstructured nature of data mining in a different stream. Number of procedure is constructed to Performing this operations most effectively. This paper focuses one of the text retrieval process, experimental results verified proposed methods works well with most of the documents.

Get full-text (via PubEx)

The Performance of Boolean Retrieval and Vector Space Model in Textual Information Retrieval

CommIT (Communication and Information Technology) Journal ◽

10.21512/commit.v11i1.2108 ◽

2017 ◽

Vol 11 (1) ◽

pp. 33 ◽

Cited By ~ 1

Author(s):

Budi Yulianto ◽

Widodo Budiharto ◽

Iman Herwidiana Kartowisastro

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Vector Space Model ◽

Experimental Results ◽

Inverted Index ◽

Exact Results ◽

Textual Information ◽

Space Model ◽

Corpus Data

Boolean Retrieval (BR) and Vector Space Model (VSM) are very popular methods in information retrieval for creating an inverted index and querying terms. BR method searches the exact results of the textual information retrieval without ranking the results. VSM method searches and ranks the results. This study empirically compares the two methods. The research utilizes a sample of the corpus data obtained from Reuters. The experimental results show that the required times to produce an inverted index by the two methods are nearly the same. However, a difference exists on the querying index. The results also show that the numberof generated indexes, the sizes of the generated files, and the duration of reading and searching an index are proportional with the file number in the corpus and thefile size.

Get full-text (via PubEx)

Improved minimum-minimum roughness algorithm for clustering categorical data

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.10.006 ◽

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Get full-text (via PubEx)

Local and Global Latent Semantic Analysis for Text Categorization

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2014070101 ◽

2014 ◽

Vol 4 (3) ◽

pp. 1-13

Author(s):

Khadoudja Ghanem

Keyword(s):

Information Retrieval ◽

High Precision ◽

Latent Semantic Analysis ◽

Classification Accuracy ◽

Text Categorization ◽

Semantic Analysis ◽

Experimental Results ◽

Semantic Approach ◽

Document Categorization ◽

Second Use

In this paper the authors propose a semantic approach to document categorization. The idea is to create for each category a semantic index (representative term vector) by performing a local Latent Semantic Analysis (LSA) followed by a clustering process. A second use of LSA (Global LSA) is adopted on a term-Class matrix in order to retrieve the class which is the most similar to the query (document to classify) in the same way where the LSA is used to retrieve documents which are the most similar to a query in Information Retrieval. The proposed system is evaluated on a popular dataset which is 20 Newsgroup corpus. Obtained results show the effectiveness of the method compared with those obtained with the classic KNN and SVM classifiers as well as with methods presented in the literature. Experimental results show that the new method has high precision and recall rates and classification accuracy is significantly improved.

Get full-text (via PubEx)

Information retrieval from heterogeneous data sets using moderated IDF-cosine similarity in vector space model

2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS) ◽

10.1109/icecds.2017.8390174 ◽

2017 ◽

Cited By ~ 1

Author(s):

Bhagyashree Pathak ◽

Niranjan Lal

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Vector Space Model ◽

Heterogeneous Data ◽

Cosine Similarity ◽

Data Sets ◽

Space Model

Get full-text (via PubEx)

DEVELOPING A PARALLEL CLASSIFIER FOR MINING IN BIG DATA SETS

IIUM Engineering Journal ◽

10.31436/iiumej.v22i2.1541 ◽

2021 ◽

Vol 22 (2) ◽

pp. 119-134

Author(s):

Ahad Shamseen ◽

Morteza Mohammadi Zanjireh ◽

Mahdi Bahaghighat ◽

Qin Xin

Keyword(s):

Data Mining ◽

Big Data ◽

Decision Tree ◽

Main Memory ◽

Experimental Results ◽

Primary Data ◽

Data Sets ◽

Decision Tree Classifier ◽

Vast Amount ◽

Tree Classifier

Data mining is the extraction of information and its roles from a vast amount of data. This topic is one of the most important topics these days. Nowadays, massive amounts of data are generated and stored each day. This data has useful information in different fields that attract programmers’ and engineers’ attention. One of the primary data mining classifying algorithms is the decision tree. Decision tree techniques have several advantages but also present drawbacks. One of its main drawbacks is its need to reside its data in the main memory. SPRINT is one of the decision tree builder classifiers that has proposed a fix for this problem. In this paper, our research developed a new parallel decision tree classifier by working on SPRINT results. Our experimental results show considerable improvements in terms of the runtime and memory requirements compared to the SPRINT classifier. Our proposed classifier algorithm could be implemented in serial and parallel environments and can deal with big data. ABSTRAK: Perlombongan data adalah pengekstrakan maklumat dan peranannya dari sejumlah besar data. Topik ini adalah salah satu topik yang paling penting pada masa ini. Pada masa ini, data yang banyak dihasilkan dan disimpan setiap hari. Data ini mempunyai maklumat berguna dalam pelbagai bidang yang menarik perhatian pengaturcara dan jurutera. Salah satu algoritma pengkelasan perlombongan data utama adalah pokok keputusan. Teknik pokok keputusan mempunyai beberapa kelebihan tetapi kekurangan. Salah satu kelemahan utamanya adalah keperluan menyimpan datanya dalam memori utama. SPRINT adalah salah satu pengelasan pembangun pokok keputusan yang telah mengemukakan untuk masalah ini. Dalam makalah ini, penyelidikan kami sedang mengembangkan pengkelasan pokok keputusan selari baru dengan mengusahakan hasil SPRINT. Hasil percubaan kami menunjukkan peningkatan yang besar dari segi jangka masa dan keperluan memori berbanding dengan pengelasan SPRINT. Algoritma pengklasifikasi yang dicadangkan kami dapat dilaksanakan dalam persekitaran bersiri dan selari dan dapat menangani data besar.

Get full-text (via PubEx)

A Technique to Choose the Proper Vector Space Models of Semantics in Case of Automatic Text Categorization

International Journal of Modern Education and Computer Science ◽

10.5815/ijmecs.2012.04.05 ◽

2012 ◽

Vol 4 (4) ◽

pp. 36-42

Author(s):

Sukanya Ray ◽

Nidhi Chandra

Keyword(s):

Vector Space ◽

Text Categorization ◽

Vector Space Models ◽

Automatic Text

Get full-text (via PubEx)

Image Substance Extraction using Data Mining Clustering Method

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b6605.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 2735-2739

Keyword(s):

Data Mining ◽

Accurate Result ◽

Image Data ◽

Data Recovery ◽

Data Sets ◽

Text Data ◽

Data Set ◽

Normal Text ◽

Additional Burden ◽

Using Data

Dater retrieval is one of the key challenging factor for today. Because of increasing the volume of data sets every year due to various factors. Information extraction in image data sets are too multifaceted compare with normal text data recovery. Image data set consist of different attributes those attribute sets are normalized before it extract from the stored data base. This required additional burden to the user who wish to extract any information from this data sets. This key challenges invite more researchers in the field of image data mining. Today many of the data sets in the form of image it gives more accurate result and more outputs. For extracting any image data image attributes are properly trained for better result. The proposed work based on grouping the data sets using image attributes. The entire process of this work divided into two major separate operations. Experiments dons against various data sets, and outputs verified proposed work gives more accurate results than the existing techniques.

Get full-text (via PubEx)

A new similarity measure for vector space models in text classification and information retrieval

Journal of Information Science ◽

10.1177/0165551520968055 ◽

2020 ◽

pp. 016555152096805

Author(s):

Mete Eminagaoglu

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Similarity Measure ◽

Text Classification ◽

Pearson Correlation ◽

Clustering Algorithms ◽

Similarity Measures ◽

Manhattan Distance ◽

Vector Space Models ◽

Classification Information

There are various models, methodologies and algorithms that can be used today for document classification, information retrieval and other text mining applications and systems. One of them is the vector space–based models, where distance metrics or similarity measures lie at the core of such models. Vector space–based model is one of the fast and simple alternatives for the processing of textual data; however, its accuracy, precision and reliability still need significant improvements. In this study, a new similarity measure is proposed, which can be effectively used for vector space models and related algorithms such as k-nearest neighbours ( k-NN) and Rocchio as well as some clustering algorithms such as K-means. The proposed similarity measure is tested with some universal benchmark data sets in Turkish and English, and the results are compared with some other standard metrics such as Euclidean distance, Manhattan distance, Chebyshev distance, Canberra distance, Bray–Curtis dissimilarity, Pearson correlation coefficient and Cosine similarity. Some successful and promising results have been obtained, which show that this proposed similarity measure could be alternatively used within all suitable algorithms and models for information retrieval, document clustering and text classification.

Get full-text (via PubEx)

Arabic Text Data Mining: a Root-Based Hierarchical Indexing Model

International Journal of Modelling and Simulation ◽

10.1080/02286203.2003.11442267 ◽

2003 ◽

Vol 23 (3) ◽

pp. 158-166 ◽

Cited By ~ 7

Author(s):

T.M. Eldos

Keyword(s):

Data Mining ◽

Arabic Text ◽

Text Data ◽

Text Data Mining ◽

Indexing Model

Get full-text (via PubEx)