Efficiency Considerations for Vertical kNN Text Categorisation

The importance of text mining stems from the availability of huge volumes of text databases holding a wealth of valuable information that needs to be mined. Text mining is a coarse area encompassing many finer branches one of which is text categorisation or text classification. Text categorisation is the process of assigning class labels to documents based entirely on their textual contents where we are given a document d, and asked to find its subject matter or class label, Ci. In this paper, an optimised k-Nearest Neighbours classifier that uses discretisation, the P-tree technology, and dimensionality reduction to achieve a high degree of accuracy, space utilisation and time efficiency is proposed. One of the fundamental contributions of this work is that as new samples arrive, the proposed classifier can find the k nearest neighbours to the new sample from the training space without a single database scan.

Download Full-text

BiLabel-Specific Features for Multi-Label Classification

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3458283 ◽

2021 ◽

Vol 16 (1) ◽

pp. 1-23

Author(s):

Min-Ling Zhang ◽

Jun-Peng Fang ◽

Yi-Bo Wang

Keyword(s):

Predictive Models ◽

Comparative Studies ◽

State Of The Art ◽

Classification Model ◽

Generation Process ◽

Prototype Selection ◽

Class Label ◽

Benchmark Datasets ◽

Label Correlations ◽

Class Labels

In multi-label classification, the task is to induce predictive models which can assign a set of relevant labels for the unseen instance. The strategy of label-specific features has been widely employed in learning from multi-label examples, where the classification model for predicting the relevancy of each class label is induced based on its tailored features rather than the original features. Existing approaches work by generating a group of tailored features for each class label independently, where label correlations are not fully considered in the label-specific features generation process. In this article, we extend existing strategy by proposing a simple yet effective approach based on BiLabel-specific features. Specifically, a group of tailored features is generated for a pair of class labels with heuristic prototype selection and embedding. Thereafter, predictions of classifiers induced by BiLabel-specific features are ensembled to determine the relevancy of each class label for unseen instance. To thoroughly evaluate the BiLabel-specific features strategy, extensive experiments are conducted over a total of 35 benchmark datasets. Comparative studies against state-of-the-art label-specific features techniques clearly validate the superiority of utilizing BiLabel-specific features to yield stronger generalization performance for multi-label classification.

Download Full-text

Knowledge based dimensionality reduction for technical text mining

2014 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2014.7004466 ◽

2014 ◽

Cited By ~ 1

Author(s):

Walid Shalaby ◽

Wlodek Zadrozny ◽

Sean Gallagher

Keyword(s):

Text Mining ◽

Dimensionality Reduction ◽

Knowledge Based

Download Full-text

Suitability of Naïve Bayesian Methods for Paragraph Level Text Classification in the Kannada Language using Dimensionality Reduction Technique

International Journal of Artificial Intelligence & Applications ◽

10.5121/ijaia.2013.4509 ◽

2013 ◽

Vol 4 (5) ◽

pp. 121-131 ◽

Cited By ~ 1

Author(s):

Jayashree R ◽

Srikantamurthy K ◽

Basavaraj S Anami

Keyword(s):

Dimensionality Reduction ◽

Text Classification ◽

Bayesian Methods ◽

Reduction Technique ◽

Naive Bayesian ◽

Naïve Bayesian ◽

Dimensionality Reduction Technique ◽

Kannada Language

Download Full-text

Exploring Automated Text Classification to Improve Keyword Corpus Search Results for Bioinspired Design

Journal of Mechanical Design ◽

10.1115/1.4028167 ◽

2014 ◽

Vol 136 (11) ◽

Cited By ~ 8

Author(s):

Michael W. Glier ◽

Daniel A. McAdams ◽

Julie S. Linsey

Keyword(s):

Text Mining ◽

Text Classification ◽

Keyword Search ◽

Idea Generation ◽

Support Vector ◽

Biological Knowledge ◽

Svm Classifier ◽

Search Results ◽

Bioinspired Design ◽

Mining Algorithms

Bioinspired design is the adaptation of methods, strategies, or principles found in nature to solve engineering problems. One formalized approach to bioinspired solution seeking is the abstraction of the engineering problem into a functional need and then seeking solutions to this function using a keyword type search method on text based biological knowledge. These function keyword search approaches have shown potential for success, but as with many text based search methods, they produce a large number of results, many of little relevance to the problem in question. In this paper, we develop a method to train a computer to identify text passages more likely to suggest a solution to a human designer. The work presented examines the possibility of filtering biological keyword search results by using text mining algorithms to automatically identify which results are likely to be useful to a designer. The text mining algorithms are trained on a pair of surveys administered to human subjects to empirically identify a large number of sentences that are, or are not, helpful for idea generation. We develop and evaluate three text classification algorithms, namely, a Naïve Bayes (NB) classifier, a k nearest neighbors (kNN) classifier, and a support vector machine (SVM) classifier. Of these methods, the NB classifier generally had the best performance. Based on the analysis of 60 word stems, a NB classifier's precision is 0.87, recall is 0.52, and F score is 0.65. We find that word stem features that describe a physical action or process are correlated with helpful sentences. Similarly, we find biological jargon feature words are correlated with unhelpful sentences.

Download Full-text

Comparative Analysis for Topic Classification in Juz Al-Baqarah

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v12.i1.pp406-411 ◽

2018 ◽

Vol 12 (1) ◽

pp. 406

Author(s):

Mohamad Izzuddin Rahman ◽

Noor Azah Samsudin ◽

Aida Mustapha ◽

Adeleke Abdullahi

Keyword(s):

Text Mining ◽

Text Categorization ◽

Arabic Language ◽

Support Vector ◽

Original Text ◽

Research Project ◽

Computational Environment ◽

Nearest Neighbours ◽

Association Discovery ◽

Relationship Of

<p>In Islam, Quran is the holy book that was revealed to the Prophet Muhammad. It functions as complete code of life for the Muslims. Remarks from Allah which contains more than 77,000 words that was passed down through Prophet Muhammad to the mankind for 23 years started in 610 ce. The Quran was divided into 114 chapters. Arabic language is the original text. The need for the Muslims across the world to find the meaning to understand the content in the Quran is necessary. Nevertheless, understanding the Quran is an interest for the Muslims as well as the attention of millions of people from the faiths. Following the generation, lots of content that related to the Quran has been broadcast by Muslims scholars in the way of the tafsirs, translation and the book of hadiths. Problem has happened at current is most Muslim in Malaysia do not understand sentences in the Quran due to language barrier. The purpose of this research is classified topic in each verses of the Quran sentence based on its specific theme. It involves the objective of text mining which are based on linguistic information and domain. The usage of corpus helps to perform various data mining tasks including information extraction, text categorization, the relationship of concepts, association discovery, the evaluation of pattern and assessed. This research project is aiming to create computing environment that enable us use to text mining the Quran. The classification experiment is using the Support Vector Machine to find themes in Juz’ Baqarah. The SVM performance is then compared against other classification algorithms such as Naive Bayes, J48 Decision Tree and K-Nearest Neighbours. This research project aims at creating an enabling computational environment for text mining the Qur’an and to facilitate users to understand every verse in Juz’ Baqarah.</p>

Download Full-text

Uncorrelated Local Maximum Margin Criterion: An Efficient Dimensionality Reduction Method for Text Classification

Procedia Technology ◽

10.1016/j.protcy.2012.05.057 ◽

2012 ◽

Vol 4 ◽

pp. 370-374 ◽

Cited By ~ 4

Author(s):

Koushik Mallick ◽

Siddhartha Bhattacharyya

Keyword(s):

Dimensionality Reduction ◽

Text Classification ◽

Reduction Method ◽

Local Maximum ◽

Maximum Margin ◽

Maximum Margin Criterion ◽

Dimensionality Reduction Method

Download Full-text

Eliminating High-Degree Biased Character Bigrams for Dimensionality Reduction in Chinese Text Categorization

Lecture Notes in Computer Science - Advances in Information Retrieval ◽

10.1007/978-3-540-24752-4_15 ◽

2004 ◽

pp. 197-208

Author(s):

Dejun Xue ◽

Maosong Sun

Keyword(s):

Dimensionality Reduction ◽

Chinese Text ◽

Text Categorization ◽

High Degree

Download Full-text

Automatic Genre-Specific Text Classification

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch020 ◽

2011 ◽

pp. 120-127

Author(s):

Xiaoyan Yu ◽

Manas Tungare ◽

Weiguo Fan ◽

Manuel Pérez-Quiñones ◽

Edward A. Fox ◽

...

Keyword(s):

Text Mining ◽

Text Classification ◽

Information Needs ◽

Question Answering ◽

Class Schedule ◽

Semistructured Documents ◽

Linkage Information ◽

Filter Noise ◽

Topic Tracking ◽

Course Syllabus

Starting with a vast number of unstructured or semistructured documents, text mining tools analyze and sift through them to present to users more valuable information specific to their information needs. The technologies in text mining include information extraction, topic tracking, summarization, categorization/ classification, clustering, concept linkage, information visualization, and question answering [Fan, Wallace, Rich, & Zhang, 2006]. In this chapter, we share our hands-on experience with one specific text mining task — text classification [Sebastiani, 2002]. Information occurs in various formats, and some formats have a specific structure or specific information that they contain: we refer to these as `genres’. Examples of information genres include news items, reports, academic articles, etc. In this paper, we deal with a specific genre type, course syllabus. A course syllabus is such a genre, with the following commonly-occurring fields: title, description, instructor’s name, textbook details, class schedule, etc. In essence, a course syllabus is the skeleton of a course. Free and fast access to a collection of syllabi in a structured format could have a significant impact on education, especially for educators and life-long learners. Educators can borrow ideas from others’ syllabi to organize their own classes. It also will be easy for life-long learners to find popular textbooks and even important chapters when they would like to learn a course on their own. Unfortunately, searching for a syllabus on the Web using Information Retrieval [Baeza-Yates & Ribeiro-Neto, 1999] techniques employed by a generic search engine often yields too many non-relevant search result pages (i.e., noise) — some of these only provide guidelines on syllabus creation; some only provide a schedule for a course event; some have outgoing links to syllabi (e.g. a course list page of an academic department). Therefore, a well-designed classifier for the search results is needed, that would help not only to filter noise out, but also to identify more relevant and useful syllabi.

Download Full-text

Detection of Economy-Related Turkish Tweets Based on Machine Learning Approaches

10.4018/978-1-7998-8413-2.ch008 ◽

2022 ◽

pp. 171-195

Author(s):

Jale Bektaş

Keyword(s):

Machine Learning ◽

Text Mining ◽

Text Classification ◽

Integration Method ◽

Classification Problem ◽

Feature Representation ◽

Learning Approaches ◽

Machine Learning Methods ◽

Linguistic Approach ◽

Turkish Language

Conducting NLP for Turkish is a lot harder than other Latin-based languages such as English. In this study, by using text mining techniques, a pre-processing frame is conducted in which TF-IDF values are calculated in accordance with a linguistic approach on 7,731 tweets shared by 13 famous economists in Turkey, retrieved from Twitter. Then, the classification results are compared with four common machine learning methods (SVM, Naive Bayes, LR, and integration LR with SVM). The features represented by the TF-IDF are experimented in different N-grams. The findings show the success of a text classification problem is relative with the feature representation methods, and the performance superiority of SVM is better compared to other ML methods with unigram feature representation. The best results are obtained via the integration method of SVM with LR with the Acc of 82.9%. These results show that these methodologies are satisfying for the Turkish language.

Download Full-text

A Review on Dimensionality Reduction in Fuzzy- and SVM-Based Text Classification Strategies

Advances in Intelligent Systems and Computing - Congress on Intelligent Systems ◽

10.1007/978-981-33-6984-9_49 ◽

2021 ◽

pp. 613-631

Author(s):

Shalini Puri

Keyword(s):

Dimensionality Reduction ◽

Text Classification

Download Full-text