A Review on Knowledge Discovery using Text Classification Techniques in Text Mining

Aim/Purpose: The aim of this paper is to analyze various text classification techniques employed in practice, their strengths and weaknesses, to provide an improved awareness regarding various knowledge extraction possibilities in the field of data mining. Background: Artificial Intelligence is reshaping text classification techniques to better acquire knowledge. However, in spite of the growth and spread of AI in all fields of research, its role with respect to text mining is not well understood yet. Methodology: For this study, various articles written between 2010 and 2017 on “text classification techniques in AI”, selected from leading journals of computer science, were analyzed. Each article was completely read. The research problems related to text classification techniques in the field of AI were identified and techniques were grouped according to the algorithms involved. These algorithms were divided based on the learning procedure used. Finally, the findings were plotted as a tree structure for visualizing the relationship between learning procedures and algorithms. Contribution: This paper identifies the strengths, limitations, and current research trends in text classification in an advanced field like AI. This knowledge is crucial for data scientists. They could utilize the findings of this study to devise customized data models. It also helps the industry to understand the operational efficiency of text mining techniques. It further contributes to reducing the cost of the projects and supports effective decision making. Findings: It has been found more important to study and understand the nature of data before proceeding into mining. The automation of text classification process is required, with the increasing amount of data and need for accuracy. Another interesting research opportunity lies in building intricate text data models with deep learning systems. It has the ability to execute complex Natural Language Processing (NLP) tasks with semantic requirements. Recommendations for Practitioners: Frame analysis, deception detection, narrative science where data expresses a story, healthcare applications to diagnose illnesses and conversation analysis are some of the recommendations suggested for practitioners. Recommendation for Researchers: Developing simpler algorithms in terms of coding and implementation, better approaches for knowledge distillation, multilingual text refining, domain knowledge integration, subjectivity detection, and contrastive viewpoint summarization are some of the areas that could be explored by researchers. Impact on Society: Text classification forms the base of data analytics and acts as the engine behind knowledge discovery. It supports state-of-the-art decision making, for example, predicting an event before it actually occurs, classifying a transaction as ‘Fraudulent’ etc. The results of this study could be used for developing applications dedicated to assisting decision making processes. These informed decisions will help to optimize resources and maximize benefits to the mankind. Future Research: In the future, better methods for parameter optimization will be identified by selecting better parameters that reflects effective knowledge discovery. The role of streaming data processing is still rarely explored when it comes to text classification.

Download Full-text

Text Mining for Literature Review and Knowledge Discovery in Cancer Risk Assessment and Research

PLoS ONE ◽

10.1371/journal.pone.0033427 ◽

2012 ◽

Vol 7 (4) ◽

pp. e33427 ◽

Cited By ~ 38

Author(s):

Anna Korhonen ◽

Diarmuid Ó Séaghdha ◽

Ilona Silins ◽

Lin Sun ◽

Johan Högberg ◽

...

Keyword(s):

Risk Assessment ◽

Text Mining ◽

Cancer Risk ◽

Literature Review ◽

Knowledge Discovery ◽

Cancer Risk Assessment

Download Full-text

A Comprehensive Study for the Hindi Language to Implement Supervised Text Classification Techniques

10.1109/ispcc53510.2021.9609401 ◽

2021 ◽

Author(s):

Vijay Kumar Soni ◽

Smita Selot

Keyword(s):

Text Classification ◽

Classification Techniques ◽

Hindi Language ◽

Comprehensive Study

Download Full-text

Exploring Automated Text Classification to Improve Keyword Corpus Search Results for Bioinspired Design

Journal of Mechanical Design ◽

10.1115/1.4028167 ◽

2014 ◽

Vol 136 (11) ◽

Cited By ~ 8

Author(s):

Michael W. Glier ◽

Daniel A. McAdams ◽

Julie S. Linsey

Keyword(s):

Text Mining ◽

Text Classification ◽

Keyword Search ◽

Idea Generation ◽

Support Vector ◽

Biological Knowledge ◽

Svm Classifier ◽

Search Results ◽

Bioinspired Design ◽

Mining Algorithms

Bioinspired design is the adaptation of methods, strategies, or principles found in nature to solve engineering problems. One formalized approach to bioinspired solution seeking is the abstraction of the engineering problem into a functional need and then seeking solutions to this function using a keyword type search method on text based biological knowledge. These function keyword search approaches have shown potential for success, but as with many text based search methods, they produce a large number of results, many of little relevance to the problem in question. In this paper, we develop a method to train a computer to identify text passages more likely to suggest a solution to a human designer. The work presented examines the possibility of filtering biological keyword search results by using text mining algorithms to automatically identify which results are likely to be useful to a designer. The text mining algorithms are trained on a pair of surveys administered to human subjects to empirically identify a large number of sentences that are, or are not, helpful for idea generation. We develop and evaluate three text classification algorithms, namely, a Naïve Bayes (NB) classifier, a k nearest neighbors (kNN) classifier, and a support vector machine (SVM) classifier. Of these methods, the NB classifier generally had the best performance. Based on the analysis of 60 word stems, a NB classifier's precision is 0.87, recall is 0.52, and F score is 0.65. We find that word stem features that describe a physical action or process are correlated with helpful sentences. Similarly, we find biological jargon feature words are correlated with unhelpful sentences.

Download Full-text

Allerdictor: fast allergen prediction using text classification techniques

Bioinformatics ◽

10.1093/bioinformatics/btu004 ◽

2014 ◽

Vol 30 (8) ◽

pp. 1120-1128 ◽

Cited By ~ 33

Author(s):

Ha X. Dang ◽

Christopher B. Lawrence

Keyword(s):

Text Classification ◽

Classification Techniques

Download Full-text

Automatic Genre-Specific Text Classification

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch020 ◽

2011 ◽

pp. 120-127

Author(s):

Xiaoyan Yu ◽

Manas Tungare ◽

Weiguo Fan ◽

Manuel Pérez-Quiñones ◽

Edward A. Fox ◽

...

Keyword(s):

Text Mining ◽

Text Classification ◽

Information Needs ◽

Question Answering ◽

Class Schedule ◽

Semistructured Documents ◽

Linkage Information ◽

Filter Noise ◽

Topic Tracking ◽

Course Syllabus

Starting with a vast number of unstructured or semistructured documents, text mining tools analyze and sift through them to present to users more valuable information specific to their information needs. The technologies in text mining include information extraction, topic tracking, summarization, categorization/ classification, clustering, concept linkage, information visualization, and question answering [Fan, Wallace, Rich, & Zhang, 2006]. In this chapter, we share our hands-on experience with one specific text mining task — text classification [Sebastiani, 2002]. Information occurs in various formats, and some formats have a specific structure or specific information that they contain: we refer to these as `genres’. Examples of information genres include news items, reports, academic articles, etc. In this paper, we deal with a specific genre type, course syllabus. A course syllabus is such a genre, with the following commonly-occurring fields: title, description, instructor’s name, textbook details, class schedule, etc. In essence, a course syllabus is the skeleton of a course. Free and fast access to a collection of syllabi in a structured format could have a significant impact on education, especially for educators and life-long learners. Educators can borrow ideas from others’ syllabi to organize their own classes. It also will be easy for life-long learners to find popular textbooks and even important chapters when they would like to learn a course on their own. Unfortunately, searching for a syllabus on the Web using Information Retrieval [Baeza-Yates & Ribeiro-Neto, 1999] techniques employed by a generic search engine often yields too many non-relevant search result pages (i.e., noise) — some of these only provide guidelines on syllabus creation; some only provide a schedule for a course event; some have outgoing links to syllabi (e.g. a course list page of an academic department). Therefore, a well-designed classifier for the search results is needed, that would help not only to filter noise out, but also to identify more relevant and useful syllabi.

Download Full-text

Detection of Economy-Related Turkish Tweets Based on Machine Learning Approaches

10.4018/978-1-7998-8413-2.ch008 ◽

2022 ◽

pp. 171-195

Author(s):

Jale Bektaş

Keyword(s):

Machine Learning ◽

Text Mining ◽

Text Classification ◽

Integration Method ◽

Classification Problem ◽

Feature Representation ◽

Learning Approaches ◽

Machine Learning Methods ◽

Linguistic Approach ◽

Turkish Language

Conducting NLP for Turkish is a lot harder than other Latin-based languages such as English. In this study, by using text mining techniques, a pre-processing frame is conducted in which TF-IDF values are calculated in accordance with a linguistic approach on 7,731 tweets shared by 13 famous economists in Turkey, retrieved from Twitter. Then, the classification results are compared with four common machine learning methods (SVM, Naive Bayes, LR, and integration LR with SVM). The features represented by the TF-IDF are experimented in different N-grams. The findings show the success of a text classification problem is relative with the feature representation methods, and the performance superiority of SVM is better compared to other ML methods with unigram feature representation. The best results are obtained via the integration method of SVM with LR with the Acc of 82.9%. These results show that these methodologies are satisfying for the Turkish language.

Download Full-text

Automatic Categorization of PubMed microRNA Target Abstracts Based on Text Classification Techniques

Journal of Applied Bioinformatics & Computational Biology ◽

10.4172/2329-9533.1000138 ◽

2017 ◽

Vol 06 (03) ◽

Author(s):

Malik Yousef ◽

Dawit Nigatu ◽

Loai Abdalla

Keyword(s):

Text Classification ◽

Microrna Target ◽

Classification Techniques ◽

Automatic Categorization

Download Full-text

An ontology based text mining system for knowledge discovery from the diagnosis data in the automotive domain

Computers in Industry ◽

10.1016/j.compind.2013.03.001 ◽

2013 ◽

Vol 64 (5) ◽

pp. 565-580 ◽

Cited By ~ 31

Author(s):

Dnyanesh G. Rajpathak

Keyword(s):

Text Mining ◽

Knowledge Discovery ◽

Mining System ◽

Text Mining System

Download Full-text

Text Mining in the Context of Business Intelligence

Encyclopedia of Information Science and Technology, First Edition ◽

10.4018/978-1-59140-553-5.ch496 ◽

2005 ◽

pp. 2793-2798 ◽

Cited By ~ 1

Author(s):

Hércules Antonio do Prado ◽

José Palazzo Moreira de Oliveira ◽

Edilson Ferneda ◽

Leandro Krug Wives ◽

Edilberto Magalhães Silva ◽

...

Keyword(s):

Data Mining ◽

Text Mining ◽

Knowledge Discovery ◽

Business Intelligence ◽

External Environment ◽

Organizational Processes ◽

External Monitoring ◽

New Applications ◽

Textual Form ◽

Organizational Problems

Information about the external environment and organizational processes are among the most worthwhile input for business intelligence (BI). Nowadays, companies have plenty of information in structured or textual forms, either from external monitoring or from the corporative systems. In the last years, the structured part of this information stock has been massively explored by means of data-mining (DM) techniques (Wang, 2003), generating models that enable the analysts to gain insights on the solutions for organizational problems. On the text-mining (TM) side, the rhythm of new applications development did not go so fast. In an informal poll carried out in 2002 (Kdnuggets), just 4% of the knowledge-discovery-from-databases (KDD) practitioners were applying TM techniques. This fact is as intriguing as surprising if one considers that 80% of all information available in an organization comes in textual form (Tan, 1999).

Download Full-text