Survey on Text Classification

Now a day there is rapid growth of the World Wide Web. The work of instinctive classification of documents is a main process for organizing the information and knowledge discovery. The classification of e-documents, online bulletin, blogs, e-mails and digital libraries need text mining, machine learning and natural language processing procedures to develop significant information. Therefore, proper classification and information detection from these assets is a fundamental field for study. Text classification is significant study topic in the area of text mining, where the documents are classified with supervised information. In this paper various text representation schemes and learning classifiers such as Naïve Bayes and Decision Tree algorithms are described with illustration for predefined classes. This present approaches are compared and distinguished based on quality assurance parameters.

Download Full-text

Events Automatic Extraction from Arabic Texts

Natural Language Processing ◽

10.4018/978-1-7998-0951-7.ch078 ◽

2020 ◽

pp. 1686-1704

Author(s):

Emna Hkiri ◽

Souheyl Mallat ◽

Mounir Zrigui

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Text Mining ◽

Machine Translation ◽

Language Processing ◽

Question Answering ◽

Arabic Language ◽

Event Extraction ◽

Mining Machine ◽

Open Domain

The event extraction task consists in determining and classifying events within an open-domain text. It is very new for the Arabic language, whereas it attained its maturity for some languages such as English and French. Events extraction was also proved to help Natural Language Processing tasks such as Information Retrieval and Question Answering, text mining, machine translation etc… to obtain a higher performance. In this article, we present an ongoing effort to build a system for event extraction from Arabic texts using Gate platform and other tools.

Download Full-text

Medical Reports Analysis Using Natural Language Processing for Disease Classification

Handbook of Research on Applications and Implementations of Machine Learning Techniques - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-5225-9902-9.ch009 ◽

2020 ◽

pp. 155-172

Author(s):

Sumathi S. ◽

Indumathi S. ◽

Rajkumar S.

Keyword(s):

Natural Language Processing ◽

Language Processing ◽

Text Classification ◽

Medical Data ◽

Disease Classification ◽

Human Interaction ◽

Time Consumption ◽

Medical Reports ◽

Medical Documents

Text classification in medical domain could result in an easier way of handling large volumes of medical data. They can be segregated depending on the type of diseases, which can be determined by extracting the decisive key texts from the original document. Due to various nuances present in understanding language in general, a requirement of large volumes of text-based data is required for algorithms to learn patterns properly. The problem with existing systems such as MedScape, MedLinePlus, Wrappin, and MedHunt is that they involve human interaction and high time consumption in handling a large volume of data. By employing automation in this proposed field, the large involvement of manpower could be removed which in turn speeds up the process of classification of the medical documents by which the shortage of medical technicians in third world countries are addressed.

Download Full-text

Towards Robust Text Classification with Semantics-Aware Recurrent Neural Architecture

Machine Learning and Knowledge Extraction ◽

10.3390/make1020034 ◽

2019 ◽

Vol 1 (2) ◽

pp. 575-589 ◽

Cited By ~ 1

Author(s):

Blaž Škrlj ◽

Jan Kralj ◽

Nada Lavrač ◽

Senja Pollak

Keyword(s):

Text Mining ◽

Language Processing ◽

Text Classification ◽

Deep Neural Networks ◽

Semantic Knowledge ◽

Text Documents ◽

Neural Architecture ◽

Classification Tasks ◽

And Gender ◽

Semantic Resources

Deep neural networks are becoming ubiquitous in text mining and natural language processing, but semantic resources, such as taxonomies and ontologies, are yet to be fully exploited in a deep learning setting. This paper presents an efficient semantic text mining approach, which converts semantic information related to a given set of documents into a set of novel features that are used for learning. The proposed Semantics-aware Recurrent deep Neural Architecture (SRNA) enables the system to learn simultaneously from the semantic vectors and from the raw text documents. We test the effectiveness of the approach on three text classification tasks: news topic categorization, sentiment analysis and gender profiling. The experiments show that the proposed approach outperforms the approach without semantic knowledge, with highest accuracy gain (up to 10%) achieved on short document fragments.

Download Full-text

A Survey of Arabic Text Classification Models

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i6.pp4352-4355 ◽

2018 ◽

Vol 8 (6) ◽

pp. 4352 ◽

Cited By ~ 1

Author(s):

Ahed M. F. Al-Sbou

Keyword(s):

Language Processing ◽

Text Classification ◽

Arabic Language ◽

Arabic Text ◽

Classification Models ◽

Natural Languages ◽

Text Organization ◽

Arabic Text Classification ◽

Arabic Language Processing

<p>There is a huge content of Arabic text available over online that requires an organization of these texts. As result, here are many applications of natural languages processing (NLP) that concerns with text organization. One of the is text classification (TC). TC helps to make dealing with unorganized text. However, it is easier to classify them into suitable class or labels. This paper is a survey of Arabic text classification. Also, it presents comparison among different methods in the classification of Arabic texts, where Arabic text is represented a complex text due to its vocabularies. Arabic language is one of the richest languages in the world, where it has many linguistic bases. The researche in Arabic language processing is very few compared to English. As a result, these problems represent challenges in the classification, and organization of specific Arabic text. Text classification (TC) helps to access the most documents, or information that has already classified into specific classes, or categories to one or more classes or categories. In addition, classification of documents facilitate search engine to decrease the amount of document to, and then to become easier to search and matching with queries.</p>

Download Full-text

TEXT CLASSIFICATION BASED ON FUZZY RADIAL BASIS FUNCTION

Iraqi Journal for Computers and Informatics ◽

10.25195/ijci.v45i1.40 ◽

2019 ◽

Vol 45 (1) ◽

pp. 11-14

Author(s):

Zuhair Ali

Keyword(s):

Radial Basis Function ◽

Language Processing ◽

Text Classification ◽

Basis Function ◽

Automated Classification ◽

New Methods ◽

Radial Basis ◽

Document Collection ◽

Better Than

Automated classification of text into predefined categories has always been considered as a vital method in thenatural language processing field. In this paper new methods based on Radial Basis Function (RBF) and Fuzzy Radial BasisFunction (FRBF) are used to solve the problem of text classification, where a set of features extracted for each sentencein the document collection these set of features introduced to FRBF and RBF to classify documents. Reuters 21578 datasetutilized for the purpose of text classification. The results showed the effectiveness of FRBF is better than RBF.

Download Full-text

Applying Text Mining, Clustering Analysis, and Latent Dirichlet Allocation Techniques for Topic Classification of Environmental Education Journals

Sustainability ◽

10.3390/su131910856 ◽

2021 ◽

Vol 13 (19) ◽

pp. 10856

Author(s):

I-Cheng Chang ◽

Tai-Kuei Yu ◽

Yu-Jie Chang ◽

Tai-Yi Yu

Keyword(s):

Artificial Intelligence ◽

Cluster Analysis ◽

Text Mining ◽

Environmental Education ◽

Hierarchical Clustering ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Word Analysis ◽

Dirichlet Allocation

Facing the big data wave, this study applied artificial intelligence to cite knowledge and find a feasible process to play a crucial role in supplying innovative value in environmental education. Intelligence agents of artificial intelligence and natural language processing (NLP) are two key areas leading the trend in artificial intelligence; this research adopted NLP to analyze the research topics of environmental education research journals in the Web of Science (WoS) database during 2011–2020 and interpret the categories and characteristics of abstracts for environmental education papers. The corpus data were selected from abstracts and keywords of research journal papers, which were analyzed with text mining, cluster analysis, latent Dirichlet allocation (LDA), and co-word analysis methods. The decisions regarding the classification of feature words were determined and reviewed by domain experts, and the associated TF-IDF weights were calculated for the following cluster analysis, which involved a combination of hierarchical clustering and K-means analysis. The hierarchical clustering and LDA decided the number of required categories as seven, and the K-means cluster analysis classified the overall documents into seven categories. This study utilized co-word analysis to check the suitability of the K-means classification, analyzed the terms with high TF-IDF wights for distinct K-means groups, and examined the terms for different topics with the LDA technique. A comparison of the results demonstrated that most categories that were recognized with K-means and LDA methods were the same and shared similar words; however, two categories had slight differences. The involvement of field experts assisted with the consistency and correctness of the classified topics and documents.

Download Full-text

Semi-Supervised Convolutional Neural Network for Law Advice Online

Applied Sciences ◽

10.3390/app9173617 ◽

2019 ◽

Vol 9 (17) ◽

pp. 3617 ◽

Cited By ~ 2

Author(s):

Fen Zhao ◽

Penghua Li ◽

Yuanyuan Li ◽

Jie Hou ◽

Yinguo Li

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Language Processing ◽

Text Classification ◽

Learning Task ◽

Automatic Classification ◽

Labor Cost ◽

Internet Technology ◽

Data Annotation

With the rapid developments of Internet technology, a mass of law cases is constantly occurring and needs to be dealt with in time. Automatic classification of law text is the most basic and critical process in the online law advice platform. Deep neural network-based natural language processing (DNN-NLP) is one of the most promising approaches to implement text classification. Meanwhile, as the convolutional neural network-based (CNN-based) methods developed, CNN-based text classification has already achieved impressive results. However, previous work applied amounts of manually-annotated data, which increased the labor cost and reduced the adaptability of the approach. Hence, we present a new semi-supervised model to solve the problem of data annotation. Our method learns the embedding of small text regions from unlabeled data and then integrates the learned embedding into the supervised training. More specifically, the learned embedding regions with the two-view-embedding model are used as an additional input to the CNN’s convolution layer. In addition, to implement the multi-task learning task, we propose the multi-label classification algorithm to assign multiple labels to an instance. The proposed method is evaluated experimentally subject to a law case description dataset and English standard dataset RCV1 . On Chinese data, the simulation results demonstrate that, compared with the existing methods such as linear SVM, our scheme respectively improves by 7.76%, 7.86%, 9.19%, and 2.96% the precision, recall, F-1, and Hamming loss. Analogously, the results suggest that compared to CNN, our scheme respectively improves by 4.46%, 5.76%, 5.14% and 0.87% in terms of precision, recall, F-1, and Hamming loss. It is worth mentioning that the robustness of this method makes it suitable and effective for automatic classification of law text. Furthermore, the design concept proposed is promising, which can be utilized in other real-world applications such as news classification and public opinion monitoring.

Download Full-text

The Classification of Short Scientific Texts Using Pretrained BERT Model

Studies in Health Technology and Informatics - Public Health and Informatics ◽

10.3233/shti210125 ◽

2021 ◽

Author(s):

Gleb Danilov ◽

Timur Ishankulov ◽

Konstantin Kotik ◽

Yuriy Orlov ◽

Mikhail Shifrin ◽

...

Keyword(s):

Language Processing ◽

Text Classification ◽

Binary Classification ◽

Scientific Texts ◽

Pubmed Database ◽

Automated Text Classification ◽

Literature Selection ◽

State Of Art ◽

Classification Quality

Automated text classification is a natural language processing (NLP) technology that could significantly facilitate scientific literature selection. A specific topical dataset of 630 article abstracts was obtained from the PubMed database. We proposed 27 parametrized options of PubMedBERT model and 4 ensemble models to solve a binary classification task on that dataset. Three hundred tests with resamples were performed in each classification approach. The best PubMedBERT model demonstrated F1-score = 0.857 while the best ensemble model reached F1-score = 0.853. We concluded that the short scientific texts classification quality might be improved using the latest state-of-art approaches.

Download Full-text

Automated Text Classification of Maintenance Data of Higher Education Buildings Using Text Mining and Machine Learning Techniques

Journal of Architectural Engineering ◽

10.1061/(asce)ae.1943-5568.0000522 ◽

2022 ◽

Vol 28 (1) ◽

Author(s):

Sungil Hong ◽

Junghyun Kim ◽

Eunhwa Yang

Keyword(s):

Higher Education ◽

Machine Learning ◽

Text Mining ◽

Text Classification ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Automated Text Classification

Download Full-text

Text Classification Techniques: A Literature Review

Interdisciplinary Journal of Information Knowledge and Management ◽

10.28945/4066 ◽

2018 ◽

Vol 13 ◽

pp. 117-135 ◽

Cited By ~ 7

Author(s):

M. Thangaraj ◽

M Sivakami

Keyword(s):

Decision Making ◽

Text Mining ◽

Knowledge Discovery ◽

Language Processing ◽

Text Classification ◽

Data Models ◽

Streaming Data ◽

Future Research ◽

Healthcare Applications ◽

Classification Techniques

Aim/Purpose: The aim of this paper is to analyze various text classification techniques employed in practice, their strengths and weaknesses, to provide an improved awareness regarding various knowledge extraction possibilities in the field of data mining. Background: Artificial Intelligence is reshaping text classification techniques to better acquire knowledge. However, in spite of the growth and spread of AI in all fields of research, its role with respect to text mining is not well understood yet. Methodology: For this study, various articles written between 2010 and 2017 on “text classification techniques in AI”, selected from leading journals of computer science, were analyzed. Each article was completely read. The research problems related to text classification techniques in the field of AI were identified and techniques were grouped according to the algorithms involved. These algorithms were divided based on the learning procedure used. Finally, the findings were plotted as a tree structure for visualizing the relationship between learning procedures and algorithms. Contribution: This paper identifies the strengths, limitations, and current research trends in text classification in an advanced field like AI. This knowledge is crucial for data scientists. They could utilize the findings of this study to devise customized data models. It also helps the industry to understand the operational efficiency of text mining techniques. It further contributes to reducing the cost of the projects and supports effective decision making. Findings: It has been found more important to study and understand the nature of data before proceeding into mining. The automation of text classification process is required, with the increasing amount of data and need for accuracy. Another interesting research opportunity lies in building intricate text data models with deep learning systems. It has the ability to execute complex Natural Language Processing (NLP) tasks with semantic requirements. Recommendations for Practitioners: Frame analysis, deception detection, narrative science where data expresses a story, healthcare applications to diagnose illnesses and conversation analysis are some of the recommendations suggested for practitioners. Recommendation for Researchers: Developing simpler algorithms in terms of coding and implementation, better approaches for knowledge distillation, multilingual text refining, domain knowledge integration, subjectivity detection, and contrastive viewpoint summarization are some of the areas that could be explored by researchers. Impact on Society: Text classification forms the base of data analytics and acts as the engine behind knowledge discovery. It supports state-of-the-art decision making, for example, predicting an event before it actually occurs, classifying a transaction as ‘Fraudulent’ etc. The results of this study could be used for developing applications dedicated to assisting decision making processes. These informed decisions will help to optimize resources and maximize benefits to the mankind. Future Research: In the future, better methods for parameter optimization will be identified by selecting better parameters that reflects effective knowledge discovery. The role of streaming data processing is still rarely explored when it comes to text classification.

Download Full-text