scholarly journals Survey on Text Classification

Author(s):  
Leena Bhuskute ◽  
Satish kumar Varma

Now a day there is rapid growth of the World Wide Web. The work of instinctive classification of documents is a main process for organizing the information and knowledge discovery. The classification of e-documents, online bulletin, blogs, e-mails and digital libraries need text mining, machine learning and natural language processing procedures to develop significant information. Therefore, proper classification and information detection from these assets is a fundamental field for study. Text classification is significant study topic in the area of text mining, where the documents are classified with supervised information. In this paper various text representation schemes and learning classifiers such as Naïve Bayes and Decision Tree algorithms are described with illustration for predefined classes. This present approaches are compared and distinguished based on quality assurance parameters.

2020 ◽  
pp. 1686-1704
Author(s):  
Emna Hkiri ◽  
Souheyl Mallat ◽  
Mounir Zrigui

The event extraction task consists in determining and classifying events within an open-domain text. It is very new for the Arabic language, whereas it attained its maturity for some languages such as English and French. Events extraction was also proved to help Natural Language Processing tasks such as Information Retrieval and Question Answering, text mining, machine translation etc… to obtain a higher performance. In this article, we present an ongoing effort to build a system for event extraction from Arabic texts using Gate platform and other tools.


Author(s):  
Sumathi S. ◽  
Indumathi S. ◽  
Rajkumar S.

Text classification in medical domain could result in an easier way of handling large volumes of medical data. They can be segregated depending on the type of diseases, which can be determined by extracting the decisive key texts from the original document. Due to various nuances present in understanding language in general, a requirement of large volumes of text-based data is required for algorithms to learn patterns properly. The problem with existing systems such as MedScape, MedLinePlus, Wrappin, and MedHunt is that they involve human interaction and high time consumption in handling a large volume of data. By employing automation in this proposed field, the large involvement of manpower could be removed which in turn speeds up the process of classification of the medical documents by which the shortage of medical technicians in third world countries are addressed.


2019 ◽  
Vol 1 (2) ◽  
pp. 575-589 ◽  
Author(s):  
Blaž Škrlj ◽  
Jan Kralj ◽  
Nada Lavrač ◽  
Senja Pollak

Deep neural networks are becoming ubiquitous in text mining and natural language processing, but semantic resources, such as taxonomies and ontologies, are yet to be fully exploited in a deep learning setting. This paper presents an efficient semantic text mining approach, which converts semantic information related to a given set of documents into a set of novel features that are used for learning. The proposed Semantics-aware Recurrent deep Neural Architecture (SRNA) enables the system to learn simultaneously from the semantic vectors and from the raw text documents. We test the effectiveness of the approach on three text classification tasks: news topic categorization, sentiment analysis and gender profiling. The experiments show that the proposed approach outperforms the approach without semantic knowledge, with highest accuracy gain (up to 10%) achieved on short document fragments.


Author(s):  
Ahed M. F. Al-Sbou

<p>There is a huge content of Arabic text available over online that requires an organization of these texts. As result, here are many applications of natural languages processing (NLP) that concerns with text organization. One of the is text classification (TC). TC helps to make dealing with unorganized text. However, it is easier to classify them into suitable class or labels. This paper is a survey of Arabic text classification. Also, it presents comparison among different methods in the classification of Arabic texts, where Arabic text is represented a complex text due to its vocabularies. Arabic language is one of the richest languages in the world, where it has many linguistic bases. The researche in Arabic language processing is very few compared to English. As a result, these problems represent challenges in the classification, and organization of specific Arabic text. Text classification (TC) helps to access the most documents, or information that has already classified into specific classes, or categories to one or more classes or categories. In addition, classification of documents facilitate search engine to decrease the amount of document to, and then to become easier to search and matching with queries.</p>


2019 ◽  
Vol 45 (1) ◽  
pp. 11-14
Author(s):  
Zuhair Ali

Automated classification of text into predefined categories has always been considered as a vital method in thenatural language processing field. In this paper new methods based on Radial Basis Function (RBF) and Fuzzy Radial BasisFunction (FRBF) are used to solve the problem of text classification, where a set of features extracted for each sentencein the document collection these set of features introduced to FRBF and RBF to classify documents. Reuters 21578 datasetutilized for the purpose of text classification. The results showed the effectiveness of FRBF is better than RBF.


2021 ◽  
Vol 13 (19) ◽  
pp. 10856
Author(s):  
I-Cheng Chang ◽  
Tai-Kuei Yu ◽  
Yu-Jie Chang ◽  
Tai-Yi Yu

Facing the big data wave, this study applied artificial intelligence to cite knowledge and find a feasible process to play a crucial role in supplying innovative value in environmental education. Intelligence agents of artificial intelligence and natural language processing (NLP) are two key areas leading the trend in artificial intelligence; this research adopted NLP to analyze the research topics of environmental education research journals in the Web of Science (WoS) database during 2011–2020 and interpret the categories and characteristics of abstracts for environmental education papers. The corpus data were selected from abstracts and keywords of research journal papers, which were analyzed with text mining, cluster analysis, latent Dirichlet allocation (LDA), and co-word analysis methods. The decisions regarding the classification of feature words were determined and reviewed by domain experts, and the associated TF-IDF weights were calculated for the following cluster analysis, which involved a combination of hierarchical clustering and K-means analysis. The hierarchical clustering and LDA decided the number of required categories as seven, and the K-means cluster analysis classified the overall documents into seven categories. This study utilized co-word analysis to check the suitability of the K-means classification, analyzed the terms with high TF-IDF wights for distinct K-means groups, and examined the terms for different topics with the LDA technique. A comparison of the results demonstrated that most categories that were recognized with K-means and LDA methods were the same and shared similar words; however, two categories had slight differences. The involvement of field experts assisted with the consistency and correctness of the classified topics and documents.


2019 ◽  
Vol 9 (17) ◽  
pp. 3617 ◽  
Author(s):  
Fen Zhao ◽  
Penghua Li ◽  
Yuanyuan Li ◽  
Jie Hou ◽  
Yinguo Li

With the rapid developments of Internet technology, a mass of law cases is constantly occurring and needs to be dealt with in time. Automatic classification of law text is the most basic and critical process in the online law advice platform. Deep neural network-based natural language processing (DNN-NLP) is one of the most promising approaches to implement text classification. Meanwhile, as the convolutional neural network-based (CNN-based) methods developed, CNN-based text classification has already achieved impressive results. However, previous work applied amounts of manually-annotated data, which increased the labor cost and reduced the adaptability of the approach. Hence, we present a new semi-supervised model to solve the problem of data annotation. Our method learns the embedding of small text regions from unlabeled data and then integrates the learned embedding into the supervised training. More specifically, the learned embedding regions with the two-view-embedding model are used as an additional input to the CNN’s convolution layer. In addition, to implement the multi-task learning task, we propose the multi-label classification algorithm to assign multiple labels to an instance. The proposed method is evaluated experimentally subject to a law case description dataset and English standard dataset RCV1 . On Chinese data, the simulation results demonstrate that, compared with the existing methods such as linear SVM, our scheme respectively improves by 7.76%, 7.86%, 9.19%, and 2.96% the precision, recall, F-1, and Hamming loss. Analogously, the results suggest that compared to CNN, our scheme respectively improves by 4.46%, 5.76%, 5.14% and 0.87% in terms of precision, recall, F-1, and Hamming loss. It is worth mentioning that the robustness of this method makes it suitable and effective for automatic classification of law text. Furthermore, the design concept proposed is promising, which can be utilized in other real-world applications such as news classification and public opinion monitoring.


Author(s):  
Gleb Danilov ◽  
Timur Ishankulov ◽  
Konstantin Kotik ◽  
Yuriy Orlov ◽  
Mikhail Shifrin ◽  
...  

Automated text classification is a natural language processing (NLP) technology that could significantly facilitate scientific literature selection. A specific topical dataset of 630 article abstracts was obtained from the PubMed database. We proposed 27 parametrized options of PubMedBERT model and 4 ensemble models to solve a binary classification task on that dataset. Three hundred tests with resamples were performed in each classification approach. The best PubMedBERT model demonstrated F1-score = 0.857 while the best ensemble model reached F1-score = 0.853. We concluded that the short scientific texts classification quality might be improved using the latest state-of-art approaches.


10.28945/4066 ◽  
2018 ◽  
Vol 13 ◽  
pp. 117-135 ◽  
Author(s):  
M. Thangaraj ◽  
M Sivakami

Aim/Purpose: The aim of this paper is to analyze various text classification techniques employed in practice, their strengths and weaknesses, to provide an improved awareness regarding various knowledge extraction possibilities in the field of data mining. Background: Artificial Intelligence is reshaping text classification techniques to better acquire knowledge. However, in spite of the growth and spread of AI in all fields of research, its role with respect to text mining is not well understood yet. Methodology: For this study, various articles written between 2010 and 2017 on “text classification techniques in AI”, selected from leading journals of computer science, were analyzed. Each article was completely read. The research problems related to text classification techniques in the field of AI were identified and techniques were grouped according to the algorithms involved. These algorithms were divided based on the learning procedure used. Finally, the findings were plotted as a tree structure for visualizing the relationship between learning procedures and algorithms. Contribution: This paper identifies the strengths, limitations, and current research trends in text classification in an advanced field like AI. This knowledge is crucial for data scientists. They could utilize the findings of this study to devise customized data models. It also helps the industry to understand the operational efficiency of text mining techniques. It further contributes to reducing the cost of the projects and supports effective decision making. Findings: It has been found more important to study and understand the nature of data before proceeding into mining. The automation of text classification process is required, with the increasing amount of data and need for accuracy. Another interesting research opportunity lies in building intricate text data models with deep learning systems. It has the ability to execute complex Natural Language Processing (NLP) tasks with semantic requirements. Recommendations for Practitioners: Frame analysis, deception detection, narrative science where data expresses a story, healthcare applications to diagnose illnesses and conversation analysis are some of the recommendations suggested for practitioners. Recommendation for Researchers: Developing simpler algorithms in terms of coding and implementation, better approaches for knowledge distillation, multilingual text refining, domain knowledge integration, subjectivity detection, and contrastive viewpoint summarization are some of the areas that could be explored by researchers. Impact on Society: Text classification forms the base of data analytics and acts as the engine behind knowledge discovery. It supports state-of-the-art decision making, for example, predicting an event before it actually occurs, classifying a transaction as ‘Fraudulent’ etc. The results of this study could be used for developing applications dedicated to assisting decision making processes. These informed decisions will help to optimize resources and maximize benefits to the mankind. Future Research: In the future, better methods for parameter optimization will be identified by selecting better parameters that reflects effective knowledge discovery. The role of streaming data processing is still rarely explored when it comes to text classification.


Sign in / Sign up

Export Citation Format

Share Document