Text Classification of Gujarati Newspaper Headlines

International Journal of Asian Language Processing ◽

10.1142/s2717554520500204 ◽

2021 ◽

pp. 2050020

Author(s):

Stuti Mehta ◽

Suman K. Mitra

Keyword(s):

Feature Extraction ◽

Language Processing ◽

Text Classification ◽

Low Resource ◽

Textual Data ◽

Gujarati Language ◽

News Headlines ◽

Embedding Methods ◽

Insight Into

Text classification is an extremely important area of Natural Language Processing (NLP). This paper studies various methods for embedding and classification in the Gujarati language. The dataset comprises of Gujarati News Headlines classified into various categories. Different embedding methods for Gujarati language and various classifiers are used to classify the headlines into given categories. Gujarati is a low resource language. This language is not commonly worked upon. This paper deals with one of the most important NLP tasks - classification and along with it, an idea about various embedding techniques for Gujarati language can be obtained since they help in feature extraction for the process of classification. This paper first performs embedding to get a valid representation of the textual data and then uses already existing robust classifiers to perform classification over the embedded data. Additionally, the paper provides an insight into how various NLP tasks can be performed over a low resource language like Gujarati. Finally, the research paper carries out a comparative analysis between the performances of various existing methods of embedding and classification to get an idea of which combination gives a better outcome.

Download Full-text

Efficient text feature extraction by integrating the average linkage and K-medoids clustering

Modern Physics Letters B ◽

10.1142/s0217984921501517 ◽

2021 ◽

pp. 2150151

Author(s):

Dasong Sun

Keyword(s):

Feature Extraction ◽

Text Classification ◽

Experimental Results ◽

The Other ◽

Central Feature ◽

Number Of Clusters ◽

Average Linkage ◽

Text Feature

By clustering feature words, we can not only simplify the dimension of feature subsets, but also eliminate the redundancy of the feature. However, for a feature set with very large dimensions, the traditional [Formula: see text]-medoids algorithm is difficult to accurately estimate the value of [Formula: see text]. Moreover, the clustering results of the average linkage (AL) algorithm cannot be divided again, and the AL algorithm cannot be directly used for text classification. In order to overcome the limitations of AL and [Formula: see text]-medoids, in this paper, we combine the two algorithms together so as to be mutually complementary to each other. In particular, in order to meet the purpose of text classification, we improve the AL algorithm and propose the [Formula: see text] testing statistics to obtain the approximate number of clusters. Finally, the central feature words are preserved, and the other feature words are deleted. The experimental results show that the new algorithm largely eliminates the redundancy of the feature. Compared with the traditional TF-IDF algorithms, the performance of the text classification of the new algorithm is improved.

Download Full-text

Medical Reports Analysis Using Natural Language Processing for Disease Classification

Handbook of Research on Applications and Implementations of Machine Learning Techniques - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-5225-9902-9.ch009 ◽

2020 ◽

pp. 155-172

Author(s):

Sumathi S. ◽

Indumathi S. ◽

Rajkumar S.

Keyword(s):

Natural Language Processing ◽

Language Processing ◽

Text Classification ◽

Medical Data ◽

Disease Classification ◽

Human Interaction ◽

Time Consumption ◽

Medical Reports ◽

Medical Documents

Text classification in medical domain could result in an easier way of handling large volumes of medical data. They can be segregated depending on the type of diseases, which can be determined by extracting the decisive key texts from the original document. Due to various nuances present in understanding language in general, a requirement of large volumes of text-based data is required for algorithms to learn patterns properly. The problem with existing systems such as MedScape, MedLinePlus, Wrappin, and MedHunt is that they involve human interaction and high time consumption in handling a large volume of data. By employing automation in this proposed field, the large involvement of manpower could be removed which in turn speeds up the process of classification of the medical documents by which the shortage of medical technicians in third world countries are addressed.

Download Full-text

A Survey of Arabic Text Classification Models

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i6.pp4352-4355 ◽

2018 ◽

Vol 8 (6) ◽

pp. 4352 ◽

Cited By ~ 1

Author(s):

Ahed M. F. Al-Sbou

Keyword(s):

Language Processing ◽

Text Classification ◽

Arabic Language ◽

Arabic Text ◽

Classification Models ◽

Natural Languages ◽

Text Organization ◽

Arabic Text Classification ◽

Arabic Language Processing

<p>There is a huge content of Arabic text available over online that requires an organization of these texts. As result, here are many applications of natural languages processing (NLP) that concerns with text organization. One of the is text classification (TC). TC helps to make dealing with unorganized text. However, it is easier to classify them into suitable class or labels. This paper is a survey of Arabic text classification. Also, it presents comparison among different methods in the classification of Arabic texts, where Arabic text is represented a complex text due to its vocabularies. Arabic language is one of the richest languages in the world, where it has many linguistic bases. The researche in Arabic language processing is very few compared to English. As a result, these problems represent challenges in the classification, and organization of specific Arabic text. Text classification (TC) helps to access the most documents, or information that has already classified into specific classes, or categories to one or more classes or categories. In addition, classification of documents facilitate search engine to decrease the amount of document to, and then to become easier to search and matching with queries.</p>

Download Full-text

Deep learning model for metagenome fragment classification using spaced k-mers feature extraction

Jurnal Teknologi dan Sistem Komputer ◽

10.14710/jtsiskom.2020.13407 ◽

2020 ◽

Vol 8 (3) ◽

pp. 234-238

Author(s):

Nur Choiriyati ◽

Yandra Arkeman ◽

Wisnu Ananta Kusuma

Keyword(s):

Neural Network ◽

Feature Extraction ◽

Deep Learning ◽

Language Processing ◽

Computational Time ◽

Genus Level ◽

Computational Resources ◽

Learning Architectures ◽

Deep Learning Model

An open challenge in bioinformatics is the analysis of the sequenced metagenomes from the various environments. Several studies demonstrated bacteria classification at the genus level using k-mers as feature extraction where the highest value of k gives better accuracy but it is costly in terms of computational resources and computational time. Spaced k-mers method was used to extract the feature of the sequence using 111 1111 10001 where 1 was a match and 0 was the condition that could be a match or did not match. Currently, deep learning provides the best solutions to many problems in image recognition, speech recognition, and natural language processing. In this research, two different deep learning architectures, namely Deep Neural Network (DNN) and Convolutional Neural Network (CNN), trained to approach the taxonomic classification of metagenome data and spaced k-mers method for feature extraction. The result showed the DNN classifier reached 90.89 % and the CNN classifier reached 88.89 % accuracy at the genus level taxonomy.

Download Full-text

TEXT CLASSIFICATION BASED ON FUZZY RADIAL BASIS FUNCTION

Iraqi Journal for Computers and Informatics ◽

10.25195/ijci.v45i1.40 ◽

2019 ◽

Vol 45 (1) ◽

pp. 11-14

Author(s):

Zuhair Ali

Keyword(s):

Radial Basis Function ◽

Language Processing ◽

Text Classification ◽

Basis Function ◽

Automated Classification ◽

New Methods ◽

Radial Basis ◽

Document Collection ◽

Better Than

Automated classification of text into predefined categories has always been considered as a vital method in thenatural language processing field. In this paper new methods based on Radial Basis Function (RBF) and Fuzzy Radial BasisFunction (FRBF) are used to solve the problem of text classification, where a set of features extracted for each sentencein the document collection these set of features introduced to FRBF and RBF to classify documents. Reuters 21578 datasetutilized for the purpose of text classification. The results showed the effectiveness of FRBF is better than RBF.

Download Full-text

IDC: Quantitative Evaluation Benchmark of Interpretation Methods for Deep Text Classification Models

10.21203/rs.3.rs-960359/v1 ◽

2021 ◽

Author(s):

Mohammed Khaleel ◽

Lei Qi ◽

Wallapak Tavanapong ◽

Johnny Wong ◽

Adisak Sukul ◽

...

Keyword(s):

Quantitative Evaluation ◽

Language Processing ◽

Text Classification ◽

Deep Neural Networks ◽

Performance Metrics ◽

Ground Truth ◽

Decision Making Process ◽

Research Attention ◽

Prediction Confidence ◽

Insight Into

Abstract Recent advances in deep neural networks have achieved outstanding success in natural language processing. Due to the success and the black-box nature of the deep models, interpretation methods that provide insight into the decision-making process of the models have received an influx of research attention. However, there is no quantitative evaluation comparing interpretation methods for text classification other than observing classification accuracy or prediction confidence when important word grams are removed. This is due to the lack of interpretation ground truth. Manual labeling of a large interpretation ground truth is time-consuming. We propose IDC, a new benchmark for quantitative evaluation of I nterpretation methods for D eep text C lassification models. IDC consists of three methods that take existing text classification ground truth and generate three corresponding pseudo-interpretation ground truth datasets. We propose to use interpretation recall, interpretation precision, and Cohen’s kappa inter-agreement as performance metrics. We used the pseudo ground truth datasets and the metrics to evaluate six interpretation methods.

Download Full-text

Semi-Supervised Convolutional Neural Network for Law Advice Online

Applied Sciences ◽

10.3390/app9173617 ◽

2019 ◽

Vol 9 (17) ◽

pp. 3617 ◽

Cited By ~ 2

Author(s):

Fen Zhao ◽

Penghua Li ◽

Yuanyuan Li ◽

Jie Hou ◽

Yinguo Li

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Language Processing ◽

Text Classification ◽

Learning Task ◽

Automatic Classification ◽

Labor Cost ◽

Internet Technology ◽

Data Annotation

With the rapid developments of Internet technology, a mass of law cases is constantly occurring and needs to be dealt with in time. Automatic classification of law text is the most basic and critical process in the online law advice platform. Deep neural network-based natural language processing (DNN-NLP) is one of the most promising approaches to implement text classification. Meanwhile, as the convolutional neural network-based (CNN-based) methods developed, CNN-based text classification has already achieved impressive results. However, previous work applied amounts of manually-annotated data, which increased the labor cost and reduced the adaptability of the approach. Hence, we present a new semi-supervised model to solve the problem of data annotation. Our method learns the embedding of small text regions from unlabeled data and then integrates the learned embedding into the supervised training. More specifically, the learned embedding regions with the two-view-embedding model are used as an additional input to the CNN’s convolution layer. In addition, to implement the multi-task learning task, we propose the multi-label classification algorithm to assign multiple labels to an instance. The proposed method is evaluated experimentally subject to a law case description dataset and English standard dataset RCV1 . On Chinese data, the simulation results demonstrate that, compared with the existing methods such as linear SVM, our scheme respectively improves by 7.76%, 7.86%, 9.19%, and 2.96% the precision, recall, F-1, and Hamming loss. Analogously, the results suggest that compared to CNN, our scheme respectively improves by 4.46%, 5.76%, 5.14% and 0.87% in terms of precision, recall, F-1, and Hamming loss. It is worth mentioning that the robustness of this method makes it suitable and effective for automatic classification of law text. Furthermore, the design concept proposed is promising, which can be utilized in other real-world applications such as news classification and public opinion monitoring.

Download Full-text

An efficient technique for hybrid classification and feature extraction using normalization

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.27.14534 ◽

2018 ◽

Vol 7 (2.27) ◽

pp. 156 ◽

Cited By ~ 1

Author(s):

Bipanjyot Kaur ◽

Gourav Bathla

Keyword(s):

Feature Extraction ◽

Feature Selection ◽

Language Processing ◽

Text Classification ◽

Complete Classification ◽

Learning Technique ◽

Hybrid Classification ◽

Classification Evaluation ◽

Class Labels ◽

Evaluation Parameters

Text classification is technique for assigning the class or label to a particular document within predefined class labels. Predefined classes examples are sports, business, technical, education and science etc. Classification is supervised learning technique i.e. these classes are trained with certain features and then document is classified based on similarity measure with these trained document set. Text classification is used in many applications like assigning the label to the documents, separating the spam messages from the genuine one, filtering of text, natural language processing etc. Feature selection, extraction and classification are various phases for assigning label to any document. In this paper, PCA is used for feature extraction, ABC is used for feature selection and SVM is used for classification. PCA is improved by applying normalization-using size of features in our proposed approach. It reduces the redundant features to larger extent. There are very few research works, which have implemented PCA, ABC and SVM for complete classification. Evaluation parameters like accuracy, F-measure and G-mean are calculated to check classifier efficiency. The proposed system is deployed on 20-Newsgroup dataset. Experiment analysis proves that accuracy is improved using our proposed approach as compared to existing approaches.

Download Full-text

The Classification of Short Scientific Texts Using Pretrained BERT Model

Studies in Health Technology and Informatics - Public Health and Informatics ◽

10.3233/shti210125 ◽

2021 ◽

Author(s):

Gleb Danilov ◽

Timur Ishankulov ◽

Konstantin Kotik ◽

Yuriy Orlov ◽

Mikhail Shifrin ◽

...

Keyword(s):

Language Processing ◽

Text Classification ◽

Binary Classification ◽

Scientific Texts ◽

Pubmed Database ◽

Automated Text Classification ◽

Literature Selection ◽

State Of Art ◽

Classification Quality

Automated text classification is a natural language processing (NLP) technology that could significantly facilitate scientific literature selection. A specific topical dataset of 630 article abstracts was obtained from the PubMed database. We proposed 27 parametrized options of PubMedBERT model and 4 ensemble models to solve a binary classification task on that dataset. Three hundred tests with resamples were performed in each classification approach. The best PubMedBERT model demonstrated F1-score = 0.857 while the best ensemble model reached F1-score = 0.853. We concluded that the short scientific texts classification quality might be improved using the latest state-of-art approaches.

Download Full-text

PAAD: POLITICAL ARABIC ARTICLES DATASET FOR AUTOMATIC TEXT CATEGORIZATION

Iraqi Journal for Computers and Informatics ◽

10.25195/ijci.v46i1.246 ◽

2020 ◽

Vol 46 (1) ◽

pp. 1-10

Author(s):

Dhafar Hamed Abd ◽

Ahmed T. Sadiq ◽

Ayad R. Abbas

Keyword(s):

Computational Linguistics ◽

Language Processing ◽

Text Classification ◽

Text Categorization ◽

Political Orientation ◽

Huge Amount ◽

Textual Data ◽

Automatic Text ◽

Excel File ◽

Modern Standard

Now day’s text Classification and Sentiment analysis is considered as one of the popular Natural Language Processing (NLP) tasks. This kind of technique plays significant role in human activities and has impact on the daily behaviours. Each article in different fields such as politics and business represent different opinions according to the writer tendency. A huge amount of data will be acquired through that differentiation. The capability to manage the political orientation of an online article automatically. Therefore, there is no corpus for political categorization was directed towards this task in Arabic, due to the lack of rich representative resources for training an Arabic text classifier. However, we introduce political Arabic articles dataset (PAAD) of textual data collected from newspapers, social network, general forum and ideology website. The dataset is 206 articles distributed into three categories as (Reform, Conservative and Revolutionary) that we offer to the research community on Arabic computational linguistics. We anticipate that this dataset would make a great aid for a variety of NLP tasks on Modern Standard Arabic, political text classification purposes. We present the data in raw form and excel file. Excel file will be in four types such as V1 raw data, V2 preprocessing, V3 root stemming and V4 light stemming.

Download Full-text