Vector representation based on a supervised codebook for Nepali documents classification

Document representation with outlier tokens exacerbates the classification performance due to the uncertain orientation of such tokens. Most existing document representation methods in different languages including Nepali mostly ignore the strategies to filter them out from documents before learning their representations. In this article, we propose a novel document representation method based on a supervised codebook to represent the Nepali documents, where our codebook contains only semantic tokens without outliers. Our codebook is domain-specific as it is based on tokens in a given corpus that have higher similarities with the class labels in the corpus. Our method adopts a simple yet prominent representation method for each word, called probability-based word embedding. To show the efficacy of our method, we evaluate its performance in the document classification task using Support Vector Machine and validate against widely used document representation methods such as Bag of Words, Latent Dirichlet allocation, Long Short-Term Memory, Word2Vec, Bidirectional Encoder Representations from Transformers and so on, using four Nepali text datasets (we denote them shortly as A1, A2, A3 and A4). The experimental results show that our method produces state-of-the-art classification performance (77.46% accuracy on A1, 67.53% accuracy on A2, 80.54% accuracy on A3 and 89.58% accuracy on A4) compared to the widely used existing document representation methods. It yields the best classification accuracy on three datasets (A1, A2 and A3) and a comparable accuracy on the fourth dataset (A4). Furthermore, we introduce the largest Nepali document dataset (A4), called NepaliLinguistic dataset, to the linguistic community.

Download Full-text

An Open Source Classifier for Bed Mattress Signal in Infant Sleep Monitoring

Frontiers in Neuroscience ◽

10.3389/fnins.2020.602852 ◽

2021 ◽

Vol 14 ◽

Author(s):

Jukka Ranta ◽

Manu Airaksinen ◽

Turkka Kirjavainen ◽

Sampsa Vanhatalo ◽

Nathan J. Stevenson

Keyword(s):

Neural Network ◽

Open Source ◽

Short Term Memory ◽

Support Vector ◽

Svm Classifier ◽

Infant Sleep ◽

Deep Sleep ◽

Non Invasive ◽

Piezo Element ◽

Comparable Accuracy

ObjectiveTo develop a non-invasive and clinically practical method for a long-term monitoring of infant sleep cycling in the intensive care unit.MethodsForty three infant polysomnography recordings were performed at 1–18 weeks of age, including a piezo element bed mattress sensor to record respiratory and gross-body movements. The hypnogram scored from polysomnography signals was used as the ground truth in training sleep classifiers based on 20,022 epochs of movement and/or electrocardiography signals. Three classifier designs were evaluated in the detection of deep sleep (N3 state): support vector machine (SVM), Long Short-Term Memory neural network, and convolutional neural network (CNN).ResultsDeep sleep was accurately identified from other states with all classifier variants. The SVM classifier based on a combination of movement and electrocardiography features had the highest performance (AUC 97.6%). A SVM classifier based on only movement features had comparable accuracy (AUC 95.0%). The feature-independent CNN resulted in roughly comparable accuracy (AUC 93.3%).ConclusionAutomated non-invasive tracking of sleep state cycling is technically feasible using measurements from a piezo element situated under a bed mattress.SignificanceAn open source infant deep sleep detector of this kind allows quantitative, continuous bedside assessment of infant’s sleep cycling.

Download Full-text

Discriminating Emotions in the Valence Dimension from Speech Using Timbre Features

Applied Sciences ◽

10.3390/app9122470 ◽

2019 ◽

Vol 9 (12) ◽

pp. 2470 ◽

Cited By ~ 7

Author(s):

Anvarjon Tursunov ◽

Soonil Kwon ◽

Hee-Suk Pang

Keyword(s):

Short Term Memory ◽

Classification Performance ◽

Support Vector ◽

Emotional Speech ◽

Acoustic Features ◽

Discrete Emotions ◽

Forward Selection ◽

Mel Frequency Cepstral Coefficients ◽

Speech Database ◽

Emotional Speech Database

The most used and well-known acoustic features of a speech signal, the Mel frequency cepstral coefficients (MFCC), cannot characterize emotions in speech sufficiently when a classification is performed to classify both discrete emotions (i.e., anger, happiness, sadness, and neutral) and emotions in valence dimension (positive and negative). The main reason for this is that some of the discrete emotions, such as anger and happiness, share similar acoustic features in the arousal dimension (high and low) but are different in the valence dimension. Timbre is a sound quality that can discriminate between two sounds even with the same pitch and loudness. In this paper, we analyzed timbre acoustic features to improve the classification performance of discrete emotions as well as emotions in the valence dimension. Sequential forward selection (SFS) was used to find the most relevant acoustic features among timbre acoustic features. The experiments were carried out on the Berlin Emotional Speech Database and the Interactive Emotional Dyadic Motion Capture Database. Support vector machine (SVM) and long short-term memory recurrent neural network (LSTM-RNN) were used to classify emotions. The significant classification performance improvements were achieved using a combination of baseline and the most relevant timbre acoustic features, which were found by applying SFS on a classification of emotions for the Berlin Emotional Speech Database. From extensive experiments, it was found that timbre acoustic features could characterize emotions sufficiently in a speech in the valence dimension.

Download Full-text

Spam text classification using LSTM Recurrent Neural Network

International Journal of Emerging Trends in Engineering Research ◽

10.30534/ijeter/2021/11992021 ◽

2021 ◽

Vol 9 (9) ◽

pp. 1271-1275

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Language Processing ◽

Text Classification ◽

Short Term Memory ◽

Experimental Studies ◽

The Other ◽

Support Vector ◽

Data Points ◽

Class Labels

Sequence Classification is one of the on-demand research projects in the field of Natural Language Processing (NLP). Classifying a set of images or text into an appropriate category or class is a complex task that a lot of Machine Learning (ML) models fail to accomplish accurately and end up under-fitting the given dataset. Some of the ML algorithms used in text classification are KNN, Naïve Bayes, Support Vector Machines, Convolutional Neural Networks (CNNs), Recursive CNNs, Recurrent Neural Networks (RNNs), Long Short Term Memory (LSTM), etc. For this experimental study, LSTM and a few other algorithms were chosen for a more comparative study. The dataset used is the SMS Spam Collection Dataset from Kaggle and 150 more entries were additionally added from different sources. Two possible class labels for the data points are spam and ham. Each entry consists of the class label, a few sentences of text followed by a few useless features that are eliminated. After converting the text to the required format, the models are run and then evaluated using various metrics. In experimental studies, the LSTM gives much better classification accuracy than the other machine learning models. F1-Scores in the high nineties were achieved using LSTM for classifying the text. The other models showed very low F1-Scores and Cosine Similarities indicating that they had underperformed on the dataset. Another interesting observation is that the LSTM had reduced the number of false positives and false negatives than any other model.

Download Full-text

Toward Integrated CNN-based Sentiment Analysis of Tweets for Scarce-resource Language—Hindi

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3450447 ◽

2021 ◽

Vol 20 (5) ◽

pp. 1-23

Author(s):

Vedika Gupta ◽

Nikita Jain ◽

Shubham Shubham ◽

Agam Madan ◽

Ankit Chaudhary ◽

...

Keyword(s):

Neural Network ◽

Sentiment Analysis ◽

Short Term Memory ◽

Support Vector ◽

Linguistic Resources ◽

Domain Specific ◽

Emotion Lexicon ◽

Sentiment Dictionary ◽

Hindi Language ◽

Popular Language

Linguistic resources for commonly used languages such as English and Mandarin Chinese are available in abundance, hence the existing research in these languages. However, there are languages for which linguistic resources are scarcely available. One of these languages is the Hindi language. Hindi, being the fourth-most popular language, still lacks in richly populated linguistic resources, owing to the challenges involved in dealing with the Hindi language. This article first explores the machine learning-based approaches—Naïve Bayes, Support Vector Machine, Decision Tree, and Logistic Regression—to analyze the sentiment contained in Hindi language text derived from Twitter. Further, the article presents lexicon-based approaches (Hindi Senti-WordNet, NRC Emotion Lexicon) for sentiment analysis in Hindi while also proposing a Domain-specific Sentiment Dictionary. Finally, an integrated convolutional neural network (CNN)—Recurrent Neural Network and Long Short-term Memory—is proposed to analyze sentiment from Hindi language tweets, a total of 23,767 tweets classified into positive, negative, and neutral. The proposed CNN approach gives an accuracy of 85%.

Download Full-text

Using latent Dirichlet allocation to improve text classification performance of support vector machine

2016 IEEE Congress on Evolutionary Computation (CEC) ◽

10.1109/cec.2016.7743935 ◽

2016 ◽

Cited By ~ 1

Author(s):

Yaw-Huei Chen ◽

Shu-Fong Li

Keyword(s):

Support Vector Machine ◽

Text Classification ◽

Latent Dirichlet Allocation ◽

Classification Performance ◽

Support Vector ◽

Dirichlet Allocation

Download Full-text

Chinese Text Auto-Categorization on Petro-Chemical Industrial Processes

Cybernetics and Information Technologies ◽

10.1515/cait-2016-0078 ◽

2016 ◽

Vol 16 (6) ◽

pp. 69-82

Author(s):

Jing Ni ◽

Ge Gao ◽

Pengyu Chen

Keyword(s):

Chinese Text ◽

Nearest Neighbor ◽

Reference Value ◽

Classification Performance ◽

Support Vector ◽

K Nearest Neighbor ◽

Domain Specific ◽

Advantages And Disadvantages ◽

Information Classification ◽

Technical Material

Abstract There is a huge growth in the amount of documents of corporations in recent years. With this paper we aim to improve classification performance and to support the effective management of massive technical material in the domain-specific field. Taking the field of petro-chemical process as a case, we study in detail the influence of parameters on classification accuracy when using Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) Text auto-classification algorithm. Advantages and disadvantages of the two text classification algorithms are presented in the field of petro-chemical processes. Our tests also show that more attention to the professional vocabulary can significantly improve the F1 value of the two algorithms. These results have reference value for the future information classification in related industry fields.

Download Full-text

Classification of Legislations using Deep Learning

The International Arab Journal of Information Technology ◽

10.34028/iajit/18/5/4 ◽

2021 ◽

Vol 18 (5) ◽

Author(s):

Sameerchand Pudaruth ◽

Sunjiv Soyjaudah ◽

Rajendra Gunputh

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Short Term Memory ◽

Support Vector ◽

Word Embeddings ◽

Legal Professionals ◽

Domain Specific ◽

The Republic ◽

Learning Architectures

Laws are often developed in a piecemeal approach and many provisions of similar nature are often found in different legislations. Therefore, there is a need to classify legislations into various legal topics to help legal professionals in their daily activities. In this study, we have experimented with various deep learning architectures for the automatic classification of 490 legislations from the Republic of Mauritius into 30 categories. Our results demonstrate that a Deep Neural Network (DNN) with three hidden layers delivered the best performance compared with other architectures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). A mean classification accuracy of 60.9% was achieved using DNN, 56.5% for CNN and 33.7% for Long Short-Term Memory (LSTM). Comparisons were also made with traditional machine learning classifiers such as support vector machines and decision trees and it was found that the performance of DNN was superior, by at least 10%, in all runs. Both general pre-trained word embeddings such as Word2vec and domain-specific word embeddings such as Law2vec were used in combination with the above deep learning architectures but Word2vec had the best performance. To our knowledge, this is the first application of deep learning in the categorisation of legislations.

Download Full-text

Network Pseudohealth Information Recognition Model: An Integrated Architecture of Latent Dirichlet Allocation and Data Block Update

Complexity ◽

10.1155/2020/6612043 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Jie Zhang ◽

Pingping Sun ◽

Feng Zhao ◽

Qianru Guo ◽

Yue Zou

Keyword(s):

Latent Dirichlet Allocation ◽

Recognition Performance ◽

Data Block ◽

Support Vector ◽

Combination Model ◽

Block Data ◽

Input Variables ◽

Dataset Partitioning ◽

Class Labels ◽

Dirichlet Allocation

The wanton dissemination of network pseudohealth information has brought great harm to people’s health, life, and property. It is important to detect and identify network pseudohealth information. Based on this, this paper defines the concepts of pseudohealth information, data block, and data block integration, designs an architecture that combines the latent Dirichlet allocation (LDA) algorithm and data block update integration, and proposes the combination algorithm model. In addition, crawler technology is used to crawl the pseudohealth information transmitted on the Sina Weibo platform during the “epidemic situation” from February to March 2020 for the simulation test on the experimental case dataset. The research results show that (1) the LDA model can deeply mine the semantic information of network pseudohealth information, obtain the features of document-topic distribution, and classify and train topic features as input variables; (2) the dataset partitioning method can effectively block data according to the text attributes and class labels of network pseudohealth information and can accurately classify and integrate the block data through the data block reintegration method; and (3) considering that the combination model has certain limitations on the detection of network pseudohealth information, the support vector machine (SVM) model can extract the granularity content of data blocks in pseudohealth information in real time, thus greatly improving the recognition performance of the combination model.

Download Full-text

Using Machine Learning Algorithms on Prediction of Stock Price

Journal of Modeling and Optimization ◽

10.32732/jmo.2020.12.2.84 ◽

2020 ◽

Vol 12 (2) ◽

pp. 84-99

Author(s):

Li-Pang Chen

Keyword(s):

Machine Learning ◽

Stock Price ◽

Short Term Memory ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector ◽

Short Term ◽

Learning Techniques ◽

Historical Database ◽

Long Short Term Memory

In this paper, we investigate analysis and prediction of the time-dependent data. We focus our attention on four different stocks are selected from Yahoo Finance historical database. To build up models and predict the future stock price, we consider three different machine learning techniques including Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNN) and Support Vector Regression (SVR). By treating close price, open price, daily low, daily high, adjusted close price, and volume of trades as predictors in machine learning methods, it can be shown that the prediction accuracy is improved.

Download Full-text

Binary Spectrum Feature for Improved Classiﬁer Performance

10.36227/techrxiv.12993122 ◽

2020 ◽

Author(s):

Nalika Ulapane ◽

Karthick Thiyagarajan ◽

sarath kodagoda

Keyword(s):

Machine Learning ◽

Classification Performance ◽

Feature Reduction ◽

Sensor Data ◽

Machine Learning Techniques ◽

Support Vector ◽

Svm Classifier ◽

Monitoring Task ◽

Classifier Performance ◽

Spectrum Feature

<div>Classiﬁcation has become a vital task in modern machine learning and Artiﬁcial Intelligence applications, including smart sensing. Numerous machine learning techniques are available to perform classiﬁcation. Similarly, numerous practices, such as feature selection (i.e., selection of a subset of descriptor variables that optimally describe the output), are available to improve classiﬁer performance. In this paper, we consider the case of a given supervised learning classiﬁcation task that has to be performed making use of continuous-valued features. It is assumed that an optimal subset of features has already been selected. Therefore, no further feature reduction, or feature addition, is to be carried out. Then, we attempt to improve the classiﬁcation performance by passing the given feature set through a transformation that produces a new feature set which we have named the “Binary Spectrum”. Via a case study example done on some Pulsed Eddy Current sensor data captured from an infrastructure monitoring task, we demonstrate how the classiﬁcation accuracy of a Support Vector Machine (SVM) classiﬁer increases through the use of this Binary Spectrum feature, indicating the feature transformation’s potential for broader usage.</div><div><br></div>

Download Full-text