Text classification to streamline online wildlife trade analyses

Automated monitoring of websites that trade wildlife is increasingly necessary to inform conservation and biosecurity efforts. However, e-commerce and wildlife trading websites can contain a vast number of advertisements, an unknown proportion of which may be irrelevant to researchers and practitioners. Given that many wildlife-trade advertisements have an unstructured text format, automated identification of relevant listings has not traditionally been possible, nor attempted. Other scientific disciplines have solved similar problems using machine learning and natural language processing models, such as text classifiers. Here, we test the ability of a suite of text classifiers to extract relevant advertisements from wildlife trade occurring on the Internet. We collected data from an Australian classifieds website where people can post advertisements of their pet birds (n = 16.5k advertisements). We found that text classifiers can predict, with a high degree of accuracy, which listings are relevant (ROC AUC ≥ 0.98, F1 score ≥ 0.77). Furthermore, in an attempt to answer the question ‘how much data is required to have an adequately performing model?’, we conducted a sensitivity analysis by simulating decreases in sample sizes to measure the subsequent change in model performance. From our sensitivity analysis, we found that text classifiers required a minimum sample size of 33% (c. 5.5k listings) to accurately identify relevant listings (for our dataset), providing a reference point for future applications of this sort. Our results suggest that text classification is a viable tool that can be applied to the online trade of wildlife to reduce time dedicated to data cleaning. However, the success of text classifiers will vary depending on the advertisements and websites, and will therefore be context dependent. Further work to integrate other machine learning tools, such as image classification, may provide better predictive abilities in the context of streamlining data processing for wildlife trade related online data.

Download Full-text

Text classification to streamline online wildlife trade analyses

10.32942/osf.io/593ve ◽

2021 ◽

Author(s):

Oliver C. Stringham ◽

Stephanie Moncayo ◽

Katherine G.W. Hill ◽

Adam Toomes ◽

Lewis Mitchell ◽

...

Keyword(s):

Machine Learning ◽

Sensitivity Analysis ◽

Language Processing ◽

Text Classification ◽

Model Performance ◽

Wildlife Trade ◽

Online Data ◽

Vast Number ◽

Pet Birds ◽

Text Classifiers

1.Automated monitoring of websites that trade wildlife is increasingly necessary to inform conservation and biosecurity efforts. However, e-commerce and wildlife trading websites can contain a vast number of advertisements, an unknown proportion of which may be irrelevant to researchers and practitioners. Given that many of these advertisements have an unstructured text format, automated identification of relevant listings has not traditionally been possible, nor attempted. Other scientific disciplines have solved similar problems using machine learning and natural language processing models, such as text classifiers. 2.Here, we test the ability of a suite of text classifiers to extract relevant advertisements from an Australian classifieds website where people can post advertisements of their pet birds (n = 16.5k advertisements). Furthermore, in an attempt to answer the question ‘how much data is required to have an adequately performing model?’, we conducted a sensitivity analysis by simulating decreases in sample sizes to measure the subsequent change in model performance.3.We found that text classifiers can predict, with a high degree of accuracy, which listings are relevant (ROC AUC ≥ 0.98, F1 score ≥ 0.77). From our sensitivity analysis, we found that text classifiers required a minimum sample size of 33% (c. 5.5k listings) to accurately identify relevant listings (for our dataset), providing a reference point for future applications of this sort. 4.Our results suggest that text classification is a viable tool that can be applied to the online trade of wildlife to reduce time dedicated to data cleaning. However, the success of text classifiers will vary depending on the advertisements and websites, and will therefore be context dependent. Further work to integrate other machine learning tools, such as image classification, may provide better predictive abilities in the context of streamlining data processing for wildlife trade related online data.

Download Full-text

Efficient processing of GRU based on word embedding for text classification

JOIV International Journal on Informatics Visualization ◽

10.30630/joiv.3.4.289 ◽

2019 ◽

Vol 3 (4) ◽

Cited By ~ 2

Author(s):

Muhammad Zulqarnain ◽

Rozaida Ghazali ◽

Muhammad Ghulam Ghouse ◽

Muhammad Faheem Mushtaq

Keyword(s):

Language Processing ◽

Text Classification ◽

Classification Performance ◽

Word Embedding ◽

Training Data ◽

Superior Performance ◽

Sequential Data ◽

Online Data ◽

Benchmark Datasets ◽

Recurrent Architecture

Text classification has become very serious problem for big organization to manage the large amount of online data and has been extensively applied in the tasks of Natural Language Processing (NLP). Text classification can support users to excellently manage and exploit meaningful information require to be classified into various categories for further use. In order to best classify texts, our research efforts to develop a deep learning approach which obtains superior performance in text classification than other RNNs approaches. However, the main problem in text classification is how to enhance the classification accuracy and the sparsity of the data semantics sensitivity to context often hinders the classification performance of texts. In order to overcome the weakness, in this paper we proposed unified structure to investigate the effects of word embedding and Gated Recurrent Unit (GRU) for text classification on two benchmark datasets included (Google snippets and TREC). GRU is a well-known type of recurrent neural network (RNN), which is ability of computing sequential data over its recurrent architecture. Experimentally, the semantically connected words are commonly near to each other in embedding spaces. First, words in posts are changed into vectors via word embedding technique. Then, the words sequential in sentences are fed to GRU to extract the contextual semantics between words. The experimental results showed that proposed GRU model can effectively learn the word usage in context of texts provided training data. The quantity and quality of training data significantly affected the performance. We evaluated the performance of proposed approach with traditional recurrent approaches, RNN, MV-RNN and LSTM, the proposed approach is obtained better results on two benchmark datasets in the term of accuracy and error rate.

Download Full-text

Text Classification Algorithms: A Survey

Information ◽

10.3390/info10040150 ◽

2019 ◽

Vol 10 (4) ◽

pp. 150 ◽

Cited By ~ 93

Author(s):

Kowsari ◽

Jafari Meimandi ◽

Heidarysafa ◽

Mendu ◽

Barnes ◽

...

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Language Processing ◽

Text Classification ◽

Classification Algorithms ◽

Learning Approaches ◽

Machine Learning Methods ◽

Linear Relationships ◽

Reduction Methods ◽

Complex Models

In recent years, there has been an exponential growth in the number of complex documentsand texts that require a deeper understanding of machine learning methods to be able to accuratelyclassify texts in many applications. Many machine learning approaches have achieved surpassingresults in natural language processing. The success of these learning algorithms relies on their capacityto understand complex models and non-linear relationships within data. However, finding suitablestructures, architectures, and techniques for text classification is a challenge for researchers. In thispaper, a brief overview of text classification algorithms is discussed. This overview covers differenttext feature extractions, dimensionality reduction methods, existing algorithms and techniques, andevaluations methods. Finally, the limitations of each technique and their application in real-worldproblems are discussed.

Download Full-text

A Comparison of Traditional Machine Learning Approaches for Supervised Feedback Classification in Bahasa Indonesia

International Journal of New Media Technology ◽

10.31937/ijnmt.v1i1.1485 ◽

2020 ◽

Vol 7 (1) ◽

pp. 28-32

Author(s):

Andre Rusli ◽

Alethea Suryadibrata ◽

Samiaji Bintang Nusantara ◽

Julio Christian Young

Keyword(s):

Machine Learning ◽

Language Processing ◽

Text Classification ◽

Weighted Average ◽

Supervised Machine Learning ◽

Learning Approaches ◽

K Nearest Neighbors ◽

Machine Learning Classification ◽

Logistics Regression ◽

Learning Machine

The advancement of machine learning and natural language processing techniques hold essential opportunities to improve the existing software engineering activities, including the requirements engineering activity. Instead of manually reading all submitted user feedback to understand the evolving requirements of their product, developers could use the help of an automatic text classification program to reduce the required effort. Many supervised machine learning approaches have already been used in many fields of text classification and show promising results in terms of performance. This paper aims to implement NLP techniques for the basic text preprocessing, which then are followed by traditional (non-deep learning) machine learning classification algorithms, which are the Logistics Regression, Decision Tree, Multinomial Naïve Bayes, K-Nearest Neighbors, Linear SVC, and Random Forest classifier. Finally, the performance of each algorithm to classify the feedback in our dataset into several categories is evaluated using three F1 Score metrics, the macro-, micro-, and weighted-average F1 Score. Results show that generally, Logistics Regression is the most suitable classifier in most cases, followed by Linear SVC. However, the performance gap is not large, and with different configurations and requirements, other classifiers could perform equally or even better.

Download Full-text

Spam text classification using LSTM Recurrent Neural Network

International Journal of Emerging Trends in Engineering Research ◽

10.30534/ijeter/2021/11992021 ◽

2021 ◽

Vol 9 (9) ◽

pp. 1271-1275

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Language Processing ◽

Text Classification ◽

Short Term Memory ◽

Experimental Studies ◽

The Other ◽

Support Vector ◽

Data Points ◽

Class Labels

Sequence Classification is one of the on-demand research projects in the field of Natural Language Processing (NLP). Classifying a set of images or text into an appropriate category or class is a complex task that a lot of Machine Learning (ML) models fail to accomplish accurately and end up under-fitting the given dataset. Some of the ML algorithms used in text classification are KNN, Naïve Bayes, Support Vector Machines, Convolutional Neural Networks (CNNs), Recursive CNNs, Recurrent Neural Networks (RNNs), Long Short Term Memory (LSTM), etc. For this experimental study, LSTM and a few other algorithms were chosen for a more comparative study. The dataset used is the SMS Spam Collection Dataset from Kaggle and 150 more entries were additionally added from different sources. Two possible class labels for the data points are spam and ham. Each entry consists of the class label, a few sentences of text followed by a few useless features that are eliminated. After converting the text to the required format, the models are run and then evaluated using various metrics. In experimental studies, the LSTM gives much better classification accuracy than the other machine learning models. F1-Scores in the high nineties were achieved using LSTM for classifying the text. The other models showed very low F1-Scores and Cosine Similarities indicating that they had underperformed on the dataset. Another interesting observation is that the LSTM had reduced the number of false positives and false negatives than any other model.

Download Full-text

SALTClass: classifying clinical short notes using background knowledge from unlabeled data

10.1101/801944 ◽

2019 ◽

Author(s):

Ayoub Bagheri ◽

Daniel Oberski ◽

Arjan Sammani ◽

Peter G.M. van der Heijden ◽

Folkert W. Asselbergs

Keyword(s):

Machine Learning ◽

Language Processing ◽

Text Classification ◽

Latent Dirichlet Allocation ◽

Machine Learning Algorithms ◽

Unlabeled Data ◽

Specific Information ◽

Short Text ◽

Link Type ◽

Python Package

AbstractBackgroundWith the increasing use of unstructured text in electronic health records, extracting useful related information has become a necessity. Text classification can be applied to extract patients’ medical history from clinical notes. However, the sparsity in clinical short notes, that is, excessively small word counts in the text, can lead to large classification errors. Previous studies demonstrated that natural language processing (NLP) can be useful in the text classification of clinical outcomes. We propose incorporating the knowledge from unlabeled data, as this may alleviate the problem of short noisy sparse text.ResultsThe software package SALTClass (short and long text classifier) is a machine learning NLP toolkit. It uses seven clustering algorithms, namely, latent Dirichlet allocation, K-Means, MiniBatchK-Means, BIRCH, MeanShift, DBScan, and GMM. Smoothing methods are applied to the resulting cluster information to enrich the representation of sparse text. For the subsequent prediction step, SALTClass can be used on either the original document-term matrix or in an enrichment pipeline. To this end, ten different supervised classifiers have also been integrated into SALTClass. We demonstrate the effectiveness of the SALTClass NLP toolkit in the identification of patients’ family history in a Dutch clinical cardiovascular text corpus from University Medical Center Utrecht, the Netherlands.ConclusionsThe considerable amount of unstructured short text in healthcare applications, particularly in clinical cardiovascular notes, has created an urgent need for tools that can parse specific information from text reports. Using machine learning algorithms for enriching short text can improve the representation for further applications.AvailabilitySALTClass can be downloaded as a Python package from Python Package Index (PyPI) website athttps://pypi.org/project/saltclassand from GitHub athttps://github.com/bagheria/saltclass.

Download Full-text

Surrogate-guided sampling designs for classification of rare outcomes from electronic medical records data

Biostatistics ◽

10.1093/biostatistics/kxaa028 ◽

2020 ◽

Author(s):

W Katherine Tan ◽

Patrick J Heagerty

Keyword(s):

Machine Learning ◽

Clinical Outcomes ◽

Language Processing ◽

Large Scale ◽

Model Performance ◽

Learning Performance ◽

Accurate Identification ◽

Text Data ◽

Data Set ◽

The Impact

Summary Scalable and accurate identification of specific clinical outcomes has been enabled by machine-learning applied to electronic medical record systems. The development of classification models requires the collection of a complete labeled data set, where true clinical outcomes are obtained by human expert manual review. For example, the development of natural language processing algorithms requires the abstraction of clinical text data to obtain outcome information necessary for training models. However, if the outcome is rare then simple random sampling results in very few cases and insufficient information to develop accurate classifiers. Since large scale detailed abstraction is often expensive, time-consuming, and not feasible, more efficient strategies are needed. Under such resource constrained settings, we propose a class of enrichment sampling designs, where selection for abstraction is stratified by auxiliary variables related to the true outcome of interest. Stratified sampling on highly specific variables results in targeted samples that are more enriched with cases, which we show translates to increased model discrimination and better statistical learning performance. We provide mathematical details and simulation evidence that links sampling designs to their resulting prediction model performance. We discuss the impact of our proposed sampling on both model training and validation. Finally, we illustrate the proposed designs for outcome label collection and subsequent machine-learning, using radiology report text data from the Lumbar Imaging with Reporting of Epidemiology study.

Download Full-text

Text Classification for Organizational Researchers

Organizational Research Methods ◽

10.1177/1094428117719322 ◽

2017 ◽

Vol 21 (3) ◽

pp. 766-799 ◽

Cited By ~ 18

Author(s):

Vladimer B. Kobayashi ◽

Stefan T. Mol ◽

Hannah A. Berkers ◽

Gábor Kismihók ◽

Deanne N. Den Hartog

Keyword(s):

Machine Learning ◽

Text Mining ◽

Text Classification ◽

Training Data ◽

Classification Model ◽

Data Preparation ◽

Organizational Research ◽

Job Vacancy ◽

Text Classifiers ◽

Effective Use

Organizations are increasingly interested in classifying texts or parts thereof into categories, as this enables more effective use of their information. Manual procedures for text classification work well for up to a few hundred documents. However, when the number of documents is larger, manual procedures become laborious, time-consuming, and potentially unreliable. Techniques from text mining facilitate the automatic assignment of text strings to categories, making classification expedient, fast, and reliable, which creates potential for its application in organizational research. The purpose of this article is to familiarize organizational researchers with text mining techniques from machine learning and statistics. We describe the text classification process in several roughly sequential steps, namely training data preparation, preprocessing, transformation, application of classification techniques, and validation, and provide concrete recommendations at each step. To help researchers develop their own text classifiers, the R code associated with each step is presented in a tutorial. The tutorial draws from our own work on job vacancy mining. We end the article by discussing how researchers can validate a text classification model and the associated output.

Download Full-text

Short Text Classification with Tolerance Near Sets

10.36939/ir.202108231232 ◽

2021 ◽

Author(s):

◽

Vrushang Patel

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Classification ◽

Sentiment Classification ◽

Data Sets ◽

Short Text ◽

Near Sets ◽

Textual Features

Text classification is a classical machine learning application in Natural Language Processing, which aims to assign labels to textual units such as documents, sentences, paragraphs, and queries. Applications of text classification include sentiment classification and news categorization. Sentiment classification identifies the polarity of text such as positive, negative or neutral based on textual features. In this thesis, we implemented a modified form of a tolerance-based algorithm (TSC) to classify sentiment polarities of tweets as well as news categories from text. The TSC algorithm is a supervised algorithm that was designed to perform short text classification with tolerance near sets (TNS). The proposed TSC algorithm uses pre-trained SBERT algorithm vectors for creating tolerance classes. The effectiveness of the TSC algorithm has been demonstrated by testing it on ten well-researched data sets. One of the datasets (Covid-Sentiment) was hand-crafted with tweets from Twitter of opinions related to COVID. Experiments demonstrate that TSC outperforms five classical ML algorithms with one dataset, and is comparable with all other datasets using a weighted F1-score measure.

Download Full-text

Emotionally charged text classification with deep learning and sentiment semantic

Neural Computing and Applications ◽

10.1007/s00521-021-06542-1 ◽

2021 ◽

Author(s):

Jeow Li Huan ◽

Arif Ahmed Sekh ◽

Chai Quek ◽

Dilip K. Prasad

Keyword(s):

Language Processing ◽

Text Classification ◽

Classification Accuracy ◽

State Of The Art ◽

Document Representation ◽

Classical Technique ◽

Text Classifiers ◽

Vector Sequences ◽

Fully Connected ◽

Better Than

AbstractText classification is one of the widely used phenomena in different natural language processing tasks. State-of-the-art text classifiers use the vector space model for extracting features. Recent progress in deep models, recurrent neural networks those preserve the positional relationship among words achieve a higher accuracy. To push text classification accuracy even higher, multi-dimensional document representation, such as vector sequences or matrices combined with document sentiment, should be explored. In this paper, we show that documents can be represented as a sequence of vectors carrying semantic meaning and classified using a recurrent neural network that recognizes long-range relationships. We show that in this representation, additional sentiment vectors can be easily attached as a fully connected layer to the word vectors to further improve classification accuracy. On the UCI sentiment labelled dataset, using the sequence of vectors alone achieved an accuracy of 85.6%, which is better than 80.7% from ridge regression classifier—the best among the classical technique we tested. Additional sentiment information further increases accuracy to 86.3%. On our suicide notes dataset, the best classical technique—the Naíve Bayes Bernoulli classifier, achieves accuracy of 71.3%, while our classifier, incorporating semantic and sentiment information, exceeds that at 75% accuracy.

Download Full-text