scholarly journals Text classification to streamline online wildlife trade analyses

PLoS ONE ◽  
2021 ◽  
Vol 16 (7) ◽  
pp. e0254007
Author(s):  
Oliver C. Stringham ◽  
Stephanie Moncayo ◽  
Katherine G. W. Hill ◽  
Adam Toomes ◽  
Lewis Mitchell ◽  
...  

Automated monitoring of websites that trade wildlife is increasingly necessary to inform conservation and biosecurity efforts. However, e-commerce and wildlife trading websites can contain a vast number of advertisements, an unknown proportion of which may be irrelevant to researchers and practitioners. Given that many wildlife-trade advertisements have an unstructured text format, automated identification of relevant listings has not traditionally been possible, nor attempted. Other scientific disciplines have solved similar problems using machine learning and natural language processing models, such as text classifiers. Here, we test the ability of a suite of text classifiers to extract relevant advertisements from wildlife trade occurring on the Internet. We collected data from an Australian classifieds website where people can post advertisements of their pet birds (n = 16.5k advertisements). We found that text classifiers can predict, with a high degree of accuracy, which listings are relevant (ROC AUC ≥ 0.98, F1 score ≥ 0.77). Furthermore, in an attempt to answer the question ‘how much data is required to have an adequately performing model?’, we conducted a sensitivity analysis by simulating decreases in sample sizes to measure the subsequent change in model performance. From our sensitivity analysis, we found that text classifiers required a minimum sample size of 33% (c. 5.5k listings) to accurately identify relevant listings (for our dataset), providing a reference point for future applications of this sort. Our results suggest that text classification is a viable tool that can be applied to the online trade of wildlife to reduce time dedicated to data cleaning. However, the success of text classifiers will vary depending on the advertisements and websites, and will therefore be context dependent. Further work to integrate other machine learning tools, such as image classification, may provide better predictive abilities in the context of streamlining data processing for wildlife trade related online data.

2021 ◽  
Author(s):  
Oliver C. Stringham ◽  
Stephanie Moncayo ◽  
Katherine G.W. Hill ◽  
Adam Toomes ◽  
Lewis Mitchell ◽  
...  

1.Automated monitoring of websites that trade wildlife is increasingly necessary to inform conservation and biosecurity efforts. However, e-commerce and wildlife trading websites can contain a vast number of advertisements, an unknown proportion of which may be irrelevant to researchers and practitioners. Given that many of these advertisements have an unstructured text format, automated identification of relevant listings has not traditionally been possible, nor attempted. Other scientific disciplines have solved similar problems using machine learning and natural language processing models, such as text classifiers. 2.Here, we test the ability of a suite of text classifiers to extract relevant advertisements from an Australian classifieds website where people can post advertisements of their pet birds (n = 16.5k advertisements). Furthermore, in an attempt to answer the question ‘how much data is required to have an adequately performing model?’, we conducted a sensitivity analysis by simulating decreases in sample sizes to measure the subsequent change in model performance.3.We found that text classifiers can predict, with a high degree of accuracy, which listings are relevant (ROC AUC ≥ 0.98, F1 score ≥ 0.77). From our sensitivity analysis, we found that text classifiers required a minimum sample size of 33% (c. 5.5k listings) to accurately identify relevant listings (for our dataset), providing a reference point for future applications of this sort. 4.Our results suggest that text classification is a viable tool that can be applied to the online trade of wildlife to reduce time dedicated to data cleaning. However, the success of text classifiers will vary depending on the advertisements and websites, and will therefore be context dependent. Further work to integrate other machine learning tools, such as image classification, may provide better predictive abilities in the context of streamlining data processing for wildlife trade related online data.


Author(s):  
Muhammad Zulqarnain ◽  
Rozaida Ghazali ◽  
Muhammad Ghulam Ghouse ◽  
Muhammad Faheem Mushtaq

Text classification has become very serious problem for big organization to manage the large amount of online data and has been extensively applied in the tasks of Natural Language Processing (NLP). Text classification can support users to excellently manage and exploit meaningful information require to be classified into various categories for further use. In order to best classify texts, our research efforts to develop a deep learning approach which obtains superior performance in text classification than other RNNs approaches. However, the main problem in text classification is how to enhance the classification accuracy and the sparsity of the data semantics sensitivity to context often hinders the classification performance of texts. In order to overcome the weakness, in this paper we proposed unified structure to investigate the effects of word embedding and Gated Recurrent Unit (GRU) for text classification on two benchmark datasets included (Google snippets and TREC). GRU is a well-known type of recurrent neural network (RNN), which is ability of computing sequential data over its recurrent architecture. Experimentally, the semantically connected words are commonly near to each other in embedding spaces. First, words in posts are changed into vectors via word embedding technique. Then, the words sequential in sentences are fed to GRU to extract the contextual semantics between words. The experimental results showed that proposed GRU model can effectively learn the word usage in context of texts provided training data. The quantity and quality of training data significantly affected the performance. We evaluated the performance of proposed approach with traditional recurrent approaches, RNN, MV-RNN and LSTM, the proposed approach is obtained better results on two benchmark datasets in the term of accuracy and error rate.


Information ◽  
2019 ◽  
Vol 10 (4) ◽  
pp. 150 ◽  
Author(s):  
Kowsari ◽  
Jafari Meimandi ◽  
Heidarysafa ◽  
Mendu ◽  
Barnes ◽  
...  

In recent years, there has been an exponential growth in the number of complex documentsand texts that require a deeper understanding of machine learning methods to be able to accuratelyclassify texts in many applications. Many machine learning approaches have achieved surpassingresults in natural language processing. The success of these learning algorithms relies on their capacityto understand complex models and non-linear relationships within data. However, finding suitablestructures, architectures, and techniques for text classification is a challenge for researchers. In thispaper, a brief overview of text classification algorithms is discussed. This overview covers differenttext feature extractions, dimensionality reduction methods, existing algorithms and techniques, andevaluations methods. Finally, the limitations of each technique and their application in real-worldproblems are discussed.


2020 ◽  
Vol 7 (1) ◽  
pp. 28-32
Author(s):  
Andre Rusli ◽  
Alethea Suryadibrata ◽  
Samiaji Bintang Nusantara ◽  
Julio Christian Young

The advancement of machine learning and natural language processing techniques hold essential opportunities to improve the existing software engineering activities, including the requirements engineering activity. Instead of manually reading all submitted user feedback to understand the evolving requirements of their product, developers could use the help of an automatic text classification program to reduce the required effort. Many supervised machine learning approaches have already been used in many fields of text classification and show promising results in terms of performance. This paper aims to implement NLP techniques for the basic text preprocessing, which then are followed by traditional (non-deep learning) machine learning classification algorithms, which are the Logistics Regression, Decision Tree, Multinomial Naïve Bayes, K-Nearest Neighbors, Linear SVC, and Random Forest classifier. Finally, the performance of each algorithm to classify the feedback in our dataset into several categories is evaluated using three F1 Score metrics, the macro-, micro-, and weighted-average F1 Score. Results show that generally, Logistics Regression is the most suitable classifier in most cases, followed by Linear SVC. However, the performance gap is not large, and with different configurations and requirements, other classifiers could perform equally or even better.


Sequence Classification is one of the on-demand research projects in the field of Natural Language Processing (NLP). Classifying a set of images or text into an appropriate category or class is a complex task that a lot of Machine Learning (ML) models fail to accomplish accurately and end up under-fitting the given dataset. Some of the ML algorithms used in text classification are KNN, Naïve Bayes, Support Vector Machines, Convolutional Neural Networks (CNNs), Recursive CNNs, Recurrent Neural Networks (RNNs), Long Short Term Memory (LSTM), etc. For this experimental study, LSTM and a few other algorithms were chosen for a more comparative study. The dataset used is the SMS Spam Collection Dataset from Kaggle and 150 more entries were additionally added from different sources. Two possible class labels for the data points are spam and ham. Each entry consists of the class label, a few sentences of text followed by a few useless features that are eliminated. After converting the text to the required format, the models are run and then evaluated using various metrics. In experimental studies, the LSTM gives much better classification accuracy than the other machine learning models. F1-Scores in the high nineties were achieved using LSTM for classifying the text. The other models showed very low F1-Scores and Cosine Similarities indicating that they had underperformed on the dataset. Another interesting observation is that the LSTM had reduced the number of false positives and false negatives than any other model.


2019 ◽  
Author(s):  
Ayoub Bagheri ◽  
Daniel Oberski ◽  
Arjan Sammani ◽  
Peter G.M. van der Heijden ◽  
Folkert W. Asselbergs

AbstractBackgroundWith the increasing use of unstructured text in electronic health records, extracting useful related information has become a necessity. Text classification can be applied to extract patients’ medical history from clinical notes. However, the sparsity in clinical short notes, that is, excessively small word counts in the text, can lead to large classification errors. Previous studies demonstrated that natural language processing (NLP) can be useful in the text classification of clinical outcomes. We propose incorporating the knowledge from unlabeled data, as this may alleviate the problem of short noisy sparse text.ResultsThe software package SALTClass (short and long text classifier) is a machine learning NLP toolkit. It uses seven clustering algorithms, namely, latent Dirichlet allocation, K-Means, MiniBatchK-Means, BIRCH, MeanShift, DBScan, and GMM. Smoothing methods are applied to the resulting cluster information to enrich the representation of sparse text. For the subsequent prediction step, SALTClass can be used on either the original document-term matrix or in an enrichment pipeline. To this end, ten different supervised classifiers have also been integrated into SALTClass. We demonstrate the effectiveness of the SALTClass NLP toolkit in the identification of patients’ family history in a Dutch clinical cardiovascular text corpus from University Medical Center Utrecht, the Netherlands.ConclusionsThe considerable amount of unstructured short text in healthcare applications, particularly in clinical cardiovascular notes, has created an urgent need for tools that can parse specific information from text reports. Using machine learning algorithms for enriching short text can improve the representation for further applications.AvailabilitySALTClass can be downloaded as a Python package from Python Package Index (PyPI) website athttps://pypi.org/project/saltclassand from GitHub athttps://github.com/bagheria/saltclass.


Biostatistics ◽  
2020 ◽  
Author(s):  
W Katherine Tan ◽  
Patrick J Heagerty

Summary Scalable and accurate identification of specific clinical outcomes has been enabled by machine-learning applied to electronic medical record systems. The development of classification models requires the collection of a complete labeled data set, where true clinical outcomes are obtained by human expert manual review. For example, the development of natural language processing algorithms requires the abstraction of clinical text data to obtain outcome information necessary for training models. However, if the outcome is rare then simple random sampling results in very few cases and insufficient information to develop accurate classifiers. Since large scale detailed abstraction is often expensive, time-consuming, and not feasible, more efficient strategies are needed. Under such resource constrained settings, we propose a class of enrichment sampling designs, where selection for abstraction is stratified by auxiliary variables related to the true outcome of interest. Stratified sampling on highly specific variables results in targeted samples that are more enriched with cases, which we show translates to increased model discrimination and better statistical learning performance. We provide mathematical details and simulation evidence that links sampling designs to their resulting prediction model performance. We discuss the impact of our proposed sampling on both model training and validation. Finally, we illustrate the proposed designs for outcome label collection and subsequent machine-learning, using radiology report text data from the Lumbar Imaging with Reporting of Epidemiology study.


2017 ◽  
Vol 21 (3) ◽  
pp. 766-799 ◽  
Author(s):  
Vladimer B. Kobayashi ◽  
Stefan T. Mol ◽  
Hannah A. Berkers ◽  
Gábor Kismihók ◽  
Deanne N. Den Hartog

Organizations are increasingly interested in classifying texts or parts thereof into categories, as this enables more effective use of their information. Manual procedures for text classification work well for up to a few hundred documents. However, when the number of documents is larger, manual procedures become laborious, time-consuming, and potentially unreliable. Techniques from text mining facilitate the automatic assignment of text strings to categories, making classification expedient, fast, and reliable, which creates potential for its application in organizational research. The purpose of this article is to familiarize organizational researchers with text mining techniques from machine learning and statistics. We describe the text classification process in several roughly sequential steps, namely training data preparation, preprocessing, transformation, application of classification techniques, and validation, and provide concrete recommendations at each step. To help researchers develop their own text classifiers, the R code associated with each step is presented in a tutorial. The tutorial draws from our own work on job vacancy mining. We end the article by discussing how researchers can validate a text classification model and the associated output.


2021 ◽  
Author(s):  
◽  
Vrushang Patel

Text classification is a classical machine learning application in Natural Language Processing, which aims to assign labels to textual units such as documents, sentences, paragraphs, and queries. Applications of text classification include sentiment classification and news categorization. Sentiment classification identifies the polarity of text such as positive, negative or neutral based on textual features. In this thesis, we implemented a modified form of a tolerance-based algorithm (TSC) to classify sentiment polarities of tweets as well as news categories from text. The TSC algorithm is a supervised algorithm that was designed to perform short text classification with tolerance near sets (TNS). The proposed TSC algorithm uses pre-trained SBERT algorithm vectors for creating tolerance classes. The effectiveness of the TSC algorithm has been demonstrated by testing it on ten well-researched data sets. One of the datasets (Covid-Sentiment) was hand-crafted with tweets from Twitter of opinions related to COVID. Experiments demonstrate that TSC outperforms five classical ML algorithms with one dataset, and is comparable with all other datasets using a weighted F1-score measure.


Author(s):  
Jeow Li Huan ◽  
Arif Ahmed Sekh ◽  
Chai Quek ◽  
Dilip K. Prasad

AbstractText classification is one of the widely used phenomena in different natural language processing tasks. State-of-the-art text classifiers use the vector space model for extracting features. Recent progress in deep models, recurrent neural networks those preserve the positional relationship among words achieve a higher accuracy. To push text classification accuracy even higher, multi-dimensional document representation, such as vector sequences or matrices combined with document sentiment, should be explored. In this paper, we show that documents can be represented as a sequence of vectors carrying semantic meaning and classified using a recurrent neural network that recognizes long-range relationships. We show that in this representation, additional sentiment vectors can be easily attached as a fully connected layer to the word vectors to further improve classification accuracy. On the UCI sentiment labelled dataset, using the sequence of vectors alone achieved an accuracy of 85.6%, which is better than 80.7% from ridge regression classifier—the best among the classical technique we tested. Additional sentiment information further increases accuracy to 86.3%. On our suicide notes dataset, the best classical technique—the Naíve Bayes Bernoulli classifier, achieves accuracy of 71.3%, while our classifier, incorporating semantic and sentiment information, exceeds that at 75% accuracy.


Sign in / Sign up

Export Citation Format

Share Document