Reduction of Training Noises for Text Classifiers

Author(s):  
Rey-Long Liu
Keyword(s):  
2016 ◽  
Vol 100 ◽  
pp. 137-144 ◽  
Author(s):  
Lungan Zhang ◽  
Liangxiao Jiang ◽  
Chaoqun Li ◽  
Ganggang Kong

Author(s):  
Han-joon Kim

This chapter introduces two practical techniques for improving Naïve Bayes text classifiers that are widely used for text classification. The Naïve Bayes has been evaluated to be a practical text classification algorithm due to its simple classification model, reasonable classification accuracy, and easy update of classification model. Thus, many researchers have a strong incentive to improve the Naïve Bayes by combining it with other meta-learning approaches such as EM (Expectation Maximization) and Boosting. The EM approach is to combine the Naïve Bayes with the EM algorithm and the Boosting approach is to use the Naïve Bayes as a base classifier in the AdaBoost algorithm. For both approaches, a special uncertainty measure fit for Naïve Bayes learning is used. In the Naïve Bayes learning framework, these approaches are expected to be practical solutions to the problem of lack of training documents in text classification systems.


2020 ◽  
Vol 11 ◽  
Author(s):  
Maria-Theodora Pandi ◽  
Peter J. van der Spek ◽  
Maria Koromina ◽  
George P. Patrinos

Text mining in biomedical literature is an emerging field which has already been shown to have a variety of implementations in many research areas, including genetics, personalized medicine, and pharmacogenomics. In this study, we describe a novel text-mining approach for the extraction of pharmacogenomics associations. The code that was used toward this end was implemented using R programming language, either through custom scripts, where needed, or through utilizing functions from existing libraries. Articles (abstracts or full texts) that correspond to a specified query were extracted from PubMed, while concept annotations were derived by PubTator Central. Terms that denote a Mutation or a Gene as well as Chemical compound terms corresponding to drug compounds were normalized and the sentences containing the aforementioned terms were filtered and preprocessed to create appropriate training sets. Finally, after training and adequate hyperparameter tuning, four text classifiers were created and evaluated (FastText, Linear kernel SVMs, XGBoost, Lasso, and Elastic-Net Regularized Generalized Linear Models) with regard to their performance in identifying pharmacogenomics associations. Although further improvements are essential toward proper implementation of this text-mining approach in the clinical practice, our study stands as a comprehensive, simplified, and up-to-date approach for the identification and assessment of research articles enriched in clinically relevant pharmacogenomics relationships. Furthermore, this work highlights a series of challenges concerning the effective application of text mining in biomedical literature, whose resolution could substantially contribute to the further development of this field.


2020 ◽  
Vol 110 (3) ◽  
pp. 357-362 ◽  
Author(s):  
Jon-Patrick Allem ◽  
Patricia Escobedo ◽  
Likhit Dharmapuri

Objectives. To use publicly accessible data from people who post to Twitter to rapidly capture and describe the public’s recent experiences with cannabis. Methods. We obtained Twitter posts containing cannabis-related terms from May 1, 2018, to December 31, 2018. We used methods to distinguish between posts from social bots and nonbots. We used text classifiers to identify topics in posts (n = 60 861). Results. Prevalent topics of posts included using cannabis with mentions of cannabis initiation, processed cannabis products, and health and medical with posts suggesting that cannabis could help with cancer, sleep, pain, anxiety, depression, trauma, and posttraumatic stress disorder. Polysubstance use was a common topic with mentions of cocaine, heroin, ecstasy, LSD, meth, mushrooms, and Xanax along with cannabis. Social bots regularly made health claims about cannabis. Conclusions. Findings suggest that processed cannabis products, unsubstantiated health claims about cannabis products, and the co-use of cannabis with legal and illicit substances warrant considerations by public health researchers in the future.


2016 ◽  
Vol 28 (13) ◽  
pp. 3691-3706 ◽  
Author(s):  
Jiayu Han ◽  
Wanli Zuo ◽  
Lu Liu ◽  
Yuanbo Xu ◽  
Tao Peng
Keyword(s):  

PLoS ONE ◽  
2021 ◽  
Vol 16 (7) ◽  
pp. e0254007
Author(s):  
Oliver C. Stringham ◽  
Stephanie Moncayo ◽  
Katherine G. W. Hill ◽  
Adam Toomes ◽  
Lewis Mitchell ◽  
...  

Automated monitoring of websites that trade wildlife is increasingly necessary to inform conservation and biosecurity efforts. However, e-commerce and wildlife trading websites can contain a vast number of advertisements, an unknown proportion of which may be irrelevant to researchers and practitioners. Given that many wildlife-trade advertisements have an unstructured text format, automated identification of relevant listings has not traditionally been possible, nor attempted. Other scientific disciplines have solved similar problems using machine learning and natural language processing models, such as text classifiers. Here, we test the ability of a suite of text classifiers to extract relevant advertisements from wildlife trade occurring on the Internet. We collected data from an Australian classifieds website where people can post advertisements of their pet birds (n = 16.5k advertisements). We found that text classifiers can predict, with a high degree of accuracy, which listings are relevant (ROC AUC ≥ 0.98, F1 score ≥ 0.77). Furthermore, in an attempt to answer the question ‘how much data is required to have an adequately performing model?’, we conducted a sensitivity analysis by simulating decreases in sample sizes to measure the subsequent change in model performance. From our sensitivity analysis, we found that text classifiers required a minimum sample size of 33% (c. 5.5k listings) to accurately identify relevant listings (for our dataset), providing a reference point for future applications of this sort. Our results suggest that text classification is a viable tool that can be applied to the online trade of wildlife to reduce time dedicated to data cleaning. However, the success of text classifiers will vary depending on the advertisements and websites, and will therefore be context dependent. Further work to integrate other machine learning tools, such as image classification, may provide better predictive abilities in the context of streamlining data processing for wildlife trade related online data.


Sign in / Sign up

Export Citation Format

Share Document