Text Classification Model Enhanced by Unlabeled Data for LaTeX Formula

Generic language models pretrained on large unspecific domains are currently the foundation of NLP. Labeled data are limited in most model training due to the cost of manual annotation, especially in domains including massive Proper Nouns such as mathematics and biology, where it affects the accuracy and robustness of model prediction. However, directly applying a generic language model on a specific domain does not work well. This paper introduces a BERT-based text classification model enhanced by unlabeled data (UL-BERT) in the LaTeX formula domain. A two-stage Pretraining model based on BERT(TP-BERT) is pretrained by unlabeled data in the LaTeX formula domain. A double-prediction pseudo-labeling (DPP) method is introduced to obtain high confidence pseudo-labels for unlabeled data by self-training. Moreover, a multi-rounds teacher–student model training approach is proposed for UL-BERT model training with few labeled data and more unlabeled data with pseudo-labels. Experiments on the classification of the LaTex formula domain show that the classification accuracies have been significantly improved by UL-BERT where the F1 score has been mostly enhanced by 2.76%, and lower resources are needed in model training. It is concluded that our method may be applicable to other specific domains with enormous unlabeled data and limited labelled data.

Download Full-text

Transfer Language Space with Similar Domain Adaptation: A Case Study with Hepatocellular Carcinoma

10.1101/2020.08.26.20182659 ◽

2020 ◽

Author(s):

Patricia Balthazar ◽

Scott Jeffery Lee ◽

Daniel Rubin ◽

Terry Dessar ◽

Judy Gichoya ◽

...

Keyword(s):

Hepatocellular Carcinoma ◽

Domain Adaptation ◽

Language Model ◽

Complex Model ◽

Language Models ◽

Pixel Intensity ◽

Cross Domain ◽

Radiology Reports ◽

Generic Language ◽

Similar Domain

Transfer learning is a common practice in image classification with deep learning where the available data is often limited for training a complex model with millions of parameters. However, transferring language models requires special attention since cross-domain vocabularies (e.g. between news articles and radiology reports) do not always overlap as the pixel intensity range overlaps mostly for images. We present a concept of similar domain adaptation where we transfer an interinstitutional language model between two different modalities (ultrasound to MRI) to capture liver abnormalities. Our experiments show that such transfer is more effective for performing shared targeted task than generic language space transfer. We use MRI screening exam reports for hepatocellular carcinoma as the use-case and apply the transfer language space strategy to automatically label thousands of imaging exams.

Download Full-text

Text Classification in Law Area: a Systematic Review

10.5753/kdmile.2021.17458 ◽

2021 ◽

Author(s):

V. S. Martins ◽

C. D. Silva

Keyword(s):

Systematic Review ◽

Text Classification ◽

English Language ◽

Search Strategy ◽

Language Model ◽

Brazilian Portuguese ◽

Language Models ◽

Automatic Text Classification ◽

Research Questions ◽

Automatic Text

Automatic Text Classification represents a great improvement in law area workflow, mainly in the migration of physical to electronic lawsuits. A systematic review of studies on text classification in law area from January 2017 up to February 2020 was conducted. The search strategy identified 20 studies, that were analyzed and compared. The review investigates from research questions: what are the state-of-art language models, its application of text classification in English and Brazilian Portuguese datasets from legal area, if there are available language models trained on Brazilian Portuguese, and datasets in Brazilian law area. It concludes that there are applications of automatic text classification in Brazil, although there is a gap on the use of language models when compared with English language dataset studies, also the importance of language model in domain pre-training to improve results, as well as there are two studies making available Brazilian Portuguese language models, and one introducing a dataset in Brazilian law area.

Download Full-text

A Survey on Sentiment Analysis and Opinion Mining in Greek Social Media

Information ◽

10.3390/info12080331 ◽

2021 ◽

Vol 12 (8) ◽

pp. 331

Author(s):

Georgios Alexandridis ◽

Iraklis Varlamis ◽

Konstantinos Korovesis ◽

George Caridakis ◽

Panagiotis Tsantilas

Keyword(s):

Social Media ◽

Sentiment Analysis ◽

Opinion Mining ◽

Language Model ◽

Language Models ◽

Current Work ◽

Support Vector ◽

Greek Language ◽

Linguistic Resources ◽

Generic Language

As the amount of content that is created on social media is constantly increasing, more and more opinions and sentiments are expressed by people in various subjects. In this respect, sentiment analysis and opinion mining techniques can be valuable for the automatic analysis of huge textual corpora (comments, reviews, tweets etc.). Despite the advances in text mining algorithms, deep learning techniques, and text representation models, the results in such tasks are very good for only a few high-density languages (e.g., English) that possess large training corpora and rich linguistic resources; nevertheless, there is still room for improvement for the other lower-density languages as well. In this direction, the current work employs various language models for representing social media texts and text classifiers in the Greek language, for detecting the polarity of opinions expressed on social media. The experimental results on a related dataset collected by the authors of the current work are promising, since various classifiers based on the language models (naive bayesian, random forests, support vector machines, logistic regression, deep feed-forward neural networks) outperform those of word or sentence-based embeddings (word2vec, GloVe), achieving a classification accuracy of more than 80%. Additionally, a new language model for Greek social media has also been trained on the aforementioned dataset, proving that language models based on domain specific corpora can improve the performance of generic language models by a margin of 2%. Finally, the resulting models are made freely available to the research community.

Download Full-text

Leveraging Unlabeled Data for Classification

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch181 ◽

2011 ◽

pp. 1164-1169

Author(s):

Yinghui Yang ◽

Balaji Padmanabhan

Keyword(s):

Research Question ◽

Unlabeled Data ◽

Training Data ◽

Bank Loan ◽

Classification Model ◽

Classification Models ◽

Class Label ◽

Data Record ◽

Model Training ◽

Class Labels

Classification is a form of data analysis that can be used to extract models to predict categorical class labels (Han & Kamber, 2001). Data classification has proven to be very useful in a wide variety of applications. For example, a classification model can be built to categorize bank loan applications as either safe or risky. In order to build a classification model, training data containing multiple independent variables and a dependant variable (class label) is needed. If a data record has a known value for its class label, this data record is termed “labeled”. If the value for its class is unknown, it is “unlabeled”. There are situations with a large amount of unlabeled data and a small amount of labeled data. Using only labeled data to build classification models can potentially ignore useful information contained in the unlabeled data. Furthermore, unlabeled data can often be much cheaper and more plentiful than labeled data, and so if useful information can be extracted from it that reduces the need for labeled examples, this can be a significant benefit (Balcan & Blum 2005). The default practice is to use only the labeled data to build a classification model and then assign class labels to the unlabeled data. However, when the amount of labeled data is not enough, the classification model built only using the labeled data can be biased and far from accurate. The class labels assigned to the unlabeled data can then be inaccurate. How to leverage the information contained in the unlabeled data to help improve the accuracy of the classification model is an important research question. There are two streams of research that addresses the challenging issue of how to appropriately use unlabeled data for building classification models. The details are discussed below.

Download Full-text

When Low Resource NLP Meets Unsupervised Language Model: Meta-Pretraining then Meta-Learning for Few-Shot Text Classification (Student Abstract)

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i10.7158 ◽

2020 ◽

Vol 34 (10) ◽

pp. 13773-13774

Author(s):

Shumin Deng ◽

Ningyu Zhang ◽

Zhanlin Sun ◽

Jiaoyan Chen ◽

Huajun Chen

Keyword(s):

Text Classification ◽

State Of The Art ◽

Language Model ◽

Language Models ◽

Generic Model ◽

Effective Strategy ◽

Linguistic Features ◽

Meta Learning ◽

Promising Solution ◽

Model Initialization

Text classification tends to be difficult when data are deficient or when it is required to adapt to unseen classes. In such challenging scenarios, recent studies have often used meta-learning to simulate the few-shot task, thus negating implicit common linguistic features across tasks. This paper addresses such problems using meta-learning and unsupervised language models. Our approach is based on the insight that having a good generalization from a few examples relies on both a generic model initialization and an effective strategy for adapting this model to newly arising tasks. We show that our approach is not only simple but also produces a state-of-the-art performance on a well-studied sentiment classification dataset. It can thus be further suggested that pretraining could be a promising solution for few-shot learning of many other NLP tasks. The code and the dataset to replicate the experiments are made available at https://github.com/zxlzr/FewShotNLP.

Download Full-text

A study of Turkish emotion classification with pretrained language models

Journal of Information Science ◽

10.1177/0165551520985507 ◽

2021 ◽

pp. 016555152098550

Author(s):

Alaettin Uçan ◽

Murat Dörterler ◽

Ebru Akçapınar Sezer

Keyword(s):

Machine Learning ◽

Language Model ◽

Experimental Studies ◽

Classification Performance ◽

Research Field ◽

Language Models ◽

Classification Model ◽

Data Sets ◽

Emotion Classification ◽

Model Approach

Emotion classification is a research field that aims to detect the emotions in a text using machine learning methods. In traditional machine learning (TML) methods, feature engineering processes cause the loss of some meaningful information, and classification performance is negatively affected. In addition, the success of modelling using deep learning (DL) approaches depends on the sample size. More samples are needed for Turkish due to the unique characteristics of the language. However, emotion classification data sets in Turkish are quite limited. In this study, the pretrained language model approach was used to create a stronger emotion classification model for Turkish. Well-known pretrained language models were fine-tuned for this purpose. The performances of these fine-tuned models for Turkish emotion classification were comprehensively compared with the performances of TML and DL methods in experimental studies. The proposed approach provides state-of-the-art performance for Turkish emotion classification.

Download Full-text

Text classification models for the automatic detection of nonmedical prescription medication use from social media

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01394-0 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Mohammed Ali Al-Garadi ◽

Yuan-Chi Yang ◽

Haitao Cai ◽

Yucheng Ruan ◽

Karen O’Connor ◽

...

Keyword(s):

Machine Learning ◽

Social Media ◽

Deep Learning ◽

Text Classification ◽

Prescription Medication ◽

The United States ◽

Language Models ◽

Classification Model ◽

Future Research ◽

Classification Models

Abstract Background Prescription medication (PM) misuse/abuse has emerged as a national crisis in the United States, and social media has been suggested as a potential resource for performing active monitoring. However, automating a social media-based monitoring system is challenging—requiring advanced natural language processing (NLP) and machine learning methods. In this paper, we describe the development and evaluation of automatic text classification models for detecting self-reports of PM abuse from Twitter. Methods We experimented with state-of-the-art bi-directional transformer-based language models, which utilize tweet-level representations that enable transfer learning (e.g., BERT, RoBERTa, XLNet, AlBERT, and DistilBERT), proposed fusion-based approaches, and compared the developed models with several traditional machine learning, including deep learning, approaches. Using a public dataset, we evaluated the performances of the classifiers on their abilities to classify the non-majority “abuse/misuse” class. Results Our proposed fusion-based model performs significantly better than the best traditional model (F1-score [95% CI]: 0.67 [0.64–0.69] vs. 0.45 [0.42–0.48]). We illustrate, via experimentation using varying training set sizes, that the transformer-based models are more stable and require less annotated data compared to the other models. The significant improvements achieved by our best-performing classification model over past approaches makes it suitable for automated continuous monitoring of nonmedical PM use from Twitter. Conclusions BERT, BERT-like and fusion-based models outperform traditional machine learning and deep learning models, achieving substantial improvements over many years of past research on the topic of prescription medication misuse/abuse classification from social media, which had been shown to be a complex task due to the unique ways in which information about nonmedical use is presented. Several challenges associated with the lack of context and the nature of social media language need to be overcome to further improve BERT and BERT-like models. These experimental driven challenges are represented as potential future research directions.

Download Full-text

Transformer-based deep neural network language models for Alzheimer’s disease detection from targeted speech

10.21203/rs.3.rs-49267/v2 ◽

2020 ◽

Author(s):

Alireza Roshanzamir ◽

Hamid Aghajan ◽

Mahdieh Soleymani Baghshah

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Natural Language ◽

Language Processing ◽

Language Model ◽

Large Datasets ◽

Language Models ◽

Classification Model ◽

Picture Description ◽

Network Language

Abstract Background: We developed transformer-based deep learning models based on natural language processing for early diagnosis of Alzheimer’s disease from the picture description test.Methods: The lack of large datasets poses the most important limitation for using complex models that do not require feature engineering. Transformer-based pre-trained deep language models have recently made a large leap in NLP research and application. These models are pre-trained on available large datasets to understand natural language texts appropriately, and are shown to subsequently perform well on classification tasks with small training sets. The overall classification model is a simple classifier on top of the pre-trained deep language model.Results: The models are evaluated on picture description test transcripts of the Pitt corpus, which contains data of 170 AD patients with 257 interviews and 99 healthy controls with 243 interviews. The large bidirectional encoder representations from transformers (BERTLarge) embedding with logistic regression classifier achieves classification accuracy of 88.08%, which improves thestate-of-the-art by 2.48%.Conclusions: Using pre-trained language models can improve AD prediction. This not only solves the problem of lack of sufficiently large datasets, but also reduces the need for expert-defined features.

Download Full-text

Mapping ESG Trends by Distant Supervision of Neural Language Models

Machine Learning and Knowledge Extraction ◽

10.3390/make2040025 ◽

2020 ◽

Vol 2 (4) ◽

pp. 453-468 ◽

Cited By ~ 1

Author(s):

Natraj Raman ◽

Grace Bang ◽

Armineh Nourbakhsh

Keyword(s):

Business Strategy ◽

Corporate Sustainability ◽

Language Model ◽

Semantic Knowledge ◽

Language Models ◽

Classification Model ◽

Investment Strategies ◽

Business Decisions ◽

Distant Supervision ◽

Business Operations

The integration of Environmental, Social and Governance (ESG) considerations into business decisions and investment strategies have accelerated over the past few years. It is important to quantify the extent to which ESG-related conversations are carried out by companies so that their impact on business operations can be objectively assessed. However, profiling ESG language is challenging due to its multi-faceted nature and the lack of supervised datasets. This research study aims to detect historical trends in ESG discussions by analyzing the transcripts of corporate earning calls. The proposed solution exploits recent advances in neural language modeling to understand the linguistic structure in ESG discourse. In detail, firstly we develop a classification model that categorizes the relevance of a text sentence to ESG. A pre-trained language model is fine-tuned on a small corporate sustainability reports dataset for this purpose. The semantic knowledge encoded in this classification model is then leveraged by applying it to the sentences in the conference transcripts using a novel distant-supervision approach. Extensive empirical evaluations against various pretraining techniques demonstrate the efficacy of the proposed transfer learning framework. Our analysis indicates that in the last 5 years, nearly 15% of the discussions during earnings calls pertained to ESG, implying that ESG factors are integral to business strategy.

Download Full-text

An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian

Sensors ◽

10.3390/s21010133 ◽

2020 ◽

Vol 21 (1) ◽

pp. 133

Author(s):

Marco Pota ◽

Mirko Ventura ◽

Rosario Catelli ◽

Massimo Esposito

Keyword(s):

Sentiment Analysis ◽

Language Model ◽

Language Models ◽

Plain Text ◽

Analysis Techniques ◽

Academic Communities ◽

Text Corpora ◽

General Basis ◽

Model Training

Over the last decade industrial and academic communities have increased their focus on sentiment analysis techniques, especially applied to tweets. State-of-the-art results have been recently achieved using language models trained from scratch on corpora made up exclusively of tweets, in order to better handle the Twitter jargon. This work aims to introduce a different approach for Twitter sentiment analysis based on two steps. Firstly, the tweet jargon, including emojis and emoticons, is transformed into plain text, exploiting procedures that are language-independent or easily applicable to different languages. Secondly, the resulting tweets are classified using the language model BERT, but pre-trained on plain text, instead of tweets, for two reasons: (1) pre-trained models on plain text are easily available in many languages, avoiding resource- and time-consuming model training directly on tweets from scratch; (2) available plain text corpora are larger than tweet-only ones, therefore allowing better performance. A case study describing the application of the approach to Italian is presented, with a comparison with other Italian existing solutions. The results obtained show the effectiveness of the approach and indicate that, thanks to its general basis from a methodological perspective, it can also be promising for other languages.

Download Full-text