automated text classification
Recently Published Documents





Sergio Canuto ◽  
Marcos André Gonçalves ◽  
Thierson Couto Rosa

The definition of a set of informative features capable of representing and discriminating documents is paramount for the task of automatically classifying documents. In this doctoral dissertation, we present the most comprehensive study so far on the role of meta-features (high-level features built from lower-level ones) as an alternative for representing documents. We start by proposing new sets of (meta-)features that exploit distance measures in the original (bag-of-words) feature space to summarize potentially complex relationships between documents. We then (i) analyze the discriminative power of such meta-features with novel multi-objective feature selection strategies; (ii) provide new GPU implementations to reduce computational time; (iii) enrich distance relationships with labeled or context-specific information; (iv) adapt the proposed meta-features for tasks as hard as sentiment analysis. Our experimental results show that our meta-features can achieve remarkable classification results by distance exploitation, being the state-of-the-art in many situations and scenarios.

Gleb Danilov ◽  
Timur Ishankulov ◽  
Konstantin Kotik ◽  
Yuriy Orlov ◽  
Mikhail Shifrin ◽  

Automated text classification is a natural language processing (NLP) technology that could significantly facilitate scientific literature selection. A specific topical dataset of 630 article abstracts was obtained from the PubMed database. We proposed 27 parametrized options of PubMedBERT model and 4 ensemble models to solve a binary classification task on that dataset. Three hundred tests with resamples were performed in each classification approach. The best PubMedBERT model demonstrated F1-score = 0.857 while the best ensemble model reached F1-score = 0.853. We concluded that the short scientific texts classification quality might be improved using the latest state-of-art approaches.

2020 ◽  
Jae Yeon Kim ◽  
Andrew Thompson

In this study, we used a natural experiment and machine learning to examine how threats prompt information seeking among marginalized populations. We traced how the September 11 attacks, an exogenous shock, increased the interest of Arab and Indian Americans in U.S. domestic politics. We classified 5,684 Arab American and Indian American newspaper articles using machine learning and estimated that three more articles on U.S. domestic politics were published daily in the post-9/11 period than in previous years. While the natural experiment design identifies the causal relationship between the intervention and the outcome variation, an automated text classification creates essential data for such a causal identification. This project also provides an accompanying R package that makes collecting data from the largest database of ethnic newspapers published in the U.S. easier and faster.

2020 ◽  
Jae Yeon Kim

The voices of racial minority groups have rarely been examined systematically with large-scale text analysis in political science. This study fills such a gap by applying an integrated classification framework to the analysis of the commonalities and differences in political issues that appeared in 78,305 articles from Asian American and African American newspapers from the 1960s to the 1980s. The automated text classification shows that Asian American newspapers focused on promoting collective gains more often than African American newspapers. Conversely, African American newspapers concentrated on preventing collective losses more than Asian American newspapers. The content analysis demonstrates that the issue priorities varied between the corpora, especially with respect to policy contexts. Gaining access to government resources was a more urgent issue for Asian Americans, while reducing or ending state violence, such as police brutality, was a more pressing matter for African Americans. It also helped avoid extreme interpretations of the machine coding, as the misalignment of political agendas between the two corpora widened up to 10 times when the training data were measured using the minimum, rather than the maximum, reliability threshold.

2020 ◽  
Vol 44 ◽  
pp. 101060 ◽  
Weili Fang ◽  
Hanbin Luo ◽  
Shuangjie Xu ◽  
Peter E.D. Love ◽  
Zhenchuan Lu ◽  

2020 ◽  
Vol 40 (4) ◽  
pp. 465-479 ◽  
Arun Varghese ◽  
George Agyeman-Badu ◽  
Michelle Cawley

A. Adeleke ◽  
N. Samsudin ◽  
A. Mustapha ◽  
S. Ahmad Khalid

Classification of Quranic verses into predefined categories is an essential task in Quranic studies. However, in recent times, with the advancement in information technology and machine learning, several classification algorithms have been developed for the purpose of text classification tasks. Automated text classification (ATC) is a well-known technique in machine learning. It is the task of developing models that could be trained to automatically assign to each text instances a known label from a predefined state. In this paper, four conventional ML classifiers: support vector machine (SVM), naïve bayes (NB), decision trees (J48), nearest neighbor (<em>k</em>-NN), are used in classifying selected Quranic verses into three predefined class labels: faith (<em>iman</em>), worship (<em>ibadah</em>), etiquettes (<em>akhlak</em>). The Quranic data comprises of verses in chapter two (<em>al-Baqara</em>) of the holy scripture. In the results, the classifiers achieved above 80% accuracy score with naïve bayes (NB) algorithm recording the overall highest scores of 93.9% accuracy and 0.964 AUC.

Sign in / Sign up

Export Citation Format

Share Document