Sampling the Web as Training Data for Text Classification

2010 ◽  
Vol 1 (4) ◽  
pp. 24-42 ◽  
Author(s):  
Wei-Yen Day ◽  
Chun-Yi Chi ◽  
Ruey-Cheng Chen ◽  
Pu-Jen Cheng
Author(s):  
Wei-Yen Day ◽  
Chun-Yi Chi ◽  
Ruey-Cheng Chen ◽  
Pu-Jen Cheng

Data acquisition is a major concern in text classification. The excessive human efforts required by conventional methods to build up quality training collection might not always be available to research workers. In this paper, the authors look into possibilities to automatically collect training data by sampling the Web with a set of given class names. The basic idea is to populate appropriate keywords and submit them as queries to search engines for acquiring training data. The first of two methods presented in this paper is based on sampling the common concepts among classes and the other is based on sampling the discriminative concepts for each class. A series of experiments were carried out independently on two different datasets and results show that the proposed methods significantly improve classifier performance even without using manually labeled training data. The authors’ strategy for retrieving Web samples substantially helps in the conventional document classification in terms of accuracy and efficiency.


Author(s):  
Sarmad Mahar ◽  
Sahar Zafar ◽  
Kamran Nishat

Headnotes are the precise explanation and summary of legal points in an issued judgment. Law journals hire experienced lawyers to write these headnotes. These headnotes help the reader quickly determine the issue discussed in the case. Headnotes comprise two parts. The first part comprises the topic discussed in the judgment, and the second part contains a summary of that judgment. In this thesis, we design, develop and evaluate headnote prediction using machine learning, without involving human involvement. We divided this task into a two steps process. In the first step, we predict law points used in the judgment by using text classification algorithms. The second step generates a summary of the judgment using text summarization techniques. To achieve this task, we created a Databank by extracting data from different law sources in Pakistan. We labelled training data generated based on Pakistan law websites. We tested different feature extraction methods on judiciary data to improve our system. Using these feature extraction methods, we developed a dictionary of terminology for ease of reference and utility. Our approach achieves 65% accuracy by using Linear Support Vector Classification with tri-gram and without stemmer. Using active learning our system can continuously improve the accuracy with the increased labelled examples provided by the users of the system.


2021 ◽  
Vol 2021 ◽  
pp. 1-16
Author(s):  
Sunil Kumar Prabhakar ◽  
Dong-Ok Won

To unlock information present in clinical description, automatic medical text classification is highly useful in the arena of natural language processing (NLP). For medical text classification tasks, machine learning techniques seem to be quite effective; however, it requires extensive effort from human side, so that the labeled training data can be created. For clinical and translational research, a huge quantity of detailed patient information, such as disease status, lab tests, medication history, side effects, and treatment outcomes, has been collected in an electronic format, and it serves as a valuable data source for further analysis. Therefore, a huge quantity of detailed patient information is present in the medical text, and it is quite a huge challenge to process it efficiently. In this work, a medical text classification paradigm, using two novel deep learning architectures, is proposed to mitigate the human efforts. The first approach is that a quad channel hybrid long short-term memory (QC-LSTM) deep learning model is implemented utilizing four channels, and the second approach is that a hybrid bidirectional gated recurrent unit (BiGRU) deep learning model with multihead attention is developed and implemented successfully. The proposed methodology is validated on two medical text datasets, and a comprehensive analysis is conducted. The best results in terms of classification accuracy of 96.72% is obtained with the proposed QC-LSTM deep learning model, and a classification accuracy of 95.76% is obtained with the proposed hybrid BiGRU deep learning model.


2021 ◽  
Author(s):  
Wilson Wongso ◽  
Henry Lucky ◽  
Derwin Suhartono

Abstract The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.


2011 ◽  
Vol 1 (3) ◽  
pp. 54-70 ◽  
Author(s):  
Abdullah Wahbeh ◽  
Mohammed Al-Kabi ◽  
Qasem Al-Radaideh ◽  
Emad Al-Shawakfa ◽  
Izzat Alsmadi

The information world is rich of documents in different formats or applications, such as databases, digital libraries, and the Web. Text classification is used for aiding search functionality offered by search engines and information retrieval systems to deal with the large number of documents on the web. Many research papers, conducted within the field of text classification, were applied to English, Dutch, Chinese, and other languages, whereas fewer were applied to Arabic language. This paper addresses the issue of automatic classification or classification of Arabic text documents. It applies text classification to Arabic language text documents using stemming as part of the preprocessing steps. Results have showed that applying text classification without using stemming; the support vector machine (SVM) classifier has achieved the highest classification accuracy using the two test modes with 87.79% and 88.54%. On the other hand, stemming has negatively affected the accuracy, where the SVM accuracy using the two test modes dropped down to 84.49% and 86.35%.


2020 ◽  
Vol 2020 ◽  
pp. 1-7 ◽  
Author(s):  
Aboubakar Nasser Samatin Njikam ◽  
Huan Zhao

This paper introduces an extremely lightweight (with just over around two hundred thousand parameters) and computationally efficient CNN architecture, named CharTeC-Net (Character-based Text Classification Network), for character-based text classification problems. This new architecture is composed of four building blocks for feature extraction. Each of these building blocks, except the last one, uses 1 × 1 pointwise convolutional layers to add more nonlinearity to the network and to increase the dimensions within each building block. In addition, shortcut connections are used in each building block to facilitate the flow of gradients over the network, but more importantly to ensure that the original signal present in the training data is shared across each building block. Experiments on eight standard large-scale text classification and sentiment analysis datasets demonstrate CharTeC-Net’s superior performance over baseline methods and yields competitive accuracy compared with state-of-the-art methods, although CharTeC-Net has only between 181,427 and 225,323 parameters and weighs less than 1 megabyte.


2019 ◽  
Vol 9 (8) ◽  
pp. 1725
Author(s):  
Isra Nurul HABIBI ◽  
Abba Suganda GIRSANG

Text classification is one of the ways to classify sentences. The grouped data are comments from social media with training data from sites that provide points /scores for each review given such as tripadvisor.co.id. The word2vec method is used to extract words into numbers so that the machine learning algorithm can be applied to classify data. Word2vec is an unsupervised task that is capable of utilizing unlabeled data to convert a word into its vector representation that can also find the semantic relationship between words by counting their distance. The goal from this paper is that data from social media such as Twitter or Instagram can also quickly find out the total /weight of a tourist place from the comment given. The experiment shows that the result of F1 Score on data without removing stop words and eliminate the train data, give a better result 0,85.


2018 ◽  
Vol 25 (10) ◽  
pp. 1274-1283 ◽  
Author(s):  
Abeed Sarker ◽  
Maksim Belousov ◽  
Jasper Friedrichs ◽  
Kai Hakala ◽  
Svetlana Kiritchenko ◽  
...  

AbstractObjectiveWe executed the Social Media Mining for Health (SMM4H) 2017 shared tasks to enable the community-driven development and large-scale evaluation of automatic text processing methods for the classification and normalization of health-related text from social media. An additional objective was to publicly release manually annotated data.Materials and MethodsWe organized 3 independent subtasks: automatic classification of self-reports of 1) adverse drug reactions (ADRs) and 2) medication consumption, from medication-mentioning tweets, and 3) normalization of ADR expressions. Training data consisted of 15 717 annotated tweets for (1), 10 260 for (2), and 6650 ADR phrases and identifiers for (3); and exhibited typical properties of social-media-based health-related texts. Systems were evaluated using 9961, 7513, and 2500 instances for the 3 subtasks, respectively. We evaluated performances of classes of methods and ensembles of system combinations following the shared tasks.ResultsAmong 55 system runs, the best system scores for the 3 subtasks were 0.435 (ADR class F1-score) for subtask-1, 0.693 (micro-averaged F1-score over two classes) for subtask-2, and 88.5% (accuracy) for subtask-3. Ensembles of system combinations obtained best scores of 0.476, 0.702, and 88.7%, outperforming individual systems.DiscussionAmong individual systems, support vector machines and convolutional neural networks showed high performance. Performance gains achieved by ensembles of system combinations suggest that such strategies may be suitable for operational systems relying on difficult text classification tasks (eg, subtask-1).ConclusionsData imbalance and lack of context remain challenges for natural language processing of social media text. Annotated data from the shared task have been made available as reference standards for future studies (http://dx.doi.org/10.17632/rxwfb3tysd.1).


Sign in / Sign up

Export Citation Format

Share Document