scholarly journals Go Simple and Pre-Train on Domain-Specific Corpora: On the Role of Training Data for Text Classification

Author(s):  
Aleksandra Edwards ◽  
Jose Camacho-Collados ◽  
Hélène De Ribaupierre ◽  
Alun Preece
Author(s):  
Pratiksha Bongale

Today’s world is mostly data-driven. To deal with the humongous amount of data, Machine Learning and Data Mining strategies are put into usage. Traditional ML approaches presume that the model is tested on a dataset extracted from the same domain from where the training data has been taken from. Nevertheless, some real-world situations require machines to provide good results with very little domain-specific training data. This creates room for the development of machines that are capable of predicting accurately by being trained on easily found data. Transfer Learning is the key to it. It is the scientific art of applying the knowledge gained while learning a task to another task that is similar to the previous one in some or another way. This article focuses on building a model that is capable of differentiating text data into binary classes; one roofing the text data that is spam and the other not containing spam using BERT’s pre-trained model (bert-base-uncased). This pre-trained model has been trained on Wikipedia and Book Corpus data and the goal of this paper is to highlight the pre-trained model’s capabilities to transfer the knowledge that it has learned from its training (Wiki and Book Corpus) to classifying spam texts from the rest.


2007 ◽  
Author(s):  
P. S. Kavanagh ◽  
G. J. O. Fletcher ◽  
B. J. Ellis
Keyword(s):  

Author(s):  
Annapoorani Gopal ◽  
Lathaselvi Gandhimaruthian ◽  
Javid Ali

The Deep Neural Networks have gained prominence in the biomedical domain, becoming the most commonly used networks after machine learning technology. Mammograms can be used to detect breast cancers with high precision with the help of Convolutional Neural Network (CNN) which is deep learning technology. An exhaustive labeled data is required to train the CNN from scratch. This can be overcome by deploying Generative Adversarial Network (GAN) which comparatively needs lesser training data during a mammogram screening. In the proposed study, the application of GANs in estimating breast density, high-resolution mammogram synthesis for clustered microcalcification analysis, effective segmentation of breast tumor, analysis of the shape of breast tumor, extraction of features and augmentation of the image during mammogram classification have been extensively reviewed.


2017 ◽  
Vol 93 (4) ◽  
pp. 177-202 ◽  
Author(s):  
Emily E. Griffith

ABSTRACT Auditors are more likely to identify misstatements in complex estimates if they recognize problematic patterns among an estimate's underlying assumptions. Rich problem representations aid pattern recognition, but auditors likely have difficulty developing them given auditors' limited domain-specific expertise in this area. In two experiments, I predict and find that a relational cue in a specialist's work highlighting aggressive assumptions improves auditors' problem representations and subsequent judgments about estimates. However, this improvement only occurs when a situational factor (e.g., risk) increases auditors' epistemic motivation to incorporate the cue into their problem representations. These results suggest that auditors do not always respond to cues in specialists' work. More generally, this study highlights the role of situational factors in increasing auditors' epistemic motivation to develop rich problem representations, which contribute to high-quality audit judgments in this and other domains where pattern recognition is important.


2005 ◽  
Vol 20 (1) ◽  
pp. 137-158 ◽  
Author(s):  
Gail M. Gottfried ◽  
Susan A. Gelman
Keyword(s):  

2004 ◽  
Vol 4 (2) ◽  
pp. 373-390 ◽  
Author(s):  
Alain Samson

AbstractIn an article aimed at complementing Boyer and Sperber's (relatively structural) views of counter-intuitive concepts and their robustness in the religious domain, Franks (2003) has recently drawn attention to the fact that the tolerance of such conflict or contradiction appears to be less domain-specific in some cultures, such as those found in East Asia. This paper follows up on this important point by highlighting the similarities and differences of the tolerance for contradictions evident in East Asian 'naïve dialecticism' and nonnatural religious representations. It is argued that, despite their dissimilarity with respect to the content represented, both types of tolerances may be structurally similar. Both could also be anchored in intuition, albeit in qualitatively different ways. Given the general tolerance of psychological contradiction among persons of East Asian cultures and the potential role of religion, the question whether there is a place for the study of 'tolerance of contradiction' in cross-cultural psychology and cognitive anthropology is raised.


2021 ◽  
Vol 13 (1) ◽  
pp. 29-46
Author(s):  
Ekkehard König

This paper discusses the role of English as the current lingua franca academica in contrast to a multilingual approach to scientific inquiry on the basis of four perspectives: a cognitive, a typological, a contrastive and a domain-specific one. It is argued that a distinction must be drawn between the natural sciences and the humanities in order to properly assess the potential of either linguistic solution to the problem of scientific communication. To the extent that the results of scientific research are expressed in formal languages and international standardised terminology, the exclusive use of one lingua franca is unproblematic, especially if phenomena of our external world are under consideration. In the humanities, by contrast, especially in the analysis of our non-visible, mental world, a single lingua franca cannot be regarded as a neutral instrument, but may more often than not become a conceptual prison. For the humanities the analysis of the conceptual system of a language provides the most reliable access to its culture. For international exchange of results, however, the humanities too have to rely on a suitable lingua franca as language of description as opposed to the language under description.


Author(s):  
Sarmad Mahar ◽  
Sahar Zafar ◽  
Kamran Nishat

Headnotes are the precise explanation and summary of legal points in an issued judgment. Law journals hire experienced lawyers to write these headnotes. These headnotes help the reader quickly determine the issue discussed in the case. Headnotes comprise two parts. The first part comprises the topic discussed in the judgment, and the second part contains a summary of that judgment. In this thesis, we design, develop and evaluate headnote prediction using machine learning, without involving human involvement. We divided this task into a two steps process. In the first step, we predict law points used in the judgment by using text classification algorithms. The second step generates a summary of the judgment using text summarization techniques. To achieve this task, we created a Databank by extracting data from different law sources in Pakistan. We labelled training data generated based on Pakistan law websites. We tested different feature extraction methods on judiciary data to improve our system. Using these feature extraction methods, we developed a dictionary of terminology for ease of reference and utility. Our approach achieves 65% accuracy by using Linear Support Vector Classification with tri-gram and without stemmer. Using active learning our system can continuously improve the accuracy with the increased labelled examples provided by the users of the system.


2017 ◽  
Vol 45 (6) ◽  
pp. 523-535 ◽  
Author(s):  
Heike M. Buhl ◽  
Peter Noack ◽  
Baerbel Kracke

This longitudinal study addresses the role of support given by parents and peers during the transition from university to work life. A sample of 64 German university students in their last year at the university completed scales from the Network of Relationships Inventory regarding general support, namely, instrumental aid and intimacy with mothers, fathers, romantic partners, and friends. Four years later, they assessed domain-specific support when looking for work, namely, joint exploration and instrumental support. Participants perceived receiving both types of support from all significant others. However, joint exploration was more important than instrumental support. They felt especially supported by romantic partners. Women received more support than did men. Both types of domain-specific support were explained by general modes of support assessed 4 years earlier. Whether parents, friends, and partners were perceived as helpful during the transition was explained mainly by joint exploration. Again, support from a partner was seen as especially helpful in contrast to help from parents and friends. The special significance of joint exploration underlines the benefit of counseling at the transition from university to work life.


2021 ◽  
Vol 2021 ◽  
pp. 1-16
Author(s):  
Sunil Kumar Prabhakar ◽  
Dong-Ok Won

To unlock information present in clinical description, automatic medical text classification is highly useful in the arena of natural language processing (NLP). For medical text classification tasks, machine learning techniques seem to be quite effective; however, it requires extensive effort from human side, so that the labeled training data can be created. For clinical and translational research, a huge quantity of detailed patient information, such as disease status, lab tests, medication history, side effects, and treatment outcomes, has been collected in an electronic format, and it serves as a valuable data source for further analysis. Therefore, a huge quantity of detailed patient information is present in the medical text, and it is quite a huge challenge to process it efficiently. In this work, a medical text classification paradigm, using two novel deep learning architectures, is proposed to mitigate the human efforts. The first approach is that a quad channel hybrid long short-term memory (QC-LSTM) deep learning model is implemented utilizing four channels, and the second approach is that a hybrid bidirectional gated recurrent unit (BiGRU) deep learning model with multihead attention is developed and implemented successfully. The proposed methodology is validated on two medical text datasets, and a comprehensive analysis is conducted. The best results in terms of classification accuracy of 96.72% is obtained with the proposed QC-LSTM deep learning model, and a classification accuracy of 95.76% is obtained with the proposed hybrid BiGRU deep learning model.


Sign in / Sign up

Export Citation Format

Share Document