Union model performance improving for text data classification

2021 ◽  
Author(s):  
A. S. Surkova ◽  
S. S. Skorynin ◽  
V. V. Kondratiev ◽  
V. F. Zharinov
Biostatistics ◽  
2020 ◽  
Author(s):  
W Katherine Tan ◽  
Patrick J Heagerty

Summary Scalable and accurate identification of specific clinical outcomes has been enabled by machine-learning applied to electronic medical record systems. The development of classification models requires the collection of a complete labeled data set, where true clinical outcomes are obtained by human expert manual review. For example, the development of natural language processing algorithms requires the abstraction of clinical text data to obtain outcome information necessary for training models. However, if the outcome is rare then simple random sampling results in very few cases and insufficient information to develop accurate classifiers. Since large scale detailed abstraction is often expensive, time-consuming, and not feasible, more efficient strategies are needed. Under such resource constrained settings, we propose a class of enrichment sampling designs, where selection for abstraction is stratified by auxiliary variables related to the true outcome of interest. Stratified sampling on highly specific variables results in targeted samples that are more enriched with cases, which we show translates to increased model discrimination and better statistical learning performance. We provide mathematical details and simulation evidence that links sampling designs to their resulting prediction model performance. We discuss the impact of our proposed sampling on both model training and validation. Finally, we illustrate the proposed designs for outcome label collection and subsequent machine-learning, using radiology report text data from the Lumbar Imaging with Reporting of Epidemiology study.


2021 ◽  
Vol 5 (2(61)) ◽  
pp. 6-8
Author(s):  
Olena Hryshchenko ◽  
Vadym Yaremenko

The object of research is the methods of fast classification for solving text data classification problems. The need for this study is due to the rapid growth of textual data, both in digital and printed forms. Thus, there is a need to process such data using software, since human resources are not able to process such an amount of data in full. A large number of data classification approaches have been developed. The conducted research is based on the application of the following methods of classification of text data: Bloom filter, naive Bayesian classifier and neural networks to a set of text data in order to classify them into categories. Each method has both disadvantages and advantages. This paper will reflect the strengths and weaknesses of each method on a specific example. These algorithms were comparatively among themselves in terms of speed and efficiency, that is, the accuracy of determining the belonging of a text to a certain class of classification. The work of each method was considered on the same data sets with a change in the amount of training and test data, as well as with a change in the number of classification groups. The dataset used contains the following classes: world, business, sports, and science and technology. In real conditions of the classification of such data, the number of categories is much larger than that considered in the work, and may have subcategories in its composition. In the course of this study, each method was analyzed using different parameter values to obtain the best result. Analyzing the results obtained, the best results for the classification of text data were obtained using a neural network.


Text data analytics became an integral part of World Wide Web data management and Internet based applications rapidly growing all over the world. E-commerce applications are growing exponentially in the business field and the competitors in the E-commerce are gradually increasing many machine learning techniques for predicting business related operations with the aim of increasing the product sales to the greater extent. Usage of similarity measures is inevitable in modern day to day real applications. Cosine similarity plays a dominant role in text data mining applications such as text classification, clustering, querying, and searching and so on. A modified clustering based cosine similarity measure called MCS is proposed in this paper for data classification. The proposed method is experimentally verified by employing many UCI machine learning datasets involving categorical attributes. The proposed method is superior in producing more accurate classification results in majority of experiments conducted on the UCI machine learning datasets.


Buana Bastra ◽  
2021 ◽  
Vol 5 (1) ◽  
pp. 39-48
Author(s):  
Nilatul Izzah ◽  
Sunu Catur Budiyono

The research has target for describing moral value that is first figure and communitymoral there are from novel Tuhan Ijinkan Aku Menjadi Pelacur. Morality is quality ofhuman behavior that shows one's behavior is right or wrong, good or bad. In order to makethis research being analyzed, so the researcher use the theory of Thomas Lickona,whichemphasize in three things, those are to know the value of moral, moral feeling, and moralbehavior. Method for this research is hermenutik method (interpreting text). Data and dataresource are the moral value of God bless Me be Prostitute novel by Muhidin M. Dahlan.Data collection technique is using repeat reading, make notes and data classification. Dataanalysis technique is using interpretation, explanation, description, and make conclusion.Data is divided by subs chapter based on the problem and research purpose. Data thatalready been interpreted then it described in an essay as the result. The conclusion by thisresearch is 1) moral knowledge there is moral awareness, knowing moral value, takeperpective, logical of moral and self knowledge from novel “Tuhan Ijinkan Aku MenjadiPelacur” 2) the attitude of moral is moral feeling, conscience, self regard, empathy, good love, self control, and humble. 3) the action of moral is interest, desire, andfigure habits and community. All of the conclusion has been found in the first figure andcommunity from novel “Tuhan Ijinkan Aku Menjadi Pelacur” novel by Muhidin MDahlan.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-6
Author(s):  
Nada Boudjellal ◽  
Huaping Zhang ◽  
Asif Khan ◽  
Arshad Ahmad ◽  
Rashid Naseem ◽  
...  

The web is being loaded daily with a huge volume of data, mainly unstructured textual data, which increases the need for information extraction and NLP systems significantly. Named-entity recognition task is a key step towards efficiently understanding text data and saving time and effort. Being a widely used language globally, English is taking over most of the research conducted in this field, especially in the biomedical domain. Unlike other languages, Arabic suffers from lack of resources. This work presents a BERT-based model to identify biomedical named entities in the Arabic text data (specifically disease and treatment named entities) that investigates the effectiveness of pretraining a monolingual BERT model with a small-scale biomedical dataset on enhancing the model understanding of Arabic biomedical text. The model performance was compared with two state-of-the-art models (namely, AraBERT and multilingual BERT cased), and it outperformed both models with 85% F1-score.


2021 ◽  
Vol 8 (2) ◽  
pp. 33-45
Author(s):  
A.V. Pchelin A.V. Pchelin ◽  
◽  
N.A. Kononov N.A. Kononov ◽  
V.S. Serova V.S. Serova ◽  
E.V. Bunova E.V. Bunova ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document