Union model performance improving for text data classification

Cybersecurity Text Data Classification and Optimization for CTI Systems

Advances in Intelligent Systems and Computing - Web, Artificial Intelligence and Network Applications ◽

10.1007/978-3-030-44038-1_37 ◽

2020 ◽

pp. 410-419 ◽

Cited By ~ 1

Author(s):

Ariel Rodriguez ◽

Koji Okamura

Keyword(s):

Data Classification ◽

Text Data

Surrogate-guided sampling designs for classification of rare outcomes from electronic medical records data

Biostatistics ◽

10.1093/biostatistics/kxaa028 ◽

2020 ◽

Author(s):

W Katherine Tan ◽

Patrick J Heagerty

Keyword(s):

Machine Learning ◽

Clinical Outcomes ◽

Language Processing ◽

Large Scale ◽

Model Performance ◽

Learning Performance ◽

Accurate Identification ◽

Text Data ◽

Data Set ◽

The Impact

Summary Scalable and accurate identification of specific clinical outcomes has been enabled by machine-learning applied to electronic medical record systems. The development of classification models requires the collection of a complete labeled data set, where true clinical outcomes are obtained by human expert manual review. For example, the development of natural language processing algorithms requires the abstraction of clinical text data to obtain outcome information necessary for training models. However, if the outcome is rare then simple random sampling results in very few cases and insufficient information to develop accurate classifiers. Since large scale detailed abstraction is often expensive, time-consuming, and not feasible, more efficient strategies are needed. Under such resource constrained settings, we propose a class of enrichment sampling designs, where selection for abstraction is stratified by auxiliary variables related to the true outcome of interest. Stratified sampling on highly specific variables results in targeted samples that are more enriched with cases, which we show translates to increased model discrimination and better statistical learning performance. We provide mathematical details and simulation evidence that links sampling designs to their resulting prediction model performance. We discuss the impact of our proposed sampling on both model training and validation. Finally, we illustrate the proposed designs for outcome label collection and subsequent machine-learning, using radiology report text data from the Lumbar Imaging with Reporting of Epidemiology study.

A comparative analysis of text data classification accuracy and speed using neural networks, Bloom filter and naive Bayes

Technology audit and production reserves ◽

10.15587/2706-5448.2021.237767 ◽

2021 ◽

Vol 5 (2(61)) ◽

pp. 6-8

Author(s):

Olena Hryshchenko ◽

Vadym Yaremenko

Keyword(s):

Neural Networks ◽

Data Classification ◽

Bloom Filter ◽

Data Sets ◽

Classification Problems ◽

Text Data ◽

Textual Data ◽

Fast Classification ◽

Parameter Values

The object of research is the methods of fast classification for solving text data classification problems. The need for this study is due to the rapid growth of textual data, both in digital and printed forms. Thus, there is a need to process such data using software, since human resources are not able to process such an amount of data in full. A large number of data classification approaches have been developed. The conducted research is based on the application of the following methods of classification of text data: Bloom filter, naive Bayesian classifier and neural networks to a set of text data in order to classify them into categories. Each method has both disadvantages and advantages. This paper will reflect the strengths and weaknesses of each method on a specific example. These algorithms were comparatively among themselves in terms of speed and efficiency, that is, the accuracy of determining the belonging of a text to a certain class of classification. The work of each method was considered on the same data sets with a change in the amount of training and test data, as well as with a change in the number of classification groups. The dataset used contains the following classes: world, business, sports, and science and technology. In real conditions of the classification of such data, the number of categories is much larger than that considered in the work, and may have subcategories in its composition. In the course of this study, each method was analyzed using different parameter values to obtain the best result. Analyzing the results obtained, the best results for the classification of text data were obtained using a neural network.

Modified Cosine Similarity Measure based Data Classification in Data Mining

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e9754.069520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 649-654

Keyword(s):

Machine Learning ◽

Data Mining ◽

Similarity Measure ◽

Dominant Role ◽

Similarity Measures ◽

Data Classification ◽

Cosine Similarity ◽

Machine Learning Techniques ◽

Text Data ◽

Cosine Similarity Measure

Text data analytics became an integral part of World Wide Web data management and Internet based applications rapidly growing all over the world. E-commerce applications are growing exponentially in the business field and the competitors in the E-commerce are gradually increasing many machine learning techniques for predicting business related operations with the aim of increasing the product sales to the greater extent. Usage of similarity measures is inevitable in modern day to day real applications. Cosine similarity plays a dominant role in text data mining applications such as text classification, clustering, querying, and searching and so on. A modified clustering based cosine similarity measure called MCS is proposed in this paper for data classification. The proposed method is experimentally verified by employing many UCI machine learning datasets involving categorical attributes. The proposed method is superior in producing more accurate classification results in majority of experiments conducted on the UCI machine learning datasets.

Using Correlation Based Subspace Clustering for Multi-label Text Data Classification

2010 22nd IEEE International Conference on Tools with Artificial Intelligence ◽

10.1109/ictai.2010.115 ◽

2010 ◽

Cited By ~ 3

Author(s):

Mohammad Salim Ahmed ◽

Latifur Khan ◽

Mandava Rajeswari

Keyword(s):

Subspace Clustering ◽

Data Classification ◽

Text Data

Hybrid Technique for Medical Data Classification using Multi-Layer Perceptron with NB Classifier

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.k2179.1081219 ◽

2019 ◽

Vol 8 (12) ◽

pp. 2627-2632

Keyword(s):

Deep Learning ◽

Data Analysis ◽

Image Data ◽

Numerical Data ◽

Data Classification ◽

Heterogeneous Data ◽

Medical Data ◽

Hybrid Technique ◽

Text Data ◽

Medical Data Classification

Medical data analysis gains more interest from the last decade due to its significance advantages. Medical data is a heterogeneous data, which is the combination of text data, numeric data and image data. For to analyze such heterogeneous data traditional data analysis mechanisms are inefficient. To handle this heterogeneous data deep learning is obvious choice. Deep learning is able to handle text, numeric and image data more efficiently than traditional data mining techniques. In this paper we proposed a deep learning based multilayer perceptron to analysis medical data. This method independently address the text data, image data and numerical data and combinable made medical data classification

An Innovative Research Framework on Intelligent Text Data Classification System Using Genetic Algorithm

International Journal of Artificial Intelligence & Applications ◽

10.5121/ijaia.2016.7605 ◽

2016 ◽

Vol 7 (6) ◽

pp. 57-73

Author(s):

Maheswara Rao V V R ◽

Silpa N ◽

Gadiraju Mahesh

Keyword(s):

Genetic Algorithm ◽

Classification System ◽

Data Classification ◽

Text Data ◽

Research Framework ◽

Innovative Research

KAJIAN MORALITAS DALAM NOVEL TUHAN IJINKAN AKU MENJADI PELACUR KARYA MUHIDIN M DAHLAN

Buana Bastra ◽

10.36456/bastra.vol5.no1.a3580 ◽

2021 ◽

Vol 5 (1) ◽

pp. 39-48

Author(s):

Nilatul Izzah ◽

Sunu Catur Budiyono

Keyword(s):

Data Classification ◽

Moral Knowledge ◽

Self Control ◽

Research Purpose ◽

Moral Awareness ◽

Moral Value ◽

Text Data ◽

Self Knowledge ◽

Moral Feeling

The research has target for describing moral value that is first figure and communitymoral there are from novel Tuhan Ijinkan Aku Menjadi Pelacur. Morality is quality ofhuman behavior that shows one's behavior is right or wrong, good or bad. In order to makethis research being analyzed, so the researcher use the theory of Thomas Lickona,whichemphasize in three things, those are to know the value of moral, moral feeling, and moralbehavior. Method for this research is hermenutik method (interpreting text). Data and dataresource are the moral value of God bless Me be Prostitute novel by Muhidin M. Dahlan.Data collection technique is using repeat reading, make notes and data classification. Dataanalysis technique is using interpretation, explanation, description, and make conclusion.Data is divided by subs chapter based on the problem and research purpose. Data thatalready been interpreted then it described in an essay as the result. The conclusion by thisresearch is 1) moral knowledge there is moral awareness, knowing moral value, takeperpective, logical of moral and self knowledge from novel “Tuhan Ijinkan Aku MenjadiPelacur” 2) the attitude of moral is moral feeling, conscience, self regard, empathy, good love, self control, and humble. 3) the action of moral is interest, desire, andfigure habits and community. All of the conclusion has been found in the first figure andcommunity from novel “Tuhan Ijinkan Aku Menjadi Pelacur” novel by Muhidin MDahlan.

Mining and Tracking Massive Text Data: Classification, Construction of Tracking Statistics, and Inference Under Misclassification

Technometrics ◽

10.1198/004017006000000471 ◽

2007 ◽

Vol 49 (2) ◽

pp. 116-128 ◽

Cited By ~ 10

Author(s):

Daniel R Jeske ◽

Regina Y Liu

Keyword(s):

Data Classification ◽

Text Data

ABioNER: A BERT-Based Model for Arabic Biomedical Named-Entity Recognition

Complexity ◽

10.1155/2021/6633213 ◽

2021 ◽

Vol 2021 ◽

pp. 1-6

Author(s):

Nada Boudjellal ◽

Huaping Zhang ◽

Asif Khan ◽

Arshad Ahmad ◽

Rashid Naseem ◽

...

Keyword(s):

Named Entity Recognition ◽

Model Performance ◽

Recognition Task ◽

Entity Recognition ◽

Small Scale ◽

Text Data ◽

Named Entities ◽

Named Entity ◽

Textual Data ◽

Biomedical Named Entity Recognition

The web is being loaded daily with a huge volume of data, mainly unstructured textual data, which increases the need for information extraction and NLP systems significantly. Named-entity recognition task is a key step towards efficiently understanding text data and saving time and effort. Being a widely used language globally, English is taking over most of the research conducted in this field, especially in the biomedical domain. Unlike other languages, Arabic suffers from lack of resources. This work presents a BERT-based model to identify biomedical named entities in the Arabic text data (specifically disease and treatment named entities) that investigates the effectiveness of pretraining a monolingual BERT model with a small-scale biomedical dataset on enhancing the model understanding of Arabic biomedical text. The model performance was compared with two state-of-the-art models (namely, AraBERT and multilingual BERT cased), and it outperformed both models with 85% F1-score.