USING MACHINE LEARNING AND SEMANTIC FEATURES IN INTELLECTUAL ANALYSIS OF TEXT DATA

The object of the research when writing the work was the body of text data collected together with the scientific advisor and the algorithms for processing the natural language of analysis. The stream of hypotheses has been tested against computer science scientific publications through a series of simulation experiments described in this dissertation. The subject of the research is algorithms and the results of the algorithms, aimed at predicting promising topics and terms that appear in the course of time in the scientific environment. The result of this work is a set of machine learning models, with the help of which experiments were carried out to identify promising terms and semantic relationships in the text corpus. The resulting models can be used for semantic processing and analysis of other subject areas.

Download Full-text

Deep Learning for text in limted data settings

10.36227/techrxiv.12100692 ◽

2020 ◽

Author(s):

Pathikkumar Patel ◽

Bhargav Lad ◽

Jinan Fiaidhi

Keyword(s):

Machine Learning ◽

Time Series ◽

Deep Learning ◽

Sentiment Analysis ◽

Transfer Learning ◽

Text Classification ◽

State Of The Art ◽

Time Series Forecasting ◽

Text Data ◽

Performance Levels

During the last few years, RNN models have been extensively used and they have proven to be better for sequence and text data. RNNs have achieved state-of-the-art performance levels in several applications such as text classification, sequence to sequence modelling and time series forecasting. In this article we will review different Machine Learning and Deep Learning based approaches for text data and look at the results obtained from these methods. This work also explores the use of transfer learning in NLP and how it affects the performance of models on a specific application of sentiment analysis.

Download Full-text

Predicting adult neuroscience intensive care unit admission from emergency department triage using a retrospective, tabular-free text machine learning approach

Scientific Reports ◽

10.1038/s41598-021-80985-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Eyal Klang ◽

Benjamin R. Kummer ◽

Neha S. Dangayach ◽

Amy Zhong ◽

M. Arash Kia ◽

...

Keyword(s):

Machine Learning ◽

Intensive Care Unit ◽

Emergency Department ◽

Intensive Care ◽

Learning Model ◽

Free Text ◽

Combined Model ◽

Text Data ◽

Machine Learning Model ◽

Record Data

AbstractEarly admission to the neurosciences intensive care unit (NSICU) is associated with improved patient outcomes. Natural language processing offers new possibilities for mining free text in electronic health record data. We sought to develop a machine learning model using both tabular and free text data to identify patients requiring NSICU admission shortly after arrival to the emergency department (ED). We conducted a single-center, retrospective cohort study of adult patients at the Mount Sinai Hospital, an academic medical center in New York City. All patients presenting to our institutional ED between January 2014 and December 2018 were included. Structured (tabular) demographic, clinical, bed movement record data, and free text data from triage notes were extracted from our institutional data warehouse. A machine learning model was trained to predict likelihood of NSICU admission at 30 min from arrival to the ED. We identified 412,858 patients presenting to the ED over the study period, of whom 1900 (0.5%) were admitted to the NSICU. The daily median number of ED presentations was 231 (IQR 200–256) and the median time from ED presentation to the decision for NSICU admission was 169 min (IQR 80–324). A model trained only with text data had an area under the receiver-operating curve (AUC) of 0.90 (95% confidence interval (CI) 0.87–0.91). A structured data-only model had an AUC of 0.92 (95% CI 0.91–0.94). A combined model trained on structured and text data had an AUC of 0.93 (95% CI 0.92–0.95). At a false positive rate of 1:100 (99% specificity), the combined model was 58% sensitive for identifying NSICU admission. A machine learning model using structured and free text data can predict NSICU admission soon after ED arrival. This may potentially improve ED and NSICU resource allocation. Further studies should validate our findings.

Download Full-text

Using Configuration Semantic Features and Machine Learning Algorithms to Predict Build Result in Cloud-Based Container Environment

2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS) ◽

10.1109/icpads51040.2020.00042 ◽

2020 ◽

Author(s):

Yiwen Wu ◽

Yang Zhang ◽

Junsheng Chang ◽

Bo Ding ◽

Tao Wang ◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Semantic Features

Download Full-text

A Latent Dirichlet Allocation and Fuzzy Clustering Based Machine Learning Model for Text Thesaurus

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2020.2.3811 ◽

2020 ◽

Vol 15 (2) ◽

Author(s):

Jia Luo ◽

Dongwen Yu ◽

Zong Dai

Keyword(s):

Machine Learning ◽

Fuzzy Clustering ◽

Latent Dirichlet Allocation ◽

Learning Model ◽

Machine Learning Algorithms ◽

Text Data ◽

Huge Data ◽

Machine Learning Model ◽

N Gram ◽

Dirichlet Allocation

It is not quite possible to use manual methods to process the huge amount of structured and semi-structured data. This study aims to solve the problem of processing huge data through machine learning algorithms. We collected the text data of the company’s public opinion through crawlers, and use Latent Dirichlet Allocation (LDA) algorithm to extract the keywords of the text, and uses fuzzy clustering to cluster the keywords to form different topics. The topic keywords will be used as a seed dictionary for new word discovery. In order to verify the efficiency of machine learning in new word discovery, algorithms based on association rules, N-Gram, PMI, andWord2vec were used for comparative testing of new word discovery. The experimental results show that the Word2vec algorithm based on machine learning model has the highest accuracy, recall and F-value indicators.

Download Full-text

A Literature Research on Machine Learning Techniques used for Training Annotated Corpus

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1063.0882s819 ◽

2019 ◽

Vol 8 (2S8) ◽

pp. 1331-1337

Keyword(s):

Machine Learning ◽

Machine Learning Techniques ◽

Automatic Annotation ◽

Research Papers ◽

Text Data ◽

Literature Research ◽

Annotation Task ◽

Text Annotation ◽

Subsequent Research ◽

Learning Techniques

The development of research in the annotation area is growing. Researchers perform annotation task using various forms of datasets such as text, sound, images, and videos. Various algorithms are used to perform tasks. The purpose of this survey is to find out algorithms that are often used by researchers to perform annotation tasks, especially on text data. The literature surveys thirteen research papers on text annotation from the last 5 years. The results of this review indicate that SVM is the algorithm used for all three annotation methods: manual, automatic and semi-automatic annotation, with a significant accuracy above 80%. The result of this survey will be referred by the authors as the basis for subsequent research that will be conducted, especially in the semi-automatic annotation method.

Download Full-text

Predicting Health Material Cognitive Accessibility Using Multidimensional Semantic Features and Readability Tools as Predicators (Preprint)

10.2196/preprints.29175 ◽

2021 ◽

Author(s):

Meng Ji ◽

Yanmeng Liu ◽

Tianyong Hao

Keyword(s):

Machine Learning ◽

Health Education ◽

Health Information ◽

Domain Knowledge ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Semantic Features ◽

Integrated Models ◽

Advanced Education ◽

Cognitive Accessibility

BACKGROUND Much of current health information understandability research uses medical readability formula (MRF) to assess the cognitive difficulty of health education resources. This is based on an implicit assumption that medical domain knowledge represented by uncommon words or jargons form the sole barriers to health information access among the public. Our study challenged this by showing that for readers from non-English speaking backgrounds with higher education attainment, semantic features of English health texts rather than medical jargons can explain the lack of cognitive access of health materials among readers with better understanding of health terms, yet limited exposure to English health education materials. OBJECTIVE Our study explored combined MRF and multidimensional semantic features (MSF) for developing machine learning algorithms to predict the actual level of cognitive accessibility of English health materials on health risks and diseases for specific populations. We compare algorithms to evaluate the cognitive accessibility of specialised health information for non-native English speaker with advanced education levels yet very limited exposure to English health education environments. METHODS We used 108 semantic features to measure the content complexity and accessibility of original English resources. Using 1000 English health texts collected from international health organization websites, rated by international tertiary students, we compared machine learning (decision tree, SVM, discriminant analysis, ensemble tree and logistic regression) after automatic hyperparameter optimization (grid search for the best combination of hyperparameters of minimal classification errors). We applied 10-fold cross-validation on the whole dataset for the model training and testing, calculated the AUC, sensitivity, specificity, and accuracy as the measured of the model performance. RESULTS Using two sets of predictor features: widely tested MRF and MSF proposed in our study, we developed and compared three sets of machine learning algorithms: the first set of algorithms used MRF as predictors only, the second set of algorithms used MSF as predictors only, and the last set of algorithms used both MRF and MSF as integrated models. The results showed that the integrated models outperformed in terms of AUC, sensitivity, accuracy, and specificity. CONCLUSIONS Our study showed that cognitive accessibility of English health texts is not limited to word length and sentence length conventionally measured by MRF. We compared machine learning algorithms combing MRF and MSF to explore the cognitive accessibility of health information from syntactic and semantic perspectives. The results showed the strength of integrated models in terms of statistically increased AUC, sensitivity, and accuracy to predict health resource accessibility for the target readership, indicating that both MRF and MSF contribute to the comprehension of health information, and that for readers with advanced education, semantic features outweigh syntax and domain knowledge.

Download Full-text

Learning to Suggest Hashtags

Semantic Web Science and Real-World Applications - Advances in Web Technologies and Engineering ◽

10.4018/978-1-5225-7186-5.ch012 ◽

2019 ◽

pp. 302-325

Author(s):

Fahd Kalloubi ◽

El Habib Nfaoui

Keyword(s):

Machine Learning ◽

Social Networks ◽

Online Social Networks ◽

Recommendation System ◽

Semantic Features ◽

Short Messages

Twitter is one of the primary online social networks where users share messages and contents of interest to those who follow their activities. To effectively categorize and give audience to their tweets, users try to append appropriate hashtags to their short messages. However, the hashtags usage is very small and very heterogeneous and users may spend a lot of time searching the appropriate hashtags. Thus, the need for a system to assist users in this task is very important to increase and homogenize the hashtagging usage. In this chapter, the authors present a hashtag recommendation system on microblogging platforms by leveraging semantic features. Furthermore, they conduct a detailed study on how the semantic-based model influences the final recommended hashtags using different ranking strategies. Moreover, they propose a linear and a machine learning based combination of these ranking strategies. The experiment results show that their approach improves content-based recommendations, achieving a recall of more than 47% on recommending 5 hashtags.

Download Full-text

Open-Ended Questions

Employee Surveys and Sensing ◽

10.1093/oso/9780190939717.003.0013 ◽

2020 ◽

pp. 202-218

Author(s):

Subhadra Dutta ◽

Eric M. O’Rourke

Keyword(s):

Machine Learning ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Written Language ◽

Text Data ◽

Employee Survey ◽

Trade Offs ◽

Word Relatedness ◽

Survey Responses

Natural language processing (NLP) is the field of decoding human written language. This chapter responds to the growing interest in using machine learning–based NLP approaches for analyzing open-ended employee survey responses. These techniques address scalability and the ability to provide real-time insights to make qualitative data collection equally or more desirable in organizations. The chapter walks through the evolution of text analytics in industrial–organizational psychology and discusses relevant supervised and unsupervised machine learning NLP methods for survey text data, such as latent Dirichlet allocation, latent semantic analysis, sentiment analysis, word relatedness methods, and so on. The chapter also lays out preprocessing techniques and the trade-offs of growing NLP capabilities internally versus externally, points the readers to available resources, and ends with discussing implications and future directions of these approaches.

Download Full-text

MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors

Applied Sciences ◽

10.3390/app10217831 ◽

2020 ◽

Vol 10 (21) ◽

pp. 7831

Author(s):

Han Kyul Kim ◽

Sae Won Choi ◽

Ye Seul Bae ◽

Jiin Choi ◽

Hyein Kwon ◽

...

Keyword(s):

Machine Learning ◽

Contextual Information ◽

String Matching ◽

Similarity Measures ◽

Mapping Method ◽

Machine Learning Algorithms ◽

Training Data ◽

Context Aware ◽

Text Data ◽

Data Standardization

With growing interest in machine learning, text standardization is becoming an increasingly important aspect of data pre-processing within biomedical communities. As performances of machine learning algorithms are affected by both the amount and the quality of their training data, effective data standardization is needed to guarantee consistent data integrity. Furthermore, biomedical organizations, depending on their geographical locations or affiliations, rely on different sets of text standardization in practice. To facilitate easier machine learning-related collaborations between these organizations, an effective yet practical text data standardization method is needed. In this paper, we introduce MARIE (a context-aware term mapping method with string matching and embedding vectors), an unsupervised learning-based tool, to find standardized clinical terminologies for queries, such as a hospital’s own codes. By incorporating both string matching methods and term embedding vectors generated by BioBERT (bidirectional encoder representations from transformers for biomedical text mining), it utilizes both structural and contextual information to calculate similarity measures between source and target terms. Compared to previous term mapping methods, MARIE shows improved mapping accuracy. Furthermore, it can be easily expanded to incorporate any string matching or term embedding methods. Without requiring any additional model training, it is not only effective, but also a practical term mapping method for text data standardization and pre-processing.

Download Full-text