Study on the semi-supervised learning-based patient similarity from heterogeneous electronic medical records

Abstract Background A new learning-based patient similarity measurement was proposed to measure patients’ similarity for heterogeneous electronic medical records (EMRs) data. Methods We first calculated feature-level similarities according to the features’ attributes. A domain expert provided patient similarity scores of 30 randomly selected patients. These similarity scores and feature-level similarities for 30 patients comprised the labeled sample set, which was used for the semi-supervised learning algorithm to learn the patient-level similarities for all patients. Then we used the k-nearest neighbor (kNN) classifier to predict four liver conditions. The predictive performances were compared in four different situations. We also compared the performances between personalized kNN models and other machine learning models. We assessed the predictive performances by the area under the receiver operating characteristic curve (AUC), F1-score, and cross-entropy (CE) loss. Results As the size of the random training samples increased, the kNN models using the learned patient similarity to select near neighbors consistently outperformed those using the Euclidean distance to select near neighbors (all P values < 0.001). The kNN models using the learned patient similarity to identify the top k nearest neighbors from the random training samples also had a higher best-performance (AUC: 0.95 vs. 0.89, F1-score: 0.84 vs. 0.67, and CE loss: 1.22 vs. 1.82) than those using the Euclidean distance. As the size of the similar training samples increased, which composed the most similar samples determined by the learned patient similarity, the performance of kNN models using the simple Euclidean distance to select the near neighbors degraded gradually. When exchanging the role of the Euclidean distance, and the learned patient similarity in selecting the near neighbors and similar training samples, the performance of the kNN models gradually increased. These two kinds of kNN models had the same best-performance of AUC 0.95, F1-score 0.84, and CE loss 1.22. Among the four reference models, the highest AUC and F1-score were 0.94 and 0.80, separately, which were both lower than those for the simple and similarity-based kNN models. Conclusions This learning-based method opened an opportunity for similarity measurement based on heterogeneous EMR data and supported the secondary use of EMR data.

Download Full-text

A machine-learning approach to predict postprandial hypoglycemia

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-019-0943-4 ◽

2019 ◽

Vol 19 (1) ◽

Cited By ~ 7

Author(s):

Wonju Seo ◽

You-Bin Lee ◽

Seunghyun Lee ◽

Sang-Man Jin ◽

Sung-Min Park

Keyword(s):

Machine Learning ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Characteristic Curve ◽

Artificial Pancreas ◽

Individual Performance ◽

Machine Learning Algorithms ◽

Prediction Algorithm ◽

Support Vector ◽

K Nearest Neighbor

Abstract Background For an effective artificial pancreas (AP) system and an improved therapeutic intervention with continuous glucose monitoring (CGM), predicting the occurrence of hypoglycemia accurately is very important. While there have been many studies reporting successful algorithms for predicting nocturnal hypoglycemia, predicting postprandial hypoglycemia still remains a challenge due to extreme glucose fluctuations that occur around mealtimes. The goal of this study is to evaluate the feasibility of easy-to-use, computationally efficient machine-learning algorithm to predict postprandial hypoglycemia with a unique feature set. Methods We use retrospective CGM datasets of 104 people who had experienced at least one hypoglycemia alert value during a three-day CGM session. The algorithms were developed based on four machine learning models with a unique data-driven feature set: a random forest (RF), a support vector machine using a linear function or a radial basis function, a K-nearest neighbor, and a logistic regression. With 5-fold cross-subject validation, the average performance of each model was calculated to compare and contrast their individual performance. The area under a receiver operating characteristic curve (AUC) and the F1 score were used as the main criterion for evaluating the performance. Results In predicting a hypoglycemia alert value with a 30-min prediction horizon, the RF model showed the best performance with the average AUC of 0.966, the average sensitivity of 89.6%, the average specificity of 91.3%, and the average F1 score of 0.543. In addition, the RF showed the better predictive performance for postprandial hypoglycemic events than other models. Conclusion In conclusion, we showed that machine-learning algorithms have potential in predicting postprandial hypoglycemia, and the RF model could be a better candidate for the further development of postprandial hypoglycemia prediction algorithm to advance the CGM technology and the AP technology further.

Download Full-text

DEVELOPMENT AND COMPARATIVE ANALYSIS OF SEMI-SUPERVISED LEARNING ALGORITHMS ON A SMALL AMOUNT OF LABELED DATA

Bulletin of National Technical University KhPI Series System Analysis Control and Information Technologies ◽

10.20998/2079-0023.2021.01.16 ◽

2021 ◽

pp. 98-103

Author(s):

Klym Yamkovyi

Keyword(s):

Supervised Learning ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Center Of Mass ◽

Unlabeled Data ◽

Learning Approaches ◽

Classification Problems ◽

K Nearest Neighbor ◽

Supervised Learning Algorithms ◽

Label Information

The paper is dedicated to the development and comparative experimental analysis of semi-supervised learning approaches based on a mix of unsupervised and supervised approaches for the classification of datasets with a small amount of labeled data, namely, identifying to which of a set of categories a new observation belongs using a training set of data containing observations whose category membership is known. Semi-supervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. Unlabeled data, when used in combination with a small quantity of labeled data, can produce significant improvement in learning accuracy. The goal is semi-supervised methods development and analysis along with comparing their accuracy and robustness on different synthetics datasets. The proposed approach is based on the unsupervised K-medoids methods, also known as the Partitioning Around Medoid algorithm, however, unlike Kmedoids the proposed algorithm first calculates medoids using only labeled data and next process unlabeled classes – assign labels of nearest medoid. Another proposed approach is the mix of the supervised method of K-nearest neighbor and unsupervised K-Means. Thus, the proposed learning algorithm uses information about both the nearest points and classes centers of mass. The methods have been implemented using Python programming language and experimentally investigated for solving classification problems using datasets with different distribution and spatial characteristics. Datasets were generated using the scikit-learn library. Was compared the developed approaches to find average accuracy on all these datasets. It was shown, that even small amounts of labeled data allow us to use semi-supervised learning, and proposed modifications ensure to improve accuracy and algorithm performance, which was demonstrated during experiments. And with the increase of available label information accuracy of the algorithms grows up. Thus, the developed algorithms are using a distance metric that considers available label information. Keywords: Unsupervised learning, supervised learning. semi-supervised learning, clustering, distance, distance function, nearest neighbor, medoid, center of mass.

Download Full-text

An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records

Artificial Intelligence in Medicine ◽

10.1016/j.artmed.2015.04.007 ◽

2015 ◽

Vol 65 (2) ◽

pp. 155-166 ◽

Cited By ~ 41

Author(s):

Ramakanth Kavuluru ◽

Anthony Rios ◽

Yuan Lu

Keyword(s):

Supervised Learning ◽

Electronic Medical Records ◽

Medical Records ◽

Empirical Evaluation ◽

Learning Approaches ◽

Diagnosis Codes

Download Full-text

The Comparative Experimental Study of Multilabel Classification for Diagnosis Assistant Based on Chinese Obstetric EMRs

Journal of Healthcare Engineering ◽

10.1155/2018/7273451 ◽

2018 ◽

Vol 2018 ◽

pp. 1-9 ◽

Cited By ~ 4

Author(s):

Kunli Zhang ◽

Hongchao Ma ◽

Yueshu Zhao ◽

Hongying Zan ◽

Lei Zhuang

Keyword(s):

Electronic Medical Records ◽

Medical Records ◽

Latent Dirichlet Allocation ◽

Nearest Neighbor ◽

Fertility Level ◽

K Nearest Neighbor ◽

Multilabel Classification ◽

Multilabel Learning ◽

Chief Complaints ◽

Physical Examinations

Obstetric electronic medical records (EMRs) contain massive amounts of medical data and health information. The information extraction and diagnosis assistants of obstetric EMRs are of great significance in improving the fertility level of the population. The admitting diagnosis in the first course record of the EMR is reasoned from various sources, such as chief complaints, auxiliary examinations, and physical examinations. This paper treats the diagnosis assistant as a multilabel classification task based on the analyses of obstetric EMRs. The latent Dirichlet allocation (LDA) topic and the word vector are used as features and the four multilabel classification methods, BP-MLL (backpropagation multilabel learning), RAkEL (RAndom k labELsets), MLkNN (multilabel k-nearest neighbor), and CC (chain classifier), are utilized to build the diagnosis assistant models. Experimental results conducted on real cases show that the BP-MLL achieves the best performance with an average precision up to 0.7413 ± 0.0100 when the number of label sets and the word dimensions are 71 and 100, respectively. The result of the diagnosis assistant can be introduced as a supplementary learning method for medical students. Additionally, the method can be used not only for obstetric EMRs but also for other medical records.

Download Full-text

Analisis Matthew Correlation Coefficient pada K-Nearest Neighbor dalam Klasifikasi Ikan Hias

INFORMAL: Informatics Journal ◽

10.19184/isj.v5i2.18907 ◽

2020 ◽

Vol 5 (2) ◽

pp. 57

Author(s):

Novia Hasdyna ◽

Rozzi Kesuma Dinata

Keyword(s):

Machine Learning ◽

Correlation Coefficient ◽

Euclidean Distance ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Ornamental Fish ◽

Machine Learning Algorithm ◽

K Nearest Neighbor ◽

Matthew Correlation Coefficient

K-Nearest Neighbor (K-NN) is a machine learning algorithm that functions to classify data. This study aims to measure the performance of K-NN algorithm by using Matthew Correlation Coefficient (MCC). The data that used in this study are the ornamental fish which consisting of 3 classes named Premium, Medium, and Low. The analysis results of the Matthew Correlation Coefficient on K-NN using Euclidean Distance obtained the highest MCC value in Medium class which is 0.786542. The second highest MCC value is in Premium class which is 0.567434. The lowest MCC value is in Low class which is 0.435269. Overall, the MCC values is statistically which is 0,596415.

Download Full-text

Data Mining Approach to Identify Disease Cohorts from Primary Care Electronic Medical Records: A Case of Diabetes Mellitus

The Open Bioinformatics Journal ◽

10.2174/1875036201710010016 ◽

2017 ◽

Vol 10 (1) ◽

pp. 16-27 ◽

Cited By ~ 11

Author(s):

Ebenezer S. Owusu Adjah ◽

Olga Montvida ◽

Julius Agbeve ◽

Sanjoy K. Paul

Keyword(s):

Diabetes Mellitus ◽

Primary Care ◽

Electronic Medical Records ◽

Medical Records ◽

Characteristic Curve ◽

Large Population ◽

Classification Problem ◽

True Positive Rate ◽

Population Based ◽

True Negative

Background:Identification of diseased patients from primary care based electronic medical records (EMRs) has methodological challenges that may impact epidemiologic inferences.Objective:To compare deterministic clinically guided selection algorithms with probabilistic machine learning (ML) methodologies for their ability to identify patients with type 2 diabetes mellitus (T2DM) from large population based EMRs from nationally representative primary care database.Methods:Four cohorts of patients with T2DM were defined by deterministic approach based on disease codes. The database was mined for a set of best predictors of T2DM and the performance of six ML algorithms were compared based on cross-validated true positive rate, true negative rate, and area under receiver operating characteristic curve.Results:In the database of 11,018,025 research suitable individuals, 379 657 (3.4%) were coded to have T2DM. Logistic Regression classifier was selected as best ML algorithm and resulted in a cohort of 383,330 patients with potential T2DM. Eighty-three percent (83%) of this cohort had a T2DM code, and 16% of the patients with T2DM code were not included in this ML cohort. Of those in the ML cohort without disease code, 52% had at least one measure of elevated glucose level and 22% had received at least one prescription for antidiabetic medication.Conclusion:Deterministic cohort selection based on disease coding potentially introduces significant mis-classification problem. ML techniques allow testing for potential disease predictors, and under meaningful data input, are able to identify diseased cohorts in a holistic way.

Download Full-text

A Hybrid Semi-supervised Learning Approach to Identifying Protected Health Information in Electronic Medical Records

Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication - IMCOM '16 ◽

10.1145/2857546.2857630 ◽

2016 ◽

Cited By ~ 2

Author(s):

Nguyen Dong Phuong ◽

Vo Thi Ngoc Chau ◽

Ho Tu Bao

Keyword(s):

Supervised Learning ◽

Health Information ◽

Electronic Medical Records ◽

Medical Records ◽

Learning Approach ◽

Protected Health Information

Download Full-text

Similar Pair-free Partial Label Metric Learning

International Journal of Circuits, Systems and Signal Processing ◽

10.46300/9106.2022.16.26 ◽

2022 ◽

Vol 16 ◽

pp. 215-223

Author(s):

Houjie Li ◽

Min Yang ◽

Yu Zhou ◽

Ruirui Zheng ◽

Wenpeng Liu ◽

...

Keyword(s):

Learning Algorithm ◽

Learning Algorithms ◽

Metric Learning ◽

Learning Technology ◽

K Nearest Neighbor ◽

Learning Framework ◽

Training Samples ◽

Lower Accuracy ◽

Partial Label Learning ◽

Similar Pair

Partial label learning is a new weak- ly supervised learning framework. In this frame- work, the real category label of a training sample is usually concealed in a set of candidate labels, which will lead to lower accuracy of learning al- gorithms compared with traditional strong super- vised cases. Recently, it has been found that met- ric learning technology can be used to improve the accuracy of partial label learning algorithm- s. However, because it is difficult to ascertain similar pairs from training samples, at present there are few metric learning algorithms for par- tial label learning framework. In view of this, this paper proposes a similar pair-free partial la- bel metric learning algorithm. The main idea of the algorithm is to define two probability distri- butions on the training samples, i.e., the proba- bility distribution determined by the distance of sample pairs and the probability distribution de- termined by the similarity of candidate label set of sample pairs, and then the metric matrix is ob- tained via minimizing the KL divergence of the two probability distributions. The experimental results on several real-world partial label dataset- s show that the proposed algorithm can improve the accuracy of k-nearest neighbor partial label learning algorithm (PL-KNN) better than the ex- isting partial label metric learning algorithms, up to 8 percentage points.

Download Full-text