scholarly journals Derivation and Validation of a Record Linkage Algorithm between EMS and the Emergency Department

2017 ◽  
Author(s):  
Colby Redfield ◽  
Abdulhakim Tlimat ◽  
Yoni Halpern ◽  
David Schoenfeld ◽  
Edward Ullman ◽  
...  

AbstractBackgroundLinking EMS electronic patient care reports (ePCRs) to ED records can provide clinicians access to vital information that can alter management. It can also create rich databases for research and quality improvement. Unfortunately, previous attempts at ePCR - ED record linkage have had limited success.ObjectiveTo derive and validate an automated record linkage algorithm between EMS ePCR’s and ED records using supervised machine learning.MethodsAll consecutive ePCR’s from a single EMS provider between June 2013 and June 2015 were included. A primary reviewer matched ePCR’s to a list of ED patients to create a gold standard. Age, gender, last name, first name, social security number (SSN), and date of birth (DOB) were extracted. Data was randomly split into 80%/20% training and test data sets. We derived missing indicators, identical indicators, edit distances, and percent differences. A multivariate logistic regression model was trained using 5k fold cross-validation, using label k-fold, L2 regularization, and class re-weighting.ResultsA total of 14,032 ePCRs were included in the study. Inter-rater reliability between the primary and secondary reviewer had a Kappa of 0.9. The algorithm had a sensitivity of 99.4%, a PPV of 99.9% and AUC of 0.99 in both the training and test sets. DOB match had the highest odd ratio of 16.9, followed by last name match (10.6). SSN match had an odds ratio of 3.8.ConclusionsWe were able to successfully derive and validate a probabilistic record linkage algorithm from a single EMS ePCR provider to our hospital EMR.

Author(s):  
WASIF AFZAL ◽  
RICHARD TORKAR ◽  
ROBERT FELDT

In the presence of a number of algorithms for classification and prediction in software engineering, there is a need to have a systematic way of assessing their performances. The performance assessment is typically done by some form of partitioning or resampling of the original data to alleviate biased estimation. For predictive and classification studies in software engineering, there is a lack of a definitive advice on the most appropriate resampling method to use. This is seen as one of the contributing factors for not being able to draw general conclusions on what modeling technique or set of predictor variables are the most appropriate. Furthermore, the use of a variety of resampling methods make it impossible to perform any formal meta-analysis of the primary study results. Therefore, it is desirable to examine the influence of various resampling methods and to quantify possible differences. Objective and method: This study empirically compares five common resampling methods (hold-out validation, repeated random sub-sampling, 10-fold cross-validation, leave-one-out cross-validation and non-parametric bootstrapping) using 8 publicly available data sets with genetic programming (GP) and multiple linear regression (MLR) as software quality classification approaches. Location of (PF, PD) pairs in the ROC (receiver operating characteristics) space and area under an ROC curve (AUC) are used as accuracy indicators. Results: The results show that in terms of the location of (PF, PD) pairs in the ROC space, bootstrapping results are in the preferred region for 3 of the 8 data sets for GP and for 4 of the 8 data sets for MLR. Based on the AUC measure, there are no significant differences between the different resampling methods using GP and MLR. Conclusion: There can be certain data set properties responsible for insignificant differences between the resampling methods based on AUC. These include imbalanced data sets, insignificant predictor variables and high-dimensional data sets. With the current selection of data sets and classification techniques, bootstrapping is a preferred method based on the location of (PF, PD) pair data in the ROC space. Hold-out validation is not a good choice for comparatively smaller data sets, where leave-one-out cross-validation (LOOCV) performs better. For comparatively larger data sets, 10-fold cross-validation performs better than LOOCV.


2021 ◽  
pp. 13-26
Author(s):  
Felix Kruse ◽  
Jan-Philipp Awick ◽  
Jorge Marx Gómez ◽  
Peter Loos

This paper explores the data integration process step record linkage. Thereby we focus on the entity company. For the integration of company data, the company name is a crucial attribute, which often includes the legal form. This legal form is not concise and consistent represented among different data sources, which leads to considerable data quality problems for the further process steps in record linkage. To solve these problems, we classify and ex-tract the legal form from the attribute company name. For this purpose, we iteratively developed four different approaches and compared them in a benchmark. The best approach is a hybrid approach combining a rule set and a supervised machine learning model. With our developed hybrid approach, any company data sets from research or business can be processed. Thus, the data quality for subsequent data processing steps such as record linkage can be improved. Furthermore, our approach can be adapted to solve the same data quality problems in other attributes.


2018 ◽  
Vol 7 (2.15) ◽  
pp. 136 ◽  
Author(s):  
Rosaida Rosly ◽  
Mokhairi Makhtar ◽  
Mohd Khalid Awang ◽  
Mohd Isa Awang ◽  
Mohd Nordin Abdul Rahman

This paper analyses the performance of classification models using single classification and combination of ensemble method, which are Breast Cancer Wisconsin and Hepatitis data sets as training datasets. This paper presents a comparison of different classifiers based on a 10-fold cross validation using a data mining tool. In this experiment, various classifiers are implemented including three popular ensemble methods which are boosting, bagging and stacking for the combination. The result shows that for the classification of the Breast Cancer Wisconsin data set, the single classification of Naïve Bayes (NB) and a combination of bagging+NB algorithm displayed the highest accuracy at the same percentage (97.51%) compared to other combinations of ensemble classifiers. For the classification of the Hepatitisdata set, the result showed that the combination of stacking+Multi-Layer Perception (MLP) algorithm achieved a higher accuracy at 86.25%. By using the ensemble classifiers, the result may be improved. In future, a multi-classifier approach will be proposed by introducing a fusion at the classification level between these classifiers to obtain classification with higher accuracies.  


2016 ◽  
Vol 28 (8) ◽  
pp. 1694-1722 ◽  
Author(s):  
Yu Wang ◽  
Jihong Li

In typical machine learning applications such as information retrieval, precision and recall are two commonly used measures for assessing an algorithm's performance. Symmetrical confidence intervals based on K-fold cross-validated t distributions are widely used for the inference of precision and recall measures. As we confirmed through simulated experiments, however, these confidence intervals often exhibit lower degrees of confidence, which may easily lead to liberal inference results. Thus, it is crucial to construct faithful confidence (credible) intervals for precision and recall with a high degree of confidence and a short interval length. In this study, we propose two posterior credible intervals for precision and recall based on K-fold cross-validated beta distributions. The first credible interval for precision (or recall) is constructed based on the beta posterior distribution inferred by all K data sets corresponding to K confusion matrices from a K-fold cross-validation. Second, considering that each data set corresponding to a confusion matrix from a K-fold cross-validation can be used to infer a beta posterior distribution of precision (or recall), the second proposed credible interval for precision (or recall) is constructed based on the average of K beta posterior distributions. Experimental results on simulated and real data sets demonstrate that the first credible interval proposed in this study almost always resulted in degrees of confidence greater than 95%. With an acceptable degree of confidence, both of our two proposed credible intervals have shorter interval lengths than those based on a corrected K-fold cross-validated t distribution. Meanwhile, the average ranks of these two credible intervals are superior to that of the confidence interval based on a K-fold cross-validated t distribution for the degree of confidence and are superior to that of the confidence interval based on a corrected K-fold cross-validated t distribution for the interval length in all 27 cases of simulated and real data experiments. However, the confidence intervals based on the K-fold and corrected K-fold cross-validated t distributions are in the two extremes. Thus, when focusing on the reliability of the inference for precision and recall, the proposed methods are preferable, especially for the first credible interval.


2012 ◽  
Vol 433-440 ◽  
pp. 3959-3963 ◽  
Author(s):  
Bayram Akdemir ◽  
Nurettin Çetinkaya

In distributing systems, load forecasting is one of the major management problems to carry on energy flowing; protect the systems, and economic management. In order to manage the system, next step of the load characteristics must be inform from historical data sets. For the forecasting, not only historical parameters are used but also external parameters such as weather conditions, seasons and populations and etc. have much importance to forecast the next behavior of the load characteristic. Holidays and week days have different affects on energy consumption in any country. In this study, target is to forecast the peak energy level the next an hour and to compare affects of week days and holidays on peak energy needs. Energy consumption data sets have nonlinear characteristics and it is not easy to fit any curve due to its nonlinearity and lots of parameters. In order to forecast peak energy level, Adaptive neural fuzzy inference system is used for hourly affects of holidays and week days on peak energy level is argued. The obtained values from output of the artificial intelligence are evaluated two fold cross validation and mean absolute percentage error. The obtained two fold cross validation error as mean absolute percentage error is 3.51 and included holidays data set has more accuracy than the data set without holiday. Total success increased 2.4%.


2005 ◽  
Vol 14 (01n02) ◽  
pp. 261-280 ◽  
Author(s):  
JIANG LI ◽  
MICHAEL T. MANRY ◽  
CHANGHUA YU ◽  
D. RANDALL WILSON

Algorithms reducing the storage requirement of the nearest neighbor classifier (NNC) can be divided into three main categories: Fast searching algorithms, Instance-based learning algorithms and Prototype based algorithms. We propose an algorithm, LVQPRU, for pruning NNC prototype vectors and a compact classifier with good performance is obtained. The basic condensing algorithm is applied to the initial prototypes to speed up the learning process. The learning vector quantization (LVQ) algorithm is utilized to fine tune the remaining prototypes during each pruning iteration. We evaluate LVQPRU on several data sets along with 12 other algorithms using ten-fold cross-validation. Simulation results show that the proposed algorithm has high generalization accuracy and good storage reduction ratios.


2019 ◽  
pp. 495-501
Author(s):  
Balasaheb Tarle ◽  
Muddana Akkalaksmi

In medical data classification, if the size of data sets is small and if it contains multiple missing attribute values, in such cases improving classification performance is an important issue. The foremost objective of machine learning research is to improve the classification performance of the classifiers. The number of training instances provided for training must be sufficient in size. In the proposed algorithm, we substitute missing attribute values with attribute available domain values and generate additional training tuples that are in addition to original training tuples. These additional, plus original training samples provide sufficient data samples for learning. The neuro-fuzzy classifier trained on this dataset. The classification performance on test data for the neuro-fuzzy classifier is obtained using the k-fold cross-validation method. The proposed method attains around 2.8% and 3.61% improvement in classification accuracy for this classifier.


2019 ◽  
Vol 21 (4) ◽  
pp. 1425-1436 ◽  
Author(s):  
Xiangxiang Zeng ◽  
Yue Zhong ◽  
Wei Lin ◽  
Quan Zou

Abstract Identification of disease-associated circular RNAs (circRNAs) is of critical importance, especially with the dramatic increase in the amount of circRNAs. However, the availability of experimentally validated disease-associated circRNAs is limited, which restricts the development of effective computational methods. To our knowledge, systematic approaches for the prediction of disease-associated circRNAs are still lacking. In this study, we propose the use of deep forests combined with positive-unlabeled learning methods to predict potential disease-related circRNAs. In particular, a heterogeneous biological network involving 17 961 circRNAs, 469 miRNAs, and 248 diseases was constructed, and then 24 meta-path-based topological features were extracted. We applied 5-fold cross-validation on 15 disease data sets to benchmark the proposed approach and other competitive methods and used Recall@k and PRAUC@k to evaluate their performance. In general, our method performed better than the other methods. In addition, the performance of all methods improved with the accumulation of known positive labels. Our results provided a new framework to investigate the associations between circRNA and disease and might improve our understanding of its functions.


Sign in / Sign up

Export Citation Format

Share Document