scholarly journals Empirical evaluation of feature subset selection based on a real-world data set

2004 ◽  
Vol 17 (3) ◽  
pp. 285-288 ◽  
Author(s):  
Petra Perner ◽  
Chid Apte
2013 ◽  
Vol 47 ◽  
pp. 1-34 ◽  
Author(s):  
G. Wang ◽  
Q. Song ◽  
H. Sun ◽  
X. Zhang ◽  
B. Xu ◽  
...  

Many feature subset selection (FSS) algorithms have been proposed, but not all of them are appropriate for a given feature selection problem. At the same time, so far there is rarely a good way to choose appropriate FSS algorithms for the problem at hand. Thus, FSS algorithm automatic recommendation is very important and practically useful. In this paper, a meta learning based FSS algorithm automatic recommendation method is presented. The proposed method first identifies the data sets that are most similar to the one at hand by the k-nearest neighbor classification algorithm, and the distances among these data sets are calculated based on the commonly-used data set characteristics. Then, it ranks all the candidate FSS algorithms according to their performance on these similar data sets, and chooses the algorithms with best performance as the appropriate ones. The performance of the candidate FSS algorithms is evaluated by a multi-criteria metric that takes into account not only the classification accuracy over the selected features, but also the runtime of feature selection and the number of selected features. The proposed recommendation method is extensively tested on 115 real world data sets with 22 well-known and frequently-used different FSS algorithms for five representative classifiers. The results show the effectiveness of our proposed FSS algorithm recommendation method.


2019 ◽  
Vol 10 (03) ◽  
pp. 409-420 ◽  
Author(s):  
Steven Horng ◽  
Nathaniel R. Greenbaum ◽  
Larry A. Nathanson ◽  
James C. McClay ◽  
Foster R. Goss ◽  
...  

Objective Numerous attempts have been made to create a standardized “presenting problem” or “chief complaint” list to characterize the nature of an emergency department visit. Previous attempts have failed to gain widespread adoption as they were not freely shareable or did not contain the right level of specificity, structure, and clinical relevance to gain acceptance by the larger emergency medicine community. Using real-world data, we constructed a presenting problem list that addresses these challenges. Materials and Methods We prospectively captured the presenting problems for 180,424 consecutive emergency department patient visits at an urban, academic, Level I trauma center in the Boston metro area. No patients were excluded. We used a consensus process to iteratively derive our system using real-world data. We used the first 70% of consecutive visits to derive our ontology, followed by a 6-month washout period, and the remaining 30% for validation. All concepts were mapped to Systematized Nomenclature of Medicine–Clinical Terms (SNOMED CT). Results Our system consists of a polyhierarchical ontology containing 692 unique concepts, 2,118 synonyms, and 30,613 nonvisible descriptions to correct misspellings and nonstandard terminology. Our ontology successfully captured structured data for 95.9% of visits in our validation data set. Discussion and Conclusion We present the HierArchical Presenting Problem ontologY (HaPPy). This ontology was empirically derived and then iteratively validated by an expert consensus panel. HaPPy contains 692 presenting problem concepts, each concept being mapped to SNOMED CT. This freely sharable ontology can help to facilitate presenting problem-based quality metrics, research, and patient care.


2019 ◽  
Vol 37 (7_suppl) ◽  
pp. 180-180 ◽  
Author(s):  
A. Oliver Sartor ◽  
Sreevalsa Appukkuttan ◽  
Ronald E. Aubert ◽  
Jeffrey Weiss ◽  
Joy Wang ◽  
...  

180 Background: Radium-223 (RA-223) is the first FDA approved targeted alpha therapy that significantly improves overall survival (OS) in patients (pts) with metastatic castration resistant prostate cancer (mCRPC) with symptomatic bone metastases. There is limited real world data describing RA-223 current use. Methods: A retrospective patient chart review was done of men who received at least 1 cycle of Ra-223 for mCRPC in 10 centers throughout the US (4 academic, 6 private practices). All pts had a minimum follow-up of 4 months, or placed in hospice or death. Descriptive analyses for clinical characteristics and treatment outcomes were performed. Results: Among the 200 pts (mean age-73.6 years, mean Charlson comorbidity index-6.9) RA-223 was initiated on average 1.6 years from mCRPC diagnosis (first line use (1L)=38.5%, 2L=31.5% and ≥3L=30%). 78% completed 5-6 cycles of RA-223 with mean therapy duration of 4.2 months. Among all pts, 43% received RA-223 as monotherapy (no overlap with other mCRPC therapies) while 57% had combination therapy with either abiraterone or enzalutamide. Median OS following RA-223 initiation was 21.2 months (95% CI 19.6- 29.2). Table provides the RA-223 utilization by type of clinical practice. Conclusions: Utilization of RA-223 in this real world data set was distinct from clinical trial data. Most patients received RA-223 in combination with abiraterone or enzalutamide, therapies that were unavailable when the pilot trial was conducted. Median survival was 21.2 months. Real world use of RA-223 has evolved as newer agents have become FDA approved in bone-metastatic CRPC. Academic and community patterns of practice were more similar than distinct. [Table: see text]


2021 ◽  
Vol 39 (15_suppl) ◽  
pp. e18725-e18725
Author(s):  
Ravit Geva ◽  
Barliz Waissengrin ◽  
Dan Mirelman ◽  
Felix Bokstein ◽  
Deborah T. Blumenthal ◽  
...  

e18725 Background: Healthcare data sharing is important for the creation of diverse and large data sets, supporting clinical decision making, and accelerating efficient research to improve patient outcomes. This is especially vital in the case of real world data analysis. However, stakeholders are reluctant to share their data without ensuring patients’ privacy, proper protection of their data sets and the ways they are being used. Homomorphic encryption is a cryptographic capability that can address these issues by enabling computation on encrypted data without ever decrypting it, so the analytics results are obtained without revealing the raw data. The aim of this study is to prove the accuracy of analytics results and the practical efficiency of the technology. Methods: A real-world data set of colorectal cancer patients’ survival data, following two different treatment interventions, including 623 patients and 24 variables, amounting to 14,952 items of data, was encrypted using leveled homomorphic encryption implemented in the PALISADE software library. Statistical analysis of key oncological endpoints was blindly performed on both the raw data and the homomorphically-encrypted data using descriptive statistics and survival analysis with Kaplan-Meier curves. Results were then compared with an accuracy goal of two decimals. Results: The difference between the raw data and the homomorphically encrypted data results, regarding all variables analyzed was within the pre-determined accuracy range goal, as well as the practical efficiency of the encrypted computation measured by run time, are presented in table. Conclusions: This study demonstrates that data encrypted with Homomorphic Encryption can be statistical analyzed with a precision of at least two decimal places, allowing safe clinical conclusions drawing while preserving patients’ privacy and protecting data owners’ data assets. Homomorphic encryption allows performing efficient computation on encrypted data non-interactively and without requiring decryption during computation time. Utilizing the technology will empower large-scale cross-institution and cross- stakeholder collaboration, allowing safe international collaborations. Clinical trial information: 0048-19-TLV. [Table: see text]


1985 ◽  
Vol 22 (4) ◽  
pp. 462-467 ◽  
Author(s):  
Dennis H. Gensch

All disaggregate multiattribute choice models contain the assumption that the population is reasonably homogeneous with respect to the aggregate parameters estimated by the model. The author points out that one particular choice model, logit, has a structure that makes it particularly suited to test a data set for possible segments. A real-world data set is used to illustrate a simple procedure for testing the homogeneity assumption. The analysis provides a warning that managers may easily derive suboptimal or counterproductive strategies if they fail to test this assumption.


2018 ◽  
Vol 210 ◽  
pp. 04019 ◽  
Author(s):  
Hyontai SUG

Recent world events in go games between human and artificial intelligence called AlphaGo showed the big advancement in machine learning technologies. While AlphaGo was trained using real world data, AlphaGo Zero was trained using massive random data, and the fact that AlphaGo Zero won AlphaGo completely revealed that diversity and size in training data is important for better performance for the machine learning algorithms, especially in deep learning algorithms of neural networks. On the other hand, artificial neural networks and decision trees are widely accepted machine learning algorithms because of their robustness in errors and comprehensibility respectively. In this paper in order to prove that diversity and size in data are important factors for better performance of machine learning algorithms empirically, the two representative algorithms are used for experiment. A real world data set called breast tissue was chosen, because the data set consists of real numbers that is very good property for artificial random data generation. The result of the experiment proved the fact that the diversity and size of data are very important factors for better performance.


2002 ◽  
Vol 14 (1) ◽  
pp. 21-41 ◽  
Author(s):  
Marco Saerens ◽  
Patrice Latinne ◽  
Christine Decaestecker

It sometimes happens (for instance in case control studies) that a classifier is trained on a data set that does not reflect the true a priori probabilities of the target classes on real-world data. This may have a negative effect on the classification accuracy obtained on the real-world data set, especially when the classifier's decisions are based on the a posteriori probabilities of class membership. Indeed, in this case, the trained classifier provides estimates of the a posteriori probabilities that are not valid for this real-world data set (they rely on the a priori probabilities of the training set). Applying the classifier as is (without correcting its outputs with respect to these new conditions) on this new data set may thus be suboptimal. In this note, we present a simple iterative procedure for adjusting the outputs of the trained classifier with respect to these new a priori probabilities without having to refit the model, even when these probabilities are not known in advance. As a by-product, estimates of the new a priori probabilities are also obtained. This iterative algorithm is a straightforward instance of the expectation-maximization (EM) algorithm and is shown to maximize the likelihood of the new data. Thereafter, we discuss a statistical test that can be applied to decide if the a priori class probabilities have changed from the training set to the real-world data. The procedure is illustrated on different classification problems involving a multilayer neural network, and comparisons with a standard procedure for a priori probability estimation are provided. Our original method, based on the EM algorithm, is shown to be superior to the standard one for a priori probability estimation. Experimental results also indicate that the classifier with adjusted outputs always performs better than the original one in terms of classification accuracy, when the a priori probability conditions differ from the training set to the real-world data. The gain in classification accuracy can be significant.


Sign in / Sign up

Export Citation Format

Share Document