scholarly journals Feature-Weighted Sampling for Proper Evaluation of Classification Models

2021 ◽  
Vol 11 (5) ◽  
pp. 2039
Author(s):  
Hyunseok Shin ◽  
Sejong Oh

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.

2021 ◽  
Author(s):  
Dongchul Cha ◽  
Chongwon Pae ◽  
Se A Lee ◽  
Gina Na ◽  
Young Kyun Hur ◽  
...  

BACKGROUND Deep learning (DL)–based artificial intelligence may have different diagnostic characteristics than human experts in medical diagnosis. As a data-driven knowledge system, heterogeneous population incidence in the clinical world is considered to cause more bias to DL than clinicians. Conversely, by experiencing limited numbers of cases, human experts may exhibit large interindividual variability. Thus, understanding how the 2 groups classify given data differently is an essential step for the cooperative usage of DL in clinical application. OBJECTIVE This study aimed to evaluate and compare the differential effects of clinical experience in otoendoscopic image diagnosis in both computers and physicians exemplified by the class imbalance problem and guide clinicians when utilizing decision support systems. METHODS We used digital otoendoscopic images of patients who visited the outpatient clinic in the Department of Otorhinolaryngology at Severance Hospital, Seoul, South Korea, from January 2013 to June 2019, for a total of 22,707 otoendoscopic images. We excluded similar images, and 7500 otoendoscopic images were selected for labeling. We built a DL-based image classification model to classify the given image into 6 disease categories. Two test sets of 300 images were populated: balanced and imbalanced test sets. We included 14 clinicians (otolaryngologists and nonotolaryngology specialists including general practitioners) and 13 DL-based models. We used accuracy (overall and per-class) and kappa statistics to compare the results of individual physicians and the ML models. RESULTS Our ML models had consistently high accuracies (balanced test set: mean 77.14%, SD 1.83%; imbalanced test set: mean 82.03%, SD 3.06%), equivalent to those of otolaryngologists (balanced: mean 71.17%, SD 3.37%; imbalanced: mean 72.84%, SD 6.41%) and far better than those of nonotolaryngologists (balanced: mean 45.63%, SD 7.89%; imbalanced: mean 44.08%, SD 15.83%). However, ML models suffered from class imbalance problems (balanced test set: mean 77.14%, SD 1.83%; imbalanced test set: mean 82.03%, SD 3.06%). This was mitigated by data augmentation, particularly for low incidence classes, but rare disease classes still had low per-class accuracies. Human physicians, despite being less affected by prevalence, showed high interphysician variability (ML models: kappa=0.83, SD 0.02; otolaryngologists: kappa=0.60, SD 0.07). CONCLUSIONS Even though ML models deliver excellent performance in classifying ear disease, physicians and ML models have their own strengths. ML models have consistent and high accuracy while considering only the given image and show bias toward prevalence, whereas human physicians have varying performance but do not show bias toward prevalence and may also consider extra information that is not images. To deliver the best patient care in the shortage of otolaryngologists, our ML model can serve a cooperative role for clinicians with diverse expertise, as long as it is kept in mind that models consider only images and could be biased toward prevalent diseases even after data augmentation.


Author(s):  
André Maletzke ◽  
Waqar Hassan ◽  
Denis dos Reis ◽  
Gustavo Batista

Quantification is a task similar to classification in the sense that it learns from a labeled training set. However, quantification is not interested in predicting the class of each observation, but rather measure the class distribution in the test set. The community has developed performance measures and experimental setups tailored to quantification tasks. Nonetheless, we argue that a critical variable, the size of the test sets, remains ignored. Such disregard has three main detrimental effects. First, it implicitly assumes that quantifiers will perform equally well for different test set sizes. Second, it increases the risk of cherry-picking by selecting a test set size for which a particular proposal performs best. Finally, it disregards the importance of designing methods that are suitable for different test set sizes. We discuss these issues with the support of one of the broadest experimental evaluations ever performed, with three main outcomes. (i) We empirically demonstrate the importance of the test set size to assess quantifiers. (ii) We show that current quantifiers generally have a mediocre performance on the smallest test sets. (iii) We propose a metalearning scheme to select the best quantifier based on the test size that can outperform the best single quantification method.


Molecules ◽  
2019 ◽  
Vol 24 (10) ◽  
pp. 2006 ◽  
Author(s):  
Liadys Mora Lagares ◽  
Nikola Minovski ◽  
Marjana Novič

P-glycoprotein (P-gp) is a transmembrane protein that actively transports a wide variety of chemically diverse compounds out of the cell. It is highly associated with the ADMET (absorption, distribution, metabolism, excretion and toxicity) properties of drugs/drug candidates and contributes to decreasing toxicity by eliminating compounds from cells, thereby preventing intracellular accumulation. Therefore, in the drug discovery and toxicological assessment process it is advisable to pay attention to whether a compound under development could be transported by P-gp or not. In this study, an in silico multiclass classification model capable of predicting the probability of a compound to interact with P-gp was developed using a counter-propagation artificial neural network (CP ANN) based on a set of 2D molecular descriptors, as well as an extensive dataset of 2512 compounds (1178 P-gp inhibitors, 477 P-gp substrates and 857 P-gp non-active compounds). The model provided a good classification performance, producing non error rate (NER) values of 0.93 for the training set and 0.85 for the test set, while the average precision (AvPr) was 0.93 for the training set and 0.87 for the test set. An external validation set of 385 compounds was used to challenge the model’s performance. On the external validation set the NER and AvPr values were 0.70 for both indices. We believe that this in silico classifier could be effectively used as a reliable virtual screening tool for identifying potential P-gp ligands.


Blood ◽  
2012 ◽  
Vol 120 (21) ◽  
pp. 197-197
Author(s):  
Ricky D Edmondson ◽  
Shweta S. Chavan ◽  
Christoph Heuck ◽  
Bart Barlogie

Abstract Abstract 197 We and others have used gene expression profiling to classify multiple myeloma into high and low risk groups; here, we report the first combined GEP and proteomics study of a large number of baseline samples (n=85) of highly enriched tumor cells from patients with newly diagnosed myeloma. Peptide expression levels from MS data on CD138-selected plasma cells from a discovery set of 85 patients with newly diagnosed myeloma were used to identify proteins that were linked to short survival (OS < 3 years vs OS ≥ 3 years). The proteomics dataset consisted of intensity values for 11,006 peptides (representing 2,155 proteins), where intensity is the quantitative measure of peptide abundance; Peptide intensities were normalized by Z score transformation and significance analysis of microarray (SAM) was applied resulting in the identification 24 peptides as differentially expressed between the two groups (OS < 3 years vs OS ≥ 3 years), with fold change ≥1.5 and FDR <5%. The 24 peptides mapped to 19 unique proteins, and all were present at higher levels in the group with shorter overall survival than in the group with longer overall survival. An independent SAM analysis with parameters identical to the proteomics analysis (fold change ≥1.5; FDR <5%) was performed with the Affymetrix U133Plus2 microarray chip based expression data. This analysis identified 151 probe sets that were differentially expressed between the two groups; 144 probe sets were present at higher levels and seven at lower levels in the group with shorter overall survival. Comparing the SAM analyses of proteomics and GEP data, we identified nine probe sets, corresponding to seven genes, with increased levels of both protein and mRNA in the short lived group. In order to validate these findings from the discovery experiment we used GEP data from a randomized subset of the TT3 patient population as a training set for determining the optimal cut-points for each of the nine probe sets. Thus, TT3 population was randomized into two sub-populations for the training set (two-thirds of the population; n=294) and test set (one-third of the population; n=147); the Total Therapy 2 (TT2) patient population was used as an additional test set (n=441). A running log rank test was performed on the training set for each of the nine probe sets to determine its optimal gene expression cut-point. The cut-points derived from the training set were then applied to TT3 and TT2 test sets to investigate survival differences for the groups separated by the optimal cutpoint for each probe. The overall survival of the groups was visualized using the method of Kaplan and Meier, and a P-value was calculated (based on log-rank test) to determine whether there was a statistically significant difference in survival between the two groups (P ≤0.05). We performed univariate regression analysis using Cox proportional hazard model with the nine probe sets as variables on the TT3 test set. To identify which of the genes corresponding to these nine probes had an independent prognostic value, we performed a multivariate stepwise Cox regression analysis. wherein CACYBP, FABP5, and IQGAP2 retained significance after competing with the remaining probe sets in the analysis. CACYBP had the highest hazard ratio (HR 2.70, P-value 0.01). We then performed the univariate and multivariate analyses on the TT2 test set where CACYBP, CORO1A, ENO1, and STMN1 were selected by the multivariate analysis, and CACYBP had the highest hazard ratio (HR 1.93, P-value 0.004). CACYBP was the only gene selected by multivariate analyses of both test sets. Disclosures: No relevant conflicts of interest to declare.


2017 ◽  
Vol 35 (15_suppl) ◽  
pp. e15575-e15575
Author(s):  
Brice Jabo ◽  
John W. Morgan ◽  
Mayada A. Aljehani ◽  
Matthew J Selleck ◽  
Albert Y. Lin

e15575 Background: Gastric cancer (GC) mortality remains high, with a 5-year survival of 30 percent. For patients with resectable GC, mortality varies depending on both patient and tumor characteristics. The current study sought to develop a web-based prognostic model to assist patients and health care providers in decision making regarding either surgery-only or adjuvant chemoradiotherapy (CRT). Methods: California SEER data was used and records, including demographic, pathologic, and treatment information, for 2,583 patients diagnosed with stage IB to III GC and treated with either surgery only or adjuvant CRT from 2006 to 2013 were retrieved. Purposeful selection using Cox regression model was used to identify important mortality predictors. Additionally, with simple random sampling, 70% of the data were assigned to the training set and the remaining 30% were assigned to the test set. Furthermore, generalized boosted classification model was trained using the training set and validated using the test set. Area under the curve (AUC) of the receiver operating characteristic (ROC), sensitivity, specificity and accuracy were determined for 5- and 10-year mortality. Results: The median survival was 33 months for patients in the training set, and 32 for the test set. Predictors included in the model were age, ethnicity (Asian/other, Hispanic, non-Hispanic black and non-Hispanic white), T-stage, histology (intestinal, diffuse and other), presence of signet ring (yes/no), proximal location (yes/no), lymph node ratio, and CRT following surgery (yes/no). Validation of the model on the test set showed as follows: AUC, sensitivity, specificity and accuracy of 0.78(95%CI = 0.75,0.82), 0.75, 0.65 and 0.70 for 5-year survival and 0.77(95%CI = 0.74,0.80), 0.79, 0.55 and 0.70 for 10-year survival. Conclusions: The proposed web-based prognostic tool using readily available patient and tumor characteristic provides validated and personalized prognostic information to aide clinicians and patients in GC adjuvant treatment decision process. [Table: see text]


Author(s):  
Wang Zongbao

The distributed power generation in Gansu Province is dominated by wind power and photovoltaic power. Most of these distributed power plants are located in underdeveloped areas. Due to the weak local consumption capacity, the distributed electricity is mainly sent and consumed outside. A key indicator that affects ultra-long-distance power transmission is line loss. This is an important indicator of the economic operation of the power system, and it also comprehensively reflects the planning, design, production and operation level of power companies. However, most of the current research on line loss is focused on ultra-high voltage (≧110 KV), and there is less involved in distributed power generation lines below 110 KV. In this study, 35 kV and 110 kV lines are taken as examples, combined with existing weather, equipment, operation, power outages and other data, we summarize and integrate an analysis table of line loss impact factors. Secondly, from the perspective of feature relevance and feature importance, we analyze the factors that affect line loss, and obtain data with higher feature relevance and feature importance ranking. In the experiment, these two factors are determined as the final line loss influence factor. Then, based on the conclusion of the line loss influencing factor, the optimized random forest regression algorithm is used to construct the line loss prediction model. The prediction verification results show that the training set error is 0.021 and the test set error is 0.026. The prediction error of the training set and test set is only 0.005. The experimental results show that the optimized random forest algorithm can indeed analyze the line loss of 35 kV and 110 kV lines well, and can also explain the performance of 110-EaR1120 reasonably.


Author(s):  
Iman Dwi Almunandar ◽  
Nellawaty A. Tewu ◽  
Anshari Al-Ghaniyy

The emergence of academic procrastination behavior among students in Indonesia, especially the students of Faculty of Psychology at YARSI University, becomes a habit not to be underestimated, so interfere with the effectiveness of the learning process frequently. The lecturers at the Faculty of Psychology have often warned students to do and collect assignments in accordance with predetermined deadline. However, they still violate it. According to researchers, this problem needs to be solved with a proper training to minimize academic procrastination behavior of the students. In this study, researchers conducted a needs analysis to decide whether the students need a training or not. In the study, there are 30 respondents who were chosen with the random sampling method. Measurement of academic procrastination behavior is using the theory by McCloskey (2011), which has six dimensions: Psychological Belief about Abilities, Distractions, Social Factor of Procrastination, Time Management, Personal Initiative, and Laziness. Methods of analyzing needs are using Questioner, Interview, Observations, Focus Group Discussion (FGD), Intelligence Tests. The result of analyzing needs shows that psychology students' generation of 2015 at the Faculty of Psychology YARSI University need for training on Time Management. Keywords: Procrastination; College Students; Analyze Needs


Author(s):  
C. Radha

An important problem in pattern recognition is that of pattern classification. The objective of classification is to determine a discriminant function which is consistent with the given training examples and performs reasonably well on an unlabeled test set of examples. The degree of performance of the classifier on the test examples, known as its generalization performance, is an important issue in the design of the classifier. It has been established that a good generalization performance can be achieved by providing the learner with a sufficiently large number of discriminative training examples. However, in many domains, it is infeasible or expensive to obtain a sufficiently large training set. Various mechanisms have been proposed in literature to combat this problem. Active Learning techniques (Angluin, 1998; Seung, Opper, & Sompolinsky, 1992) reduce the number of training examples required by carefully choosing discriminative training examples. Bootstrapping (Efron, 1979; Hamamoto, Uchimura & Tomita, 1997) and other pattern synthesis techniques generate a synthetic training set from the given training set. We present some of these techniques and propose some general mechanisms for pattern synthesis.


2020 ◽  
Vol 2020 ◽  
pp. 1-6
Author(s):  
Zhehao He ◽  
Wang Lv ◽  
Jian Hu

Background. The differential diagnosis of subcentimetre lung nodules with a diameter of less than 1 cm has always been one of the problems of imaging doctors and thoracic surgeons. We plan to create a deep learning model for the diagnosis of pulmonary nodules in a simple method. Methods. Image data and pathological diagnosis of patients come from the First Affiliated Hospital of Zhejiang University School of Medicine from October 1, 2016, to October 1, 2019. After data preprocessing and data augmentation, the training set is used to train the model. The test set is used to evaluate the trained model. At the same time, the clinician will also diagnose the test set. Results. A total of 2,295 images of 496 lung nodules and their corresponding pathological diagnosis were selected as a training set and test set. After data augmentation, the number of training set images reached 12,510 images, including 6,648 malignant nodular images and 5,862 benign nodular images. The area under the P-R curve of the trained model is 0.836 in the classification of malignant and benign nodules. The area under the ROC curve of the trained model is 0.896 (95% CI: 78.96%~100.18%), which is higher than that of three doctors. However, the P value is not less than 0.05. Conclusion. With the help of an automatic machine learning system, clinicians can create a deep learning pulmonary nodule pathology classification model without the help of deep learning experts. The diagnostic efficiency of this model is not inferior to that of the clinician.


2021 ◽  
Vol 3 ◽  
Author(s):  
Ram Krishn Mishra ◽  
Siddhaling Urolagin ◽  
J. Angel Arul Jothi ◽  
Ashwin Sanjay Neogi ◽  
Nishad Nawaz

The Covid-19 pandemic has disrupted the world economy and significantly influenced the tourism industry. Millions of people have shared their emotions, views, facts, and circumstances on numerous social media platforms, which has resulted in a massive flow of information. The high-density social media data has drawn many researchers to extract valuable information and understand the user’s emotions during the pandemic time. The research looks at the data collected from the micro-blogging site Twitter for the tourism sector, emphasizing sub-domains hospitality and healthcare. The sentiment of approximately 20,000 tweets have been calculated using Valence Aware Dictionary for Sentiment Reasoning (VADER) model. Furthermore, topic modeling was used to reveal certain hidden themes and determine the narrative and direction of the topics related to tourism healthcare, and hospitality. Topic modeling also helped us to identify inter-cluster similar terms and analyzing the flow of information from a group of a similar opinion. Finally, a cutting-edge deep learning classification model was used with different epoch sizes of the dataset to anticipate and classify the people’s feelings. The deep learning model has been tested with multiple parameters such as training set accuracy, test set accuracy, validation loss, validation accuracy, etc., and resulted in more than a 90% in training set accuracy tourism hospitality and healthcare reported 80.9 and 78.7% respectively on test set accuracy.


Sign in / Sign up

Export Citation Format

Share Document