Kernel-Based Discriminant Techniques for Educational Placement

2004 ◽  
Vol 29 (2) ◽  
pp. 219-240 ◽  
Author(s):  
Miao-hsiang Lin ◽  
Su-yun Huang ◽  
Yuan-chin Chang

This article considers the problem of educational placement. Several discriminant techniques are applied to a data set from a survey project of science ability. A profile vector for each student consists of five science-educational indictors. The students are intended to be placed into three reference groups: advanced, regular, and remedial. Various discriminant techniques, including Fisher’s discriminant analysis and kernel-based nonparametric discriminant analysis, are compared. The evaluation work is based on the leaving-one-out misclassification score. Results from the five school data sets and 500 bootstrap samples reveal that the kernel-based nonparametric approach with bandwidth selected by cross validation performs reasonably well. The authors regard kernel-based nonparametric procedures as desirable competitors to Fisher’s discriminant rule for handling problems of educational placement.

2018 ◽  
Vol 7 (2.15) ◽  
pp. 136 ◽  
Author(s):  
Rosaida Rosly ◽  
Mokhairi Makhtar ◽  
Mohd Khalid Awang ◽  
Mohd Isa Awang ◽  
Mohd Nordin Abdul Rahman

This paper analyses the performance of classification models using single classification and combination of ensemble method, which are Breast Cancer Wisconsin and Hepatitis data sets as training datasets. This paper presents a comparison of different classifiers based on a 10-fold cross validation using a data mining tool. In this experiment, various classifiers are implemented including three popular ensemble methods which are boosting, bagging and stacking for the combination. The result shows that for the classification of the Breast Cancer Wisconsin data set, the single classification of Naïve Bayes (NB) and a combination of bagging+NB algorithm displayed the highest accuracy at the same percentage (97.51%) compared to other combinations of ensemble classifiers. For the classification of the Hepatitisdata set, the result showed that the combination of stacking+Multi-Layer Perception (MLP) algorithm achieved a higher accuracy at 86.25%. By using the ensemble classifiers, the result may be improved. In future, a multi-classifier approach will be proposed by introducing a fusion at the classification level between these classifiers to obtain classification with higher accuracies.  


2012 ◽  
Vol 433-440 ◽  
pp. 3959-3963 ◽  
Author(s):  
Bayram Akdemir ◽  
Nurettin Çetinkaya

In distributing systems, load forecasting is one of the major management problems to carry on energy flowing; protect the systems, and economic management. In order to manage the system, next step of the load characteristics must be inform from historical data sets. For the forecasting, not only historical parameters are used but also external parameters such as weather conditions, seasons and populations and etc. have much importance to forecast the next behavior of the load characteristic. Holidays and week days have different affects on energy consumption in any country. In this study, target is to forecast the peak energy level the next an hour and to compare affects of week days and holidays on peak energy needs. Energy consumption data sets have nonlinear characteristics and it is not easy to fit any curve due to its nonlinearity and lots of parameters. In order to forecast peak energy level, Adaptive neural fuzzy inference system is used for hourly affects of holidays and week days on peak energy level is argued. The obtained values from output of the artificial intelligence are evaluated two fold cross validation and mean absolute percentage error. The obtained two fold cross validation error as mean absolute percentage error is 3.51 and included holidays data set has more accuracy than the data set without holiday. Total success increased 2.4%.


2006 ◽  
Vol 29 (1) ◽  
pp. 153-162
Author(s):  
Pratul Kumar Saraswati ◽  
Sanjeev V Sabnis

Paleontologists use statistical methods for prediction and classification of taxa. Over the years, the statistical analyses of morphometric data are carried out under the assumption of multivariate normality. In an earlier study, three closely resembling species of a biostratigraphically important genus Nummulites were discriminated by multi-group discrimination. Two discriminant functions that used diameter and thickness of the tests and height and length of chambers in the final whorl accounted for nearly 100% discrimination. In this paper Classification and Regression Tree (CART), a non-parametric method, is used for classification and prediction of the same data set. In all 111 iterations of CART methodology are performed by splitting the data set of 55 observations into training, validation and test data sets in varying proportions. In the validation data sets 40% of the iterations are correctly classified and only one case of misclassification in 49% of the iterations is noted. As regards test data sets, nearly 70% contain no misclassification cases whereas in about 25% test data sets only one case of misclassification is found. The results suggest that the method is highly successful in assigning an individual to a particular species. The key variables on the basis of which tree models are built are combinations of thickness of the test (T), height of the chambers in the final whorl (HL) and diameter of the test (D). Both discriminant analysis and CART thus appear to be comparable in discriminating the three species. However, CART reduces the number of requisite variables without increasing the misclassification error. The method is very useful for professional geologists for quick identification of species.


2019 ◽  
Author(s):  
Martin Papenberg ◽  
Gunnar W. Klau

Numerous applications in psychological research require that a pool of elements is partitioned into multiple parts. While many applications seek groups that are well-separated, i.e., dissimilar from each other, others require the different groups to be as similar as possible. Examples include the assignment of students to parallel courses, assembling stimulus sets in experimental psychology, splitting achievement tests into parts of equal difficulty, and dividing a data set for cross validation. We present anticlust, an easy-to-use and free software package for solving these problems fast and in an automated manner. The package anticlust is an open source extension to the R programming language and implements the methodology of anticlustering. Anticlustering divides elements into similar parts, ensuring similarity between groups by enforcing heterogeneity within groups. Thus, anticlustering is the direct reversal of cluster analysis that aims to maximize homogeneity within groups and dissimilarity between groups. Our package anticlust implements two anticlustering criteria, reversing the clustering methods k-means and cluster editing, respectively. In a simulation study, we show that anticlustering returns excellent results and outperforms alternative approaches like random assignment and matching. In three example applications, we illustrate how to apply anticlust on real data sets. We demonstrate how to assign experimental stimuli to equivalent sets based on norming data, how to divide a large data set for cross validation, and how to split a test into parts of equal item difficulty and discrimination.


Author(s):  
Sarina Sulaiman ◽  
Nor Amalina Abdul Rahim ◽  
Andri Pranolo

The emergence and growth of internet usage has accumulated an extensive amount of data. These data contain a wealth of undiscovered valuable information and problems of incomplete data set may lead to observation error. This research explored a technique to analyze data that transforms meaningless data to meaningful information. The work focused on Rough Set (RS) to deal with incomplete data and rules derivation. Rules with high and low left-hand-side (LHS) support value generated by RS were used as query statements to form a cluster of data. The model was tested on AIDS blog data set consisting of 146 bloggers and E-Learning@UTM (EL) log data set comprising 23105 URLs. 5-fold and 10-fold cross validation were used to split the data. Naïve algorithm and Boolean algorithm as discretization techniques and Johnson’s algorithm (Johnson) and Genetic algorithm (GA) as reduction techniques were employed to compare the results. 5-fold cross validation tended to suit AIDS data well while 10-fold cross validation was the best for EL data set. Johnson and GA yielded the same number of rules for both data sets. These findings are significant as evidence in terms of accuracy that was achieved using the proposed model


2016 ◽  
Vol 34 (32) ◽  
pp. 3931-3938 ◽  
Author(s):  
Li-Xuan Qin ◽  
Huei-Chung Huang ◽  
Colin B. Begg

Purpose Reproducibility of scientific experimentation has become a major concern because of the perception that many published biomedical studies cannot be replicated. In this article, we draw attention to the connection between inflated overoptimistic findings and the use of cross-validation for error estimation in molecular classification studies. We show that, in the absence of careful design to prevent artifacts caused by systematic differences in the processing of specimens, established tools such as cross-validation can lead to a spurious estimate of the error rate in the overoptimistic direction, regardless of the use of data normalization as an effort to remove these artifacts. Methods We demonstrated this important yet overlooked complication of cross-validation using a unique pair of data sets on the same set of tumor samples. One data set was collected with uniform handling to prevent handling effects; the other was collected without uniform handling and exhibited handling effects. The paired data sets were used to estimate the biologic effects of the samples and the handling effects of the arrays in the latter data set, which were then used to simulate data using virtual rehybridization following various array-to-sample assignment schemes. Results Our study showed that (1) cross-validation tended to underestimate the error rate when the data possessed confounding handling effects; (2) depending on the relative amount of handling effects, normalization may further worsen the underestimation of the error rate; and (3) balanced assignment of arrays to comparison groups allowed cross-validation to provide an unbiased error estimate. Conclusion Our study demonstrates the benefits of balanced array assignment for reproducible molecular classification and calls for caution on the routine use of data normalization and cross-validation in such analysis.


2009 ◽  
Vol 91 (6) ◽  
pp. 427-436 ◽  
Author(s):  
M. GRAZIANO USAI ◽  
MIKE E. GODDARD ◽  
BEN J. HAYES

SummaryWe used a least absolute shrinkage and selection operator (LASSO) approach to estimate marker effects for genomic selection. The least angle regression (LARS) algorithm and cross-validation were used to define the best subset of markers to include in the model. The LASSO–LARS approach was tested on two data sets: a simulated data set with 5865 individuals and 6000 Single Nucleotide Polymorphisms (SNPs); and a mouse data set with 1885 individuals genotyped for 10 656 SNPs and phenotyped for a number of quantitative traits. In the simulated data, three approaches were used to split the reference population into training and validation subsets for cross-validation: random splitting across the whole population; random sampling of validation set from the last generation only, either within or across families. The highest accuracy was obtained by random splitting across the whole population. The accuracy of genomic estimated breeding values (GEBVs) in the candidate population obtained by LASSO–LARS was 0·89 with 156 explanatory SNPs. This value was higher than those obtained by Best Linear Unbiased Prediction (BLUP) and a Bayesian method (BayesA), which were 0·75 and 0·84, respectively. In the mouse data, 1600 individuals were randomly allocated to the reference population. The GEBVs for the remaining 285 individuals estimated by LASSO–LARS were more accurate than those obtained by BLUP and BayesA for weight at six weeks and slightly lower for growth rate and body length. It was concluded that LASSO–LARS approach is a good alternative method to estimate marker effects for genomic selection, particularly when the cost of genotyping can be reduced by using a limited subset of markers.


Author(s):  
Jing Xu ◽  
Fuyi Li ◽  
André Leier ◽  
Dongxu Xiang ◽  
Hsin-Hui Shen ◽  
...  

Abstract Antimicrobial peptides (AMPs) are a unique and diverse group of molecules that play a crucial role in a myriad of biological processes and cellular functions. AMP-related studies have become increasingly popular in recent years due to antimicrobial resistance, which is becoming an emerging global concern. Systematic experimental identification of AMPs faces many difficulties due to the limitations of current methods. Given its significance, more than 30 computational methods have been developed for accurate prediction of AMPs. These approaches show high diversity in their data set size, data quality, core algorithms, feature extraction, feature selection techniques and evaluation strategies. Here, we provide a comprehensive survey on a variety of current approaches for AMP identification and point at the differences between these methods. In addition, we evaluate the predictive performance of the surveyed tools based on an independent test data set containing 1536 AMPs and 1536 non-AMPs. Furthermore, we construct six validation data sets based on six different common AMP databases and compare different computational methods based on these data sets. The results indicate that amPEPpy achieves the best predictive performance and outperforms the other compared methods. As the predictive performances are affected by the different data sets used by different methods, we additionally perform the 5-fold cross-validation test to benchmark different traditional machine learning methods on the same data set. These cross-validation results indicate that random forest, support vector machine and eXtreme Gradient Boosting achieve comparatively better performances than other machine learning methods and are often the algorithms of choice of multiple AMP prediction tools.


2021 ◽  
Author(s):  
Georgios Boumis ◽  
Bart van Osnabrugge ◽  
Jan Verkade

<p>Operational near real-time flood forecasting relies heavily on adequate spatial interpolation of precipitation forcing which bears a huge impact on the accuracy of hydrologic forecasts. In this study, the generalized REGNIE (genRE) interpolation technique is examined. The genRE approach was shown to enhance the traditional Inverse Distance Weighting (IDW) method with information from existing observed climatological precipitation data sets (Van Osnabrugge, 2017). The successful application of the genRE method with a re-analysis precipitation data set, expands the applicability of the method as detailed re-analysis data sets become more prevalent while high density observation networks remain scarce.</p><p>Here, the approach is extended to use climatological precipitation data from the Met Éireann’s Re-Analysis (MÉRA). Investigations are carried out using hourly precipitation accumulations for two major flood events induced by Atlantic storms in the Suir River Basin, Ireland. Alongside genRE, the following techniques are comparatively explored: Inverse Distance Weighting (IDW), Ordinary Kriging (OK) and Regression Kriging (RK). Cross-validation is applied in order to compare the different interpolation methods, while spatial maps and correlation coefficients are utilized for assessing the skill of the interpolators to emulate the climatology of MÉRA. In the process, a preliminary intercomparison between the observed precipitation and MÉRA precipitation for the two events is also made.</p><p>In a statistical sense, cross-validation results verify that genRE performs slightly better than all three interpolation techniques for both events studied. Overall, OK is found to be the most inadequate approach, specifically in terms of preserving the original variance in observed precipitation. MÉRA manages to reproduce the temporal variations of observations in a good manner for both events, whereas it displays less skill when considering spatial variations especially where topography has a major influence. Finally, genRE outperforms all other interpolators in mimicking the climatological conditions of MÉRA for both events.</p><p> </p><p>Van Osnabrugge, B., Weerts, A.H. and Uijlenhoet, R., 2017. genRE: A method to extend gridded precipitation climatology data sets in near real-time for hydrological forecasting purposes. Water Resources Research, 53(11), pp.9284-9303.</p>


2018 ◽  
Vol 154 (2) ◽  
pp. 149-155
Author(s):  
Michael Archer

1. Yearly records of worker Vespula germanica (Fabricius) taken in suction traps at Silwood Park (28 years) and at Rothamsted Research (39 years) are examined. 2. Using the autocorrelation function (ACF), a significant negative 1-year lag followed by a lesser non-significant positive 2-year lag was found in all, or parts of, each data set, indicating an underlying population dynamic of a 2-year cycle with a damped waveform. 3. The minimum number of years before the 2-year cycle with damped waveform was shown varied between 17 and 26, or was not found in some data sets. 4. Ecological factors delaying or preventing the occurrence of the 2-year cycle are considered.


Sign in / Sign up

Export Citation Format

Share Document