Monotonic classification: An overview on algorithms, performance measures and data sets

Nowadays, a huge amount of data is generated due to the growth in the technologies. There are different tools used to view this massive amount of data, and these tools contain different data mining techniques which can be applied for the obtained data sets. Classification is required to extract useful information or to predict the result from these enormous amounts of data. For this purpose, there are different classification algorithms. In this paper, we have compared Naive Bayes, K*, and random forest classification algorithm using Weka tool. To analyze the performance of these three algorithms we have considered three data sets. They are diabetes, supermarket and weather data set. In this work, an analysis is made based on the confusion matrix and different performance measures like RMSE, MAE, ROC, etc

Download Full-text

PERFORMANCE VALIDATION OF PRIOR QUANTIZATION TECHNIQUES IN OUTLIERS CLASSIFICATION USING WDBC DATASET

International Journal of Engineering Technologies and Management Research ◽

10.29121/ijetmr.v5.i4.2018.207 ◽

2020 ◽

Vol 5 (4) ◽

pp. 48-56

Author(s):

D. Rajakumari

Keyword(s):

Data Mining ◽

Decision Making Process ◽

Data Sets ◽

Distance Metrics ◽

Medical Practitioners ◽

Useful Knowledge ◽

Large Size ◽

Performance Validation ◽

Sequential Scanning ◽

Monotonic Classification

Data mining is the process of analyzing enormous data and summarizing it into the useful knowledge discovery and the task of data mining approaches is growing quickly, particularly classification techniques very efficient, way to classifying the data, which is important in the decision-making process for medical practitioners. This study presents the quantization and validation (OQV) techniques for fast outlier detection in large size WDBC data sets. The distance metrics utilization makes the algorithm as the linear one for various objects and assures the sequential scanning. The inclusion of direct quantization technique and the cluster explicit discovery assures the simplicity and the economical. The comparative analysis of proposed OQV techniques with the triangular boundary-based classification and the Weighing-based Feature Selection and Monotonic Classification (WFSMC) regarding the accuracy, precision, recall and the number of attributes assures an effectiveness of OQV for large size datasets.

Download Full-text

Analysis of Freeway Traffic Incident Conditions by Using Second-Order Spatiotemporal Traffic Performance Measures

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198105192500103 ◽

2005 ◽

Vol 1925 (1) ◽

pp. 20-28 ◽

Cited By ~ 2

Author(s):

Sherif Ishak ◽

Ciprian Alecsandru

Keyword(s):

Performance Measures ◽

Second Order ◽

Operating Conditions ◽

Data Sets ◽

Freeway Traffic ◽

Contour Maps ◽

Traffic Conditions ◽

Statistical Measures ◽

Study Results ◽

Texture Characterization

The characteristics of preincident, postincident, and nonincident traffic conditions on freeways are investigated. The characteristics are defined by second-order statistical measures derived from spatiotemporal speed contour maps. Four performance measures are used to quantify properties such as smoothness, homogeneity, and randomness in traffic conditions in a manner similar to texture characterization of digital images. With real-world incident and traffic data sets, statistical analysis was conducted to seek distinctive characteristics of three groups of traffic operating conditions: preincident, postincident, and nonincident. The study results showed that the spatiotemporal characteristics of each of the three groups were not discernible. Although the distributions of performance measures within each group are statistically different, no consistent pattern was detected to imply that certain characteristics could increase the likelihood of incidents or identify precursory conditions to incidents.

Download Full-text

Precision Health–Enabled Machine Learning to Identify Need for Wraparound Social Services Using Patient- and Population-Level Data Sets: Algorithm Development and Validation (Preprint)

10.2196/preprints.16129 ◽

2019 ◽

Author(s):

Suranga N Kasthurirathne ◽

Shaun Grannis ◽

Paul K Halverson ◽

Justin Morea ◽

Nir Menachemi ◽

...

Keyword(s):

Machine Learning ◽

Social Determinants Of Health ◽

Social Services ◽

Performance Measures ◽

Social Determinants ◽

Population Level ◽

Decision Models ◽

Data Sets ◽

Level Data ◽

Precision Health

BACKGROUND Emerging interest in precision health and the increasing availability of patient- and population-level data sets present considerable potential to enable analytical approaches to identify and mitigate the negative effects of social factors on health. These issues are not satisfactorily addressed in typical medical care encounters, and thus, opportunities to improve health outcomes, reduce costs, and improve coordination of care are not realized. Furthermore, methodological expertise on the use of varied patient- and population-level data sets and machine learning to predict need for supplemental services is limited. OBJECTIVE The objective of this study was to leverage a comprehensive range of clinical, behavioral, social risk, and social determinants of health factors in order to develop decision models capable of identifying patients in need of various wraparound social services. METHODS We used comprehensive patient- and population-level data sets to build decision models capable of predicting need for behavioral health, dietitian, social work, or other social service referrals within a safety-net health system using area under the receiver operating characteristic curve (AUROC), sensitivity, precision, F1 score, and specificity. We also evaluated the value of population-level social determinants of health data sets in improving machine learning performance of the models. RESULTS Decision models for each wraparound service demonstrated performance measures ranging between 59.2%% and 99.3%. These results were statistically superior to the performance measures demonstrated by our previous models which used a limited data set and whose performance measures ranged from 38.2% to 88.3% (behavioural health: F1 score P<.001, AUROC P=.01; social work: F1 score P<.001, AUROC P=.03; dietitian: F1 score P=.001, AUROC P=.001; other: F1 score P=.01, AUROC P=.02); however, inclusion of additional population-level social determinants of health did not contribute to any performance improvements (behavioural health: F1 score P=.08, AUROC P=.09; social work: F1 score P=.16, AUROC P=.09; dietitian: F1 score P=.08, AUROC P=.14; other: F1 score P=.33, AUROC P=.21) in predicting the need for referral in our population of vulnerable patients seeking care at a safety-net provider. CONCLUSIONS Precision health–enabled decision models that leverage a wide range of patient- and population-level data sets and advanced machine learning methods are capable of predicting need for various wraparound social services with good performance.

Download Full-text

Comparison of marker selection methods for high throughput scRNA-seq data

10.1101/679761 ◽

2019 ◽

Author(s):

Anna C. Gilbert ◽

Alexander Vargo

Keyword(s):

Performance Measures ◽

Synthetic Data ◽

Large Data ◽

Ground Truth ◽

Selection Method ◽

Large Data Sets ◽

Data Sets ◽

Selection Methods ◽

Marker Selection

AbstractHere, we evaluate the performance of a variety of marker selection methods on scRNA-seq UMI counts data. We test on an assortment of experimental and synthetic data sets that range in size from several thousand to one million cells. In addition, we propose several performance measures for evaluating the quality of a set of markers when there is no known ground truth. According to these metrics, most existing marker selection methods show similar performance on experimental scRNA-seq data; thus, the speed of the algorithm is the most important consid-eration for large data sets. With this in mind, we introduce RANKCORR, a fast marker selection method with strong mathematical underpinnings that takes a step towards sensible multi-class marker selection.

Download Full-text

Precision Health–Enabled Machine Learning to Identify Need for Wraparound Social Services Using Patient- and Population-Level Data Sets: Algorithm Development and Validation

JMIR Medical Informatics ◽

10.2196/16129 ◽

2020 ◽

Vol 8 (7) ◽

pp. e16129 ◽

Cited By ~ 1

Author(s):

Suranga N Kasthurirathne ◽

Shaun Grannis ◽

Paul K Halverson ◽

Justin Morea ◽

Nir Menachemi ◽

...

Keyword(s):

Machine Learning ◽

Social Determinants Of Health ◽

Social Services ◽

Performance Measures ◽

Social Determinants ◽

Population Level ◽

Decision Models ◽

Data Sets ◽

Level Data ◽

Precision Health

Background Emerging interest in precision health and the increasing availability of patient- and population-level data sets present considerable potential to enable analytical approaches to identify and mitigate the negative effects of social factors on health. These issues are not satisfactorily addressed in typical medical care encounters, and thus, opportunities to improve health outcomes, reduce costs, and improve coordination of care are not realized. Furthermore, methodological expertise on the use of varied patient- and population-level data sets and machine learning to predict need for supplemental services is limited. Objective The objective of this study was to leverage a comprehensive range of clinical, behavioral, social risk, and social determinants of health factors in order to develop decision models capable of identifying patients in need of various wraparound social services. Methods We used comprehensive patient- and population-level data sets to build decision models capable of predicting need for behavioral health, dietitian, social work, or other social service referrals within a safety-net health system using area under the receiver operating characteristic curve (AUROC), sensitivity, precision, F1 score, and specificity. We also evaluated the value of population-level social determinants of health data sets in improving machine learning performance of the models. Results Decision models for each wraparound service demonstrated performance measures ranging between 59.2%% and 99.3%. These results were statistically superior to the performance measures demonstrated by our previous models which used a limited data set and whose performance measures ranged from 38.2% to 88.3% (behavioural health: F1 score P<.001, AUROC P=.01; social work: F1 score P<.001, AUROC P=.03; dietitian: F1 score P=.001, AUROC P=.001; other: F1 score P=.01, AUROC P=.02); however, inclusion of additional population-level social determinants of health did not contribute to any performance improvements (behavioural health: F1 score P=.08, AUROC P=.09; social work: F1 score P=.16, AUROC P=.09; dietitian: F1 score P=.08, AUROC P=.14; other: F1 score P=.33, AUROC P=.21) in predicting the need for referral in our population of vulnerable patients seeking care at a safety-net provider. Conclusions Precision health–enabled decision models that leverage a wide range of patient- and population-level data sets and advanced machine learning methods are capable of predicting need for various wraparound social services with good performance.

Download Full-text

On-the-fly scheduling versus reservation-based scheduling for unpredictable workflows

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019841681 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1140-1158 ◽

Cited By ~ 3

Author(s):

Ana Gainaru ◽

Hongyang Sun ◽

Guillaume Aupy ◽

Yuankai Huo ◽

Bennett A Landman ◽

...

Keyword(s):

Performance Measures ◽

Data Centers ◽

Large Data ◽

Large Data Sets ◽

System Level ◽

Data Sets ◽

Data Set ◽

System Utilization ◽

Average Stretch ◽

Level Performance

Scientific insights in the coming decade will clearly depend on the effective processing of large data sets generated by dynamic heterogeneous applications typical of workflows in large data centers or of emerging fields like neuroscience. In this article, we show how these big data workflows have a unique set of characteristics that pose challenges for leveraging HPC methodologies, particularly in scheduling. Our findings indicate that execution times for these workflows are highly unpredictable and are not correlated with the size of the data set involved or the precise functions used in the analysis. We characterize this inherent variability and sketch the need for new scheduling approaches by quantifying significant gaps in achievable performance. Through simulations, we show how on-the-fly scheduling approaches can deliver benefits in both system-level and user-level performance measures. On average, we find improvements of up to 35% in system utilization and up to 45% in average stretch of the applications, illustrating the potential of increasing performance through new scheduling approaches.

Download Full-text

Real-time interpolation of streaming data

Computer Science ◽

10.7494/csci.2020.21.4.3932 ◽

2020 ◽

Vol 21 (4) ◽

Author(s):

Roman Dębski

Keyword(s):

Time Series ◽

Finite Difference ◽

Real Time ◽

Performance Measures ◽

Spline Interpolation ◽

Streaming Data ◽

Data Sets ◽

Look Ahead ◽

Different Types ◽

Hermite Splines

One of the key elements of real-time $C^1$-continuous cubic spline interpolation of streaming data is an estimator of the first derivative of the interpolated function that is more accurate than the ones based on finite difference schemas.Two such greedy look-ahead heuristic estimators (denoted as MinBE and MinAJ2) based on Calculus of Variations are formally defined (in closed form) together with the corresponding cubic splines they generate, and then comparatively evaluated in a series of numerical experiments involving different types of performance measures. The results presented show that the cubic Hermite splines generated by heuristic MinAJ2 significantly outperformed these based on finite difference schemas in terms of all tested performance measures (including convergence).The proposed approach is quite general. It can be directly applied to streams of univariate functional data like time-series. Multidimensional curves defined parametrically, after splitting, can be handled as well. The streaming character of the algorithm means that it can also be useful in processing data sets that are too large to fit in memory (e.g., edge computing devices, embedded time-series databases).

Download Full-text