A cavernous analytics using advanced machine learning for real world datasets in research implementations

Abstract Dimension reduction and feature selection are fundamental tools for machine learning and data mining. Most existing methods, however, assume that objects are represented by a single vectorial descriptor. In reality, some description methods assign unordered sets or graphs of vectors to a single object, where each vector is assumed to have the same number of dimensions, but is drawn from a different probability distribution. Moreover, some applications (such as pose estimation) may require the recognition of individual vectors (nodes) of an object. In such cases it is essential that the nodes within a single object remain distinguishable after dimension reduction. In this paper we propose new discriminant analysis methods that are able to satisfy two criteria at the same time: separating between classes and between the nodes of an object instance. We analyze and evaluate our methods on several different synthetic and real-world datasets.

Download Full-text

RON-Gauss: Enhancing Utility in Non-Interactive Private Data Release

Proceedings on Privacy Enhancing Technologies ◽

10.2478/popets-2019-0003 ◽

2019 ◽

Vol 2019 (1) ◽

pp. 26-46 ◽

Cited By ~ 2

Author(s):

Thee Chanyaswad ◽

Changchang Liu ◽

Prateek Mittal

Keyword(s):

Machine Learning ◽

Real World ◽

Differential Privacy ◽

Real Data ◽

The Novel ◽

Private Data ◽

Data Release ◽

Machine Learning Applications ◽

Order Of Magnitude ◽

Real World Datasets

Abstract A key challenge facing the design of differential privacy in the non-interactive setting is to maintain the utility of the released data. To overcome this challenge, we utilize the Diaconis-Freedman-Meckes (DFM) effect, which states that most projections of high-dimensional data are nearly Gaussian. Hence, we propose the RON-Gauss model that leverages the novel combination of dimensionality reduction via random orthonormal (RON) projection and the Gaussian generative model for synthesizing differentially-private data. We analyze how RON-Gauss benefits from the DFM effect, and present multiple algorithms for a range of machine learning applications, including both unsupervised and supervised learning. Furthermore, we rigorously prove that (a) our algorithms satisfy the strong ɛ-differential privacy guarantee, and (b) RON projection can lower the level of perturbation required for differential privacy. Finally, we illustrate the effectiveness of RON-Gauss under three common machine learning applications – clustering, classification, and regression – on three large real-world datasets. Our empirical results show that (a) RON-Gauss outperforms previous approaches by up to an order of magnitude, and (b) loss in utility compared to the non-private real data is small. Thus, RON-Gauss can serve as a key enabler for real-world deployment of privacy-preserving data release.

Download Full-text

Zero-Shot Feature Selection via Transferring Supervised Knowledge

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2021040101 ◽

2021 ◽

Vol 17 (2) ◽

pp. 1-20

Author(s):

Zheng Wang ◽

Qiao Wang ◽

Tingzhang Zhao ◽

Chaokun Wang ◽

Xiaojun Ye

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Dimensionality Reduction ◽

Real World ◽

Rapid Growth ◽

Learning Systems ◽

Training Data ◽

Effective Technique ◽

Supervised Methods ◽

Real World Datasets

Feature selection, an effective technique for dimensionality reduction, plays an important role in many machine learning systems. Supervised knowledge can significantly improve the performance. However, faced with the rapid growth of newly emerging concepts, existing supervised methods might easily suffer from the scarcity and validity of labeled data for training. In this paper, the authors study the problem of zero-shot feature selection (i.e., building a feature selection model that generalizes well to “unseen” concepts with limited training data of “seen” concepts). Specifically, they adopt class-semantic descriptions (i.e., attributes) as supervision for feature selection, so as to utilize the supervised knowledge transferred from the seen concepts. For more reliable discriminative features, they further propose the center-characteristic loss which encourages the selected features to capture the central characteristics of seen concepts. Extensive experiments conducted on various real-world datasets demonstrate the effectiveness of the method.

Download Full-text

A Grey-Box Ensemble Model Exploiting Black-Box Accuracy and White-Box Intrinsic Interpretability

Algorithms ◽

10.3390/a13010017 ◽

2020 ◽

Vol 13 (1) ◽

pp. 17 ◽

Cited By ~ 4

Author(s):

Emmanuel Pintelas ◽

Ioannis E. Livieris ◽

Panagiotis Pintelas

Keyword(s):

Machine Learning ◽

Real World ◽

High Performance ◽

Black Box ◽

Box Model ◽

Proposed Model ◽

Wide Range ◽

Critical Issues ◽

Key Factor ◽

Real World Datasets

Machine learning has emerged as a key factor in many technological and scientific advances and applications. Much research has been devoted to developing high performance machine learning models, which are able to make very accurate predictions and decisions on a wide range of applications. Nevertheless, we still seek to understand and explain how these models work and make decisions. Explainability and interpretability in machine learning is a significant issue, since in most of real-world problems it is considered essential to understand and explain the model’s prediction mechanism in order to trust it and make decisions on critical issues. In this study, we developed a Grey-Box model based on semi-supervised methodology utilizing a self-training framework. The main objective of this work is the development of a both interpretable and accurate machine learning model, although this is a complex and challenging task. The proposed model was evaluated on a variety of real world datasets from the crucial application domains of education, finance and medicine. Our results demonstrate the efficiency of the proposed model performing comparable to a Black-Box and considerably outperforming single White-Box models, while at the same time remains as interpretable as a White-Box model.

Download Full-text

A Human-AI Loop Approach for Joint Keyword Discovery and Expectation Estimation in Micropost Event Detection

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i03.5626 ◽

2020 ◽

Vol 34 (03) ◽

pp. 2451-2458

Author(s):

Akansha Bhardwaj ◽

Jie Yang ◽

Philippe Cudré-Mauroux

Keyword(s):

Machine Learning ◽

Real World ◽

Event Detection ◽

State Of The Art ◽

Regularization Parameter ◽

Learning Models ◽

Training Process ◽

Model Training ◽

Real World Datasets ◽

Machine Learning Models

Microblogging platforms such as Twitter are increasingly being used in event detection. Existing approaches mainly use machine learning models and rely on event-related keywords to collect the data for model training. These approaches make strong assumptions on the distribution of the relevant microposts containing the keyword – referred to as the expectation of the distribution – and use it as a posterior regularization parameter during model training. Such approaches are, however, limited as they fail to reliably estimate the informativeness of a keyword and its expectation for model training. This paper introduces a Human-AI loop approach to jointly discover informative keywords for model training while estimating their expectation. Our approach iteratively leverages the crowd to estimate both keyword-specific expectation and the disagreement between the crowd and the model in order to discover new keywords that are most beneficial for model training. These keywords and their expectation not only improve the resulting performance but also make the model training process more transparent. We empirically demonstrate the merits of our approach, both in terms of accuracy and interpretability, on multiple real-world datasets and show that our approach improves the state of the art by 24.3%.

Download Full-text

Machine learning concepts for correlated Big Data privacy

Journal Of Big Data ◽

10.1186/s40537-021-00530-x ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Sreemoyee Biswas ◽

Nilay Khare ◽

Pragati Agrawal ◽

Priyank Jain

Keyword(s):

Machine Learning ◽

Big Data ◽

Correlation Analysis ◽

Real World ◽

Data Privacy ◽

Correlated Data ◽

Data Correlation ◽

The Real ◽

Big Data Privacy ◽

Real World Datasets

AbstractWith data becoming a salient asset worldwide, dependence amongst data kept on growing. Hence the real-world datasets that one works upon in today’s time are highly correlated. Since the past few years, researchers have given attention to this aspect of data privacy and found a correlation among data. The existing data privacy guarantees cannot assure the expected data privacy algorithms. The privacy guarantees provided by existing algorithms were enough when there existed no relation between data in the datasets. Hence, by keeping the existence of data correlation into account, there is a dire need to reconsider the privacy algorithms. Some of the research has considered utilizing a well-known machine learning concept, i.e., Data Correlation Analysis, to understand the relationship between data in a better way. This concept has given some promising results as well. Though it is still concise, the researchers did a considerable amount of research on correlated data privacy. Researchers have provided solutions using probabilistic models, behavioral analysis, sensitivity analysis, information theory models, statistical correlation analysis, exhaustive combination analysis, temporal privacy leakages, and weighted hierarchical graphs. Nevertheless, researchers are doing work upon the real-world datasets that are often large (technologically termed big data) and house a high amount of data correlation. Firstly, the data correlation in big data must be studied. Researchers are exploring different analysis techniques to find the best suitable. Then, they might suggest a measure to guarantee privacy for correlated big data. This survey paper presents a detailed survey of the methods proposed by different researchers to deal with the problem of correlated data privacy and correlated big data privacy and highlights the future scope in this area. The quantitative analysis of the reviewed articles suggests that data correlation is a significant threat to data privacy. This threat further gets magnified with big data. While considering and analyzing data correlation, then parameters such as Maximum queries executed, Mean average error values show better results when compared with other methods. Hence, there is a grave need to understand and propose solutions for correlated big data privacy.

Download Full-text

Supervised Learning Using Homology Stable Rank Kernels

Frontiers in Applied Mathematics and Statistics ◽

10.3389/fams.2021.668046 ◽

2021 ◽

Vol 7 ◽

Author(s):

Jens Agerberg ◽

Ryan Ramanujam ◽

Martina Scolamiero ◽

Wojciech Chachólski

Keyword(s):

Machine Learning ◽

Data Analysis ◽

Real World ◽

Fundamental Property ◽

Stable Rank ◽

Topological Data Analysis ◽

Improve Accuracy ◽

Recent Developments ◽

Classification Tasks ◽

Real World Datasets

Exciting recent developments in Topological Data Analysis have aimed at combining homology-based invariants with Machine Learning. In this article, we use hierarchical stabilization to bridge between persistence and kernel-based methods by introducing the so-called stable rank kernels. A fundamental property of the stable rank kernels is that they depend on metrics to compare persistence modules. We illustrate their use on artificial and real-world datasets and show that by varying the metric we can improve accuracy in classification tasks.

Download Full-text

Secure Logistic Regression Based on Homomorphic Encryption: Design and Evaluation (Preprint)

10.2196/preprints.8805 ◽

2017 ◽

Cited By ~ 2

Author(s):

Miran Kim ◽

Yongsoo Song ◽

Shuang Wang ◽

Yuhou Xia ◽

Xiaoqian Jiang

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Real World ◽

Homomorphic Encryption ◽

Training Model ◽

Logistic Function ◽

Data Outsourcing ◽

Sensitive Data ◽

Real World Datasets ◽

Linear Means

BACKGROUND Learning a model without accessing raw data has been an intriguing idea to security and machine learning researchers for years. In an ideal setting, we want to encrypt sensitive data to store them on a commercial cloud and run certain analyses without ever decrypting the data to preserve privacy. Homomorphic encryption technique is a promising candidate for secure data outsourcing, but it is a very challenging task to support real-world machine learning tasks. Existing frameworks can only handle simplified cases with low-degree polynomials such as linear means classifier and linear discriminative analysis. OBJECTIVE The goal of this study is to provide a practical support to the mainstream learning models (eg, logistic regression). METHODS We adapted a novel homomorphic encryption scheme optimized for real numbers computation. We devised (1) the least squares approximation of the logistic function for accuracy and efficiency (ie, reduce computation cost) and (2) new packing and parallelization techniques. RESULTS Using real-world datasets, we evaluated the performance of our model and demonstrated its feasibility in speed and memory consumption. For example, it took approximately 116 minutes to obtain the training model from the homomorphically encrypted Edinburgh dataset. In addition, it gives fairly accurate predictions on the testing dataset. CONCLUSIONS We present the first homomorphically encrypted logistic regression outsourcing model based on the critical observation that the precision loss of classification models is sufficiently small so that the decision plan stays still.

Download Full-text

A Critical Comparison of Machine Learning Classifiers to Predict Match Outcomes in the NFL

International Journal of Computer Science in Sport ◽

10.2478/ijcss-2020-0009 ◽

2020 ◽

Vol 19 (2) ◽

pp. 36-50

Author(s):

Ryan Beal ◽

Timothy J. Norman ◽

Sarvapali D. Ramchurn

Keyword(s):

Machine Learning ◽

Real World ◽

Outcome Prediction ◽

National Football League ◽

American Football ◽

Prediction Problem ◽

Machine Learning Classification ◽

Machine Learning Classifiers ◽

Critical Comparison ◽

Real World Datasets

AbstractIn this paper, we critically evaluate the performance of nine machine learning classification techniques when applied to the match outcome prediction problem presented by American Football. Specifically, we implement and test nine techniques using real-world datasets of 1280 games over 5 seasons from the National Football League (NFL). We test the nine different classifier techniques using a total of 42 features for each team and we find that the best performing algorithms are able to improve one previous published works. The algoriothms achieve an accuracy of between 44.64% for a Guassian Process classifier to 67.53% with a Naïve Bayes classifer. We also test each classifier on a year by year basis and compare our results to those of the bookmakers and other leading academic papers.

Download Full-text

Non-monotone DR-submodular Maximization over General Convex Sets

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/297 ◽

2020 ◽

Author(s):

Christoph Dürr ◽

Nguyen Kim Thang ◽

Abhinav Srivastav ◽

Léo Tible

Keyword(s):

Machine Learning ◽

Real World ◽

Communication Systems ◽

Convex Sets ◽

Convex Set ◽

Applied Mathematics ◽

Submodular Functions ◽

General Convex ◽

Real World Datasets ◽

Approximation Guarantee

Many real-world problems can often be cast as the optimization of DR-submodular functions defined over a convex domain. These functions play an important role with applications in many areas of applied mathematics, such as machine learning, computer vision, operation research, communication systems or economics. In addition, they capture a subclass of non-convex optimization that provides both practical and theoretical guarantees. In this paper, we show that for maximizing non-monotone DR-submodular functions over a general convex set (such as up-closed convex sets, conic convex set, etc) the Frank-Wolfe algorithm achieves an approximation guarantee which depends on the convex set. To the best of our knowledge, this is the first approximation guarantee. Finally we benchmark our algorithm on problems arising in machine learning domain with the real-world datasets.

Download Full-text