A Novel Approach using Expert Knowledge on Error based Pruning

Many traditional pruning methods assume that all the datasets are equally probable and equally important, so they apply equal pruning to all the datasets. However, in real-world classification problems, all the datasets are not equal and considering equal pruning rate during pruning tends to generate a decision tree with a large size and high misclassification rate. In this paper, we present a practical algorithm to deal with the data specific classification problem when there are datasets with different properties. Another key motivation of the data specific pruning in the paper is "trading accuracy and size". A new algorithm called Expert Knowledge Based Pruning (EKBP) is proposed to solve this dilemma. We proposed to integrate error rate, missing values and expert judgment as factors for determining data specific pruning for each dataset. We show by analysis and experiments that using this pruning, we can scale both accuracy and generalisation for the tree that is generated. Moreover, the method can be very effective for high dimensional datasets. We conduct an extensive experimental study on openly available 40 real world datasets from UCI repository. In all these experiments, the proposed approach shows considerably reduction of tree size having equal or better accuracy compared to several benchmark decision tree methods that are proposed in literature.

Download Full-text

Comparison of Missing Data Infilling Mechanisms for Recovering a Real-World Single Station Streamflow Observation

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18168375 ◽

2021 ◽

Vol 18 (16) ◽

pp. 8375

Author(s):

Thelma Dede Baddoo ◽

Zhijia Li ◽

Samuel Nii Odai ◽

Kenneth Rodolphe Chabi Boni ◽

Isaac Kwesi Nooni ◽

...

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Real World ◽

Missing Values ◽

Total Error ◽

Extensive Study ◽

Error Measurement ◽

Missing Data Imputation ◽

Single Station ◽

Real World Datasets

Reconstructing missing streamflow data can be challenging when additional data are not available, and missing data imputation of real-world datasets to investigate how to ascertain the accuracy of imputation algorithms for these datasets are lacking. This study investigated the necessary complexity of missing data reconstruction schemes to obtain the relevant results for a real-world single station streamflow observation to facilitate its further use. This investigation was implemented by applying different missing data mechanisms spanning from univariate algorithms to multiple imputation methods accustomed to multivariate data taking time as an explicit variable. The performance accuracy of these schemes was assessed using the total error measurement (TEM) and a recommended localized error measurement (LEM) in this study. The results show that univariate missing value algorithms, which are specially developed to handle univariate time series, provide satisfactory results, but the ones which provide the best results are usually time and computationally intensive. Also, multiple imputation algorithms which consider the surrounding observed values and/or which can understand the characteristics of the data provide similar results to the univariate missing data algorithms and, in some cases, perform better without the added time and computational downsides when time is taken as an explicit variable. Furthermore, the LEM would be especially useful when the missing data are in specific portions of the dataset or where very large gaps of ‘missingness’ occur. Finally, proper handling of missing values of real-world hydroclimatic datasets depends on imputing and extensive study of the particular dataset to be imputed.

Download Full-text

Joint Modeling of Local and Global Temporal Dynamics for Multivariate Time Series Forecasting with Missing Values

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6056 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5956-5963

Author(s):

Xianfeng Tang ◽

Huaxiu Yao ◽

Yiwei Sun ◽

Charu Aggarwal ◽

Prasenjit Mitra ◽

...

Keyword(s):

Time Series ◽

Real World ◽

Missing Values ◽

Temporal Dynamics ◽

Multivariate Time Series ◽

Temporal Distribution ◽

Joint Modeling ◽

Adversarial Training ◽

Global Patterns ◽

Real World Datasets

Multivariate time series (MTS) forecasting is widely used in various domains, such as meteorology and traffic. Due to limitations on data collection, transmission, and storage, real-world MTS data usually contains missing values, making it infeasible to apply existing MTS forecasting models such as linear regression and recurrent neural networks. Though many efforts have been devoted to this problem, most of them solely rely on local dependencies for imputing missing values, which ignores global temporal dynamics. Local dependencies/patterns would become less useful when the missing ratio is high, or the data have consecutive missing values; while exploring global patterns can alleviate such problem. Thus, jointly modeling local and global temporal dynamics is very promising for MTS forecasting with missing values. However, work in this direction is rather limited. Therefore, we study a novel problem of MTS forecasting with missing values by jointly exploring local and global temporal dynamics. We propose a new framework øurs, which leverages memory network to explore global patterns given estimations from local perspectives. We further introduce adversarial training to enhance the modeling of global temporal distribution. Experimental results on real-world datasets show the effectiveness of øurs for MTS forecasting with missing values and its robustness under various missing ratios.

Download Full-text

Predictive Analytics with Strategically Missing Data

INFORMS Journal on Computing ◽

10.1287/ijoc.2019.0947 ◽

2020 ◽

Author(s):

Juheng Zhang ◽

Xiaoping Liu ◽

Xiao-Bai Li

Keyword(s):

Missing Data ◽

Financial Reporting ◽

Real World ◽

Missing Values ◽

Predictive Analytics ◽

Support Vector ◽

Real World Data ◽

Novel Approach ◽

Strategic Behaviors ◽

Job Application

We study strategically missing data problems in predictive analytics with regression. In many real-world situations, such as financial reporting, college admission, job application, and marketing advertisement, data providers often conceal certain information on purpose in order to gain a favorable outcome. It is important for the decision-maker to have a mechanism to deal with such strategic behaviors. We propose a novel approach to handle strategically missing data in regression prediction. The proposed method derives imputation values of strategically missing data based on the Support Vector Regression models. It provides incentives for the data providers to disclose their true information. We show that with the proposed method imputation errors for the missing values are minimized under some reasonable conditions. An experimental study on real-world data demonstrates the effectiveness of the proposed approach.

Download Full-text

Fuzzy Overclustering: Semi-Supervised Classification of Fuzzy Labels with Overclustering and Inverse Cross-Entropy

Sensors ◽

10.3390/s21196661 ◽

2021 ◽

Vol 21 (19) ◽

pp. 6661

Author(s):

Lars Schmarje ◽

Johannes Brünger ◽

Monty Santarossa ◽

Simon-Martin Schröder ◽

Rainer Kiko ◽

...

Keyword(s):

Deep Learning ◽

Real World ◽

Supervised Classification ◽

Limited Information ◽

Classification Problems ◽

Previous State ◽

Supervised Methods ◽

Real World Datasets ◽

Fuzzy Labels

Deep learning has been successfully applied to many classification problems including underwater challenges. However, a long-standing issue with deep learning is the need for large and consistently labeled datasets. Although current approaches in semi-supervised learning can decrease the required amount of annotated data by a factor of 10 or even more, this line of research still uses distinct classes. For underwater classification, and uncurated real-world datasets in general, clean class boundaries can often not be given due to a limited information content in the images and transitional stages of the depicted objects. This leads to different experts having different opinions and thus producing fuzzy labels which could also be considered ambiguous or divergent. We propose a novel framework for handling semi-supervised classifications of such fuzzy labels. It is based on the idea of overclustering to detect substructures in these fuzzy labels. We propose a novel loss to improve the overclustering capability of our framework and show the benefit of overclustering for fuzzy labels. We show that our framework is superior to previous state-of-the-art semi-supervised methods when applied to real-world plankton data with fuzzy labels. Moreover, we acquire 5 to 10% more consistent predictions of substructures.

Download Full-text

Which Attributes Matter the Most for Loan Origination? A Neural Attention Approach

10.20944/preprints202002.0180.v1 ◽

2020 ◽

Author(s):

Antonios Alexos ◽

Sotirios Chatzis

Keyword(s):

Deep Learning ◽

Real World ◽

Individual Case ◽

Learning Model ◽

Attention Mechanism ◽

Novel Approach ◽

Specific Individual ◽

Real World Datasets ◽

Deep Learning Model

In this paper we address the understanding of the problem, of why a deep learning model decides that an individual is eligible for a loan or not. Here we propose a novel approach for inferring, which attributes matter the most, for making a decision in each specific individual case. Specifically we leverage concepts from neural attention to devise a novel feature wise attention mechanism. As we show, using real world datasets, our approach offers unique insights into the importance of various features, by producing a decision explanation for each specific loan case. At the same time, we observe that our novel mechanism, generates decisions which are much closer to the decisions generated by human experts, compared to the existent competitors.

Download Full-text

SETTING UP A PROBABILISTIC NEURAL NETWORK FOR CLASSIFICATION OF HIGHWAY VEHICLES

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026805001702 ◽

2005 ◽

Vol 05 (04) ◽

pp. 411-423 ◽

Cited By ~ 5

Author(s):

MAJURA F. SELEKWA ◽

VALERIAN KWIGIZILE ◽

RENATUS N. MUSSA

Keyword(s):

Neural Network ◽

Neural Networks ◽

Decision Tree ◽

Probabilistic Neural Network ◽

Classification Problem ◽

Misclassification Rate ◽

Tree Methods ◽

Decision Tree Methods

Many neural network methods used for efficient classification of populations work only when the population is globally separable. In situ classification of highway vehicles is one of the problems with globally nonseparable populations. This paper presents a systematic procedure for setting up a probabilistic neural network that can classify the globally nonseparable population of highway vehicles. The method is based on a simple concept that any set of classifiable data can be broken down to subclasses of locally separable data. Hence, if these locally separable data can be identified, then the classification problem can be carried out in two hierarchical steps; step one classifies the data according to the local subclasses, and step two classifies the local subclasses into the global classes. The proposed approach was tested on the problem of classifying highway vehicles according to the US Federal Highway Administration standard, which is normally handled by decision tree methods that use vehicle axle information and a set of IF-THEN rules. By using a sample of 3326 vehicles, the proposed method showed improved classification results with an overall misclassification rate of only 2.9% compared to 9.7% of the decision tree methods. A similar setup can be used with different neural networks such as recurrent neural networks, but they were not tested in this study especially since the focus was for in situ applications where a high learning rate is desired.

Download Full-text

Impact of Data-driven Methods for Causal Inference in Observational Studies: Simulations in Three Real-world Datasets

10.21203/rs.3.rs-591022/v1 ◽

2021 ◽

Author(s):

Thibaut Pressat-Laffouilhère ◽

Clément Massonnaud ◽

Hélène Bréard ◽

Margaux Lefebvre ◽

Bruno Falissard ◽

...

Keyword(s):

Causal Inference ◽

Sample Size ◽

Real World ◽

Research Question ◽

Data Driven ◽

Knowledge Based ◽

Backward Elimination ◽

Real World Datasets ◽

The Impact ◽

Bic Criterion

Abstract Background: Although data-driven methods for selecting covariates in multivariable models ignore confusion, mediation and collision, they are still used in causal inference. This study, through three real-world datasets, shows the impact of data-driven methods on causal inference. Methods: A research question leading to multivariate model was raised for each of three real-world datasets. Three covariate selection methods were compared on their performances to correctly answer the question: Augmented Backward Elimination with BIC criterion and “change-in-estimate” threshold set at 0.05, Backward Elimination with BIC criterion and a knowledge-based method relying on causal diagrams. The covariates were classified as indispensable, prohibited and optional, considering the potential bias they could cause on the estimate. For each dataset and sample size (N=75, 300 and 3,000), 10,000 Monte Carlo samples were drawn. Percentages of inclusion of each covariate in models were computed. Coverages of Wald’s 95% confidence interval of exposure effects were computed with two different theoretical values (the analysed method, the knowledge-based method).Results: Even with the largest sample size (n=3,000), data-driven methods were not reproducible, with 8.6% to 53% of covariates included in 20% to 80% of experiences. Prohibited covariates could be included in more than 80% of experiences and indispensable covariates missed in more than 80% of experiences even with n=3,000. With the largest sample sizes, coverages of the theoretical knowledge-based value by data-driven methods ranged from 0% to 83.7%; coverages of the theoretical value of the same data-driven method ranged from 73.2% to 91.1% and were asymmetrical. Conclusion: In conclusion, data-driven methods should not be used in causal inference.

Download Full-text

Identifying Human Mobility via Trajectory Embeddings

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/234 ◽

2017 ◽

Cited By ~ 39

Author(s):

Qiang Gao ◽

Fan Zhou ◽

Kunpeng Zhang ◽

Goce Trajcevski ◽

Xucheng Luo ◽

...

Keyword(s):

Human Mobility ◽

Route Planning ◽

Classification Problem ◽

Personalized Recommendation ◽

Mobility Patterns ◽

Classification Problems ◽

Trajectory Data ◽

Trajectory Classification ◽

Spatio Temporal ◽

Real World Datasets

Understanding human trajectory patterns is an important task in many location based social networks (LBSNs) applications, such as personalized recommendation and preference-based route planning. Most of the existing methods classify a trajectory (or its segments) based on spatio-temporal values and activities, into some predefined categories, e.g., walking or jogging. We tackle a novel trajectory classification problem: we identify and link trajectories to users who generate them in the LBSNs, a problem called Trajectory-User Linking (TUL). Solving the TUL problem is not a trivial task because: (1) the number of the classes (i.e., users) is much larger than the number of motion patterns in the common trajectory classification problems; and (2) the location based trajectory data, especially the check-ins, are often extremely sparse. To address these challenges, a Recurrent Neural Networks (RNN) based semi-supervised learning model, called TULER (TUL via Embedding and RNN) is proposed, which exploits the spatio-temporal data to capture the underlying semantics of user mobility patterns. Experiments conducted on real-world datasets demonstrate that TULER achieves better accuracy than the existing methods.

Download Full-text

Hyperspectral Image Classification Using Kernel Fukunaga-Koontz Transform

Mathematical Problems in Engineering ◽

10.1155/2013/471915 ◽

2013 ◽

Vol 2013 ◽

pp. 1-7 ◽

Cited By ~ 2

Author(s):

Semih Dinç ◽

Abdullah Bal

Keyword(s):

Hyperspectral Image ◽

Classification Problem ◽

Hyperspectral Data ◽

Support Vector ◽

Distributed Data ◽

Classification Problems ◽

Novel Approach ◽

Training Stage ◽

Improved Performance ◽

Two Stages

This paper presents a novel approach for the hyperspectral imagery (HSI) classification problem, using Kernel Fukunaga-Koontz Transform (K-FKT). The Kernel based Fukunaga-Koontz Transform offers higher performance for classification problems due to its ability to solve nonlinear data distributions. K-FKT is realized in two stages: training and testing. In the training stage, unlike classical FKT, samples are relocated to the higher dimensional kernel space to obtain a transformation from non-linear distributed data to linear form. This provides a more efficient solution to hyperspectral data classification. The second stage, testing, is accomplished by employing the Fukunaga- Koontz Transformation operator to find out the classes of the real world hyperspectral images. In experiment section, the improved performance of HSI classification technique, K-FKT, has been tested comparing other methods such as the classical FKT and three types of support vector machines (SVMs).

Download Full-text

Development of a Knowledge-Based Simulation Model to Verify Operability in a Complex Combat System

Proceedings of the Human Factors Society Annual Meeting ◽

10.1177/154193128502900409 ◽

1985 ◽

Vol 29 (4) ◽

pp. 354-357

Author(s):

Robert Dick ◽

Renwick Curry

Keyword(s):

Computer Simulation ◽

Simulation Model ◽

Design Process ◽

Real World ◽

Expert Knowledge ◽

Knowledge Based ◽

Programming Effort ◽

Selected Functions

This paper outlines an approach to computer simulation that was used to verify operability of selected functions in a complex combat system, early in the design process. The approach emphasizes the establishment of an appropriate real-world context for the simulation and extensive reliance on expert knowledge in all steps leading to the programming effort.

Download Full-text