A Unified Framework for Knowledge Intensive Gradient Boosting: Leveraging Human Experts for Noisy Sparse Domains

Incorporating richer human inputs including qualitative constraints such as monotonic and synergistic influences has long been adapted inside AI. Inspired by this, we consider the problem of using such influence statements in the successful gradient-boosting framework. We develop a unified framework for both classification and regression settings that can both effectively and efficiently incorporate such constraints to accelerate learning to a better model. Our results in a large number of standard domains and two particularly novel real-world domains demonstrate the superiority of using domain knowledge rather than treating the human as a mere labeler.

Download Full-text

Improving the performance of a radio-frequency localization system in adverse outdoor applications

EURASIP Journal on Wireless Communications and Networking ◽

10.1186/s13638-021-02001-6 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Marcelo N. de Sousa ◽

Ricardo Sant’Ana ◽

Rigel P. Fernandes ◽

Julio Cesar Duarte ◽

José A. Apolinário ◽

...

Keyword(s):

Random Forest ◽

Ray Tracing ◽

Real World ◽

Practical Implication ◽

Real Life ◽

Simulated Data ◽

Real Data ◽

Gradient Boosting ◽

Real World Data ◽

Localization Accuracy

AbstractIn outdoor RF localization systems, particularly where line of sight can not be guaranteed or where multipath effects are severe, information about the terrain may improve the position estimate’s performance. Given the difficulties in obtaining real data, a ray-tracing fingerprint is a viable option. Nevertheless, although presenting good simulation results, the performance of systems trained with simulated features only suffer degradation when employed to process real-life data. This work intends to improve the localization accuracy when using ray-tracing fingerprints and a few field data obtained from an adverse environment where a large number of measurements is not an option. We employ a machine learning (ML) algorithm to explore the multipath information. We selected algorithms random forest and gradient boosting; both considered efficient tools in the literature. In a strict simulation scenario (simulated data for training, validating, and testing), we obtained the same good results found in the literature (error around 2 m). In a real-world system (simulated data for training, real data for validating and testing), both ML algorithms resulted in a mean positioning error around 100 ,m. We have also obtained experimental results for noisy (artificially added Gaussian noise) and mismatched (with a null subset of) features. From the simulations carried out in this work, our study revealed that enhancing the ML model with a few real-world data improves localization’s overall performance. From the machine ML algorithms employed herein, we also observed that, under noisy conditions, the random forest algorithm achieved a slightly better result than the gradient boosting algorithm. However, they achieved similar results in a mismatch experiment. This work’s practical implication is that multipath information, once rejected in old localization techniques, now represents a significant source of information whenever we have prior knowledge to train the ML algorithm.

Download Full-text

A Multi-Tiered Framework for Insider Threat Prevention

Electronics ◽

10.3390/electronics10091005 ◽

2021 ◽

Vol 10 (9) ◽

pp. 1005

Author(s):

Rakan A. Alsowail ◽

Taher Al-Shehari

Keyword(s):

Real World ◽

Relevant Literature ◽

Security And Privacy ◽

Insider Threat ◽

Unified Framework ◽

Insider Threats ◽

Public And Private ◽

Private Organizations ◽

Confidential Data ◽

Real Value

As technologies are rapidly evolving and becoming a crucial part of our lives, security and privacy issues have been increasing significantly. Public and private organizations have highly confidential data, such as bank accounts, military and business secrets, etc. Currently, the competition between organizations is significantly higher than before, which triggers sensitive organizations to spend an excessive volume of their budget to keep their assets secured from potential threats. Insider threats are more dangerous than external ones, as insiders have a legitimate access to their organization’s assets. Thus, previous approaches focused on some individual factors to address insider threat problems (e.g., technical profiling), but a broader integrative perspective is needed. In this paper, we propose a unified framework that incorporates various factors of the insider threat context (technical, psychological, behavioral and cognitive). The framework is based on a multi-tiered approach that encompasses pre, in and post-countermeasures to address insider threats in an all-encompassing perspective. It considers multiple factors that surround the lifespan of insiders’ employment, from the pre-joining of insiders to an organization until after they leave. The framework is utilized on real-world insider threat cases. It is also compared with previous work to highlight how our framework extends and complements the existing frameworks. The real value of our framework is that it brings together the various aspects of insider threat problems based on real-world cases and relevant literature. This can therefore act as a platform for general understanding of insider threat problems, and pave the way to model a holistic insider threat prevention system.

Download Full-text

HyP-ABC: A Novel Automated Hyper-Parameter Tuning Algorithm Using Evolutionary Optimization

10.36227/techrxiv.14714508.v2 ◽

2021 ◽

Author(s):

Leila Zahedi ◽

Farid Ghareh Mohammadi ◽

M. Hadi Amini

Keyword(s):

Parameter Optimization ◽

Real World ◽

Optimization Problems ◽

State Of The Art ◽

Parameter Tuning ◽

Gradient Boosting ◽

Support Vector ◽

Wide Range ◽

Extreme Gradient Boosting ◽

Art Techniques

Machine learning techniques lend themselves as promising decision-making and analytic tools in a wide range of applications. Different ML algorithms have various hyper-parameters. In order to tailor an ML model towards a specific application, a large number of hyper-parameters should be tuned. Tuning the hyper-parameters directly affects the performance (accuracy and run-time). However, for large-scale search spaces, efficiently exploring the ample number of combinations of hyper-parameters is computationally challenging. Existing automated hyper-parameter tuning techniques suffer from high time complexity. In this paper, we propose HyP-ABC, an automatic innovative hybrid hyper-parameter optimization algorithm using the modified artificial bee colony approach, to measure the classification accuracy of three ML algorithms, namely random forest, extreme gradient boosting, and support vector machine. Compared to the state-of-the-art techniques, HyP-ABC is more efficient and has a limited number of parameters to be tuned, making it worthwhile for real-world hyper-parameter optimization problems. We further compare our proposed HyP-ABC algorithm with state-of-the-art techniques. In order to ensure the robustness of the proposed method, the algorithm takes a wide range of feasible hyper-parameter values, and is tested using a real-world educational dataset.

Download Full-text

An Interpretable Early Dynamic Sequential Predictor for Sepsis-Induced Coagulopathy Progression in the Real-World Using Machine Learning

Frontiers in Medicine ◽

10.3389/fmed.2021.775047 ◽

2021 ◽

Vol 8 ◽

Author(s):

Ruixia Cui ◽

Wenbo Hua ◽

Kai Qu ◽

Heran Yang ◽

Yingmu Tong ◽

...

Keyword(s):

Machine Learning ◽

Real World ◽

Time Series Data ◽

Time Window ◽

Medical Center ◽

Characteristic Curve ◽

Series Data ◽

Gradient Boosting ◽

Early Management ◽

Extreme Gradient Boosting

Sepsis-associated coagulation dysfunction greatly increases the mortality of sepsis. Irregular clinical time-series data remains a major challenge for AI medical applications. To early detect and manage sepsis-induced coagulopathy (SIC) and sepsis-associated disseminated intravascular coagulation (DIC), we developed an interpretable real-time sequential warning model toward real-world irregular data. Eight machine learning models including novel algorithms were devised to detect SIC and sepsis-associated DIC 8n (1 ≤ n ≤ 6) hours prior to its onset. Models were developed on Xi'an Jiaotong University Medical College (XJTUMC) and verified on Beth Israel Deaconess Medical Center (BIDMC). A total of 12,154 SIC and 7,878 International Society on Thrombosis and Haemostasis (ISTH) overt-DIC labels were annotated according to the SIC and ISTH overt-DIC scoring systems in train set. The area under the receiver operating characteristic curve (AUROC) were used as model evaluation metrics. The eXtreme Gradient Boosting (XGBoost) model can predict SIC and sepsis-associated DIC events up to 48 h earlier with an AUROC of 0.929 and 0.910, respectively, and even reached 0.973 and 0.955 at 8 h earlier, achieving the highest performance to date. The novel ODE-RNN model achieved continuous prediction at arbitrary time points, and with an AUROC of 0.962 and 0.936 for SIC and DIC predicted 8 h earlier, respectively. In conclusion, our model can predict the sepsis-associated SIC and DIC onset up to 48 h in advance, which helps maximize the time window for early management by physicians.

Download Full-text

Validation-Based Sparse Gaussian Process Classifier Design

Neural Computation ◽

10.1162/neco.2009.03-08-724 ◽

2009 ◽

Vol 21 (7) ◽

pp. 2082-2103 ◽

Cited By ~ 1

Author(s):

Shirish Shevade ◽

S. Sundararajan

Keyword(s):

Bayesian Methods ◽

Real World ◽

Basis Vector ◽

Data Sets ◽

Training Set ◽

Classifier Design ◽

Set Size ◽

Benchmark Data ◽

Regression Problems ◽

Classification And Regression

Gaussian processes (GPs) are promising Bayesian methods for classification and regression problems. Design of a GP classifier and making predictions using it is, however, computationally demanding, especially when the training set size is large. Sparse GP classifiers are known to overcome this limitation. In this letter, we propose and study a validation-based method for sparse GP classifier design. The proposed method uses a negative log predictive (NLP) loss measure, which is easy to compute for GP models. We use this measure for both basis vector selection and hyperparameter adaptation. The experimental results on several real-world benchmark data sets show better or comparable generalization performance over existing methods.

Download Full-text

Causal Feature Selection with Missing Data

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3488055 ◽

2022 ◽

Vol 16 (4) ◽

pp. 1-24

Author(s):

Kui Yu ◽

Yajing Yang ◽

Wei Ding

Keyword(s):

Feature Selection ◽

Missing Data ◽

Real World ◽

Missing Values ◽

Prediction Models ◽

Causal Structure ◽

Data Imputation ◽

Accurate Data ◽

Unified Framework ◽

Class Variable

Causal feature selection aims at learning the Markov blanket (MB) of a class variable for feature selection. The MB of a class variable implies the local causal structure among the class variable and its MB and all other features are probabilistically independent of the class variable conditioning on its MB, this enables causal feature selection to identify potential causal features for feature selection for building robust and physically meaningful prediction models. Missing data, ubiquitous in many real-world applications, remain an open research problem in causal feature selection due to its technical complexity. In this article, we discuss a novel multiple imputation MB (MimMB) framework for causal feature selection with missing data. MimMB integrates Data Imputation with MB Learning in a unified framework to enable the two key components to engage with each other. MB Learning enables Data Imputation in a potentially causal feature space for achieving accurate data imputation, while accurate Data Imputation helps MB Learning identify a reliable MB of the class variable in turn. Then, we further design an enhanced kNN estimator for imputing missing values and instantiate the MimMB. In our comprehensively experimental evaluation, our new approach can effectively learn the MB of a given variable in a Bayesian network and outperforms other rival algorithms using synthetic and real-world datasets.

Download Full-text

Domain Driven Data Mining

Data Mining and Knowledge Discovery Technologies ◽

10.4018/978-1-59904-960-1.ch008 ◽

2008 ◽

pp. 196-223 ◽

Cited By ~ 1

Author(s):

Longbing Cao ◽

Chengqi Zhang

Keyword(s):

Data Mining ◽

Complex Systems ◽

Real World ◽

Domain Knowledge ◽

Pattern Mining ◽

Iterative Refinement ◽

User Preference ◽

Data Driven ◽

Real World Data ◽

Hidden Knowledge

Quantitative intelligence based traditional data mining is facing grand challenges from real-world enterprise and cross-organization applications. For instance, the usual demonstration of specific algorithms cannot support business users to take actions to their advantage and needs. We think this is due to Quantitative Intelligence focused data-driven philosophy. It either views data mining as an autonomous data-driven, trial-and-error process, or only analyzes business issues in an isolated, case-by-case manner. Based on experience and lessons learnt from real-world data mining and complex systems, this article proposes a practical data mining methodology referred to as Domain-Driven Data Mining. On top of quantitative intelligence and hidden knowledge in data, domain-driven data mining aims to meta-synthesize quantitative intelligence and qualitative intelligence in mining complex applications in which human is in the loop. It targets actionable knowledge discovery in constrained environment for satisfying user preference. Domain-driven methodology consists of key components including understanding constrained environment, business-technical questionnaire, representing and involving domain knowledge, human-mining cooperation and interaction, constructing next-generation mining infrastructure, in-depth pattern mining and postprocessing, business interestingness and actionability enhancement, and loop-closed human-cooperated iterative refinement. Domain-driven data mining complements the data-driven methodology, the metasynthesis of qualitative intelligence and quantitative intelligence has potential to discover knowledge from complex systems, and enhance knowledge actionability for practical use by industry and business.

Download Full-text

Learning Geo-Contextual Embeddings for Commuting Flow Prediction

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i01.5425 ◽

2020 ◽

Vol 34 (01) ◽

pp. 808-816

Author(s):

Zhicheng Liu ◽

Fabio Miranda ◽

Weiting Xiong ◽

Junyan Yang ◽

Qiao Wang ◽

...

Keyword(s):

New York ◽

Real World ◽

Policy Development ◽

Contextual Information ◽

Supply And Demand ◽

Gradient Boosting ◽

Spatial Correlations ◽

Commuting Flows ◽

Flow Prediction ◽

Conventional Models

Predicting commuting flows based on infrastructure and land-use information is critical for urban planning and public policy development. However, it is a challenging task given the complex patterns of commuting flows. Conventional models, such as gravity model, are mainly derived from physics principles and limited by their predictive power in real-world scenarios where many factors need to be considered. Meanwhile, most existing machine learning-based methods ignore the spatial correlations and fail to model the influence of nearby regions. To address these issues, we propose Geo-contextual Multitask Embedding Learner (GMEL), a model that captures the spatial correlations from geographic contextual information for commuting flow prediction. Specifically, we first construct a geo-adjacency network containing the geographic contextual information. Then, an attention mechanism is proposed based on the framework of graph attention network (GAT) to capture the spatial correlations and encode geographic contextual information to embedding space. Two separate GATs are used to model supply and demand characteristics. To enhance the effectiveness of the embedding representation, a multitask learning framework is used to introduce stronger restrictions, forcing the embeddings to encapsulate effective representation for flow prediction. Finally, a gradient boosting machine is trained based on the learned embeddings to predict commuting flows. We evaluate our model using real-world dataset from New York City and the experimental results demonstrate the effectiveness of our proposed method against the state of the art.

Download Full-text

Scholarly Rigor

Evidence-Based Faculty Development Through the Scholarship of Teaching and Learning (SoTL) - Advances in Educational Marketing, Administration, and Leadership ◽

10.4018/978-1-7998-2212-7.ch014 ◽

2020 ◽

pp. 262-280

Author(s):

Robert Hallis

Keyword(s):

Instructional Practices ◽

Information Management ◽

Real World ◽

Domain Knowledge ◽

Teaching And Learning ◽

Research Question ◽

Management Skills ◽

Opinion Piece

The Scholarship of Teaching and Learning nurtures an academic discussion of best instructional practices. This case study examines the role domain knowledge plays in determining extent to which students can effectively analyze an opinion piece from a major news organization, locate a relevant source to support their view of the issue, and reflect on the quality of their work. The goal of analyzing an opinion piece is twofold: it fosters critical thinking in analyzing the strength of an argument and it promotes information management skills in locating and incorporating relevant sources in a real-world scenario. Students, however, exhibited difficulties in accurately completing the assignment and usually overestimated their expertise. This chapter traces how each step in the process of making this study public clarifies the issues encountered. The focus here, however, centers on the context within which the study was formulated, those issues that contributed to framing the research question, and how the context of inquiry served to deepen insights in interpreting the results.

Download Full-text

Development and Analysis of an Enhanced Multi-Expert Knowledge Integration System for Designing Context-Aware Ubiquitous Learning Contents

International Journal of Distance Education Technologies ◽

10.4018/ijdet.2018100103 ◽

2018 ◽

Vol 16 (4) ◽

pp. 31-53 ◽

Cited By ~ 1

Author(s):

Gwo-Haur Hwang ◽

Beyin Chen ◽

Shiau-Huei Huang

Keyword(s):

Knowledge Acquisition ◽

Real World ◽

Domain Knowledge ◽

Knowledge Integration ◽

Expert Knowledge ◽

Context Aware ◽

Ubiquitous Learning ◽

Integration System ◽

Degree Of Acceptance ◽

Multiple Experts

This article describes how in context-aware ubiquitous learning environments, teachers must plan a theme and design learning contents to provide complete knowledge for students. Knowledge acquisition, which is an approach for helping people represent and organize domain knowledge, has been recognized as a potential way of guiding teachers to develop real-world context-related learning contents. However, previous studies failed to address the issue that the learning contents provided by multiple experts or teachers might be redundant or inconsistent; moreover, it is difficult to use the traditional knowledge acquisition method to fully describe the complex real-world contexts and the learning contents. Therefore, in this article, a multi-expert knowledge integration system with an enhanced knowledge representation approach and Delphi method has been developed. From the experimental results, it is found that the teachers involved had a high degree of acceptance of the system. They believe that it can unify the knowledge of many teachers.

Download Full-text