Data Mining and Knowledge Discovery

AbstractThrough the quantification of physical activity energy expenditure (PAEE), health care monitoring has the potential to stimulate vital and healthy ageing, inducing behavioural changes in older people and linking these to personal health gains. To be able to measure PAEE in a health care perspective, methods from wearable accelerometers have been developed, however, mainly targeted towards younger people. Since elderly subjects differ in energy requirements and range of physical activities, the current models may not be suitable for estimating PAEE among the elderly. Furthermore, currently available methods seem to be either simple but non-generalizable or require elaborate (manual) feature construction steps. Because past activities influence present PAEE, we propose a modeling approach known for its ability to model sequential data, the recurrent neural network (RNN). To train the RNN for an elderly population, we used the growing old together validation (GOTOV) dataset with 34 healthy participants of 60 years and older (mean 65 years old), performing 16 different activities. We used accelerometers placed on wrist and ankle, and measurements of energy counts by means of indirect calorimetry. After optimization, we propose an architecture consisting of an RNN with 3 GRU layers and a feedforward network combining both accelerometer and participant-level data. Our efforts included switching mean to standard deviation for down-sampling the input data and combining temporal and static data (person-specific details such as age, weight, BMI). The resulting architecture produces accurate PAEE estimations while decreasing training input and time by a factor of 10. Subsequently, compared to the state-of-the-art, it is capable to integrate longer activity data which lead to more accurate estimations of low intensity activities EE. It can thus be employed to investigate associations of PAEE with vitality parameters of older people related to metabolic and cognitive health and mental well-being.

Download Full-text

Synwalk: community detection via random walk modelling

Data Mining and Knowledge Discovery ◽

10.1007/s10618-021-00809-w ◽

2022 ◽

Author(s):

Christian Toth ◽

Denis Helic ◽

Bernhard C. Geiger

Keyword(s):

Random Walk ◽

Random Walks ◽

Community Detection ◽

Detection Algorithm ◽

Future Research ◽

Small Communities ◽

Mixing Parameter ◽

Modularity Maximization ◽

Community Detection Algorithm ◽

Empirical Networks

AbstractComplex systems, abstractly represented as networks, are ubiquitous in everyday life. Analyzing and understanding these systems requires, among others, tools for community detection. As no single best community detection algorithm can exist, robustness across a wide variety of problem settings is desirable. In this work, we present Synwalk, a random walk-based community detection method. Synwalk builds upon a solid theoretical basis and detects communities by synthesizing the random walk induced by the given network from a class of candidate random walks. We thoroughly validate the effectiveness of our approach on synthetic and empirical networks, respectively, and compare Synwalk’s performance with the performance of Infomap and Walktrap (also random walk-based), Louvain (based on modularity maximization) and stochastic block model inference. Our results indicate that Synwalk performs robustly on networks with varying mixing parameters and degree distributions. We outperform Infomap on networks with high mixing parameter, and Infomap and Walktrap on networks with many small communities and low average degree. Our work has a potential to inspire further development of community detection via synthesis of random walks and we provide concrete ideas for future research.

Download Full-text

Strengthening ties towards a highly-connected world

Data Mining and Knowledge Discovery ◽

10.1007/s10618-021-00812-1 ◽

2022 ◽

Author(s):

Antonis Matakos ◽

Aristides Gionis

Keyword(s):

Social Networks ◽

Online Social Networks ◽

Constant Factor ◽

Algorithmic Problem ◽

Full Potential ◽

User Engagement ◽

Scalable Algorithms ◽

Special Cases ◽

Access Information ◽

Flow Of Information

AbstractOnline social networks provide a forum where people make new connections, learn more about the world, get exposed to different points of view, and access information that were previously inaccessible. It is natural to assume that content-delivery algorithms in social networks should not only aim to maximize user engagement but also to offer opportunities for increasing connectivity and enabling social networks to achieve their full potential. Our motivation and aim is to develop methods that foster the creation of new connections, and subsequently, improve the flow of information in the network. To achieve our goal, we propose to leverage the strong triadic closure principle, and consider violations to this principle as opportunities for creating more social links. We formalize this idea as an algorithmic problem related to the densest k-subgraph problem. For this new problem, we establish hardness results and propose approximation algorithms. We identify two special cases of the problem that admit a constant-factor approximation. Finally, we experimentally evaluate our proposed algorithm on real-world social networks, and we additionally evaluate some simpler but more scalable algorithms.

Download Full-text

Dynamic cyber risk estimation with competitive quantile autoregression

Data Mining and Knowledge Discovery ◽

10.1007/s10618-021-00814-z ◽

2022 ◽

Author(s):

Raisa Dzhamtyrova ◽

Carsten Maple

Keyword(s):

Value At Risk ◽

Arrival Time ◽

Time Series Data ◽

Risk Estimation ◽

Series Data ◽

Cyber Attack ◽

Significance Level ◽

Confidence Levels ◽

Quantile Autoregression ◽

Cyber Risk

AbstractThe increasing value of data held in enterprises makes it an attractive target to attackers. The increasing likelihood and impact of a cyber attack have highlighted the importance of effective cyber risk estimation. We propose two methods for modelling Value-at-Risk (VaR) which can be used for any time-series data. The first approach is based on Quantile Autoregression (QAR), which can estimate VaR for different quantiles, i. e. confidence levels. The second method, we term Competitive Quantile Autoregression (CQAR), dynamically re-estimates cyber risk as soon as new data becomes available. This method provides a theoretical guarantee that it asymptotically performs as well as any QAR at any time point in the future. We show that these methods can predict the size and inter-arrival time of cyber hacking breaches by running coverage tests. The proposed approaches allow to model a separate stochastic process for each significance level and therefore provide more flexibility compared to previously proposed techniques. We provide a fully reproducible code used for conducting the experiments.

Download Full-text

INK: knowledge graph embeddings for node classification

Data Mining and Knowledge Discovery ◽

10.1007/s10618-021-00806-z ◽

2022 ◽

Author(s):

Bram Steenwinckel ◽

Gilles Vandewiele ◽

Michael Weyns ◽

Terencio Agozzino ◽

Filip De Turck ◽

...

Keyword(s):

Knowledge Graph ◽

Graph Embeddings ◽

Node Classification

Download Full-text

Expected passes

Data Mining and Knowledge Discovery ◽

10.1007/s10618-021-00810-3 ◽

2022 ◽

Author(s):

Gabriel Anzer ◽

Pascal Bauer

Keyword(s):

Success Probability ◽

Area Under The Curve ◽

Gradient Boosting ◽

Frequent Event ◽

Extreme Gradient Boosting ◽

Potential Applications ◽

Ball Trajectory ◽

Positional Data ◽

Meaningful Research ◽

Equal Difficulty

AbstractPasses are by far football’s (soccer) most frequent event, yet surprisingly little meaningful research has been devoted to quantify them. With the increase in availability of so-called positional data, describing the positioning of players and ball at every moment of the game, our work aims to determine the difficulty of every pass by calculating its success probability based on its surrounding circumstances. As most experts will agree, not all passes are of equal difficulty, however, most traditional metrics count them as such. With our work we can quantify how well players can execute passes, assess their risk profile, and even compute completion probabilities for hypothetical passes by combining physical and machine learning models. Our model uses the first 0.4 seconds of a ball trajectory and the movement vectors of all players to predict the intended target of a pass with an accuracy of $$93.0\%$$ 93.0 % for successful and $$72.0\%$$ 72.0 % for unsuccessful passes much higher than any previously published work. Our extreme gradient boosting model can then quantify the likelihood of a successful pass completion towards the identified target with an area under the curve (AUC) of $$93.4\%$$ 93.4 % . Finally, we discuss several potential applications, like player scouting or evaluating pass decisions.

Download Full-text

Provable randomized rounding for minimum-similarity diversification

Data Mining and Knowledge Discovery ◽

10.1007/s10618-021-00811-2 ◽

2022 ◽

Author(s):

Bruno Ordozgoiti ◽

Ananth Mahadevan ◽

Antonis Matakos ◽

Aristides Gionis

Keyword(s):

Optimization Problems ◽

Randomized Algorithm ◽

Similarity Function ◽

Randomized Rounding ◽

Cardinality Constraint ◽

Similarity Functions ◽

Penalty Term ◽

Combinatorial Optimization Problems ◽

Benchmark Datasets ◽

Approximation Guarantee

AbstractWhen searching for information in a data collection, we are often interested not only in finding relevant items, but also in assembling a diverse set, so as to explore different concepts that are present in the data. This problem has been researched extensively. However, finding a set of items with minimal pairwise similarities can be computationally challenging, and most existing works striving for quality guarantees assume that item relatedness is measured by a distance function. Given the widespread use of similarity functions in many domains, we believe this to be an important gap in the literature. In this paper we study the problem of finding a diverse set of items, when item relatedness is measured by a similarity function. We formulate the diversification task using a flexible, broadly applicable minimization objective, consisting of the sum of pairwise similarities of the selected items and a relevance penalty term. To find good solutions we adopt a randomized rounding strategy, which is challenging to analyze because of the cardinality constraint present in our formulation. Even though this obstacle can be overcome using dependent rounding, we show that it is possible to obtain provably good solutions using an independent approach, which is faster, simpler to implement and completely parallelizable. Our analysis relies on a novel bound for the ratio of Poisson-Binomial densities, which is of independent interest and has potential implications for other combinatorial-optimization problems. We leverage this result to design an efficient randomized algorithm that provides a lower-order additive approximation guarantee. We validate our method using several benchmark datasets, and show that it consistently outperforms the greedy approaches that are commonly used in the literature.

Download Full-text