Symbolic Regression for Approximating Graph Geodetic Number

Graph properties are certain attributes that could make the structure of the graph understandable. Occasionally, standard methods cannot work properly for calculating exact values of graph properties due to their huge computational complexity, especially for real-world graphs. In contrast, heuristics and metaheuristics are alternatives proved their ability to provide sufficient solutions in a reasonable time. Although in some cases, even heuristics are not efficient enough, where they need some not easily obtainable global information of the graph. The problem thus should be dealt in completely different way by trying to find features that related to the property and based on these data build a formula which can approximate the graph property. In this work, symbolic regression with an evolutionary algorithm called Cartesian Genetic Programming has been used to derive formulas capable to approximate the graph geodetic number which measures the minimal-cardinality set of vertices, such that all shortest paths between its elements cover every vertex of the graph. Finding the exact value of the geodetic number is known to be NP-hard for general graphs. The obtained formulas are tested on random and real-world graphs. It is demonstrated how various graph properties as training data can lead to diverse formulas with different accuracy. It is also investigated which training data are really related to each property.

Download Full-text

Betweenness centrality for temporal multiplexes

Scientific Reports ◽

10.1038/s41598-021-84418-z ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Silvia Zaoli ◽

Piero Mazzarisi ◽

Fabrizio Lillo

Keyword(s):

Information Flow ◽

Real World ◽

Betweenness Centrality ◽

Temporal Structure ◽

Shortest Paths ◽

Single Layer ◽

Distance Metric ◽

Definition Of

AbstractBetweenness centrality quantifies the importance of a vertex for the information flow in a network. The standard betweenness centrality applies to static single-layer networks, but many real world networks are both dynamic and made of several layers. We propose a definition of betweenness centrality for temporal multiplexes. This definition accounts for the topological and temporal structure and for the duration of paths in the determination of the shortest paths. We propose an algorithm to compute the new metric using a mapping to a static graph. We apply the metric to a dataset of $$\sim 20$$ ∼ 20 k European flights and compare the results with those obtained with static or single-layer metrics. The differences in the airports rankings highlight the importance of considering the temporal multiplex structure and an appropriate distance metric.

Download Full-text

Zero-Shot Feature Selection via Transferring Supervised Knowledge

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2021040101 ◽

2021 ◽

Vol 17 (2) ◽

pp. 1-20

Author(s):

Zheng Wang ◽

Qiao Wang ◽

Tingzhang Zhao ◽

Chaokun Wang ◽

Xiaojun Ye

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Dimensionality Reduction ◽

Real World ◽

Rapid Growth ◽

Learning Systems ◽

Training Data ◽

Effective Technique ◽

Supervised Methods ◽

Real World Datasets

Feature selection, an effective technique for dimensionality reduction, plays an important role in many machine learning systems. Supervised knowledge can significantly improve the performance. However, faced with the rapid growth of newly emerging concepts, existing supervised methods might easily suffer from the scarcity and validity of labeled data for training. In this paper, the authors study the problem of zero-shot feature selection (i.e., building a feature selection model that generalizes well to “unseen” concepts with limited training data of “seen” concepts). Specifically, they adopt class-semantic descriptions (i.e., attributes) as supervision for feature selection, so as to utilize the supervised knowledge transferred from the seen concepts. For more reliable discriminative features, they further propose the center-characteristic loss which encourages the selected features to capture the central characteristics of seen concepts. Extensive experiments conducted on various real-world datasets demonstrate the effectiveness of the method.

Download Full-text

Glean

Proceedings of the VLDB Endowment ◽

10.14778/3447689.3447703 ◽

2021 ◽

Vol 14 (6) ◽

pp. 997-1005

Author(s):

Sandeep Tata ◽

Navneet Potti ◽

James B. Wendt ◽

Lauro Beltrão Costa ◽

Marc Najork ◽

...

Keyword(s):

Machine Learning ◽

Data Management ◽

Real World ◽

Empirical Studies ◽

Ground Truth ◽

Training Data ◽

Ground Truth Data ◽

Document Type ◽

Machine Learning Model ◽

Structured Information

Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones. We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.

Download Full-text

Mining Sport Activities

Advanced Methodologies and Technologies in Media and Communications - Advances in Multimedia and Interactive Technologies ◽

10.4018/978-1-5225-7601-3.ch049 ◽

2019 ◽

pp. 610-621

Author(s):

Iztok Fister Jr. ◽

Iztok Fister

Keyword(s):

Data Mining ◽

Real World ◽

Healthy Lifestyle ◽

Smart Phones ◽

The Other ◽

Training Data ◽

Stress Relieving ◽

Sport Activities ◽

Future Potential ◽

Insight Into

For many people, sport is one of the stress-relieving activities. People being involved with sport wish to achieve attractive shape, healthy lifestyle, lose weight, and so on. However, there are also people who deal with sport because of competition goals. In order to fulfill their competition goals, they need to train properly. Even for professionals, it is very hard to perform a serious training. On the other hand, recent expansion of smart sport watches and even smart phones allow athletes to train smarter. During the months and years, they produce dozens of activity files. These files offer thousands of opportunities for data mining approaches, where athletes gained a deep insight into their training data. Data mining approaches are able to extract habits of athletes, help to prevent over-training syndrome and injuries, clustering similar activities together, and much more. In this chapter, the authors show opportunities for data mining, enumerate recent applications, and outline future potential for research and applications in the real world.

Download Full-text

Improving Model-Based Genetic Programming for Symbolic Regression of Small Expressions

Evolutionary Computation ◽

10.1162/evco_a_00278 ◽

2020 ◽

pp. 1-27 ◽

Cited By ~ 1

Author(s):

M. Virgolin ◽

T. Alderliesten ◽

C. Witteveen ◽

P. A. N. Bosman

Keyword(s):

Genetic Programming ◽

Real World ◽

Symbolic Regression ◽

New Approach ◽

Model Based ◽

Real World Datasets ◽

Linkage Learning ◽

Optimal Mixing ◽

Crucial Parameter

The Gene-pool Optimal Mixing Evolutionary Algorithm (GOMEA) is a model-based EA framework that has been shown to perform well in several domains, including Genetic Programming (GP). Differently from traditional EAs where variation acts blindly, GOMEA learns a model of interdependencies within the genotype, that is, the linkage, to estimate what patterns to propagate. In this article, we study the role of Linkage Learning (LL) performed by GOMEA in Symbolic Regression (SR). We show that the non-uniformity in the distribution of the genotype in GP populations negatively biases LL, and propose a method to correct for this. We also propose approaches to improve LL when ephemeral random constants are used. Furthermore, we adapt a scheme of interleaving runs to alleviate the burden of tuning the population size, a crucial parameter for LL, to SR. We run experiments on 10 real-world datasets, enforcing a strict limitation on solution size, to enable interpretability. We find that the new LL method outperforms the standard one, and that GOMEA outperforms both traditional and semantic GP. We also find that the small solutions evolved by GOMEA are competitive with tuned decision trees, making GOMEA a promising new approach to SR.

Download Full-text

Estimating real-world performance of a predictive model: a case-study in predicting mortality

JAMIA Open ◽

10.1093/jamiaopen/ooaa008 ◽

2020 ◽

Vol 3 (2) ◽

pp. 243-251

Author(s):

Vincent J Major ◽

Neil Jethani ◽

Yindalon Aphinyanaphongs

Keyword(s):

Experimental Design ◽

Real World ◽

Model Performance ◽

Assistive Technologies ◽

Training Data ◽

Electronic Health Record Data ◽

Test Set ◽

Model Composite ◽

Temporal Validation ◽

Cohort Selection

Abstract Objective One primary consideration when developing predictive models is downstream effects on future model performance. We conduct experiments to quantify the effects of experimental design choices, namely cohort selection and internal validation methods, on (estimated) real-world model performance. Materials and Methods Four years of hospitalizations are used to develop a 1-year mortality prediction model (composite of death or initiation of hospice care). Two common methods to select appropriate patient visits from their encounter history (backwards-from-outcome and forwards-from-admission) are combined with 2 testing cohorts (random and temporal validation). Two models are trained under otherwise identical conditions, and their performances compared. Operating thresholds are selected in each test set and applied to a “real-world” cohort of labeled admissions from another, unused year. Results Backwards-from-outcome cohort selection retains 25% of candidate admissions (n = 23 579), whereas forwards-from-admission selection includes many more (n = 92 148). Both selection methods produce similar performances when applied to a random test set. However, when applied to the temporally defined “real-world” set, forwards-from-admission yields higher areas under the ROC and precision recall curves (88.3% and 56.5% vs. 83.2% and 41.6%). Discussion A backwards-from-outcome experiment manipulates raw training data, simplifying the experiment. This manipulated data no longer resembles real-world data, resulting in optimistic estimates of test set performance, especially at high precision. In contrast, a forwards-from-admission experiment with a temporally separated test set consistently and conservatively estimates real-world performance. Conclusion Experimental design choices impose bias upon selected cohorts. A forwards-from-admission experiment, validated temporally, can conservatively estimate real-world performance. LAY SUMMARY The routine care of patients stands to benefit greatly from assistive technologies, including data-driven risk assessment. Already, many different machine learning and artificial intelligence applications are being developed from complex electronic health record data. To overcome challenges that arise from such data, researchers often start with simple experimental approaches to test their work. One key component is how patients (and their healthcare visits) are selected for the study from the pool of all patients seen. Another is how the group of patients used to create the risk estimator differs from the group used to evaluate how well it works. These choices complicate how the experimental setting compares to the real-world application to patients. For example, different selection approaches that depend on each patient’s future outcome can simplify the experiment but are impractical upon implementation as these data are unavailable. We show that this kind of “backwards” experiment optimistically estimates how well the model performs. Instead, our results advocate for experiments that select patients in a “forwards” manner and “temporal” validation that approximates training on past data and implementing on future data. More robust results help gauge the clinical utility of recent works and aid decision-making before implementation into practice.

Download Full-text

MetaLight: Value-Based Meta-Reinforcement Learning for Traffic Signal Control

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i01.5467 ◽

2020 ◽

Vol 34 (01) ◽

pp. 1153-1160 ◽

Cited By ~ 1

Author(s):

Xinshi Zang ◽

Huaxiu Yao ◽

Guanjie Zheng ◽

Nan Xu ◽

Kai Xu ◽

...

Keyword(s):

Reinforcement Learning ◽

Real World ◽

Learning Algorithm ◽

Traffic Signal ◽

Training Data ◽

Signal Control ◽

Traffic Signal Control ◽

Individual Level ◽

Real World Datasets ◽

Reinforcement Learning Models

Using reinforcement learning for traffic signal control has attracted increasing interests recently. Various value-based reinforcement learning methods have been proposed to deal with this classical transportation problem and achieved better performances compared with traditional transportation methods. However, current reinforcement learning models rely on tremendous training data and computational resources, which may have bad consequences (e.g., traffic jams or accidents) in the real world. In traffic signal control, some algorithms have been proposed to empower quick learning from scratch, but little attention is paid to learning by transferring and reusing learned experience. In this paper, we propose a novel framework, named as MetaLight, to speed up the learning process in new scenarios by leveraging the knowledge learned from existing scenarios. MetaLight is a value-based meta-reinforcement learning workflow based on the representative gradient-based meta-learning algorithm (MAML), which includes periodically alternate individual-level adaptation and global-level adaptation. Moreover, MetaLight improves the-state-of-the-art reinforcement learning model FRAP in traffic signal control by optimizing its model structure and updating paradigm. The experiments on four real-world datasets show that our proposed MetaLight not only adapts more quickly and stably in new traffic scenarios, but also achieves better performance.

Download Full-text

Deep Interest-Shifting Network with Meta-Embeddings for Fresh Item Recommendation

Complexity ◽

10.1155/2020/8828087 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13

Author(s):

Zhao Li ◽

Haobo Wang ◽

Donghui Ding ◽

Shichang Hu ◽

Zhen Zhang ◽

...

Keyword(s):

Real World ◽

Contextual Information ◽

Cold Start ◽

Training Data ◽

Learning To Learn ◽

User Interests ◽

Synthetic Datasets ◽

Fresh Products ◽

Embedding Performance ◽

Deep Interest

Nowadays, people have an increasing interest in fresh products such as new shoes and cosmetics. To this end, an E-commerce platform Taobao launched a fresh-item hub page on the recommender system, with which customers can freely and exclusively explore and purchase fresh items, namely, the New Tendency page. In this work, we make a first attempt to tackle the fresh-item recommendation task with two major challenges. First, a fresh-item recommendation scenario usually faces the challenge that the training data are highly deficient due to low page views. In this paper, we propose a deep interest-shifting network (DisNet), which transfers knowledge from a huge number of auxiliary data and then shifts user interests with contextual information. Furthermore, three interpretable interest-shifting operators are introduced. Second, since the items are fresh, many of them have never been exposed to users, leading to a severe cold-start problem. Though this problem can be alleviated by knowledge transfer, we further babysit these fully cold-start items by a relational meta-Id-embedding generator (RM-IdEG). Specifically, it trains the item id embeddings in a learning-to-learn manner and integrates relational information for better embedding performance. We conducted comprehensive experiments on both synthetic datasets as well as a real-world dataset. Both DisNet and RM-IdEG significantly outperform state-of-the-art approaches, respectively. Empirical results clearly verify the effectiveness of the proposed techniques, which are arguably promising and scalable in real-world applications.

Download Full-text

Facial Expression Recognition Based on Weighted-Cluster Loss and Deep Transfer Learning Using a Highly Imbalanced Dataset

Sensors ◽

10.3390/s20092639 ◽

2020 ◽

Vol 20 (9) ◽

pp. 2639

Author(s):

Quan T. Ngo ◽

Seokhoon Yoon

Keyword(s):

Facial Expression ◽

Transfer Learning ◽

Loss Function ◽

Real World ◽

Facial Expression Recognition ◽

Training Data ◽

Fine Tuning ◽

Expression Recognition ◽

Recent Success ◽

Deep Cnn

Facial expression recognition (FER) is a challenging problem in the fields of pattern recognition and computer vision. The recent success of convolutional neural networks (CNNs) in object detection and object segmentation tasks has shown promise in building an automatic deep CNN-based FER model. However, in real-world scenarios, performance degrades dramatically owing to the great diversity of factors unrelated to facial expressions, and due to a lack of training data and an intrinsic imbalance in the existing facial emotion datasets. To tackle these problems, this paper not only applies deep transfer learning techniques, but also proposes a novel loss function called weighted-cluster loss, which is used during the fine-tuning phase. Specifically, the weighted-cluster loss function simultaneously improves the intra-class compactness and the inter-class separability by learning a class center for each emotion class. It also takes the imbalance in a facial expression dataset into account by giving each emotion class a weight based on its proportion of the total number of images. In addition, a recent, successful deep CNN architecture, pre-trained in the task of face identification with the VGGFace2 database from the Visual Geometry Group at Oxford University, is employed and fine-tuned using the proposed loss function to recognize eight basic facial emotions from the AffectNet database of facial expression, valence, and arousal computing in the wild. Experiments on an AffectNet real-world facial dataset demonstrate that our method outperforms the baseline CNN models that use either weighted-softmax loss or center loss.

Download Full-text

Improved Junction Body Flow Modeling Through Data-Driven Symbolic Regression

Journal of Ship Research ◽

10.5957/josr.09180053 ◽

2019 ◽

Vol 63 (4) ◽

pp. 283-293 ◽

Cited By ~ 1

Author(s):

Jack Weatheritt ◽

Richard David Sandberg

Keyword(s):

Reynolds Number ◽

Symbolic Regression ◽

Basis Functions ◽

Training Data ◽

Data Driven ◽

High Fidelity ◽

Turbulent Stress ◽

Stress Strain ◽

New Model ◽

Varying Coefficients

A novel data-driven turbulence modeling framework is presented and applied to the problem of junction body flow. In particular, a symbolic regression approach is used to find nonlinear analytical expressions of the turbulent stress‐strain coupling that are ready for implementation in computational fluid dynamics (CFD) solvers using Reynolds-averaged Navier‐Stokes (RANS) closures. Results from baseline linear RANS closure calculations of a finite square-mounted cylinder with a Reynolds number of <inline-graphic xlink:href="josr09180053inf1.tif"/>, based on diameter and freestream velocity, are shown to considerably overpredict the separated flow region downstream of the square cylinder, mainly because of the failure of the model to accurately represent the complex vortex structure generated by the junction flow. In the present study, a symbolic regression tool built on a gene expression programming technique is used to find a nonlinear constitutive stress‐strain relationship. In short, the algorithm finds the most appropriate linear combination of basis functions and spatially varying coefficients that approximate the turbulent stress tensor from high-fidelity data. Here, the high-fidelity data, or the so-called training data, were obtained from a hybrid RANS/Large Eddy Simulation (LES) calculation also developed with symbolic regression that showed excellent agreement with direct numerical simulation data. The present study, therefore, also demonstrates that training data required for RANS closure development can be obtained using computationally more affordable approaches, such as hybrid RANS/LES. A procedure is presented to evaluate which of the individual basis functions that are available for model development are most likely to produce a successful nonlinear closure. A new model is built using those basis functions only. This new model is then tested, i.e., an actual CFD calculation is performed, on the well-known periodic hills case and produces significantly better results than the linear baseline model, despite this test case being fundamentally different from the training case. Finally, the new model is shown to also improve predictive accuracy for a surface-mounted cube placed in a channel at a cube height Reynolds number of <inline-graphic xlink:href="josr09180053inf2.tif"/> over traditional linear RANS closures.

Download Full-text