vc dimension Latest Research Papers

<p>Symbolic regression (SR) is a function identification process, the task of which is to identify and express the relationship between the input and output variables in mathematical models. SR is named to emphasise its ability to find the structure and coefficients of the model simultaneously. Genetic Programming (GP) is an attractive and powerful technique for SR, since it does not require any predefined model and has a flexible representation. However, GP based SR generally has a poor generalisation ability which degrades its reliability and hampers its applications to science and real-world modeling. Therefore, this thesis aims to develop new GP approaches to SR that evolve/learn models exhibiting good generalisation ability. This thesis develops a novel feature selection method in GP for high-dimensional SR. Feature selection can potentially contribute not only to improving the efficiency of learning algorithms but also to enhancing the generalisation ability. However, feature selection is seldom considered in GP for high-dimensional SR. The proposed new feature selection method utilises GP’s built-in feature selection ability and relies on permutation to detect the truly relevant features and discard irrelevant/noisy features. The results confirm the superiority of the proposed method over the other examined feature selection methods including random forests and decision trees on identifying the truly relevant features. Further analysis indicates that the models evolved by GP with the proposed feature selection method are more likely to contain only the truly relevant features and have better interpretability. To address the overfitting issue of GP when learning from a relatively small number of instances, this thesis proposes a new GP approach by incorporating structural risk minimisation (SRM), which is a framework to estimate the generalisation performance of models, into GP. The effectiveness of SRM highly depends on the accuracy of the Vapnik-Chervonenkis (VC) dimension measuring model complexity. This thesis significantly extends an experimental method (instead of theoretical estimation) to measure the VC-dimension of a mixture of linear and nonlinear regression models in GP for the first time. The experimental method has been conducted using uniform and non-uniform settings and provides reliable VC-dimension values. The results show that our methods have an impressively better generalisation gain and evolve more compact model, which have a much smaller behavioural difference from the target models than standard GP and GP with bootstrap, The proposed method using the optimised non-uniform setting further improves the one using the uniform setting. This thesis employs geometric semantic GP (GSGP) to tackle the unsatisfied generalisation performance of GP for SR when no overfitting occurs. It proposes three new angle-awareness driven geometric semantic operators (GSO) including selection, crossover and mutation to further explore the geometry of the semantic space to gain a greater generalisation improvement in GP for SR. The angle-awareness brings new geometric properties to these geometric operators, which are expected to provide a greater leverage for approximating the target semantics in each operation, and more importantly, to be resistant to overfitting. The results show that compared with two kinds of state-of-the-art GSOs, the proposed new GSOs not only drive the evolutionary process fitting the target semantics more efficiently but also significantly improve the generalisation performance. A further comparison on the evolved models shows that the new method generally produces simpler models with a much smaller size and containing important building blocks of the target models.</p>

Download Full-text

Improving the Generalisation of Genetic Programming for Symbolic Regression

10.26686/wgtn.17068166 ◽

2021 ◽

Author(s):

◽

Qi Chen

Keyword(s):

Feature Selection ◽

Genetic Programming ◽

Experimental Method ◽

Feature Selection Method ◽

Selection Method ◽

Symbolic Regression ◽

Model Complexity ◽

High Dimensional ◽

Vc Dimension ◽

Generalisation Ability

<p>Symbolic regression (SR) is a function identification process, the task of which is to identify and express the relationship between the input and output variables in mathematical models. SR is named to emphasise its ability to find the structure and coefficients of the model simultaneously. Genetic Programming (GP) is an attractive and powerful technique for SR, since it does not require any predefined model and has a flexible representation. However, GP based SR generally has a poor generalisation ability which degrades its reliability and hampers its applications to science and real-world modeling. Therefore, this thesis aims to develop new GP approaches to SR that evolve/learn models exhibiting good generalisation ability. This thesis develops a novel feature selection method in GP for high-dimensional SR. Feature selection can potentially contribute not only to improving the efficiency of learning algorithms but also to enhancing the generalisation ability. However, feature selection is seldom considered in GP for high-dimensional SR. The proposed new feature selection method utilises GP’s built-in feature selection ability and relies on permutation to detect the truly relevant features and discard irrelevant/noisy features. The results confirm the superiority of the proposed method over the other examined feature selection methods including random forests and decision trees on identifying the truly relevant features. Further analysis indicates that the models evolved by GP with the proposed feature selection method are more likely to contain only the truly relevant features and have better interpretability. To address the overfitting issue of GP when learning from a relatively small number of instances, this thesis proposes a new GP approach by incorporating structural risk minimisation (SRM), which is a framework to estimate the generalisation performance of models, into GP. The effectiveness of SRM highly depends on the accuracy of the Vapnik-Chervonenkis (VC) dimension measuring model complexity. This thesis significantly extends an experimental method (instead of theoretical estimation) to measure the VC-dimension of a mixture of linear and nonlinear regression models in GP for the first time. The experimental method has been conducted using uniform and non-uniform settings and provides reliable VC-dimension values. The results show that our methods have an impressively better generalisation gain and evolve more compact model, which have a much smaller behavioural difference from the target models than standard GP and GP with bootstrap, The proposed method using the optimised non-uniform setting further improves the one using the uniform setting. This thesis employs geometric semantic GP (GSGP) to tackle the unsatisfied generalisation performance of GP for SR when no overfitting occurs. It proposes three new angle-awareness driven geometric semantic operators (GSO) including selection, crossover and mutation to further explore the geometry of the semantic space to gain a greater generalisation improvement in GP for SR. The angle-awareness brings new geometric properties to these geometric operators, which are expected to provide a greater leverage for approximating the target semantics in each operation, and more importantly, to be resistant to overfitting. The results show that compared with two kinds of state-of-the-art GSOs, the proposed new GSOs not only drive the evolutionary process fitting the target semantics more efficiently but also significantly improve the generalisation performance. A further comparison on the evolved models shows that the new method generally produces simpler models with a much smaller size and containing important building blocks of the target models.</p>

Download Full-text

Bounded VC-Dimension Implies the Schur-Erdős Conjecture

COMBINATORICA ◽

10.1007/s00493-021-4530-9 ◽

2021 ◽

Author(s):

Jacob Fox ◽

János Pach ◽

Andrew Suk

Keyword(s):

Vc Dimension ◽

Erdös Conjecture

Download Full-text

Fast Diameter Computation within Split Graphs

Discrete Mathematics & Theoretical Computer Science ◽

10.46298/dmtcs.6422 ◽

2021 ◽

Vol vol. 23, no. 3 (Graph Theory) ◽

Author(s):

Guillaume Ducoffe ◽

Michel Habib ◽

Laurent Viennot

Keyword(s):

Natural Variation ◽

Linear Time ◽

Interval Number ◽

Stable Set ◽

Vc Dimension ◽

Graph Invariants ◽

Split Graph ◽

Bounded Treewidth ◽

Split Graphs ◽

Stabbing Number

When can we compute the diameter of a graph in quasi linear time? We address this question for the class of {\em split graphs}, that we observe to be the hardest instances for deciding whether the diameter is at most two. We stress that although the diameter of a non-complete split graph can only be either $2$ or $3$, under the Strong Exponential-Time Hypothesis (SETH) we cannot compute the diameter of an $n$-vertex $m$-edge split graph in less than quadratic time -- in the size $n+m$ of the input. Therefore it is worth to study the complexity of diameter computation on {\em subclasses} of split graphs, in order to better understand the complexity border. Specifically, we consider the split graphs with bounded {\em clique-interval number} and their complements, with the former being a natural variation of the concept of interval number for split graphs that we introduce in this paper. We first discuss the relations between the clique-interval number and other graph invariants such as the classic interval number of graphs, the treewidth, the {\em VC-dimension} and the {\em stabbing number} of a related hypergraph. Then, in part based on these above relations, we almost completely settle the complexity of diameter computation on these subclasses of split graphs: - For the $k$-clique-interval split graphs, we can compute their diameter in truly subquadratic time if $k={\cal O}(1)$, and even in quasi linear time if $k=o(\log{n})$ and in addition a corresponding ordering of the vertices in the clique is given. However, under SETH this cannot be done in truly subquadratic time for any $k = \omega(\log{n})$. - For the {\em complements} of $k$-clique-interval split graphs, we can compute their diameter in truly subquadratic time if $k={\cal O}(1)$, and even in time ${\cal O}(km)$ if a corresponding ordering of the vertices in the stable set is given. Again this latter result is optimal under SETH up to polylogarithmic factors. Our findings raise the question whether a $k$-clique interval ordering can always be computed in quasi linear time. We prove that it is the case for $k=1$ and for some subclasses such as bounded-treewidth split graphs, threshold graphs and comparability split graphs. Finally, we prove that some important subclasses of split graphs -- including the ones mentioned above -- have a bounded clique-interval number.

Download Full-text

Coresets for the Average Case Error for Finite Query Sets

Sensors ◽

10.3390/s21196689 ◽

2021 ◽

Vol 21 (19) ◽

pp. 6689

Author(s):

Alaa Maalouf ◽

Ibrahim Jubran ◽

Murad Tukan ◽

Dan Feldman

Keyword(s):

Approximation Error ◽

Principal Component ◽

Average Error ◽

Vc Dimension ◽

Worst Case ◽

Average Case ◽

Input Size ◽

Main Technique ◽

Finite Set ◽

The Given

Coreset is usually a small weighted subset of an input set of items, that provably approximates their loss function for a given set of queries (models, classifiers, hypothesis). That is, the maximum (worst-case) error over all queries is bounded. To obtain smaller coresets, we suggest a natural relaxation: coresets whose average error over the given set of queries is bounded. We provide both deterministic and randomized (generic) algorithms for computing such a coreset for any finite set of queries. Unlike most corresponding coresets for the worst-case error, the size of the coreset in this work is independent of both the input size and its Vapnik–Chervonenkis (VC) dimension. The main technique is to reduce the average-case coreset into the vector summarization problem, where the goal is to compute a weighted subset of the n input vectors which approximates their sum. We then suggest the first algorithm for computing this weighted subset in time that is linear in the input size, for n≫1/ε, where ε is the approximation error, improving, e.g., both [ICML’17] and applications for principal component analysis (PCA) [NIPS’16]. Experimental results show significant and consistent improvement also in practice. Open source code is provided.

Download Full-text

On the VC-dimension of half-spaces with respect to convex sets

Discrete Mathematics & Theoretical Computer Science ◽

10.46298/dmtcs.6631 ◽

2021 ◽

Vol vol. 23, no. 3 (Combinatorics) ◽

Author(s):

Nicolas Grelier ◽

Saeed Gh. Ilchi ◽

Tillmann Miltzow ◽

Shakhar Smorodinsky

Keyword(s):

Pairwise Disjoint ◽

Convex Sets ◽

Range Query ◽

Set Cover ◽

Vc Dimension ◽

Set Cover Problem ◽

Structure Problem ◽

Cover Problem ◽

Geometric Set Cover ◽

Disjoint Convex

A family S of convex sets in the plane defines a hypergraph H = (S, E) as follows. Every subfamily S' of S defines a hyperedge of H if and only if there exists a halfspace h that fully contains S' , and no other set of S is fully contained in h. In this case, we say that h realizes S'. We say a set S is shattered, if all its subsets are realized. The VC-dimension of a hypergraph H is the size of the largest shattered set. We show that the VC-dimension for pairwise disjoint convex sets in the plane is bounded by 3, and this is tight. In contrast, we show the VC-dimension of convex sets in the plane (not necessarily disjoint) is unbounded. We provide a quadratic lower bound in the number of pairs of intersecting sets in a shattered family of convex sets in the plane. We also show that the VC-dimension is unbounded for pairwise disjoint convex sets in R^d , for d > 2. We focus on, possibly intersecting, segments in the plane and determine that the VC-dimension is always at most 5. And this is tight, as we construct a set of five segments that can be shattered. We give two exemplary applications. One for a geometric set cover problem and one for a range-query data structure problem, to motivate our findings.

Download Full-text

The VC Dimension of Metric Balls under Fréchet and Hausdorff Distances

Discrete & Computational Geometry ◽

10.1007/s00454-021-00318-z ◽

2021 ◽

Author(s):

Anne Driemel ◽

André Nusser ◽

Jeff M. Phillips ◽

Ioannis Psarros

Keyword(s):

Density Estimation ◽

Lower Bounds ◽

Upper And Lower Bounds ◽

Similarity Metrics ◽

Vc Dimension ◽

Set Systems ◽

Polygonal Curves ◽

The Individual ◽

Curve Similarity ◽

Metric Balls

AbstractThe Vapnik–Chervonenkis dimension provides a notion of complexity for systems of sets. If the VC dimension is small, then knowing this can drastically simplify fundamental computational tasks such as classification, range counting, and density estimation through the use of sampling bounds. We analyze set systems where the ground set X is a set of polygonal curves in $$\mathbb {R}^d$$ R d and the sets $$\mathcal {R}$$ R are metric balls defined by curve similarity metrics, such as the Fréchet distance and the Hausdorff distance, as well as their discrete counterparts. We derive upper and lower bounds on the VC dimension that imply useful sampling bounds in the setting that the number of curves is large, but the complexity of the individual curves is small. Our upper and lower bounds are either near-quadratic or near-linear in the complexity of the curves that define the ranges and they are logarithmic in the complexity of the curves that define the ground set.

Download Full-text