scholarly journals Confidence Decision Trees via Online and Active Learning for Streaming Data

2017 ◽  
Vol 60 ◽  
pp. 1031-1055 ◽  
Author(s):  
Rocco De Rosa ◽  
Nicolò Cesa-Bianchi

Decision tree classifiers are a widely used tool in data stream mining. The use of confidence intervals to estimate the gain associated with each split leads to very effective methods, like the popular Hoeffding tree algorithm. From a statistical viewpoint, the analysis of decision tree classifiers in a streaming setting requires knowing when enough new information has been collected to justify splitting a leaf. Although some of the issues in the statistical analysis of Hoeffding trees have been already clarified, a general and rigorous study of confidence intervals for splitting criteria is missing. We fill this gap by deriving accurate confidence intervals to estimate the splitting gain in decision tree learning with respect to three criteria: entropy, Gini index, and a third index proposed by Kearns and Mansour. We also extend our confidence analysis to a selective sampling setting, in which the decision tree learner adaptively decides which labels to query in the stream. We provide theoretical guarantees bounding the probability that the decision tree learned via our selective sampling strategy classifies suboptimally the next example in the stream. Experiments on real and synthetic data in a streaming setting show that our trees are indeed more accurate than trees with the same number of leaves generated by state-of-the-art techniques. In addition to that, our active learning module empirically uses fewer labels without significantly hurting the performance.

2021 ◽  
Author(s):  
Jaspreet Kaur Bassan

This work proposes a technique for classifying unlabelled streaming data using grammar-based immune programming, a hybrid meta-heuristic where the space of grammar generated solutions is searched by an artificial immune system inspired algorithm. Data is labelled using an active learning technique and is buffered until the system trains adequately on the labelled data. The system is employed in static and in streaming data environments, and is tested and evaluated using synthetic and real-world data. The performances of the system employed in different data settings are compared with each other and with two benchmark problems. The proposed classification system adapted well to the changing nature of streaming data and the active learning technique made the process less computationally expensive by retaining only those instances which favoured the training process.


2020 ◽  
Vol 68 (3) ◽  
pp. 949-964
Author(s):  
Dimitris Bertsimas ◽  
Bradley Sturt

The bootstrap method is one of the major developments in statistics in the 20th century for computing confidence intervals directly from data. However, the bootstrap method is traditionally approximated with a randomized algorithm, which can sometimes produce inaccurate confidence intervals. In “Computation of Exact Bootstrap Confidence Intervals: Complexity and Deterministic Algorithms,” Bertsimas and Sturt present a new perspective of the bootstrap method through the lens of counting integer points in a polyhedron. Through this perspective, the authors develop the first computational complexity results and efficient deterministic approximation algorithm (fully polynomial time approximation scheme) for bootstrap confidence intervals, which unlike traditional methods, has guaranteed bounds on its error. In experiments on real and synthetic data sets from clinical trials, the proposed deterministic algorithms quickly produce reliable confidence intervals, which are significantly more accurate than those from randomization.


Geophysics ◽  
1997 ◽  
Vol 62 (6) ◽  
pp. 1804-1811 ◽  
Author(s):  
Qingbo Liao ◽  
George A. McMechan

The centroid frequency shift method is implemented, tested with synthetic data, and applied to field data from three contiguous crosswell seismic experiments at the Gypsy Pilot in northern Oklahoma. The similtaneous iterative reconstruction technique is used for tomographic estimations of both P‐wave velocity and Q. No amplitude corrections or spreading loss corrections are needed for the Q estimation. The estimated in‐situ velocity and Q distributions correlate well with log data and local lithology. The Q/velocity ratio appears to correlate with the sand/shale ratio (ranging from an average of ∼15 s/km for the sand‐dominated lithologies to an average of ∼8.5 s/km for the shale‐dominated ones), with the result that new information is provided on interwell connectivity.


2011 ◽  
Vol 18 (2) ◽  
pp. 153-173
Author(s):  
Dittaya Wanvarie ◽  
Hiroya Takamura ◽  
Manabu Okumura

2015 ◽  
Vol 9 (1) ◽  
pp. 927-973
Author(s):  
G. van der Wel ◽  
H. Fischer ◽  
H. Oerter ◽  
H. Meyer ◽  
H. A. J. Meijer

Abstract. Paleoclimatic information can be retrieved from the diffusion of the stable water isotope signal during firnification of snow. The diffusion length, a measure for the amount of diffusion a layer has experienced, depends on the firn temperature and the accumulation rate. We show that the estimation of the diffusion length using Power Spectral Densities (PSD) of the record of a single isotope species can be biased and is therefore not a reliable proxy for past temperature reconstruction. Using a second water isotope and calculating the difference in diffusion lengths between the two isotopes this problem is circumvented. We study the PSD method applied to two isotopes in detail and additionally present a new forward diffusion method for retrieving the differential diffusion length based on the Pearson correlation between the two isotope signals. The two methods are discussed and extensively tested on synthetic data which are generated in a Monte Carlo manner. We show that calibration of the PSD method with this synthetic data is necessary to be able to objectively determine the differential diffusion length. The correlation based method proofs to be a good alternative for the PSD method as it yields equal or somewhat higher precision than the PSD method. The use of synthetic data also allows us to estimate the accuracy and precision of the two methods and to choose the best sampling strategy to obtain past temperatures with the required precision. Additional to application to synthetic data the two methods are tested on stable isotope records from the EPICA ice core drilled in Dronning Maud Land, Antarctica, showing that reliable firn temperatures can be reconstructed with a typical uncertainty of 1.5 and 2 °C for the Holocene period and 2 and 2.5 °C for the last glacial period for the correlation and PSD method, respectively.


Sign in / Sign up

Export Citation Format

Share Document