scholarly journals Big Data Classification Using Distributed Optimized Hoeffding Trees

2017 ◽  
Vol 2 (1) ◽  
pp. 14-20
Author(s):  
Sharmishta Suhas Desai ◽  
S. T. Patil

Large usage of social media, online shopping or transactions gives birth to voluminous data. Visual representation and analysis of this large amount of data is one of the major research topics today. As this data is changing over the period of time, we need an approach which will take care of velocity of data as well as volume and variety. In this paper, author has proposed a distributed method which will handle three dimensions of data and gives good results as compared to other method.  Traditional algorithms are based on global optima which are basically memory resident programs. Our approach which is based on optimized hoeffding bound uses local optima method and distributed map-reduce architecture. It does not require copying whole data set onto a memory. As the model build is frequently updated on multiple nodes concurrently, it is more suitable for time varying data. Hoeffding bound is basically suitable for real time data stream. We have proposed very efficient distributed map-reduce architecture to implement hoeffding tree efficiently. We have used deep learning at leaf level to optimize the hoeffding tree. Drift detection is taken care by the architecture itself no separate provision is required for this. In this paper, with experimental results it is proved that our method takes less learning time with more accuracy. Also distributed algorithm for hoeffding tree implementation is proposed.

Author(s):  
Tomas Gro¨nstedt ◽  
Markus Wallin

Recent work on gas turbine diagnostics based on optimisation techniques advocates two different approaches: 1) Stochastic optimisation, including Genetic Algorithm techniques, for its robustness when optimising objective functions with many local optima and 2) Gradient based methods mainly for their computational efficiency. For smooth and single optimum functions, gradient methods are known to provide superior numerical performance. This paper addresses the key issue for method selection, i.e. whether multiple local optima may occur when the optimisation approach is applied to real engine testing. Two performance test data sets for the RM12 low bypass ratio turbofan engine, powering the Swedish Fighter Gripen, have been analysed. One set of data was recorded during performance testing of a highly degraded engine. This engine has been subjected to Accelerated Mission Testing (AMT) cycles corresponding to more than 4000 hours of run time. The other data set was recorded for a development engine with less than 200 hours of operation. The search for multiple optima was performed starting from more than 100 extreme points. Not a single case of multi-modality was encountered, i.e. one unique solution for each of the two data sets was consistently obtained. The RM12 engine cycle is typical for a modern fighter engine, implying that the obtained results can be transferred to, at least, most low bypass ratio turbofan engines. The paper goes on to describe the numerical difficulties that had to be resolved to obtain efficient and robust performance by the gradient solvers. Ill conditioning and noise may, as illustrated on a model problem, introduce local optima without a correspondence in the gas turbine physics. Numerical methods exploiting the special problem structure represented by a non-linear least squares formulation is given special attention. Finally, a mixed norm allowing for both robustness and numerical efficiency is suggested.


2018 ◽  
Vol 34 (3) ◽  
pp. 1247-1266 ◽  
Author(s):  
Hua Kang ◽  
Henry V. Burton ◽  
Haoxiang Miao

Post-earthquake recovery models can be used as decision support tools for pre-event planning. However, due to a lack of available data, there have been very few opportunities to validate and/or calibrate these models. This paper describes the use of building damage, permitting, and repair data from the 2014 South Napa Earthquake to evaluate a stochastic process post-earthquake recovery model. Damage data were obtained for 1,470 buildings, and permitting and repair time data were obtained for a subset (456) of those buildings. A “blind” prediction is shown to adequately capture the shape of the recovery trajectory despite overpredicting the overall pace of the recovery. Using the mean time to permit and repair time from the acquired data set significantly improves the accuracy of the recovery prediction. A generalized model is formulated by establishing statistical relationships between key time parameters and endogenous and exogenous factors that have been shown to influence the pace of recovery.


2021 ◽  
Vol 77 (1) ◽  
pp. 19-27
Author(s):  
Hamish Todd ◽  
Paul Emsley

Biological macromolecules have complex three-dimensional shapes that are experimentally examined using X-ray crystallography and electron cryo-microscopy. Interpreting the data that these methods yield involves building 3D atomic models. With almost every data set, some portion of the time put into creating these models must be spent manually modifying the model in order to make it consistent with the data; this is difficult and time-consuming, in part because the data are `blurry' in three dimensions. This paper describes the design and assessment of CootVR (available at http://hamishtodd1.github.io/cvr), a prototype computer program for performing this task in virtual reality, allowing structural biologists to build molecular models into cryo-EM and crystallographic data using their hands. CootVR was timed against Coot for a very specific model-building task, and was found to give an order-of-magnitude speedup for this task. A from-scratch model build using CootVR was also attempted; from this experience it is concluded that currently CootVR does not give a speedup over Coot overall.


Author(s):  
Rizwan Patan ◽  
Rajasekhara Babu M ◽  
Suresh Kallam

A Big Data Stream Computing (BDSC) Platform handles real-time data from various applications such as risk management, marketing management and business intelligence. Now a days Internet of Things (IoT) deployment is increasing massively in all the areas. These IoTs engender real-time data for analysis. Existing BDSC is inefficient to handle Real-data stream from IoTs because the data stream from IoTs is unstructured and has inconstant velocity. So, it is challenging to handle such real-time data stream. This work proposes a framework that handles real-time data stream through device control techniques to improve the performance. The frame work includes three layers. First layer deals with Big Data platforms that handles real data streams based on area of importance. Second layer is performance layer which deals with performance issues such as low response time, and energy efficiency. The third layer is meant for Applying developed method on existing BDSC platform. The experimental results have been shown a performance improvement 20%-30% for real time data stream from IoT application.


Author(s):  
Prasanna Lakshmi Kompalli

Data coming from different sources is referred to as data streams. Data stream mining is an online learning technique where each data point must be processed as the data arrives and discarded as the processing is completed. Progress of technologies has resulted in the monitoring these data streams in real time. Data streams has created many new challenges to the researchers in real time. The main features of this type of data are they are fast flowing, large amounts of data which are continuous and growing in nature, and characteristics of data might change in course of time which is termed as concept drift. This chapter addresses the problems in mining data streams with concept drift. Due to which, isolating the correct literature would be a grueling task for researchers and practitioners. This chapter tries to provide a solution as it would be an amalgamation of all techniques used for data stream mining with concept drift.


Author(s):  
Trupti Vishwambhar Kenekar ◽  
Ajay R. Dani

As Big Data is group of structured, unstructured and semi-structure data collected from various sources, it is important to mine and provide privacy to individual data. Differential Privacy is one the best measure which provides strong privacy guarantee. The chapter proposed differentially private frequent item set mining using map reduce requires less time for privately mining large dataset. The chapter discussed problem of preserving data privacy, different challenges to preserving data privacy in big data environment, Data privacy techniques and their applications to unstructured data. The analyses of experimental results on structured and unstructured data set are also presented.


Sign in / Sign up

Export Citation Format

Share Document