Big Data Classification Using Distributed Optimized Hoeffding Trees

Large usage of social media, online shopping or transactions gives birth to voluminous data. Visual representation and analysis of this large amount of data is one of the major research topics today. As this data is changing over the period of time, we need an approach which will take care of velocity of data as well as volume and variety. In this paper, author has proposed a distributed method which will handle three dimensions of data and gives good results as compared to other method. Traditional algorithms are based on global optima which are basically memory resident programs. Our approach which is based on optimized hoeffding bound uses local optima method and distributed map-reduce architecture. It does not require copying whole data set onto a memory. As the model build is frequently updated on multiple nodes concurrently, it is more suitable for time varying data. Hoeffding bound is basically suitable for real time data stream. We have proposed very efficient distributed map-reduce architecture to implement hoeffding tree efficiently. We have used deep learning at leaf level to optimize the hoeffding tree. Drift detection is taken care by the architecture itself no separate provision is required for this. In this paper, with experimental results it is proved that our method takes less learning time with more accuracy. Also distributed algorithm for hoeffding tree implementation is proposed.

Download Full-text

Real-time data stream analysis and entire process quality monitoring based on plant information

Journal of Computer Applications ◽

10.3724/sp.j.1087.2012.02935 ◽

2013 ◽

Vol 32 (10) ◽

pp. 2935-2939

Author(s):

Xiao-yong BIAN ◽

Xiao-long ZHANG ◽

Hai YU

Keyword(s):

Real Time ◽

Data Stream ◽

Process Quality ◽

Quality Monitoring ◽

Time Data ◽

Entire Process ◽

Real Time Data ◽

Data Stream Analysis

Download Full-text

A Comparative Study of Genetic Algorithms and Gradient Methods for RM12 Turbofan Engine Diagnostics and Performance Estimation

Volume 2: Turbo Expo 2004 ◽

10.1115/gt2004-53591 ◽

2004 ◽

Cited By ~ 7

Author(s):

Tomas Gro¨nstedt ◽

Markus Wallin

Keyword(s):

Gas Turbine ◽

Performance Test ◽

Gradient Methods ◽

Performance Testing ◽

Performance Estimation ◽

Data Sets ◽

Data Set ◽

Turbofan Engine ◽

Local Optima ◽

Bypass Ratio

Recent work on gas turbine diagnostics based on optimisation techniques advocates two different approaches: 1) Stochastic optimisation, including Genetic Algorithm techniques, for its robustness when optimising objective functions with many local optima and 2) Gradient based methods mainly for their computational efficiency. For smooth and single optimum functions, gradient methods are known to provide superior numerical performance. This paper addresses the key issue for method selection, i.e. whether multiple local optima may occur when the optimisation approach is applied to real engine testing. Two performance test data sets for the RM12 low bypass ratio turbofan engine, powering the Swedish Fighter Gripen, have been analysed. One set of data was recorded during performance testing of a highly degraded engine. This engine has been subjected to Accelerated Mission Testing (AMT) cycles corresponding to more than 4000 hours of run time. The other data set was recorded for a development engine with less than 200 hours of operation. The search for multiple optima was performed starting from more than 100 extreme points. Not a single case of multi-modality was encountered, i.e. one unique solution for each of the two data sets was consistently obtained. The RM12 engine cycle is typical for a modern fighter engine, implying that the obtained results can be transferred to, at least, most low bypass ratio turbofan engines. The paper goes on to describe the numerical difficulties that had to be resolved to obtain efficient and robust performance by the gradient solvers. Ill conditioning and noise may, as illustrated on a model problem, introduce local optima without a correspondence in the gas turbine physics. Numerical methods exploiting the special problem structure represented by a non-linear least squares formulation is given special attention. Finally, a mixed norm allowing for both robustness and numerical efficiency is suggested.

Download Full-text

Replicating the Recovery following the 2014 South Napa Earthquake using Stochastic Process Models

Earthquake Spectra ◽

10.1193/012917eqs020m ◽

2018 ◽

Vol 34 (3) ◽

pp. 1247-1266 ◽

Cited By ~ 9

Author(s):

Hua Kang ◽

Henry V. Burton ◽

Haoxiang Miao

Keyword(s):

Stochastic Process ◽

Repair Time ◽

Process Models ◽

Time Data ◽

Data Set ◽

Damage Data ◽

Exogenous Factors ◽

Recovery Models ◽

Statistical Relationships ◽

Recovery Trajectory

Post-earthquake recovery models can be used as decision support tools for pre-event planning. However, due to a lack of available data, there have been very few opportunities to validate and/or calibrate these models. This paper describes the use of building damage, permitting, and repair data from the 2014 South Napa Earthquake to evaluate a stochastic process post-earthquake recovery model. Damage data were obtained for 1,470 buildings, and permitting and repair time data were obtained for a subset (456) of those buildings. A “blind” prediction is shown to adequately capture the shape of the recovery trajectory despite overpredicting the overall pace of the recovery. Using the mean time to permit and repair time from the acquired data set significantly improves the accuracy of the recovery prediction. A generalized model is formulated by establishing statistical relationships between key time parameters and endogenous and exogenous factors that have been shown to influence the pace of recovery.

Download Full-text

Development and assessment of CootVR, a virtual reality computer program for model building

Acta Crystallographica Section D Structural Biology ◽

10.1107/s2059798320013625 ◽

2021 ◽

Vol 77 (1) ◽

pp. 19-27

Author(s):

Hamish Todd ◽

Paul Emsley

Keyword(s):

Virtual Reality ◽

Computer Program ◽

Model Building ◽

Three Dimensional ◽

Specific Model ◽

Three Dimensions ◽

Biological Macromolecules ◽

Data Set ◽

X Ray Crystallography ◽

Order Of Magnitude

Biological macromolecules have complex three-dimensional shapes that are experimentally examined using X-ray crystallography and electron cryo-microscopy. Interpreting the data that these methods yield involves building 3D atomic models. With almost every data set, some portion of the time put into creating these models must be spent manually modifying the model in order to make it consistent with the data; this is difficult and time-consuming, in part because the data are `blurry' in three dimensions. This paper describes the design and assessment of CootVR (available at http://hamishtodd1.github.io/cvr), a prototype computer program for performing this task in virtual reality, allowing structural biologists to build molecular models into cryo-EM and crystallographic data using their hands. CootVR was timed against Coot for a very specific model-building task, and was found to give an order-of-magnitude speedup for this task. A from-scratch model build using CootVR was also attempted; from this experience it is concluded that currently CootVR does not give a speedup over Coot overall.

Download Full-text

Real-time Data Stream Processing - Challenges and Perspectives

International Journal of Computer Science Issues ◽

10.20943/01201705.612 ◽

2017 ◽

Vol 14 (5) ◽

pp. 6-12 ◽

Cited By ~ 3

Keyword(s):

Real Time ◽

Data Stream ◽

Stream Processing ◽

Time Data ◽

Data Stream Processing ◽

Real Time Data

Download Full-text

Implementing a real-time data stream for time-series stellar photometry

10.1117/12.2232248 ◽

2016 ◽

Author(s):

M. Bogosavljevic ◽

Z. Ioannou

Keyword(s):

Time Series ◽

Real Time ◽

Data Stream ◽

Time Data ◽

Stellar Photometry ◽

Real Time Data

Download Full-text

Dynamic Load Balancing and Channel Strategy for Apache Flume Collecting Real-Time Data Stream

2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC) ◽

10.1109/ispa/iucc.2017.00089 ◽

2017 ◽

Cited By ~ 1

Author(s):

Buqing Shu ◽

Haopeng Chen ◽

Meng Sun

Keyword(s):

Load Balancing ◽

Real Time ◽

Dynamic Load ◽

Data Stream ◽

Dynamic Load Balancing ◽

Time Data ◽

Channel Strategy ◽

Real Time Data

Download Full-text

Performance Improvement IoT Applications Through Multimedia Analytics Using Big Data Stream Computing Platforms

Exploring the Convergence of Big Data and the Internet of Things - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-2947-7.ch015 ◽

2018 ◽

pp. 200-221

Author(s):

Rizwan Patan ◽

Rajasekhara Babu M ◽

Suresh Kallam

Keyword(s):

Big Data ◽

Real Time ◽

Performance Improvement ◽

Data Stream ◽

Real Data ◽

Stream Computing ◽

Time Data ◽

Real Time Data ◽

Computing Platforms ◽

Time And Energy

A Big Data Stream Computing (BDSC) Platform handles real-time data from various applications such as risk management, marketing management and business intelligence. Now a days Internet of Things (IoT) deployment is increasing massively in all the areas. These IoTs engender real-time data for analysis. Existing BDSC is inefficient to handle Real-data stream from IoTs because the data stream from IoTs is unstructured and has inconstant velocity. So, it is challenging to handle such real-time data stream. This work proposes a framework that handles real-time data stream through device control techniques to improve the performance. The frame work includes three layers. First layer deals with Big Data platforms that handles real data streams based on area of importance. Second layer is performance layer which deals with performance issues such as low response time, and energy efficiency. The third layer is meant for Applying developed method on existing BDSC platform. The experimental results have been shown a performance improvement 20%-30% for real time data stream from IoT application.

Download Full-text

Knowledge Discovery From Evolving Data Streams

Advances in Business Information Systems and Analytics - Machine Learning Techniques for Improved Business Analytics ◽

10.4018/978-1-5225-3534-8.ch002 ◽

2019 ◽

pp. 19-39

Author(s):

Prasanna Lakshmi Kompalli

Keyword(s):

Real Time ◽

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Data Stream Mining ◽

Time Data ◽

Stream Mining ◽

New Challenges ◽

Mining Data Streams ◽

Different Sources

Data coming from different sources is referred to as data streams. Data stream mining is an online learning technique where each data point must be processed as the data arrives and discarded as the processing is completed. Progress of technologies has resulted in the monitoring these data streams in real time. Data streams has created many new challenges to the researchers in real time. The main features of this type of data are they are fast flowing, large amounts of data which are continuous and growing in nature, and characteristics of data might change in course of time which is termed as concept drift. This chapter addresses the problems in mining data streams with concept drift. Due to which, isolating the correct literature would be a grueling task for researchers and practitioners. This chapter tries to provide a solution as it would be an amalgamation of all techniques used for data stream mining with concept drift.

Download Full-text

Privacy Preserving Data Mining on Unstructured Data

Privacy and Security Policies in Big Data - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-5225-2486-1.ch008 ◽

2017 ◽

pp. 167-190

Author(s):

Trupti Vishwambhar Kenekar ◽

Ajay R. Dani

Keyword(s):

Data Mining ◽

Big Data ◽

Structure Data ◽

Data Privacy ◽

Differential Privacy ◽

Unstructured Data ◽

Map Reduce ◽

Individual Data ◽

Data Set ◽

Privacy Preserving Data Mining

As Big Data is group of structured, unstructured and semi-structure data collected from various sources, it is important to mine and provide privacy to individual data. Differential Privacy is one the best measure which provides strong privacy guarantee. The chapter proposed differentially private frequent item set mining using map reduce requires less time for privately mining large dataset. The chapter discussed problem of preserving data privacy, different challenges to preserving data privacy in big data environment, Data privacy techniques and their applications to unstructured data. The analyses of experimental results on structured and unstructured data set are also presented.

Download Full-text