Collaborative and Clustering Based Strategy in Big Data

Web Services ◽  
2019 ◽  
pp. 221-239 ◽  
Author(s):  
Arushi Jain ◽  
Vishal Bhatnagar ◽  
Pulkit Sharma

There is a proliferation in the amount of data generated and its volume, which is going to persevere for many coming years. Big data clustering is the exercise of taking a set of objects and dividing them into groups in such a way that the objects in the same groups are more similar to each other according to a certain set of parameters than to those in other groups. These groups are known as clusters. Cluster analysis is one of the main tasks in the field of data mining and is a commonly used technique for statistical analysis of data. While big data collaborative filtering defined as a technique that filters the information sought by the user and patterns by collaborating multiple data sets such as viewpoints, multiple agents and pre-existing data about the users' behavior stored in matrices. Collaborative filtering is especially required when a huge data set is present.

Author(s):  
Arushi Jain ◽  
Vishal Bhatnagar ◽  
Pulkit Sharma

There is a proliferation in the amount of data generated and its volume, which is going to persevere for many coming years. Big data clustering is the exercise of taking a set of objects and dividing them into groups in such a way that the objects in the same groups are more similar to each other according to a certain set of parameters than to those in other groups. These groups are known as clusters. Cluster analysis is one of the main tasks in the field of data mining and is a commonly used technique for statistical analysis of data. While big data collaborative filtering defined as a technique that filters the information sought by the user and patterns by collaborating multiple data sets such as viewpoints, multiple agents and pre-existing data about the users' behavior stored in matrices. Collaborative filtering is especially required when a huge data set is present.


Author(s):  
Chris Goller ◽  
James Simek ◽  
Jed Ludlow

The purpose of this paper is to present a non-traditional pipeline mechanical damage ranking system using multiple-data-set in-line inspection (ILI) tools. Mechanical damage continues to be a major factor in reportable incidents for hazardous liquid and gas pipelines. While several ongoing programs seek to limit damage incidents through public awareness, encroachment monitoring, and one-call systems, others have focused efforts on the quantification of mechanical damage severity through modeling, the use of ILI tools, and subsequent feature assessment at locations selected for excavation. Current generation ILI tools capable of acquiring multiple-data-sets in a single survey may provide an improved assessment of the severity of damaged zones using methods developed in earlier research programs as well as currently reported information. For magnetic flux leakage (MFL) type tools, using multiple field levels, varied field directions, and high accuracy deformation sensors enables detection and provides the data necessary for enhanced severity assessments. This paper will provide a review of multiple-data-set ILI results from several pipe joints with simulated mechanical damage locations created mimicing right-of-way encroachment events in addition to field results from ILI surveys using multiple-data-set tools.


2018 ◽  
Vol 11 (7) ◽  
pp. 4239-4260 ◽  
Author(s):  
Richard Anthes ◽  
Therese Rieckh

Abstract. In this paper we show how multiple data sets, including observations and models, can be combined using the “three-cornered hat” (3CH) method to estimate vertical profiles of the errors of each system. Using data from 2007, we estimate the error variances of radio occultation (RO), radiosondes, ERA-Interim, and Global Forecast System (GFS) model data sets at four radiosonde locations in the tropics and subtropics. A key assumption is the neglect of error covariances among the different data sets, and we examine the consequences of this assumption on the resulting error estimates. Our results show that different combinations of the four data sets yield similar relative and specific humidity, temperature, and refractivity error variance profiles at the four stations, and these estimates are consistent with previous estimates where available. These results thus indicate that the correlations of the errors among all data sets are small and the 3CH method yields realistic error variance profiles. The estimated error variances of the ERA-Interim data set are smallest, a reasonable result considering the excellent model and data assimilation system and assimilation of high-quality observations. For the four locations studied, RO has smaller error variances than radiosondes, in agreement with previous studies. Part of the larger error variance of the radiosondes is associated with representativeness differences because radiosondes are point measurements, while the other data sets represent horizontal averages over scales of ∼ 100 km.


2021 ◽  
Author(s):  
By Huan Chen ◽  
Brian Caffo ◽  
Genevieve Stein-O’Brien ◽  
Jinrui Liu ◽  
Ben Langmead ◽  
...  

SummaryIntegrative analysis of multiple data sets has the potential of fully leveraging the vast amount of high throughput biological data being generated. In particular such analysis will be powerful in making inference from publicly available collections of genetic, transcriptomic and epigenetic data sets which are designed to study shared biological processes, but which vary in their target measurements, biological variation, unwanted noise, and batch variation. Thus, methods that enable the joint analysis of multiple data sets are needed to gain insights into shared biological processes that would otherwise be hidden by unwanted intra-data set variation. Here, we propose a method called two-stage linked component analysis (2s-LCA) to jointly decompose multiple biologically related experimental data sets with biological and technological relationships that can be structured into the decomposition. The consistency of the proposed method is established and its empirical performance is evaluated via simulation studies. We apply 2s-LCA to jointly analyze four data sets focused on human brain development and identify meaningful patterns of gene expression in human neurogenesis that have shared structure across these data sets. The code to conduct 2s-LCA has been complied into an R package “PJD”, which is available at https://github.com/CHuanSite/PJD.


Author(s):  
Ping Li ◽  
Hua-Liang Wei ◽  
Stephen A. Billings ◽  
Michael A. Balikhin ◽  
Richard Boynton

A basic assumption on the data used for nonlinear dynamic model identification is that the data points are continuously collected in chronological order. However, there are situations in practice where this assumption does not hold and we end up with an identification problem from multiple data sets. The problem is addressed in this paper and a new cross-validation-based orthogonal search algorithm for NARMAX model identification from multiple data sets is proposed. The algorithm aims at identifying a single model from multiple data sets so as to extend the applicability of the standard method in the cases, such as the data sets for identification are obtained from multiple tests or a series of experiments, or the data set is discontinuous because of missing data points. The proposed method can also be viewed as a way to improve the performance of the standard orthogonal search method for model identification by making full use of all the available data segments in hand. Simulated and real data are used in this paper to illustrate the operation and to demonstrate the effectiveness of the proposed method.


2022 ◽  
Vol 12 (1) ◽  
pp. 0-0

Data Mining is an essential task because the digital world creates huge data daily. Associative classification is one of the data mining task which is used to carry out classification of data, based on the demand of knowledge users. Most of the associative classification algorithms are not able to analyze the big data which are mostly continuous in nature. This leads to the interest of analyzing the existing discretization algorithms which converts continuous data into discrete values and the development of novel discretizer Reliable Distributed Fuzzy Discretizer for big data set. Many discretizers suffer the problem of over splitting the partitions. Our proposed method is implemented in distributed fuzzy environment and aims to avoid over splitting of partitions by introducing a novel stopping criteria. Proposed discretization method is compared with existing distributed fuzzy partitioning method and achieved good accuracy in the performance of associative classifiers.


Author(s):  
Onur Doğan ◽  
Hakan  Aşan ◽  
Ejder Ayç

In today’s competitive world, organizations need to make the right decisions to prolong their existence. Using non-scientific methods and making emotional decisions gave way to the use of scientific methods in the decision making process in this competitive area. Within this scope, many decision support models are still being developed in order to assist the decision makers and owners of organizations. It is easy to collect massive amount of data for organizations, but generally the problem is using this data to achieve economic advances. There is a critical need for specialization and automation to transform the data into the knowledge in big data sets. Data mining techniques are capable of providing description, estimation, prediction, classification, clustering, and association. Recently, many data mining techniques have been developed in order to find hidden patterns and relations in big data sets. It is important to obtain new correlations, patterns, and trends, which are understandable and useful to the decision makers. There have been many researches and applications focusing on different data mining techniques and methodologies.In this study, we aim to obtain understandable and applicable results from a large volume of record set that belong to a firm, which is active in the meat processing industry, by using data mining techniques. In the application part, firstly, data cleaning and data integration, which are the first steps of data mining process, are performed on the data in the database. With the aid of data cleaning and data integration, the data set was obtained, which is suitable for data mining. Then, various association rule algorithms were applied to this data set. This analysis revealed that finding unexplored patterns in the set of data would be beneficial for the decision makers of the firm. Finally, many association rules are obtained, which are useful for decision makers of the local firm. 


2020 ◽  
Vol 493 (1) ◽  
pp. 48-54
Author(s):  
Chris Koen

ABSTRACT Large monitoring campaigns, particularly those using multiple filters, have produced replicated time series of observations for literally millions of stars. The search for periodicities in such replicated data can be facilitated by comparing the periodograms of the various time series. In particular, frequency spectra can be searched for common peaks. The sensitivity of this procedure to various parameters (e.g. the time base of the data, length of the frequency interval searched, number of replicate series, etc.) is explored. Two additional statistics that could sharpen results are also discussed: the closeness (in frequency) of peaks identified as common to all data sets, and the sum of the ranks of the peaks. Analytical expressions for the distributions of these two statistics are presented. The method is illustrated by showing that a ‘dubious’ periodicity in an 'Asteroid Terrestrial-impact Last Alert System' data set is highly significant.


2015 ◽  
Vol 2 (1) ◽  
pp. 205395171558941 ◽  
Author(s):  
Emily Gray ◽  
Will Jennings ◽  
Stephen Farrall ◽  
Colin Hay

WBAN is a self-governing and perceptive used to informant the activities of a person and to improve the individuality of people, which satisfies the requirements of the user's needs. In this paper, we propose a Big data retrieval unit in WBAN using Elliptical Curve Cryptography. Big data transmit the data through Map reduce and retrieve the data safely using ECCDS algorithm. Map-reduce is a programming method for accessing multiple data sets on multi-node hardware efficiently using a distributed storage process and it incorporate the entire in-between requirements connected via the identical in-among key in . Cloud Sim extensible toolkit is used to enable the modeling and to enhance the application provision.


Sign in / Sign up

Export Citation Format

Share Document