scholarly journals A Comparison of Big Data Frameworks on a Layered Dataflow Model

2017 ◽  
Vol 27 (01) ◽  
pp. 1740003 ◽  
Author(s):  
Claudia Misale ◽  
Maurizio Drocco ◽  
Marco Aldinucci ◽  
Guy Tremblay

In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models–for which only informal (and often confusing) semantics is generally provided–all share a common underlying model, namely, the Dataflow model. The model we propose shows how various tools share the same expressiveness at different levels of abstraction. The contribution of this work is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm), thus making it easier to understand high-level data-processing applications written in such frameworks. Second, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level.

2021 ◽  
pp. 034-041
Author(s):  
A.Y. Gladun ◽  
◽  
K.A. Khala ◽  

It is becoming clear with growing complication of cybersecurity threats, that one of the most important resources to combat cyberattacks is the processing of large amounts of data in the cyber environment. In order to process a huge amount of data and to make decisions, there is a need to automate the tasks of searching, selecting and interpreting Big Data to solve operational information security problems. Big data analytics is complemented by semantic technology, can improve cybersecurity, and allows you to process and interpret large amounts of information in the cyber environment. Using of semantic modeling methods in Big Data analytics is necessary for the selection and combination of heterogeneous Big Data sources, recognition of the patterns of network attacks and other cyber threats, which must occur quickly to implement countermeasures. Therefore to analyze Big Data metadata, the authors propose pre-processing of metadata at the semantic level. As analysis tools, it is proposed to create a thesaurus of the problem based on the domain ontology, which should provide a terminological basis for the integration of ontologies of different levels. To build a thesaurus of the problem, it is proposed to use the standards of open information resources, dictionaries, encyclopedias. The development of an ontology hierarchy formalizes the relationships between data elements that will be used in future for machine learning and artificial intelligence algorithms to adapt to changes in the environment, which in turn will increase the efficiency of big data analytics for the cybersecurity domain.


2020 ◽  
Vol 34 (28) ◽  
pp. 2050311
Author(s):  
Satvik Vats ◽  
B. B. Sagar

In Big data domain, platform dependency can alter the behavior of the business. It is because of the different kinds (Structured, Semi-structured and Unstructured) and characteristics of the data. By the traditional infrastructure, different kinds of data cannot be processed simultaneously due to their platform dependency for a particular task. Therefore, the responsibility of selecting suitable tools lies with the user. The variety of data generated by different sources requires the selection of suitable tools without human intervention. Further, these tools also face the limitation of recourses to deal with a large volume of data. This limitation of resources affects the performance of the tools in terms of execution time. Therefore, in this work, we proposed a model in which different data analytics tools share a common infrastructure to provide data independence and resource sharing environment, i.e. the proposed model shares common (Hybrid) Hadoop Distributed File System (HDFS) between three Name-Node (Master Node), three Data-Node and one Client-node, which works under the DeMilitarized zone (DMZ). To realize this model, we have implemented Mahout, R-Hadoop and Splunk sharing a common HDFS. Further using our model, we run [Formula: see text]-means clustering, Naïve Bayes and recommender algorithms on three different datasets, movie rating, newsgroup, and Spam SMS dataset, representing structured, semi-structured and unstructured, respectively. Our model selected the appropriate tool, e.g. Mahout to run on the newsgroup dataset as other tools cannot run on this data. This shows that our model provides data independence. Further results of our proposed model are compared with the legacy (individual) model in terms of execution time and scalability. The improved performance of the proposed model establishes the hypothesis that our model overcomes the limitation of the resources of the legacy model.


2018 ◽  
Vol 7 (2.32) ◽  
pp. 452
Author(s):  
Anjali Mathur ◽  
K Vinitha ◽  
R Shubham ◽  
K Gowtham

A bank merger is a situation in which two banks or all branches of a bank join together to become one bank. The bank merger of State Bank of India was implemented on 1stApril 2017 in India. The bank merger is a good idea to centralize the customer’s data from nationwide. However, it is a difficult task for administrators and technologists. Some high level techniques are required to collect the data from the branches, of the bank present at nationwide, and merge them accordingly. For this huge data Big-Data Analysis techniques can be used to manage and access the data. The big data analytics provides algorithms to compare, classify and cluster the data at local and global level. This research paper proposes big data analytics for education loan provided by State Bank of India. The loan granting process becomes centralized after merger. It affects the processing of granting a loan, as earlier it was according to branches only. The proposed work is for comparative study of the impact of bank merger on education loan provided by State Bank of India.  


Author(s):  
Jaimin Navinchandra Undavia ◽  
Atul Manubhai Patel

The technological advancement has also opened up various ways to collect data through automatic mechanisms. One such mechanism collects a huge amount of data without any further maintenance or human interventions. The health industry sector has been confronted by the need to manage the big data being produced by various sources, which are well known for producing high volumes of heterogeneous data. High level of sophistication has been incorporated in almost all the industry, and healthcare is one of them. The article shows that the existence of huge amount of data in healthcare industry and the data generated in healthcare industry is neither homogeneous nor a simple type of data. Then the various sources and objectives of data are also highlighted and discussed. As data come from various sources, they must be versatile in nature in all aspects. So, rightly and meaningfully, big data analytics has penetrated the healthcare industry and its impact is also highlighted.


2019 ◽  
Vol 4 (2) ◽  
pp. 235
Author(s):  
Firman Arifin ◽  
Budi Nur Iman ◽  
Budi Nur Iman ◽  
Elly Purwantini ◽  
Elly Purwantini ◽  
...  

Understanding public interest and opinion are necessary tasks in high intense political competition. Utilizing big data analytics from social media provide an important source of information that candidates can utilize, manage and even engage them in targeted political campaigning agenda. One of the source in big data is social media’s interactions. Social media empowers public to participate proactivelyin the campaigning activities. This paper examines trends gathered from data analytics of two contenders’ group for Indonesian Election in 2019. It tracks the recent patterns of people engagement via social media analytic specifically Twitter. The study developed the analysis into the proposed model based on their trends and patterns.


2019 ◽  
Vol 12 (1) ◽  
pp. 202
Author(s):  
Eun Sun Kim ◽  
Yunjeong Choi ◽  
Jeongeun Byun

To expand the field of governmental applications of Big Data analytics, this study presents a case of data-driven decision-making using information on research and development (R&D) projects in Korea. The Korean government has continuously expanded the proportion of its R&D investment in small and medium-size enterprises to improve the commercialization performance of national R&D projects. However, the government has struggled with the so-called “Korea R&D Paradox”, which refers to how performance has lagged despite the high level of investment in R&D. Using data from 48,309 national R&D projects carried out by enterprises from 2013 to 2017, we perform a cluster analysis and decision tree analysis to derive the determinants of their commercialization performance. This study provides government entities with insights into how they might adjust their approach to Big Data analytics to improve the efficiency of R&D investment in small- and medium-sized enterprises.


Sensors ◽  
2021 ◽  
Vol 21 (21) ◽  
pp. 7035
Author(s):  
Dina Fawzy ◽  
Sherin Moussa ◽  
Nagwa Badr

Enormous heterogeneous sensory data are generated in the Internet of Things (IoT) for various applications. These big data are characterized by additional features related to IoT, including trustworthiness, timing and spatial features. This reveals more perspectives to consider while processing, posing vast challenges to traditional data fusion methods at different fusion levels for collection and analysis. In this paper, an IoT-based spatiotemporal data fusion (STDF) approach for low-level data in–data out fusion is proposed for real-time spatial IoT source aggregation. It grants optimum performance through leveraging traditional data fusion methods based on big data analytics while exclusively maintaining the data expiry, trustworthiness and spatial and temporal IoT data perspectives, in addition to the volume and velocity. It applies cluster sampling for data reduction upon data acquisition from all IoT sources. For each source, it utilizes a combination of k-means clustering for spatial analysis and Tiny AGgregation (TAG) for temporal aggregation to maintain spatiotemporal data fusion at the processing server. STDF is validated via a public IoT data stream simulator. The experiments examine diverse IoT processing challenges in different datasets, reducing the data size by 95% and decreasing the processing time by 80%, with an accuracy level up to 90% for the largest used dataset.


2017 ◽  
Vol 21 (1) ◽  
pp. 1-6 ◽  
Author(s):  
David J. Pauleen ◽  
William Y.C. Wang

Purpose This viewpoint study aims to make the case that the field of knowledge management (KM) must respond to the significant changes that big data/analytics is bringing to operationalizing the production of organizational data and information. Design/methodology/approach This study expresses the opinions of the guest editors of “Does Big Data Mean Big Knowledge? Knowledge Management Perspectives on Big Data and Analytics”. Findings A Big Data/Analytics-Knowledge Management (BDA-KM) model is proposed that illustrates the centrality of knowledge as the guiding principle in the use of big data/analytics in organizations. Research limitations/implications This is an opinion piece, and the proposed model still needs to be empirically verified. Practical implications It is suggested that academics and practitioners in KM must be capable of controlling the application of big data/analytics, and calls for further research investigating how KM can conceptually and operationally use and integrate big data/analytics to foster organizational knowledge for better decision-making and organizational value creation. Originality/value The BDA-KM model is one of the early models placing knowledge as the primary consideration in the successful organizational use of big data/analytics.


Sign in / Sign up

Export Citation Format

Share Document