Analisa Performa Klastering Data Besar pada Hadoop

Big Data is a collection of data with a large and complex size, consisting of various data types and obtained from various sources, overgrowing quickly. Some of the problems that will arise when processing big data, among others, are related to the storage and access of big data, which consists of various types of data with high complexity that are not able to be handled by the relational model. One technology that can solve the problem of storing and accessing big data is Hadoop. Hadoop is a technology that can store and process big data by distributing big data into several data partitions (data blocks). Problems arise when an analysis process requires all data spread out into one data entity, for example, in the data clustering process. One alternative solution is to do a parallel and scattered analysis, then perform a centralized analysis of the results of the scattered analysis. This study examines and analyzes two methods, namely K-Medoids Mapreduce and K-Modes without Mapreduce. The dataset used is a dataset about cars consisting of 3.5 million rows of data with 400MB distributed in a Hadoop Cluster (consisting of more than one engine). Hadoop has a MapReduce feature, consisting of 2 functions, namely map and reduce. The map function performs a selection to retrieve a key, value pairs, and returns a value in the form of a collection of key value pairs, and then the reduce function combines all key value pairs from several map functions. The results of the cluster quality evaluation are tested using the Silhouette Coefficient testing metric. The K-Medoids MapReduce algorithm for the car dataset gives a silhouette value of 0.99 with a total of 2 clusters.

Download Full-text

Big Data Security Challenges and Solution of Distributed Computing in Hadoop Environment: A Security Framework

Recent Advances in Computer Science and Communications ◽

10.2174/2213275912666190822095422 ◽

2020 ◽

Vol 13 (4) ◽

pp. 790-797

Author(s):

Gurjit Singh Bhathal ◽

Amardeep Singh Dhiman

Keyword(s):

Big Data ◽

Data Security ◽

Data Sets ◽

Security Framework ◽

Hadoop Distributed File System ◽

Current Scenario ◽

Hadoop Cluster ◽

Ciphertext Policy ◽

In Transit ◽

Hadoop Framework

Background: In current scenario of internet, large amounts of data are generated and processed. Hadoop framework is widely used to store and process big data in a highly distributed manner. It is argued that Hadoop Framework is not mature enough to deal with the current cyberattacks on the data. Objective: The main objective of the proposed work is to provide a complete security approach comprising of authorisation and authentication for the user and the Hadoop cluster nodes and to secure the data at rest as well as in transit. Methods: The proposed algorithm uses Kerberos network authentication protocol for authorisation and authentication and to validate the users and the cluster nodes. The Ciphertext-Policy Attribute- Based Encryption (CP-ABE) is used for data at rest and data in transit. User encrypts the file with their own set of attributes and stores on Hadoop Distributed File System. Only intended users can decrypt that file with matching parameters. Results: The proposed algorithm was implemented with data sets of different sizes. The data was processed with and without encryption. The results show little difference in processing time. The performance was affected in range of 0.8% to 3.1%, which includes impact of other factors also, like system configuration, the number of parallel jobs running and virtual environment. Conclusion: The solutions available for handling the big data security problems faced in Hadoop framework are inefficient or incomplete. A complete security framework is proposed for Hadoop Environment. The solution is experimentally proven to have little effect on the performance of the system for datasets of different sizes.

Download Full-text

REFLECTIONS ON THE BASIC FUNCTIONS OF JUDICIAL BIG DATA IN CHINA —— TAKING THE ANALYSIS PROCESS OF AN INSURANCE CASE AS AN EXAMPLE

Wonkwang University Legal Research Institute ◽

10.22397/wlri.2019.35.3.321 ◽

2019 ◽

Vol 35 (3) ◽

pp. 321-339

Author(s):

Hongyan Pan

Keyword(s):

Big Data ◽

Analysis Process ◽

Insurance Case ◽

Basic Functions

Download Full-text

Construction of a multi-source heterogeneous hybrid platform for big data

Journal of Computational Methods in Sciences and Engineering ◽

10.3233/jcm-215138 ◽

2021 ◽

pp. 1-10

Author(s):

Ying Wang ◽

Yiding Liu ◽

Minna Xia

Keyword(s):

Big Data ◽

Data Analysis ◽

Forest Fire ◽

Original Data ◽

Big Data Analysis ◽

Multiple Sources ◽

Data Types ◽

Fire Monitoring ◽

Data Platform

Big data is featured by multiple sources and heterogeneity. Based on the big data platform of Hadoop and spark, a hybrid analysis on forest fire is built in this study. This platform combines the big data analysis and processing technology, and learns from the research results of different technical fields, such as forest fire monitoring. In this system, HDFS of Hadoop is used to store all kinds of data, spark module is used to provide various big data analysis methods, and visualization tools are used to realize the visualization of analysis results, such as Echarts, ArcGIS and unity3d. Finally, an experiment for forest fire point detection is designed so as to corroborate the feasibility and effectiveness, and provide some meaningful guidance for the follow-up research and the establishment of forest fire monitoring and visualized early warning big data platform. However, there are two shortcomings in this experiment: more data types should be selected. At the same time, if the original data can be converted to XML format, the compatibility is better. It is expected that the above problems can be solved in the follow-up research.

Download Full-text

Big Data Analytics for Search Engine Optimization

Big Data and Cognitive Computing ◽

10.3390/bdcc4020005 ◽

2020 ◽

Vol 4 (2) ◽

pp. 5 ◽

Cited By ~ 1

Author(s):

Ioannis C. Drivas ◽

Damianos P. Sakas ◽

Georgios A. Giannakopoulos ◽

Daphne Kyriaki-Manessi

Keyword(s):

Big Data ◽

Cultural Heritage ◽

Search Engine ◽

Data Analytics ◽

User Behavior ◽

Big Data Analytics ◽

Data Types ◽

Search Engine Optimization ◽

The Impact ◽

The Web

In the Big Data era, search engine optimization deals with the encapsulation of datasets that are related to website performance in terms of architecture, content curation, and user behavior, with the purpose to convert them into actionable insights and improve visibility and findability on the Web. In this respect, big data analytics expands the opportunities for developing new methodological frameworks that are composed of valid, reliable, and consistent analytics that are practically useful to develop well-informed strategies for organic traffic optimization. In this paper, a novel methodology is implemented in order to increase organic search engine visits based on the impact of multiple SEO factors. In order to achieve this purpose, the authors examined 171 cultural heritage websites and their retrieved data analytics about their performance and user experience inside them. Massive amounts of Web-based collections are included and presented by cultural heritage organizations through their websites. Subsequently, users interact with these collections, producing behavioral analytics in a variety of different data types that come from multiple devices, with high velocity, in large volumes. Nevertheless, prior research efforts indicate that these massive cultural collections are difficult to browse while expressing low visibility and findability in the semantic Web era. Against this backdrop, this paper proposes the computational development of a search engine optimization (SEO) strategy that utilizes the generated big cultural data analytics and improves the visibility of cultural heritage websites. One step further, the statistical results of the study are integrated into a predictive model that is composed of two stages. First, a fuzzy cognitive mapping process is generated as an aggregated macro-level descriptive model. Secondly, a micro-level data-driven agent-based model follows up. The purpose of the model is to predict the most effective combinations of factors that achieve enhanced visibility and organic traffic on cultural heritage organizations’ websites. To this end, the study contributes to the knowledge expansion of researchers and practitioners in the big cultural analytics sector with the purpose to implement potential strategies for greater visibility and findability of cultural collections on the Web.

Download Full-text

Balancing risk and trust for strategic alliance formation decisions

Journal of Strategy and Management ◽

10.1108/jsma-03-2021-0067 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Zachary A. Collier ◽

Matthew D. Wood ◽

Dale A. Henderson

Keyword(s):

Value Of Information ◽

Strategic Alliance ◽

Selection Process ◽

Subjective Assessment ◽

Quantitative Model ◽

Information Gathering ◽

Content Type ◽

Alliance Partner ◽

Analysis Process ◽

A Value

PurposeTrust entails the assumption of risk by the trustor to the extent that the trustee may act in a manner unaligned with the trustor's interests. Before a strategic alliance is formed, each firm formulates a subjective assessment regarding whether the other firm will behave in a trustworthy manner and not act opportunistically. To inform this partner analysis and selection process, the authors leverage the concept of value of information to quantify the benefit of information gathering activities on the trustworthiness of a potential trustee.Design/methodology/approachIn this paper, the authors develop a decision model that explicitly operationalizes trust as the subjective probability that a trustee will act in a trustworthy manner. The authors integrate the concept of value of information related to information gathering activities, which would inform a trustor about a trustee's trustworthiness.FindingsTrust inherently involves some degree of risk, and the authors find that there is practical value in carrying out information gathering activities to facilitate the partner analysis process. The authors present a list of trustworthiness indicators, along with a scoring sheet, to facilitate learning more about a potential strategic alliance partner.Originality/valueThe need for a quantitative model that can support risk-based strategic alliance decision-making for partner analysis represents a research gap in the literature. The modeling of strategic alliance partner analysis decisions from a value of information (VOI) perspective adds a contribution to the trust literature.

Download Full-text

Discussion on Data Features and Construction Models of Translation Corpus in the Era of Big Data

E3S Web of Conferences ◽

10.1051/e3sconf/202125101030 ◽

2021 ◽

Vol 251 ◽

pp. 01030

Author(s):

Qinqi Kang ◽

Zhao Kang

Keyword(s):

Big Data ◽

Rapid Development ◽

Third Party ◽

Crowd Sourcing ◽

Data Types ◽

Corpus Construction ◽

Translation Practice ◽

Open Source Data ◽

Key Factor ◽

Translation Machine

With the rapid development of artificial intelligence in the current era of big data, the construction of translation corpus has become a key factor in effectively achieving a highly intelligent translation. In the era of big data, the data sources and data types of translation corpus are becoming more and more diversified, which will inevitably bring about a new revolution in the construction of translation corpus. The construction of the translation corpus in the era of big data can fully rely on third-party open source data, crowd-sourcing translation, machine closed-loop, human-machine collaboration and other multiple modes to comprehensively improve the quality of translation corpus construction to better serve translation practice.

Download Full-text

A systematic view on data descriptors for the visual analysis of tabular data

Information Visualization ◽

10.1177/1473871616667767 ◽

2016 ◽

Vol 16 (3) ◽

pp. 232-256 ◽

Cited By ~ 5

Author(s):

Hans-Jörg Schulz ◽

Thomas Nocke ◽

Magnus Heitzler ◽

Heidrun Schumann

Keyword(s):

Data Analysis ◽

Visual Analysis ◽

Data Gathering ◽

Climate Impact ◽

General Concept ◽

Basic Knowledge ◽

Data Types ◽

Process Data ◽

Tabular Data ◽

Analysis Process

Visualization has become an important ingredient of data analysis, supporting users in exploring data and confirming hypotheses. At the beginning of a visual data analysis process, data characteristics are often assessed in an initial data profiling step. These include, for example, statistical properties of the data and information on the data’s well-formedness, which can be used during the subsequent analysis to adequately parametrize views and to highlight or exclude data items. We term this information data descriptors, which can span such diverse aspects as the data’s provenance, its storage schema, or its uncertainties. Gathered descriptors encapsulate basic knowledge about the data and can thus be used as objective starting points for the visual analysis process. In this article, we bring together these different aspects in a systematic form that describes the data itself (e.g. its content and context) and its relation to the larger data gathering and visual analysis process (e.g. its provenance and its utility). Once established in general, we further detail the concept of data descriptors specifically for tabular data as the most common form of structured data today. Finally, we utilize these data descriptors for tabular data to capture domain-specific data characteristics in the field of climate impact research. This procedure from the general concept via the concrete data type to the specific application domain effectively provides a blueprint for instantiating data descriptors for other data types and domains in the future.

Download Full-text

BigGIS With Hadoop in MapReduce Environment

Handbook of Research on Digital Research Methods and Architectural Tools in Urban Planning and Design - Advances in Civil and Industrial Engineering ◽

10.4018/978-1-5225-9238-9.ch002 ◽

2019 ◽

pp. 25-32

Author(s):

Nada M. Alhakkak

Keyword(s):

Big Data ◽

Scheduling Algorithm ◽

Real Data ◽

Map Reduce ◽

Data Types ◽

Simulated Environment ◽

Merge Sort ◽

Data Source ◽

Sort Algorithm ◽

And Storage

BigGIS is a new product that resulted from developing GIS in the “Big Data” area, which is used in storing and processing big geographical data and helps in solving its issues. This chapter describes an optimized Big GIS framework in Map Reduce Environment M2BG. The suggested framework has been integrated into Map Reduce Environment in order to solve the storage issues and get the benefit of the Hadoop environment. M2BG include two steps: Big GIS warehouse and Big GIS Map Reduce. The first step contains three main layers: Data Source and Storage Layer (DSSL), Data Processing Layer (DPL), and Data Analysis Layer (DAL). The second layer is responsible for clustering using swarms as inputs for the Hadoop phase. Then it is scheduled in the mapping part with the use of a preempted priority scheduling algorithm; some data types are classified as critical and some others are ordinary data type; the reduce part used, merge sort algorithm M2BG, should solve security and be implemented with real data in the simulated environment and later in the real world.

Download Full-text

Semantic Technologies and Big Data Analytics for Cyber Defence

Web Services ◽

10.4018/978-1-5225-7501-6.ch074 ◽

2019 ◽

pp. 1430-1443

Author(s):

Louise Leenen ◽

Thomas Meyer

Keyword(s):

Decision Making ◽

Big Data ◽

Data Analytics ◽

Big Data Analytics ◽

Future Research ◽

Data Sets ◽

Semantic Technologies ◽

Data Types ◽

Intelligent Decision Making ◽

Big Data Technologies

The Governments, military forces and other organisations responsible for cybersecurity deal with vast amounts of data that has to be understood in order to lead to intelligent decision making. Due to the vast amounts of information pertinent to cybersecurity, automation is required for processing and decision making, specifically to present advance warning of possible threats. The ability to detect patterns in vast data sets, and being able to understanding the significance of detected patterns are essential in the cyber defence domain. Big data technologies supported by semantic technologies can improve cybersecurity, and thus cyber defence by providing support for the processing and understanding of the huge amounts of information in the cyber environment. The term big data analytics refers to advanced analytic techniques such as machine learning, predictive analysis, and other intelligent processing techniques applied to large data sets that contain different data types. The purpose is to detect patterns, correlations, trends and other useful information. Semantic technologies is a knowledge representation paradigm where the meaning of data is encoded separately from the data itself. The use of semantic technologies such as logic-based systems to support decision making is becoming increasingly popular. However, most automated systems are currently based on syntactic rules. These rules are generally not sophisticated enough to deal with the complexity of decisions required to be made. The incorporation of semantic information allows for increased understanding and sophistication in cyber defence systems. This paper argues that both big data analytics and semantic technologies are necessary to provide counter measures against cyber threats. An overview of the use of semantic technologies and big data technologies in cyber defence is provided, and important areas for future research in the combined domains are discussed.

Download Full-text

Big Data Models and the Public Sector

Web Services ◽

10.4018/978-1-5225-7501-6.ch007 ◽

2019 ◽

pp. 105-126

Author(s):

N. Nawin Sona

Keyword(s):

Machine Learning ◽

Data Mining ◽

Big Data ◽

Public Sector ◽

Predictive Analytics ◽

Data Types ◽

The Public ◽

Emerging Trends ◽

Wide Range ◽

Learning Data

This chapter aims to give an overview of the wide range of Big Data approaches and technologies today. The data features of Volume, Velocity, and Variety are examined against new database technologies. It explores the complexity of data types, methodologies of storage, access and computation, current and emerging trends of data analysis, and methods of extracting value from data. It aims to address the need for clarity regarding the future of RDBMS and the newer systems. And it highlights the methods in which Actionable Insights can be built into public sector domains, such as Machine Learning, Data Mining, Predictive Analytics and others.

Download Full-text