Using Apache Hadoop | ScienceGate

Apache Hadoop A Guide for Cluster Configuration and Testing

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v7i4.792796 ◽

2019 ◽

Vol 7 (4) ◽

pp. 792-796

Author(s):

Ankit Shah ◽

Mamta Padole

Keyword(s):

Apache Hadoop ◽

Cluster Configuration

Download Full-text

HISTORICAL ANALYSIS OF MESSAGE CONTENTS TO RECOMMEND ISSUES TO OPEN SOURCE SOFTWARE CONTRIBUTORS

Revista Eletrônica de Sistemas de Informação ◽

10.21529/resi.2014.1302005 ◽

2014 ◽

Vol 13 (2) ◽

Cited By ~ 1

Author(s):

Igor Fabio Steinmacher ◽

Igor S Wiese ◽

Andre Luis Schwerz ◽

Rafael Liberato Roberto ◽

João Eduardo Ferreira ◽

...

Keyword(s):

Open Source ◽

Open Source Software ◽

Naive Bayes ◽

Historical Analysis ◽

Naïve Bayes ◽

Apache Hadoop

Os desenvolvedores de projetos de software livre distribuídos utilizam ferramentas de acompanhamento de pendências para coordenar o seu trabalho. Essas ferramentas armazenam informações importantes, mantendo registro de decisões importantes e soluções para bugs. Decidir sobre que pendências são as mais adequadas para se contribuir pode ser difícil, uma vez que a elevada quantidade de dados aumenta a pressão sobre os desenvolvedores. Este artigo mostra a importância do conteúdo das discussões que ocorrem por meio da ferramenta de acompanhamento de pendências em um projeto de software livre para a construção de um classificador para predizer a participação de um colaborador na solução de um problema. Para projetar este modelo de predição, utilizamos dois algoritmos de aprendizagem de máquina: Naïve Bayes e J48. Utilizamos dados do projeto Apache Hadoop Commons para avaliar o uso dos algoritmos. Aplicando algoritmos de aprendizado de máquina aos dez desenvolvedores mais ativos no projeto, obtivemos uma média de recall de 66,82% para Naïve Bayes e 53,02% usando J48. Obtivemos 64,31% de precisão e 90,27% de acurácia usando o J48. Também realizamos um estudo exploratório com cinco desenvolvedores que participaram na solução de um volume menor de problemas , obtendo 77,41% de precisão, 48% de recall, e 98,84% de acurácia usando o algoritmo J48. Os resultados indicam que o conteúdo dos comentários em pendências/ problemas em projetos de software livre representam um fator relevante com base no qual recomendar pendências aos desenvolvedores que colaboram com o projeto.

Download Full-text

Public Opinion Knowledge (POK) platform based on apache hadoop: To get public opinion from French content published on the Web/CSM

2016 2nd International Conference on Cloud Computing Technologies and Applications (CloudTech) ◽

10.1109/cloudtech.2016.7847689 ◽

2016 ◽

Author(s):

Abdelkader Rhouati ◽

El Hassane Ettifouri ◽

Mohammed Ghaouth Belkasmi ◽

Toumi Bouchentouf

Keyword(s):

Public Opinion ◽

Apache Hadoop ◽

The Web

Download Full-text

Conversion and Display of the Calculated Data of Spectral Remote Sensing Data on the Basis of GeoServer Extensions and Distributed Storage Technologies

PROGRAMMNAYA INGENERIA ◽

10.17587/prin.12.107-112 ◽

2021 ◽

Vol 12 (2) ◽

pp. 107-112

Author(s):

I. E. Kharlampenkov ◽

◽

A. U. Oshchepkov ◽

Keyword(s):

Remote Sensing ◽

Information Technologies ◽

File System ◽

Distributed Storage ◽

Calculated Data ◽

Remote Sensing Data ◽

Distributed Computing Systems ◽

Apache Hadoop ◽

Distributed Information ◽

Sensing Data

The article presents methods for caching and displaying data from spectral satellite images using libraries of distributed computing systems that are part of the Apache Hadoop ecosystem, and GeoServer extensions. The authors gave a brief overview of existing tools that provide the ability to present remote sensing data using distributed information technologies. A distinctive feature is the way to convert remote sensing data inside Apache Parquet files for further display. This approach allows you to interact with the distributed file system via the Kite SDK libraries and switch on additional data processors based on Apache Hadoop technology as external services. A comparative analysis of existing tools, such as: GeoMesa, GeoWawe, etc is performed. The following steps are described: extracting data from Apache Parquet via the Kite SDK, converting this data to GDAL Dataset, iterating the received data, and saving it inside the file system in BIL format. In this article, the BIL format is used for the GeoServer cache. The extension was implemented and published under the Apache License on the GitHub resource. In conclusion, you will find instructions for installing and using the created extension.

Download Full-text

Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark

Journal of Systems and Software ◽

10.1016/j.jss.2016.11.037 ◽

2017 ◽

Vol 125 ◽

pp. 133-151 ◽

Cited By ~ 51

Author(s):

Ilias Mavridis ◽

Helen Karatza

Keyword(s):

Performance Evaluation ◽

Apache Spark ◽

Apache Hadoop ◽

Log File Analysis ◽

Log File

Download Full-text

Key usage patterns for apache Hadoop in the enterprise

2013 IEEE International Conference on Big Data ◽

10.1109/bigdata.2013.6691544 ◽

2013 ◽

Author(s):

Amr Awadallah

Keyword(s):

Apache Hadoop ◽

Usage Patterns

Download Full-text

Efficient Transfer of data from RDBMS to HDFS and conversion to JSON format

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.38710 ◽

2021 ◽

Vol 9 (10) ◽

pp. 1869-1871

Author(s):

Dr. C. K. Gomathy

Keyword(s):

Data Warehouse ◽

Relational Databases ◽

Command Line ◽

Command Line Interface ◽

Apache Hadoop ◽

Enterprise Data Warehouse ◽

Efficient Transfer ◽

Efficient Execution

Abstract: Apache Sqoop is mainly used to efficiently transfer large volumes of data between Apache Hadoop and relational databases. It helps to certain tasks, such as ETL (Extract transform load) processing, from an enterprise data warehouse to Hadoop, for efficient execution at a much less cost. Here first we import the table which presents in MYSQL Database with the help of command-line interface application called Sqoop and there is a chance of addition of new rows and updating new rows then we have to execute the query again. So, with the help of our project there is no need of executing queries again for that we are using Sqoop job, which consists of total commands for import and next after import we retrieve the data from hive using Java JDBC and we convert the data to JSON Format, which consists of data in an organized way and easy to access manner by using GSON Library. Keywords: Sqoop, Json, Gson, Maven and JDBC

Download Full-text

Perbandingan Kinerja Komputasi Hadoop dan Spark untuk Memprediksi Cuaca (Studi Kasus : Storm Event Database)

Repositor ◽

10.22219/repositor.v2i4.93 ◽

2020 ◽

Vol 2 (4) ◽

pp. 463

Author(s):

Rendiyono Wahyu Saputro ◽

Aminuddin Aminuddin ◽

Yuda Munarko

Keyword(s):

Internet Of Things ◽

Distributed Computing ◽

Apache Spark ◽

Data Sources ◽

User Generated Content ◽

Storm Event ◽

Test Results ◽

Process Data ◽

Apache Hadoop ◽

Database Technology

AbstrakPerkembangan teknologi telah mengakibatkan pertumbuhan data yang semakin cepat dan besar setiap waktunya. Hal tersebut disebabkan oleh banyaknya sumber data seperti mesin pencari, RFID, catatan transaksi digital, arsip video dan foto, user generated content, internet of things, penelitian ilmiah di berbagai bidang seperti genomika, meteorologi, astronomi, fisika, dll. Selain itu, data - data tersebut memiliki karakteristik yang unik antara satu dengan lainnya, hal ini yang menyebabkan tidak dapat diproses oleh teknologi basis data konvensional.Oleh karena itu, dikembangkan beragam framework komputasi terdistribusi seperti Apache Hadoop dan Apache Spark yang memungkinkan untuk memproses data secara terdistribusi dengan menggunakan gugus komputer.Adanya ragam framework komputasi terdistribusi, sehingga diperlukan sebuah pengujian untuk mengetahui kinerja komputasi keduanya. Pengujian dilakukan dengan memproses dataset dengan beragam ukuran dan dalam gugus komputer dengan jumlah node yang berbeda. Dari semua hasil pengujian, Apache Hadoop memerlukan waktu yang lebih sedikit dibandingkan dengan Apache Spark. Hal tersebut terjadi karena nilai throughput dan throughput/node Apache Hadoop lebih tinggi daripada Apache Spark.AbstractTechnological developments have resulted in rapid and growing data growth every time. This is due to the large number of data sources such as search engines, RFID, digital transaction records, video and photo archives, user generated content, internet of things, scientific research in areas such as genomics, meteorology, astronomy, physics, In addition, these data have unique characteristics of each other, this is the cause can not be processed by conventional database technology. Therefore, developed various distributed computing frameworks such as Apache Hadoop and Apache Spark that enable to process data in a distributed by using computer cluster.The existence of various frameworks of distributed computing, so required a test to determine the performance of both computing. Testing is done by processing datasets of various sizes and in clusters of computers with different number of nodes. Of all the test results, Apache Hadoop takes less time than the Apache Spark. This happens because the value of throuhgput and throughput / node Apache Hadoop is higher than Apache Spark.

Download Full-text

Sistema de Computação Paralela e Distribuída Utilizando Raspberry Pi e Apache Hadoop

10.5753/epoca.2018.13460 ◽

2018 ◽

Author(s):

Hemerson Pontes ◽

Gilvandro De Medeiros ◽

Joanderson Borges ◽

Helton Maia

Keyword(s):

Big Data ◽

Raspberry Pi ◽

Apache Hadoop

No contexto de Big Data, o grande fluxo e a complexidade dos dados gerados exigem elevado custo computacional para tarefas de processamento e extração de informação, sendo um desafio concluir tais execuções em tempo hábil para tomadas de decisões técnicas ou empresariais. No entanto, em clusters computacionais, pode-se gerenciar e distribuir pacotes de dados entre diferentes unidades de processamento, tornando-se possível e viável trabalhar com um grande volume de dados, processando-os de forma paralela e distribuída. Portanto, o presente trabalho se dispõe a construir a infraestrutura de um cluster e estudar seu funcionamento utilizando, para isso, a ferramenta Apache Hadoop para o processamento distribuído de dados.

Download Full-text

Processing LIDAR Data with Apache Hadoop

Lecture Notes in Geoinformation and Cartography - The Rise of Big Spatial Data ◽

10.1007/978-3-319-45123-7_25 ◽

2016 ◽

pp. 351-358

Author(s):

Jan Růžička ◽

Lukáš Orčík ◽

Kateřina Růžičková ◽

Juraj Kisztner

Keyword(s):

Lidar Data ◽

Apache Hadoop

Download Full-text