scholarly journals Processing Big Data with Apache Hadoop in the Current Challenging Era of COVID-19

2021 ◽  
Vol 5 (1) ◽  
pp. 12
Author(s):  
Otmane Azeroual ◽  
Renaud Fabre

Big data have become a global strategic issue, as increasingly large amounts of unstructured data challenge the IT infrastructure of global organizations and threaten their capacity for strategic forecasting. As experienced in former massive information issues, big data technologies, such as Hadoop, should efficiently tackle the incoming large amounts of data and provide organizations with relevant processed information that was formerly neither visible nor manageable. After having briefly recalled the strategic advantages of big data solutions in the introductory remarks, in the first part of this paper, we focus on the advantages of big data solutions in the currently difficult time of the COVID-19 pandemic. We characterize it as an endemic heterogeneous data context; we then outline the advantages of technologies such as Hadoop and its IT suitability in this context. In the second part, we identify two specific advantages of Hadoop solutions, globality combined with flexibility, and we notice that they are at work with a “Hadoop Fusion Approach” that we describe as an optimal response to the context. In the third part, we justify selected qualifications of globality and flexibility by the fact that Hadoop solutions enable comparable returns in opposite contexts of models of partial submodels and of models of final exact systems. In part four, we remark that in both these opposite contexts, Hadoop’s solutions allow a large range of needs to be fulfilled, which fits with requirements previously identified as the current heterogeneous data structure of COVID-19 information. In the final part, we propose a framework of strategic data processing conditions. To the best of our knowledge, they appear to be the most suitable to overcome COVID-19 massive information challenges.

Author(s):  
Alicia VALDEZ-MENCHACA ◽  
Laura VAZQUEZ-DE LOS SANTOS ◽  
Griselda CORTES-MORALES ◽  
Ana PAIZ-RIVERA

The objective of this research project is the design and implementation of a technological strategy for the use of big data technologies as Apache Hadoop, as well as its supporting software projects that allows to prepare medium-sized companies in new innovative technologies. As part of the methodology, an analysis of the best big data practices, analysis of the software for design and configure big data in a linux server for the technological proposal. As a first result, a roadmap for the installation and configuration of Hadoop software running on a Linux virtual machine has been obtained, as well as the proposal of the technological strategy whose main components are: analysis of the technological architecture, selection of processes or data to be analyzed and installation of Hadoop, among others.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Gousiya Begum ◽  
S. Zahoor Ul Huq ◽  
A. P. Siva Kumar

Abstract Extensive usage of Internet based applications in day to day life has led to generation of huge amounts of data every minute. Apart from humans, data is generated by machines like sensors, satellite, CCTV etc. This huge collection of heterogeneous data is often referred as Big Data which can be processed to draw useful insights. Apache Hadoop has emerged has widely used open source software framework for Big Data Processing and it is a cluster of cooperative computers enabling distributed parallel processing. Hadoop Distributed File System is used to store data blocks replicated and spanned across different nodes. HDFS uses an AES based cryptographic techniques at block level which is transparent and end to end in nature. However cryptography provides security from unauthorized access to the data blocks, but a legitimate user can still harm the data. One such example was execution of malicious map reduce jar files by legitimate user which can harm the data in the HDFS. We developed a mechanism where every map reduce jar will be tested by our sandbox security to ensure the jar is not malicious and suspicious jar files are not allowed to process the data in the HDFS. This feature is not present in the existing Apache Hadoop framework and our work is made available in github for consideration and inclusion in the future versions of Apache Hadoop.


2016 ◽  
Author(s):  
Marco Moscatelli ◽  
Matteo Gnocchi ◽  
Andrea Manconi ◽  
Luciano Milanesi

Motivation Nowadays, advances in technology has arisen in a huge amount of data in both biomedical research and healthcare systems. This growing amount of data gives rise to the need for new research methods and analysis techniques. Analysis of these data offers new opportunities to define novel diagnostic processes. Therefore, a greater integration between healthcare and biomedical data is essential to devise novel predictive models in the field of biomedical diagnosis. In this context, the digitalization of clinical exams and medical records is becoming essential to collect heterogeneous information. Analysis of these data by means of big data technologies will allow a more in depth understanding of the mechanisms leading to diseases, and contextually it will facilitate the development of novel diagnostics and personalized therapeutics. The recent application of big data technologies in the medical fields will offer new opportunities to integrate enormous amount of medical and clinical information from population studies. Therefore, it is essential to devise new strategies aimed at storing and accessing the data in a standardized way. Moreover, it is important to provide suitable methods to manage these heterogeneous data. Methods In this work, we present a new information technology infrastructure devised to efficiently manage huge amounts of heterogeneous data for disease prevention and precision medicine. A test set based on data produced by a clinical and diagnostic laboratory has been built to set up the infrastructure. When working with clinical data is essential to ensure the confidentiality of sensitive patient data. Therefore, the set up phase has been carried out using "anonymous data". To this end, specific techniques have been adopted with the aim to ensure a high level of privacy in the correlation of the medical records with important secondary information (e.g., date of birth, place of residence). It should be noted that the rigidity of relational databases does not lend to the nature of these data. In our opinion, better results can be obtained using non-relational (NoSQL) databases. Starting from these considerations, the infrastructure has been developed on a NoSQL database with the aim to combine scalability and flexibility performances. In particular, MongoDB [1] has been used as it fits better to manage different types of data on large scale. In doing so, the infrastructure is able to provide an optimized management of huge amounts of heterogeneous data, while ensuring high speed of analysis. Results The presented infrastructure exploits big data technologies in order to overcome the limitations of relational databases when working with large and heterogeneous data. The infrastructure implements a set of interface procedures aimed at preparing the metadata for importing data in a NOSQL DB. Abstract truncated at 3,000 characters - the full version is available in the pdf file


Author(s):  
Hana Mallek ◽  
Faiza Ghozzi ◽  
Faiez Gargouri

Big Data emerged after a big explosion of data from the Web 2.0, digital sensors, and social media applications such as Facebook, Twitter, etc. In this constant growth of data, many domains are influenced, especially the decisional support system domain, where the integration of processes should be adapted to support this huge amount of data to improve analysis goals. The basic purpose of this research article is to adapt extract-transform-load processes with Big Data technologies, in order to support not only this evolution of data but also the knowledge discovery. In this article, a new approach called Big Dimensional ETL (BigDimETL) is suggested to deal with ETL basic operations and take into account the multidimensional structure. In order to accelerate data handling, the MapReduce paradigm is used to enhance data warehousing capabilities and HBase as a distributed storage mechanism. Experimental results confirm that the ETL operation performs well especially with adapted operations.


2021 ◽  
Vol 348 ◽  
pp. 01003
Author(s):  
Abdullayev Vugar Hacimahmud ◽  
Ragimova Nazila Ali ◽  
Khalilov Matlab Etibar

The volume of information in the 21st century is growing at a rapid pace. Big data technologies are used to process modern information. This article discusses the use of big data technologies to implement monitoring of social processes. Big data has its characteristics and principles, which reflect here. In addition, we also discussed big data applications in some areas. Particular attention in this article pays to the interactions of big data and sociology. For this, there consider digital sociology and computational social sciences. One of the main objects of study in sociology is social processes. The article shows the types of social processes and their monitoring. As an example, there is implemented monitoring of social processes at the university. There are used following technologies for the realization of social processes monitoring: products 1010data (1010edge, 1010connect, 1010reveal, 1010equities), products of Apache Software Foundation (Apache Hive, Apache Chukwa, Apache Hadoop, Apache Pig), MapReduce framework, language R, library Pandas, NoSQL, etc. Despite this, this article examines the use of the MapReduce model for social processes monitoring at the university.


Author(s):  
Сергей Юрьевич Золотов ◽  
Игорь Юрьевич Турчановский

Описан эксперимент по использованию технологий Apache Big Data в исследованиях климатических систем. В ходе эксперимента реализовано четыре варианта решения тестовой задачи. Ускорение расчетов с помощью технологий Apache Big Data вполне достижимо, и наиболее эффективный способ для этого найден в четвертом варианте решения тестовой задачи. Суть найденного решения сводится к преобразованию исходных наборов данных к формату, подходящему для хранения в распределенной файловой системе и применения технологи Spark SQL из стека Apache Big Data для параллельной обработки данных на вычислительных кластерах. The core of the Apache Big Data stack consists of two technologies: Apache Hadoop for organizing distributed file storages of unlimited capacity and Apache Spark for organizing parallel computing on computing clusters. The combination of Apache Spark and Apache Hadoop is fully applicable for creating big data processing systems. The main idea implemented by Spark is dividing data into separate parts (partitions) and processing these parts in memory of many computers connected within a network. Data is sent only when needed, and Spark automatically detects when the exchange will take place. For testing, we chose the problem of calculating the monthly, annual, and seasonal trends in the temperature of the atmosphere of our planet for the period from 1960 to 2010 according to the NCEP/NCAR and JRA-55 reanalysis data. During the experiment, four variants of solving the test problem were implemented. The first variant represents the simplest implementation without parallelism. The second implementation variant assumes parallel reading of data from the local file system, aggregation, and calculation of trends. The third variant was the calculation of a test problem on a two-node cluster. NCEP and JRA-55 reanalysis files were placed in their original format in the Hadoop storage (HDFS), which combines the disk subsystems of two computers. The disadvantage of this variant is loading all reanalysis files completely into the random access memory of the workflow. The solution proposed in the fourth variant is to pre-convert the original file format to a form when reading from HDFS is selective, based on the specified parameters.


2016 ◽  
Author(s):  
Marco Moscatelli ◽  
Matteo Gnocchi ◽  
Andrea Manconi ◽  
Luciano Milanesi

Motivation Nowadays, advances in technology has arisen in a huge amount of data in both biomedical research and healthcare systems. This growing amount of data gives rise to the need for new research methods and analysis techniques. Analysis of these data offers new opportunities to define novel diagnostic processes. Therefore, a greater integration between healthcare and biomedical data is essential to devise novel predictive models in the field of biomedical diagnosis. In this context, the digitalization of clinical exams and medical records is becoming essential to collect heterogeneous information. Analysis of these data by means of big data technologies will allow a more in depth understanding of the mechanisms leading to diseases, and contextually it will facilitate the development of novel diagnostics and personalized therapeutics. The recent application of big data technologies in the medical fields will offer new opportunities to integrate enormous amount of medical and clinical information from population studies. Therefore, it is essential to devise new strategies aimed at storing and accessing the data in a standardized way. Moreover, it is important to provide suitable methods to manage these heterogeneous data. Methods In this work, we present a new information technology infrastructure devised to efficiently manage huge amounts of heterogeneous data for disease prevention and precision medicine. A test set based on data produced by a clinical and diagnostic laboratory has been built to set up the infrastructure. When working with clinical data is essential to ensure the confidentiality of sensitive patient data. Therefore, the set up phase has been carried out using "anonymous data". To this end, specific techniques have been adopted with the aim to ensure a high level of privacy in the correlation of the medical records with important secondary information (e.g., date of birth, place of residence). It should be noted that the rigidity of relational databases does not lend to the nature of these data. In our opinion, better results can be obtained using non-relational (NoSQL) databases. Starting from these considerations, the infrastructure has been developed on a NoSQL database with the aim to combine scalability and flexibility performances. In particular, MongoDB [1] has been used as it fits better to manage different types of data on large scale. In doing so, the infrastructure is able to provide an optimized management of huge amounts of heterogeneous data, while ensuring high speed of analysis. Results The presented infrastructure exploits big data technologies in order to overcome the limitations of relational databases when working with large and heterogeneous data. The infrastructure implements a set of interface procedures aimed at preparing the metadata for importing data in a NOSQL DB. Abstract truncated at 3,000 characters - the full version is available in the pdf file


Author(s):  
N.A. Mironov ◽  
E.A. Maryshev ◽  
N.A. Divueva

The article analyzes the issues of information support of expert-analytical studies in the system of preparing scientific and technological documents in the interests of ensuring the country's defense and the security of the state.Recommendations on the use of big data technologies for the preparation of expert-analytical documents, formed by experts of the Federal Roster of Experts of the Scientific and Technological Sphere, on the integration of heterogeneous data using the potential of the expert community of the scientific and technological sphere are proposed.The methodology of the work is the formation, synthesis and systematization of scientific, technological and expert-analytical documents in the interests of ensuring the defense of the country and the security of the state using big data technologies and the scientific potential of experts from the Federal Roster of Scientific and Technological Experts in remote access mode.The results of the work are informational and analytical materials on priority areas of development of the scientific and technological potential of leading foreign countries on military technical security of the state, informational materials for expert and analytical support for the preparation and adoption of decisions on the scientific and technological support of national defense and security.The research results can be used by relevant government departments, leading universities, enterprises of the military-industrial complex during research and development and manufacturing products.


Author(s):  
Ebru Aydindag Bayrak ◽  
Pinar Kirci

This article presents a brief introduction to big data and big data analytics and also their roles in the healthcare system. A definite range of scientific researches about big data analytics in the healthcare system have been reviewed. The definition of big data, the components of big data, medical big data sources, used big data technologies in present, and big data analytics in healthcare have been examined under the different titles. Also, the historical development process of big data analytics has been mentioned. As a known big data analytics technology, Apache Hadoop technology and its core components with tools have been explained briefly. Moreover, a glance of some of the big data analytics tools or platforms apart from Hadoop eco-system were given. The main goal is to help researchers or specialists with giving an opinion about the rising importance of used big data analytics in healthcare systems.


Sign in / Sign up

Export Citation Format

Share Document