Big Data and Machine Learning Integration: The Benefits and Research Issues in the Huge Data Processing

The generation of the data from individual member to MNC incurring more burden on the existing architectures. The current requirements of processing and storing huge data may not be suitable to the existing storage and processing techniques. The fundamental issue is kind of the data populated every second in the social media even reaching to peta bytes of the storage the processing of this huge data is another problem. Here the concept of big data comes into the picture,Hadoop is a frame work which is helpful to store huge amounts of the data and to process the data in parallel and distributed mode. The framework is the combination of Hadoop Distributed File System(HDFS) and Map Reduce(MR). HDFS is a distributed storage which allows huge storage capacity solves the issue of abnormal data population, whereas the processing of the data is taken by the Map Reduce which provides a versatile model of processing the huge amounts of the data. The other dimension of the current work is to analyze the huge amounts of the data which is beyond the scope of Hadoop based tools. Machine Learning (ML) is a class of algorithms provides various techniques to analyze the huge data in a better possible way. ML provides classification techniques, clustering mechanisms and Recommender systems to name a few. The importance of the current work is to integrate the Hadoop and R which in turn the combination of Big data and ML. The work provides the key benefits of such integration and future scope of the integration along with possible research constraints in the reality. We believe the work gives a platform to researchers so as to extract the future scope of the integration and difficulties faced in the process.

Download Full-text

Drug Prediction in Healthcare Using Big Data and Machine Learning

Advances in Social Networking and Online Communities - Hidden Link Prediction in Stochastic Social Networks ◽

10.4018/978-1-5225-9096-5.ch005 ◽

2019 ◽

pp. 79-92 ◽

Cited By ~ 1

Author(s):

Mamoon Rashid ◽

Vishal Goyal ◽

Shabir Ahmad Parah ◽

Harjeet Singh

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Analysis ◽

Side Effects ◽

Previous History ◽

Distributed File System ◽

Optimal Drug ◽

Hadoop Distributed File System ◽

History Of ◽

Medicinal Drugs

The healthcare system is literally losing patients due to improper diagnosis, accidents, and infections in hospitals alone. To address these challenges, the authors are proposing the drug prediction model that will act as informative guide for patients and help them for taking right medicines for the cure of particular disease. In this chapter, the authors are proposing use of Hadoop distributed file system for the storage of medical datasets related to medicinal drugs. MLLib Library of Apache Spark is to be used for initial data analysis for drug suggestions related to symptoms gathered from particular user. The model will analyze the previous history of patients for any side effects of the drug to be recommended. This proposal will consider weather and maps API from Google as well so that the patients can easily locate the nearby stores where the medicines will be available. It is believed that this proposal of research will surely eradicate the issues by prescribing the optimal drug and its availability by giving the location of the retailer of that drug near the customer.

Download Full-text

Attribute based honey encryption algorithm for securing big data: Hadoop distributed file system perspective

PeerJ Computer Science ◽

10.7717/peerj-cs.259 ◽

2020 ◽

Vol 6 ◽

pp. e259

Author(s):

Gayatri Kapil ◽

Alka Agrawal ◽

Abdulaziz Attaallah ◽

Abdullah Algarni ◽

Rajeev Kumar ◽

...

Keyword(s):

Big Data ◽

File System ◽

Low Cost ◽

Distributed File System ◽

File Size ◽

System Perspective ◽

Huge Data ◽

Attribute Based Encryption ◽

Hadoop Distributed File System ◽

Encryption Decryption

Hadoop has become a promising platform to reliably process and store big data. It provides flexible and low cost services to huge data through Hadoop Distributed File System (HDFS) storage. Unfortunately, absence of any inherent security mechanism in Hadoop increases the possibility of malicious attacks on the data processed or stored through Hadoop. In this scenario, securing the data stored in HDFS becomes a challenging task. Hence, researchers and practitioners have intensified their efforts in working on mechanisms that would protect user’s information collated in HDFS. This has led to the development of numerous encryption-decryption algorithms but their performance decreases as the file size increases. In the present study, the authors have enlisted a methodology to solve the issue of data security in Hadoop storage. The authors have integrated Attribute Based Encryption with the honey encryption on Hadoop, i.e., Attribute Based Honey Encryption (ABHE). This approach works on files that are encoded inside the HDFS and decoded inside the Mapper. In addition, the authors have evaluated the proposed ABHE algorithm by performing encryption-decryption on different sizes of files and have compared the same with existing ones including AES and AES with OTP algorithms. The ABHE algorithm shows considerable improvement in performance during the encryption-decryption of files.

Download Full-text

A Survey on Accelerated Mapreduce for Hadoop

Oriental journal of computer science and technology ◽

10.13005/ojcst/10.03.07 ◽

2017 ◽

Vol 10 (3) ◽

pp. 597-602

Author(s):

Jyotindra Tiwari ◽

Dr. Mahesh Pawar ◽

Dr. Anjajana Pandey

Keyword(s):

Big Data ◽

Data Storage ◽

Energy Efficient ◽

High Performance ◽

Map Reduce ◽

Efficient Computation ◽

Apache Hadoop ◽

Huge Data ◽

Performance Techniques ◽

Big Data Storage

Big Data is defined by 3Vs which stands for variety, volume and velocity. The volume of data is very huge, data exists in variety of file types and data grows very rapidly. Big data storage and processing has always been a big issue. Big data has become even more challenging to handle these days. To handle big data high performance techniques have been introduced. Several frameworks like Apache Hadoop has been introduced to process big data. Apache Hadoop provides map/reduce to process big data. But this map/reduce can be further accelerated. In this paper a survey has been performed for map/reduce acceleration and energy efficient computation in quick time.

Download Full-text

Sandbox security model for Hadoop file system

Journal Of Big Data ◽

10.1186/s40537-020-00356-z ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Gousiya Begum ◽

S. Zahoor Ul Huq ◽

A. P. Siva Kumar

Keyword(s):

Big Data ◽

File System ◽

Heterogeneous Data ◽

Map Reduce ◽

Security Model ◽

Apache Hadoop ◽

Legitimate User ◽

Hadoop Distributed File System ◽

Block Level ◽

Cryptographic Techniques

Abstract Extensive usage of Internet based applications in day to day life has led to generation of huge amounts of data every minute. Apart from humans, data is generated by machines like sensors, satellite, CCTV etc. This huge collection of heterogeneous data is often referred as Big Data which can be processed to draw useful insights. Apache Hadoop has emerged has widely used open source software framework for Big Data Processing and it is a cluster of cooperative computers enabling distributed parallel processing. Hadoop Distributed File System is used to store data blocks replicated and spanned across different nodes. HDFS uses an AES based cryptographic techniques at block level which is transparent and end to end in nature. However cryptography provides security from unauthorized access to the data blocks, but a legitimate user can still harm the data. One such example was execution of malicious map reduce jar files by legitimate user which can harm the data in the HDFS. We developed a mechanism where every map reduce jar will be tested by our sandbox security to ensure the jar is not malicious and suspicious jar files are not allowed to process the data in the HDFS. This feature is not present in the existing Apache Hadoop framework and our work is made available in github for consideration and inclusion in the future versions of Apache Hadoop.

Download Full-text

Data Mining for the Internet of Things: Literature Review and Challenges

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.37122 ◽

2021 ◽

Vol 9 (VII) ◽

pp. 3351-3362

Author(s):

Dr. Mohd Zuber

Keyword(s):

Data Mining ◽

Big Data ◽

Internet Of Things ◽

The Internet ◽

Mining System ◽

Research Issues ◽

Data Mining Algorithms ◽

Huge Data ◽

Mining Algorithms ◽

The Internet Of Things

The huge data generate by the Internet of Things (IOT) are measured of high business worth, and data mining algorithms can be applied to IOT to take out hidden information from data. In this paper, we give a methodical way to review data mining in knowledge, technique and application view, together with classification, clustering, association analysis and time series analysis, outlier analysis. And the latest application luggage is also surveyed. As more and more devices connected to IOT, huge volume of data should be analyzed, the latest algorithms should be customized to apply to big data. We reviewed these algorithms and discussed challenges and open research issues. At last a suggested big data mining system is proposed.

Download Full-text

Machine learning approaches on map reduce for Big Data analytics

2015 International Conference on Green Computing and Internet of Things (ICGCIoT) ◽

10.1109/icgciot.2015.7380512 ◽

2015 ◽

Author(s):

J V N Lakshmi ◽

Ananthi Sheshasaayee

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Analytics ◽

Big Data Analytics ◽

Map Reduce ◽

Learning Approaches

Download Full-text

PNNCP- Parallel Nearest Neighbor Classification and Prediction for Big Data Application Based on Apache Spark and Machine Learning

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a1382.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 2358-2365

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Streams ◽

Nearest Neighbor ◽

Apache Spark ◽

Course Of Action ◽

Huge Data ◽

Data Application ◽

Big Data Application ◽

Neighbor Classification

Right by and by the Colossal Information applications, for case, social orchestrating, helpful human administrations, agribusiness, keeping cash, stock show, direction, Facebook and so forward are making the data with especially tall speed. Volume and Speed of the Immense data plays a fundamental bit interior the execution of Colossal data applications. Execution of the Colossal data application can be affected by distinctive parameters. Quickly watch, capacity and precision are the a significant parcel of the triumphant parameters which impact the by and gigantic execution of any Huge data applications. Due the energize and underhanded affiliation of the qualities of 7Vs of Colossal data, each Colossal Information affiliations expect the tall execution.Tall execution is the foremost obvious test within the display advancing condition. In this paper we propose the parallel course of action way to bargain with speedup the explore for closest neighbor center. k-NN classifier is the preeminent basic and comprehensively utilized method for gathering. In this paper we apply a parallelism thought to k-NN for looking the another closest neighbor. This neighbor center will be utilized for putting lost and execution of the remarkable data streams. This classifier unequivocally overhaul and coordinate of the out of date data streams. We are utilizing the Apache Begin and scattered estimation space affiliation for snappier evaluation.

Download Full-text

Big Data and Shipping-managing vessel performance

JOIV International Journal on Informatics Visualization ◽

10.30630/joiv.2.2.116 ◽

2018 ◽

Vol 2 (2) ◽

pp. 73

Author(s):

Mandeep Virk ◽

Vaishali Chauhan

Keyword(s):

Machine Learning ◽

Decision Making ◽

Big Data ◽

Fault Tolerance ◽

High Rate ◽

Shipping Industry ◽

Large Sets ◽

Huge Data ◽

Reliable Performance ◽

Increasing Demand

Shipping business is staggering the trade by a substantial number which portrays the usage of leading technologies to deliver formative and reliable performance to deal with the increasing demand. Technologies like AIS, machine learning, and IoT are making a shift in shipping industry by introducing robots and more sensor equipped devices. The hitch big data originates as a technology which is proficient for assembling and transforming the colossal and divergent figures of data providing organizations with meaningful insights for better decision-making. The size of data is increasing at a higher rate because of the procreation of peripatitic gadgets and sensors attached. Big data is accustomed to delineate technologies and techniques which are used to store, manage, distribute and analyze huge data sheets with a high rate of data occurrence. This gigantic data is allowing to terminate the business by developing meaningful and valuable insights by processing the data. Hadoop is the fundamental basic for composing big data and furnishes with convenient judgments through analysis. It enables the processing of large sets of data by providing a higher degree of fault-tolerance. Parallelism is adapted to process big size of data in the efficient and inexpensive way. Contending massive bulk of data is a determined and vigorous assignment that needs an enormous crunching armature to guaranty affluent data processing and analysis.

Download Full-text

An Implementation of Genetic Algorithms in Big Data Processing for Medical Data

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.c4852.029320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 2215-2217

Keyword(s):

Genetic Algorithm ◽

Big Data ◽

Data Processing ◽

Medical Data ◽

Sql Server ◽

Big Data Processing ◽

Huge Data ◽

Cloud Server ◽

Processing Techniques ◽

Program Interface

The large amount of real time medical measurement parameters stored in the SQL server needs processing using a specific algorithm. One of the big data processing techniques is available for medical data is Genetic algorithm. The acquired medical parameters are combined together to predict or diagnose the disease using the genetic algorithm. In this paper, the genetic algorithm is used to process the medical measurements data. The medical parameters are posted temporarily in the Representational Structure (REST) Application Program Interface (API) using a gateway protocol MQTT. The genetic algorithm can easily diagnose the disease using the existing stored parameters. The medical parameters of the patient like ECG, Blood pressure and skin temperature are posted frequently in the cloud server for continuous monitoring, and the huge data is also processed using this proposed method.

Download Full-text

Healthcare Prediction and Analysis System with Constant Data Polling

International Journal of Inventive Engineering and Sciences - Regular Issue ◽

10.35940/ijies.k0993.0951120 ◽

2020 ◽

Vol 5 (11) ◽

pp. 1-8

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Extraction ◽

Community Services ◽

Complex Data ◽

Accurate Analysis ◽

Early Disease Detection ◽

Huge Data ◽

Scope Of Application ◽

Analysis System

The applications of Big data and machine learning in the fields of healthcare, bioinformatics and information sciences are the most important things that a researcher takes into consideration when doing predictive analysis. The Data production at this stage has never been higher and it is increasing at an alarming rate. Hence, it is difficult to store, process and visualise this huge data using customary technologies. However, abstract design for a specific massive information application has been restricted. With advancement of big data in the field of biomedical and healthcare domain, accurate analysis of medical data can be proved beneficial for early disease detection, patient care and community services. Machine learning is being used in a wide scope of application domains to discover patterns in huge datasets. Moreover, the results from machine learning drive critical decisions in applications relating healthcare and biomedicine. The transformation of data to actionable insights from complex data remains a key challenge. In this paper we have introduced a new method of polling of data before analysis is conducted on it. This method will be valuable for dealing with the issue of incomplete data and will progressively prompt suitable and more precise data extraction.

Download Full-text