A kind of entity recognition algorithm based on Hadoop for power big data

The term Big Data refers to large-scale information management and analysis technologies that exceed the capability of traditional data processing technologies. Big Data is differentiated from traditional technologies in three ways: volume, velocity and variety of data. Big data analytics is the process of analyzing large data sets which contains a variety of data types to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. Since Big Data is new emerging field, there is a need for development of new technologies and algorithms for handling big data. The main objective of this paper is to provide knowledge about various research challenges of Big Data analytics. A brief overview of various types of Big Data analytics is discussed in this paper. For each analytics, the paper describes process steps and tools. A banking application is given for each analytics. Some of research challenges and possible solutions for those challenges of big data analytics are also discussed.

Download Full-text

Science Pipelines for the Square Kilometre Array

10.20944/preprints201810.0115.v1 ◽

2018 ◽

Author(s):

Jamie Farnes ◽

Ben Mort ◽

Fred Dulwich ◽

Stef Salvini ◽

Wes Armour

Keyword(s):

Big Data ◽

Large Scale ◽

Technological Development ◽

Large Data ◽

Process Data ◽

Polarization Data ◽

Square Kilometre Array ◽

Sky Survey ◽

Scientific Results ◽

The Universe

The Square Kilometre Array (SKA) will be both the largest radio telescope ever constructed and the largest Big Data project in the known Universe. The first phase of the project will generate on the order of 5 zettabytes of data per year. A critical task for the SKA will be its ability to process data for science, which will need to be conducted by science pipelines. Together with polarization data from the LOFAR Multifrequency Snapshot Sky Survey (MSSS), we have been developing a realistic SKA-like science pipeline that can handle the large data volumes generated by LOFAR at 150 MHz. The pipeline uses task-based parallelism to image, detect sources, and perform Faraday Tomography across the entire LOFAR sky. The project thereby provides a unique opportunity to contribute to the technological development of the SKA telescope, while simultaneously enabling cutting-edge scientific results. In this paper, we provide an update on current efforts to develop a science pipeline that can enable tight constraints on the magnetised large-scale structure of the Universe.

Download Full-text

Evaluation of big data frameworks for analysis of smart grids

Journal Of Big Data ◽

10.1186/s40537-019-0270-8 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 1

Author(s):

Mohammad Hasan Ansari ◽

Vahid Tabatab Vakili ◽

Behnam Bahrak

Keyword(s):

Big Data ◽

Smart Grid ◽

Cyber Security ◽

Smart Grids ◽

Large Scale ◽

Rapid Development ◽

Massive Data ◽

Grid Data ◽

Data Generator ◽

And Performance

AbstractWith the rapid development of smart grids and increasing data collected in these networks, analyzing this massive data for applications such as marketing, cyber-security, and performance analysis, has gained popularity. This paper focuses on analysis and performance evaluation of big data frameworks that are proposed for handling smart grid data. Since obtaining large amounts of smart grid data is difficult due to privacy concerns, we propose and implement a large scale smart grid data generator to produce massive data under conditions similar to those in real smart grids. We use four open source big data frameworks namely Hadoop-Hbase, Cassandra, Elasticsearch, and MongoDB, in our implementation. Finally, we evaluate the performance of different frameworks on smart grid big data and present a performance benchmark that includes common data analysis techniques on smart grid data.

Download Full-text

Sherlock: an open-source data platform to store, analyze and integrate Big Data for biology

F1000Research ◽

10.12688/f1000research.52791.1 ◽

2021 ◽

Vol 10 ◽

pp. 409

Author(s):

Balázs Bohár ◽

David Fazekas ◽

Matthew Madgwick ◽

Luca Csabai ◽

Marton Olbei ◽

...

Keyword(s):

Big Data ◽

Data Management ◽

Open Source ◽

Large Scale ◽

Genomic Sequence ◽

Large Data ◽

Structured Data ◽

Biological Research ◽

Data Platform ◽

Big Data Technologies

In the era of Big Data, data collection underpins biological research more so than ever before. In many cases this can be as time-consuming as the analysis itself, requiring downloading multiple different public databases, with different data structures, and in general, spending days before answering any biological questions. To solve this problem, we introduce an open-source, cloud-based big data platform, called Sherlock (https://earlham-sherlock.github.io/). Sherlock provides a gap-filling way for biologists to store, convert, query, share and generate biology data, while ultimately streamlining bioinformatics data management. The Sherlock platform provides a simple interface to leverage big data technologies, such as Docker and PrestoDB. Sherlock is designed to analyse, process, query and extract the information from extremely complex and large data sets. Furthermore, Sherlock is capable of handling different structured data (interaction, localization, or genomic sequence) from several sources and converting them to a common optimized storage format, for example to the Optimized Row Columnar (ORC). This format facilitates Sherlock’s ability to quickly and easily execute distributed analytical queries on extremely large data files as well as share datasets between teams. The Sherlock platform is freely available on Github, and contains specific loader scripts for structured data sources of genomics, interaction and expression databases. With these loader scripts, users are able to easily and quickly create and work with the specific file formats, such as JavaScript Object Notation (JSON) or ORC. For computational biology and large-scale bioinformatics projects, Sherlock provides an open-source platform empowering data management, data analytics, data integration and collaboration through modern big data technologies.

Download Full-text

Science Pipelines for the Square Kilometre Array

10.20944/preprints201810.0115.v2 ◽

2018 ◽

Author(s):

Jamie Farnes ◽

Ben Mort ◽

Fred Dulwich ◽

Stef Salvini ◽

Wes Armour

Keyword(s):

Big Data ◽

Large Scale ◽

Technological Development ◽

Large Data ◽

Process Data ◽

Polarization Data ◽

Square Kilometre Array ◽

Sky Survey ◽

Scientific Results ◽

The Universe

The Square Kilometre Array (SKA) will be both the largest radio telescope ever constructed and the largest Big Data project in the known Universe. The first phase of the project will generate on the order of 5 zettabytes of data per year. A critical task for the SKA will be its ability to process data for science, which will need to be conducted by science pipelines. Together with polarization data from the LOFAR Multifrequency Snapshot Sky Survey (MSSS), we have been developing a realistic SKA-like science pipeline that can handle the large data volumes generated by LOFAR at 150 MHz. The pipeline uses task-based parallelism to image, detect sources, and perform Faraday Tomography across the entire LOFAR sky. The project thereby provides a unique opportunity to contribute to the technological development of the SKA telescope, while simultaneously enabling cutting-edge scientific results. In this paper, we provide an update on current efforts to develop a science pipeline that can enable tight constraints on the magnetised large-scale structure of the Universe.

Download Full-text

A Novel Approach for Clustering Big Data based on MapReduce

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i3.pp1711-1719 ◽

2018 ◽

Vol 8 (3) ◽

pp. 1711 ◽

Cited By ~ 1

Author(s):

Gourav Bathla ◽

Himanshu Aggarwal ◽

Rinkle Rani

Keyword(s):

Big Data ◽

Categorical Data ◽

Large Scale ◽

Clustering Algorithms ◽

Numerical Data ◽

Large Data ◽

Data Sets ◽

Single Node ◽

Novel Approach ◽

Network Analytics

Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.

Download Full-text

Extracting-Transforming-Loading Modeling Approach for Big Data Analytics

International Journal of Decision Support System Technology ◽

10.4018/ijdsst.2016100104 ◽

2016 ◽

Vol 8 (4) ◽

pp. 50-69 ◽

Cited By ~ 4

Author(s):

Mahfoud Bala ◽

Omar Boussaid ◽

Zaia Alimazighi

Keyword(s):

Big Data ◽

Data Analytics ◽

Early Stage ◽

Conceptual Modeling ◽

Big Data Analytics ◽

Large Data ◽

Computing Systems ◽

Improvement Rate ◽

Modeling Approach ◽

Speed Up

Due to their widespread use, Internet, Web 2.0 and digital sensors create data in non-traditional volumes (at terabytes and petabytes scale). The big data characterized by the four V's has brought with it new challenges given the limited capabilities of traditional computing systems. This paper aims to provide solutions which can cope with very large data in Decision-Support Systems (DSSs). In the data integration phase, specifically, the authors propose a conceptual modeling approach for parallel and distributed Extracting-Transforming-Loading (ETL) processes. Among the complexity dimensions of big data, this study focuses on the volume of data to ensure a good performance for ETL processes. The authors' approach allows anticipating on the parallelization/distribution issues at the early stage of Data Warehouse (DW) projects. They have implemented an ETL platform called Parallel-ETL (P-ETL for short) and conducted some experiments. Their performance analysis reveals that the proposed approach enables to speed up ETL processes by up to 33% with the improvement rate being linear.

Download Full-text

Framework of Smart Regulation System for Large-Scale Power Grid Based on Big Data and Artificial Intelligence

Advances in Intelligent Systems and Computing - The 2020 International Conference on Machine Learning and Big Data Analytics for IoT Security and Privacy ◽

10.1007/978-3-030-62743-0_63 ◽

2020 ◽

pp. 440-445

Author(s):

Xinlei Cai ◽

Yanlin Cui ◽

Kai Dong ◽

Zijie Meng ◽

Yuan Pan

Keyword(s):

Artificial Intelligence ◽

Big Data ◽

Large Scale ◽

Power Grid ◽

Regulation System ◽

Smart Regulation ◽

Grid Based

Download Full-text

D2d Big Data Analytics for user Behavior Over Cellular Networks for Improving Content Deliveries

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b2826.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 3000-3003

Keyword(s):

Big Data ◽

Communication Networks ◽

Data Analytics ◽

Large Scale ◽

User Behavior ◽

Big Data Analytics ◽

Large Data ◽

Network Capacity ◽

D2d Communication ◽

Device To Device

(Nowadays cellular phones are profoundly increasing and present network capacity need to be increased to meet the growing demands of user equipment (UE) that has led to evolution of cellular and communication networks. Device-to-Device (D2D) communication is a usage technology that extends enormous features that can be incorporated with LTE and conssidesred as a finest technological componentespecially for the 5G network. Generally the 5G wireless networks are being introsduced to improve the present technology that meets the future demands extending efficient and reliable solutions. This Device-to-Device (D2D) communication can be established within LTE that limits to its proximity and comes with various advantages such as increase of spectral efficiency, energy efficiency, reduction of transmission delay, efficient offloaded traffic, avoiding congestion in cellular network. This paper deals D2D entities that include user behaviors, content deliveries and characteristics in big data platform that utilizes sharing large scale data accurately and effectively. Besides D2D, the proposed work builds concept of big data analytics integrated with D2D for effectively improving the content deliveries while offloading large data set.The presentwork also discussesbig data predictive analysis for the users based on D2D network services that help for further work.

Download Full-text

Science Pipelines for the Square Kilometre Array

Galaxies ◽

10.3390/galaxies6040120 ◽

2018 ◽

Vol 6 (4) ◽

pp. 120 ◽

Cited By ~ 6

Author(s):

Jamie Farnes ◽

Ben Mort ◽

Fred Dulwich ◽

Stef Salvini ◽

Wes Armour

Keyword(s):

Big Data ◽

Large Scale ◽

Technological Development ◽

Large Data ◽

Process Data ◽

Polarization Data ◽

Square Kilometre Array ◽

Sky Survey ◽

Scientific Results ◽

The Universe

The Square Kilometre Array (SKA) will be both the largest radio telescope ever constructed and the largest Big Data project in the known Universe. The first phase of the project will generate on the order of five zettabytes of data per year. A critical task for the SKA will be its ability to process data for science, which will need to be conducted by science pipelines. Together with polarization data from the LOFAR Multifrequency Snapshot Sky Survey (MSSS), we have been developing a realistic SKA-like science pipeline that can handle the large data volumes generated by LOFAR at 150 MHz. The pipeline uses task-based parallelism to image, detect sources and perform Faraday tomography across the entire LOFAR sky. The project thereby provides a unique opportunity to contribute to the technological development of the SKA telescope, while simultaneously enabling cutting-edge scientific results. In this paper, we provide an update on current efforts to develop a science pipeline that can enable tight constraints on the magnetised large-scale structure of the Universe.

Download Full-text