scholarly journals A kind of entity recognition algorithm based on Hadoop for power big data

2018 ◽  
Vol 189 ◽  
pp. 03005
Author(s):  
Jun Qi ◽  
Weichun Ge ◽  
Zhao Li ◽  
Wei Li ◽  
Hongyu Zhang ◽  
...  

With the coming of the era of big data, traditional entity recognition technologies have been unable to effectively finish data preprocessing due to large scale of power grid data and complex volume type features. The rising of Hadoop technologies in these years can deal with big data processings better. Therefore, this paper proposes a power big data entity recognition algorithm based on Hadoop. It applies the discretization algorithm to select higher information accuracy discrete points and put forward a discretization evaluation indicator. In the end, we finish entity recognition of the monitoring data of wind turbines on Hadoop platform.Experimental results show that the proposed algorithm performs well in terms of correctness and breakpoint number experiments and it has a good speed-up ratio. The proposed algorithm can apply to power large data entity recognition processing.

2017 ◽  
pp. 83-99
Author(s):  
Sivamathi Chokkalingam ◽  
Vijayarani S.

The term Big Data refers to large-scale information management and analysis technologies that exceed the capability of traditional data processing technologies. Big Data is differentiated from traditional technologies in three ways: volume, velocity and variety of data. Big data analytics is the process of analyzing large data sets which contains a variety of data types to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. Since Big Data is new emerging field, there is a need for development of new technologies and algorithms for handling big data. The main objective of this paper is to provide knowledge about various research challenges of Big Data analytics. A brief overview of various types of Big Data analytics is discussed in this paper. For each analytics, the paper describes process steps and tools. A banking application is given for each analytics. Some of research challenges and possible solutions for those challenges of big data analytics are also discussed.


Author(s):  
Jamie Farnes ◽  
Ben Mort ◽  
Fred Dulwich ◽  
Stef Salvini ◽  
Wes Armour

The Square Kilometre Array (SKA) will be both the largest radio telescope ever constructed and the largest Big Data project in the known Universe. The first phase of the project will generate on the order of 5 zettabytes of data per year. A critical task for the SKA will be its ability to process data for science, which will need to be conducted by science pipelines. Together with polarization data from the LOFAR Multifrequency Snapshot Sky Survey (MSSS), we have been developing a realistic SKA-like science pipeline that can handle the large data volumes generated by LOFAR at 150 MHz. The pipeline uses task-based parallelism to image, detect sources, and perform Faraday Tomography across the entire LOFAR sky. The project thereby provides a unique opportunity to contribute to the technological development of the SKA telescope, while simultaneously enabling cutting-edge scientific results. In this paper, we provide an update on current efforts to develop a science pipeline that can enable tight constraints on the magnetised large-scale structure of the Universe.


2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Mohammad Hasan Ansari ◽  
Vahid Tabatab Vakili ◽  
Behnam Bahrak

AbstractWith the rapid development of smart grids and increasing data collected in these networks, analyzing this massive data for applications such as marketing, cyber-security, and performance analysis, has gained popularity. This paper focuses on analysis and performance evaluation of big data frameworks that are proposed for handling smart grid data. Since obtaining large amounts of smart grid data is difficult due to privacy concerns, we propose and implement a large scale smart grid data generator to produce massive data under conditions similar to those in real smart grids. We use four open source big data frameworks namely Hadoop-Hbase, Cassandra, Elasticsearch, and MongoDB, in our implementation. Finally, we evaluate the performance of different frameworks on smart grid big data and present a performance benchmark that includes common data analysis techniques on smart grid data.


F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 409
Author(s):  
Balázs Bohár ◽  
David Fazekas ◽  
Matthew Madgwick ◽  
Luca Csabai ◽  
Marton Olbei ◽  
...  

In the era of Big Data, data collection underpins biological research more so than ever before. In many cases this can be as time-consuming as the analysis itself, requiring downloading multiple different public databases, with different data structures, and in general, spending days before answering any biological questions. To solve this problem, we introduce an open-source, cloud-based big data platform, called Sherlock (https://earlham-sherlock.github.io/). Sherlock provides a gap-filling way for biologists to store, convert, query, share and generate biology data, while ultimately streamlining bioinformatics data management. The Sherlock platform provides a simple interface to leverage big data technologies, such as Docker and PrestoDB. Sherlock is designed to analyse, process, query and extract the information from extremely complex and large data sets. Furthermore, Sherlock is capable of handling different structured data (interaction, localization, or genomic sequence) from several sources and converting them to a common optimized storage format, for example to the Optimized Row Columnar (ORC). This format facilitates Sherlock’s ability to quickly and easily execute distributed analytical queries on extremely large data files as well as share datasets between teams. The Sherlock platform is freely available on Github, and contains specific loader scripts for structured data sources of genomics, interaction and expression databases. With these loader scripts, users are able to easily and quickly create and work with the specific file formats, such as JavaScript Object Notation (JSON) or ORC. For computational biology and large-scale bioinformatics projects, Sherlock provides an open-source platform empowering data management, data analytics, data integration and collaboration through modern big data technologies.


Author(s):  
Jamie Farnes ◽  
Ben Mort ◽  
Fred Dulwich ◽  
Stef Salvini ◽  
Wes Armour

The Square Kilometre Array (SKA) will be both the largest radio telescope ever constructed and the largest Big Data project in the known Universe. The first phase of the project will generate on the order of 5 zettabytes of data per year. A critical task for the SKA will be its ability to process data for science, which will need to be conducted by science pipelines. Together with polarization data from the LOFAR Multifrequency Snapshot Sky Survey (MSSS), we have been developing a realistic SKA-like science pipeline that can handle the large data volumes generated by LOFAR at 150 MHz. The pipeline uses task-based parallelism to image, detect sources, and perform Faraday Tomography across the entire LOFAR sky. The project thereby provides a unique opportunity to contribute to the technological development of the SKA telescope, while simultaneously enabling cutting-edge scientific results. In this paper, we provide an update on current efforts to develop a science pipeline that can enable tight constraints on the magnetised large-scale structure of the Universe.


Author(s):  
Gourav Bathla ◽  
Himanshu Aggarwal ◽  
Rinkle Rani

Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.


2016 ◽  
Vol 8 (4) ◽  
pp. 50-69 ◽  
Author(s):  
Mahfoud Bala ◽  
Omar Boussaid ◽  
Zaia Alimazighi

Due to their widespread use, Internet, Web 2.0 and digital sensors create data in non-traditional volumes (at terabytes and petabytes scale). The big data characterized by the four V's has brought with it new challenges given the limited capabilities of traditional computing systems. This paper aims to provide solutions which can cope with very large data in Decision-Support Systems (DSSs). In the data integration phase, specifically, the authors propose a conceptual modeling approach for parallel and distributed Extracting-Transforming-Loading (ETL) processes. Among the complexity dimensions of big data, this study focuses on the volume of data to ensure a good performance for ETL processes. The authors' approach allows anticipating on the parallelization/distribution issues at the early stage of Data Warehouse (DW) projects. They have implemented an ETL platform called Parallel-ETL (P-ETL for short) and conducted some experiments. Their performance analysis reveals that the proposed approach enables to speed up ETL processes by up to 33% with the improvement rate being linear.


2019 ◽  
Vol 8 (2) ◽  
pp. 3000-3003

(Nowadays cellular phones are profoundly increasing and present network capacity need to be increased to meet the growing demands of user equipment (UE) that has led to evolution of cellular and communication networks. Device-to-Device (D2D) communication is a usage technology that extends enormous features that can be incorporated with LTE and conssidesred as a finest technological componentespecially for the 5G network. Generally the 5G wireless networks are being introsduced to improve the present technology that meets the future demands extending efficient and reliable solutions. This Device-to-Device (D2D) communication can be established within LTE that limits to its proximity and comes with various advantages such as increase of spectral efficiency, energy efficiency, reduction of transmission delay, efficient offloaded traffic, avoiding congestion in cellular network. This paper deals D2D entities that include user behaviors, content deliveries and characteristics in big data platform that utilizes sharing large scale data accurately and effectively. Besides D2D, the proposed work builds concept of big data analytics integrated with D2D for effectively improving the content deliveries while offloading large data set.The presentwork also discussesbig data predictive analysis for the users based on D2D network services that help for further work.


Galaxies ◽  
2018 ◽  
Vol 6 (4) ◽  
pp. 120 ◽  
Author(s):  
Jamie Farnes ◽  
Ben Mort ◽  
Fred Dulwich ◽  
Stef Salvini ◽  
Wes Armour

The Square Kilometre Array (SKA) will be both the largest radio telescope ever constructed and the largest Big Data project in the known Universe. The first phase of the project will generate on the order of five zettabytes of data per year. A critical task for the SKA will be its ability to process data for science, which will need to be conducted by science pipelines. Together with polarization data from the LOFAR Multifrequency Snapshot Sky Survey (MSSS), we have been developing a realistic SKA-like science pipeline that can handle the large data volumes generated by LOFAR at 150 MHz. The pipeline uses task-based parallelism to image, detect sources and perform Faraday tomography across the entire LOFAR sky. The project thereby provides a unique opportunity to contribute to the technological development of the SKA telescope, while simultaneously enabling cutting-edge scientific results. In this paper, we provide an update on current efforts to develop a science pipeline that can enable tight constraints on the magnetised large-scale structure of the Universe.


Sign in / Sign up

Export Citation Format

Share Document