scholarly journals Multi-Objective Big Data View Materialization using MOGA

2022 ◽  
Vol 13 (1) ◽  
pp. 0-0

The COVID 19 Pandemic, has resulted in large scale of generation of Big data. This Big data is heterogeneous and includes the data of people infected with corona virus, the people who were in contact of infected person, demographics of infected person, data on corona testing, huge amount of GPS data of people location, and large number of unstructured data about prevention and treatment of COVID 19. Thus, the pandemic has resulted in producing several Zeta bytes of structured, semi-structured and unstructured data. The challenge is to process this Big data, which has the characteristics of very large volume, brisk rate of generation and modification and large data redundancy, in a time bound manner to take timely predictions and decisions. Materialization of views for Big data is one of the ways to enhance the efficiency of processing of the data. In this paper, Big data view selection problem is addressed, as a bi-objective optimization problem, using Multi-objective genetic algorithm.

2021 ◽  
Vol 34 (2) ◽  
pp. 1-28
Author(s):  
Akshay Kumar ◽  
T. V. Vijay Kumar

Big data views, in the context of distributed file system (DFS), are defined over structured, semi-structured and unstructured data that are voluminous in nature with the purpose to reduce the response time of queries over Big data. As the size of semi-structured and unstructured data in Big data is very large compared to structured data, a framework based on query attributes on Big data can be used to identify Big data views. Materializing Big data views can enhance the query response time and facilitate efficient distribution of data over the DFS based application. Given all the Big data views cannot be materialized, therefore, a subset of Big data views should be selected for materialization. The purpose of view selection for materialization is to improve query response time subject to resource constraints. The Big data view materialization problem was defined as a bi-objective problem with the two objectives- minimization of query evaluation cost and minimization of the update processing cost, with a constraint on the total size of the materialized views. This problem is addressed in this paper using multi-objective genetic algorithm NSGA-II. The experimental results show that proposed NSGA-II based Big data view selection algorithm is able to select reasonably good quality views for materialization.


2017 ◽  
pp. 83-99
Author(s):  
Sivamathi Chokkalingam ◽  
Vijayarani S.

The term Big Data refers to large-scale information management and analysis technologies that exceed the capability of traditional data processing technologies. Big Data is differentiated from traditional technologies in three ways: volume, velocity and variety of data. Big data analytics is the process of analyzing large data sets which contains a variety of data types to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. Since Big Data is new emerging field, there is a need for development of new technologies and algorithms for handling big data. The main objective of this paper is to provide knowledge about various research challenges of Big Data analytics. A brief overview of various types of Big Data analytics is discussed in this paper. For each analytics, the paper describes process steps and tools. A banking application is given for each analytics. Some of research challenges and possible solutions for those challenges of big data analytics are also discussed.


2013 ◽  
Vol 14 (1) ◽  
pp. 51-61 ◽  
Author(s):  
Fabian Fischer ◽  
Johannes Fuchs ◽  
Florian Mansmann ◽  
Daniel A Keim

The enormous growth of data in the last decades led to a wide variety of different database technologies. Nowadays, we are capable of storing vast amounts of structured and unstructured data. To address the challenge of exploring and making sense out of big data using visual analytics, the tight integration of such backend services is needed. In this article, we introduce BANKSAFE, which was built for the VAST Challenge 2012 and won the outstanding comprehensive submission award. BANKSAFE is based on modern database technologies and is capable of visually analyzing vast amounts of monitoring data and security-related datasets of large-scale computer networks. To better describe and demonstrate the visualizations, we utilize the Visual Analytics Science and Technology (VAST) Challenge 2012 as case study. Additionally, we discuss lessons learned during the design and development of BANKSAFE, which are also applicable to other visual analytics applications for big data.


Author(s):  
Jamie Farnes ◽  
Ben Mort ◽  
Fred Dulwich ◽  
Stef Salvini ◽  
Wes Armour

The Square Kilometre Array (SKA) will be both the largest radio telescope ever constructed and the largest Big Data project in the known Universe. The first phase of the project will generate on the order of 5 zettabytes of data per year. A critical task for the SKA will be its ability to process data for science, which will need to be conducted by science pipelines. Together with polarization data from the LOFAR Multifrequency Snapshot Sky Survey (MSSS), we have been developing a realistic SKA-like science pipeline that can handle the large data volumes generated by LOFAR at 150 MHz. The pipeline uses task-based parallelism to image, detect sources, and perform Faraday Tomography across the entire LOFAR sky. The project thereby provides a unique opportunity to contribute to the technological development of the SKA telescope, while simultaneously enabling cutting-edge scientific results. In this paper, we provide an update on current efforts to develop a science pipeline that can enable tight constraints on the magnetised large-scale structure of the Universe.


F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 409
Author(s):  
Balázs Bohár ◽  
David Fazekas ◽  
Matthew Madgwick ◽  
Luca Csabai ◽  
Marton Olbei ◽  
...  

In the era of Big Data, data collection underpins biological research more so than ever before. In many cases this can be as time-consuming as the analysis itself, requiring downloading multiple different public databases, with different data structures, and in general, spending days before answering any biological questions. To solve this problem, we introduce an open-source, cloud-based big data platform, called Sherlock (https://earlham-sherlock.github.io/). Sherlock provides a gap-filling way for biologists to store, convert, query, share and generate biology data, while ultimately streamlining bioinformatics data management. The Sherlock platform provides a simple interface to leverage big data technologies, such as Docker and PrestoDB. Sherlock is designed to analyse, process, query and extract the information from extremely complex and large data sets. Furthermore, Sherlock is capable of handling different structured data (interaction, localization, or genomic sequence) from several sources and converting them to a common optimized storage format, for example to the Optimized Row Columnar (ORC). This format facilitates Sherlock’s ability to quickly and easily execute distributed analytical queries on extremely large data files as well as share datasets between teams. The Sherlock platform is freely available on Github, and contains specific loader scripts for structured data sources of genomics, interaction and expression databases. With these loader scripts, users are able to easily and quickly create and work with the specific file formats, such as JavaScript Object Notation (JSON) or ORC. For computational biology and large-scale bioinformatics projects, Sherlock provides an open-source platform empowering data management, data analytics, data integration and collaboration through modern big data technologies.


Author(s):  
Jamie Farnes ◽  
Ben Mort ◽  
Fred Dulwich ◽  
Stef Salvini ◽  
Wes Armour

The Square Kilometre Array (SKA) will be both the largest radio telescope ever constructed and the largest Big Data project in the known Universe. The first phase of the project will generate on the order of 5 zettabytes of data per year. A critical task for the SKA will be its ability to process data for science, which will need to be conducted by science pipelines. Together with polarization data from the LOFAR Multifrequency Snapshot Sky Survey (MSSS), we have been developing a realistic SKA-like science pipeline that can handle the large data volumes generated by LOFAR at 150 MHz. The pipeline uses task-based parallelism to image, detect sources, and perform Faraday Tomography across the entire LOFAR sky. The project thereby provides a unique opportunity to contribute to the technological development of the SKA telescope, while simultaneously enabling cutting-edge scientific results. In this paper, we provide an update on current efforts to develop a science pipeline that can enable tight constraints on the magnetised large-scale structure of the Universe.


Author(s):  
Gourav Bathla ◽  
Himanshu Aggarwal ◽  
Rinkle Rani

Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.


2018 ◽  
Vol 189 ◽  
pp. 03005
Author(s):  
Jun Qi ◽  
Weichun Ge ◽  
Zhao Li ◽  
Wei Li ◽  
Hongyu Zhang ◽  
...  

With the coming of the era of big data, traditional entity recognition technologies have been unable to effectively finish data preprocessing due to large scale of power grid data and complex volume type features. The rising of Hadoop technologies in these years can deal with big data processings better. Therefore, this paper proposes a power big data entity recognition algorithm based on Hadoop. It applies the discretization algorithm to select higher information accuracy discrete points and put forward a discretization evaluation indicator. In the end, we finish entity recognition of the monitoring data of wind turbines on Hadoop platform.Experimental results show that the proposed algorithm performs well in terms of correctness and breakpoint number experiments and it has a good speed-up ratio. The proposed algorithm can apply to power large data entity recognition processing.


2020 ◽  
Vol 16 (8) ◽  
pp. 155014772094914
Author(s):  
Muhammad Sardaraz ◽  
Muhammad Tahir

Recent developments in cloud computing have made it a powerful solution for executing large-scale scientific problems. The complexity of scientific workflows demands efficient utilization of cloud resources to satisfy user requirements. Scheduling of scientific workflows in a cloud environment is a challenge for researchers. The problem is considered as NP-hard. Some constraints such as a heterogeneous environment, dependencies between tasks, quality of service and user deadlines make it difficult for the scheduler to fully utilize available resources. The problem has been extensively studied in the literature. Different researchers have targeted different parameters. This article presents a multi-objective scheduling algorithm for scheduling scientific workflows in cloud computing. The solution is based on genetic algorithm that targets makespan, monetary cost, and load balance. The proposed algorithm first finds the best solution for each parameter. Based on these solutions, the algorithm finds the superbest solution for all parameters. The proposed algorithm is evaluated with benchmark datasets and comparative results with the standard genetic algorithm, particle swarm optimization, and specialized scheduler are presented. The results show that the proposed algorithm achieves an improvement in makespan and reduces the cost with a well load balanced system.


Sign in / Sign up

Export Citation Format

Share Document