ScaDS Dresden/Leipzig – A competence center for collaborative big data research

Abstract The efficient and intelligent handling of large, often distributed and heterogeneous data sets increasingly determines the scientific and economic competitiveness in most application areas. Mobile applications, social networks, multimedia collections, sensor networks, data intense scientific experiments, and complex simulations nowadays generate a huge data deluge. Nonetheless, processing and analyzing these data sets with innovative methods open up new opportunities for its exploitation and new insights. Nevertheless, the resulting resource requirements exceed usually the possibilities of state-of-the-art methods for the acquisition, integration, analysis and visualization of data and are summarized under the term big data. ScaDS Dresden/Leipzig, as one Germany-wide competence center for collaborative big data research, bundles efforts to realize data-intensive applications for a wide range of applications in science and industry. In this article, we present the basic concept of the competence center and give insights in some of its research topics.

Download Full-text

NoSQL Databases

Advances in Data Mining and Database Management - Handbook of Research on Cloud Infrastructures for Big Data Analytics ◽

10.4018/978-1-4666-5864-6.ch008 ◽

2014 ◽

pp. 186-215 ◽

Cited By ~ 2

Author(s):

Ganesh Chandra Deka

Keyword(s):

Cloud Computing ◽

Big Data ◽

Data Processing ◽

Open Source ◽

Data Storage ◽

Big Data Processing ◽

Nosql Databases ◽

Data Intensive ◽

Huge Data ◽

Data Intensive Applications

NoSQL databases are designed to meet the huge data storage requirements of cloud computing and big data processing. NoSQL databases have lots of advanced features in addition to the conventional RDBMS features. Hence, the “NoSQL” databases are popularly known as “Not only SQL” databases. A variety of NoSQL databases having different features to deal with exponentially growing data-intensive applications are available with open source and proprietary option. This chapter discusses some of the popular NoSQL databases and their features on the light of CAP theorem.

Download Full-text

Performance-efficient Recommendation and Prediction Service for Big Data frameworks focusing on Data Compression and In-memory Data Storage Indicators

Scalable Computing Practice and Experience ◽

10.12694/scpe.v22i4.1945 ◽

2021 ◽

Vol 22 (4) ◽

pp. 401-412

Author(s):

Hrachya Astsatryan ◽

Arthur Lalayan ◽

Aram Kocharyan ◽

Daniel Hagimont

Keyword(s):

Big Data ◽

Data Compression ◽

Data Storage ◽

File Systems ◽

Large Datasets ◽

Data Sets ◽

Mapreduce Framework ◽

Data Intensive ◽

Parallel Data ◽

Data Intensive Applications

The MapReduce framework manages Big Data sets by splitting the large datasets into a set of distributed blocks and processes them in parallel. Data compression and in-memory file systems are widely used methods in Big Data processing to reduce resource-intensive I/O operations and improve I/O rate correspondingly. The article presents a performance-efficient modular and configurable decision-making robust service relying on data compression and in-memory data storage indicators. The service consists of Recommendation and Prediction modules, predicts the execution time of a given job based on metrics, and recommends the best configuration parameters to improve Hadoop and Spark frameworks' performance. Several CPU and data-intensive applications and micro-benchmarks have been evaluated to improve the performance, including Log Analyzer, WordCount, and K-Means.

Download Full-text

A Hybrid Scheduler for Many Task Computing in Big Data Systems

International Journal of Applied Mathematics and Computer Science ◽

10.1515/amcs-2017-0027 ◽

2017 ◽

Vol 27 (2) ◽

pp. 385-399 ◽

Cited By ~ 5

Author(s):

Laura Vasiliu ◽

Florin Pop ◽

Catalin Negru ◽

Mariana Mocanu ◽

Valentin Cristea ◽

...

Keyword(s):

Big Data ◽

Scheduling Algorithm ◽

Rapid Evolution ◽

Data Sets ◽

Data Systems ◽

Data Intensive ◽

Performance Requirements ◽

Huge Data ◽

Hybrid Solution ◽

Big Data Systems

AbstractWith the rapid evolution of the distributed computing world in the last few years, the amount of data created and processed has fast increased to petabytes or even exabytes scale. Such huge data sets need data-intensive computing applications and impose performance requirements to the infrastructures that support them, such as high scalability, storage, fault tolerance but also efficient scheduling algorithms. This paper focuses on providing a hybrid scheduling algorithm for many task computing that addresses big data environments with few penalties, taking into consideration the deadlines and satisfying a data dependent task model. The hybrid solution consists of several heuristics and algorithms (min-min, min-max and earliest deadline first) combined in order to provide a scheduling algorithm that matches our problem. The experimental results are conducted by simulation and prove that the proposed hybrid algorithm behaves very well in terms of meeting deadlines.

Download Full-text

Intelligent Secure Storage Mechanism for Big Data

Webology ◽

10.14704/web/v18si01/web18057 ◽

2021 ◽

Vol 18 (Special Issue 01) ◽

pp. 246-261

Author(s):

K.R. Remesh Babu ◽

K.P. Madhu

Keyword(s):

Big Data ◽

Data Storage ◽

Big Data Analytics ◽

Business Organizations ◽

Storage Mechanism ◽

Data Intensive ◽

Secure Storage ◽

Huge Data ◽

Efficient Data ◽

Data Intensive Applications

The management of big data became more important due to the wide spread adoption of internet of things in various fields. The developments in technology, science, human habits, etc., generates massive amount of data, so it is increasingly important to store and protect these data from attacks. Big data analytics is now a hot topic. The data storage facility provided by the cloud computing enabled business organizations to overcome the burden of huge data storage and maintenance. Also, several distributed cloud applications supports them to analyze this data for taking appropriate decisions. The dynamic growth of data and data intensive applications demands an efficient intelligent storage mechanism for big data. The proposed system analyzes IP packets for vulnerabilities and classifies data nodes as reliable and unreliable nodes for the efficient data storage. The proposed Apriori algorithm based method automatically classifies the nodes for intelligent secure storage mechanism for the distributed big data storage.

Download Full-text

Visualization of Big Data Sets Using Computer Graphics

Advances in Data Mining and Database Management - Handbook of Research on Big Data Storage and Visualization Techniques ◽

10.4018/978-1-5225-3142-5.ch020 ◽

2018 ◽

pp. 578-603

Author(s):

Anna Ursyn ◽

Edoardo L'Astorina

Keyword(s):

Big Data ◽

Data Visualization ◽

Visual Analysis ◽

Data Sets ◽

Arrival Times ◽

Learning And Teaching ◽

Knowledge Domains ◽

Visualization Of Data ◽

And Mathematics ◽

Visualization Techniques

This chapter discusses some possible ways of how professionals, researchers and users representing various knowledge domains are collecting and visualizing big data sets. First it describes communication through senses as a basis for visualization techniques, computational solutions for enhancing senses and ways of enhancing senses by technology. The next part discusses ideas behind visualization of data sets and ponders what is and what not visualization is. Further discussion relates to data visualization through art as visual solutions of science and mathematics related problems, documentation objects and events, and a testimony to thoughts, knowledge and meaning. Learning and teaching through data visualization is the concluding theme of the chapter. Edoardo L'Astorina provides visual analysis of best practices in visualization: An overlay of Google Maps that showed all the arrival times - in real time - of all the buses in your area based on your location and visual representation of all the Tweets in the world about TfL (Transport for London) tube lines to predict disruptions.

Download Full-text

On the Effectiveness of Hybrid Canopy with Hoeffding Adaptive Naive Bayes Trees

International Journal of Applied Evolutionary Computation ◽

10.4018/ijaec.2017040102 ◽

2017 ◽

Vol 8 (2) ◽

pp. 30-43

Author(s):

Mrutyunjaya Panda

Keyword(s):

Big Data ◽

Clustering Analysis ◽

Large Scale ◽

Data Sets ◽

Recent Past ◽

Large Scale Data ◽

Huge Data ◽

With Memory ◽

Memory Constraints ◽

Scale Data

The Big Data, due to its complicated and diverse nature, poses a lot of challenges for extracting meaningful observations. This sought smart and efficient algorithms that can deal with computational complexity along with memory constraints out of their iterative behavior. This issue may be solved by using parallel computing techniques, where a single machine or a multiple machine can perform the work simultaneously, dividing the problem into sub problems and assigning some private memory to each sub problems. Clustering analysis are found to be useful in handling such a huge data in the recent past. Even though, there are many investigations in Big data analysis are on, still, to solve this issue, Canopy and K-Means++ clustering are used for processing the large-scale data in shorter amount of time with no memory constraints. In order to find the suitability of the approach, several data sets are considered ranging from small to very large ones having diverse filed of applications. The experimental results opine that the proposed approach is fast and accurate.

Download Full-text

Analysis and Visualization of Data Assimilating Hive and COGNOS Insight 10.2.2

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.6.11271 ◽

2018 ◽

Vol 7 (2.6) ◽

pp. 318

Author(s):

Mandeep Virk ◽

Vaishali Chauhan ◽

Urvashi Mittal

Keyword(s):

Big Data ◽

Data Analysis ◽

High Rate ◽

Huge Data ◽

Suitable Form ◽

Visualization Of Data ◽

Very High

Data analysis is the most grueling tasks in the coinciding world. The size of data is increasing at a very high rate because of the procreation of peripatetic gadgets and sensors attached. To make that data readable is another challenging task. Effectual visualization provides users with better analysis capabilities and helps in deriving evidence about data. Many techniques and tools have been invented to deal with such problems but to make these tools amendable is the main mystification. It is the big data that originated as a technology which is proficient in assembling and transforming the colossal and divergent figures of data, providing organizations with meaningful insights for derivingimprovedresults. Big data is accustomed to delineate technologies and techniques which are used to store, manage, distribute and analyze huge data sheets. The existent of administrating this research is to make the data readable in a more suitable form with less comprehend. Mainly the research emphasizes on the fabrication of using COGNOS insight 10.2.2 for visualizing data and implementing the analyzed results derived from the hive. The assimilation between tools has also been reformed in this research.

Download Full-text

Analysis of big data for data-intensive applications

2016 International Conference on Recent Advances and Innovations in Engineering (ICRAIE) ◽

10.1109/icraie.2016.7939551 ◽

2016 ◽

Author(s):

Meenu Dave ◽

Hemant Kumar Gianey

Keyword(s):

Big Data ◽

Data Intensive ◽

Data Intensive Applications

Download Full-text

Significance of Hierarchical and Markov Clustering in Grouping Aware Data Placement for Data Intensive Applications Having Interest Locality

Scalable Computing Practice and Experience ◽

10.12694/scpe.v19i3.1375 ◽

2018 ◽

Vol 19 (3) ◽

pp. 245-258

Author(s):

Vengadeswaran Shanmugasundaram ◽

Balasundaram Sadhu Ramakrishnan

Keyword(s):

Big Data ◽

Data Placement ◽

Query Execution ◽

Access Pattern ◽

Clustering Techniques ◽

Data Intensive ◽

Markov Clustering ◽

Default Data ◽

Data Intensive Applications ◽

Grouping Behavior

In this data era, massive volumes of data are being generated every second in variety of domains such as Geoscience, Social Web, Finance, e-Commerce, Health Care, Climate modelling, Physics, Astronomy, Government sectors etc. Hadoop has been well-recognized as de factobig data processing platform that have been extensively adopted, and is currently widely used, in many application domains processing Big Data. Even though it is considered as an efficient solution for such complex query processing, it has its own limitation when the data to be processed exhibit interest locality. The data required for any query execution follows grouping behavior wherein only a part of the Big-Data is accessed frequently. During such scenarion, the time taken to execute a queryand return results, increases exponentially as the amount of data increases leading to much waiting time for the user. Since Hadoop default data placement strategy (HDDPS) does not consider such grouping behavior, it does not perform efficiently resulting in lacunas such as decreased local map task execution, increased query execution time etc. Hence proposed an Optimal Data Placement Strategy (ODPS) based on grouping semantics. In this paper we experiment the significance oftwo most promising clustering techniques viz. Hierarchical Agglomerative Clustering (HAC) and Markov Clustering (MCL) in grouping aware data placement for data intensive applications having interest locality. Initially user access pattern is identified by dynamically analyzing history log.Then both clustering techniques (HAC & MCL) are separately applied over the access pattern to obtain independent clusters. These clusters are interpreted and validated to extract the Optimal Data Groupings (ODG). Finally proposed strategy reorganizes the default data layouts in HDFSbased on ODG to achieve maximum parallel execution per group subjective to Load Balancer and Rack Awareness. Our proposed strategy is tested in 10 node cluster placed in a multi rack with Hadoop installed in every node deployed in cloud platform. Proposed strategy reduces the query execution time, significantly improves the data locality and has proved to be more efficient for massive datasets processing in heterogeneous distributed environment. Also MCL shows a marginal improved performance over HAC for queries exhibiting interest localities.

Download Full-text

Big Data in Bioinformatics

Математическая биология и биоинформатика ◽

10.17537/2017.12.102 ◽

2017 ◽

Vol 12 (1) ◽

pp. 102-119 ◽

Cited By ~ 3

Author(s):

Н.Н. Назипова ◽

N.N. Nazipova

Keyword(s):

Big Data ◽

High Performance ◽

New Technologies ◽

Collaborative Work ◽

Heterogeneous Data ◽

Visual Programming ◽

Generic Programming ◽

Data Formats ◽

Huge Data ◽

Sequencing Platforms

Sequencing of the human genome began in 1994. It took 10 years of collaborative work of many research groups from different countries in order to provide a draft of the human DNA. Modern technologies allow sequencing of a whole genome in a few days. We discuss here the advances in modern bioinformatics related to the emergence of high-performance sequencing platforms, which not only contributed to the expansion of capabilities of biology and related sciences, but also gave rise to the phenomenon of Big Data in biology. The necessity for development of new technologies and methods for organization of storage, management, analysis and visualization of big data is substantiated. Modern bioinformatics is facing not only the problem of processing enormous volumes of heterogeneous data, but also a variety of methods of interpretation and presentation of the results, the simultaneous existence of various software tools and data formats. The ways of solving the arising challenges are discussed, in particular by using experiences from other areas of modern life, such as web and business intelligence. The former is the area of scientific research and development that explores the roles and makes use of artificial intelligence and information technology (IT) for new products, services and frameworks that are empowered by the World Wide Web; the latter is the domain of IT, which addresses the issues of decision-making. New database management systems, other than relational ones, will help solve the problem of storing huge data and providing an acceptable timescale for performing search queries. New programming technologies, such as generic programming and visual programming, are designed to solve the problem of the diversity of genomic data formats and to provide the ability to quickly create one's own scripts for data processing.

Download Full-text