Flexible MapReduce Workflows for Cloud Data Analytics

Data analytics applications handle large data sets subject to multiple processing phases, some of which can execute in parallel on clusters, grids or clouds. Such applications can benefit from using MapReduce model, only requiring the end-user to define the application algorithms for input data processing and the map and reduce functions, but this poses a need to install/configure specific frameworks such as Apache Hadoop or Elastic MapReduce in Amazon Cloud. In order to provide more flexibility in defining and adjusting the application configurations, as well as in the specification of the composition of the application phases and their orchestration, the authors describe an approach for supporting MapReduce stages as sub-workflows in the AWARD framework (Autonomic Workflow Activities Reconfigurable and Dynamic). The authors discuss how a text mining application is represented as a complex workflow with multiple phases, where individual workflow nodes support MapReduce computations. Access to intermediate data produced during the MapReduce computations is supported by a data sharing abstraction. The authors describe two implementations of this abstraction, one based on a shared tuple space and another based on an in-memory distributed key/value store. The authors describe the implementation of the framework, a set of developed tools, and our experimentation with the execution of the text mining algorithm over multiple Amazon EC2 (Elastic Compute Cloud) instances, and report on the speed-up and size-up results obtained up to 20 EC2 instances and for different corpus sizes, up to 97 million words.

Download Full-text

NVESTIGATION OF THE EFFICIENCY OF DISTRIBUTED INFORMATION SYSTEMS BASED ON THE PROCESSING OF LARGE AMOUNTS OF DATA

Visnyk Universytetu “Ukraina” ◽

10.36994/2707-4110-2019-2-23-03 ◽

2019 ◽

Author(s):

Mykhajlo Klymash ◽

Olena Hordiichuk — Bublivska ◽

Ihor Tchaikovskyi ◽

Oksana Urikova

Keyword(s):

Distributed Systems ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Data Decomposition ◽

Distributed Information ◽

Software Model ◽

Computing Performance ◽

Mapreduce Model ◽

Singular Data

In this article investigated the features of processing large arrays of information for distributed systems. A method of singular data decomposition is used to reduce the amount of data processed, eliminating redundancy. Dependencies of computational efficiency on distributed systems were obtained using the MPI messaging protocol and MapReduce node interaction software model. Were analyzed the efficiency of the application of each technology for the processing of different sizes of data: Non — distributed systems are inefficient for large volumes of information due to low computing performance. It is proposed to use distributed systems that use the method of singular data decomposition, which will reduce the amount of information processed. The study of systems using the MPI protocol and MapReduce model obtained the dependence of the duration calculations time on the number of processes, which testify to the expediency of using distributed computing when processing large data sets. It is also found that distributed systems using MapReduce model work much more efficiently than MPI, especially with large amounts of data. MPI makes it possible to perform calculations more efficiently for small amounts of information. When increased the data sets, advisable to use the Map Reduce model.

Download Full-text

Data-driven campaigns in public sensemaking: Discursive positions, contextualization, and maneuvers in American, British, and German debates around computational politics

Communications ◽

10.1515/commun-2019-0125 ◽

2020 ◽

Vol 45 (s1) ◽

pp. 535-559

Author(s):

Christian Pentzold ◽

Lena Fölsche

Keyword(s):

United States ◽

United Kingdom ◽

Presidential Election ◽

Data Analytics ◽

Large Data ◽

Large Data Sets ◽

Data Driven ◽

Data Sets ◽

Federal Election ◽

Online Comments

AbstractOur article examines how journalistic reports and online comments have made sense of computational politics. It treats the discourse around data-driven campaigns as its object of analysis and codifies four main perspectives that have structured the debates about the use of large data sets and data analytics in elections. We study American, British, and German sources on the 2016 United States presidential election, the 2017 United Kingdom general election, and the 2017 German federal election. There, groups of speakers maneuvered between enthusiastic, skeptical, agnostic, or admonitory stances and so cannot be clearly mapped onto these four discursive positions. Coming along with the inconsistent accounts, public sensemaking was marked by an atmosphere of speculation about the substance and effects of computational politics. We conclude that this equivocality helped journalists and commentators to sideline prior reporting on the issue in order to repeatedly rediscover the practices they had already covered.

Download Full-text

Performance Optimization System for Hadoop and Spark Frameworks

Cybernetics and Information Technologies ◽

10.2478/cait-2020-0056 ◽

2020 ◽

Vol 20 (6) ◽

pp. 5-17

Author(s):

Hrachya Astsatryan ◽

Aram Kocharyan ◽

Daniel Hagimont ◽

Arthur Lalayan

Keyword(s):

Performance Optimization ◽

Large Scale ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Apache Hadoop ◽

Compression Factor ◽

Large Scale Data ◽

Additional Processing ◽

Mapreduce Model

AbstractThe optimization of large-scale data sets depends on the technologies and methods used. The MapReduce model, implemented on Apache Hadoop or Spark, allows splitting large data sets into a set of blocks distributed on several machines. Data compression reduces data size and transfer time between disks and memory but requires additional processing. Therefore, finding an optimal tradeoff is a challenge, as a high compression factor may underload Input/Output but overload the processor. The paper aims to present a system enabling the selection of the compression tools and tuning the compression factor to reach the best performance in Apache Hadoop and Spark infrastructures based on simulation analyzes.

Download Full-text

Breadth search strategies for finding minimal reducts: towards hardware implementation

Neural Computing and Applications ◽

10.1007/s00521-020-04833-7 ◽

2020 ◽

Vol 32 (18) ◽

pp. 14801-14816

Author(s):

Mateusz Choromański ◽

Tomasz Grześ ◽

Piotr Hońko

Keyword(s):

Hardware Implementation ◽

Search Strategy ◽

Attribute Reduction ◽

Large Data ◽

Complex Problem ◽

Search Strategies ◽

Large Data Sets ◽

Data Sets ◽

Speed Up ◽

Future Work

Abstract Attribute reduction, being a complex problem in data mining, has attracted many researchers. The importance of this issue rises due to ever-growing data to be mined. Together with data growth, a need for speeding up computations increases. The contribution of this paper is twofold: (1) investigation of breadth search strategies for finding minimal reducts in order to emerge the most promising method for processing large data sets; (2) development and implementation of the first hardware approach to finding minimal reducts in order to speed up time-consuming computations. Experimental research showed that for software implementation blind breadth search strategy is in general faster than frequency-based breadth search strategy not only in finding all minimal reducts but also in finding one of them. An inverse situation was observed for hardware implementation. In the future work, the implemented tool is to be used as a fundamental module in a system to be built for processing large data sets.

Download Full-text

Big Data in the Industry - Overview of Selected Issues

Management Systems in Production Engineering ◽

10.1515/mspe-2017-0036 ◽

2017 ◽

Vol 25 (4) ◽

pp. 251-254 ◽

Cited By ~ 2

Author(s):

Sylwia Gierej

Keyword(s):

Big Data ◽

Mass Production ◽

Data Analytics ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Manufacturing Companies ◽

Professional Literature ◽

Mass Data ◽

Definition Of

AbstractThis article reviews selected issues related to the use of Big Data in the industry. The aim is to define the potential scope and forms of using large data sets in manufacturing companies. By systematically reviewing scientific and professional literature, selected issues related to the use of mass data analytics in production were analyzed. A definition of Big Data was presented, detailing its main attributes. The importance of mass data processing technology in the development of Industry 4.0 concept has been highlighted. Subsequently, attention was paid to issues such as production process optimization, decision making and mass production individualisation, and indicated the potential for large volumes of data. As a result, conclusions were drawn regarding the potential of using Big Data in the industry.

Download Full-text

BIG DATA ANALYSIS IN HEALTH CARE DOMAIN: A SYSTEMATIC REVIEW

International Journal of Engineering Technologies and Management Research ◽

10.29121/ijetmr.v5.i2.2018.605 ◽

2020 ◽

Vol 5 (2) ◽

pp. 1-8

Author(s):

Abhishek Bajpai ◽

Dr. Sanjiv Sharma

Keyword(s):

Health Care ◽

Big Data ◽

Data Analytics ◽

Big Data Analytics ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Healthcare Data ◽

Challenges And Opportunities ◽

Business Profit

As the Volume of the data produced is increasing day by day in our society, the exploration of big data in healthcare is increasing at an unprecedented rate. Now days, Big data is very popular buzzword concept in the various areas. This paper provide an effort is made to established that even the healthcare industries are stepping into big data pool to take all advantages from its various advanced tools and technologies. This paper provides the review of various research disciplines made in health care realm using big data approaches and methodologies. Big data methodologies can be used for the healthcare data analytics (which consist 4 V’s) which provide the better decision to accelerate the business profit and customer affection, acquire a better understanding of market behaviours and trends and to provide E-Health services using Digital imaging and communication in Medicine (DICOM).Big data Techniques like Map Reduce, Machine learning can be applied to develop system for early diagnosis of disease, i.e. analysis of the chronic disease like- heart disease, diabetes and stroke. The analysis on the data is performed using big data analytics framework Hadoop. Hadoop framework is used to process large data sets Further the paper present the various Big data tools , challenges and opportunities and various hurdles followed by the conclusion.

Download Full-text

Big data analytics steps and tools used in analytical process

Journal of Management and Science ◽

10.26524/jms.2017.24 ◽

2017 ◽

Vol 7 (1) ◽

pp. 183-195

Author(s):

Sasikala V

Keyword(s):

Big Data ◽

Customer Service ◽

Data Analytics ◽

Big Data Analytics ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Competitive Advantages ◽

Customer Preferences ◽

Market Trends

Big data analytics is the process of examining large data sets to uncover hidden patterns,unknown correlations, market trends, customer preferences and other useful business information. The analytical findings can lead to more effective marketing, new revenue opportunities, better customer service, improved operational efficiency, competitive advantages over rival organizations and other business benefits.

Download Full-text

ASKARI

Digital Crime and Forensic Science in Cyberspace ◽

10.4018/978-1-59140-872-7.ch008 ◽

2011 ◽

pp. 155-174 ◽

Cited By ~ 2

Author(s):

Caroline Chibelushi ◽

Bernadette Sharp ◽

Hanifa Shah

Keyword(s):

Text Mining ◽

National Security ◽

Criminal Behavior ◽

Communication Systems ◽

Information Overload ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Key Technologies ◽

Crime Detection

The advancement of multimedia and communication systems has not only provided faster and better communication facilities but also facilitated easier means to organized crime. Concern about national security has increased significantly in the recent years due to the increase in organized crimes, leading to increasing amounts of data available for investigation by criminal analysts. The opportunity to analyze this data to determine patterns of criminal behavior, monitor, and predict criminal activities coexists with the threat of information overload. A large amount of information, which is stored in textual and unstructured form, contains a valuable untapped source of data. Data mining and text mining are two key technologies suited to the discovery of underlying patterns in large data sets. This chapter reviews the use of text mining techniques in crime detection projects and describes in detail the text mining approach used in the proposed ASKARI project.

Download Full-text

The usage of large data sets in online consumer behaviour: A bibliometric and computational text-mining–driven analysis of previous research

Journal of Business Research ◽

10.1016/j.jbusres.2019.09.009 ◽

2020 ◽

Vol 106 ◽

pp. 46-59 ◽

Cited By ~ 8

Author(s):

Mika Vanhala ◽

Chien Lu ◽

Jaakko Peltonen ◽

Sanna Sundqvist ◽

Jyrki Nummenmaa ◽

...

Keyword(s):

Text Mining ◽

Consumer Behaviour ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Online Consumer ◽

Online Consumer Behaviour

Download Full-text

Deep Learning Security Systems

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1347.0986s319 ◽

2019 ◽

Vol 8 (6S3) ◽

pp. 1823-1826

Keyword(s):

Big Data ◽

Deep Learning ◽

Data Analysis ◽

Data Analytics ◽

Big Data Analytics ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Security Systems ◽

Abstract Knowledge

Big Data Analytics and Deep Learning are not supposed to be two entirely different concepts. Big Data means extremely huge large data sets that can be analyzed to find patterns, trends. One technique that can be used for data analysis so that able to help us find abstract patterns in Big Data is Deep Learning. If we apply Deep Learning to Big Data, we can find unknown and useful patterns that were impossible so far. With the help of Deep Learning, AI is getting smart. There is a hypothesis in this regard, the more data, the more abstract knowledge. So a handy survey of Big Data, Deep Learning and its application in Big Data is necessary.

Download Full-text