scholarly journals Efficient storage, retrieval and analysis of poker hands: An adaptive data framework

2017 ◽  
Vol 27 (4) ◽  
pp. 713-726
Author(s):  
Marcin Gorawski ◽  
Michal Lorek

Abstract In online gambling, poker hands are one of the most popular and fundamental units of the game state and can be considered objects comprising all the events that pertain to the single hand played. In a situation where tens of millions of poker hands are produced daily and need to be stored and analysed quickly, the use of relational databases no longer provides high scalability and performance stability. The purpose of this paper is to present an efficient way of storing and retrieving poker hands in a big data environment. We propose a new, read-optimised storage model that offers significant data access improvements over traditional database systems as well as the existing Hadoop file formats such as ORC, RCFile or SequenceFile. Through index-oriented partition elimination, our file format allows reducing the number of file splits that needs to be accessed, and improves query response time up to three orders of magnitude in comparison with other approaches. In addition, our file format supports a range of new indexing structures to facilitate fast row retrieval at a split level. Both index types operate independently of the Hive execution context and allow other big data computational frameworks such as MapReduce or Spark to benefit from the optimized data access path to the hand information. Moreover, we present a detailed analysis of our storage model and its supporting index structures, and how they are organised in the overall data framework. We also describe in detail how predicate based expression trees are used to build effective file-level execution plans. Our experimental tests conducted on a production cluster, holding nearly 40 billion hands which span over 4000 partitions, show that multi-way partition pruning outperforms other existing file formats, resulting in faster query execution times and better cluster utilisation.

Author(s):  
J. Boehm ◽  
K. Liu ◽  
C. Alis

In the geospatial domain we have now reached the point where data volumes we handle have clearly grown beyond the capacity of most desktop computers. This is particularly true in the area of point cloud processing. It is therefore naturally lucrative to explore established big data frameworks for big geospatial data. The very first hurdle is the import of geospatial data into big data frameworks, commonly referred to as data ingestion. Geospatial data is typically encoded in specialised binary file formats, which are not naturally supported by the existing big data frameworks. Instead such file formats are supported by software libraries that are restricted to single CPU execution. We present an approach that allows the use of existing point cloud file format libraries on the Apache Spark big data framework. We demonstrate the ingestion of large volumes of point cloud data into a compute cluster. The approach uses a map function to distribute the data ingestion across the nodes of a cluster. We test the capabilities of the proposed method to load billions of points into a commodity hardware compute cluster and we discuss the implications on scalability and performance. The performance is benchmarked against an existing native Apache Spark data import implementation.


Big data is one of the most influential technologies of the modern era. However, in order to support maturity of big data systems, development and sustenance of heterogeneous environments is requires. This, in turn, requires integration of technologies as well as concepts. Computing and storage are the two core components of any big data system. With that said, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings the facet of big data file formats into picture. This paper classifies available big data file formats into five categories namely text-based, row-based, column-based, in-memory and data storage services. It also compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Lastly, it provides a discussion on tradeoffs that must be considered while choosing a file format for a big data system, providing a framework for creation for file format selection criteria.


2020 ◽  
Vol 17 (1) ◽  
pp. 513-518
Author(s):  
Shashi Pal Singh ◽  
Ajai Kumar ◽  
Rachna Awasthi ◽  
Neetu Yadav ◽  
Shikha Jain

In today’s World there exists various source of data in various formats (file formats), different structure, different types and etc. which is a hug collection of unstructured over the internet or social media. This gives rise to categorization of data as unstructured, semi structured and structured data. Data that exist in irregular manner without any particular schema are referred as unstructured data which is very difficult to process as it consists of irregularities and ambiguities. So, we are focused on Intelligent Processing Unit which converts unstructured big data into intelligent meaningful information. Intelligent text extraction is a technique that automatically identifies and extracts text from file format. The system consists of different stages which include the pre-processing, keyphase extraction techniques and transformation for the text extraction and retrieve structured data from unstructured data. The system consists multiple method/approach give better result. We are currently working in various file formats and converting the file format into DOCX which will come in the form of the un-structure Form, and then we will obtain that file in the structure form with the help of intelligent Pre-processing. The pre-process stages that triggers the unstructured data/corpus into structured data converting into meaning full. The Initial stage is the system remove the stop word, unwanted symbols noisy data and line spacing. The second stage is Data Extraction from various sources of file or types of files into proper format plain text. The then in third stage we transform the data or information from one format to another for the user to understand the data. The final step is rebuilding the file in its original format maintaining tag of the files. The large size files are divided into sub small size file to executed the parallel processing algorithms for fast processing of larger files and data. Parallel processing is a very important concept for text extraction and with its help; the big file breaks in a small file and improves the result. Extraction of data is done in Bilingual language, and represent the most relevant information contained in the document. Key-phase extraction is an important problem of data mining, Knowledge retrieval and natural speech processing. Keyword Extraction technique has been used to abstract keywords that exclusively recognize a document. Rebuilding is an important part of this project and we will use the entire concept in that file format and in the last, we need the same format which we have done in that file. This concept is being widely used but not much work of the work has been done in the area of developing many functionalities under one tool, so this makes us feel the requirement of such a tool which can easily and efficiently convert unstructured files into structured one.


Author(s):  
Lucas M. Ponce ◽  
Walter dos Santos ◽  
Wagner Meira ◽  
Dorgival Guedes ◽  
Daniele Lezzi ◽  
...  

Abstract High-performance computing (HPC) and massive data processing (Big Data) are two trends that are beginning to converge. In that process, aspects of hardware architectures, systems support and programming paradigms are being revisited from both perspectives. This paper presents our experience on this path of convergence with the proposal of a framework that addresses some of the programming issues derived from such integration. Our contribution is the development of an integrated environment that integretes (i) COMPSs, a programming framework for the development and execution of parallel applications for distributed infrastructures; (ii) Lemonade, a data mining and analysis tool; and (iii) HDFS, the most widely used distributed file system for Big Data systems. To validate our framework, we used Lemonade to create COMPSs applications that access data through HDFS, and compared them with equivalent applications built with Spark, a popular Big Data framework. The results show that the HDFS integration benefits COMPSs by simplifying data access and by rearranging data transfer, reducing execution time. The integration with Lemonade facilitates COMPSs’s use and may help its popularization in the Data Science community, by providing efficient algorithm implementations for experts from the data domain that want to develop applications with a higher level abstraction.


Author(s):  
J. Boehm ◽  
K. Liu ◽  
C. Alis

In the geospatial domain we have now reached the point where data volumes we handle have clearly grown beyond the capacity of most desktop computers. This is particularly true in the area of point cloud processing. It is therefore naturally lucrative to explore established big data frameworks for big geospatial data. The very first hurdle is the import of geospatial data into big data frameworks, commonly referred to as data ingestion. Geospatial data is typically encoded in specialised binary file formats, which are not naturally supported by the existing big data frameworks. Instead such file formats are supported by software libraries that are restricted to single CPU execution. We present an approach that allows the use of existing point cloud file format libraries on the Apache Spark big data framework. We demonstrate the ingestion of large volumes of point cloud data into a compute cluster. The approach uses a map function to distribute the data ingestion across the nodes of a cluster. We test the capabilities of the proposed method to load billions of points into a commodity hardware compute cluster and we discuss the implications on scalability and performance. The performance is benchmarked against an existing native Apache Spark data import implementation.


Author(s):  
Madhavi Tota

Big Data is very dynamic issues in the current year, enables computing resources as a data to be provided as Information Technology services with high efficiency and effectiveness. The high amount of data in world is growing day by day. Data is growing very rapidly because of use of internet, smart phone and social network. Now size of the data is in Petabyte and Exabyte. Traditional database systems are not able to capture, store and analyze this large amount of data. In the digital and computing world, information is generated and collected at a rate that rapidly exceeds the limits. However, the current scenario the growth rate of such large data creates number of challenges, such as the fast growth of data, access speed, diverse data, and security. This paper shows the fundamental concepts of Big Data. Privacy threats and security methods used in Big Data. With the development of various research application and recourses of Internet/Mobile Internet, social networks, Internet of Things, big data has become the very important topic of research across the world, at the same time, big data has security risks and privacy protection during different stages such as collecting, storing, analyzing and utilizing. This paper introduces security measures of big data, then proposes the technology to solve the security threats.


2020 ◽  
Author(s):  
Fernando de Assis Rodrigues ◽  
Pedro Henrique Santos Bisi ◽  
Ricardo César Gonçalves Sant’Ana

The study goal is to identify semantics characteristics of datasets, at the moment of data collecting, from dataset's structures found on export data interfaces available on user’s interactions analysis tools, on Internet communication channels, and on statistical data access tools involved in a scientific journal management process, thru an application of data analysis and data model techniques. The research universe was delimited to exportable dataset's structures, found in journal publishing systems, online social networks statistics, search engines, and web analytics tools. The sample analyzed was restricted to dataset's structures, available in reports found in Open Journal Systems (OJS), Google Analytics, Google Search Console, Twitter Analytics, and Facebook Insights. These resources did not present any version control numbering, except by OJS (2.6). The data was collected in September' 2017 from "Electronic Journal Digital Skills for Family Farming" accounts. It was adopted an exploratory analysis methodology to identify characteristics about how data are available and structured on those data resources, contemplating a systematically describing process of datasets, entities, and attributes related to the interaction between users and communications channels from a scientific journal. A total of 255 exportable datasets were found, distributed in 5 file formats: Comma-Separated Values (CSV) (82), Google Docs Spreadsheet File Format (69), Excel Microsoft Office Open XML Format Spreadsheet file (50), Portable Document Format (50), and Excel Binary File Format (3). Except for CSV, all other file formats were discarded, mainly because CSV is a machine-readable, open file format, and available in every export data interfaces analyzed. It was collected 82 CSV datasets from Google Analytics (50), Google Search (20), Open Journal Systems (7), Facebook Insights (3), and Twitter Analytics (2). In order to systematize the analysis, it was applied concepts from Entity-Relationship (ER) Model (Silberschatz, Korth, & Sudarshan, 2010) with entities to store data collected from i) services, ii) resources available in the services, iii) datasets available in the resources, and iv) attributes available in the datasets. Also, it was developed two auxiliary tables i) format, to store file format types available on datasets, and ii) data type to store data types: "a named (and in practice finite) set of values" (Date, 2016, p. 228). This applied ER Model provides a structure to store data from entities and attributes from each dataset. Applying this ER structure on data collected in this study was possible to identify 82 entities, 2280 attributes, with a subset of 1342 unique attribute labels. The ER structure and data was stored in a Google Spreadsheet file. After that, the file was uploaded to a DataBase Management System (DBMS) to a further data analysis. It was developed a Python script to reorder the data stored in DBMS to a new data structure, adopting the Online Analytical Processing (OLAP) cube as representation with Service (s), Entity (e), and Attribute (a) data used as dimensions (Gray, Bosworth, Lyaman, & Pirahesh, 1996; Inmon, 1996; Kimball & Ross, 2011). The collected data was reordered to OLAP cube dimensions by a pivot table process (Cornell, 2005). It was intended to observe on intersections of OLAP cube the characteristics shared internally and externally by services, entities and, attributes that can affect semantics aspects on data collecting. The results show that 88.69% of attributes doesn't it relate to any description about its content. Added to that, all attributes that share equal labels between distinct services came without description on collecting. This subset of attributes had a significant importance to interoperability applicability of those datasets, with a capability to distinguish the context on collecting process and also be part of a group of potential primary keys or unique fields, helping to build relationships between data from this sources, or even in a geographic, timing or linguistic determination.


IEEE Access ◽  
2020 ◽  
Vol 8 ◽  
pp. 226380-226396
Author(s):  
Diana Martinez-Mosquera ◽  
Rosa Navarrete ◽  
Sergio Lujan-Mora

2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Mahdi Torabzadehkashi ◽  
Siavash Rezaei ◽  
Ali HeydariGorji ◽  
Hosein Bobarshad ◽  
Vladimir Alves ◽  
...  

AbstractIn the era of big data applications, the demand for more sophisticated data centers and high-performance data processing mechanisms is increasing drastically. Data are originally stored in storage systems. To process data, application servers need to fetch them from storage devices, which imposes the cost of moving data to the system. This cost has a direct relation with the distance of processing engines from the data. This is the key motivation for the emergence of distributed processing platforms such as Hadoop, which move process closer to data. Computational storage devices (CSDs) push the “move process to data” paradigm to its ultimate boundaries by deploying embedded processing engines inside storage devices to process data. In this paper, we introduce Catalina, an efficient and flexible computational storage platform, that provides a seamless environment to process data in-place. Catalina is the first CSD equipped with a dedicated application processor running a full-fledged operating system that provides filesystem-level data access for the applications. Thus, a vast spectrum of applications can be ported for running on Catalina CSDs. Due to these unique features, to the best of our knowledge, Catalina CSD is the only in-storage processing platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and HPC applications in-place without any modifications on the underlying distributed processing framework. For the proof of concept, we build a fully functional Catalina prototype and a CSD-equipped platform using 16 Catalina CSDs to run Intel HiBench Hadoop and HPC benchmarks to investigate the benefits of deploying Catalina CSDs in the distributed processing environments. The experimental results show up to 2.2× improvement in performance and 4.3× reduction in energy consumption, respectively, for running Hadoop MapReduce benchmarks. Additionally, thanks to the Neon SIMD engines, the performance and energy efficiency of DFT algorithms are improved up to 5.4× and 8.9×, respectively.


Sign in / Sign up

Export Citation Format

Share Document