scholarly journals Identifying semantic characteristics of user interaction datasets through application of a data analysis

2020 ◽  
Author(s):  
Fernando de Assis Rodrigues ◽  
Pedro Henrique Santos Bisi ◽  
Ricardo César Gonçalves Sant’Ana

The study goal is to identify semantics characteristics of datasets, at the moment of data collecting, from dataset's structures found on export data interfaces available on user’s interactions analysis tools, on Internet communication channels, and on statistical data access tools involved in a scientific journal management process, thru an application of data analysis and data model techniques. The research universe was delimited to exportable dataset's structures, found in journal publishing systems, online social networks statistics, search engines, and web analytics tools. The sample analyzed was restricted to dataset's structures, available in reports found in Open Journal Systems (OJS), Google Analytics, Google Search Console, Twitter Analytics, and Facebook Insights. These resources did not present any version control numbering, except by OJS (2.6). The data was collected in September' 2017 from "Electronic Journal Digital Skills for Family Farming" accounts. It was adopted an exploratory analysis methodology to identify characteristics about how data are available and structured on those data resources, contemplating a systematically describing process of datasets, entities, and attributes related to the interaction between users and communications channels from a scientific journal. A total of 255 exportable datasets were found, distributed in 5 file formats: Comma-Separated Values (CSV) (82), Google Docs Spreadsheet File Format (69), Excel Microsoft Office Open XML Format Spreadsheet file (50), Portable Document Format (50), and Excel Binary File Format (3). Except for CSV, all other file formats were discarded, mainly because CSV is a machine-readable, open file format, and available in every export data interfaces analyzed. It was collected 82 CSV datasets from Google Analytics (50), Google Search (20), Open Journal Systems (7), Facebook Insights (3), and Twitter Analytics (2). In order to systematize the analysis, it was applied concepts from Entity-Relationship (ER) Model (Silberschatz, Korth, & Sudarshan, 2010) with entities to store data collected from i) services, ii) resources available in the services, iii) datasets available in the resources, and iv) attributes available in the datasets. Also, it was developed two auxiliary tables i) format, to store file format types available on datasets, and ii) data type to store data types: "a named (and in practice finite) set of values" (Date, 2016, p. 228). This applied ER Model provides a structure to store data from entities and attributes from each dataset. Applying this ER structure on data collected in this study was possible to identify 82 entities, 2280 attributes, with a subset of 1342 unique attribute labels. The ER structure and data was stored in a Google Spreadsheet file. After that, the file was uploaded to a DataBase Management System (DBMS) to a further data analysis. It was developed a Python script to reorder the data stored in DBMS to a new data structure, adopting the Online Analytical Processing (OLAP) cube as representation with Service (s), Entity (e), and Attribute (a) data used as dimensions (Gray, Bosworth, Lyaman, & Pirahesh, 1996; Inmon, 1996; Kimball & Ross, 2011). The collected data was reordered to OLAP cube dimensions by a pivot table process (Cornell, 2005). It was intended to observe on intersections of OLAP cube the characteristics shared internally and externally by services, entities and, attributes that can affect semantics aspects on data collecting. The results show that 88.69% of attributes doesn't it relate to any description about its content. Added to that, all attributes that share equal labels between distinct services came without description on collecting. This subset of attributes had a significant importance to interoperability applicability of those datasets, with a capability to distinguish the context on collecting process and also be part of a group of potential primary keys or unique fields, helping to build relationships between data from this sources, or even in a geographic, timing or linguistic determination.

2017 ◽  
Vol 27 (4) ◽  
pp. 713-726
Author(s):  
Marcin Gorawski ◽  
Michal Lorek

Abstract In online gambling, poker hands are one of the most popular and fundamental units of the game state and can be considered objects comprising all the events that pertain to the single hand played. In a situation where tens of millions of poker hands are produced daily and need to be stored and analysed quickly, the use of relational databases no longer provides high scalability and performance stability. The purpose of this paper is to present an efficient way of storing and retrieving poker hands in a big data environment. We propose a new, read-optimised storage model that offers significant data access improvements over traditional database systems as well as the existing Hadoop file formats such as ORC, RCFile or SequenceFile. Through index-oriented partition elimination, our file format allows reducing the number of file splits that needs to be accessed, and improves query response time up to three orders of magnitude in comparison with other approaches. In addition, our file format supports a range of new indexing structures to facilitate fast row retrieval at a split level. Both index types operate independently of the Hive execution context and allow other big data computational frameworks such as MapReduce or Spark to benefit from the optimized data access path to the hand information. Moreover, we present a detailed analysis of our storage model and its supporting index structures, and how they are organised in the overall data framework. We also describe in detail how predicate based expression trees are used to build effective file-level execution plans. Our experimental tests conducted on a production cluster, holding nearly 40 billion hands which span over 4000 partitions, show that multi-way partition pruning outperforms other existing file formats, resulting in faster query execution times and better cluster utilisation.


2014 ◽  
Vol 940 ◽  
pp. 433-436 ◽  
Author(s):  
Ying Zhang ◽  
Xin Shi

Based on the detailed analysis of the STL file format, VC++ 6.0 programming language was used to extract the STL ASCII and binary file information, at the same time, using the OpenGL triangle drawing technology for graphical representation of the STL file, with rendering functions such as material, coordinate transformation, lighting, et al, finally realizing the loading and three-dimensional display of STL ASCII and binary file formats.


2021 ◽  
Vol 8 (1) ◽  
pp. 15
Author(s):  
Eko Nur Hermansyah ◽  
Danny Manongga ◽  
Ade Iriani

<p align="center"><strong>Abstrak</strong></p><p>Intansi Kearsipan memiliki berbagai pengetahuan yang digunakan untuk pengelolaan arsip yang dimilikinya, <em>knowledge management</em> digunakan untuk mengumpulkan, mengelola, dan menyebarluaskan pengetahuan yang dimiliki, sehingga pengetahuan yang dimiliki oleh instansi kearsipan dapat digunakan untuk kemajuan intansi dan tidak hilang. Penelitian ini dilakukan di Dinas Perpustakaan dan Kearsipan Kota Salatiga. Pengumpulan data dilakukan dengan wawancara petugas kearsipan untuk mengumpulkan data tentang pengetahuan yang dimiliki dan cara penyimpanan serta penyebarluasan yang diterapkan di intansi kearsipan. Analisis data dilakukan dengan mengelompokkan pengetahuan yang dimiliki oleh intansi kearsipan sesuai dengan model <em>Choo-Sense Making</em>, untuk kemudian diterapkan di <em>Confluence</em> sesuai dengan hasil dari pengolahan data dengan model <em>Choo-Sense Making</em>. Hasil dari penelitian ini untuk model <em>Choo-Sense Making</em> pengetahuan di intansi kearsipan dibagi atas 3 tahap yaitu <em>Sense Making</em> yang berisi tentang pengetahuan yang berasal dari luar intansi dibuatkan wadah sebagai media diskusi, <em>chatting,</em> <em>Knowledge Creating</em> berisi tentang pengetahuan-pengetahuan yang dimiliki intansi kearsipan yang telah di dokumentasikan diubah dalam betuk <em>softfile</em> kemudian diunggah kedalam <em>space</em> untuk memudahkan penyimpanan serta penyebarluasan pengetahuan yang dimiliki, dan <em>Decision Making</em> yang berisi tentang jadwal-jadwal intansi dan evaluasi yang dilakukan intansi kearsipan. Hasil dari model <em>Choo-Sense Making</em> dimasukan ke <em>Confluence</em>, memperoleh hasil <em>space</em> yang dapat memudahkan menyimpan pengetahuan yang dimiliki berupa file aplikasi, <em>softfile</em>, serta memudahkan dalam pencarian kembali dan penyerluasan pengetahuan yang dimiliki. Penerapan <em>Choo-Sense Making</em> selain untuk mempermudah penyimpanan dan penyerbaluasan serta komunikasi, dapat mengurangi resiko kehilangan pengetahuan yang dimiliki oleh intansi kearsipan.<strong> </strong></p><p><strong>Kata kunci<em>: </em></strong><em>Knowledge Management, Model Choo-Sense Making, Confluence</em>, Perpustakaan dan Arsip</p><p align="center"><em>Abstract</em></p><p><em>Archival Agency has several knowledge that are used to manage the owned archive, knowledge management is used to collect, manage and disseminate the owned knowledge so that the knowledge that the archival agency has can be used for the agency progress and it will not missing. The research is conducted in Dinas Perpustakaan dan Kearsipan Kota Salatiga. Data collecting is conducted by interviewing the archival officer to gather data related to its knowledge, the storage system and dissemination applied in this archival agency. Data analysis is conducted by categorizing the agency knowledge according to Choo-Sense Making model and then it is applied in Confluence in accordance with the result of the data analysis from the Choo-Sense Making model. The result of this research, for Choo-Sense Making model, the knowledge in the archival agency is divided into 3 steps; Sense Making, Knowledge Creating and Decision Making. Sense Making contains knowledge coming from the outside of the agency that has forum as discussion media, chatting. Knowledge Creating contains knowledge that owned by the archival agency that has been documented and changed in the form of softfile then uploaded into space to ease the storage and the knowledge dissemination. Decision Making is about agency schedules and evaluation toward the activity in this archival agency. The result of Choo-Sense Making Model is input into Confluence, get space result that ease to save the knowledge in the form of application file, softfile, and ease to search and disseminate the owned knowledge. The application of Choo-Sense Making eases the storage system, dissemination, and communication. It also reduces the risk of losing knowledge owned by the archival agency.</em></p><p><em> </em></p><p><strong>Keywords</strong>: <em>Knowledge Management, Model Choo-Sense Making, Confluence, Library and Archive</em></p>


Author(s):  
Sulastri Rini Rindrayani

This research is a quantitative research that aims at knowing the effect of micro teaching and guidance from trainee teacher towards apprentice student’s teaching ability of Economics Departement of STKIP PGRI Tulungagung. Quesionare is used in data collecting method. The result of data analysis using multiple regression shows (1) There is significant positive effect between micro teaching and the apprentice student’s teaching ability, (2) There is significant positive effect between guidance of trainee teacher and the apprentice students’ teaching ability. (3) There is positive effect of micro teaching and guidance from trainee teacher towards the apprentice students’ teaching ability. The volue of effect is 44,2% , meanwhile 55,8% is influenced by other factors.


2020 ◽  
Vol 245 ◽  
pp. 06042
Author(s):  
Oliver Gutsche ◽  
Igor Mandrichenko

A columnar data representation is known to be an efficient way for data storage, specifically in cases when the analysis is often done based only on a small fragment of the available data structures. A data representation like Apache Parquet is a step forward from a columnar representation, which splits data horizontally to allow for easy parallelization of data analysis. Based on the general idea of columnar data storage, working on the [LDRD Project], we have developed a striped data representation, which, we believe, is better suited to the needs of High Energy Physics data analysis. A traditional columnar approach allows for efficient data analysis of complex structures. While keeping all the benefits of columnar data representations, the striped mechanism goes further by enabling easy parallelization of computations without requiring special hardware. We will present an implementation and some performance characteristics of such a data representation mechanism using a distributed no-SQL database or a local file system, unified under the same API and data representation model. The representation is efficient and at the same time simple so that it allows for a common data model and APIs for wide range of underlying storage mechanisms such as distributed no-SQL databases and local file systems. Striped storage adopts Numpy arrays as its basic data representation format, which makes it easy and efficient to use in Python applications. The Striped Data Server is a web service, which allows to hide the server implementation details from the end user, easily exposes data to WAN users, and allows to utilize well known and developed data caching solutions to further increase data access efficiency. We are considering the Striped Data Server as the core of an enterprise scale data analysis platform for High Energy Physics and similar areas of data processing. We have been testing this architecture with a 2TB dataset from a CMS dark matter search and plan to expand it to multiple 100 TB or even PB scale. We will present the striped format, Striped Data Server architecture and performance test results.


2018 ◽  
Author(s):  
Kimberly Megan Scott ◽  
Melissa Kline

As more researchers make their datasets openly available, the potential of secondary data analysis to address new questions increases. However, the distinction between primary and secondary data analysis is unnecessarily confounded with the distinction between confirmatory and exploratory research. We propose a framework, akin to library book checkout records, for logging access to datasets in order to support confirmatory analysis where appropriate. This would support a standard form of preregistration for secondary data analysis, allowing authors to demonstrate that their plans were registered prior to data access. We discuss the critical elements of such a system, its strengths and limitations, and potential extensions.


2021 ◽  
Author(s):  
Ane van Schalkwyk ◽  
Sara Grobbelaar ◽  
Euodia Vermeulen

BACKGROUND There is a growing trend in the potential benefits and application of log data for the evaluation of mHealth Apps. However, the process by which insights may be derived from log data remains unstructured, resulting in underutilisation of mHealth data. OBJECTIVE We aimed to acquire an understanding of how log data analysis can be used to generate valuable insights in support of realistic evaluations of mobile Apps through a scoping review. This understanding is delineated according to publication trends, associated concepts and characteristics of log data, framework or processes required to develop insights from log data, and how these insights may be utilised towards evaluation of Apps. METHODS The PRISMA-ScR guidelines for a scoping review were followed. The Scopus database, the Journal of Medical Internet Research (JMIR), and grey literature (through a Google search) delivered 105 articles of which 33 articles were retained in the sample for analysis and synthesis. RESULTS A definition for log data is developed from its characteristics and articulated as: anonymous records of users’ real-time interactions with the application, collected objectively or automatically and often accessed from cloud-based storage. Publications for theoretical and empirical work on log data analysis have increased between 2010 and 2021 (100% and 95% respectively). The research approach is distributed between inductive (43%), deductive (30%), and a hybrid approach (27%). Research methods include mixed-methods (73%) and quantitative only (27%), although mixed-methods dominate since 2018. Only 30% of studies articulated the use of a framework or model to perform the log data analysis. Four main focus areas for log data analysis are identified as usability (40%), engagement (15%), effectiveness (15%), and adherence (15%). An average of one year of log data is used for analysis, with an average of three years from the launch of the App to the analysis. Collected indicators include user events or clicks made, specific features of the App, and timestamps of clicks. The main calculated indicators are features used or not used (24/33), frequency (21/33), and duration (18/33). Reporting the calculated indicators per ‘user or user group’ was the most used reference point. CONCLUSIONS Standardised terminology, processes, frameworks, and explicit benchmarks to utilise log data are lacking in literature. Thereby, the need for a conceptual framework that is able to standardise the log analysis of mobile Apps is determined. We provide a summary of concepts towards such a framework. CLINICALTRIAL NA


2018 ◽  
pp. 218-233
Author(s):  
Mayank Yuvaraj

During the course of planning an institutional repository, digital library collections or digital preservation service it is inevitable to draft file format policies in order to ensure long term digital preservation, its accessibility and compatibility. Sincere efforts have been made to encourage the adoption of standard formats yet the digital preservation policies vary from library to library. The present paper is based against this background to present the digital preservation community with a common understanding of the common file formats used in the digital libraries or institutional repositories. The paper discusses both open and proprietary file formats for several media.


2019 ◽  
Vol 2 (1) ◽  
pp. 45-54 ◽  
Author(s):  
Kimberly M. Scott ◽  
Melissa Kline

As more researchers make their data sets openly available, the potential of secondary data analysis to address new questions increases. However, the distinction between primary and secondary data analysis is unnecessarily confounded with the distinction between confirmatory and exploratory research. We propose a framework, akin to library-book checkout records, for logging access to data sets in order to support confirmatory analysis when appropriate. This system would support a standard form of preregistration for secondary data analysis, allowing authors to demonstrate that their plans were registered prior to data access. We discuss the critical elements of such a system, its strengths and limitations, and potential extensions.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Kyle Ellrott ◽  
Alex Buchanan ◽  
Allison Creason ◽  
Michael Mason ◽  
Thomas Schaffter ◽  
...  

Abstract Challenges are achieving broad acceptance for addressing many biomedical questions and enabling tool assessment. But ensuring that the methods evaluated are reproducible and reusable is complicated by the diversity of software architectures, input and output file formats, and computing environments. To mitigate these problems, some challenges have leveraged new virtualization and compute methods, requiring participants to submit cloud-ready software packages. We review recent data challenges with innovative approaches to model reproducibility and data sharing, and outline key lessons for improving quantitative biomedical data analysis through crowd-sourced benchmarking challenges.


Sign in / Sign up

Export Citation Format

Share Document