scholarly journals Capturing and querying fine-grained provenance of preprocessing pipelines in data science

2020 ◽  
Vol 14 (4) ◽  
pp. 507-520
Author(s):  
Adriane Chapman ◽  
Paolo Missier ◽  
Giulia Simonelli ◽  
Riccardo Torlone

Data processing pipelines that are designed to clean, transform and alter data in preparation for learning predictive models, have an impact on those models' accuracy and performance, as well on other properties, such as model fairness. It is therefore important to provide developers with the means to gain an in-depth understanding of how the pipeline steps affect the data, from the raw input to training sets ready to be used for learning. While other efforts track creation and changes of pipelines of relational operators, in this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, and the definition of provenance patterns for each of them, and (ii) a prototype implementation of an application-level provenance capture library that works alongside Python. We report on provenance processing and storage overhead and scalability experiments, carried out over both real ML benchmark pipelines and over TCP-DI, and show how the resulting provenance can be used to answer a suite of provenance benchmark queries that underpin some of the developers' debugging questions, as expressed on the Data Science Stack Exchange.

Author(s):  
Manjunatha R C ◽  
Rekha K R ◽  
Nataraj K R

<p>Wireless sensor networks are usually left unattended and serve hostile environment, therefore can easily be compromised. With compromised nodes an attacker can conduct several inside and outside attacks. Node replication attack is one of them which can cause severe damage to wireless sensor network if left undetected. This paper presents fuzzy based simulation framework for detection and revocation of compromised nodes in wireless sensor network. Our proposed scheme uses PDR statistics and neighbor reports to determine the probability of a cluster being compromised. Nodes in compromised cluster are then revoked and software attestation is performed.Simulation is carried out on MATLAB 2010a and performance of proposed scheme is compared with conventional algorithms on the basis of communication and storage overhead. Simulation results show that proposed scheme require less communication and storage overhead than conventional algorithms.</p>


Author(s):  
Manjunatha R C ◽  
Rekha K R ◽  
Nataraj K R

<p>Wireless sensor networks are usually left unattended and serve hostile environment, therefore can easily be compromised. With compromised nodes an attacker can conduct several inside and outside attacks. Node replication attack is one of them which can cause severe damage to wireless sensor network if left undetected. This paper presents fuzzy based simulation framework for detection and revocation of compromised nodes in wireless sensor network. Our proposed scheme uses PDR statistics and neighbor reports to determine the probability of a cluster being compromised. Nodes in compromised cluster are then revoked and software attestation is performed.Simulation is carried out on MATLAB 2010a and performance of proposed scheme is compared with conventional algorithms on the basis of communication and storage overhead. Simulation results show that proposed scheme require less communication and storage overhead than conventional algorithms.</p>


2019 ◽  
Vol 214 ◽  
pp. 04010
Author(s):  
Álvaro Fernández Casaní ◽  
Dario Barberis ◽  
Javier Sánchez ◽  
Carlos García Montoro ◽  
Santiago González de la Hoz ◽  
...  

The ATLAS EventIndex currently runs in production in order to build a complete catalogue of events for experiments with large amounts of data. The current approach is to index all final produced data files at CERN Tier0, and at hundreds of grid sites, with a distributed data collection architecture using Object Stores to temporarily maintain the conveyed information, with references to them sent with a Messaging System. The final backend of all the indexed data is a central Hadoop infrastructure at CERN; an Oracle relational database is used for faster access to a subset of this information. In the future of ATLAS, instead of files, the event should be the atomic information unit for metadata, in order to accommodate future data processing and storage technologies. Files will no longer be static quantities, possibly dynamically aggregating data, and also allowing event-level granularity processing in heavily parallel computing environments. It also simplifies the handling of loss and or extension of data. In this sense the EventIndex may evolve towards a generalized whiteboard, with the ability to build collections and virtual datasets for end users. This proceedings describes the current Distributed Data Collection Architecture of the ATLAS EventIndex project, with details of the Producer, Consumer and Supervisor entities, and the protocol and information temporarily stored in the ObjectStore. It also shows the data flow rates and performance achieved since the new Object Store as temporary store approach was put in production in July 2017. We review the challenges imposed by the expected increasing rates that will reach 35 billion new real events per year in Run 3, and 100 billion new real events per year in Run 4. For simulated events the numbers are even higher, with 100 billion events/year in run 3, and 300 billion events/year in run 4. We also outline the challenges we face in order to accommodate future use cases in the EventIndex.


2012 ◽  
Vol 49 (No. 10) ◽  
pp. 389-399 ◽  
Author(s):  
M. Trckova ◽  
L. Matlova ◽  
L. Dvorska ◽  
I. Pavlik

Feeding kaolin as a supplement to pigs for prevention of diarrheal diseases has been introduced into some farms in the CzechRepublic. Peat was used in the 1990s for a similar purpose; however, most farmers ceased feeding peat as a supplement because of its frequent contamination with conditionally pathogenic mycobacteria, esp. with Mycobacterium avium subsp. hominissuis. The aim of the present paper is to review available literature from the standpoint of the advantages and disadvantages related to feeding kaolin as a supplement to animals. Its positive effects exerted through the diet primarily consist in its adsorbent capability which may be useful for detoxification of the organism and for prevention of diarrheal diseases in pigs. Because the mechanism of action of kaolin fed as a supplement is unknown, a risk related to its potential interactions with other nutrient compounds of the diet exists. Therefore, it is necessary to investigate the effectiveness and safety of feeding kaolin in detail with regard to the health status and performance of each farm animal species. The disadvantage of kaolin use is its potential toxicity, provided it has been mined from the environment with natural or anthropogenic occurrence of toxic compounds. Another risk factor is a potential contamination of originally sterile kaolin with conditionally pathogenic mycobacteria from surface water, dust, soil, and other constituents of the environment in the mines during kaolin extraction, processing and storage.


Entropy ◽  
2020 ◽  
Vol 22 (12) ◽  
pp. 1438
Author(s):  
Carlos Alberto de Braganca Pereira ◽  
Adriano Polpo ◽  
Agatha Sacramento Rodrigues

With the increase in data processing and storage capacity, a large amount of data is available [...]


2019 ◽  
Vol 9 (1) ◽  
pp. 33-49 ◽  
Author(s):  
Kuldeep Sambrekar ◽  
Vijay S. Rajpurohit

Agriculture and its related industries are the backbone of many countries' economic growth. To achieve an efficient agricultural management system, remote sensing forecasting and GIS technology are providing information to users/stakeholders of various agricultural application uses. This information is huge in size and is stored in the cloud computing storage environment. Minimizing data access and storage costs on such an environment is desired. For achieving fine-grained role-based access control mechanisms, researchers are now focusing on ensuring such roles are enforced correctly. Existing models, though they are using role-based access control at various levels, are facing challenges like high computation rates and storage overhead. Currently, existing systems are using XML and UML for role and user creation. To address these research challenges, this article presents a model Fast and Efficient Multi View Access Control (FEMVAC) using the Amazon S3 public cloud environment for agriculture. The model minimizes storage overhead by adopting a banarization method over UML/XML method. The experimental outcome shows that the FEMVAC method is efficient compared with existing models.


Author(s):  
Frank Theo Moerman ◽  
Kostadin Fikiin

Proper control and performance of evaporators in food refrigeration facilities are vital to provide a suitable temperature regime, safety, quality and wholesomeness of refrigerated products at minimum electricity costs. When humid air passes along the surfaces of a low-temperature evaporator, frost is usually formed, which decreases the heat transfer efficiency. Frosting and defrosting phenomena have been extensively investigated for different industrial scenarios and extensive literature exists in the matter. However, no studies have been published so far to address in a comprehensive way the methods and patterns of evaporator defrosting as affected by hygienic design implications and criteria. This book chapter is intended to fill in this gap by enforcing hygienic imperatives in the evaporator design. Various design solutions and conditions of operation are considered as decisive in determining the amount, thickness and structure of the frost build-up. Advantages and drawbacks of diverse defrost methods are outlined with regards to contamination risks in refrigeration facilities.


2014 ◽  
Vol 2 (1) ◽  
pp. 8-11
Author(s):  
Gregor Krajčovič ◽  
◽  
Silvia Horňáková

This article describes an information system concept designed to provide large amounts of geospatial data to the public. It addresses specific software design challenges, including wide computational power and storage scalability and unclear definition of stakeholders. It defines functional, security and performance requirements and provides a conceptual system design proposal.


2014 ◽  
Vol 644-650 ◽  
pp. 1923-1926
Author(s):  
Shao Min Zhang ◽  
Yan Chao Xu ◽  
Bao Yi Wang ◽  
Jin Xiao ◽  
Rui Niu

Aiming at solving data integrity protection problems in the cloud , a remote data integrity verification scheme is proposed. Firstly, the data integrity verification is constructed based on homomorphic identification and data fragment structure. Secondly, by introducing random mask, the public verification is realized and by building index-hash table (IHT), the scheme can support dynamic verification. Finally, use the MapReduce for parallel computing, which reduces computation overhead side and storage overhead. The security and performance analyses show that our proposed scheme is secure and reliable.


2017 ◽  
Vol 2017 ◽  
pp. 1-7 ◽  
Author(s):  
Kai Fan ◽  
Panfei Song ◽  
Yintang Yang

As one of the core techniques in 5G, the Internet of Things (IoT) is increasingly attracting people’s attention. Meanwhile, as an important part of IoT, the Near Field Communication (NFC) is widely used on mobile devices and makes it possible to take advantage of NFC system to complete mobile payment and merchandise information reading. But with the development of NFC, its problems are increasingly exposed, especially the security and privacy of authentication. Many NFC authentication protocols have been proposed for that, some of them only improve the function and performance without considering the security and privacy, and most of the protocols are heavyweight. In order to overcome these problems, this paper proposes an ultralightweight mutual authentication protocol, named ULMAP. ULMAP only uses Bit and XOR operations to complete the mutual authentication and prevent the denial of service (DoS) attack. In addition, it uses subkey and subindex number into its key update process to achieve the forward security. The most important thing is that the computation and storage overhead of ULMAP are few. Compared with some traditional schemes, our scheme is lightweight, economical, practical, and easy to protect against synchronization attack.


Sign in / Sign up

Export Citation Format

Share Document