MAGICPL: A Generic Process Description Language for Distributed Pseudonymization Scenarios

Author(s):  
Galina Tremper ◽  
Torben Brenner ◽  
Florian Stampe ◽  
Andreas Borg ◽  
Martin Bialke ◽  
...  

Abstract Objectives Pseudonymization is an important aspect of projects dealing with sensitive patient data. Most projects build their own specialized, hard-coded, solutions. However, these overlap in many aspects of their functionality. As any re-implementation binds resources, we would like to propose a solution that facilitates and encourages the reuse of existing components. Methods We analyzed already-established data protection concepts to gain an insight into their common features and the ways in which their components were linked together. We found that we could represent these pseudonymization processes with a simple descriptive language, which we have called MAGICPL, plus a relatively small set of components. We designed MAGICPL as an XML-based language, to make it human-readable and accessible to nonprogrammers. Additionally, a prototype implementation of the components was written in Java. MAGICPL makes it possible to reference the components using their class names, making it easy to extend or exchange the component set. Furthermore, there is a simple HTTP application programming interface (API) that runs the tasks and allows other systems to communicate with the pseudonymization process. Results MAGICPL has been used in at least three projects, including the re-implementation of the pseudonymization process of the German Cancer Consortium, clinical data flows in a large-scale translational research network (National Network Genomic Medicine), and for our own institute's pseudonymization service. Conclusions Putting our solution into productive use at both our own institute and at our partner sites facilitated a reduction in the time and effort required to build pseudonymization pipelines in medical research.

Author(s):  
D. C. Price ◽  
C. Flynn ◽  
A. Deller

Abstract Galactic electron density distribution models are crucial tools for estimating the impact of the ionised interstellar medium on the impulsive signals from radio pulsars and fast radio bursts. The two prevailing Galactic electron density models (GEDMs) are YMW16 (Yao et al. 2017, ApJ, 835, 29) and NE2001 (Cordes & Lazio 2002, arXiv e-prints, pp astro–ph/0207156). Here, we introduce a software package PyGEDM which provides a unified application programming interface for these models and the YT20 (Yamasaki & Totani 2020, ApJ, 888, 105) model of the Galactic halo. We use PyGEDM to compute all-sky maps of Galactic dispersion measure (DM) for YMW16 and NE2001 and compare the large-scale differences between the two. In general, YMW16 predicts higher DM values towards the Galactic anticentre. YMW16 predicts higher DMs at low Galactic latitudes, but NE2001 predicts higher DMs in most other directions. We identify lines of sight for which the models are most discrepant, using pulsars with independent distance measurements. YMW16 performs better on average than NE2001, but both models show significant outliers. We suggest that future campaigns to determine pulsar distances should focus on targets where the models show large discrepancies, so future models can use those measurements to better estimate distances along those line of sight. We also suggest that the Galactic halo should be considered as a component in future GEDMs, to avoid overestimating the Galactic DM contribution for extragalactic sources such as FRBs.


F1000Research ◽  
2020 ◽  
Vol 9 ◽  
pp. 1374 ◽  
Author(s):  
Minh-Son Phan ◽  
Anatole Chessel

The advent of large-scale fluorescence and electronic microscopy techniques along with maturing image analysis is giving life sciences a deluge of geometrical objects in 2D/3D(+t) to deal with. These objects take the form of large scale, localised, precise, single cell, quantitative data such as cells’ positions, shapes, trajectories or lineages, axon traces in whole brains atlases or varied intracellular protein localisations, often in multiple experimental conditions. The data mining of those geometrical objects requires a variety of mathematical and computational tools of diverse accessibility and complexity. Here we present a new Python library for quantitative 3D geometry called GeNePy3D which helps handle and mine information and knowledge from geometric data, providing a unified application programming interface (API) to methods from several domains including computational geometry, scale space methods or spatial statistics. By framing this library as generically as possible, and by linking it to as many state-of-the-art reference algorithms and projects as needed, we help render those often specialist methods accessible to a larger community. We exemplify the usefulness of the  GeNePy3D toolbox by re-analysing a recently published whole-brain zebrafish neuronal atlas, with other applications and examples available online. Along with an open source, documented and exemplified code, we release reusable containers to allow for convenient and wide usability and increased reproducibility.


2018 ◽  
Author(s):  
Alex Nunes ◽  
Damian Lidgard ◽  
Franziska Broell

In 2015, as part of the Ocean Tracking Network’s bioprobe initiative, 20 grey seals (Halichoerus grypus) were tagged with a high-resolution (> 30 Hz) inertial tags (> 30 Hz), a depth-temperature satellite tag (0.1 Hz), and an acoustic transceiver on Sable Island for 6 months. Comparable to similar large-scale studies in movement ecology, the unprecedented size of the data (gigabytes for a single seal) collected by these instruments raises new challenges in efficient database management. Here we propose the utility of Postgres and netCDF for storing the biotelemetry data and associated metadata. While it was possible to write the lower-resolution (acoustic and satellite) data to a Postgres database, netCDF was chosen as the format for the high-resolution movement (acceleration and inertial) records. Even without access to cluster computing, data could be efficiently (CPU time) recorded, as 920 million records were written in < 3 hours. ERDDAP was used to access and link the different datastreams with a user-friendly Application Programming Interface. This approach compresses the data to a fifth of its original size, and storing the data in a tree-like structure enables easy access and visualization for the end user.


2018 ◽  
Author(s):  
Alex Nunes ◽  
Damian Lidgard ◽  
Franziska Broell

In 2015, as part of the Ocean Tracking Network’s bioprobe initiative, 20 grey seals (Halichoerus grypus) were tagged with a high-resolution (> 30 Hz) inertial tags (> 30 Hz), a depth-temperature satellite tag (0.1 Hz), and an acoustic transceiver on Sable Island for 6 months. Comparable to similar large-scale studies in movement ecology, the unprecedented size of the data (gigabytes for a single seal) collected by these instruments raises new challenges in efficient database management. Here we propose the utility of Postgres and netCDF for storing the biotelemetry data and associated metadata. While it was possible to write the lower-resolution (acoustic and satellite) data to a Postgres database, netCDF was chosen as the format for the high-resolution movement (acceleration and inertial) records. Even without access to cluster computing, data could be efficiently (CPU time) recorded, as 920 million records were written in < 3 hours. ERDDAP was used to access and link the different datastreams with a user-friendly Application Programming Interface. This approach compresses the data to a fifth of its original size, and storing the data in a tree-like structure enables easy access and visualization for the end user.


2018 ◽  
Author(s):  
Soohyun Lee ◽  
Jeremy Johnson ◽  
Carl Vitzthum ◽  
Koray Kırlı ◽  
Burak H. Alver ◽  
...  

AbstractSummaryWe introduce Tibanna, an open-source software tool for automated execution of bioinformatics pipelines on Amazon Web Services (AWS). Tibanna accepts reproducible and portable pipeline standards including Common Workflow Language (CWL), Workflow Description Language (WDL) and Docker. It adopts a strategy of isolation and optimization of individual executions, combined with a serverless scheduling approach. Pipelines are executed and monitored using local commands or the Python Application Programming Interface (API) and cloud configuration is automatically handled. Tibanna is well suited for projects with a range of computational requirements, including those with large and widely fluctuating loads. Notably, it has been used to process terabytes of data for the 4D Nucleome (4DN) Network.AvailabilitySource code is available on GitHub at https://github.com/4dn-dcic/tibanna.


2021 ◽  
pp. 146144482110494
Author(s):  
Thomas Smits ◽  
Ruben Ros

How do digital media impact the meaning of iconic photographs? Recent studies have suggested that online circulation, especially in a memeified form, might lead to the erosion, fracturing, or collapsing of the original contextual meaning of iconic pictures. Introducing a distant reading methodology to the study of iconic photographs, we apply the Google Cloud Vision Application Programming Interface (GCV API) to retrieve 940,000 online circulations of 26 iconic images between 1995 and 2020. We use document embeddings, a Natural Language Processing technique, to map in what contexts iconic photographs are circulated online. The article demonstrates that constantly changing configurations of contextual imagetexts, self-referential image-texts, and non-referential image/texts shape the online live of iconic photographs: ebbs and flows of slowly disappearing, suddenly resurfacing, and newly found meanings. While iconic photographs might not need captions to speak, this article argues that a large-scale analysis of texts can help us better grasp what they say.


Author(s):  
Donald Sturgeon

Abstract This article presents technical approaches and innovations in digital library design developed during the design and implementation of the Chinese Text Project, a widely-used, large-scale full-text digital library of premodern Chinese writing. By leveraging a combination of domain-optimized Optical Character Recognition, a purpose-designed crowdsourcing system, and an Application Programming Interface (API), this project simultaneously provides a sustainable transcription system, search interface and reading environment, as well as an extensible platform for transcribing and working with premodern Chinese textual materials. By means of the API, intentionally loosely integrated text mining tools are used to extend the platform, while also being reusable independently with materials from other sources and in other languages.


F1000Research ◽  
2021 ◽  
Vol 9 ◽  
pp. 1374
Author(s):  
Minh-Son Phan ◽  
Anatole Chessel

The advent of large-scale fluorescence and electronic microscopy techniques along with maturing image analysis is giving life sciences a deluge of geometrical objects in 2D/3D(+t) to deal with. These objects take the form of large scale, localised, precise, single cell, quantitative data such as cells’ positions, shapes, trajectories or lineages, axon traces in whole brains atlases or varied intracellular protein localisations, often in multiple experimental conditions. The data mining of those geometrical objects requires a variety of mathematical and computational tools of diverse accessibility and complexity. Here we present a new Python library for quantitative 3D geometry called GeNePy3D which helps handle and mine information and knowledge from geometric data, providing a unified application programming interface (API) to methods from several domains including computational geometry, scale space methods or spatial statistics. By framing this library as generically as possible, and by linking it to as many state-of-the-art reference algorithms and projects as needed, we help render those often specialist methods accessible to a larger community. We exemplify the usefulness of the  GeNePy3D toolbox by re-analysing a recently published whole-brain zebrafish neuronal atlas, with other applications and examples available online. Along with an open source, documented and exemplified code, we release reusable containers to allow for convenient and wide usability and increased reproducibility.


2021 ◽  
pp. 1-19
Author(s):  
Kiet Van Nguyen ◽  
Nhat Duy Nguyen ◽  
Phong Nguyen-Thuan Do ◽  
Anh Gia-Tuan Nguyen ◽  
Ngan Luu-Thuy Nguyen

Machine Reading Comprehension has attracted significant interest in research on natural language understanding, and large-scale datasets and neural network-based methods have been developed for this task. However, most developments of resources and methods in machine reading comprehension have been investigated using two resource-rich languages, English and Chinese. This article proposes a system called ViReader for open-domain machine reading comprehension in Vietnamese by using Wikipedia as the textual knowledge source, where the answer to any particular question is a textual span derived directly from texts on Vietnamese Wikipedia. Our system combines a sentence retriever component, based on techniques of information retrieval to extract the relevant sentences, with a transfer learning-based answer extractor trained to predict answers based on Wikipedia texts. Experiments on multiple datasets for machine reading comprehension in Vietnamese and other languages demonstrate that (1) our ViReader system is highly competitive with prevalent machine learning-based systems, and (2) multi-task learning by using a combination consisting of the sentence retriever and answer extractor is an end-to-end reading comprehension system. The sentence retriever component of our proposed system retrieves the sentences that are most likely to provide the answer response to the given question. The transfer learning-based answer extractor then reads the document from which the sentences have been retrieved, predicts the answer, and returns it to the user. The ViReader system achieves new state-of-the-art performances, with values of 70.83% EM (exact match) and 89.54% F1, outperforming the BERT-based system by 11.55% and 9.54% , respectively. It also obtains state-of-the-art performance on UIT-ViNewsQA (another Vietnamese dataset consisting of online health-domain news) and BiPaR (a bilingual dataset on English and Chinese novel texts). Compared with the BERT-based system, our system achieves significant improvements (in terms of F1) with 7.65% for English and 6.13% for Chinese on the BiPaR dataset. Furthermore, we build a ViReader application programming interface that programmers can employ in Artificial Intelligence applications.


2021 ◽  
Author(s):  
Sowmiya K ◽  
Supriya S ◽  
R. Subhashini

Analysis of structured data has seen tremendous success in the past. However, large scale of an unstructured data have been analysed in the form of video format remains a challenging area. YouTube, a Google company, has over a billion users and it used to generate billions of views. Since YouTube data is getting created in a very huge amount with a lot of views and with an equally great speed, there is a huge demand to store the data, process the data and carefully study the data in this large amount of it usable. The project utilizes the YouTube Data API (Application Programming Interface) which allows the applications or websites to incorporate functions in which they are used by YouTube application to fetch and view the information. The Google Developers Console which is used to generate an unique access key which is further required to fetch the data from YouTube public channel. Process the data and finally data stored in AWS. This project extracts the meaningful output of which can be used by the management for analysis, these methodologies helpful for business intelligence.


Sign in / Sign up

Export Citation Format

Share Document