From source to sink - Sustainable and reproducible data pipelines with SaQC

Author(s):  
David Schäfer ◽  
Bert Palm ◽  
Lennart Schmidt ◽  
Peter Lünenschloß ◽  
Jan Bumberger

<p>The number of sensors used in the environmental system sciences is increasing rapidly, and while this trend undoubtedly provides a great potential to broaden the understanding of complex spatio-temporal processes, it comes with its own set of new challenges. The flow of data from a source to its sink, from sensors to databases, involves many, usually error prone intermediate steps. From the data acquisition with its specific scientific and technical challenges, over the data transfer from often remote locations to the final data processing, all carry great potential to introduce errors and disturbances into the actual environmental signal.</p><p>Quantifying these errors becomes a crucial part of the later evaluation of all measured data. While many large environmental observatories are moving from manual to more automated ways of data processing and quality assurance, these systems are usually highly customized and hand written. This approach is non-ideal in several ways: First, it is a waste of resources as the same algorithms are implemented over and over again and second, it imposes great challenges to reproducibility. If the relevant programs are made available at all, they expose all problems of software reuse: correctness of the implementation, readability and comprehensibility for future users, as well as transferability between different computing environments. Beside these problems, related to software development in general, another crucial factor comes into play: the end product, a processed and quality controlled data set, is closely tied to the current version of the programs in use. Even small changes to the source code can lead to vastly differing results. If this is not approached responsibly, data and programs will inevitably fall out of sync.</p><p>The presented software, the 'System for automated Quality Control (SaQC)' (www.ufz.git.de/rdm-software/saqc), helps to either solve, or massively simplify the solution to the presented challenges. As a mainly no-code platform with a large set of implemented functionality, SaQC lowers the entry barrier for the non-programming scientific practitioner, without sacrificing the possibilities to fine-grained adaptation to project specific needs. The text based configuration allows the easy integration into version control systems and thus opens the opportunity to use well established software for data lineage. We will give a short overview of the program's unique features and showcase possibilities to build reliable and reproducible processing and quality assurance pipelines for real-world data from a spatially distributed, heterogeneous sensor network.</p>

Author(s):  
Guo-Zheng Li

This chapter introduces great challenges and the novel machine learning techniques employed in clinical data processing. It argues that the novel machine learning techniques including support vector machines, ensemble learning, feature selection, feature reuse by using multi-task learning, and multi-label learning provide potentially more substantive solutions for decision support and clinical data analysis. The authors demonstrate the generalization performance of the novel machine learning techniques on real world data sets including one data set of brain glioma, one data set of coronary heart disease in Chinese Medicine and some tumor data sets of microarray. More and more machine learning techniques will be developed to improve analysis precision of clinical data sets.


2012 ◽  
pp. 875-897
Author(s):  
Guo-Zheng Li

This chapter introduces great challenges and the novel machine learning techniques employed in clinical data processing. It argues that the novel machine learning techniques including support vector machines, ensemble learning, feature selection, feature reuse by using multi-task learning, and multi-label learning provide potentially more substantive solutions for decision support and clinical data analysis. The authors demonstrate the generalization performance of the novel machine learning techniques on real world data sets including one data set of brain glioma, one data set of coronary heart disease in Chinese Medicine and some tumor data sets of microarray. More and more machine learning techniques will be developed to improve analysis precision of clinical data sets.


Author(s):  
Michael schatz ◽  
Joachim Jäger ◽  
Marin van Heel

Lumbricus terrestris erythrocruorin is a giant oxygen-transporting macromolecule in the blood of the common earth worm (worm "hemoglobin"). In our current study, we use specimens (kindly provided by Drs W.E. Royer and W.A. Hendrickson) embedded in vitreous ice (1) to avoid artefacts encountered with the negative stain preparation technigue used in previous studies (2-4).Although the molecular structure is well preserved in vitreous ice, the low contrast and high noise level in the micrographs represent a serious problem in image interpretation. Moreover, the molecules can exhibit many different orientations relative to the object plane of the microscope in this type of preparation. Existing techniques of analysis requiring alignment of the molecular views relative to one or more reference images often thus yield unsatisfactory results.We use a new method in which first rotation-, translation- and mirror invariant functions (5) are derived from the large set of input images, which functions are subsequently classified automatically using multivariate statistical techniques (6). The different molecular views in the data set can therewith be found unbiasedly (5). Within each class, all images are aligned relative to that member of the class which contributes least to the classes′ internal variance (6). This reference image is thus the most typical member of the class. Finally the aligned images from each class are averaged resulting in molecular views with enhanced statistical resolution.


Biomimetics ◽  
2021 ◽  
Vol 6 (2) ◽  
pp. 32
Author(s):  
Tomasz Blachowicz ◽  
Jacek Grzybowski ◽  
Pawel Steblinski ◽  
Andrea Ehrmann

Computers nowadays have different components for data storage and data processing, making data transfer between these units a bottleneck for computing speed. Therefore, so-called cognitive (or neuromorphic) computing approaches try combining both these tasks, as is done in the human brain, to make computing faster and less energy-consuming. One possible method to prepare new hardware solutions for neuromorphic computing is given by nanofiber networks as they can be prepared by diverse methods, from lithography to electrospinning. Here, we show results of micromagnetic simulations of three coupled semicircle fibers in which domain walls are excited by rotating magnetic fields (inputs), leading to different output signals that can be used for stochastic data processing, mimicking biological synaptic activity and thus being suitable as artificial synapses in artificial neural networks.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Sven Lißner ◽  
Stefan Huber

Abstract Background GPS-based cycling data are increasingly available for traffic planning these days. However, the recorded data often contain more information than simply bicycle trips. GPS tracks resulting from tracking while using other modes of transport than bike or long periods at working locations while people are still tracking are only some examples. Thus, collected bicycle GPS data need to be processed adequately to use them for transportation planning. Results The article presents a multi-level approach towards bicycle-specific data processing. The data processing model contains different steps of processing (data filtering, smoothing, trip segmentation, transport mode recognition, driving mode detection) to finally obtain a correct data set that contains bicycle trips, only. The validation reveals a sound accuracy of the model at its’ current state (82–88%).


2021 ◽  
pp. 1-13
Author(s):  
Hailin Liu ◽  
Fangqing Gu ◽  
Zixian Lin

Transfer learning methods exploit similarities between different datasets to improve the performance of the target task by transferring knowledge from source tasks to the target task. “What to transfer” is a main research issue in transfer learning. The existing transfer learning method generally needs to acquire the shared parameters by integrating human knowledge. However, in many real applications, an understanding of which parameters can be shared is unknown beforehand. Transfer learning model is essentially a special multi-objective optimization problem. Consequently, this paper proposes a novel auto-sharing parameter technique for transfer learning based on multi-objective optimization and solves the optimization problem by using a multi-swarm particle swarm optimizer. Each task objective is simultaneously optimized by a sub-swarm. The current best particle from the sub-swarm of the target task is used to guide the search of particles of the source tasks and vice versa. The target task and source task are jointly solved by sharing the information of the best particle, which works as an inductive bias. Experiments are carried out to evaluate the proposed algorithm on several synthetic data sets and two real-world data sets of a school data set and a landmine data set, which show that the proposed algorithm is effective.


Author(s):  
Shaoqiang Wang ◽  
Shudong Wang ◽  
Song Zhang ◽  
Yifan Wang

Abstract To automatically detect dynamic EEG signals to reduce the time cost of epilepsy diagnosis. In the signal recognition of electroencephalogram (EEG) of epilepsy, traditional machine learning and statistical methods require manual feature labeling engineering in order to show excellent results on a single data set. And the artificially selected features may carry a bias, and cannot guarantee the validity and expansibility in real-world data. In practical applications, deep learning methods can release people from feature engineering to a certain extent. As long as the focus is on the expansion of data quality and quantity, the algorithm model can learn automatically to get better improvements. In addition, the deep learning method can also extract many features that are difficult for humans to perceive, thereby making the algorithm more robust. Based on the design idea of ResNeXt deep neural network, this paper designs a Time-ResNeXt network structure suitable for time series EEG epilepsy detection to identify EEG signals. The accuracy rate of Time-ResNeXt in the detection of EEG epilepsy can reach 91.50%. The Time-ResNeXt network structure produces extremely advanced performance on the benchmark dataset (Berne-Barcelona dataset) and has great potential for improving clinical practice.


2018 ◽  
Vol 8 (11) ◽  
pp. 2216
Author(s):  
Jiahui Jin ◽  
Qi An ◽  
Wei Zhou ◽  
Jiakai Tang ◽  
Runqun Xiong

Network bandwidth is a scarce resource in big data environments, so data locality is a fundamental problem for data-parallel frameworks such as Hadoop and Spark. This problem is exacerbated in multicore server-based clusters, where multiple tasks running on the same server compete for the server’s network bandwidth. Existing approaches solve this problem by scheduling computational tasks near the input data and considering the server’s free time, data placements, and data transfer costs. However, such approaches usually set identical values for data transfer costs, even though a multicore server’s data transfer cost increases with the number of data-remote tasks. Eventually, this hampers data-processing time, by minimizing it ineffectively. As a solution, we propose DynDL (Dynamic Data Locality), a novel data-locality-aware task-scheduling model that handles dynamic data transfer costs for multicore servers. DynDL offers greater flexibility than existing approaches by using a set of non-decreasing functions to evaluate dynamic data transfer costs. We also propose online and offline algorithms (based on DynDL) that minimize data-processing time and adaptively adjust data locality. Although DynDL is NP-complete (nondeterministic polynomial-complete), we prove that the offline algorithm runs in quadratic time and generates optimal results for DynDL’s specific uses. Using a series of simulations and real-world executions, we show that our algorithms are 30% better than algorithms that do not consider dynamic data transfer costs in terms of data-processing time. Moreover, they can adaptively adjust data localities based on the server’s free time, data placement, and network bandwidth, and schedule tens of thousands of tasks within subseconds or seconds.


Materials ◽  
2019 ◽  
Vol 12 (5) ◽  
pp. 791 ◽  
Author(s):  
Peter Gamnitzer ◽  
Martin Drexel ◽  
Andreas Brugger ◽  
Günter Hofstetter

Hygro-thermo-chemo-mechanical modelling of time-dependent concrete behavior requires the accurate determination of a large set of parameters. In this paper, the parameters of a multiphase model are calibrated based on a comprehensive set of experiments for a particular concrete of grade C30/37. The experiments include a calorimetry test, tests for age-dependent mechanical properties, tests for determining the water desorption isotherm, shrinkage tests, and compressive creep tests. The latter two were performed on sealed and unsealed specimens with accompanying mass water content measurements. The multiphase model is based on an effective stress formulation. It features a porosity-dependent desorption isotherm, taking into account the time-dependency of the desorption properties. The multiphase model is shown to yield excellent results for the evolutions of the mechanical parameters. The evolution of the autogenous shrinkage strain and evolutions of the creep compliances for loading at concrete ages of 2 days, 7 days, and 28 days are well predicted together with the respective mass water content evolution. This also holds for the evolution of the drying shrinkage strain, at least for moderate drying up to one year. However, it will be demonstrated that for longer drying times further conceptual thoughts concerning the coupled representation of shrinkage and creep are required.


Sign in / Sign up

Export Citation Format

Share Document