From source to sink - Sustainable and reproducible data pipelines with SaQC

The number of sensors used in the environmental system sciences is increasing rapidly, and while this trend undoubtedly provides a great potential to broaden the understanding of complex spatio-temporal processes, it comes with its own set of new challenges. The flow of data from a source to its sink, from sensors to databases, involves many, usually error prone intermediate steps. From the data acquisition with its specific scientific and technical challenges, over the data transfer from often remote locations to the final data processing, all carry great potential to introduce errors and disturbances into the actual environmental signal.Quantifying these errors becomes a crucial part of the later evaluation of all measured data. While many large environmental observatories are moving from manual to more automated ways of data processing and quality assurance, these systems are usually highly customized and hand written. This approach is non-ideal in several ways: First, it is a waste of resources as the same algorithms are implemented over and over again and second, it imposes great challenges to reproducibility. If the relevant programs are made available at all, they expose all problems of software reuse: correctness of the implementation, readability and comprehensibility for future users, as well as transferability between different computing environments. Beside these problems, related to software development in general, another crucial factor comes into play: the end product, a processed and quality controlled data set, is closely tied to the current version of the programs in use. Even small changes to the source code can lead to vastly differing results. If this is not approached responsibly, data and programs will inevitably fall out of sync.The presented software, the 'System for automated Quality Control (SaQC)' (www.ufz.git.de/rdm-software/saqc), helps to either solve, or massively simplify the solution to the presented challenges. As a mainly no-code platform with a large set of implemented functionality, SaQC lowers the entry barrier for the non-programming scientific practitioner, without sacrificing the possibilities to fine-grained adaptation to project specific needs. The text based configuration allows the easy integration into version control systems and thus opens the opportunity to use well established software for data lineage. We will give a short overview of the program's unique features and showcase possibilities to build reliable and reproducible processing and quality assurance pipelines for real-world data from a spatially distributed, heterogeneous sensor network.

Download Full-text

Machine Learning for Clinical Data Processing

Advances in Digital Crime, Forensics, and Cyber Terrorism - Digital Forensics for the Health Sciences ◽

10.4018/978-1-60960-483-7.ch009 ◽

2011 ◽

pp. 193-215

Author(s):

Guo-Zheng Li

Keyword(s):

Machine Learning ◽

Data Processing ◽

Clinical Data ◽

Machine Learning Techniques ◽

Support Vector ◽

Data Sets ◽

The Novel ◽

Real World Data ◽

Data Set ◽

Learning Techniques

This chapter introduces great challenges and the novel machine learning techniques employed in clinical data processing. It argues that the novel machine learning techniques including support vector machines, ensemble learning, feature selection, feature reuse by using multi-task learning, and multi-label learning provide potentially more substantive solutions for decision support and clinical data analysis. The authors demonstrate the generalization performance of the novel machine learning techniques on real world data sets including one data set of brain glioma, one data set of coronary heart disease in Chinese Medicine and some tumor data sets of microarray. More and more machine learning techniques will be developed to improve analysis precision of clinical data sets.

Download Full-text

Machine Learning for Clinical Data Processing

Machine Learning ◽

10.4018/978-1-60960-818-7.ch409 ◽

2012 ◽

pp. 875-897

Author(s):

Guo-Zheng Li

Keyword(s):

Machine Learning ◽

Data Processing ◽

Clinical Data ◽

Machine Learning Techniques ◽

Support Vector ◽

Data Sets ◽

The Novel ◽

Real World Data ◽

Data Set ◽

Learning Techniques

Download Full-text

Molecular Views of Ice-Embedded Lumbricus Terrestris Erythrocruorin Obtained By Invariant Classification

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100181002 ◽

1990 ◽

Vol 48 (1) ◽

pp. 450-451

Author(s):

Michael schatz ◽

Joachim Jäger ◽

Marin van Heel

Keyword(s):

Image Interpretation ◽

Lumbricus Terrestris ◽

Reference Image ◽

Large Set ◽

Multivariate Statistical ◽

Data Set ◽

Low Contrast ◽

Negative Stain ◽

Reference Images ◽

Statistical Resolution

Lumbricus terrestris erythrocruorin is a giant oxygen-transporting macromolecule in the blood of the common earth worm (worm "hemoglobin"). In our current study, we use specimens (kindly provided by Drs W.E. Royer and W.A. Hendrickson) embedded in vitreous ice (1) to avoid artefacts encountered with the negative stain preparation technigue used in previous studies (2-4).Although the molecular structure is well preserved in vitreous ice, the low contrast and high noise level in the micrographs represent a serious problem in image interpretation. Moreover, the molecules can exhibit many different orientations relative to the object plane of the microscope in this type of preparation. Existing techniques of analysis requiring alignment of the molecular views relative to one or more reference images often thus yield unsatisfactory results.We use a new method in which first rotation-, translation- and mirror invariant functions (5) are derived from the large set of input images, which functions are subsequently classified automatically using multivariate statistical techniques (6). The different molecular views in the data set can therewith be found unbiasedly (5). Within each class, all images are aligned relative to that member of the class which contributes least to the classes′ internal variance (6). This reference image is thus the most typical member of the class. Finally the aligned images from each class are averaged resulting in molecular views with enhanced statistical resolution.

Download Full-text

Neuro-Inspired Signal Processing in Ferromagnetic Nanofibers

Biomimetics ◽

10.3390/biomimetics6020032 ◽

2021 ◽

Vol 6 (2) ◽

pp. 32

Author(s):

Tomasz Blachowicz ◽

Jacek Grzybowski ◽

Pawel Steblinski ◽

Andrea Ehrmann

Keyword(s):

Data Processing ◽

Data Storage ◽

Domain Walls ◽

Data Transfer ◽

Synaptic Activity ◽

Neuromorphic Computing ◽

Rotating Magnetic Fields ◽

Micromagnetic Simulations ◽

Energy Consuming ◽

Stochastic Data

Computers nowadays have different components for data storage and data processing, making data transfer between these units a bottleneck for computing speed. Therefore, so-called cognitive (or neuromorphic) computing approaches try combining both these tasks, as is done in the human brain, to make computing faster and less energy-consuming. One possible method to prepare new hardware solutions for neuromorphic computing is given by nanofiber networks as they can be prepared by diverse methods, from lithography to electrospinning. Here, we show results of micromagnetic simulations of three coupled semicircle fibers in which domain walls are excited by rotating magnetic fields (inputs), leading to different output signals that can be used for stochastic data processing, mimicking biological synaptic activity and thus being suitable as artificial synapses in artificial neural networks.

Download Full-text

Facing the needs for clean bicycle data – a bicycle-specific approach of GPS data processing

European Transport Research Review ◽

10.1186/s12544-020-00462-2 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Sven Lißner ◽

Stefan Huber

Keyword(s):

Data Processing ◽

Gps Data ◽

Data Set ◽

Specific Data ◽

Driving Mode ◽

Mode Detection ◽

Current State ◽

Mode Recognition ◽

Recorded Data ◽

Gps Tracks

Abstract Background GPS-based cycling data are increasingly available for traffic planning these days. However, the recorded data often contain more information than simply bicycle trips. GPS tracks resulting from tracking while using other modes of transport than bike or long periods at working locations while people are still tracking are only some examples. Thus, collected bicycle GPS data need to be processed adequately to use them for transportation planning. Results The article presents a multi-level approach towards bicycle-specific data processing. The data processing model contains different steps of processing (data filtering, smoothing, trip segmentation, transport mode recognition, driving mode detection) to finally obtain a correct data set that contains bicycle trips, only. The validation reveals a sound accuracy of the model at its’ current state (82–88%).

Download Full-text

Auto-sharing parameters for transfer learning based on multi-objective optimization

Integrated Computer-Aided Engineering ◽

10.3233/ica-210655 ◽

2021 ◽

pp. 1-13

Author(s):

Hailin Liu ◽

Fangqing Gu ◽

Zixian Lin

Keyword(s):

Transfer Learning ◽

Optimization Problem ◽

Data Sets ◽

Multi Objective Optimization ◽

Particle Swarm Optimizer ◽

Real World Data ◽

Data Set ◽

Target Task ◽

Main Research ◽

Multi Objective

Transfer learning methods exploit similarities between different datasets to improve the performance of the target task by transferring knowledge from source tasks to the target task. “What to transfer” is a main research issue in transfer learning. The existing transfer learning method generally needs to acquire the shared parameters by integrating human knowledge. However, in many real applications, an understanding of which parameters can be shared is unknown beforehand. Transfer learning model is essentially a special multi-objective optimization problem. Consequently, this paper proposes a novel auto-sharing parameter technique for transfer learning based on multi-objective optimization and solves the optimization problem by using a multi-swarm particle swarm optimizer. Each task objective is simultaneously optimized by a sub-swarm. The current best particle from the sub-swarm of the target task is used to guide the search of particles of the source tasks and vice versa. The target task and source task are jointly solved by sharing the information of the best particle, which works as an inductive bias. Experiments are carried out to evaluate the proposed algorithm on several synthetic data sets and two real-world data sets of a school data set and a landmine data set, which show that the proposed algorithm is effective.

Download Full-text

Empirical evaluation of feature subset selection based on a real-world data set

Engineering Applications of Artificial Intelligence ◽

10.1016/j.engappai.2004.03.005 ◽

2004 ◽

Vol 17 (3) ◽

pp. 285-288 ◽

Cited By ~ 5

Author(s):

Petra Perner ◽

Chid Apte

Keyword(s):

Real World ◽

Empirical Evaluation ◽

Subset Selection ◽

Feature Subset Selection ◽

Feature Subset ◽

Real World Data ◽

Data Set ◽

World Data

Download Full-text

Time-ResNeXt for epilepsy recognition based on EEG signals in wireless networks

EURASIP Journal on Wireless Communications and Networking ◽

10.1186/s13638-020-01810-5 ◽

2020 ◽

Vol 2020 (1) ◽

Cited By ~ 1

Author(s):

Shaoqiang Wang ◽

Shudong Wang ◽

Song Zhang ◽

Yifan Wang

Keyword(s):

Deep Learning ◽

Network Structure ◽

Signal Recognition ◽

Eeg Signals ◽

Real World Data ◽

Data Set ◽

Practical Applications ◽

Epilepsy Diagnosis ◽

Single Data ◽

Electroencephalogram Eeg

Abstract To automatically detect dynamic EEG signals to reduce the time cost of epilepsy diagnosis. In the signal recognition of electroencephalogram (EEG) of epilepsy, traditional machine learning and statistical methods require manual feature labeling engineering in order to show excellent results on a single data set. And the artificially selected features may carry a bias, and cannot guarantee the validity and expansibility in real-world data. In practical applications, deep learning methods can release people from feature engineering to a certain extent. As long as the focus is on the expansion of data quality and quantity, the algorithm model can learn automatically to get better improvements. In addition, the deep learning method can also extract many features that are difficult for humans to perceive, thereby making the algorithm more robust. Based on the design idea of ResNeXt deep neural network, this paper designs a Time-ResNeXt network structure suitable for time series EEG epilepsy detection to identify EEG signals. The accuracy rate of Time-ResNeXt in the detection of EEG epilepsy can reach 91.50%. The Time-ResNeXt network structure produces extremely advanced performance on the benchmark dataset (Berne-Barcelona dataset) and has great potential for improving clinical practice.

Download Full-text

DynDL: Scheduling Data-Locality-Aware Tasks with Dynamic Data Transfer Cost for Multicore-Server-Based Big Data Clusters

Applied Sciences ◽

10.3390/app8112216 ◽

2018 ◽

Vol 8 (11) ◽

pp. 2216

Author(s):

Jiahui Jin ◽

Qi An ◽

Wei Zhou ◽

Jiakai Tang ◽

Runqun Xiong

Keyword(s):

Big Data ◽

Data Processing ◽

Processing Time ◽

Data Transfer ◽

Data Locality ◽

Free Time ◽

Time Data ◽

Dynamic Data ◽

Network Bandwidth ◽

Transfer Cost

Network bandwidth is a scarce resource in big data environments, so data locality is a fundamental problem for data-parallel frameworks such as Hadoop and Spark. This problem is exacerbated in multicore server-based clusters, where multiple tasks running on the same server compete for the server’s network bandwidth. Existing approaches solve this problem by scheduling computational tasks near the input data and considering the server’s free time, data placements, and data transfer costs. However, such approaches usually set identical values for data transfer costs, even though a multicore server’s data transfer cost increases with the number of data-remote tasks. Eventually, this hampers data-processing time, by minimizing it ineffectively. As a solution, we propose DynDL (Dynamic Data Locality), a novel data-locality-aware task-scheduling model that handles dynamic data transfer costs for multicore servers. DynDL offers greater flexibility than existing approaches by using a set of non-decreasing functions to evaluate dynamic data transfer costs. We also propose online and offline algorithms (based on DynDL) that minimize data-processing time and adaptively adjust data locality. Although DynDL is NP-complete (nondeterministic polynomial-complete), we prove that the offline algorithm runs in quadratic time and generates optimal results for DynDL’s specific uses. Using a series of simulations and real-world executions, we show that our algorithms are 30% better than algorithms that do not consider dynamic data transfer costs in terms of data-processing time. Moreover, they can adaptively adjust data localities based on the server’s free time, data placement, and network bandwidth, and schedule tens of thousands of tasks within subseconds or seconds.

Download Full-text

Calibration of a Multiphase Model Based on a Comprehensive Data Set for a Normal Strength Concrete

Materials ◽

10.3390/ma12050791 ◽

2019 ◽

Vol 12 (5) ◽

pp. 791 ◽

Cited By ~ 2

Author(s):

Peter Gamnitzer ◽

Martin Drexel ◽

Andreas Brugger ◽

Günter Hofstetter

Keyword(s):

Water Content ◽

Accurate Determination ◽

Autogenous Shrinkage ◽

Shrinkage Strain ◽

Desorption Isotherm ◽

Multiphase Model ◽

Large Set ◽

Creep Tests ◽

Data Set ◽

Water Desorption

Hygro-thermo-chemo-mechanical modelling of time-dependent concrete behavior requires the accurate determination of a large set of parameters. In this paper, the parameters of a multiphase model are calibrated based on a comprehensive set of experiments for a particular concrete of grade C30/37. The experiments include a calorimetry test, tests for age-dependent mechanical properties, tests for determining the water desorption isotherm, shrinkage tests, and compressive creep tests. The latter two were performed on sealed and unsealed specimens with accompanying mass water content measurements. The multiphase model is based on an effective stress formulation. It features a porosity-dependent desorption isotherm, taking into account the time-dependency of the desorption properties. The multiphase model is shown to yield excellent results for the evolutions of the mechanical parameters. The evolution of the autogenous shrinkage strain and evolutions of the creep compliances for loading at concrete ages of 2 days, 7 days, and 28 days are well predicted together with the respective mass water content evolution. This also holds for the evolution of the drying shrinkage strain, at least for moderate drying up to one year. However, it will be demonstrated that for longer drying times further conceptual thoughts concerning the coupled representation of shrinkage and creep are required.

Download Full-text