The Interface between Data Science, Research Assessment and Science Support - Highlights from the German Perspective and Examples from Heidelberg University

Although wine has been produced for several thousands of years, the ancient beverage has remained popular and even more affordable in modern times. Among all wine making regions, Bordeaux, France is probably one of the most prestigious wine areas in history. Since hundreds of wines are produced from Bordeaux each year, humans are not likely to be able to examine all wines across multiple vintages to define the characteristics of outstanding 21st century Bordeaux wines. Wineinformatics is a newly proposed data science research with an application domain in wine to process a large amount of wine data through the computer. The goal of this paper is to build a high-quality computational model on wine reviews processed by the full power of the Computational Wine Wheel to understand 21st century Bordeaux wines. On top of 985 binary-attributes generated from the Computational Wine Wheel in our previous research, we try to add additional attributes by utilizing a CATEGORY and SUBCATEGORY for an additional 14 and 34 continuous-attributes to be included in the All Bordeaux (14,349 wine) and the 1855 Bordeaux datasets (1359 wines). We believe successfully merging the original binary-attributes and the new continuous-attributes can provide more insights for Naïve Bayes and Supported Vector Machine (SVM) to build the model for a wine grade category prediction. The experimental results suggest that, for the All Bordeaux dataset, with the additional 14 attributes retrieved from CATEGORY, the Naïve Bayes classification algorithm was able to outperform the existing research results by increasing accuracy by 2.15%, precision by 8.72%, and the F-score by 1.48%. For the 1855 Bordeaux dataset, with the additional attributes retrieved from the CATEGORY and SUBCATEGORY, the SVM classification algorithm was able to outperform the existing research results by increasing accuracy by 5%, precision by 2.85%, recall by 5.56%, and the F-score by 4.07%. The improvements demonstrated in the research show that attributes retrieved from the CATEGORY and SUBCATEGORY has the power to provide more information to classifiers for superior model generation. The model build in this research can better distinguish outstanding and class 21st century Bordeaux wines. This paper provides new directions in Wineinformatics for technical research in data science, such as regression, multi-target, classification and domain specific research, including wine region terroir analysis, wine quality prediction, and weather impact examination.

Download Full-text

Clustering in Wineinformatics with Attribute Selection to Increase Uniqueness of Clusters

Fermentation ◽

10.3390/fermentation7010027 ◽

2021 ◽

Vol 7 (1) ◽

pp. 27

Author(s):

Jared McCune ◽

Alex Riley ◽

Bernard Chen

Keyword(s):

Unsupervised Learning ◽

Data Science ◽

Science Research ◽

Research Area ◽

Attribute Selection ◽

Short List ◽

Wine Quality ◽

Filtration Process ◽

Target Region ◽

Related Data

Wineinformatics is a new data science research area that focuses on large amounts of wine-related data. Most of the current Wineinformatics researches are focused on supervised learning to predict the wine quality, price, region and weather. In this research, unsupervised learning using K-means clustering with optimal K search and filtration process is studied on a Bordeaux-region specific dataset to form clusters and find representative wines in each cluster. 14,349 wines representing the 21st century Bordeaux dataset are clustered into 43 and 13 clusters with detailed analysis on the number of wines, dominant wine characteristics, average wine grades, and representative wines in each cluster. Similar research results are also generated and presented on 435 elite wines (wines that scored 95 points and above on a 100 points scale). The information generated from this research can be beneficial to wine vendors to make a selection given the limited number of wines they can realistically offer, to connoisseurs to study wines in a target region/vintage/price with a representative short list, and to wine consumers to get recommendations. Many possible researches can adopt the same process to analyze and find representative wines in different wine making regions/countries, vintages, or pivot points. This paper opens up a new door for Wineinformatics in unsupervised learning researches.

Download Full-text

Data Science & Engineering into Food Science: A novel Big Data Platform for Low Molecular Weight Gelators’ Behavioral Analysis

Journal of Computer Science and Technology ◽

10.24215/16666038.20.e08 ◽

2020 ◽

Vol 20 (2) ◽

pp. e08

Author(s):

Verónica Cuello ◽

Gonzalo Zarza ◽

Maria Corradini ◽

Michael Rogers

Keyword(s):

Molecular Weight ◽

Big Data ◽

Data Science ◽

Data Entry ◽

Science Research ◽

Food Science ◽

Redundant Data ◽

Data Platform ◽

Assembly Behavior ◽

Low Molecular Weight Gelators

The objective of this article is to introduce a comprehensiveend-to-end solution aimed at enabling the applicationof state-of-the-art Data Science and Analyticmethodologies to a food science related problem. Theproblem refers to the automation of load, homogenization,complex processing and real-time accessibility tolow molecular-weight gelators (LMWGs) data to gaininsights into their assembly behavior, i.e. whether agel can be mixed with an appropriate solvent or not.Most of the work within the field of Colloidal andFood Science in relation to LMWGs have centered onidentifying adequate solvents that can generate stablegels and evaluating how the LMWG characteristics canaffect gelation. As a result, extensive databases havebeen methodically and manually registered, storingresults from different laboratory experiments. Thecomplexity of those databases, and the errors causedby manual data entry, can interfere with the analysisand visualization of relations and patterns, limiting theutility of the experimental work.Due to the above mentioned, we have proposed ascalable and flexible Big Data solution to enable theunification, homogenization and availability of the datathrough the application of tools and methodologies.This approach contributes to optimize data acquisitionduring LMWG research and reduce redundant data processingand analysis, while also enabling researchersto explore a wider range of testing conditions and pushforward the frontier in Food Science research.

Download Full-text

The R package “eseis” – a software toolbox for environmental seismology

Earth Surface Dynamics ◽

10.5194/esurf-6-669-2018 ◽

2018 ◽

Vol 6 (3) ◽

pp. 669-686 ◽

Cited By ~ 8

Author(s):

Michael Dietze

Keyword(s):

Data Science ◽

Earth Science ◽

Science Research ◽

Worked Examples ◽

R Package ◽

Research Field ◽

Data Sets ◽

Research Fields ◽

Scientific Disciplines ◽

Analysis Environment

Abstract. Environmental seismology is the study of the seismic signals emitted by Earth surface processes. This emerging research field is at the intersection of seismology, geomorphology, hydrology, meteorology, and further Earth science disciplines. It amalgamates a wide variety of methods from across these disciplines and ultimately fuses them in a common analysis environment. This overarching scope of environmental seismology requires a coherent yet integrative software which is accepted by many of the involved scientific disciplines. The statistic software R has gained paramount importance in the majority of data science research fields. R has well-justified advances over other mostly commercial software, which makes it the ideal language to base a comprehensive analysis toolbox on. The article introduces the avenues and needs of environmental seismology, and how these are met by the R package eseis. The conceptual structure, example data sets, and available functions are demonstrated. Worked examples illustrate possible applications of the package and in-depth descriptions of the flexible use of the functions. The package has a registered DOI, is available under the GPL licence on the Comprehensive R Archive Network (CRAN), and is maintained on GitHub.

Download Full-text

Lean Data Science Research Life Cycle: A Concept for Data Analysis Software Development

Communications in Computer and Information Science - Knowledge-Based Software Engineering ◽

10.1007/978-3-319-11854-3_61 ◽

2014 ◽

pp. 708-716 ◽

Cited By ~ 2

Author(s):

Maxim Shcherbakov ◽

Nataliya Shcherbakova ◽

Adriaan Brebels ◽

Timur Janovsky ◽

Valery Kamaev

Keyword(s):

Life Cycle ◽

Data Analysis ◽

Software Development ◽

Data Science ◽

Science Research ◽

Analysis Software ◽

Data Analysis Software

Download Full-text

Integrating data science into the translational science research spectrum: A substance use disorder case study

Journal of Clinical and Translational Science ◽

10.1017/cts.2020.521 ◽

2020 ◽

pp. 1-6

Author(s):

Emily Slade ◽

Linda P. Dwoskin ◽

Guo-Qiang Zhang ◽

Jeffery C. Talbert ◽

Jin Chen ◽

...

Keyword(s):

Substance Use ◽

Big Data ◽

Substance Use Disorder ◽

Data Science ◽

Population Level ◽

Science Research ◽

Health Impact ◽

Team Science ◽

Translational Science ◽

Preclinical Research

Abstract The availability of large healthcare datasets offers the opportunity for researchers to navigate the traditional clinical and translational science research stages in a nonlinear manner. In particular, data scientists can harness the power of large healthcare datasets to bridge from preclinical discoveries (T0) directly to assessing population-level health impact (T4). A successful bridge from T0 to T4 does not bypass the other stages entirely; rather, effective team science makes a direct progression from T0 to T4 impactful by incorporating the perspectives of researchers from every stage of the clinical and translational science research spectrum. In this exemplar, we demonstrate how effective team science overcame challenges and, ultimately, ensured success when a diverse team of researchers worked together, using healthcare big data to test population-level substance use disorder (SUD) hypotheses generated from preclinical rodent studies. This project, called Advancing Substance use disorder Knowledge using Big Data (ASK Big Data), highlights the critical roles that data science expertise and effective team science play in quickly translating preclinical research into public health impact.

Download Full-text

Building an integrated enhanced virtual research environment metadata catalogue

The Electronic Library ◽

10.1108/el-09-2018-0183 ◽

2019 ◽

Vol 37 (6) ◽

pp. 929-951 ◽

Cited By ~ 2

Author(s):

Laurent Remy ◽

Dragan Ivanović ◽

Maria Theodoridou ◽

Athina Kritsotaki ◽

Paul Martin ◽

...

Keyword(s):

Data Model ◽

Data Science ◽

Science Research ◽

Dublin Core ◽

Content Type ◽

Knowledge Based ◽

Novel Approach ◽

Virtual Research Environment ◽

Definition Of ◽

Practical Implications

Purpose The purpose of this paper is to boost multidisciplinary research by the building of an integrated catalogue or research assets metadata. Such an integrated catalogue should enable researchers to solve problems or analyse phenomena that require a view across several scientific domains. Design/methodology/approach There are two main approaches for integrating metadata catalogues provided by different e-science research infrastructures (e-RIs): centralised and distributed. The authors decided to implement a central metadata catalogue that describes, provides access to and records actions on the assets of a number of e-RIs participating in the system. The authors chose the CERIF data model for description of assets available via the integrated catalogue. Analysis of popular metadata formats used in e-RIs has been conducted, and mappings between popular formats and the CERIF data model have been defined using an XML-based tool for description and automatic execution of mappings. Findings An integrated catalogue of research assets metadata has been created. Metadata from e-RIs supporting Dublin Core, ISO 19139, DCAT-AP, EPOS-DCAT-AP, OIL-E and CKAN formats can be integrated into the catalogue. Metadata are stored in CERIF RDF in the integrated catalogue. A web portal for searching this catalogue has been implemented. Research limitations/implications Only five formats are supported at this moment. However, description of mappings between other source formats and the target CERIF format can be defined in the future using the 3M tool, an XML-based tool for describing X3ML mappings that can then be automatically executed on XML metadata records. The approach and best practices described in this paper can thus be applied in future mappings between other metadata formats. Practical implications The integrated catalogue is a part of the eVRE prototype, which is a result of the VRE4EIC H2020 project. Social implications The integrated catalogue should boost the performance of multi-disciplinary research; thus it has the potential to enhance the practice of data science and so contribute to an increasingly knowledge-based society. Originality/value A novel approach for creation of the integrated catalogue has been defined and implemented. The approach includes definition of mappings between various formats. Defined mappings are effective and shareable.

Download Full-text

A Microservice-Based Big Data Analysis Platform for Online Educational Applications

Scientific Programming ◽

10.1155/2020/6929750 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13

Author(s):

Kehua Miao ◽

Jie Li ◽

Wenxing Hong ◽

Mingtao Chen

Keyword(s):

Big Data ◽

Data Analysis ◽

Data Science ◽

Modular Design ◽

Science Research ◽

Big Data Analysis ◽

Research Field ◽

Traditional Work ◽

Educational Applications ◽

Analysis Platform

The booming development of data science and big data technology stacks has inspired continuous iterative updates of data science research or working methods. At present, the granularity of the labor division between data science and big data is more refined. Traditional work methods, from work infrastructure environment construction to data modelling and analysis of working methods, will greatly delay work and research efficiency. In this paper, we focus on the purpose of the current friendly collaboration of the data science team to build data science and big data analysis application platform based on microservices architecture for education or nonprofessional research field. In the environment based on microservices that facilitates updating the components of each component, the platform has a personal code experiment environment that integrates JupyterHub based on Spark and HDFS for multiuser use and a visualized modelling tools which follow the modular design of data science engineering based on Greenplum in-database analysis. The entire web service system is developed based on spring boot.

Download Full-text

A Novel Autonomous Perceptron Model for Pattern Classification Applications

Entropy ◽

10.3390/e21080763 ◽

2019 ◽

Vol 21 (8) ◽

pp. 763 ◽

Cited By ~ 12

Author(s):

Alaa Sagheer ◽

Mohammed Zidan ◽

Mohammed M. Abdelsamea

Keyword(s):

Pattern Classification ◽

Data Science ◽

Science Research ◽

Efficient Solutions ◽

Classification Model ◽

Computational Time ◽

Quantum Bit ◽

Training Samples ◽

Artificial Neural Network Ann ◽

Definition Of

Pattern classification represents a challenging problem in machine learning and data science research domains, especially when there is a limited availability of training samples. In recent years, artificial neural network (ANN) algorithms have demonstrated astonishing performance when compared to traditional generative and discriminative classification algorithms. However, due to the complexity of classical ANN architectures, ANNs are sometimes incapable of providing efficient solutions when addressing complex distribution problems. Motivated by the mathematical definition of a quantum bit (qubit), we propose a novel autonomous perceptron model (APM) that can solve the problem of the architecture complexity of traditional ANNs. APM is a nonlinear classification model that has a simple and fixed architecture inspired by the computational superposition power of the qubit. The proposed perceptron is able to construct the activation operators autonomously after a limited number of iterations. Several experiments using various datasets are conducted, where all the empirical results show the superiority of the proposed model as a classifier in terms of accuracy and computational time when it is compared with baseline classification models.

Download Full-text

Designing Data Science Workshops for Data-Intensive Environmental Science Research

Journal of Statistics Education ◽

10.1080/10691898.2020.1854636 ◽

2020 ◽

pp. 1-31

Author(s):

Allison S. Theobold ◽

Stacey A. Hancock ◽

Sara Mannheimer

Keyword(s):

Data Science ◽

Environmental Science ◽

Science Research ◽

Data Intensive

Download Full-text