The Interface between Data Science, Research Assessment and Science Support - Highlights from the German Perspective and Examples from Heidelberg University

Author(s):  
Christoph Siart ◽  
Simon Kopp ◽  
Jochen Apel
Beverages ◽  
2021 ◽  
Vol 7 (1) ◽  
pp. 3
Author(s):  
Zeqing Dong ◽  
Travis Atkison ◽  
Bernard Chen

Although wine has been produced for several thousands of years, the ancient beverage has remained popular and even more affordable in modern times. Among all wine making regions, Bordeaux, France is probably one of the most prestigious wine areas in history. Since hundreds of wines are produced from Bordeaux each year, humans are not likely to be able to examine all wines across multiple vintages to define the characteristics of outstanding 21st century Bordeaux wines. Wineinformatics is a newly proposed data science research with an application domain in wine to process a large amount of wine data through the computer. The goal of this paper is to build a high-quality computational model on wine reviews processed by the full power of the Computational Wine Wheel to understand 21st century Bordeaux wines. On top of 985 binary-attributes generated from the Computational Wine Wheel in our previous research, we try to add additional attributes by utilizing a CATEGORY and SUBCATEGORY for an additional 14 and 34 continuous-attributes to be included in the All Bordeaux (14,349 wine) and the 1855 Bordeaux datasets (1359 wines). We believe successfully merging the original binary-attributes and the new continuous-attributes can provide more insights for Naïve Bayes and Supported Vector Machine (SVM) to build the model for a wine grade category prediction. The experimental results suggest that, for the All Bordeaux dataset, with the additional 14 attributes retrieved from CATEGORY, the Naïve Bayes classification algorithm was able to outperform the existing research results by increasing accuracy by 2.15%, precision by 8.72%, and the F-score by 1.48%. For the 1855 Bordeaux dataset, with the additional attributes retrieved from the CATEGORY and SUBCATEGORY, the SVM classification algorithm was able to outperform the existing research results by increasing accuracy by 5%, precision by 2.85%, recall by 5.56%, and the F-score by 4.07%. The improvements demonstrated in the research show that attributes retrieved from the CATEGORY and SUBCATEGORY has the power to provide more information to classifiers for superior model generation. The model build in this research can better distinguish outstanding and class 21st century Bordeaux wines. This paper provides new directions in Wineinformatics for technical research in data science, such as regression, multi-target, classification and domain specific research, including wine region terroir analysis, wine quality prediction, and weather impact examination.


Fermentation ◽  
2021 ◽  
Vol 7 (1) ◽  
pp. 27
Author(s):  
Jared McCune ◽  
Alex Riley ◽  
Bernard Chen

Wineinformatics is a new data science research area that focuses on large amounts of wine-related data. Most of the current Wineinformatics researches are focused on supervised learning to predict the wine quality, price, region and weather. In this research, unsupervised learning using K-means clustering with optimal K search and filtration process is studied on a Bordeaux-region specific dataset to form clusters and find representative wines in each cluster. 14,349 wines representing the 21st century Bordeaux dataset are clustered into 43 and 13 clusters with detailed analysis on the number of wines, dominant wine characteristics, average wine grades, and representative wines in each cluster. Similar research results are also generated and presented on 435 elite wines (wines that scored 95 points and above on a 100 points scale). The information generated from this research can be beneficial to wine vendors to make a selection given the limited number of wines they can realistically offer, to connoisseurs to study wines in a target region/vintage/price with a representative short list, and to wine consumers to get recommendations. Many possible researches can adopt the same process to analyze and find representative wines in different wine making regions/countries, vintages, or pivot points. This paper opens up a new door for Wineinformatics in unsupervised learning researches.


2020 ◽  
Vol 20 (2) ◽  
pp. e08
Author(s):  
Verónica Cuello ◽  
Gonzalo Zarza ◽  
Maria Corradini ◽  
Michael Rogers

The objective of this article is to introduce a comprehensiveend-to-end solution aimed at enabling the applicationof state-of-the-art Data Science and Analyticmethodologies to a food science related problem. Theproblem refers to the automation of load, homogenization,complex processing and real-time accessibility tolow molecular-weight gelators (LMWGs) data to gaininsights into their assembly behavior, i.e. whether agel can be mixed with an appropriate solvent or not.Most of the work within the field of Colloidal andFood Science in relation to LMWGs have centered onidentifying adequate solvents that can generate stablegels and evaluating how the LMWG characteristics canaffect gelation. As a result, extensive databases havebeen methodically and manually registered, storingresults from different laboratory experiments. Thecomplexity of those databases, and the errors causedby manual data entry, can interfere with the analysisand visualization of relations and patterns, limiting theutility of the experimental work.Due to the above mentioned, we have proposed ascalable and flexible Big Data solution to enable theunification, homogenization and availability of the datathrough the application of tools and methodologies.This approach contributes to optimize data acquisitionduring LMWG research and reduce redundant data processingand analysis, while also enabling researchersto explore a wider range of testing conditions and pushforward the frontier in Food Science research.


2018 ◽  
Vol 6 (3) ◽  
pp. 669-686 ◽  
Author(s):  
Michael Dietze

Abstract. Environmental seismology is the study of the seismic signals emitted by Earth surface processes. This emerging research field is at the intersection of seismology, geomorphology, hydrology, meteorology, and further Earth science disciplines. It amalgamates a wide variety of methods from across these disciplines and ultimately fuses them in a common analysis environment. This overarching scope of environmental seismology requires a coherent yet integrative software which is accepted by many of the involved scientific disciplines. The statistic software R has gained paramount importance in the majority of data science research fields. R has well-justified advances over other mostly commercial software, which makes it the ideal language to base a comprehensive analysis toolbox on. The article introduces the avenues and needs of environmental seismology, and how these are met by the R package eseis. The conceptual structure, example data sets, and available functions are demonstrated. Worked examples illustrate possible applications of the package and in-depth descriptions of the flexible use of the functions. The package has a registered DOI, is available under the GPL licence on the Comprehensive R Archive Network (CRAN), and is maintained on GitHub.


Author(s):  
Emily Slade ◽  
Linda P. Dwoskin ◽  
Guo-Qiang Zhang ◽  
Jeffery C. Talbert ◽  
Jin Chen ◽  
...  

Abstract The availability of large healthcare datasets offers the opportunity for researchers to navigate the traditional clinical and translational science research stages in a nonlinear manner. In particular, data scientists can harness the power of large healthcare datasets to bridge from preclinical discoveries (T0) directly to assessing population-level health impact (T4). A successful bridge from T0 to T4 does not bypass the other stages entirely; rather, effective team science makes a direct progression from T0 to T4 impactful by incorporating the perspectives of researchers from every stage of the clinical and translational science research spectrum. In this exemplar, we demonstrate how effective team science overcame challenges and, ultimately, ensured success when a diverse team of researchers worked together, using healthcare big data to test population-level substance use disorder (SUD) hypotheses generated from preclinical rodent studies. This project, called Advancing Substance use disorder Knowledge using Big Data (ASK Big Data), highlights the critical roles that data science expertise and effective team science play in quickly translating preclinical research into public health impact.


2019 ◽  
Vol 37 (6) ◽  
pp. 929-951 ◽  
Author(s):  
Laurent Remy ◽  
Dragan Ivanović ◽  
Maria Theodoridou ◽  
Athina Kritsotaki ◽  
Paul Martin ◽  
...  

Purpose The purpose of this paper is to boost multidisciplinary research by the building of an integrated catalogue or research assets metadata. Such an integrated catalogue should enable researchers to solve problems or analyse phenomena that require a view across several scientific domains. Design/methodology/approach There are two main approaches for integrating metadata catalogues provided by different e-science research infrastructures (e-RIs): centralised and distributed. The authors decided to implement a central metadata catalogue that describes, provides access to and records actions on the assets of a number of e-RIs participating in the system. The authors chose the CERIF data model for description of assets available via the integrated catalogue. Analysis of popular metadata formats used in e-RIs has been conducted, and mappings between popular formats and the CERIF data model have been defined using an XML-based tool for description and automatic execution of mappings. Findings An integrated catalogue of research assets metadata has been created. Metadata from e-RIs supporting Dublin Core, ISO 19139, DCAT-AP, EPOS-DCAT-AP, OIL-E and CKAN formats can be integrated into the catalogue. Metadata are stored in CERIF RDF in the integrated catalogue. A web portal for searching this catalogue has been implemented. Research limitations/implications Only five formats are supported at this moment. However, description of mappings between other source formats and the target CERIF format can be defined in the future using the 3M tool, an XML-based tool for describing X3ML mappings that can then be automatically executed on XML metadata records. The approach and best practices described in this paper can thus be applied in future mappings between other metadata formats. Practical implications The integrated catalogue is a part of the eVRE prototype, which is a result of the VRE4EIC H2020 project. Social implications The integrated catalogue should boost the performance of multi-disciplinary research; thus it has the potential to enhance the practice of data science and so contribute to an increasingly knowledge-based society. Originality/value A novel approach for creation of the integrated catalogue has been defined and implemented. The approach includes definition of mappings between various formats. Defined mappings are effective and shareable.


2020 ◽  
Vol 2020 ◽  
pp. 1-13
Author(s):  
Kehua Miao ◽  
Jie Li ◽  
Wenxing Hong ◽  
Mingtao Chen

The booming development of data science and big data technology stacks has inspired continuous iterative updates of data science research or working methods. At present, the granularity of the labor division between data science and big data is more refined. Traditional work methods, from work infrastructure environment construction to data modelling and analysis of working methods, will greatly delay work and research efficiency. In this paper, we focus on the purpose of the current friendly collaboration of the data science team to build data science and big data analysis application platform based on microservices architecture for education or nonprofessional research field. In the environment based on microservices that facilitates updating the components of each component, the platform has a personal code experiment environment that integrates JupyterHub based on Spark and HDFS for multiuser use and a visualized modelling tools which follow the modular design of data science engineering based on Greenplum in-database analysis. The entire web service system is developed based on spring boot.


Entropy ◽  
2019 ◽  
Vol 21 (8) ◽  
pp. 763 ◽  
Author(s):  
Alaa Sagheer ◽  
Mohammed Zidan ◽  
Mohammed M. Abdelsamea

Pattern classification represents a challenging problem in machine learning and data science research domains, especially when there is a limited availability of training samples. In recent years, artificial neural network (ANN) algorithms have demonstrated astonishing performance when compared to traditional generative and discriminative classification algorithms. However, due to the complexity of classical ANN architectures, ANNs are sometimes incapable of providing efficient solutions when addressing complex distribution problems. Motivated by the mathematical definition of a quantum bit (qubit), we propose a novel autonomous perceptron model (APM) that can solve the problem of the architecture complexity of traditional ANNs. APM is a nonlinear classification model that has a simple and fixed architecture inspired by the computational superposition power of the qubit. The proposed perceptron is able to construct the activation operators autonomously after a limited number of iterations. Several experiments using various datasets are conducted, where all the empirical results show the superiority of the proposed model as a classifier in terms of accuracy and computational time when it is compared with baseline classification models.


Sign in / Sign up

Export Citation Format

Share Document