small data sets Latest Research Papers

AbstractThis chapter deals with the main theoretical fundamentals and practical issues of using functional regression in the context of genomic prediction. We explain how to represent data in functions by means of basis functions and considered two basis functions: Fourier for periodic or near-periodic data and B-splines for nonperiodic data. We derived the functional regression with a smoothed coefficient function under a fixed model framework and some examples are also provided under this model. A Bayesian version of functional regression is outlined and explained and all details for its implementation in glmnet and BGLR are given. The examples take into account in the predictor the main effects of environments and genotypes and the genotype × environment interaction term. The examples are done with small data sets so that the user can run them on his/her own computer and can understand the implementation process.

Download Full-text

Machine Learning-Assisted QSAR Models on Contaminant Reactivity Toward Four Oxidants: Combining Small Data Sets and Knowledge Transfer

Environmental Science & Technology ◽

10.1021/acs.est.1c04883 ◽

2021 ◽

Author(s):

Shifa Zhong ◽

Yanping Zhang ◽

Huichun Zhang

Keyword(s):

Machine Learning ◽

Knowledge Transfer ◽

Small Data ◽

Data Sets ◽

Qsar Models ◽

Small Data Sets

Download Full-text

A Robust Approach to Variation in Carpathian Rusyn: Resampling-Based Methods for Small Data Sets

Journal of Linguistics/Jazykovedný casopis ◽

10.2478/jazcas-2021-0055 ◽

2021 ◽

Vol 72 (2) ◽

pp. 603-617

Author(s):

Moulay Zaidan Lahjouji-Seppälä ◽

Achim Rabus

Keyword(s):

Large Scale ◽

Small Data ◽

Data Sets ◽

Data Set ◽

Large Scale Data ◽

Dependent Variables ◽

Small Data Sets ◽

The Stability ◽

Sociolinguistic Status ◽

Scale Data

Abstract Quantitative, corpus based research on spontaneous spoken Carpathian Rusyn language can cause several data-related problems: Speakers are using ambivalent forms in different quantities, resulting in a biased data set – while a stricter data-cleaning process would lead to a large scale data loss. On top of that, polytomous categorical dependent variables are hard to analyze due to methodological limitations. This paper provides several approaches to face unbalanced and biased data sets containing variation of conjugational forms of the verb maty ‘to have’ and (po-)znaty ‘to know’ in Carpathian Rusyn language. Using resampling based methods like Cross-Validation, Bootstrapping and Random Forests, we provide a strategy for circumventing possible methodological pitfalls and gaining the most information from our precious data, without trying to p-hack the results. Calculating the predictive power of several sociolinguistic factors on linguistic variation, we can make valid statements about the (sociolinguistic) status of Rusyn and the stability of the old dialect continuum of Rusyn varieties.

Download Full-text

A Manifold Learning Perspective on Representation Learning: Learning Decoder and Representations without an Encoder

Entropy ◽

10.3390/e23111403 ◽

2021 ◽

Vol 23 (11) ◽

pp. 1403

Author(s):

Viktoria Schuster ◽

Anders Krogh

Keyword(s):

Manifold Learning ◽

Representation Learning ◽

Natural Image ◽

Small Data ◽

Data Sets ◽

Dimensional Manifold ◽

Dimensional Representation ◽

Input Space ◽

Training Samples ◽

Small Data Sets

Autoencoders are commonly used in representation learning. They consist of an encoder and a decoder, which provide a straightforward method to map n-dimensional data in input space to a lower m-dimensional representation space and back. The decoder itself defines an m-dimensional manifold in input space. Inspired by manifold learning, we showed that the decoder can be trained on its own by learning the representations of the training samples along with the decoder weights using gradient descent. A sum-of-squares loss then corresponds to optimizing the manifold to have the smallest Euclidean distance to the training samples, and similarly for other loss functions. We derived expressions for the number of samples needed to specify the encoder and decoder and showed that the decoder generally requires much fewer training samples to be well-specified compared to the encoder. We discuss the training of autoencoders in this perspective and relate it to previous work in the field that uses noisy training examples and other types of regularization. On the natural image data sets MNIST and CIFAR10, we demonstrated that the decoder is much better suited to learn a low-dimensional representation, especially when trained on small data sets. Using simulated gene regulatory data, we further showed that the decoder alone leads to better generalization and meaningful representations. Our approach of training the decoder alone facilitates representation learning even on small data sets and can lead to improved training of autoencoders. We hope that the simple analyses presented will also contribute to an improved conceptual understanding of representation learning.

Download Full-text

Data analytics accelerates the experimental discovery of new thermoelectric materials with extremely high figure of merit

10.21203/rs.3.rs-926972/v1 ◽

2021 ◽

Author(s):

Sergey Levchenko ◽

Yaqiong Zhong ◽

Xiaojuan Hu ◽

Debalaya Sarker ◽

Qingrui Xia ◽

...

Keyword(s):

Thermoelectric Materials ◽

Figure Of Merit ◽

Large Scale ◽

Chemical Space ◽

Small Data ◽

Data Sets ◽

High Performing ◽

Learning Framework ◽

Small Data Sets ◽

P Type

Abstract Thermoelectric (TE) materials are among very few sustainable yet feasible energy solutions of present time. This huge promise of energy harvesting is contingent on identifying/designing materials having higher efficiency than presently available ones. However, due to the vastness of the chemical space of materials, only its small fraction was scanned experimentally and/or computationally so far. Employing a compressed-sensing based symbolic regression in an active-learning framework, we have not only identified a trend in materials’ compositions for superior TE performance, but have also predicted and experimentally synthesized several extremely high performing novel TE materials. Among these, we found polycrystalline p-type Cu0.45Ag0.55GaTe2 to possess an experimental figure of merit as high as ~2.8 at 827 K. This is a breakthrough in the field, because all previously known thermoelectric materials with a comparable figure of merit are either unstable or much more difficult to synthesize, rendering them unusable in large-scale applications. The presented methodology demonstrates the importance and tremendous potential of physically informed descriptors in material science, in particular for relatively small data sets typically available from experiments at well-controlled conditions.

Download Full-text

The PPLD has advantages over conventional regression methods in application to moderately sized genome-wide association studies

PLoS ONE ◽

10.1371/journal.pone.0257164 ◽

2021 ◽

Vol 16 (9) ◽

pp. e0257164

Author(s):

Veronica J. Vieland ◽

Sang-Cheol Seok

Keyword(s):

Proportional Hazards ◽

Association Studies ◽

Cox Proportional Hazards ◽

Small Data ◽

Data Sets ◽

Genome Wide Association Studies ◽

Genetic Modifiers ◽

Sample Sizes ◽

Regression Methods ◽

Small Data Sets

In earlier work, we have developed and evaluated an alternative approach to the analysis of GWAS data, based on a statistic called the PPLD. More recently, motivated by a GWAS for genetic modifiers of the X-linked Mendelian disorder Duchenne Muscular Dystrophy (DMD), we adapted the PPLD for application to time-to-event (TE) phenotypes. Because DMD itself is relatively rare, this is a setting in which the very large sample sizes generally assembled for GWAS are simply not attainable. For this reason, statistical methods specially adapted for use in small data sets are required. Here we explore the behavior of the TE-PPLD via simulations, comparing the TE-PPLD with Cox Proportional Hazards analysis in the context of small to moderate sample sizes. Our results will help to inform our approach to the DMD study going forward, and they illustrate several respects in which the TE-PPLD, and by extension the original PPLD, offer advantages over regression-based approaches to GWAS in this context.

Download Full-text

Data Sharing in Distributed Architectures – Concept and Implementation in HiGHmed

10.3233/shti210548 ◽

2021 ◽

Author(s):

Reto Wettstein ◽

Hauke Hund ◽

Christian Fegeler ◽

Oliver Heinze

Keyword(s):

Research Question ◽

Routine Data ◽

Third Party ◽

Small Data ◽

Data Sets ◽

Distributed Data ◽

University Hospitals ◽

Single Location ◽

Open Standard ◽

Small Data Sets

Medical routine data has the potential to benefit research. However, transferring this data into a research context is difficult. For this reason Medical Data Integration Centers are being established in German university hospitals to consolidate data from primary information systems in a single location. But, small data-sets from one organization can be insufficient to answer a research question adequately. In order to obtain larger data-sets, attempts to merge and provide data-sets across institutional boundaries are made. Therefore, this paper proposes a possible process that can extract, merge, pseudonymize and provide distributed data-sets from several organizations conforming to privacy regulations. This process is executed according to the open standard BPMN 2.0, the underlying process data model is based on HL7 FHIR R4. The proposed solution is currently being deployed at eight university hospitals and one Trusted Third Party in the HiGHmed consortium.

Download Full-text

Toward Domain Adaptation for small data sets

Internet of Things ◽

10.1016/j.iot.2021.100458 ◽

2021 ◽

pp. 100458

Author(s):

Maryam Alshehhi ◽

Ernesto Damiani ◽

Di Wang

Keyword(s):

Domain Adaptation ◽

Small Data ◽

Data Sets ◽

Small Data Sets

Download Full-text

small data sets
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Casting plate defect detection using motif discovery with minimal model training and small data sets

Toward Orbital-Free Density Functional Theory with Small Data Sets and Deep Learning

Functional Regression

Machine Learning-Assisted QSAR Models on Contaminant Reactivity Toward Four Oxidants: Combining Small Data Sets and Knowledge Transfer

A Robust Approach to Variation in Carpathian Rusyn: Resampling-Based Methods for Small Data Sets

A Manifold Learning Perspective on Representation Learning: Learning Decoder and Representations without an Encoder

Data analytics accelerates the experimental discovery of new thermoelectric materials with extremely high figure of merit

The PPLD has advantages over conventional regression methods in application to moderately sized genome-wide association studies

Data Sharing in Distributed Architectures – Concept and Implementation in HiGHmed

Toward Domain Adaptation for small data sets

Export Citation Format

small data setsRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Casting plate defect detection using motif discovery with minimal model training and small data sets

Toward Orbital-Free Density Functional Theory with Small Data Sets and Deep Learning

Functional Regression

Machine Learning-Assisted QSAR Models on Contaminant Reactivity Toward Four Oxidants: Combining Small Data Sets and Knowledge Transfer

A Robust Approach to Variation in Carpathian Rusyn: Resampling-Based Methods for Small Data Sets

A Manifold Learning Perspective on Representation Learning: Learning Decoder and Representations without an Encoder

Data analytics accelerates the experimental discovery of new thermoelectric materials with extremely high figure of merit

The PPLD has advantages over conventional regression methods in application to moderately sized genome-wide association studies

Data Sharing in Distributed Architectures – Concept and Implementation in HiGHmed

Toward Domain Adaptation for small data sets

small data sets
Recently Published Documents