Statistical learning of small data with domain knowledge --- Sample size- and pre-notch length- dependent strength of concrete

Author(s):  
Jia-Hao Wang ◽  
Jun-Nan Jia ◽  
Sheng Sun ◽  
Tong-Yi Zhang
2020 ◽  
Author(s):  
Bo Hu ◽  
Lin-Feng Yan ◽  
Yang Yang ◽  
Ying-Zhi Sun ◽  
Cui Yue ◽  
...  

Abstract Background The diagnosis of prostate transition zone cancers (PTZC) remains a clinical challenge due to its similarity to benign prostatic hyperplasia (BPH) on MRI. The Deep Convolutional Neural Networks showed high efficacy in medical imaging but was limited by the small data size. A transfer learning method was combined with deep learning to overcome this challenge.Methods A retrospective investigation was conducted on 217 patients enrolled from our hospital database (208 patients) and The Cancer Imaging Archive (9 patients). Based on the T2 weighted images (T2WIs) and apparent diffusion coefficient (ADC) maps of these patients, DCNN models were trained and compared between different TL database (ImageNet vs. disease-related images) and protocols (from scratch, fine-tuning or transductive transferring).Results PTZC and BPH can be classified through traditional DCNN. The efficacy of transfer learning from ImageNet was limited but improved by transferring knowledge from the disease-related images. Furthermore, transductive transfer learning from disease-related images had the comparable efficacies with the fine-tuning method. Limitations include retrospective design and relatively small sample size.Conclusion For PTZC with a small sample size, the accurate diagnosis can be achieved via the deep transfer learning from disease-related images.


Author(s):  
Guanjie Zheng ◽  
Chang Liu ◽  
Hua Wei ◽  
Porter Jenkins ◽  
Chacha Chen ◽  
...  

Small data has been a barrier for many machine learning tasks, especially when applied in scientific domains. Fortunately, we can utilize domain knowledge to make up the lack of data. Hence, in this paper, we propose a hybrid model KRL that treats domain knowledge model as a weak learner and uses another neural net model to boost it. We prove that KRL is guaranteed to improve over pure domain knowledge model and pure neural net model under certain loss functions. Extensive experiments have shown the superior performance of KRL over baselines. In addition, several case studies have explained how the domain knowledge can assist the prediction.


2021 ◽  
Vol 7 (1) ◽  
Author(s):  
Chi Chen ◽  
Shyue Ping Ong

AbstractPredicting properties from a material’s composition or structure is of great interest for materials design. Deep learning has recently garnered considerable interest in materials predictive tasks with low model errors when dealing with large materials data. However, deep learning models suffer in the small data regime that is common in materials science. Here we develop the AtomSets framework, which utilizes universal compositional and structural descriptors extracted from pre-trained graph network deep learning models with standard multi-layer perceptrons to achieve consistently high model accuracy for both small compositional data (<400) and large structural data (>130,000). The AtomSets models show lower errors than the graph network models at small data limits and other non-deep-learning models at large data limits. They also transfer better in a simulated materials discovery process where the targeted materials have property values out of the training data limits. The models require minimal domain knowledge inputs and are free from feature engineering. The presented AtomSets model framework can potentially accelerate machine learning-assisted materials design and discovery with less data restriction.


2020 ◽  
Vol 4 (1) ◽  
pp. 1
Author(s):  
Nderui Ndung’u ◽  
Dr. Susan Were ◽  
Dr. Patrick Mwangangi

Purpose: This research focused on the influence of top management support on procurement regulatory compliance level in public universities in Kenya.  Methodology: The study used Ex-post facto design, the research design was chosen because the study aimed at investigating the causal relationship on variables, which cannot be controlled by the researcher. The researcher applied the census sampling technique to select the sample size of 31 public universities and 333 respondents. The study was informed by the Principal-Agent Theory. Primary data was collected through the use of questionnaires issued to the procurement staffs in the sampled universities. This study undertook a census because the sample size was small. Data collected from the field was coded and cleaned to remove outliers or missing values and categorized manually according to the questionnaire items using frequency distribution tables and percentages. The researcher used both descriptive and inferential statistics with the help of statistical package of Social Science (SPSS) version 24 to analyze the data.Findings: Top management was found to be satisfactory significant in contribution as a factor to procurement regulatory compliance level. The p value 0.000 was less than the conventional probability of 0.05 at a testing at 95% level of significance. Motivation and rewards to staff can be used as a strategy to combat the personal interest that arise in preference to fulfill their duties and assignments. Management acts as the framework to the functionality of the procurement department and activities thereof. Top management making efforts on motivating the staff individually leads to an exemplary performance.Unique contribution to theory, policy and practice: Deficient checking and assessment of the functioning in an organization is connected to the non-attendance to a controllable situation. The existing hindrances from the top management for the staffs to engage and enhance adoption and application of total quality management in procurement should be eradicated. Disciplinary steps should be taken on staff with unbecoming behaviors in effort to uphold ethical practice. There should be freedom of staff to discharge their duties as assigned to them by the top management


2018 ◽  
Author(s):  
Shuilian Xie ◽  
Ulisses M. Braga-Neto

AbstractMotivationPrecision and recall have become very popular classification accuracy metrics in the statistical learning literature. These metrics are ordinarily defined under the assumption that the data are sampled randomly from the mixture of the populations. However, observational case-control studies for biomarker discovery often collect data that are sampled separately from the case and control populations, particularly in the case of rare diseases. This discrepancy may introduce severe bias in classifier accuracy estimation.ResultsWe demonstrate, using both analytical and numerical methods, that classifier precision estimates can display strong bias under separating sampling, with the bias magnitude depending on the difference between the case prevalences in the data and in the actual population. We show that this bias is systematic in the sense that it cannot be reduced by increasing sample size. If information about the true case prevalence is available from public health records, then a modified precision estimator is proposed that displays smaller bias, which can in fact be reduced to zero as sample size increases under regularity conditions on the classification algorithm. The accuracy of the theoretical analysis and the performance of the proposed precision estimator under separate sampling are investigated using synthetic and real data from observational case-control studies. The results confirmed that the proposed precision estimator indeed becomes unbiased as sample size increases, while the ordinary precision estimator may display large bias, particularly in the case of rare diseases.AvailabilityExtra plots are available as Supplementary Materials.Author summaryBiomedical data are often sampled separately from the case and control populations, particularly in the case of rare diseases. Precision is a popular classification accuracy metric in the statistical learning literature, which implicitly assumes that the data are sampled randomly from the mixture of the populations. In this paper we study the bias of precision under separate sampling using theoretical and numerical methods. We also propose a precision estimator for separate sampling in the case when the prevalence is known from public health records. The results confirmed that the proposed precision estimator becomes unbiased as sample size increases, while the ordinary precision estimator may display large bias, particularly in the case of rare diseases. In the absence of any knowledge about disease prevalence, precision estimates should be avoided under separate sampling.


2015 ◽  
Vol 14s5 ◽  
pp. CIN.S30804 ◽  
Author(s):  
Amin Zollanvari

High-dimensional data generally refer to data in which the number of variables is larger than the sample size. Analyzing such datasets poses great challenges for classical statistical learning because the finite-sample performance of methods developed within classical statistical learning does not live up to classical asymptotic premises in which the sample size unboundedly grows for a fixed dimensionality of observations. Much work has been done in developing mathematical-statistical techniques for analyzing high-dimensional data. Despite remarkable progress in this field, many practitioners still utilize classical methods for analyzing such datasets. This state of affairs can be attributed, in part, to a lack of knowledge and, in part, to the ready-to-use computational and statistical software packages that are well developed for classical techniques. Moreover, many scientists working in a specific field of high-dimensional statistical learning are either not aware of other existing machineries in the field or are not willing to try them out. The primary goal in this work is to bring together various machineries of high-dimensional analysis, give an overview of the important results, and present the operating conditions upon which they are grounded. When appropriate, readers are referred to relevant review articles for more information on a specific subject.


Author(s):  
Šinkovec ◽  
Geroldinger ◽  
Heinze

The parameters of logistic regression models are usually obtained by the method of maximum likelihood (ML). However, in analyses of small data sets or data sets with unbalanced outcomes or exposures, ML parameter estimates may not exist. This situation has been termed ‘separation’ as the two outcome groups are separated by the values of a covariate or a linear combination of covariates. To overcome the problem of non-existing ML parameter estimates, applying Firth’s correction (FC) was proposed. In practice, however, a principal investigator might be advised to ‘bring more data’ in order to solve a separation issue. We illustrate the problem by means of examples from colorectal cancer screening and ornithology. It is unclear if such an increasing sample size (ISS) strategy that keeps sampling new observations until separation is removed improves estimation compared to applying FC to the original data set. We performed an extensive simulation study where the main focus was to estimate the cost-adjusted relative efficiency of ML combined with ISS compared to FC. FC yielded reasonably small root mean squared errors and proved to be the more efficient estimator. Given our findings, we propose not to adapt the sample size when separation is encountered but to use FC as the default method of analysis whenever the number of observations or outcome events is critically low.


2018 ◽  
Author(s):  
Rhett N. D’souza ◽  
Po-Yao Huang ◽  
Fang-Cheng Yeh

AbstractDeep neural networks have gained immense popularity in the Big Data problem; however, the availability of training samples can be relatively limited in certain application domains, particularly medical imaging, and consequently leading to overfitting problems. This “Small Data” challenge may need a mindset that is entirely different from the existing Big Data paradigm. Here, under the small data setting, we examined whether the network structure has a substantial influence on the performance and whether the optimal structure is predominantly determined by sample size or data nature. To this end, we listed all possible combinations of layers given an upper bound of the VC-dimension to study how structural hyperparameters affected the performance. Our results showed that structural optimization improved accuracy by 27.99%, 16.44%, and 13.11% over random selection for a sample size of 100, 500, and 1,000 in the MNIST dataset, respectively, suggesting that the importance of the network structure increases as the sample size becomes smaller. Furthermore, the optimal network structure was mostly determined by the data nature (photographic, calligraphic, or medical images), and less affected by the sample size, suggesting that the optimal network structure is data-driven, not sample size driven. After network structure optimization, the conventional convolutional neural network could achieve 91.13% in accuracy with only 500 samples, 93.66% in accuracy with only 1000 samples for the MNIST dataset and 94.10% in accuracy with only 3300 samples for the Mitosis (microscopic) dataset. These results indicate the primary importance of the network structure and the nature of the data in facing the Small Data challenge.


2020 ◽  
Vol 4 (1) ◽  
pp. 1-12
Author(s):  
Nderui Ndung’u ◽  
Dr. Susan Were ◽  
Dr. Patrick Mwangangi

Purpose: This research focused on the influence of top management support on procurement regulatory compliance level in public universities in Kenya.  Methodology: The study used Ex-post facto design, the research design was chosen because the study aimed at investigating the causal relationship on variables, which cannot be controlled by the researcher. The researcher applied the census sampling technique to select the sample size of 31 public universities and 333 respondents. The study was informed by the Principal-Agent Theory. Primary data was collected through the use of questionnaires issued to the procurement staffs in the sampled universities. This study undertook a census because the sample size was small. Data collected from the field was coded and cleaned to remove outliers or missing values and categorized manually according to the questionnaire items using frequency distribution tables and percentages. The researcher used both descriptive and inferential statistics with the help of statistical package of Social Science (SPSS) version 24 to analyze the data.Findings: Top management was found to be satisfactory significant in contribution as a factor to procurement regulatory compliance level. The p value 0.000 was less than the conventional probability of 0.05 at a testing at 95% level of significance. Motivation and rewards to staff can be used as a strategy to combat the personal interest that arise in preference to fulfill their duties and assignments. Management acts as the framework to the functionality of the procurement department and activities thereof. Top management making efforts on motivating the staff individually leads to an exemplary performance.Unique contribution to theory, policy and practice: Deficient checking and assessment of the functioning in an organization is connected to the non-attendance to a controllable situation. The existing hindrances from the top management for the staffs to engage and enhance adoption and application of total quality management in procurement should be eradicated. Disciplinary steps should be taken on staff with unbecoming behaviors in effort to uphold ethical practice. There should be freedom of staff to discharge their duties as assigned to them by the top management


Sign in / Sign up

Export Citation Format

Share Document