The relationship between journal citation impact and citation sentiment: A study of 32 million citances in PubMed Central

Citation sentiment plays an important role in citation analysis and scholarly communication research, but prior citation sentiment studies have used small data sets and relied largely on manual annotation. This paper uses a large data set of PubMed Central (PMC) full-text publications and analyzes citation sentiment in more than 32 million citances within PMC, revealing citation sentiment patterns at the journal and discipline levels. This paper finds a weak relationship between a journal’s citation impact (as measured by CiteScore) and the average sentiment score of citances to its publications. When journals are aggregated into quartiles based on citation impact, we find that journals in higher quartiles are cited more favorably than those in the lower quartiles. Further, social science journals are found to be cited with higher sentiment, followed by engineering and natural science and biomedical journals, respectively. This result may be attributed to disciplinary discourse patterns in which social science researchers tend to use more subjective terms to describe others’ work than do natural science or biomedical researchers.

Download Full-text

The European Creep Collaborative Committee (ECCC) Approach to Creep Data Assessment

Journal of Pressure Vessel Technology ◽

10.1115/1.2894296 ◽

2008 ◽

Vol 130 (2) ◽

Cited By ~ 9

Author(s):

Stuart Holdsworth

Keyword(s):

Rupture Strength ◽

Large Data ◽

Small Data ◽

Data Sets ◽

Creep Data ◽

Data Set ◽

Data Assessment ◽

Long Time ◽

Small Data Sets ◽

Material Conditions

The European Creep Collaborative Committee (ECCC) approach to creep data assessment has now been established for almost ten years. The methodology covers the analysis of rupture strength and ductility, creep strain, and stress relaxation data, for a range of material conditions. This paper reviews the concepts and procedures involved. The original approach was devised to determine data sheets for use by committees responsible for the preparation of National and International Design and Product Standards, and the methods developed for data quality evaluation and data analysis were therefore intentionally rigorous. The focus was clearly on the determination of long-time property values from the largest possible data sets involving a significant number of observations in the mechanism regime for which predictions were required. More recently, the emphasis has changed. There is now an increasing requirement for full property descriptions from very short times to very long and hence the need for much more flexible model representations than were previously required. There continues to be a requirement for reliable long-time predictions from relatively small data sets comprising relatively short duration tests, in particular, to exploit new alloy developments at the earliest practical opportunity. In such circumstances, it is not feasible to apply the same degree of rigor adopted for large data set assessment. Current developments are reviewed.

Download Full-text

Classification of jujube defects in small data sets based on transfer learning

Neural Computing and Applications ◽

10.1007/s00521-021-05715-2 ◽

2021 ◽

Author(s):

Jianping Ju ◽

Hong Zheng ◽

Xiaohang Xu ◽

Zhongyuan Guo ◽

Zhaohui Zheng ◽

...

Keyword(s):

Transfer Learning ◽

Loss Function ◽

Training Model ◽

Parameter Distribution ◽

Test Accuracy ◽

Small Data ◽

Data Sets ◽

Data Set ◽

Small Data Sets

AbstractAlthough convolutional neural networks have achieved success in the field of image classification, there are still challenges in the field of agricultural product quality sorting such as machine vision-based jujube defects detection. The performance of jujube defect detection mainly depends on the feature extraction and the classifier used. Due to the diversity of the jujube materials and the variability of the testing environment, the traditional method of manually extracting the features often fails to meet the requirements of practical application. In this paper, a jujube sorting model in small data sets based on convolutional neural network and transfer learning is proposed to meet the actual demand of jujube defects detection. Firstly, the original images collected from the actual jujube sorting production line were pre-processed, and the data were augmented to establish a data set of five categories of jujube defects. The original CNN model is then improved by embedding the SE module and using the triplet loss function and the center loss function to replace the softmax loss function. Finally, the depth pre-training model on the ImageNet image data set was used to conduct training on the jujube defects data set, so that the parameters of the pre-training model could fit the parameter distribution of the jujube defects image, and the parameter distribution was transferred to the jujube defects data set to complete the transfer of the model and realize the detection and classification of the jujube defects. The classification results are visualized by heatmap through the analysis of classification accuracy and confusion matrix compared with the comparison models. The experimental results show that the SE-ResNet50-CL model optimizes the fine-grained classification problem of jujube defect recognition, and the test accuracy reaches 94.15%. The model has good stability and high recognition accuracy in complex environments.

Download Full-text

A Comparison Study of Mahalanobis-Taguchi System and Neural Network for Multivariate Pattern Recognition

Design Engineering, Parts A and B ◽

10.1115/imece2005-80029 ◽

2005 ◽

Cited By ~ 10

Author(s):

Jungeui Hong ◽

Elizabeth A. Cudney ◽

Genichi Taguchi ◽

Rajesh Jugulum ◽

Kioumars Paryani ◽

...

Keyword(s):

Neural Network ◽

Small Data ◽

Data Sets ◽

Comparison Study ◽

Data Set ◽

Set Size ◽

Breast Cancer Study ◽

Discriminant Ability ◽

Small Data Sets ◽

Multivariate Pattern

The Mahalanobis-Taguchi System is a diagnosis and predictive method for analyzing patterns in multivariate cases. The goal of this study is to compare the ability of the Mahalanobis-Taguchi System and a neural network to discriminate using small data sets. We examine the discriminant ability as a function of data set size using an application area where reliable data is publicly available. The study uses the Wisconsin Breast Cancer study with nine attributes and one class.

Download Full-text

Assessment of Master Curve Material Inhomogeneity Using Small Data Sets

Volume 1A: Codes and Standards ◽

10.1115/pvp2018-84297 ◽

2018 ◽

Author(s):

Kim Wallin

Keyword(s):

Modal Analysis ◽

Master Curve ◽

Large Data ◽

Large Data Sets ◽

Small Data ◽

Data Sets ◽

Inhomogeneous Materials ◽

Material Inhomogeneity ◽

Analysis Methods ◽

Small Data Sets

The standard Master Curve (MC) deals only with materials assumed to be homogeneous, but MC analysis methods for inhomogeneous materials have also been developed. Especially the bi-modal and multi-modal analysis methods are becoming more and more standard. Their drawback is that these methods are generally reliable only with sufficiently large data sets (number of valid tests, r ≥ 15–20). Here, the possibility of using the multi-modal analysis method with smaller data sets is assessed, and a new procedure to conservatively account for possible inhomogeneities is proposed.

Download Full-text

A Bayesian Approach to Biological Variation Analysis

Clinical Chemistry ◽

10.1373/clinchem.2018.300145 ◽

2019 ◽

Vol 65 (8) ◽

pp. 995-1005 ◽

Cited By ~ 2

Author(s):

Thomas Røraas ◽

Sverre Sandberg ◽

Aasne K Aarsand ◽

Bård Støve

Keyword(s):

Prior Knowledge ◽

Bayesian Approach ◽

Bayesian Models ◽

Noisy Data ◽

Biological Variation ◽

Small Data ◽

Data Sets ◽

Data Set ◽

Credible Intervals ◽

Small Data Sets

Abstract BACKGROUND Biological variation (BV) data have many applications for diagnosing and monitoring disease. The standard statistical approaches for estimating BV are sensitive to “noisy data” and assume homogeneity of within-participant CV. Prior knowledge about BV is mostly ignored. The aims of this study were to develop Bayesian models to calculate BV that (a) are robust to “noisy data,” (b) allow heterogeneity in the within-participant CVs, and (c) take advantage of prior knowledge. METHOD We explored Bayesian models with different degrees of robustness using adaptive Student t distributions instead of the normal distributions and when the possibility of heterogeneity of the within-participant CV was allowed. Results were compared to more standard approaches using chloride and triglyceride data from the European Biological Variation Study. RESULTS Using the most robust Bayesian approach on a raw data set gave results comparable to a standard approach with outlier assessments and removal. The posterior distribution of the fitted model gives access to credible intervals for all parameters that can be used to assess reliability. Reliable and relevant priors proved valuable for prediction. CONCLUSIONS The recommended Bayesian approach gives a clear picture of the degree of heterogeneity, and the ability to crudely estimate personal within-participant CVs can be used to explore relevant subgroups. Because BV experiments are expensive and time-consuming, prior knowledge and estimates should be considered of high value and applied accordingly. By including reliable prior knowledge, precise estimates are possible even with small data sets.

Download Full-text

Fooled Again and Again

The Phantom Pattern Problem ◽

10.1093/oso/9780198864165.003.0005 ◽

2020 ◽

pp. 75-100

Author(s):

Gary Smith ◽

Jay Cordes

Keyword(s):

Test Scores ◽

Large Data ◽

Large Data Sets ◽

Small Data ◽

Data Sets ◽

Small Data Sets ◽

Athletic Competitions

Patterns are inevitable and we should not be surprised by them. Streaks, clusters, and correlations are the norm, not the exception. In a large number of coin flips, there are likely to be coincidental clusters of heads and tails. In nationwide data on cancer, crime, or test scores, there are likely to be flukey clusters. When the data are separated into smaller geographic units like cities, the most extreme results are likely to be found in the smallest cities. In athletic competitions between well-matched teams, the outcome of a small number of games is almost meaningless. Our challenge is to overcome our inherited inclination to think that all patterns are meaningful; for example, thinking that clustering in large data sets or differences among small data sets must be something real that needs to be explained. Often, it is just meaningless happenstance.

Download Full-text

On the Use of the Inverted Abel Integral for Evaluating Spectroscopic Sources

Applied Spectroscopy ◽

10.1366/0003702814731806 ◽

1981 ◽

Vol 35 (1) ◽

pp. 35-42 ◽

Cited By ~ 13

Author(s):

J. D. Algeo ◽

M. B. Denton

Keyword(s):

Noisy Data ◽

Large Data ◽

Cubic Spline ◽

Low Noise ◽

Large Data Sets ◽

Small Data ◽

Data Sets ◽

Spline Method ◽

Noise Levels ◽

Small Data Sets

A numerical method for evaluating the inverted Abel integral employing cubic spline approximations is described along with a modification of the procedure of Cremers and Birkebak, and an extension of the Barr method. The accuracy of the computations is evaluated at several noise levels and with varying resolution of the input data. The cubic spline method is found to be useful only at very low noise levels, but capable of providing good results with small data sets. The Barr method is computationally the simplest, and is adequate when large data sets are available. For noisy data, the method of Cremers and Birkebak gave the best results.

Download Full-text

FINGERPRINT SILICON REPLICAS: STATIC AND DYNAMIC FEATURES FOR VITALITY DETECTION USING AN OPTICAL CAPTURE DEVICE

International Journal of Image and Graphics ◽

10.1142/s0219467808003209 ◽

2008 ◽

Vol 08 (04) ◽

pp. 495-512 ◽

Cited By ~ 26

Author(s):

PIETRO COLI ◽

GIAN LUCA MARCIALIS ◽

FABIO ROLI

Keyword(s):

Performance Improvement ◽

Small Sample Size ◽

Small Sample ◽

Small Data ◽

Data Sets ◽

Dynamic Features ◽

Data Set ◽

Feature Sets ◽

Small Data Sets ◽

Verification Systems

The automatic vitality detection of a fingerprint has become an important issue in personal verification systems based on this biometric. It has been shown that fake fingerprints made using materials like gelatine or silicon can deceive commonly used sensors. Recently, the extraction of vitality features from fingerprint images has been proposed to address this problem. Among others, static and dynamic features have been separately studied so far, thus their respective merits are not yet clear; especially because reported results were often obtained with different sensors and using small data sets which could have obscured relative merits, due to the potential small sample-size issues. In this paper, we compare some static and dynamic features by experiments on a larger data set and using the same optical sensor for the extraction of both feature sets. We dealt with fingerprint stamps made using liquid silicon rubber. Reported results show the relative merits of static and dynamic features and the performance improvement achievable by using such features together.

Download Full-text

The Application of Artificial Neural Networks With Small Data Sets: An Example for Analysis of Fracture Spacing in the Lisburne Formation, Northeastern Alaska

SPE Reservoir Evaluation & Engineering ◽

10.2118/103188-pa ◽

2008 ◽

Vol 11 (03) ◽

pp. 598-605 ◽

Cited By ~ 10

Author(s):

Danial Kaviani ◽

Thang Bui ◽

Jerry L. Jensen ◽

Catherine Hanks

Keyword(s):

Neural Networks ◽

Artificial Neural Networks ◽

Petroleum Engineering ◽

Small Data ◽

Data Sets ◽

Classification Problems ◽

Error Performance ◽

Data Set ◽

Fracture Spacing ◽

Small Data Sets

Summary Artificial neural networks (ANNs) have been used widely for prediction and classification problems. In particular, many methods for building ANNs have appeared in the last 2 decades. One of the continuing important limitations of using ANNs, however, is their poor ability to analyze small data sets because of overfitting. Several methods have been proposed in the literature to overcome this problem. On the basis of our study, we can conclude that ANNs that use radial basis functions (RBFs) can decrease the error of the prediction effectively when there is an underlying relationship between the variables. We have applied this and other methods to determine the factors controlling and related to fracture spacing in the Lisburne formation, northeastern Alaska. By comparing the RBF results with those from other ANN methods, we find that the former method gives a substantially smaller error than many of the alternative methods. For example, the errors in predicted fracture spacing for the Lisburne formation with conventional ANN methods are approximately 50 to 200% larger than those obtained with RBFs. With a method that predicts fracture spacing more accurately, we were able to identify more reliably the effects on the spacing of such factors as bed thickness, lithology, structural position, and degree of folding. By comparing performances of all the methods we tested, we observed that some methods that performed well in one test did not necessarily do as well in another test. This suggests that, while RBF can be expected to be among the best methods, there is no "best universal method" for all the cases, and testing different methods for each case is required. Nonetheless, through this study, we were able to identify several candidate methods and, thereby, narrow the work required to find a suitable ANN. In petroleum engineering and geosciences, the number of data is limited in many cases because of expense or logistical limitations (e.g., limited core, poor borehole conditions, or restricted logging suites). Thus, the methods used in this study should be attractive in many petroleum-engineering contexts in which complex, nonlinear relationships need to be modeled by use of small data sets. Introduction An ANN is "an information-processing system that has certain performance characteristics in common with biological neural networks" (Fausett 1994). On the basis of the "universal approximation theorem" with a sufficient number of hidden nodes, multilayer neural networks (Fig. 1) are able to predict any unknown function (Haykin 1999). ANNs are widely used in prediction and classification problems and have numerous applications in geosciences and petroleum engineering, including permeability prediction (Aminian et al. 2003), fluid-properties prediction (Sultan and Al-Kaabi 2002), and well-test-data analysis (Osman and Al-Marhoun 2005). Given a basic network structure, there is a wide variety of ANNs that can be produced. For example, different methods or criteria used to train the network produce ANNs that provide different predictions (e.g., the early-stopping and weight-decay methods.) Also, two or more neural networks can be combined to produce an ANN with better error performance or other qualities, giving the so-called "ensemble learning methods," a term that covers a large variety of methods, including stacked generalization and ensemble averaging. An additional problem is introduced when the data sets are small. This is a common situation in petroleum-engineering and geosciences applications, in which the cost of data or collection logistics may limit the number of measurements. In such instances, the use of ANNs can result in overfitting, where the model is fitted to the training data points but performs poorly for prediction of other points (Fig. 2). In this study, we try to identify—among myriad possibilities—a few ANNs that provide good error performance with limited sample numbers. After a brief review of various types of ANNs, we use a synthetic data set to discuss, apply, and compare the methods that have been proposed in the literature to overcome the small-data-sets problem. Finally, we apply these methods to an actual data set—fracture-spacing data from the Lisburne Group, northeastern Alaska—and evaluate the results.

Download Full-text

SRI: A Scalable Rule Induction Algorithm

Proceedings of the Institution of Mechanical Engineers Part C Journal of Mechanical Engineering Science ◽

10.1243/09544062c18304 ◽

2006 ◽

Vol 220 (4) ◽

pp. 537-552 ◽

Cited By ~ 12

Author(s):

D T Pham ◽

A A Afify

Keyword(s):

Large Data ◽

Rule Induction ◽

Large Data Sets ◽

Small Data ◽

Data Sets ◽

Complex Data ◽

Formidable Challenge ◽

Complex Data Sets ◽

Small Data Sets ◽

Direct Use

Rule induction as a method for constructing classifiers is particularly attractive in data mining applications, where the comprehensibility of the generated models is very important. Most existing techniques were designed for small data sets and thus are not practical for direct use on very large data sets because of their computational inefficiency. Scaling up rule induction methods to handle such data sets is a formidable challenge. This article presents a new algorithm for rule induction that can efficiently extract accurate and comprehensible models from large and noisy data sets. This algorithm has been tested on several complex data sets, and the results prove that it scales up well and is an extremely effective learner.

Download Full-text