scholarly journals The impact of imbalanced training data on machine learning for author name disambiguation

2018 ◽  
Vol 117 (1) ◽  
pp. 511-526 ◽  
Author(s):  
Jinseok Kim ◽  
Jenna Kim
2021 ◽  
pp. 016555152110181
Author(s):  
Jinseok Kim ◽  
Jenna Kim ◽  
Jinmo Kim

Chinese author names are known to be more difficult to disambiguate than other ethnic names because they tend to share surnames and forenames, thus creating many homonyms. In this study, we demonstrate how using Chinese characters can affect machine learning for author name disambiguation. For analysis, 15K author names recorded in Chinese are transliterated into English and simplified by initialising their forenames to create counterfactual scenarios, reflecting real-world indexing practices in which Chinese characters are usually unavailable. The results show that Chinese author names that are highly ambiguous in English or with initialised forenames tend to become less confusing if their Chinese characters are included in the processing. Our findings indicate that recording Chinese author names in native script can help researchers and digital libraries enhance authority control of Chinese author names that continue to increase in size in bibliographic data.


2020 ◽  
Vol 10 (18) ◽  
pp. 6619
Author(s):  
Po-Jiun Wen ◽  
Chihpin Huang

The noise prediction using machine learning is a special study that has recently received increased attention. This is particularly true in workplaces with noise pollution, which increases noise exposure for general laborers. This study attempts to analyze the noise equivalent level (Leq) at the National Synchrotron Radiation Research Center (NSRRC) facility and establish a machine learning model for noise prediction. This study utilized the gradient boosting model (GBM) as the learning model in which past noise measurement records and many other features are integrated as the proposed model makes a prediction. This study analyzed the time duration and frequency of the collected Leq and also investigated the impact of training data selection. The results presented in this paper indicate that the proposed prediction model works well in almost noise sensors and frequencies. Moreover, the model performed especially well in sensor 8 (125 Hz), which was determined to be a serious noise zone in the past noise measurements. The results also show that the root-mean-square-error (RMSE) of the predicted harmful noise was less than 1 dBA and the coefficient of determination (R2) value was greater than 0.7. That is, the working field showed a favorable noise prediction performance using the proposed method. This positive result shows the ability of the proposed approach in noise prediction, thus providing a notification to the laborer to prevent long-term exposure. In addition, the proposed model accurately predicts noise future pollution, which is essential for laborers in high-noise environments. This would keep employees healthy in avoiding noise harmful positions to prevent people from working in that environment.


Energies ◽  
2020 ◽  
Vol 13 (17) ◽  
pp. 4300
Author(s):  
Kosuke Sasakura ◽  
Takeshi Aoki ◽  
Masayoshi Komatsu ◽  
Takeshi Watanabe

Data centers (DCs) are becoming increasingly important in recent years, and highly efficient and reliable operation and management of DCs is now required. The generated heat density of the rack and information and communication technology (ICT) equipment is predicted to get higher in the future, so it is crucial to maintain the appropriate temperature environment in the server room where high heat is generated in order to ensure continuous service. It is especially important to predict changes of rack intake temperature in the server room when the computer room air conditioner (CRAC) is shut down, which can cause a rapid rise in temperature. However, it is quite difficult to predict the rack temperature accurately, which in turn makes it difficult to determine the impact on service in advance. In this research, we propose a model that predicts the rack intake temperature after the CRAC is shut down. Specifically, we use machine learning to construct a gradient boosting decision tree model with data from the CRAC, ICT equipment, and rack intake temperature. Experimental results demonstrate that the proposed method has a very high prediction accuracy: the coefficient of determination was 0.90 and the root mean square error (RMSE) was 0.54. Our model makes it possible to evaluate the impact on service and determine if action to maintain the temperature environment is required. We also clarify the effect of explanatory variables and training data of the machine learning on the model accuracy.


2020 ◽  
Vol 242 ◽  
pp. 05003
Author(s):  
A.E. Lovell ◽  
A.T. Mohan ◽  
P. Talou ◽  
M. Chertkov

As machine learning methods gain traction in the nuclear physics community, especially those methods that aim to propagate uncertainties to unmeasured quantities, it is important to understand how the uncertainty in the training data coming either from theory or experiment propagates to the uncertainty in the predicted values. Gaussian Processes and Bayesian Neural Networks are being more and more widely used, in particular to extrapolate beyond measured data. However, studies are typically not performed on the impact of the experimental errors on these extrapolated values. In this work, we focus on understanding how uncertainties propagate from input to prediction when using machine learning methods. We use a Mixture Density Network (MDN) to incorporate experimental error into the training of the network and construct uncertainties for the associated predicted quantities. Systematically, we study the effect of the size of the experimental error, both on the reproduced training data and extrapolated predictions for fission yields of actinides.


2021 ◽  
Author(s):  
Elisabeth Pfaehler ◽  
Daniela Euba ◽  
Andreas Rinscheid ◽  
Otto S. Hoekstra ◽  
Josee Zijlstra ◽  
...  

Abstract Background Machine learning studies require a large number of images often obtained on different PET scanners. When merging these images, the use of harmonized images following EARL-standards is essential. However, when including retrospective images, EARL accreditation might not have been in place. The aim of this study was to develop a convolutional neural network (CNN) that can identify retrospectively if an image is EARL compliant and if it is meeting older or newer EARL-standards. Materials and Methods 96 PET images acquired on three PET/CT systems were included in the study. All images were reconstructed with the locally clinically preferred, EARL1, and EARL2 compliant reconstruction protocols. After image pre-processing, one CNN was trained to separate clinical and EARL compliant reconstructions. A second CNN was optimized to identify EARL1 and EARL2 compliant images. The accuracy of both CNNs was assessed using 5-fold cross validation. The CNNs were validated on 24 images acquired on a PET scanner not included in the training data. To assess the impact of image noise on the CNN decision, the 24 images were reconstructed with different scan durations. Results In the cross-validation, the first CNN classified all images correctly. When identifying EARL1 and EARL2 compliant images, the second CNN identified 100% EARL1 compliant and 85% EARL2 compliant images correctly. The accuracy in the independent dataset was comparable to the cross-validation accuracy. The scan duration had almost no impact on the results. Conclusion The two CNNs trained in this study can be used to retrospectively include images in a multi-center setting by e.g. adding additional smoothing. This method is especially important for machine learning studies where the harmonization of images from different PET systems is essential.


2020 ◽  
Vol 124 ◽  
pp. 103611 ◽  
Author(s):  
Elias Martins Guerra Prado ◽  
Carlos Roberto de Souza Filho ◽  
Emmanuel John M. Carranza ◽  
João Gabriel Motta

Author(s):  
Amir Laadhar ◽  
Faiza Ghozzi ◽  
Imen Megdiche ◽  
Franck Ravat ◽  
Olivier Teste ◽  
...  

2002 ◽  
Vol 8 (4) ◽  
pp. 311-325 ◽  
Author(s):  
V. HOSTE ◽  
I. HENDRICKX ◽  
W. DAELEMANS ◽  
A. VAN DEN BOSCH

Various Machine Learning (ML) approaches have been demonstrated to produce relatively successful Word Sense Disambiguation (WSD) systems. There are still unexplained differences among the performance measurements of different algorithms, hence it is warranted to deepen the investigation into which algorithm has the right ‘bias’ for this task. In this paper, we show that this is not easy to accomplish, due to intricate interactions between information sources, parameter settings, and properties of the training data. We investigate the impact of parameter optimization on generalization accuracy in a memory-based learning approach to English and Dutch WSD. A ‘word-expert’ architecture was adopted, yielding a set of classifiers, each specialized in one single wordform. The experts consist of multiple memory-based learning classifiers, each taking different information sources as input, combined in a voting scheme. We optimized the architectural and parametric settings for each individual word-expert by performing cross-validation experiments on the learning material. The results of these experiments show that the variation of both the algorithmic parameters and the information sources available to the classifiers leads to large fluctuations in accuracy. We demonstrate that optimization per word-expert leads to an overall significant improvement in the generalization accuracies of the produced WSD systems.


Sign in / Sign up

Export Citation Format

Share Document