The impact of imbalanced training data on machine learning for author name disambiguation

Chinese author names are known to be more difficult to disambiguate than other ethnic names because they tend to share surnames and forenames, thus creating many homonyms. In this study, we demonstrate how using Chinese characters can affect machine learning for author name disambiguation. For analysis, 15K author names recorded in Chinese are transliterated into English and simplified by initialising their forenames to create counterfactual scenarios, reflecting real-world indexing practices in which Chinese characters are usually unavailable. The results show that Chinese author names that are highly ambiguous in English or with initialised forenames tend to become less confusing if their Chinese characters are included in the processing. Our findings indicate that recording Chinese author names in native script can help researchers and digital libraries enhance authority control of Chinese author names that continue to increase in size in bibliographic data.

Download Full-text

Noise Prediction Using Machine Learning with Measurements Analysis

Applied Sciences ◽

10.3390/app10186619 ◽

2020 ◽

Vol 10 (18) ◽

pp. 6619

Author(s):

Po-Jiun Wen ◽

Chihpin Huang

Keyword(s):

Machine Learning ◽

Noise Exposure ◽

Learning Model ◽

Training Data ◽

Coefficient Of Determination ◽

Gradient Boosting ◽

Noise Prediction ◽

Time Duration ◽

Proposed Model ◽

The Impact

The noise prediction using machine learning is a special study that has recently received increased attention. This is particularly true in workplaces with noise pollution, which increases noise exposure for general laborers. This study attempts to analyze the noise equivalent level (Leq) at the National Synchrotron Radiation Research Center (NSRRC) facility and establish a machine learning model for noise prediction. This study utilized the gradient boosting model (GBM) as the learning model in which past noise measurement records and many other features are integrated as the proposed model makes a prediction. This study analyzed the time duration and frequency of the collected Leq and also investigated the impact of training data selection. The results presented in this paper indicate that the proposed prediction model works well in almost noise sensors and frequencies. Moreover, the model performed especially well in sensor 8 (125 Hz), which was determined to be a serious noise zone in the past noise measurements. The results also show that the root-mean-square-error (RMSE) of the predicted harmful noise was less than 1 dBA and the coefficient of determination (R2) value was greater than 0.7. That is, the working field showed a favorable noise prediction performance using the proposed method. This positive result shows the ability of the proposed approach in noise prediction, thus providing a notification to the laborer to prevent long-term exposure. In addition, the proposed model accurately predicts noise future pollution, which is essential for laborers in high-noise environments. This would keep employees healthy in avoiding noise harmful positions to prevent people from working in that environment.

Download Full-text

Rack Temperature Prediction Model Using Machine Learning after Stopping Computer Room Air Conditioner in Server Room

Energies ◽

10.3390/en13174300 ◽

2020 ◽

Vol 13 (17) ◽

pp. 4300

Author(s):

Kosuke Sasakura ◽

Takeshi Aoki ◽

Masayoshi Komatsu ◽

Takeshi Watanabe

Keyword(s):

Machine Learning ◽

High Heat ◽

Training Data ◽

Coefficient Of Determination ◽

Gradient Boosting ◽

Air Conditioner ◽

Tree Model ◽

Explanatory Variables ◽

Temperature Environment ◽

The Impact

Data centers (DCs) are becoming increasingly important in recent years, and highly efficient and reliable operation and management of DCs is now required. The generated heat density of the rack and information and communication technology (ICT) equipment is predicted to get higher in the future, so it is crucial to maintain the appropriate temperature environment in the server room where high heat is generated in order to ensure continuous service. It is especially important to predict changes of rack intake temperature in the server room when the computer room air conditioner (CRAC) is shut down, which can cause a rapid rise in temperature. However, it is quite difficult to predict the rack temperature accurately, which in turn makes it difficult to determine the impact on service in advance. In this research, we propose a model that predicts the rack intake temperature after the CRAC is shut down. Specifically, we use machine learning to construct a gradient boosting decision tree model with data from the CRAC, ICT equipment, and rack intake temperature. Experimental results demonstrate that the proposed method has a very high prediction accuracy: the coefficient of determination was 0.90 and the root mean square error (RMSE) was 0.54. Our model makes it possible to evaluate the impact on service and determine if action to maintain the temperature environment is required. We also clarify the effect of explanatory variables and training data of the machine learning on the model accuracy.

Download Full-text

Quantified uncertainties in fission yields from machine learning

EPJ Web of Conferences ◽

10.1051/epjconf/202024205003 ◽

2020 ◽

Vol 242 ◽

pp. 05003

Author(s):

A.E. Lovell ◽

A.T. Mohan ◽

P. Talou ◽

M. Chertkov

Keyword(s):

Machine Learning ◽

Nuclear Physics ◽

Experimental Error ◽

Training Data ◽

Learning Methods ◽

Mixture Density ◽

Machine Learning Methods ◽

Predicted Values ◽

The Impact ◽

Fission Yields

As machine learning methods gain traction in the nuclear physics community, especially those methods that aim to propagate uncertainties to unmeasured quantities, it is important to understand how the uncertainty in the training data coming either from theory or experiment propagates to the uncertainty in the predicted values. Gaussian Processes and Bayesian Neural Networks are being more and more widely used, in particular to extrapolate beyond measured data. However, studies are typically not performed on the impact of the experimental errors on these extrapolated values. In this work, we focus on understanding how uncertainties propagate from input to prediction when using machine learning methods. We use a Mixture Density Network (MDN) to incorporate experimental error into the training of the network and construct uncertainties for the associated predicted quantities. Systematically, we study the effect of the size of the experimental error, both on the reproduced training data and extrapolated predictions for fission yields of actinides.

Download Full-text

Convolutional Neural Networks for automatic image quality control and EARL compliance of PET images

10.21203/rs.3.rs-964263/v1 ◽

2021 ◽

Author(s):

Elisabeth Pfaehler ◽

Daniela Euba ◽

Andreas Rinscheid ◽

Otto S. Hoekstra ◽

Josee Zijlstra ◽

...

Keyword(s):

Machine Learning ◽

Cross Validation ◽

Training Data ◽

Independent Dataset ◽

Pet Ct ◽

Image Quality Control ◽

The Cross ◽

The Impact ◽

Pet Scanners ◽

Fold Cross Validation

Abstract Background Machine learning studies require a large number of images often obtained on different PET scanners. When merging these images, the use of harmonized images following EARL-standards is essential. However, when including retrospective images, EARL accreditation might not have been in place. The aim of this study was to develop a convolutional neural network (CNN) that can identify retrospectively if an image is EARL compliant and if it is meeting older or newer EARL-standards. Materials and Methods 96 PET images acquired on three PET/CT systems were included in the study. All images were reconstructed with the locally clinically preferred, EARL1, and EARL2 compliant reconstruction protocols. After image pre-processing, one CNN was trained to separate clinical and EARL compliant reconstructions. A second CNN was optimized to identify EARL1 and EARL2 compliant images. The accuracy of both CNNs was assessed using 5-fold cross validation. The CNNs were validated on 24 images acquired on a PET scanner not included in the training data. To assess the impact of image noise on the CNN decision, the 24 images were reconstructed with different scan durations. Results In the cross-validation, the first CNN classified all images correctly. When identifying EARL1 and EARL2 compliant images, the second CNN identified 100% EARL1 compliant and 85% EARL2 compliant images correctly. The accuracy in the independent dataset was comparable to the cross-validation accuracy. The scan duration had almost no impact on the results. Conclusion The two CNNs trained in this study can be used to retrospectively include images in a multi-center setting by e.g. adding additional smoothing. This method is especially important for machine learning studies where the harmonization of images from different PET systems is essential.

Download Full-text

Modeling of Cu-Au prospectivity in the Carajás mineral province (Brazil) through machine learning: Dealing with imbalanced training data

Ore Geology Reviews ◽

10.1016/j.oregeorev.2020.103611 ◽

2020 ◽

Vol 124 ◽

pp. 103611 ◽

Cited By ~ 1

Author(s):

Elias Martins Guerra Prado ◽

Carlos Roberto de Souza Filho ◽

Emmanuel John M. Carranza ◽

João Gabriel Motta

Keyword(s):

Machine Learning ◽

Training Data ◽

Carajás Mineral Province ◽

Imbalanced Training Data

Download Full-text

AUTHOR NAME DISAMBIGUATION IN ACADEMIC PUBLICATIONS USING METHODS OF MACHINE LEARNING

Vestnik komp iuternykh i informatsionnykh tekhnologii ◽

10.14489/vkit.2015.09.pp.041-048 ◽

2015 ◽

pp. 41-48

Author(s):

V. A. Zelepukhina

Keyword(s):

Machine Learning ◽

Name Disambiguation ◽

Author Name Disambiguation ◽

Academic Publications

Download Full-text

The Impact of Imbalanced Training Data on Local Matching Learning of Ontologies

Business Information Systems - Lecture Notes in Business Information Processing ◽

10.1007/978-3-030-20485-3_13 ◽

2019 ◽

pp. 162-175

Author(s):

Amir Laadhar ◽

Faiza Ghozzi ◽

Imen Megdiche ◽

Franck Ravat ◽

Olivier Teste ◽

...

Keyword(s):

Training Data ◽

Imbalanced Training Data ◽

The Impact

Download Full-text

Parameter optimization for machine-learning of word sense disambiguation

Natural Language Engineering ◽

10.1017/s1351324902003005 ◽

2002 ◽

Vol 8 (4) ◽

pp. 311-325 ◽

Cited By ~ 24

Author(s):

V. HOSTE ◽

I. HENDRICKX ◽

W. DAELEMANS ◽

A. VAN DEN BOSCH

Keyword(s):

Machine Learning ◽

Parameter Optimization ◽

Information Sources ◽

Word Sense Disambiguation ◽

Training Data ◽

Learning Material ◽

Word Sense ◽

Performance Measurements ◽

Sense Disambiguation ◽

The Impact

Various Machine Learning (ML) approaches have been demonstrated to produce relatively successful Word Sense Disambiguation (WSD) systems. There are still unexplained differences among the performance measurements of different algorithms, hence it is warranted to deepen the investigation into which algorithm has the right ‘bias’ for this task. In this paper, we show that this is not easy to accomplish, due to intricate interactions between information sources, parameter settings, and properties of the training data. We investigate the impact of parameter optimization on generalization accuracy in a memory-based learning approach to English and Dutch WSD. A ‘word-expert’ architecture was adopted, yielding a set of classifiers, each specialized in one single wordform. The experts consist of multiple memory-based learning classifiers, each taking different information sources as input, combined in a voting scheme. We optimized the architectural and parametric settings for each individual word-expert by performing cross-validation experiments on the learning material. The results of these experiments show that the variation of both the algorithmic parameters and the information sources available to the classifiers leads to large fluctuations in accuracy. We demonstrate that optimization per word-expert leads to an overall significant improvement in the generalization accuracies of the produced WSD systems.

Download Full-text