Evaluating author name disambiguation for digital libraries: a case of DBLP

Jinseok Kim

doi:10.1007/s11192-018-2824-5

Effect of Chinese characters on machine learning for Chinese author name disambiguation: A counterfactual evaluation

Journal of Information Science ◽

10.1177/01655515211018171 ◽

2021 ◽

pp. 016555152110181

Author(s):

Jinseok Kim ◽

Jenna Kim ◽

Jinmo Kim

Keyword(s):

Machine Learning ◽

Real World ◽

Digital Libraries ◽

Chinese Characters ◽

Name Disambiguation ◽

Authority Control ◽

Author Name Disambiguation ◽

Bibliographic Data ◽

Chinese Author

Chinese author names are known to be more difficult to disambiguate than other ethnic names because they tend to share surnames and forenames, thus creating many homonyms. In this study, we demonstrate how using Chinese characters can affect machine learning for author name disambiguation. For analysis, 15K author names recorded in Chinese are transliterated into English and simplified by initialising their forenames to create counterfactual scenarios, reflecting real-world indexing practices in which Chinese characters are usually unavailable. The results show that Chinese author names that are highly ambiguous in English or with initialised forenames tend to become less confusing if their Chinese characters are included in the processing. Our findings indicate that recording Chinese author names in native script can help researchers and digital libraries enhance authority control of Chinese author names that continue to increase in size in bibliographic data.

Download Full-text

Correction to: Evaluating author name disambiguation for digital libraries: a case of DBLP

Scientometrics ◽

10.1007/s11192-018-2960-y ◽

2018 ◽

Vol 118 (1) ◽

pp. 383-383

Author(s):

Jinseok Kim

Keyword(s):

Digital Libraries ◽

Name Disambiguation ◽

Author Name Disambiguation

Download Full-text

Large scale author name disambiguation in digital libraries

2014 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2014.7004487 ◽

2014 ◽

Cited By ~ 10

Author(s):

Madian Khabsa ◽

Pucktada Treeratpituk ◽

C. Lee Giles

Keyword(s):

Digital Libraries ◽

Large Scale ◽

Name Disambiguation ◽

Author Name Disambiguation

Download Full-text

Effective self-training author name disambiguation in scholarly digital libraries

Proceedings of the 10th annual joint conference on Digital libraries - JCDL '10 ◽

10.1145/1816123.1816130 ◽

2010 ◽

Cited By ~ 38

Author(s):

Anderson A. Ferreira ◽

Adriano Veloso ◽

Marcos André Gonçalves ◽

Alberto H.F. Laender

Keyword(s):

Digital Libraries ◽

Name Disambiguation ◽

Author Name Disambiguation

Download Full-text

A survey of author name disambiguation techniques: 2010–2016

The Knowledge Engineering Review ◽

10.1017/s0269888917000182 ◽

2017 ◽

Vol 32 ◽

Cited By ~ 15

Author(s):

Ijaz Hussain ◽

Sohail Asghar

Keyword(s):

Digital Libraries ◽

Problem Formulation ◽

Quality Of Services ◽

Future Research ◽

Name Disambiguation ◽

Research Directions ◽

Author Name Disambiguation ◽

Abstract Level ◽

Future Research Directions

AbstractDigital libraries content and quality of services are badly affected by the author name ambiguity problem in the citations and it is considered as one of the hardest problems faced by the digital library researchers. Several techniques have been proposed in the literature for the author name ambiguity problem. In this paper, we reviewed some recently presented author name disambiguation techniques and give some challenges and future research directions. We analyze the recent advancements in this field and classify these techniques into supervised, unsupervised, semi-supervised, graph-based and heuristic-based techniques according to their problem formulation that is mainly used for the author name disambiguation. A few surveys have been conducted to review different techniques for the author name disambiguation. These surveys highlighted only the methodology adopted for author name disambiguation but did not critically review their shortcomings. This survey provides a detailed review of author name disambiguation techniques available in the literature, makes a comparison of these techniques at an abstract level and discusses their limitations.

Download Full-text

Dynamic author name disambiguation for growing digital libraries

Information Retrieval ◽

10.1007/s10791-015-9261-3 ◽

2015 ◽

Vol 18 (5) ◽

pp. 379-412 ◽

Cited By ~ 17

Author(s):

Yanan Qian ◽

Qinghua Zheng ◽

Tetsuya Sakai ◽

Junting Ye ◽

Jun Liu

Keyword(s):

Digital Libraries ◽

Name Disambiguation ◽

Author Name Disambiguation

Download Full-text

Ethnicity‐based name partitioning for author name disambiguation using supervised machine learning

Journal of the Association for Information Science and Technology ◽

10.1002/asi.24459 ◽

2021 ◽

Author(s):

Jinseok Kim ◽

Jenna Kim ◽

Jason Owen‐Smith

Keyword(s):

Machine Learning ◽

Supervised Machine Learning ◽

Name Disambiguation ◽

Author Name Disambiguation

Download Full-text

Multilayer heuristics based clustering framework (MHCF) for author name disambiguation

Scientometrics ◽

10.1007/s11192-021-04087-7 ◽

2021 ◽

Author(s):

Humaira Waqas ◽

Muhammad Abdul Qadir

Keyword(s):

Name Disambiguation ◽

Author Name Disambiguation

Download Full-text

AutoSense Model for Word Sense Induction

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016212 ◽

2019 ◽

Vol 33 ◽

pp. 6212-6219 ◽

Cited By ~ 1

Author(s):

Reinald Kim Amplayo ◽

Seung-won Hwang ◽

Min Song

Keyword(s):

Latent Variable ◽

Word Sense ◽

Name Disambiguation ◽

Variable Model ◽

Fine Grained ◽

Word Sense Induction ◽

Author Name Disambiguation ◽

Competing Models ◽

Word Senses ◽

Better Than

Word sense induction (WSI), or the task of automatically discovering multiple senses or meanings of a word, has three main challenges: domain adaptability, novel sense detection, and sense granularity flexibility. While current latent variable models are known to solve the first two challenges, they are not flexible to different word sense granularities, which differ very much among words, from aardvark with one sense, to play with over 50 senses. Current models either require hyperparameter tuning or nonparametric induction of the number of senses, which we find both to be ineffective. Thus, we aim to eliminate these requirements and solve the sense granularity problem by proposing AutoSense, a latent variable model based on two observations: (1) senses are represented as a distribution over topics, and (2) senses generate pairings between the target word and its neighboring word. These observations alleviate the problem by (a) throwing garbage senses and (b) additionally inducing fine-grained word senses. Results show great improvements over the stateof-the-art models on popular WSI datasets. We also show that AutoSense is able to learn the appropriate sense granularity of a word. Finally, we apply AutoSense to the unsupervised author name disambiguation task where the sense granularity problem is more evident and show that AutoSense is evidently better than competing models. We share our data and code here: https://github.com/rktamplayo/AutoSense.

Download Full-text

LUCID: Author name disambiguation using graph Structural Clustering

2017 Intelligent Systems Conference (IntelliSys) ◽

10.1109/intellisys.2017.8324326 ◽

2017 ◽

Author(s):

Ijaz Hussain ◽

Sohail Asghar

Keyword(s):

Name Disambiguation ◽

Author Name Disambiguation ◽

Structural Clustering

Download Full-text