Citation context-based topic models: discovering cited and citing topics from full text

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Lixue Zou ◽  
Xiwen Liu ◽  
Wray Buntine ◽  
Yanli Liu

PurposeFull text of a document is a rich source of information that can be used to provide meaningful topics. The purpose of this paper is to demonstrate how to use citation context (CC) in the full text to identify the cited topics and citing topics efficiently and effectively by employing automatic text analysis algorithms.Design/methodology/approachThe authors present two novel topic models, Citation-Context-LDA (CC-LDA) and Citation-Context-Reference-LDA (CCRef-LDA). CC is leveraged to extract the citing text from the full text, which makes it possible to discover topics with accuracy. CC-LDA incorporates CC, citing text, and their latent relationship, while CCRef-LDA incorporates CC, citing text, their latent relationship and reference information in CC. Collapsed Gibbs sampling is used to achieve an approximate estimation. The capacity of CC-LDA to simultaneously learn cited topics and citing topics together with their links is investigated. Moreover, a topic influence measure method based on CC-LDA is proposed and applied to create links between the two-level topics. In addition, the capacity of CCRef-LDA to discover topic influential references is also investigated.FindingsThe results indicate CC-LDA and CCRef-LDA achieve improved or comparable performance in terms of both perplexity and symmetric Kullback–Leibler (sKL) divergence. Moreover, CC-LDA is effective in discovering the cited topics and citing topics with topic influence, and CCRef-LDA is able to find the cited topic influential references.Originality/valueThe automatic method provides novel knowledge for cited topics and citing topics discovery. Topic influence learnt by our model can link two-level topics and create a semantic topic network. The method can also use topic specificity as a feature to rank references.

2018 ◽  
Vol 36 (3) ◽  
pp. 400-410 ◽  
Author(s):  
Debin Fang ◽  
Haixia Yang ◽  
Baojun Gao ◽  
Xiaojun Li

Purpose Discovering the research topics and trends from a large quantity of library electronic references is essential for scientific research. Current research of this kind mainly depends on human justification. The purpose of this paper is to demonstrate how to identify research topics and evolution in trends from library electronic references efficiently and effectively by employing automatic text analysis algorithms. Design/methodology/approach The authors used the latent Dirichlet allocation (LDA), a probabilistic generative topic model to extract the latent topic from the large quantity of research abstracts. Then, the authors conducted a regression analysis on the document-topic distributions generated by LDA to identify hot and cold topics. Findings First, this paper discovers 32 significant research topics from the abstracts of 3,737 articles published in the six top accounting journals during the period of 1992-2014. Second, based on the document-topic distributions generated by LDA, the authors identified seven hot topics and six cold topics from the 32 topics. Originality/value The topics discovered by LDA are highly consistent with the topics identified by human experts, indicating the validity and effectiveness of the methodology. Therefore, this paper provides novel knowledge to the accounting literature and demonstrates a methodology and process for topic discovery with lower cost and higher efficiency than the current methods.


2021 ◽  
Vol 3 (2) ◽  
pp. 1-16
Author(s):  
Kasper Welbers ◽  
Wouter van Atteveldt ◽  
Jan Kleinnijenhuis

Abstract Most common methods for automatic text analysis in communication science ignore syntactic information, focusing on the occurrence and co-occurrence of individual words, and sometimes n-grams. This is remarkably effective for some purposes, but poses a limitation for fine-grained analyses into semantic relations such as who does what to whom and according to what source. One tested, effective method for moving beyond this bag-of-words assumption is to use a rule-based approach for labeling and extracting syntactic patterns in dependency trees. Although this method can be used for a variety of purposes, its application is hindered by the lack of dedicated and accessible tools. In this paper we introduce the rsyntax R package, which is designed to make working with dependency trees easier and more intuitive for R users, and provides a framework for combining multiple rules for reliably extracting useful semantic relations.


2012 ◽  
Vol 2012 ◽  
pp. 1-15 ◽  
Author(s):  
Anna Divoli ◽  
Preslav Nakov ◽  
Marti A. Hearst

Recent years have shown a gradual shift in the content of biomedical publications that is freely accessible, from titles and abstracts to full text. This has enabled new forms of automatic text analysis and has given rise to some interesting questions: How informative is the abstract compared to the full-text? What important information in the full-text is not present in the abstract? What should a good summary contain that is not already in the abstract? Do authors and peers see an article differently? We answer these questions by comparing the information content of the abstract to that in citances—sentences containing citations to that article. We contrast the important points of an article as judged by its authors versus as seen by peers. Focusing on the area of molecular interactions, we perform manual and automatic analysis, and we find that the set of all citances to a target article not only covers most information (entities, functions, experimental methods, and other biological concepts) found in its abstract, but also contains 20% more concepts. We further present a detailed summary of the differences across information types, and we examine the effects other citations and time have on the content of citances.


2019 ◽  
Vol 37 (3) ◽  
pp. 436-455 ◽  
Author(s):  
Chih-Ming Chen ◽  
Yung-Ting Chen ◽  
Chen-Yu Liu

Purpose An automatic text annotation system (ATAS) that can collect resources from different databases through Linked Data (LD) for automatically annotating ancient texts was developed in this study to support digital humanities research. It allows the humanists referring to resources from diverse databases when interpreting ancient texts as well as provides a friendly text annotation reader for humanists interpreting ancient text through reading. The paper aims to discuss whether the ATAS is helpful to support digital humanities research or not. Design/methodology/approach Based on the quasi-experimental design, the ATAS developed in this study and MARKUS semi-ATAS were compared whether the significant differences in the reading effectiveness and technology acceptance for supporting humanists interpreting ancient text of the Ming dynasty’s collections existed or not. Additionally, lag sequential analysis was also used to analyze users’ operation behaviors on the ATAS. A semi-structured in-depth interview was also applied to understand users’ opinions and perception of using the ATAS to interpret ancient texts through reading. Findings The experimental results reveal that the ATAS has higher reading effectiveness than MARKUS semi-ATAS, but not reaching the statistically significant difference. The technology acceptance of the ATAS is significantly higher than that of MARKUS semi-ATAS. Particularly, the function comparison of the two systems shows that the ATAS presents more perceived ease of use on the functions of term search, connection to source websites and adding annotation than MARKUS semi-ATAS. Furthermore, the reading interface of ATAS is simple and understandable and is more suitable for reading than MARKUS semi-ATAS. Among all the considered LD sources, Moedict, which is an online Chinese dictionary, was confirmed as the most helpful one. Research limitations/implications This study adopted Jieba Chinese parser to perform the word segmentation process based on a parser lexicon for the Chinese ancient texts of the Ming dynasty’s collections. The accuracy of word segmentation to a lexicon-based Chinese parser is limited due to ignoring the grammar and semantics of ancient texts. Moreover, the original parser lexicon used in Jieba Chinese parser only contains the modern words. This will reduce the accuracy of word segmentation for Chinese ancient texts. The two limitations that affect Jieba Chinese parser to correctly perform the word segmentation process for Chinese ancient texts will significantly affect the effectiveness of using ATAS to support digital humanities research. This study thus proposed a practicable scheme by adding new terms into the parser lexicon based on humanists’ self-judgment to improve the accuracy of word segmentation of Jieba Chinese parser. Practical implications Although some digital humanities platforms have been successfully developed to support digital humanities research for humanists, most of them have still not provided a friendly digital reading environment to support humanists on interpreting texts. For this reason, this study developed an ATAS that can automatically retrieve LD sources from different databases on the Internet to supply rich annotation information on reading texts to help humanists interpret texts. This study brings digital humanities research to a new ground. Originality/value This study proposed a novel ATAS that can automatically annotate useful information on an ancient text to increase the readability of the ancient text based on LD sources from different databases, thus helping humanists obtain a deeper and broader understanding in the ancient text. Currently, there is no this kind of tool developed for humanists to support digital humanities research.


2007 ◽  
Vol 26 (2) ◽  
pp. 27
Author(s):  
Sam Brooks ◽  
Mark Herrick

Index Blending is the process of database development whereby various components are merged and refined to create a single encompassing source of information. Once a research need is determined for a given area of study, existing resources are examined for value and possible contribution to the end product. Index Blending focuses on the quality of bibliographic records as the primary factor with the addition of full text to enhance the end user’s research experience as an added convenience. Key examples of the process of Index Blending involve the fields of communication and mass media, hospitality and tourism, as well as computers and applied sciences. When academia, vendors, subject experts, lexicographers, and other contributors are brought together through the various factors associated with Index Blending, relevant discipline-specific research may be greatly enhanced.


2018 ◽  
Vol 37 (6) ◽  
pp. 621-631 ◽  
Author(s):  
Roofia Galeshi ◽  
Jyotsna Sharman ◽  
Jinghong Cai

Purpose The purpose of this paper is to understand the behavior diversities that exist among young millennials’ subgroups in ways they seek health-related information. Design/methodology/approach The authors ran several sets of analyses on the 2012–2014 US Program for the International Assessment of Adult Competencies (PIAAC) Data using Stata. The population was stratified into four specific subgroups based on their gender, ethnicity—blacks, Hispanics and whites—immigration status, college status—whether they were enrolled in a program of study at the time of the survey. The outcome variables were sources of health information including print (books/magazines/brochures), traditional media (Radio/TV), internet, family/friends/co-workers and health professionals. The independent variables were gender, ethnicity, educational status and immigration status. The authors utilized the appropriate sample weight derived by Organization for Economic Cooperation and Development so the findings can be generalized to the populations. The analysis included several descriptive statistics and χ2 test of independence. Findings Despite similarities, young adults’ health seeking behavior is complex influenced by gender, ethnicity, immigration status and education. The results indicated that while the internet is the primary source of health-related information for all young adults, there are subtle differences in utilizing other available resources. For example while more educated young adults seek help from their family members, the less educated peers use the media to obtain health-related information. Ethnicity has also an effect on young adults’ information seeking behavior. The number of Hispanics and blacks that obtain their information from traditional media is significantly higher than their white counterparts. Research limitations/implications This study has several limitations. First, the authors did not consider the effect of young adults’ digital literacy skills, problem solving skills and numeracy skills on their health seeking approach. Including these cognitive skills could reveal key information about young adults approach to information seeking that is not apparent by race, ethnicity and gender only. Another limitation of this study is the lack of the ability to claim causation, PIAAC data are designed strictly for cross-sectional analysis. Practical implications Although, behaviors often do not change simply by presenting information, trying to change behavior without improving individuals’ understanding of the issue by providing accurate information is likely to fail. Providing standardized health-related information sources that are accessible to all is vitally important. The results indicate that while the majority of young adults use the internet as their primary source of information only a few percentage of young adults seek information from health professional. Consequently, there is a need for an easily accessible and standardized online health-related source of information. Social implications Healthcare facilities and health related industries have the resources and the ability to develop a reliable infrastructure that could potentially provide reliable information that is easy to understand and navigate for adults with a variety of literacy and skills to use. Perhaps adopting the Universal Design for Learning approach and providing information that is accessible to a variety of individuals regardless of their education, learning skills and language skills. Flexible learning resources provided within a standard infrastructure accessible to all can help individuals find trustworthy and consistent information that they can trust. Originality/value Despite the unique characteristics of the millennials and the profound change in the way young adults seek information, there is a paucity of research on the ways young adults seek health-related information. Most existing literature is based on locally developed surveys and convenient sampling with limited reliability and validity information. Consequently making a sweeping statement based on their findings is considered as hasty generalization. The PIAAC, on the other hand, is a nationally representative data, extensively examined for its validity and reliability.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Manuela López ◽  
Maria Sicilia ◽  
Peeter W.J. Verlegh

PurposeOpinion leaders are increasingly important as a source of information, with consumers judging them to be more credible than other media and more influential than other consumers. Thus, companies have an interest in engaging opinion leaders to post about products and brands, and the authors analyse different incentives for encouraging them to spread the word on social media (via electronic word-of-mouth [e-WoM]).Design/methodology/approachA 2 × 3 between-subjects experimental design was developed in which 359 technological opinion leaders (bloggers) participated. The authors manipulated the monetary incentive (money vs no money) and non-monetary incentives (information only vs return product vs keep product) offered in exchange for a brand post.FindingsVarious techniques for approaching opinion leaders are effective, but to differing degrees. Providing a product free of charge increases the likelihood that opinion leaders will post about it, and the highest intention to post is observed when they are allowed to keep the product. In contrast, giving money to opinion leaders could have an indirect negative impact on their intention to post through the expected negative reaction of followers.Originality/valueIt remains unclear how opinion leaders can best be encouraged to spread e-WoM, as incentives used for consumers may work differently for opinion leaders, who have followers that they want to maintain. The main contribution of this paper lies in its explanation of why opinion leaders react differently to monetary versus non-monetary incentives.


2020 ◽  
Vol 72 (2) ◽  
pp. 262-286
Author(s):  
Jihong Liang ◽  
Hao Wang ◽  
Xiaojing Li

PurposeThe purpose of this paper is to explore the task design and assignment of full-text generation on mass Chinese historical archives (CHAs) by crowdsourcing, with special attention paid to how to best divide full-text generation tasks into smaller ones assigned to crowdsourced volunteers and to improve the digitization of mass CHAs and the data-oriented processing of the digital humanities.Design/methodology/approachThis paper starts from the complexities of character recognition of mass CHAs, takes Sheng Xuanhuai archives crowdsourcing project of Shanghai Library as a case study, and makes use of the theories of archival science, including diplomatics of Chinese archival documents, and the historical approach of Chinese archival traditions as the theoretical basis and analysis methods. The results are generated through the comprehensive research.FindingsThis paper points out that volunteer tasks of full-text generation include transcription, punctuation, proofreading, metadata description, segmentation, and attribute annotation in digital humanities and provides a metadata element set for volunteers to use in creating or revising metadata descriptions and also provides an attribute tag set. The two sets can be used across the humanities to construct overall observations about texts and the archives of which they are a part. Along these lines, this paper presents significant insights for application in outlining the principles, methods, activities, and procedures of crowdsourced full-text generation for mass CHAs.Originality/valueThis study is the first to explore and identify the effective design and allocation of tasks for crowdsourced volunteers completing full-text generation on CHAs in digital humanities.


Sign in / Sign up

Export Citation Format

Share Document