Exogenous approach to improve topic segmentation

Author(s):  
Marwa Naili ◽  
Anja Habacha Chaibi ◽  
Henda Hajjami Ben Ghezala

Purpose – Topic segmentation is one of the active research fields in natural language processing. Also, many topic segmenters have been proposed. However, the current challenge of researchers is the improvement of these segmenters by using external resources. Therefore, the purpose of this paper is to integrate study and evaluate a new external semantic resource in topic segmentation. Design/methodology/approach – New topic segmenters (TSS-Onto and TSB-Onto) are proposed based on the two well-known segmenters C99 and TextTiling. The proposed segmenters integrate semantic knowledge to the segmentation process by using a domain ontology as an external resource. Subsequently, an evaluation is made to study the effect of this resource on the quality of topic segmentation along with a comparative study with related works. Findings – Based on this study, the authors showed that adding semantic knowledge, which is extracted from a domain ontology, improves the quality of topic segmentation. Moreover, TSS-Ont outperforms TSB-Ont in terms of quality of topic segmentation. Research limitations/implications – The main limitation of this study is the used test corpus for the evaluation which is not a benchmark. However, we used a collection of scientific papers from well-known digital libraries (ArXiv and ACM). Practical implications – The proposed topic segmenters can be useful in different NLP applications such as information retrieval and text summarizing. Originality/value – The primary original contribution of this paper is the improvement of topic segmentation based on semantic knowledge. This knowledge is extracted from an ontological external resource.

2019 ◽  
Vol 34 (4) ◽  
pp. 295-310 ◽  
Author(s):  
Huyen T M Nguyen ◽  
Hung V Nguyen ◽  
Quyen T Ngo ◽  
Luong X Vu ◽  
Vu Mai Tran ◽  
...  

Sentiment analysis is a natural language processing (NLP) task of identifying orextracting the sentiment content of a text unit. This task has become an active research topic since the early 2000s. During the two last editions of the VLSP workshop series, the shared task on Sentiment Analysis (SA) for Vietnamese has been organized in order to provide an objective evaluation measurement about the performance (quality) of sentiment analysis tools, and encouragethe development of Vietnamese sentiment analysis systems, as well as to provide benchmark datasets for this task. The rst campaign in 2016 only focused on the sentiment polarity classication, with a dataset containing reviews of electronic products. The second campaign in 2018 addressed the problem of Aspect Based Sentiment Analysis (ABSA) for Vietnamese, by providing two datasets containing reviews in restaurant and hotel domains. These data are accessible for research purpose via the VLSP website vlsp.org.vn/resources. This paper describes the built datasets as well as the evaluation results of the systems participating to these campaigns.


2020 ◽  
Vol 38 (5/6) ◽  
pp. 905-918
Author(s):  
Ivana Tanasijević ◽  
Gordana Pavlović-Lažetić

Purpose The purpose of this paper is to provide a methodology for automatic annotation of a multimedia collection of intangible cultural heritage mostly in the form of interviews. Assigned annotations provide a way to search the collection. Design/methodology/approach Annotation is based on automatic extraction of metadata and is conducted by named entity and topic extraction from textual descriptions with a rule-based approach supported by vocabulary resources, a compiled domain-specific classification scheme and domain-oriented corpus analysis. Findings The proposed methodology for automatic annotation of a collection of intangible cultural heritage, applied on the cultural heritage of the Balkans, has very good results according to F measure, which is 0.87 for the named entity and 0.90 for topic annotation. The overall methodology enables encapsulating domain-specific and language-specific knowledge into collections of finite state transducers and allows further improvements. Originality/value Although cultural heritage has a significant role in the development of identity of a group or an individual, it is one of those specific domains that have not yet been fully explored in case of many languages. A methodology is proposed that can be used for incorporating natural language processing techniques into digital libraries of cultural heritage.


2017 ◽  
Vol 35 (3) ◽  
pp. 398-409
Author(s):  
Gracielle Mendonça Rodrigues Gomes ◽  
Beatriz Valadares Cendon

Purpose The study aims to propose the use of the semiotics inspection method (SIM) which is an interpretative and qualitative method from semiotics engineering (SE) for the evaluation of the communicability of systems and to evaluate digital libraries and information retrieval systems (IRS). The paper presents the results of the application of this method in the evaluation of the quality of the communicability of the interface and search system of the Coordination for the Improvement of Higher Education Personnel (CAPES) Portal of e-Journals, a major scientific digital library in Brazil. There are proposed solutions to improve this system included. Design/methodology/approach The study used the SIM to evaluate the system. Two evaluators inspected the system. They performed the comparison and the analysis of three types of metamessages (metalinguistic, static and dynamic). The metamessages generated by the evaluators were contrasted to find inconsistencies and ambiguities in the CAPES Portal of e-Journals. Finally, the last step of the method was the final assessment about the inspection. Findings The evaluators identified 52 problems of communicability. These problems were ranked according to severity ratings established by Nielsen (1994). They were grouped in ten types of problems present in the interface and in the search system of the CAPES Portal of e-Journals. Originality value This research contributes theoretically to the field of information retrieval and to the area of human–computer interaction and, in particular, to the theory of SE by adapting SE methods that allow the evaluation of communicability to the context of the scientific IRS. Results obtained through scientific methods should contribute to development of the interface and search tools of IRS to better support query formulation and retrieval of relevant information and more efficiently satisfy the information needs of individuals.


2020 ◽  
Vol 38 (02) ◽  
Author(s):  
TẠ DUY CÔNG CHIẾN

Question answering systems are applied to many different fields in recent years, such as education, business, and surveys. The purpose of these systems is to answer automatically the questions or queries of users about some problems. This paper introduces a question answering system is built based on a domain specific ontology. This ontology, which contains the data and the vocabularies related to the computing domain are built from text documents of the ACM Digital Libraries. Consequently, the system only answers the problems pertaining to the information technology domains such as database, network, machine learning, etc. We use the methodologies of Natural Language Processing and domain ontology to build this system. In order to increase performance, I use a graph database to store the computing ontology and apply no-SQL database for querying data of computing ontology.


2020 ◽  
Vol 38 (1) ◽  
pp. 44-64
Author(s):  
Nikola Nikolić ◽  
Olivera Grljević ◽  
Aleksandar Kovačević

Purpose Student recruitment and retention are important issues for all higher education institutions. Constant monitoring of student satisfaction levels is therefore crucial. Traditionally, students voice their opinions through official surveys organized by the universities. In addition to that, nowadays, social media and review websites such as “Rate my professors” are rich sources of opinions that should not be ignored. Automated mining of students’ opinions can be realized via aspect-based sentiment analysis (ABSA). ABSA s is a sub-discipline of natural language processing (NLP) that focusses on the identification of sentiments (negative, neutral, positive) and aspects (sentiment targets) in a sentence. The purpose of this paper is to introduce a system for ABSA of free text reviews expressed in student opinion surveys in the Serbian language. Sentiment analysis was carried out at the finest level of text granularity – the level of sentence segment (phrase and clause). Design/methodology/approach The presented system relies on NLP techniques, machine learning models, rules and dictionaries. The corpora collected and annotated for system development and evaluation comprise students’ reviews of teaching staff at the Faculty of Technical Sciences, University of Novi Sad, Serbia, and a corpus of publicly available reviews from the Serbian equivalent of the “Rate my professors” website. Findings The research results indicate that positive sentiment can successfully be identified with the F-measure of 0.83, while negative sentiment can be detected with the F-measure of 0.94. While the F-measure for the aspect’s range is between 0.49 and 0.89, depending on their frequency in the corpus. Furthermore, the authors have concluded that the quality of ABSA depends on the source of the reviews (official students’ surveys vs review websites). Practical implications The system for ABSA presented in this paper could improve the quality of service provided by the Serbian higher education institutions through a more effective search and summary of students’ opinions. For example, a particular educational institution could very easily find out which aspects of their service the students are not satisfied with and to which aspects of their service more attention should be directed. Originality/value To the best of the authors’ knowledge, this is the first study of ABSA carried out at the level of sentence segment for the Serbian language. The methodology and findings presented in this paper provide a much-needed bases for further work on sentiment analysis for the Serbian language that is well under-resourced and under-researched in this area.


2014 ◽  
Vol 32 (1) ◽  
pp. 173-189 ◽  
Author(s):  
Yalan Yan ◽  
Xianjin Zha ◽  
Jinchao Zhang ◽  
Xiaorong Hou

Purpose – In this study, the authors use the term “e-quality” to refer to information quality, system quality and service quality. This study aims to focus on e-quality, exploring and comparing users' perceptions of digital libraries and virtual communities in the hope that the results of this study can help lead to better understanding of the exact nature of e-quality as perceived by users. Design/methodology/approach – A large-scale survey was conducted for data collection. Data collected from 334 users of digital libraries and virtual communities were used for data analysis. Findings – The study finds that users are likely to perceive a higher level of information quality, system quality and service quality of digital libraries than of virtual communities. Practical implications – The authors suggest that librarians do not need to have concerns over the challenge brought by virtual communities, which indeed have an increasing impact on the way a lot of people seek and gather information. Instead, they should encourage their users to use both digital libraries and virtual communities. The authors believe that the usage of these two types of information sources by users can efficiently inform each other, thus facilitating the e-quality of both digital libraries and virtual communities to reach excellence. Originality/value – Building on the information systems (IS) success model, this study explores and compares users' perceptions of digital libraries and virtual communities in terms of e-quality, which the authors think presents a new view for digital library research and practice alike.


2019 ◽  
Vol 35 (1) ◽  
pp. 15-30 ◽  
Author(s):  
Asad Khan ◽  
Mohamad Noorman Masrek ◽  
Khalid Mahmood

PurposeIn addition to instrumental assumptions, behavioural researchers suggest the study of individual traits such as personal innovativeness (PI), users’ satisfaction and other theoretical beliefs for example quality and general usage patterns as the latent determinants of early and post-adoptions of technological innovations. In the context of Higher Education Commission digital library of Pakistan, the purpose of this paper is to examine the relationship of PI, quality of digital resources and generic usability of digital libraries (DL) with users’ satisfaction.Design/methodology/approachTo guide the conceptual model of this study, five hypothesized relationships were formulated. Adopting a quantitative approach, snowball sampling techniques were used. A total of 464 users of DL enrolled in different programs of study in the universities of Pakistan participated and responded to the survey. For data analyses, partial least squares, a method in the structural equation modeling was used.FindingsAnalyses reveal positive and strong relationships of PI, quality of digital resources and generic usability of DL with users’ satisfaction. Thus, the findings of this study established personal traits as the significant determinants of intention to adopt DL.Research limitations/implicationsThe decision of effective adoption is manipulated by the extent of users’ willing (PI), level of satisfaction, the image of quality and users’ past experience with the use of related innovations. Thus, librarians in addition to the system features should also focus on individual characteristics and quality of resources that probably influence adequate adoption of DL.Originality/valueIn the Pakistani context, this study is the first attempt that examined the relationship of PI, the usability of DL and quality of digital resources with users’ satisfaction. Research model of this study can be used in future research. Also, this study extended the scope of theories of adoption towards DL.


2015 ◽  
Vol 1 ◽  
pp. e37 ◽  
Author(s):  
Bahar Sateli ◽  
René Witte

Motivation.Finding relevant scientific literature is one of the essential tasks researchers are facing on a daily basis. Digital libraries and web information retrieval techniques provide rapid access to a vast amount of scientific literature. However, no further automated support is available that would enable fine-grained access to the knowledge ‘stored’ in these documents. The emerging domain ofSemantic Publishingaims at making scientific knowledge accessible to both humans and machines, by adding semantic annotations to content, such as a publication’s contributions, methods, or application domains. However, despite the promises of better knowledge access, the manual annotation of existing research literature is prohibitively expensive for wide-spread adoption. We argue that a novel combination of three distinct methods can significantly advance this vision in a fully-automated way: (i) Natural Language Processing (NLP) forRhetorical Entity(RE) detection; (ii)Named Entity(NE) recognition based on the Linked Open Data (LOD) cloud; and (iii) automatic knowledge base construction for both NEs and REs using semantic web ontologies that interconnect entities in documents with the machine-readable LOD cloud.Results.We present a complete workflow to transform scientific literature into a semantic knowledge base, based on the W3C standards RDF and RDFS. A text mining pipeline, implemented based on the GATE framework, automatically extracts rhetorical entities of typeClaimsandContributionsfrom full-text scientific literature. These REs are further enriched with named entities, represented as URIs to the linked open data cloud, by integrating the DBpedia Spotlight tool into our workflow. Text mining results are stored in a knowledge base through a flexible export process that provides for a dynamic mapping of semantic annotations to LOD vocabularies through rules stored in the knowledge base. We created a gold standard corpus from computer science conference proceedings and journal articles, whereClaimandContributionsentences are manually annotated with their respective types using LOD URIs. The performance of the RE detection phase is evaluated against this corpus, where it achieves an averageF-measure of 0.73. We further demonstrate a number of semantic queries that show how the generated knowledge base can provide support for numerous use cases in managing scientific literature.Availability.All software presented in this paper is available under open source licenses athttp://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements. Development releases of individual components are additionally available on our GitHub page athttps://github.com/SemanticSoftwareLab.


2019 ◽  
Vol 26 (3) ◽  
pp. 259-291
Author(s):  
Nazanin Firoozeh ◽  
Adeline Nazarenko ◽  
Fabrice Alizon ◽  
Béatrice Daille

AbstractDue to the considerable growth of the volume of text documents on the Internet and in digital libraries, manual analysis of these documents is no longer feasible. Having efficient approaches to keyword extraction in order to retrieve the ‘key’ elements of the studied documents is now a necessity. Keyword extraction has been an active research field for many years, covering various applications in Text Mining, Information Retrieval, and Natural Language Processing, and meeting different requirements. However, it is not a unified domain of research. In spite of the existence of many approaches in the field, there is no single approach that effectively extracts keywords from different data sources. This shows the importance of having a comprehensive review, which discusses the complexity of the task and categorizes the main approaches of the field based on the features and methods of extraction that they use. This paper presents a general introduction to the field of keyword/keyphrase extraction. Unlike the existing surveys, different aspects of the problem along with the main challenges in the field are discussed. This mainly includes the unclear definition of ‘keyness’, complexities of targeting proper features for capturing desired keyness properties and selecting efficient extraction methods, and also the evaluation issues. By classifying a broad range of state-of-the-art approaches and analysing the benefits and drawbacks of different features and methods, we provide a clearer picture of them. This review is intended to help readers find their way around all the works related to keyword extraction and guide them in choosing or designing a method that is appropriate for the application they are targeting.


Author(s):  
Dharambeer Singh

Digital libraries, designed to serve people and their information needs in the same way as traditional libraries, present distinct advantages over brick and mortar facilities: elimination of physical boundaries, round-the-clock access to information, multiple access points, networking abilities, and extended search functions. As a result, they should be especially well-suited for the disables. However, minorities, those affected by lower income and education status, persons living in rural areas, the physically challanged, and developing countries as a whole consistently suffer from a lack of accessibility to digital libraries. This paper evaluates the effectiveness and relevance of digital libraries currently in place and discusses what could and should be done to improve accessibility to digital libraries for under-graduate students.


Sign in / Sign up

Export Citation Format

Share Document