Computational Linguistics for Metadata Building (CLiMB) Text Mining for the Automatic Extraction of Subject Terms for Image Metadata

2020 ◽  
Vol 5 (4) ◽  
pp. 43-55
Author(s):  
Gianpiero Bianchi ◽  
Renato Bruni ◽  
Cinzia Daraio ◽  
Antonio Laureti Palma ◽  
Giulio Perani ◽  
...  

AbstractPurposeThe main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’ websites. The information automatically extracted can be potentially updated with a frequency higher than once per year, and be safe from manipulations or misinterpretations. Moreover, this approach allows us flexibility in collecting indicators about the efficiency of universities’ websites and their effectiveness in disseminating key contents. These new indicators can complement traditional indicators of scientific research (e.g. number of articles and number of citations) and teaching (e.g. number of students and graduates) by introducing further dimensions to allow new insights for “profiling” the analyzed universities.Design/methodology/approachWebometrics relies on web mining methods and techniques to perform quantitative analyses of the web. This study implements an advanced application of the webometric approach, exploiting all the three categories of web mining: web content mining; web structure mining; web usage mining. The information to compute our indicators has been extracted from the universities’ websites by using web scraping and text mining techniques. The scraped information has been stored in a NoSQL DB according to a semi-structured form to allow for retrieving information efficiently by text mining techniques. This provides increased flexibility in the design of new indicators, opening the door to new types of analyses. Some data have also been collected by means of batch interrogations of search engines (Bing, www.bing.com) or from a leading provider of Web analytics (SimilarWeb, http://www.similarweb.com). The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register (https://eter.joanneum.at/#/home), a database collecting information on Higher Education Institutions (HEIs) at European level. All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.FindingsThe main findings of this study concern the evaluation of the potential in digitalization of universities, in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’ websites. These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.Research limitationsThe results reported in this study refers to Italian universities only, but the approach could be extended to other university systems abroad.Practical implicationsThe approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites. The approach could be applied to other university systems.Originality/valueThis work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping, optical character recognition and nontrivial text mining operations (Bruni & Bianchi, 2020).


10.2196/13007 ◽  
2019 ◽  
Vol 21 (4) ◽  
pp. e13007 ◽  
Author(s):  
George Karystianis ◽  
Armita Adily ◽  
Peter Schofield ◽  
Lee Knight ◽  
Clara Galdon ◽  
...  

Author(s):  
Daniel Hardt

This chapter considers approaches to ellipsis within computational linguistics. It begins with the structure of the ellipsis site—a topic that has received little attention in computational linguistics. There are two prominent accounts of the recovery of ellipsis: that of Lappin and McCord (1990) and Dalrymple et al. (1991). The topic of licensing follows. This topic has not been directly addressed in the computational literature, but the chapter covers two related issues of direct computational interest: the identification and generation of ellipsis occurrences. This is followed by three additional topics: identifying the antecedent, dialogue, and text mining.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Yohanes Sigit Purnomo W.P. ◽  
Yogan Jaya Kumar ◽  
Nur Zareen Zulkarnain

Purpose Extracting information from unstructured data becomes a challenging task for computational linguistics. Public figure’s statement attributed by journalists in a story is one type of information that can be processed into structured data. Therefore, having the knowledge base about this data will be very beneficial for further use, such as for opinion mining, claim detection and fact-checking. This study aims to understand statement extraction tasks and the models that have already been applied to formulate a framework for further study. Design/methodology/approach This paper presents a literature review from selected previous research that specifically addresses the topics of quotation extraction and quotation attribution. Research works that discuss corpus development related to quotation extraction and quotation attribution are also considered. The findings of the review will be used as a basis for proposing a framework to direct further research. Findings There are three findings in this study. Firstly, the extraction process still consists of two main tasks, namely, the extraction of quotations and the attribution of quotations. Secondly, most extraction algorithms rely on a rule-based algorithm or traditional machine learning. And last, the availability of corpus, which is limited in quantity and depth. Based on these findings, a statement extraction framework for Indonesian language corpus and model development is proposed. Originality/value The paper serves as a guideline to formulate a framework for statement extraction based on the findings from the literature study. The proposed framework includes a corpus development in the Indonesian language and a model for public figure statement extraction. Furthermore, this study could be used as a reference to produce a similar framework for other languages.


2016 ◽  
Vol 6 (2) ◽  
pp. 76-95
Author(s):  
Addi Rull ◽  
Tõnu Tamme ◽  
Leo Võhandu

Abstract The authors propose a novel quantitative method to analyse the structure of legal texts. The method enables to determine grammatical similarity between legal texts. The authors use the external theory of fundamental rights to separate the text of fundamental rights of the Estonian Constitution into two categories of norms: constitutional rights and restrictions. Grammatical similarity between constitutional rights, restrictions and selected legal acts and case law is measured. The layer of special norms renders the best grammatical similarity with the text of fundamental rights. The same grammatical similarity tests can be replicated to cover other jurisdictions in the future. The research is experimental, but the authors believe that the method can be utilised in fields of computational linguistics and legal text mining, but also in research where legal text structures are analysed for various purposes.


Author(s):  
Cane W.K. Leung

Sentiment analysis is a kind of text classification that classifies texts based on the sentimental orientation (SO) of opinions they contain. Sentiment analysis of product reviews has recently become very popular in text mining and computational linguistics research. The following example provides an overall idea of the challenge. The sentences below are extracted from a movie review on the Internet Movie Database: “It is quite boring...... the acting is brilliant, especially Massimo Troisi.” In the example, the author stated that “it” (the movie) is quite boring but the acting is brilliant. Understanding such sentiments involves several tasks. Firstly, evaluative terms expressing opinions must be extracted from the review. Secondly, the SO, or the polarity, of the opinions must be determined. For instance, “boring” and “brilliant” respectively carry a negative and a positive opinion. Thirdly, the opinion strength, or the intensity, of an opinion should also be determined. For instance, both “brilliant” and “good” indicate positive opinions, but “brilliant” obviously implies a stronger preference. Finally, the review is classified with respect to sentiment classes, such as Positive and Negative, based on the SO of the opinions it contains.


Author(s):  
Valérie Saugera

This chapter presents and justifies the use of both a dictionary corpus and a newspaper corpus. The dictionary corpus is used because of the role of the dictionary as linguistic authority in France, and the stable status of the Anglicisms included in it. For the newspaper data, the study benefited from the French text-mining tool named Sulci, originally designed for the corpus and thesaurus analysis of the daily newspaper Libération, which allowed extraction of all the dictionary-unattested forms in one year’s issues. In fact, this study is the first to use an electronic corpus to analyze the influence exerted by English over French. The advantages and disadvantages of the semi-automatic extraction of only dictionary-unsanctioned words of English origin are discussed in detail. The chapter includes the selection criteria for what counts as a borrowed item to be collected in the appended database.


Sign in / Sign up

Export Citation Format

Share Document