metadata enrichment
Recently Published Documents


TOTAL DOCUMENTS

31
(FIVE YEARS 9)

H-INDEX

4
(FIVE YEARS 2)

2021 ◽  
Author(s):  
Ivan Kovačič ◽  
David Bajs ◽  
Milan Ojsteršek

This paper describes the methodology of data preparation and analysis of the text similarity required for plagiarism detection on the CORE data set. Firstly, we used the CrossREF API and Microsoft Academic Graph data set for metadata enrichment and elimination of duplicates of doc-uments from the CORE 2018 data set. In the second step, we used 4-gram sequences of words from every document and transformed them into SHA-256 hash values. Features retrieved using hashing algorithm are compared, and the result is a list of documents and the percentages of cov-erage between pairs of documents features. In the third step, called pairwise feature-based ex-haustive analysis, pairs of documents are checked using the longest common substring.


Author(s):  
Chaitanya Kanchibhotlaa ◽  
◽  
Pruthvi Raj Venkatesh ◽  
DVLN Somayajulu ◽  
Radhakrishna P ◽  
...  

Many industries, such as oil, construction, banking, and insurance, have substantial historical physical data. Companies store this data in physical warehouses that are geographically distributed and usually taken care of by record management companies. Storing large volumes of historical physical data poses many critical challenges, such as increased maintenance cost, high time for recovery, and unsearchable data. Many companies digitize this data and consolidate this data into cloud repositories as part of their Digital Transformation (DT) journey to address these challenges. This DT process introduces many other technical challenges while dealing with poor scans, huge file size, geographically distributed files, and confidential documents. Though there are options to resolve each of these limitations individually, there are no frameworks that deal with digitization and historical data storage in its entirety. Moreover, they cannot handle a large number of documents having variable file sizes. This paper presents a generic cloud-based high-performance computing framework for knowledge extraction, comprising document classification based on neural networks and particle swarm optimization (PSO), data extraction, metadata enrichment, image enhancement using image processing (IP) techniques, and high data availability to users using cloud-based search. The proposed framework is executed on two cloud providers, i.e., Azure and AWS, to test its efficacy.


Information ◽  
2021 ◽  
Vol 12 (2) ◽  
pp. 64
Author(s):  
Eirini Kaldeli ◽  
Orfeas Menis-Mastromichalakis ◽  
Spyros Bekiaris ◽  
Maria Ralli ◽  
Vassilis Tzouvaras ◽  
...  

The lack of granular and rich descriptive metadata highly affects the discoverability and usability of cultural heritage collections aggregated and served through digital platforms, such as Europeana, thus compromising the user experience. In this context, metadata enrichment services through automated analysis and feature extraction along with crowdsourcing annotation services can offer a great opportunity for improving the metadata quality of digital cultural content in a scalable way, while at the same time engaging different user communities and raising awareness about cultural heritage assets. To address this need, we propose the CrowdHeritage open end-to-end enrichment and crowdsourcing ecosystem, which supports an end-to-end workflow for the improvement of cultural heritage metadata by employing crowdsourcing and by combining machine and human intelligence to serve the particular requirements of the cultural heritage domain. The proposed solution repurposes, extends, and combines in an innovative way general-purpose state-of-the-art AI tools, semantic technologies, and aggregation mechanisms with a novel crowdsourcing platform, so as to support seamless enrichment workflows for improving the quality of CH metadata in a scalable, cost-effective, and amusing way.


2020 ◽  
Vol 36 (20) ◽  
pp. 5120-5121 ◽  
Author(s):  
Arjun Magge ◽  
Davy Weissenbacher ◽  
Karen O’Connor ◽  
Tasnia Tahsin ◽  
Graciela Gonzalez-Hernandez ◽  
...  

Abstract Summary We present GeoBoost2, a natural language-processing pipeline for extracting the location of infected hosts for enriching metadata in nucleotide sequences repositories like National Center of Biotechnology Information’s GenBank for downstream analysis including phylogeography and genomic epidemiology. The increasing number of pathogen sequences requires complementary information extraction methods for focused research, including surveillance within countries and between borders. In this article, we describe the enhancements from our earlier release including improvement in end-to-end extraction performance and speed, availability of a fully functional web-interface and state-of-the-art methods for location extraction using deep learning. Availability and implementation Application is freely available on the web at https://zodo.asu.edu/geoboost2. Source code, usage examples and annotated data for GeoBoost2 is freely available at https://github.com/ZooPhy/geoboost2. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Andreas Niekler ◽  
Sven Blanck ◽  
Marc Kaulisch

With an ever-increasing amount of data, it is essential for many systems that documents can be retrieved efficiently.The process of information retrieval can be supported by metadata enrichment of the documents. The aim of thiswork is to make scientific publications and project descriptions, consisting of titles, abstracts and bibliographicalreferences, easier to find. Therefore, we investigate text analytical methods such as keyword extraction algorithms(TFIDF, Log-Likelihood, RAKE, TAKE and KECNW) and classification approaches using a SVM withensembles of classifier chains (Web of Science and GEPRIS categories as taxonomies) and compare their quality.We present an altered an optimized keyword extraction algorithm and a supervised subject and keywordclassification approach which are, to our knowledge so far, one of the first automatic applications of this kind ininformetrics and scientific information retrieval.The most promising methods are employed and the extracted information is attached to the documents as metadata.These support a search query, using pseudo relevance feedback, to obtain further relevant search results and canalso be used to derive profiles for authors, faculties, etc. The concepts developed here will serve as a basis for theLeipzig University Research Information System.


Author(s):  
Ahmad Aghaebrahimian ◽  
Andy Stauder ◽  
Michael Ustaszewski

Abstract The extraction of large amounts of multilingual parallel text from web resources is a widely used technique in natural language processing. However, automatically collected parallel corpora usually lack precise metadata, which are crucial to accurate data analysis and interpretation. The combination of automated extraction procedures and manual metadata enrichment may help address this issue. Wikipedia is a promising candidate for the exploration of the potential of said combination of methods because it is a rich source of translations in a large number of language pairs and because its open and collaborative nature makes it possible to identify and contact the users who produce translations. This article tests to what extent translated texts automatically extracted from Wikipedia by means of neural networks can be enriched with pertinent metadata through a self-submission-based user survey. Special emphasis is placed on data usefulness, defined in terms of a catalogue of previously established assessment criteria, most prominently metadata quality. The results suggest that from a quantitative perspective, the proposed methodology is capable of capturing metadata otherwise not available. At the same time, the crowd-based collection of data and metadata may face important technical and social limitations.


Information ◽  
2019 ◽  
Vol 10 (4) ◽  
pp. 149 ◽  
Author(s):  
Phivos Mylonas ◽  
Yorghos Voutos ◽  
Anastasia Sofou

It took some time indeed, but the research evolution and transformations that occurred in the smart agriculture field over the recent years tend to constitute the latter as the main topic of interest in the so-called Internet of Things (IoT) domain. Undoubtedly, our era is characterized by the mass production of huge amounts of data, information and content deriving from many different sources, mostly IoT devices and sensors, but also from environmentalists, agronomists, winemakers, or plain farmers and interested stakeholders themselves. Being an emerging field, only a small part of this rich content has been aggregated so far in digital platforms that serve as cross-domain hubs. The latter offer typically limited usability and accessibility of the actual content itself due to problems dealing with insufficient data and metadata availability, as well as their quality. Over our recent involvement within a precision viticulture environment and in an effort to make the notion of smart agriculture in the winery domain more accessible to and reusable from the general public, we introduce herein the model of an aggregation platform that provides enhanced services and enables human-computer collaboration for agricultural data annotations and enrichment. In principle, the proposed architecture goes beyond existing digital content aggregation platforms by advancing digital data through the combination of artificial intelligence automation and creative user engagement, thus facilitating its accessibility, visibility, and re-use. In particular, by using image and free text analysis methodologies for automatic metadata enrichment, in accordance to the human expertise for enrichment, it offers a cornerstone for future researchers focusing on improving the quality of digital agricultural information analysis and its presentation, thus establishing new ways for its efficient exploitation in a larger scale with benefits both for the agricultural and the consumer domains.


ICAME Journal ◽  
2019 ◽  
Vol 43 (1) ◽  
pp. 83-122 ◽  
Author(s):  
Peter Petré ◽  
Lynn Anthonissen ◽  
Sara Budts ◽  
Enrique Manjavacas ◽  
Emma-Louise Silva ◽  
...  

Abstract The present article provides a detailed description of the corpus of Early Modern Multiloquent Authors (EMMA), as well as two small case studies that illustrate its benefits. As a large-scale specialized corpus, EMMA tries to strike the right balance between big data and sociolinguistic coverage. It comprises the writings of 50 carefully selected authors across five generations, mostly taken from the 17th-century London society. EMMA enables the study of language as both a social and cognitive phenomenon and allows us to explore the interaction between the individual and aggregate levels. The first part of the article is a detailed description of EMMA’s first release as well as the sociolinguistic and methodological principles that underlie its design and compilation. We cover the conceptual decisions and practical implementations at various stages of the compilation process: from text-markup, encoding and data preprocessing to metadata enrichment and verification. In the second part, we present two small case studies to illustrate how rich contextualization can guide the interpretation of quantitative corpus-linguistic findings. The first case study compares the past tense formation of strong verbs in writers without access to higher education to that of writers with an extensive training in Latin. The second case study relates s/th-variation in the language of a single writer, Margaret Cavendish, to major shifts in her personal life.


Sign in / Sign up

Export Citation Format

Share Document