User-Defined Semantic Enrichment of Full-Text Documents: Experiences and Lessons Learned

The publication deals with the initial stages of inclusion into the electronic catalogue of bibliographic records of electronic periodicals from eLIBRARY.RU platform and electronic serials which have been subscribed by the Central Science Library of the NAS of Belarus. The activities on addition of full text documents and tables of contents of periodicals into bibliographic records have been considered.

Download Full-text

Paragraph‐based access to full‐text documents using a hypertext system

Program electronic library and information systems ◽

10.1108/eb047080 ◽

1991 ◽

Vol 25 (2) ◽

pp. 119-131 ◽

Cited By ~ 4

Author(s):

Suliman Al‐Hawamdeh ◽

Geoff Smith ◽

Peter Willett

Keyword(s):

Full Text ◽

Text Documents

Download Full-text

Analysis of URL references in ETDs: a case study at the University of North Texas

Library Management ◽

10.1108/lm-08-2013-0073 ◽

2014 ◽

Vol 35 (4/5) ◽

pp. 293-307

Author(s):

Mark Edward Phillips ◽

Daniel Gelaw Alemneh ◽

Brenda Reyes Ayala

Keyword(s):

Full Text ◽

Similar Data ◽

North Texas ◽

University Of North Texas ◽

Text Documents ◽

Content Type ◽

Electronic Theses And Dissertations ◽

The University ◽

Theses And Dissertations ◽

The Web

Purpose – Increasingly, higher education institutions worldwide are accepting only electronic versions of their students’ theses and dissertations. These electronic theses and dissertations (ETDs) frequently feature embedded URLs in body, footnote and references section of the document. Additionally the web as ETD subject appears to be on an upward trajectory as the web becomes an increasingly important part of everyday life. The paper aims to discuss these issues. Design/methodology/approach – The authors analyzed URL references in 4,335 ETDs in the UNT ETD collection. Links were extracted from the full-text documents, cleaned and canonicalized, deconstructed in the subparts of a URL and then indexed with the full-text indexer Solr. Queries to aggregate and generate overall statistics and trends were generated against the Solr index. The resulting data were analyzed for patterns and trends within a variety of groupings. Findings – ETDs at the University of North Texas that include URL references have increased over the past 14 years from 23 percent in 1999 to 80 percent in 2012. URLs are being included into ETDs in the majority of cases: 62 percent of the publications analyzed in this work contained URLs. Originality/value – This research establishes that web resources are being widely cited in UNT's ETDs and that growth in citing these resources has been observed. Further it provides a preliminary framework for technical methods appropriate for approaching analysis of similar data that may be applicable to other sets of documents or subject areas.

Download Full-text

ON THE CREATION OF HYPERTEXT LINKS IN FULL‐TEXT DOCUMENTS: MEASUREMENT OF INTER‐LINKER CONSISTENCY

Journal of Documentation ◽

10.1108/eb026925 ◽

1994 ◽

Vol 50 (2) ◽

pp. 67-98 ◽

Cited By ~ 22

Author(s):

DAVID ELLIS ◽

JONATHAN FURNER‐HINES ◽

PETER WILLETT

Keyword(s):

Full Text ◽

Text Documents ◽

The Creation

Download Full-text

On the creation of hypertext links in full-text documents: Measurement of retrieval effectiveness

Journal of the American Society for Information Science ◽

10.1002/(sici)1097-4571(199604)47:4<287::aid-asi4>3.0.co;2-t ◽

1996 ◽

Vol 47 (4) ◽

pp. 287-300 ◽

Cited By ~ 5

Author(s):

David Ellis ◽

Jonathan Furner ◽

Peter Willett

Keyword(s):

Full Text ◽

Text Documents ◽

Retrieval Effectiveness ◽

The Creation

Download Full-text

CemoMemo

Journal on Computing and Cultural Heritage ◽

10.1145/3467888 ◽

2021 ◽

Vol 14 (4) ◽

pp. 1-22

Author(s):

Michael J. May ◽

Efrat Kantor ◽

Nissim Zror

Keyword(s):

Full Text ◽

Historical Analysis ◽

Lessons Learned ◽

Cultural Preservation ◽

Future Directions ◽

User Models ◽

Bottom Up ◽

Dark Tourism ◽

The Usa ◽

Component Models

Digitizing cemeteries and gravestones aids cultural preservation, genealogical search, dark tourism, and historical analysis. CemoMemo, an app and associated website, enables bottom-up crowd-sourced digitization of cemeteries, categorizing and indexing of gravestone data and metadata, and offering powerful full-text and numerical search. To date, CemoMemo has nearly 5,000 graves from over 130 cemeteries in 10 countries with the majority being Jewish graves in Israel and the USA. We detail CemoMemo's deployment and component models, technical attributes, and user models. CemoMemo went through two design iterations and architectures. We detail its initial architecture and the reasons that led to the change in architecture. To show its utility, we use CemoMemo's data for a historical analysis of two Jewish cemeteries from a similar period, eliciting cultural and ethnological difference between them. We present lessons learned from developing and operating CemoMemo for over 1 year and point to future directions of development.

Download Full-text

Classification of Protein-Protein Interaction Full-Text Documents Using Text and Citation Network Features

IEEE/ACM Transactions on Computational Biology and Bioinformatics ◽

10.1109/tcbb.2010.55 ◽

2010 ◽

Vol 7 (3) ◽

pp. 400-411 ◽

Cited By ~ 15

Author(s):

Artemy Kolchinsky ◽

Alaa Abi-Haidar ◽

Jasleen Kaur ◽

Ahmed Abdeen Hamed ◽

Luis M Rocha

Keyword(s):

Protein Interaction ◽

Full Text ◽

Citation Network ◽

Text Documents ◽

Protein Protein Interaction

Download Full-text

From Frequencies to Vectors

Engines of Order ◽

10.5117/9789462986190_ch05 ◽

2020 ◽

Author(s):

Bernhard Rieder

Keyword(s):

Information Retrieval ◽

Data Structure ◽

Sentiment Analysis ◽

Full Text ◽

Text Documents ◽

Document Collections ◽

Intermediate Data ◽

Algorithmic Information ◽

Digital Objects ◽

Language Meaning

This chapter investigates early attempts in information retrieval to tackle the full text of document collections. Underpinning a large number of contemporary applications, from search to sentiment analysis, the concepts and techniques pioneered by Hans Peter Luhn, Gerard Salton, Karen Spärck Jones, and others involve particular framings of language, meaning, and knowledge. They also introduce some of the fundamental mathematical formalisms and methods running through information ordering, preparing the extension to digital objects other than text documents. The chapter discusses the considerable technical expressivity that comes out of the sprawling landscape of research and experimentation that characterizes the early decades of information retrieval. This includes the emergence of the conceptual construct and intermediate data structure that is fundamental to most algorithmic information ordering: the feature vector.

Download Full-text

Semantic Text Segmentation from Synthetic Images of Full-Text Documents

SPIIRAS Proceedings ◽

10.15622/sp.2019.18.6.1381-1406 ◽

2019 ◽

Vol 18 (6) ◽

pp. 1381-1406 ◽

Cited By ~ 2

Author(s):

Lukáš Bureš ◽

Ivan Gruber ◽

Petr Neduchal ◽

Miroslav Hlaváč ◽

Marek Hrúz

Keyword(s):

Full Text ◽

Network Architecture ◽

Character Recognition ◽

Optical Character Recognition ◽

Recognition Rate ◽

Semantic Segmentation ◽

Text Documents ◽

Text Corpora ◽

Novel Approach ◽

Synthetic Images

An algorithm (divided into multiple modules) for generating images of full-text documents is presented. These images can be used to train, test, and evaluate models for Optical Character Recognition (OCR). The algorithm is modular, individual parts can be changed and tweaked to generate desired images. A method for obtaining background images of paper from already digitized documents is described. For this, a novel approach based on Variational AutoEncoder (VAE) to train a generative model was used. These backgrounds enable the generation of similar background images as the training ones on the fly.The module for printing the text uses large text corpora, a font, and suitable positional and brightness character noise to obtain believable results (for natural-looking aged documents). A few types of layouts of the page are supported. The system generates a detailed, structured annotation of the synthesized image. Tesseract OCR to compare the real-world images to generated images is used. The recognition rate is very similar, indicating the proper appearance of the synthetic images. Moreover, the errors which were made by the OCR system in both cases are very similar. From the generated images, fully-convolutional encoder-decoder neural network architecture for semantic segmentation of individual characters was trained. With this architecture, the recognition accuracy of 99.28% on a test set of synthetic documents is reached.

Download Full-text