Journal of Data Mining & Digital Humanities
Latest Publications


TOTAL DOCUMENTS

76
(FIVE YEARS 39)

H-INDEX

0
(FIVE YEARS 0)

Published By Centre Pour La Communication Scientifique Directe (CCSD)

2416-5999

2021 ◽  
Vol 2021 (HistoInformatics) ◽  
Author(s):  
Pit Schneider

Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizontal histogram projections. The method was developed to be applied on a historic data collection that commonly features quality issues, such as degraded paper, blurred text, or presence of noise. For that reason, the segmenter in question could be of particular interest for cultural institutions, that want access to robust line bounding boxes for a given historic document. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of reprocessing their historic newspaper collection. The general contribution of this paper is to outline the approach and to evaluate the gains in terms of accuracy and speed, comparing it to the segmentation algorithm bundled with the used open source OCR software.


2021 ◽  
Vol Atelier Digit_Hum (Sciences of Antiquity and...) ◽  
Author(s):  
Julie Giovacchini ◽  
Laurent Capron

Version soumise et acceptée pour publication dans le JDMDH We present here both some of our thoughts on methodology in relation to the specific constraints that complexify the ways of structuring and accessing bibliographical data in the Sciences of Antiquity, and the solutions adopted by the IPhiS-CIRIS project for dealing with these constraints. The project began in 2014 in a general scientific environment that was still being standardised and structured, with digital bibliographical resources in this disciplinary field becoming increasingly numerous, although of uneven quality and hard to access and/or private.


2021 ◽  
Vol Atelier Digit_Hum (Data deluge: which skills for...) ◽  
Author(s):  
Marie-Laure Massot ◽  
Agnès Tricoche

This article presents a study of the French-speaking digital humanities. It is based on the experience of two research engineers from the French National Center for Scientific Research (CNRS) who have been studying these issues for the last ten years. They conducted a survey at the École Normale Supérieure (ENS-Paris) which enabled them to draw up an overview of the transformation of the profession of humanities and social sciences research engineers in the context of the digital humanities. The Digit_Hum initiative, which they run in parallel with their respective activities at the ENS, also provided information for this overview thanks to its role as a space for discussion about the digital humanities along with training and structuring of this field at the ENS and the Université Paris Sciences & Lettres (PSL). Cet article est une réflexion sur les humanités numériques en contexte francophone. Elle s’appuie sur l'expérience de deux ingénieures du Centre National de la Recherche Scientifique travaillant sur ces questions depuis une dizaine d'années. À travers l'enquête qu'elles ont menée à l'École normale supérieure (ENS-Paris), elles dressent un panorama de la transformation du métier d'ingénieur(e) en sciences humaines et sociales dans le contexte des humanités numériques. L'initiative Digit_Hum, qu'elles animent en parallèle de leurs activités respectives à l'École, nourrit également ce témoignage en constituant un espace de discussions, de formations et de structuration des humanités numériques au sein de l'ENS et de l’Université Paris Sciences & Lettres.


2021 ◽  
Vol 2021 ◽  
Author(s):  
Cyprien Plateau-Holleville ◽  
Enzo Bonnot ◽  
Franck Gechter ◽  
Laurent Heyberger

International audience Vital records are rich of meaningful historical data concerning city as well as countryside inhabitants that can be used, among others, to study former populations and then reveal the social, economic and demographic characteristics of those populations. However, these studies encounter a main difficulty for collecting the data needed since most of these records are scanned documents that need a manual transcription step in order to gather all the data and start exploiting it from a historical point of view. This step consequently slows down the historical research and is an obstacle to a better knowledge of the population habits depending on their social conditions. Therefore in this paper, we present a modular and self-sufficient analysis pipeline using state-of-the-art algorithms mostly regardless of the document layout that aims to automate this data extraction process.


2021 ◽  
Vol Special Issue on Data Science... ◽  
Author(s):  
Jacky Akoka ◽  
Isabelle Comyn-Wattiau ◽  
Stéphane LamassÉ ◽  
Cédric Du Mouza

International audience Prosopographic databases, which allow the study of social groups through their bibliography, are used today by a significant number of historians. Computerization has allowed intensive and large-scale exploitation of these databases. The modeling of these proposopographic databases has given rise to several data models. An important problem is to ensure a level of quality of the stored information. In this article , we propose a generic data model allowing to describe most of the existing prosopographic databases and to enrich them by integrating several quality concepts such as uncertainty, reliability, accuracy or completeness.


2021 ◽  
Vol 2021 (Digital humanities in...) ◽  
Author(s):  
Jean-Baptiste Camps ◽  
Simon Gabay ◽  
Paul Fièvre ◽  
Thibault Clérice ◽  
Florian Cafiero

This paper describes the process of building an annotated corpus and training models for classical French literature, with a focus on theatre, and particularly comedies in verse. It was originally developed as a preliminary step to the stylometric analyses presented in Cafiero and Camps [2019]. The use of a recent lemmatiser based on neural networks and a CRF tagger allows to achieve accuracies beyond the current state-of-the art on the in-domain test, and proves to be robust during out-of-domain tests, i.e.up to 20th c.novels.


2021 ◽  
Vol HistoInformatics (HistoInformatics) ◽  
Author(s):  
Arlene Casey ◽  
Mike Bennett ◽  
Richard Tobin ◽  
Claire Grover ◽  
Iona Walker ◽  
...  

The design of models that govern diseases in population is commonly built on information and data gathered from past outbreaks. However, epidemic outbreaks are never captured in statistical data alone but are communicated by narratives, supported by empirical observations. Outbreak reports discuss correlations between populations, locations and the disease to infer insights into causes, vectors and potential interventions. The problem with these narratives is usually the lack of consistent structure or strong conventions, which prohibit their formal analysis in larger corpora. Our interdisciplinary research investigates more than 100 reports from the third plague pandemic (1894-1952) evaluating ways of building a corpus to extract and structure this narrative information through text mining and manual annotation. In this paper we discuss the progress of our ongoing exploratory project, how we enhance optical character recognition (OCR) methods to improve text capture, our approach to structure the narratives and identify relevant entities in the reports. The structured corpus is made available via Solr enabling search and analysis across the whole collection for future research dedicated, for example, to the identification of concepts. We show preliminary visualisations of the characteristics of causation and differences with respect to gender as a result of syntactic-category-dependent corpus statistics. Our goal is to develop structured accounts of some of the most significant concepts that were used to understand the epidemiology of the third plague pandemic around the globe. The corpus enables researchers to analyse the reports collectively allowing for deep insights into the global epidemiological consideration of plague in the early twentieth century.


2021 ◽  
Vol HistoInformatics (HistoInformatics) ◽  
Author(s):  
Raphaël Barman ◽  
Maud Ehrmann ◽  
Simon Clematide ◽  
Sofia Ares Oliveira ◽  
Frédéric Kaplan

The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. If the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of finer-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. In this context, we introduce a multimodal approach for the semantic segmentation of historical newspapers that combines visual and textual features. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate, among others, the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to high material variance.


2021 ◽  
Vol HistoInformatics (HistoInformatics) ◽  
Author(s):  
Kangying Li ◽  
Biligsaikhan Batjargal ◽  
Akira Maeda

Collector's seals provide important clues about the ownership of a book. They contain much information pertaining to the essential elements of ancient materials and also show the details of possession, its relation to the book, the identity of the collectors and their social status and wealth, amongst others. Asian collectors have typically used artistic ancient characters rather than modern ones to make their seals. In addition to the owner's name, several other words are used to express more profound meanings. A system that automatically recognizes these characters can help enthusiasts and professionals better understand the background information of these seals. However, there is a lack of training data and labelled images, as samples of some seals are scarce and most of them are degraded images. It is necessary to find new ways to make full use of such scarce data. While these data are available online, they do not contain information on the characters' position. The goal of this research is to assist in obtaining more labelled data through user interaction and provide retrieval tools that use only standard character typefaces extracted from font files. In this paper, a character segmentation method is proposed to predict the candidate characters' area without any labelled training data that contain character coordinate information. A retrieval-based recognition system that focuses on a single character is also proposed to support seal retrieval and matching. The experimental results demonstrate that the proposed character segmentation method performs well on Asian collector's seals, with 85% of the test data being correctly segmented.


2021 ◽  
Vol HistoInformatics (HistoInformatics) ◽  
Author(s):  
Eva Pfanzelter ◽  
Sarah Oberbichler ◽  
Jani Marjanen ◽  
Pierre-Carl Langlais ◽  
Stefan Hechl

International audience Many libraries offer free access to digitised historical newspapers via user interfaces. After an initial period of search and filter options as the only features, the availability of more advanced tools and the desire for more options among users has ushered in a period of interface development. However, this raises a number of open questions and challenges. For example, how can we provide interfaces for different user groups? What tools should be available on interfaces and how can we avoid too much complexity? What tools are helpful and how can we improve usability? This paper will not provide definite answers to these questions, but it gives an insight into the difficulties, challenges and risks of using interfaces to investigate historical newspapers. More importantly, it provides ideas and recommendations for the improvement of user interfaces and digital tools.


Sign in / Sign up

Export Citation Format

Share Document