Journal of Data Mining & Digital Humanities

Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizontal histogram projections. The method was developed to be applied on a historic data collection that commonly features quality issues, such as degraded paper, blurred text, or presence of noise. For that reason, the segmenter in question could be of particular interest for cultural institutions, that want access to robust line bounding boxes for a given historic document. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of reprocessing their historic newspaper collection. The general contribution of this paper is to outline the approach and to evaluate the gains in terms of accuracy and speed, comparing it to the segmentation algorithm bundled with the used open source OCR software.

Publishing open-access bibliographical data on Ancient Greek and Latin texts: challenges, constraints, progression

Journal of Data Mining & Digital Humanities ◽

10.46298/jdmdh.7530 ◽

2021 ◽

Vol Atelier Digit_Hum (Sciences of Antiquity and...) ◽

Author(s):

Julie Giovacchini ◽

Laurent Capron

Keyword(s):

Open Access ◽

Ancient Greek ◽

Disciplinary Field ◽

General Scientific ◽

Scientific Environment

Version soumise et acceptée pour publication dans le JDMDH We present here both some of our thoughts on methodology in relation to the specific constraints that complexify the ways of structuring and accessing bibliographical data in the Sciences of Antiquity, and the solutions adopted by the IPhiS-CIRIS project for dealing with these constraints. The project began in 2014 in a general scientific environment that was still being standardised and structured, with digital bibliographical resources in this disciplinary field becoming increasingly numerous, although of uneven quality and hard to access and/or private.

The renewal of the digital humanities. An overview of the transformation of professions in the humanities and social sciences

Journal of Data Mining & Digital Humanities ◽

10.46298/jdmdh.7552 ◽

2021 ◽

Vol Atelier Digit_Hum (Data deluge: which skills for...) ◽

Author(s):

Marie-Laure Massot ◽

Agnès Tricoche

Keyword(s):

Social Sciences ◽

Digital Humanities ◽

Scientific Research ◽

French Speaking ◽

Sciences Humaines ◽

Humanities And Social Sciences ◽

Sciences Humaines Et Sociales

This article presents a study of the French-speaking digital humanities. It is based on the experience of two research engineers from the French National Center for Scientific Research (CNRS) who have been studying these issues for the last ten years. They conducted a survey at the École Normale Supérieure (ENS-Paris) which enabled them to draw up an overview of the transformation of the profession of humanities and social sciences research engineers in the context of the digital humanities. The Digit_Hum initiative, which they run in parallel with their respective activities at the ENS, also provided information for this overview thanks to its role as a space for discussion about the digital humanities along with training and structuring of this field at the ENS and the Université Paris Sciences & Lettres (PSL). Cet article est une réflexion sur les humanités numériques en contexte francophone. Elle s’appuie sur l'expérience de deux ingénieures du Centre National de la Recherche Scientifique travaillant sur ces questions depuis une dizaine d'années. À travers l'enquête qu'elles ont menée à l'École normale supérieure (ENS-Paris), elles dressent un panorama de la transformation du métier d'ingénieur(e) en sciences humaines et sociales dans le contexte des humanités numériques. L'initiative Digit_Hum, qu'elles animent en parallèle de leurs activités respectives à l'École, nourrit également ce témoignage en constituant un espace de discussions, de formations et de structuration des humanités numériques au sein de l'ENS et de l’Université Paris Sciences & Lettres.

French vital records data gathering and analysis through image processing and machine learning algorithms

Journal of Data Mining & Digital Humanities ◽

10.46298/jdmdh.7327 ◽

2021 ◽

Vol 2021 ◽

Author(s):

Cyprien Plateau-Holleville ◽

Enzo Bonnot ◽

Franck Gechter ◽

Laurent Heyberger

Keyword(s):

Data Extraction ◽

Data Gathering ◽

Extraction Process ◽

Point Of View ◽

Machine Learning Algorithms ◽

The Social ◽

Vital Records ◽

International Audience ◽

Document Layout ◽

Scanned Documents

International audience Vital records are rich of meaningful historical data concerning city as well as countryside inhabitants that can be used, among others, to study former populations and then reveal the social, economic and demographic characteristics of those populations. However, these studies encounter a main difficulty for collecting the data needed since most of these records are scanned documents that need a manual transcription step in order to gather all the data and start exploiting it from a historical point of view. This step consequently slows down the historical research and is an obstacle to a better knowledge of the population habits depending on their social conditions. Therefore in this paper, we present a modular and self-sufficient analysis pipeline using state-of-the-art algorithms mostly regardless of the document layout that aims to automate this data extraction process.

Conceptual modeling of prosopographic databases integrating quality dimensions

Journal of Data Mining & Digital Humanities ◽

10.46298/jdmdh.5078 ◽

2021 ◽

Vol Special Issue on Data Science... ◽

Author(s):

Jacky Akoka ◽

Isabelle Comyn-Wattiau ◽

Stéphane LamassÉ ◽

Cédric Du Mouza

Keyword(s):

Data Model ◽

Large Scale ◽

Social Groups ◽

Conceptual Modeling ◽

Generic Data ◽

International Audience ◽

Quality Dimensions ◽

Generic Data Model ◽

Stored Information

International audience Prosopographic databases, which allow the study of social groups through their bibliography, are used today by a significant number of historians. Computerization has allowed intensive and large-scale exploitation of these databases. The modeling of these proposopographic databases has given rise to several data models. An important problem is to ensure a level of quality of the stored information. In this article , we propose a generic data model allowing to describe most of the existing prosopographic databases and to enrich them by integrating several quality concepts such as uncertainty, reliability, accuracy or completeness.

Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre

Journal of Data Mining & Digital Humanities ◽

10.46298/jdmdh.6485 ◽

2021 ◽

Vol 2021 (Digital humanities in...) ◽

Author(s):

Jean-Baptiste Camps ◽

Simon Gabay ◽

Paul Fièvre ◽

Thibault Clérice ◽

Florian Cafiero

Keyword(s):

Neural Networks ◽

French Literature ◽

State Of The Art ◽

Preliminary Step ◽

Training Models ◽

Pos Tagging ◽

Current State ◽

French Theatre ◽

And Training

This paper describes the process of building an annotated corpus and training models for classical French literature, with a focus on theatre, and particularly comedies in verse. It was originally developed as a preliminary step to the stylometric analyses presented in Cafiero and Camps [2019]. The use of a recent lemmatiser based on neural networks and a CRF tagger allows to achieve accuracies beyond the current state-of-the art on the in-domain test, and proves to be robust during out-of-domain tests, i.e.up to 20th c.novels.

Plague Dot Text: Text mining and annotation of outbreak reports of the Third Plague Pandemic (1894-1952)

Journal of Data Mining & Digital Humanities ◽

10.46298/jdmdh.6071 ◽

2021 ◽

Vol HistoInformatics (HistoInformatics) ◽

Author(s):

Arlene Casey ◽

Mike Bennett ◽

Richard Tobin ◽

Claire Grover ◽

Iona Walker ◽

...

Keyword(s):

Text Mining ◽

Interdisciplinary Research ◽

Character Recognition ◽

Optical Character Recognition ◽

Statistical Data ◽

Formal Analysis ◽

Future Research ◽

Syntactic Category ◽

The Third ◽

Optical Character

The design of models that govern diseases in population is commonly built on information and data gathered from past outbreaks. However, epidemic outbreaks are never captured in statistical data alone but are communicated by narratives, supported by empirical observations. Outbreak reports discuss correlations between populations, locations and the disease to infer insights into causes, vectors and potential interventions. The problem with these narratives is usually the lack of consistent structure or strong conventions, which prohibit their formal analysis in larger corpora. Our interdisciplinary research investigates more than 100 reports from the third plague pandemic (1894-1952) evaluating ways of building a corpus to extract and structure this narrative information through text mining and manual annotation. In this paper we discuss the progress of our ongoing exploratory project, how we enhance optical character recognition (OCR) methods to improve text capture, our approach to structure the narratives and identify relevant entities in the reports. The structured corpus is made available via Solr enabling search and analysis across the whole collection for future research dedicated, for example, to the identification of concepts. We show preliminary visualisations of the characteristics of causation and differences with respect to gender as a result of syntactic-category-dependent corpus statistics. Our goal is to develop structured accounts of some of the most significant concepts that were used to understand the epidemiology of the third plague pandemic around the globe. The corpus enables researchers to analyse the reports collectively allowing for deep insights into the global epidemiological consideration of plague in the early twentieth century.

Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

Journal of Data Mining & Digital Humanities ◽

10.46298/jdmdh.6107 ◽

2021 ◽

Vol HistoInformatics (HistoInformatics) ◽

Author(s):

Raphaël Barman ◽

Maud Ehrmann ◽

Simon Clematide ◽

Sofia Ares Oliveira ◽

Frédéric Kaplan

Keyword(s):

Predictive Power ◽

Research Work ◽

Semantic Segmentation ◽

Visual Features ◽

Learning Techniques ◽

Document Layout Analysis ◽

Document Layout ◽

Series Of Experiments ◽

Textual Features ◽

Extract Information

The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. If the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of finer-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. In this context, we introduce a multimodal approach for the semantic segmentation of historical newspapers that combines visual and textual features. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate, among others, the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to high material variance.

Character Segmentation in Asian Collector's Seal Imprints: An Attempt to Retrieval Based on Ancient Character Typeface

Journal of Data Mining & Digital Humanities ◽

10.46298/jdmdh.6102 ◽

2021 ◽

Vol HistoInformatics (HistoInformatics) ◽

Author(s):

Kangying Li ◽

Biligsaikhan Batjargal ◽

Akira Maeda

Keyword(s):

User Interaction ◽

Recognition System ◽

Essential Elements ◽

Character Segmentation ◽

Training Data ◽

Background Information ◽

Segmentation Method ◽

Single Character ◽

Degraded Images ◽

Standard Character

Collector's seals provide important clues about the ownership of a book. They contain much information pertaining to the essential elements of ancient materials and also show the details of possession, its relation to the book, the identity of the collectors and their social status and wealth, amongst others. Asian collectors have typically used artistic ancient characters rather than modern ones to make their seals. In addition to the owner's name, several other words are used to express more profound meanings. A system that automatically recognizes these characters can help enthusiasts and professionals better understand the background information of these seals. However, there is a lack of training data and labelled images, as samples of some seals are scarce and most of them are degraded images. It is necessary to find new ways to make full use of such scarce data. While these data are available online, they do not contain information on the characters' position. The goal of this research is to assist in obtaining more labelled data through user interaction and provide retrieval tools that use only standard character typefaces extracted from font files. In this paper, a character segmentation method is proposed to predict the candidate characters' area without any labelled training data that contain character coordinate information. A retrieval-based recognition system that focuses on a single character is also proposed to support seal retrieval and matching. The experimental results demonstrate that the proposed character segmentation method performs well on Asian collector's seals, with 85% of the test data being correctly segmented.

Digital interfaces of historical newspapers: opportunities, restrictions and recommendations

Journal of Data Mining & Digital Humanities ◽

10.46298/jdmdh.6121 ◽

2021 ◽

Vol HistoInformatics (HistoInformatics) ◽

Author(s):

Eva Pfanzelter ◽

Sarah Oberbichler ◽

Jani Marjanen ◽

Pierre-Carl Langlais ◽

Stefan Hechl

Keyword(s):

User Interfaces ◽

Initial Period ◽

Digital Tools ◽

Free Access ◽

Open Questions ◽

International Audience ◽

User Groups ◽

Insight Into

International audience Many libraries offer free access to digitised historical newspapers via user interfaces. After an initial period of search and filter options as the only features, the availability of more advanced tools and the desire for more options among users has ushered in a period of interface development. However, this raises a number of open questions and challenges. For example, how can we provide interfaces for different user groups? What tools should be available on interfaces and how can we avoid too much complexity? What tools are helpful and how can we improve usability? This paper will not provide definite answers to these questions, but it gives an insight into the difficulties, challenges and risks of using interfaces to investigate historical newspapers. More importantly, it provides ideas and recommendations for the improvement of user interfaces and digital tools.

Journal of Data Mining & Digital Humanities
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Centre Pour La Communication Scientifique Directe (CCSD)

Combining Morphological and Histogram based Text Line Segmentation in the OCR Context

Publishing open-access bibliographical data on Ancient Greek and Latin texts: challenges, constraints, progression

The renewal of the digital humanities. An overview of the transformation of professions in the humanities and social sciences

French vital records data gathering and analysis through image processing and machine learning algorithms

Conceptual modeling of prosopographic databases integrating quality dimensions

Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre

Plague Dot Text: Text mining and annotation of outbreak reports of the Third Plague Pandemic (1894-1952)

Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

Character Segmentation in Asian Collector's Seal Imprints: An Attempt to Retrieval Based on Ancient Character Typeface

Digital interfaces of historical newspapers: opportunities, restrictions and recommendations

Export Citation Format

Journal of Data Mining & Digital HumanitiesLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Centre Pour La Communication Scientifique Directe (CCSD)

Combining Morphological and Histogram based Text Line Segmentation in the OCR Context

Publishing open-access bibliographical data on Ancient Greek and Latin texts: challenges, constraints, progression

The renewal of the digital humanities. An overview of the transformation of professions in the humanities and social sciences

French vital records data gathering and analysis through image processing and machine learning algorithms

Conceptual modeling of prosopographic databases integrating quality dimensions

Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre

Plague Dot Text: Text mining and annotation of outbreak reports of the Third Plague Pandemic (1894-1952)

Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

Character Segmentation in Asian Collector's Seal Imprints: An Attempt to Retrieval Based on Ancient Character Typeface

Digital interfaces of historical newspapers: opportunities, restrictions and recommendations

Journal of Data Mining & Digital Humanities
Latest Publications