Text Segmentation for Document Recognition

Marcatori conclusivi del discorso diretto in italiano antico

Romanistisches Jahrbuch ◽

10.1515/roja-2019-0004 ◽

2019 ◽

Vol 70 (1) ◽

pp. 105-122

Author(s):

Davide Mastrantonio

Keyword(s):

Word Reading ◽

Old French ◽

Spoken Word ◽

Text Segmentation ◽

Direct Speech ◽

Unequal Distribution ◽

Specific Subset ◽

Ancient Texts

Abstract In this paper we deal with a specific subset of direct speech markers, to which little or no attention has been given so far: the expressions which codify the ending of the direct speech (“marcatori conclusivi del discorso diretto”). We analyse these markers in Old Italian texts, comparing them with their Latin and, in some cases, Old French equivalents. In the introduction (§1), we take into account various general issues related to ancient texts, namely the practice of spoken-word reading and the lack of systematic punctuation marking that helps text segmentation. After that (§2), we classify the different strategies ancient writers had at their disposal to signal that a direct speech is over, hence that what follows has to be interpreted as the narrator voice; the markers are organized in a range from most explicit to most implicit (disse > quando ebbe detto > a queste parole > allora > [Ø]). Thereafter (§3), we focus on two specific markers, the participial marker (detto questo) and the “connector + finite tense” marker (quando ebbe detto questo) in a corpus of nine texts. Though these two markers are roughly synonymic, their occurrence is not uniform among the analysed texts. The explanation of their unequal distribution is that they belong to different discourse traditions (Diskurstraditionen): “quando + finite tense” is a typical expression attested in Romance narrations (the so-called “quand-Satz”), whereas detto questo appears to be dependent on Latin tradition.

Download Full-text

A Deep Learning Approach for Text Segmentation in Document Analysis

2020 International Conference on Advanced Computing and Applications (ACOMP) ◽

10.1109/acomp50827.2020.00027 ◽

2020 ◽

Author(s):

Van-Linh Pham ◽

Xuan-Phung Pham ◽

Hoai-Nam Tran ◽

Sy-Tuyen Ho ◽

Vinh-Loi Ly ◽

...

Keyword(s):

Deep Learning ◽

Document Analysis ◽

Text Segmentation ◽

Learning Approach

Download Full-text

Applying the Bell’s Test to Chinese Texts

Entropy ◽

10.3390/e22030275 ◽

2020 ◽

Vol 22 (3) ◽

pp. 275

Author(s):

Igor A. Bessmertny ◽

Xiaoxi Huang ◽

Aleksei V. Platonov ◽

Chuqiao Yu ◽

Julia A. Koroleva

Keyword(s):

Quantum Entanglement ◽

Chinese Text ◽

Search Engines ◽

Text Processing ◽

Word Segmentation ◽

Significant Problem ◽

Text Segmentation ◽

Text Documents ◽

Segmentation Algorithms ◽

Chinese Texts

Search engines are able to find documents containing patterns from a query. This approach can be used for alphabetic languages such as English. However, Chinese is highly dependent on context. The significant problem of Chinese text processing is the missing blanks between words, so it is necessary to segment the text to words before any other action. Algorithms for Chinese text segmentation should consider context; that is, the word segmentation process depends on other ideograms. As the existing segmentation algorithms are imperfect, we have considered an approach to build the context from all possible n-grams surrounding the query words. This paper proposes a quantum-inspired approach to rank Chinese text documents by their relevancy to the query. Particularly, this approach uses Bell’s test, which measures the quantum entanglement of two words within the context. The contexts of words are built using the hyperspace analogue to language (HAL) algorithm. Experiments fulfilled in three domains demonstrated that the proposed approach provides acceptable results.

Download Full-text