Text Segmentation for Document Recognition

Author(s):  
Nicola Nobile ◽  
Ching Y. Suen
2019 ◽  
Vol 70 (1) ◽  
pp. 105-122
Author(s):  
Davide Mastrantonio

Abstract In this paper we deal with a specific subset of direct speech markers, to which little or no attention has been given so far: the expressions which codify the ending of the direct speech (“marcatori conclusivi del discorso diretto”). We analyse these markers in Old Italian texts, comparing them with their Latin and, in some cases, Old French equivalents. In the introduction (§1), we take into account various general issues related to ancient texts, namely the practice of spoken-word reading and the lack of systematic punctuation marking that helps text segmentation. After that (§2), we classify the different strategies ancient writers had at their disposal to signal that a direct speech is over, hence that what follows has to be interpreted as the narrator voice; the markers are organized in a range from most explicit to most implicit (disse > quando ebbe detto > a queste parole > allora > [Ø]). Thereafter (§3), we focus on two specific markers, the participial marker (detto questo) and the “connector + finite tense” marker (quando ebbe detto questo) in a corpus of nine texts. Though these two markers are roughly synonymic, their occurrence is not uniform among the analysed texts. The explanation of their unequal distribution is that they belong to different discourse traditions (Diskurstraditionen): “quando + finite tense” is a typical expression attested in Romance narrations (the so-called “quand-Satz”), whereas detto questo appears to be dependent on Latin tradition.


Author(s):  
Van-Linh Pham ◽  
Xuan-Phung Pham ◽  
Hoai-Nam Tran ◽  
Sy-Tuyen Ho ◽  
Vinh-Loi Ly ◽  
...  

Entropy ◽  
2020 ◽  
Vol 22 (3) ◽  
pp. 275
Author(s):  
Igor A. Bessmertny ◽  
Xiaoxi Huang ◽  
Aleksei V. Platonov ◽  
Chuqiao Yu ◽  
Julia A. Koroleva

Search engines are able to find documents containing patterns from a query. This approach can be used for alphabetic languages such as English. However, Chinese is highly dependent on context. The significant problem of Chinese text processing is the missing blanks between words, so it is necessary to segment the text to words before any other action. Algorithms for Chinese text segmentation should consider context; that is, the word segmentation process depends on other ideograms. As the existing segmentation algorithms are imperfect, we have considered an approach to build the context from all possible n-grams surrounding the query words. This paper proposes a quantum-inspired approach to rank Chinese text documents by their relevancy to the query. Particularly, this approach uses Bell’s test, which measures the quantum entanglement of two words within the context. The contexts of words are built using the hyperspace analogue to language (HAL) algorithm. Experiments fulfilled in three domains demonstrated that the proposed approach provides acceptable results.


1989 ◽  
Vol 2 (2) ◽  
pp. 120-130
Author(s):  
Wei-Chung Lin ◽  
Yu-Jen Eugene Feng

Author(s):  
Arwa Alghamdi ◽  
Dareen Alluhaybi ◽  
Doaa Almehmadi ◽  
Khadijah Alameer ◽  
Sundos Bin Siddeq ◽  
...  

2008 ◽  
Vol 11 (3-4) ◽  
pp. 157-165 ◽  
Author(s):  
Bernadette Sharp ◽  
Caroline Chibelushi

Sign in / Sign up

Export Citation Format

Share Document