text segmentation
Recently Published Documents


TOTAL DOCUMENTS

374
(FIVE YEARS 70)

H-INDEX

22
(FIVE YEARS 3)

2021 ◽  
Vol 21 ◽  
Author(s):  
Anna Piechnik

From Formality to Familiarity: Greeting Formulas in Emails Written by Teachers to ParentsThe purpose of the article is the analysis of greeting formulas in emails written by teachers to parents. The linguistic material is based on 513 email messages sent to parents by means of an online parent portal (e-dziennik). The messages in question were sent by teachers representing a number of schools in Lesser Poland Province. The formal relationship between the participants of the communication and the written form of the exchanges clearly pertains to the traditional formality of the situation; as a result, expectations are raised in terms of the official style of utterances produced under these circumstances. The analysis of the obtained material indicates that greeting formulas used in emails diverge significantly from those used in official paper-based correspondence. Traditional formulas such as Szanowni Państwo (Dear Sir/Madame, Dear Sirs) are clearly replaced by formulas originating from direct verbal communication (Dzień dobry, Witam). There is an apparent tendency to become familiar with recipients by using emoticons or the language formulas present in private correspondence between people who have a close relationship. In addition to the lack of text segmentation, the overall linguistic layout and spelling strategies that prevail in both greeting formulas and the entire body of letters indicate a negligent attitude in terms of text layout. Od dystansu do poufałości. Formuły powitalne listów elektronicznych kierowanych przez nauczycieli do rodzicówCelem artykułu jest analiza formuł powitalnych w listach elektronicznych nauczycieli do rodziców. Bazę materiałową stanowi 513 e-maili kierowanych przez nauczycieli wybranych szkół w południowej Polsce do rodziców za pośrednictwem dzienników elektronicznych. Formalny typ relacji między uczestnikami komunikacji oraz pisemna forma komunikatu stanowią tradycyjnie jaskrawy wykładnik oficjalności sytuacji i – co za tym idzie – powodują oczekiwanie oficjalnego stylu komunikatu powstającego w takiej sytuacji. Analiza materiału pokazuje, że formuły inicjalne listów elektronicznych znacząco odbiegają od grzecznościowych form powitalnych stosowanych w oficjalnej korespondencji papierowej. Nad tradycyjnym Szanowni Państwo wyraźnie dominują formuły zaczerpnięte z bezpośredniej komunikacji ustnej (Dzień dobry, Witam). Daje się zauważyć skłonność do przełamywania dystansu w postaci stosowania zwrotów do adresata, charakterystycznych dla korespondencji prywatnej między osobami będącymi w bliskiej relacji uczuciowej czy używania emotikonów. Kształt językowy i ortograficzny znacznej części formuł inicjalnych (ale i całości listów) oraz brak segmentacji tekstów świadczą o braku dbałości o graficzną stronę wiadomości.


Author(s):  
Oksana Gorban ◽  
◽  
Marina Kosova ◽  
Elena Sheptukhina ◽  
◽  
...  

The research relevance is determined by the need to annotate official documents of Don Cossack Host written in the middle of the 18 th century and kept in "Mikhailovsky Stanitsa Ataman" archive fund of the State Archive of the Volgograd Region (SAVR, fund 332, inventory 1), so as to compile a linguistic corpus. The authors characterize the problems of the deposited documentary text structural division. These difficulties occur due to the specifics of the form, the dynamics of genres and the syntactical peculiarities of business communication in the middle of the 18 th century. It is revealed that the complexity of documentary text division depends on the degree of its narrativity. The choice of a structural-semantic segment that coincides with a sentence or several closely connected sentences as a layout unit is motivated. A complex method of document segmentation for the structural markup is justified. The approach is based on genre parameterization of documents and their syntactic segmentation. It has been established that the segment boundaries can be indicated by the complex of graphic symbols, speech formulas that perform the function of details of payments, lexical and grammatical means. As a result of the study, it has been shown that the succession of procedures implemented for text segmentation, and targeted at genre and speech organization of the document identification, makes it possible to present in the diachronic corpus the information, which is necessary and sufficient for the user to conclude about the properties of the document text and its units.


2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Marko Neumann

Abstract The use of punctuation in German incunabula is often described as arbitrary, irregular, and unsystematic (cf. Masalon 2014: 54–56). This concerns the inventory, frequency, and function of punctuation marks as well as pragmatic aspects such as how typesetters treated punctuation in their respective target texts. In this paper, punctuation is not seen as an independent linguistic subsystem, but as a means of text segmentation that – along with other measures (e. g. capital letters, pilcrows, and white space) – was used to structure a text with respect to its formal appearance, helping the reader to decode information. This case study is based on a corpus of German pamphlets written by the Bohemian astrologer Wenzel Faber and printed annually beginning in 1481 at various print shops, principally in Leipzig and Nuremberg. The analysis finds significant changes in the editions before and after 1490. These changes include an increasing consistency in the intensity of text segmentation, and a use of capital letters and punctuation marks developed from a polyfunctional to a monofunctional approach. Finally, different types of text segmentation are proposed, each characterized by a specific relationship between its frequency and its function. Despite this overall tendency, one must still take into account that typesetters followed individual punctuation practices in their search for suitable forms of text segmentation.


2021 ◽  
Vol 12 (5) ◽  
pp. 1-29
Author(s):  
Qiong Wu ◽  
Adam Hare ◽  
Sirui Wang ◽  
Yuwei Tu ◽  
Zhenming Liu ◽  
...  

Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available. In this work, we reexamine the inter-related problems of “topic identification” and “text segmentation” for sparse document learning, when there is a single new text of interest. In developing a methodology to handle single documents, we face two major challenges. First is sparse information : with access to only one document, we cannot train traditional topic models or deep learning algorithms. Second is significant noise : a considerable portion of words in any single document will produce only noise and not help discern topics or segments. To tackle these issues, we design an unsupervised, computationally efficient methodology called Biclustering Approach to Topic modeling and Segmentation (BATS). BATS leverages three key ideas to simultaneously identify topics and segment text: (i) a new mechanism that uses word order information to reduce sample complexity, (ii) a statistically sound graph-based biclustering technique that identifies latent structures of words and sentences, and (iii) a collection of effective heuristics that remove noise words and award important words to further improve performance. Experiments on six datasets show that our approach outperforms several state-of-the-art baselines when considering topic coherence, topic diversity, segmentation, and runtime comparison metrics.


Author(s):  
Peiwen Gao ◽  
Yana Zhang ◽  
Suya Zhang ◽  
Zeyu Chen

2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Yiqian Li ◽  
Tao Du ◽  
Lianjiang Zhu ◽  
Shouning Qu

Text segmentation of the URL domain name is a straightforward and convenient method to analyze users’ online behaviors and is crucial to determine their areas of interest. However, the performance of popular word segmentation tools is relatively low due to the unique structure of the website domain name (such as extremely short lengths, irregular names, and no contextual relationship). To address this issue, this paper proposes an efficient minimal text segmentation (EMTS) method for URL domain names to achieve efficient adaptive text mining. We first designed a targeted hierarchical task model to reduce noise interference in minimal texts. We then presented a novel method of integrating conflict game into the two-directional maximum matching algorithm, which can make the words with higher weight and greater probability to be selected, thereby enhancing the accuracy of recognition. Next, Chinese Pinyin and English mapping were embedded in the word segmentation rules. Besides, we incorporated a correction factor that considers the text length into the F1-score to optimize the performance evaluation of text segmentation. The experimental results show that the EMTS yielded around 20 percentage points improvement with other word segmentation tools in terms of accuracy and topic extraction, providing high-quality data for the subsequent text analysis.


2021 ◽  
Author(s):  
Fei Shen ◽  
Wenting Yu ◽  
Chen Min ◽  
Qianying Ye ◽  
Chuanli Xia ◽  
...  

Text mining has been a dominant approach to extracting useful information from massive unstructured data online. But existing tools for Chinese word segmentation are not ideal for processing social media text data in Cantonese. This project developed CyberCan (https://github.com/shenfei1010/CyberCan), a lexicon of contemporary Cantonese based on more than 100 million pieces of internet texts. We compared the performance of CyberCan with existing Mandarin and Cantonese lexicons in terms of their word segmentation performance. Findings suggest that CyberCan outperforms all existing lexicons by a considerable margin.


2021 ◽  
Author(s):  
Xingqian Xu ◽  
Zhifei Zhang ◽  
Zhaowen Wang ◽  
Brian Price ◽  
Zhonghao Wang ◽  
...  
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document