document segmentation Latest Research Papers

Structural Markup of Official Documents in Diachronic Linguistic Corpus: Problems and Solutions

Vestnik Volgogradskogo gosudarstvennogo universiteta Serija 2 Jazykoznanije ◽

10.15688/jvolsu2.2021.4.1 ◽

2021 ◽

pp. 5-18

Author(s):

Oksana Gorban ◽

◽

Marina Kosova ◽

Elena Sheptukhina ◽

◽

...

Keyword(s):

Business Communication ◽

The State ◽

Complex Method ◽

Text Segmentation ◽

State Archive ◽

Document Segmentation ◽

Graphic Symbols ◽

Volgograd Region ◽

Problems And Solutions ◽

Necessary And Sufficient

The research relevance is determined by the need to annotate official documents of Don Cossack Host written in the middle of the 18 th century and kept in "Mikhailovsky Stanitsa Ataman" archive fund of the State Archive of the Volgograd Region (SAVR, fund 332, inventory 1), so as to compile a linguistic corpus. The authors characterize the problems of the deposited documentary text structural division. These difficulties occur due to the specifics of the form, the dynamics of genres and the syntactical peculiarities of business communication in the middle of the 18 th century. It is revealed that the complexity of documentary text division depends on the degree of its narrativity. The choice of a structural-semantic segment that coincides with a sentence or several closely connected sentences as a layout unit is motivated. A complex method of document segmentation for the structural markup is justified. The approach is based on genre parameterization of documents and their syntactic segmentation. It has been established that the segment boundaries can be indicated by the complex of graphic symbols, speech formulas that perform the function of details of payments, lexical and grammatical means. As a result of the study, it has been shown that the succession of procedures implemented for text segmentation, and targeted at genre and speech organization of the document identification, makes it possible to present in the diachronic corpus the information, which is necessary and sufficient for the user to conclude about the properties of the document text and its units.

DDSnet: A Deep Document Segmentation with Hybrid Blocks Architecture Network

2020 International Symposium on Computer, Consumer and Control (IS3C) ◽

10.1109/is3c50286.2020.00031 ◽

2020 ◽

Author(s):

Jing-Ming Guo ◽

Li-Ying Chang ◽

Hao-Hsuan Lee

Keyword(s):

Document Segmentation

A Joint Model for Document Segmentation and Segment Labeling

10.18653/v1/2020.acl-main.29 ◽

2020 ◽

Cited By ~ 1

Author(s):

Joe Barrow ◽

Rajiv Jain ◽

Vlad Morariu ◽

Varun Manjunatha ◽

Douglas Oard ◽

...

Keyword(s):

Joint Model ◽

Document Segmentation

Synthetic data usage for document segmentation models fine-tuning

Proceedings of the Institute for System Programming of RAS ◽

10.15514/ispras-2020-32(4)-14 ◽

2020 ◽

Vol 32 (4) ◽

pp. 189-202

Author(s):

Oksana Belyaeva ◽

Andrey Perminov ◽

Ilya Kozlov

Keyword(s):

Synthetic Data ◽

Fine Tuning ◽

Document Segmentation ◽

Data Usage ◽

Segmentation Models

Document Segmentation Method Based on Style Feature Fusion

IOP Conference Series Materials Science and Engineering ◽

10.1088/1757-899x/646/1/012044 ◽

2019 ◽

Vol 646 ◽

pp. 012044

Author(s):

Gang Liu ◽

Kai Wang ◽

Wangyang Liu ◽

Xu Cheng ◽

Tao Li

Keyword(s):

Feature Fusion ◽

Segmentation Method ◽

Document Segmentation

Research on Tibetan Document Segmentation Based on GMM+K-means Algorithm

Frontiers in Signal Processing ◽

10.22606/fsp.2019.34006 ◽

2019 ◽

Vol 3 (4) ◽

Author(s):

Chengliang Jiang ◽

◽

Huazhang Wang ◽

Keyword(s):

Document Segmentation

BeamSeg: A Joint Model for Multi-Document Segmentation and Topic Identification

10.18653/v1/k19-1054 ◽

2019 ◽

Author(s):

Pedro Mota ◽

Maxine Eskenazi ◽

Luísa Coheur

Keyword(s):

Joint Model ◽

Topic Identification ◽

Document Segmentation

Document Segmentation and Language Translation Using Tesseract-OCR

2018 IEEE 13th International Conference on Industrial and Information Systems (ICIIS) ◽

10.1109/iciinfs.2018.8721372 ◽

2018 ◽

Cited By ~ 1

Author(s):

Sahil Thakare ◽

Ajay Kamble ◽

Vishal Thengne ◽

U.R. Kamble

Keyword(s):

Language Translation ◽

Document Segmentation

MUSED: A multimedia multi-document dataset for topic segmentation

Natural Language Engineering ◽

10.1017/s1351324918000359 ◽

2018 ◽

Vol 24 (6) ◽

pp. 921-946

Author(s):

PEDRO MOTA ◽

MAXINE ESKENAZI ◽

LUÍSA COHEUR

Keyword(s):

State Of The Art ◽

The State ◽

Topic Segmentation ◽

Document Segmentation ◽

Multiple Documents ◽

The Impact ◽

Segmentation Models

AbstractResearch on topic segmentation has recently focused on segmenting documents by taking advantage of documents covering the same topics. In order to properly evaluate such approaches, a dataset of related documents is needed. However, existing datasets are limited in the number of related documents per domain. In addition, most of the available datasets do not consider documents from different media sources (PowerPoints, videos, etc.), which pose specific challenges to segmentation. We fill this gap with the MUltimedia SEgmentation Dataset (MUSED), a collection of documents manually segmented, from different media sources, in seven different domains, with an average of twenty related documents per domain. In this paper, we describe the process of building MUSED. A multi-annotator study is carried out to determine if it is possible to observe agreement among human judges and characterize their disagreement patterns. In addition, we use MUSED to compare the state-of-the-art topic segmentation techniques, including the ones that take advantage of related documents. Moreover, we study the impact of having documents from different media sources in the dataset. To the best of our knowledge, MUSED is the first dataset that allows a straightforward evaluation of both single- and multiple-documents topic segmentation techniques, as well as to study how these behave in the presence of documents from different media sources. Results show that some techniques are, indeed, sensitive to different media sources, and also that current multi-document segmentation models do not outperform previous models, pointing to a research line that needs to be boosted.

Helmholtz Principle on word embeddings for automatic document segmentation

Proceedings of the ACM Symposium on Document Engineering 2018 - DocEng '18 ◽

10.1145/3209280.3229103 ◽

2018 ◽

Author(s):

Dominik Krzemiński ◽

Helen Balinsky ◽

Alexander Balinsky

Keyword(s):

Word Embeddings ◽

Document Segmentation ◽

Helmholtz Principle

document segmentation
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Structural Markup of Official Documents in Diachronic Linguistic Corpus: Problems and Solutions

DDSnet: A Deep Document Segmentation with Hybrid Blocks Architecture Network

A Joint Model for Document Segmentation and Segment Labeling

Synthetic data usage for document segmentation models fine-tuning

Document Segmentation Method Based on Style Feature Fusion

Research on Tibetan Document Segmentation Based on GMM+K-means Algorithm

BeamSeg: A Joint Model for Multi-Document Segmentation and Topic Identification

Document Segmentation and Language Translation Using Tesseract-OCR

MUSED: A multimedia multi-document dataset for topic segmentation

Helmholtz Principle on word embeddings for automatic document segmentation

Export Citation Format

document segmentationRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Structural Markup of Official Documents in Diachronic Linguistic Corpus: Problems and Solutions

DDSnet: A Deep Document Segmentation with Hybrid Blocks Architecture Network

A Joint Model for Document Segmentation and Segment Labeling

Synthetic data usage for document segmentation models fine-tuning

Document Segmentation Method Based on Style Feature Fusion

Research on Tibetan Document Segmentation Based on GMM+K-means Algorithm

BeamSeg: A Joint Model for Multi-Document Segmentation and Topic Identification

Document Segmentation and Language Translation Using Tesseract-OCR

MUSED: A multimedia multi-document dataset for topic segmentation

Helmholtz Principle on word embeddings for automatic document segmentation

document segmentation
Recently Published Documents