Logical structure analysis and generation for structured documents: A syntactic approach

Structured documents are composed of objects with a content and a logical structure. The effective retrieval of structured documents requires models that provide for a content-based retrieval of objects that takes into account their logical structure, so that the relevance of an object is not solely based on its content, but also on the logical structure among objects. This paper proposes a formal model for representing structured documents where the content of an object is viewed as the knowledge contained in that object, and the logical structure among objects is capture by a process of knowledge augmentation: the knowledge contained in an object is augmented with that of its structurally related objects. The knowledge augmentation process takes into account the fact that knowledge can be incomplete and become inconsistent.

Download Full-text

Logical Structure Analysis for Form Images with Arbitrary Layout by Belief Propagation

Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2 ◽

10.1109/icdar.2007.4377008 ◽

2007 ◽

Cited By ~ 1

Author(s):

A. Minagawa ◽

Y. Fujii ◽

H. Takebe ◽

K. Fujimoto

Keyword(s):

Structure Analysis ◽

Belief Propagation ◽

Logical Structure

Download Full-text

Logical structure analysis of document images based on emergent computation

Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318) ◽

10.1109/icdar.1999.791756 ◽

1999 ◽

Cited By ~ 11

Author(s):

Y. Ishitani

Keyword(s):

Structure Analysis ◽

Logical Structure ◽

Document Images ◽

Emergent Computation

Download Full-text

Text Extraction Algorithm using the HTML Logical Structure Analysis

Journal of Digital Contents Society ◽

10.9728/dcs.2015.16.3.445 ◽

2015 ◽

Vol 16 (3) ◽

pp. 445-455 ◽

Cited By ~ 2

Author(s):

Hyun-Gee Jeon ◽

Chan KOH

Keyword(s):

Structure Analysis ◽

Logical Structure ◽

Text Extraction ◽

Extraction Algorithm

Download Full-text

Warehouse Management Using V‐A‐T Logical Structure Analysis

The International Journal of Logistics Management ◽

10.1108/09574099310804885 ◽

1993 ◽

Vol 4 (1) ◽

pp. 35-48 ◽

Cited By ~ 4

Author(s):

Michael S. Spencer

Keyword(s):

Structure Analysis ◽

Logical Structure ◽

Warehouse Management

Download Full-text

Logical structure analysis of scientific publications in mathematics

Proceedings of the International Conference on Web Intelligence, Mining and Semantics - WIMS '11 ◽

10.1145/1988688.1988713 ◽

2011 ◽

Cited By ~ 9

Author(s):

Valery Solovyev ◽

Nikita Zhiltsov

Keyword(s):

Structure Analysis ◽

Logical Structure ◽

Scientific Publications

Download Full-text

Semi-Structured Document Classification

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch271 ◽

2011 ◽

pp. 1779-1786

Author(s):

Ludovic Denoyer

Keyword(s):

Machine Learning ◽

Information Sources ◽

Major Change ◽

Logical Structure ◽

Document Classification ◽

Document Collections ◽

Heterogeneous Information ◽

Structured Document ◽

Structured Documents ◽

Different Content

Document classification developed over the last ten years, using techniques originating from the pattern recognition and machine learning communities. All these methods do operate on flat text representations where word occurrences are considered independents. The recent paper (Sebastiani, 2002) gives a very good survey on textual document classification. With the development of structured textual and multimedia documents, and with the increasing importance of structured document formats like XML, the document nature is changing. Structured documents usually have a much richer representation than flat ones. They have a logical structure. They are often composed of heterogeneous information sources (e.g. text, image, video, metadata, etc). Another major change with structured documents is the possibility to access document elements or fragments. The development of classifiers for structured content is a new challenge for the machine learning and IR communities. A classifier for structured documents should be able to make use of the different content information sources present in an XML document and to classify both full documents and document parts. It should easily adapt to a variety of different sources (e.g. to different Document Type Definitions). It should be able to scale with large document collections.

Download Full-text

Semi-Structured Document Classification

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch191 ◽

2011 ◽

pp. 1015-1021

Author(s):

Ludovic Denoyer ◽

Patrick Gallinari

Keyword(s):

Machine Learning ◽

Information Sources ◽

Major Change ◽

Logical Structure ◽

Document Classification ◽

Document Collections ◽

Heterogeneous Information ◽

Structured Document ◽

Structured Documents ◽

Different Content

Document classification developed over the last 10 years, using techniques originating from the pattern recognition and machine-learning communities. All these methods operate on flat text representations, where word occurrences are considered independents. The recent paper by Sebastiani (2002) gives a very good survey on textual document classification. With the development of structured textual and multimedia documents and with the increasing importance of structured document formats like XML, the document nature is changing. Structured documents usually have a much richer representation than flat ones. They have a logical structure. They are often composed of heterogeneous information sources (e.g., text, image, video, metadata, etc.). Another major change with structured documents is the possibility to access document elements or fragments. The development of classifiers for structured content is a new challenge for the machine-learning and IR communities. A classifier for structured documents should be able to make use of the different content information sources present in an XML document and to classify both full documents and document parts. It should adapt easily to a variety of different sources (e.g., different document type definitions). It should be able to scale with large document collections.

Download Full-text