Semi-Structured Document Classification

Author(s):  
Ludovic Denoyer ◽  
Patrick Gallinari

Document classification developed over the last 10 years, using techniques originating from the pattern recognition and machine-learning communities. All these methods operate on flat text representations, where word occurrences are considered independents. The recent paper by Sebastiani (2002) gives a very good survey on textual document classification. With the development of structured textual and multimedia documents and with the increasing importance of structured document formats like XML, the document nature is changing. Structured documents usually have a much richer representation than flat ones. They have a logical structure. They are often composed of heterogeneous information sources (e.g., text, image, video, metadata, etc.). Another major change with structured documents is the possibility to access document elements or fragments. The development of classifiers for structured content is a new challenge for the machine-learning and IR communities. A classifier for structured documents should be able to make use of the different content information sources present in an XML document and to classify both full documents and document parts. It should adapt easily to a variety of different sources (e.g., different document type definitions). It should be able to scale with large document collections.

Author(s):  
Ludovic Denoyer

Document classification developed over the last ten years, using techniques originating from the pattern recognition and machine learning communities. All these methods do operate on flat text representations where word occurrences are considered independents. The recent paper (Sebastiani, 2002) gives a very good survey on textual document classification. With the development of structured textual and multimedia documents, and with the increasing importance of structured document formats like XML, the document nature is changing. Structured documents usually have a much richer representation than flat ones. They have a logical structure. They are often composed of heterogeneous information sources (e.g. text, image, video, metadata, etc). Another major change with structured documents is the possibility to access document elements or fragments. The development of classifiers for structured content is a new challenge for the machine learning and IR communities. A classifier for structured documents should be able to make use of the different content information sources present in an XML document and to classify both full documents and document parts. It should easily adapt to a variety of different sources (e.g. to different Document Type Definitions). It should be able to scale with large document collections.


Author(s):  
M. LALMAS ◽  
T. ROLLEKE

Structured documents are composed of objects with a content and a logical structure. The effective retrieval of structured documents requires models that provide for a content-based retrieval of objects that takes into account their logical structure, so that the relevance of an object is not solely based on its content, but also on the logical structure among objects. This paper proposes a formal model for representing structured documents where the content of an object is viewed as the knowledge contained in that object, and the logical structure among objects is capture by a process of knowledge augmentation: the knowledge contained in an object is augmented with that of its structurally related objects. The knowledge augmentation process takes into account the fact that knowledge can be incomplete and become inconsistent.


2020 ◽  
Vol 27 (6) ◽  
pp. 105-113
Author(s):  
T. G. Nekhaeva

The article examines publication of statistical data commemorating the anniversaries of the USSR Victory in the Great Patriotic War as the most important information sources for an objective analysis of historical events. The reason for writing this article was the release of the statistical handbook of Rosstat, dedicated to the 75th anniversary of the Great Victory. In the introduction, the author argues the current urgency of issues addressed in the article caused by information warfare aimed at distorting the historical truth about the role of our country in the anti-Hitler coalition and the defeat of fascism in the World War II. The body of the article describes the concept and content of the anniversary edition. An important point of the article is the analysis of data sources used in the preparation of the handbook. The author reviews the anniversary handbook structure that includes a preface and the following sections: Population, Economic, Living conditions, Mobilization of population, Partisan movement, Evacuation during the war, Casualties and losses during the war, Military memorials and cemeteries, State awards, References. It is noted that the handbook maintains the tradition of previous statistical publications dedicated to the anniversaries of the Great Victory. Lastly, the author substantiates the novelty of data presented in the anniversary handbook and the logical structure of statistical materials in it. The author draws conclusions about the paramount importance of, and need to continue popularization of data on the great exploits of the Soviet people during the war and to introduce new statistical information into scientific circulation, which is causing further comprehension of primary information sources about the Great Patriotic War of 1941-1945.


2016 ◽  
Vol 12 (1) ◽  
pp. 70 ◽  
Author(s):  
Farshad Shams ◽  
Paolo Capodieci ◽  
Antonio Cerone ◽  
Romano Fantacci ◽  
Dania Marabissi ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document