document structure
Recently Published Documents


TOTAL DOCUMENTS

228
(FIVE YEARS 33)

H-INDEX

15
(FIVE YEARS 3)

Author(s):  
Jayati Mukherjee ◽  
Swapan K. Parui ◽  
Utpal Roy

Segmentation of text lines and words in an unconstrained handwritten or a machine-printed degraded document is a challenging document analysis problem due to the heterogeneity in the document structure. Often there is un-even skew between the lines and also broken words in a document. In this article, the contribution lies in segmentation of a document page image into lines and words. We have proposed an unsupervised, robust, and simple statistical method to segment a document image that is either handwritten or machine-printed (degraded or otherwise). In our proposed method, the segmentation is treated as a two-class classification problem. The classification is done by considering the distribution of gap size (between lines and between words) in a binary page image. Our method is very simple and easy to implement. Other than the binarization of the input image, no pre-processing is necessary. There is no need of high computational resources. The proposed method is unsupervised in the sense that no annotated document page images are necessary. Thus, the issue of a training database does not arise. In fact, given a document page image, the parameters that are needed for segmentation of text lines and words are learned in an unsupervised manner. We have applied our proposed method on several popular publicly available handwritten and machine-printed datasets (ISIDDI, IAM-Hist, IAM, PBOK) of different Indian and other languages containing different fonts. Several experimental results are presented to show the effectiveness and robustness of our method. We have experimented on ICDAR-2013 handwriting segmentation contest dataset and our method outperforms the winning method. In addition to this, we have suggested a quantitative measure to compute the level of degradation of a document page image.


Author(s):  
Oksana Andriivna Tatarinova ◽  
Vladislav Valerievich Ovsyanikov

The problem of computer recognition, both separately printed characters and whole texts, which may contain mathematical formulas, and further saving the resulting document in the "Latex" format, is considered. The developed software implements the ability to recognize printable Latin, Cyrillic, Greek letters and special mathematical symbols. For this, a multilayer convolutional neural network built using the Keras machine learning library and additional validation heuristics are used. To improve the quality of neural network recognition, a sophisticated image processing mechanism has been developed that helps to remove noise from the image, eliminate errors associated with the inclination of characters, and correct character defects associated with the quality of the input image. Also implemented are mechanisms for collecting individual characters into words or mathematical formulas, reproducing the position of signs of indices and degrees, forming ordinary fractions and expressions under the root sign. The results of the recognized text are saved in a file with the simultaneous construction of the "latex" document structure. To demonstrate the capabilities of the developed software, a graphical user interface has been added, with which you can select and inspect the input image even before the start of recognition. During testing of the software, the recognition of images of different types was carried out: completely textual, mathematical formulas without text, mathematical formulas that are between blocks of text.


2021 ◽  
Author(s):  
Shubham Pandey ◽  
Ayan Chandra ◽  
Sudeshna Sarkar ◽  
Uday Shankar

The Indian court system generates huge amounts of data relating to administration, pleadings, litigant behaviour, and court decisions on a regular basis. But the existing Judiciary is incapable of managing these vast troves of data efficiently that causes delays and pendency of a large volume of cases in the courts. Some of these time-consuming tasks involve case briefing, examining the legal issues, facts, legal principles, observations, and other significant aspects submitted by the contending parties in the court. In other words, computational methods to understand the underlying structure of a case document will directly aid the lawyers to perform these tasks efficiently and improve the overall efficiency of the Justice delivery system. Application of Computational techniques (such as Natural Language Processing) can help to gather and sift through these vast troves of information, identify patterns, extract the document structure, draft documents and make the information available online. Traditionally lawyers are trained to examine cases using the Case Law Analysis approach for case briefing. In this article, the authors aim to establish the importance and relevance of the automated case analysis problem in the legal domain. They introduce a novel case analysis structure for the supreme court judgment documents and define twelve different case law labels that are used by legal professionals to identify the structure. Finally the authors propose a method for automated case analysis, which will directly aid the lawyers to prepare speedy and efficient case briefs and drastically reduce the time taken by them in litigation.


2021 ◽  
Vol 21 (S7) ◽  
Author(s):  
Tao Li ◽  
Ying Xiong ◽  
Xiaolong Wang ◽  
Qingcai Chen ◽  
Buzhou Tang

Abstract Objective Relation extraction (RE) is a fundamental task of natural language processing, which always draws plenty of attention from researchers, especially RE at the document-level. We aim to explore an effective novel method for document-level medical relation extraction. Methods We propose a novel edge-oriented graph neural network based on document structure and external knowledge for document-level medical RE, called SKEoG. This network has the ability to take full advantage of document structure and external knowledge. Results We evaluate SKEoG on two public datasets, that is, Chemical-Disease Relation (CDR) dataset and Chemical Reactions dataset (CHR) dataset, by comparing it with other state-of-the-art methods. SKEoG achieves the highest F1-score of 70.7 on the CDR dataset and F1-score of 91.4 on the CHR dataset. Conclusion The proposed SKEoG method achieves new state-of-the-art performance. Both document structure and external knowledge can bring performance improvement in the EoG framework. Selecting proper methods for knowledge node representation is also very important.


2021 ◽  
Vol 27 (4) ◽  
Author(s):  
Paul Pluta

A carefully written Validation Plan including thoughtful discussion in its respective sections ensures a logical and complete validation project. The following discussion proposes a simple and straightforward Validation Plan document structure that is applicable to all validation/qualification at a pharma manufacturing site.


Author(s):  
Hugh Cayless ◽  
Thibault Clérice ◽  
Jonathan Robie

Text Encoding Initiative documents are notoriously heterogeneous in structure, since the Guidelines are intended to permit the encoding on any type of text, from tax receipts written on papyrus to Shakespeare plays or novels. Citation Structures are a new feature in the TEI Guidelines that provide a way for documents to declare their own internal structure along with a way to resolve citations conforming to that structure. This feature will allow systems ike the Distributed Text Services (DTS) API, which process heterogeneous TEI documents to handle tasks like automated table of contents generation, the extraction of structural metadata, and the resolution of citations without prior knowledge of document structure.


Author(s):  
C. M. Sperberg-McQueen

Ariadne is a query language intended to be powerful enough to allow domain experts to find interesting passages in their documents, but simple enough for them to learn even if XPath and other expression languages are too complex. Its assumptions about document structure (elements have parents and are at least partially ordered) are compatible with XML and the XPath Data Model but are also compatible with many non-XML models of text; Ariadne could thus serve as a query language for documents with overlapping structures.


Author(s):  
Rachmad Fitriyanto

Security information method for jpeg / exif documents generally aims to prevent security attack by protecting documents with password and watermark. Both methods cannot used to determine the condition of data integrity at the detection stage of the information security cycle. Message Digest is the essence of a file that used to represent data integrity. This study aims to compile a message digest to detect changes that occur in jpeg / exif documents in information security. The research phase consists of five stages. The first stage, identification of the jpeg / exif document structure conducted using Boyer-Moore string matching algorithm to find jpeg/exif segments location. The Second stage is segment content acquisition, conducted based on segment location and length obtained. The Third step, computing message digest for each segment using SHA512 hash function. Fourth stage, jpeg / exif document modification experiments to identified affected segments. The Fifth stage is selecting and combining the hash value of the segment into message digest. Obtained result show message digest for jpeg / exif documents composed of two parts, the hash value of the SOI segment and the APP1 segment. The SOI segment value used to detect modifications for jpeg to png conversion and image editing. The APP1 hash value used to detect metadata editing. The SOF0 hash values use to detect modification for image recoloring, cropping and resizing.


Sign in / Sign up

Export Citation Format

Share Document