Engineering Information Into Open Documents
Documents are perfectly suited for information exchange via the Internet. In order to insure that there are no misunderstandings, information embedded in a document needs to be precise and unambiguous. Having a (de facto) standard data model and conceptual information model insures that the involved parties will agree on what the information means. XML (eXtensible Markup Language) has become the de facto standard format for representing information in documents for document exchange. Many techniques have been proposed to create XML documents, including the validation and transformation of XML documents. However, very little is discussed when it comes to extracting information from non- XML documents and engineering the information into XML documents. The extraction process can be a highly labor intensive task if it is done manually. The use of automated tools would make the process more efficient. In this chapter, the author will briefly survey document engineering techniques for XML documents. Then, the author will present two techniques to extract data from Windows documents into XML documents. These two techniques have been successfully applied in two industrial projects. He believes that techniques that automate the extraction of data from non-XML documents into XML formats will definitely enhance the use of XML documents.