Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web

Author(s):  
Matthew Michelson ◽  
Craig A. Knoblock
Author(s):  
Jan Korst ◽  
Gijs Geleijnse ◽  
Nick de Jong ◽  
Michael Verschoor

2009 ◽  
pp. 2510-2542
Author(s):  
Xuepeng Yin ◽  
Torben Bach Pedersen

In today’s OLAP systems, physically integrating fast-changing data (e.g., stock quotes) into a cube is complex and time-consuming. The data is likely to be available in XML format on the World Wide Web (WWW); thus, instead of physical integration, making XML data logically federated with OLAP systems is desirable. In this article, we extend previous work on the logical federation of OLAP and XML data sources by presenting simplified query semantics, a physical query algebra, and a robust OLAP-XML query engine, as well as the query evaluation techniques. Performance experiments with a prototypical implementation suggest that the performance for OLAP-XML federations is comparable to queries on physically integrated data.


Author(s):  
Xuepeng Yin ◽  
Torben Bach Pedersen

In today’s OLAP systems, physically integrating fast-changing data, for example, stock quotes, into a cube is complex and time-consuming. This data is likely to be available in XML format on the World Wide Web (WWW); thus, instead of physical integration, making XML data logically federated with OLAP systems is desirable. In this chapter, we extend previous work on the logical federation of OLAP and XML data sources by presenting simplified query semantics, a physical query algebra and a robust OLAP-XML query engine as well as the query evaluation techniques. Performance experiments with a prototypical implementation suggest that the performance for OLAP-XML federations is comparable to queries on physically integrated data.


Author(s):  
Sally Mohamed ◽  
◽  
Mahmoud Hussien ◽  
Hamdy M. Mousa

There is a massive amount of different information and data in the World Wide Web, and the number of Arabic users and contents is widely increasing. Information extraction is an essential issue to access and sort the data on the web. In this regard, information extraction becomes a challenge, especially for languages, which have a complex morphology like Arabic. Consequently, the trend today is to build a new corpus that makes the information extraction easier and more precise. This paper presents Arabic linguistically analyzed corpus, including dependency relation. The collected data includes five fields; they are a sport, religious, weather, news and biomedical. The output is CoNLL universal lattice file format (CoNLL-UL). The corpus contains an index for the sentences and their linguistic meta-data to enable quick mining and search across the corpus. This corpus has seventeenth morphological annotations and eight features based on the identification of the textual structures help to recognize and understand the grammatical characteristics of the text and perform the dependency relation. The parsing and dependency process conducted by the universal dependency model and corrected manually. The results illustrated the enhancement in the dependency relation corpus. The designed Arabic corpus helps to quickly get linguistic annotations for a text and make the information Extraction techniques easy and clear to learn. The gotten results illustrated the average enhancement in the dependency relation corpus.


2009 ◽  
Author(s):  
Blair Williams Cronin ◽  
Ty Tedmon-Jones ◽  
Lora Wilson Mau

Sign in / Sign up

Export Citation Format

Share Document