Foundation of Keyword Search in XML

Author(s):  
Weidong Yang ◽  
Hao Zhu

It has become desirable to provide a way of keyword search for users to query structured information in an XML database (data-centric retrieval) by combining database and information retrieval techniques. Therefore, the key challenges of keyword search in the XML database are how to define appropriate result models meeting user’s search intents, how to search the results by using efficient algorithms, and how to ranking the results. In this chapter, on one hand, the authors present the foundational knowledge of XML keyword search such as XML data models, XML query languages, inverted index, and Dewey encoding. On the other hand, some existing typical researches of keyword search in XML are presented, including the results models such as Smallest Lowest Common Ancestor (SLCA), Exclusive Lowest Common Ancestor (ELCA), Meaningful Lowest Common Ancestor (MLCA), the related search algorithms, and the ranking approaches.

2011 ◽  
Vol 267 ◽  
pp. 811-815
Author(s):  
Ming Yan Shen ◽  
Xin Li ◽  
Xiang Fu Meng

The XML keyword search has been used widely in the application of XML documents. Most of the XML keyword search approaches are based on the LCA (lowest common ancestor) or its variants, which usually leads to the un-ideal recall and precision. This paper presents a novel XML keyword search method which based on semantic relatives. The method fully considers the semantic characteristics of the XML document structure. Based on the stack, the algorithm is also presented to merge the semantic relative nodes containing the keyword as the results of XML keyword search. The results of experiments have been identified the efficient and efficiency of our method.


2011 ◽  
Vol 1 (1) ◽  
pp. 1-18 ◽  
Author(s):  
Weidong Yang ◽  
Fei Fang ◽  
Nan Li ◽  
Zhongyu (Joan) Lu

Most existing XML stream processing systems adopt full structured query languages, such as XPath or XQuery, but they are difficult for ordinary users to learn and use. Keyword search is a user-friendly information discovery technique that has been extensively studied for text documents. This paper presents an XML stream filter system called XKFitler, which is the first system for supporting keyword search over XML stream. In XKFitler, the concepts of XLCA (eXclusive Lowest Common Ancestor) and XLCA Connecting Tree (XLCACT) are used to define the search semantic and results of keywords, and present an approach to filter XML stream according to keywords. The prototype XKFilter is implemented in the experiments.


Author(s):  
Weidong Yang ◽  
Fei Fang ◽  
Nan Li ◽  
Zhongyu (Joan) Lu

Most existing XML stream processing systems adopt full structured query languages, such as XPath or XQuery, but they are difficult for ordinary users to learn and use. Keyword search is a user-friendly information discovery technique that has been extensively studied for text documents. This paper presents an XML stream filter system called XKFilter, which is the first system for supporting keyword search over XML stream. In XKFilter, the concepts of XLCA (eXclusive Lowest Common Ancestor) and XLCA Connecting Tree (XLCACT) are used to define the search semantic and results of keywords, and present an approach to filter XML stream according to keywords. The prototype XKFilter is implemented in the experiments.


2012 ◽  
Vol 263-266 ◽  
pp. 1553-1558
Author(s):  
Quan Zhu Yao ◽  
Bing Tian ◽  
Wang Yun He

For XML documents, existing keyword retrieval methods encode each node with Dewey encoding, comparing Dewey encodings part by part is necessary in LCA computation. When the depth of XML is large, lots of LCA computations will affect the performance of keyword search. In this paper we propose a novel labeling method called Level-TRaverse (LTR) encoding, combine with the definition of the result set based on Exclusive Lowest Common Ancestor (ELCA),design a query Bottom-Up Level Algorithm(BULA).The experiments demonstrate this method improves the efficiency and the veracity of XML keyword retrieval.


Author(s):  
Weidong Yang ◽  
Hao Zhu

In this chapter, firstly, the LCA-based approaches for XML keyword search are analyzed and compared with each other. Several fundamental flaws of LCA-based models are explored, of which, the most important one is that the search results are eternally determined nonadjustable. Then, the chapter presents a system of adaptive keyword search in XML, called AdaptiveXKS, which employs a novel and flexible result model for avoiding these defects. Within the new model, a scoring function is presented to judge the quality of each result, and the considered metrics of evaluating results are weighted and can be updated as needed. Through the interface, the system administrator or the users can adjust some parameters according to their search intentions. One of three searching algorithms could also be chosen freely in order to catch specific querying requirements. Section 1 describes the Introduction and motivation. Section 2 defines the result model. In section 3 the scoring function is discussed deeply. Section 4 presents the system implementation and gives the detailed keyword search algorithms. Section 5 presents the experiments. Section 6 is the related work. Section 7 is the conclusion of this chapter.


Author(s):  
Weidong Yang ◽  
Hao Zhu

Most existing XML stream processing techniques adopt full structured query languages such as XPath or XQuery, which are difficult for ordinary users to learn and use. This chapter presents an XML stream filter system called XKFitler, which uses keyword to filter XML streams. In XKFitler, we use the concepts of XLCA (eXclusive Lowest Common Ancestor) and XLCA Connecting Tree (XLCACT) to define the search semantic and results of keywords, and present an approach to filter XML stream according to keywords. In section 1, the background of keyword search in XML streams is introduced. Section 2 explains the searching results. In section 3, a stack-based keyword searching algorithm for XML stream filtering without schemas is presented in-depth. Section 4 presents a keyword search over XML streams by using schema information. The system architecture of XKFilter is described in section 5. Section 6 is the experiments to show the performance. Section 7 discusses the related work. Section 8 is the summaries of this chapter.


Author(s):  
Dayananda P. ◽  
Sowmyarani C. N.

The size of semi-structured data is increasing continuously. Handling semi-structured data efficiently is a challenging task. Keyword search is an important task, and required information can be retrieved without having knowledge of data storage hierarchy. There are several challenges in handling XML data. This chapter discusses various challenges in terms of lowest common ancestor (LCA) semantics, processing of queries efficiently, retrieving top-k results for user needed data. The existing approach is defined under many classes based on how the problem and solution are tackled. Analysis of keyword search and ranking techniques for retrieving desired information are discussed in detail.


Genetics ◽  
2001 ◽  
Vol 157 (3) ◽  
pp. 1387-1395 ◽  
Author(s):  
Sudhir Kumar ◽  
Sudhindra R Gadagkar ◽  
Alan Filipski ◽  
Xun Gu

AbstractGenomic divergence between species can be quantified in terms of the number of chromosomal rearrangements that have occurred in the respective genomes following their divergence from a common ancestor. These rearrangements disrupt the structural similarity between genomes, with each rearrangement producing additional, albeit shorter, conserved segments. Here we propose a simple statistical approach on the basis of the distribution of the number of markers in contiguous sets of autosomal markers (CSAMs) to estimate the number of conserved segments. CSAM identification requires information on the relative locations of orthologous markers in one genome and only the chromosome number on which each marker resides in the other genome. We propose a simple mathematical model that can account for the effect of the nonuniformity of the breakpoints and markers on the observed distribution of the number of markers in different conserved segments. Computer simulations show that the number of CSAMs increases linearly with the number of chromosomal rearrangements under a variety of conditions. Using the CSAM approach, the estimate of the number of conserved segments between human and mouse genomes is 529 ± 84, with a mean conserved segment length of 2.8 cM. This length is <40% of that currently accepted for human and mouse genomes. This means that the mouse and human genomes have diverged at a rate of ∼1.15 rearrangements per million years. By contrast, mouse and rat are diverging at a rate of only ∼0.74 rearrangements per million years.


Sign in / Sign up

Export Citation Format

Share Document