The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking

Author(s):  
Anja Theobald ◽  
Gerhard Weikum
Author(s):  
Kamal Taha

There has been extensive research in XML Keyword-based and Loosely Structured querying. Some frameworks work well for certain types of XML data models while fail in others. The reason is that the proposed techniques overlook the context of elements when building relationships between the elements. The context of a data element is determined by its parent, because a data element is generally a characteristic of its parent. Overlooking the contexts of elements may result in relationships between the elements that are semantically disconnected, which lead to erroneous results. We present in this chapter a context-driven search engine called XTEngine for answering XML Keyword-based and Loosely Structured queries. XTEngine treats each set of elements consisting of a parent and its children data elements as one unified entity, and then uses context-driven search techniques for determining the relationships between the different unified entities. We evaluated XTEngine experimentally and compared it with three other search engines. The results showed marked improvement.


2010 ◽  
Vol 7 (2) ◽  
pp. 1-11 ◽  
Author(s):  
Matthias Lange ◽  
Karl Spies ◽  
Joachim Bargsten ◽  
Gregor Haberhauer ◽  
Matthias Klapperstück ◽  
...  

SummarySearch engines and retrieval systems are popular tools at a life science desktop. The manual inspection of hundreds of database entries, that reflect a life science concept or fact, is a time intensive daily work. Hereby, not the number of query results matters, but the relevance does. In this paper, we present the LAILAPS search engine for life science databases. The concept is to combine a novel feature model for relevance ranking, a machine learning approach to model user relevance profiles, ranking improvement by user feedback tracking and an intuitive and slim web user interface, that estimates relevance rank by tracking user interactions. Queries are formulated as simple keyword lists and will be expanded by synonyms. Supporting a flexible text index and a simple data import format, LAILAPS can easily be used both as search engine for comprehensive integrated life science databases and for small in-house project databases.With a set of features, extracted from each database hit in combination with user relevance preferences, a neural network predicts user specific relevance scores. Using expert knowledge as training data for a predefined neural network or using users own relevance training sets, a reliable relevance ranking of database hits has been implemented.In this paper, we present the LAILAPS system, the concepts, benchmarks and use cases. LAILAPS is public available for SWISSPROT data at http://lailaps.ipk-gatersleben.de


Author(s):  
Cláudio Elízio Calazans Campelo ◽  
Cláudio de Souza Baptista ◽  
Ricardo Madeira Fernandes

It is well known that documents available on the Web are extremely heterogeneous in several aspects, such as the use of various idioms, different formats to represent the contents, besides other external factors like source reputation, refresh frequency, and so forth (Page & Brin, 1998). Altogether, these factors increase the complexity of Web information retrieval systems. Superficially, traditional search engines available on the Web nowadays consist of retrieving documents that contain keywords informed by users. Nevertheless, among the variety of search possibilities, it is evident that the user needs a process that involves more sophisticated analysis; for example, temporal or spatial contextualization might be considered. In these keyword-based search engines, for instance, a Web page containing the phrase “…due to the company arrival in London, a thousand java programming jobs will be open…” would not be found if the submitted search was “jobs programming England,” unless the word “England” appeared in another phrase of the page. The explanation to this fact is that the term “London” is treated merely like another word, instead of regarding its geographical position. In a spatial search engine, the expected behavior would be to return the page described in the previous example, since the system shall have information indicating that the term “London” refers to a city located in a country referred to by the term “England.” This result could only be feasible in a traditional search engine if the user repeatedly submitted searches for all possible England sub-regions (e.g., cities). In accordance with the example, it is reasonable that for several user searches, the most interesting results are those related to certain geographical regions. A variety of features extraction and automatic document classification techniques have been proposed, however, acquiring Web-page geographical features involves some peculiar complexities, such as ambiguity (e.g., many places with the same name, various names for a single place, things with place names, etc.). Moreover, a Web page can refer to a place that contains or is contained by the one informed in the user query, which implies knowing the different region topologies used by the system. Many features related to geographical context can be added to the process of elaborating relevance ranking for returned documents. For example, a document can be more relevant than another one if its content refers to a place closer to the user location. Nonetheless, in spatial search engines, there are more complex issues to be considered because of the spatial dimension concerning on ranking elaboration. Jones, Alani, and Tudhope (2001) propose a combination of Euclidian distance between place centroids with hierarchical distances in order to generate a hybrid spatial distance that may be used in the relevance ranking elaboration of returned documents. Further important issues are the indexing mechanisms and query processing. In general, these solutions try to combine well-known textual indexing techniques (e.g., inverted files) with spatial indexing mechanisms. On the subject of user interface, spatial search engines are more complex, because users need to choose regions of interest, as well as possible spatial relationships, in addition to keywords. To visualize the results, it is pleasant to use digital map resources besides textual information.


2010 ◽  
Vol 7 (3) ◽  
Author(s):  
Matthias Lange ◽  
Karl Spies ◽  
Christian Colmsee ◽  
Steffen Flemming ◽  
Matthias Klapperstück ◽  
...  

SummaryEfficient and effective information retrieval in life sciences is one of the most pressing challenge in bioinformatics. The incredible growth of life science databases to a vast network of interconnected information systems is to the same extent a big challenge and a great chance for life science research. The knowledge found in the Web, in particular in life-science databases, are a valuable major resource. In order to bring it to the scientist desktop, it is essential to have well performing search engines. Thereby, not the response time nor the number of results is important. The most crucial factor for millions of query results is the relevance ranking.In this paper, we present a feature model for relevance ranking in life science databases and its implementation in the LAILAPS search engine. Motivated by the observation of user behavior during their inspection of search engine result, we condensed a set of 9 relevance discriminating features. These features are intuitively used by scientists, who briefly screen database entries for potential relevance. The features are both sufficient to estimate the potential relevance, and efficiently quantifiable.The derivation of a relevance prediction function that computes the relevance from this features constitutes a regression problem. To solve this problem, we used artificial neural networks that have been trained with a reference set of relevant database entries for 19 protein queries.Supporting a flexible text index and a simple data import format, this concepts are implemented in the LAILAPS search engine. It can easily be used both as search engine for comprehensive integrated life science databases and for small in-house project databases. LAILAPS is publicly available for SWISSPROT data at http://lailaps.ipk-gatersleben.de


2021 ◽  
Vol 13 (2) ◽  
pp. 31
Author(s):  
Cristòfol Rovira ◽  
Lluís Codina ◽  
Carlos Lopezosa

The visibility of academic articles or conference papers depends on their being easily found in academic search engines, above all in Google Scholar. To enhance this visibility, search engine optimization (SEO) has been applied in recent years to academic search engines in order to optimize documents and, thereby, ensure they are better ranked in search pages (i.e., academic search engine optimization or ASEO). To achieve this degree of optimization, we first need to further our understanding of Google Scholar’s relevance ranking algorithm, so that, based on this knowledge, we can highlight or improve those characteristics that academic documents already present and which are taken into account by the algorithm. This study seeks to advance our knowledge in this line of research by determining whether the language in which a document is published is a positioning factor in the Google Scholar relevance ranking algorithm. Here, we employ a reverse engineering research methodology based on a statistical analysis that uses Spearman’s correlation coefficient. The results obtained point to a bias in multilingual searches conducted in Google Scholar with documents published in languages other than in English being systematically relegated to positions that make them virtually invisible. This finding has important repercussions, both for conducting searches and for optimizing positioning in Google Scholar, being especially critical for articles on subjects that are expressed in the same way in English and other languages, the case, for example, of trademarks, chemical compounds, industrial products, acronyms, drugs, diseases, etc.


2010 ◽  
Author(s):  
Matthias Lange ◽  
Karl Spies ◽  
Christian Colmsee ◽  
Steffen Flemming ◽  
Matthias Klapperstück ◽  
...  

2012 ◽  
Vol 11 (1) ◽  
pp. 62-67
Author(s):  
Sunil Archak ◽  
Vikas Kumar

The National Bureau of Plant genetic Resources (NBPGR), New Delhi carries out the registration of unique crop germplasm in order to protect the intrinsic intellectual property as well as to facilitate greater utilization of germplasm in crop improvement programmes. It is therefore imperative to enhance access to information on registered crop germplasm. Here, we present a concept of a search engine that can suffice dual functions of a Web-based and portable search application. The concept entails converting raw data through a series of transformations from a Microsoft-Excel to eXtensible Markup Language (XML) data format. XML data initialized on compatible Web browsers are then queried for the search term based on a looping regular expression matching. The results are then loaded onto the browser in a tabulated output. The concept is implemented in the form of ‘Inventory of Registered Crop Germplasm’ on the Web as well as on a portable memory (compact disk or flash drive). The portable search engine works with minimal hardware and software requirements to enable its widespread utility to ensure greater access to information on registered crop germplasm. The portable search engine can be obtained from the NBPGR, New Delhi and the Web-based search engine can be accessed at http://www.nbpgr.ernet.in/IRCG/index.htm.


2003 ◽  
Vol 62 (2) ◽  
pp. 121-129 ◽  
Author(s):  
Astrid Schütz ◽  
Franz Machilek

Research on personal home pages is still rare. Many studies to date are exploratory, and the problem of drawing a sample that reflects the variety of existing home pages has not yet been solved. The present paper discusses sampling strategies and suggests a strategy based on the results retrieved by a search engine. This approach is used to draw a sample of 229 personal home pages that portray private identities. Findings on age and sex of the owners and elements characterizing the sites are reported.


Sign in / Sign up

Export Citation Format

Share Document