Evaluation of Connectionist Information Retrieval in a Legal Document Collection

Author(s):  
R. A. Bustos ◽  
T. D. Gedeon
Author(s):  
Stan Ruecker

Everyone who has browsed the Internet is familiar with the problems involved in finding what they want. From the novice to the most sophisticated user, the challenge is the same: how to identify quickly and reliably the precise Web sites or other documents they seek from within an ever-growing collection of several billion possibilities? This is not a new problem. Vannevar Bush, the successful Director of the Office of Scientific Research and Development, which included the Manhattan project, made a famous public call in The Atlantic Monthly in 1945 for the scientific community in peacetime to continue pursuing the style of fruitful collaboration they had experienced during the war (Bush, 1945). Bush advocated this approach to address the central difficulty posed by the proliferation of information beyond what could be managed by any single expert using contemporary methods of document management and retrieval. Bush’s vision is often cited as one of the early visions of the World Wide Web, with professional navigators trailblazing paths through the literature and leaving sets of linked documents behind them for others to follow. Sixty years later, we have the professional indexers behind Google, providing the rest of us with a magic window into the data. We can type a keyword or two, pause for reflection, then hit the “I’m feeling lucky” button and see what happens. Technically, even though it often runs in a browser, this task is “information retrieval.” One of its fundamental tenets is that the user cannot manage the data and needs to be guided and protected through the maze by a variety of information hierarchies, taxonomies, indexes, and keywords. Information retrieval is a complex research domain. The Association for Computing Machinery, arguably the largest professional organization for academic computing scientists, sponsors a periodic contest in information retrieval, where teams compete to see who has the most effective algorithms. The contest organizers choose or create a document collection, such as a set of a hundred thousand newspaper articles in English, and contestants demonstrate their software’s ability to find the most documents most accurately. Two of the measures are precision and recall: both of these are ratios, and they pull in opposite directions. Precision is the ratio of the number of documents that have been correctly identified out of the number of documents returned by the search. Recall is the ratio of the number of documents that have been retrieved out of the total number in the collection that should have been retrieved. It is therefore possible to get 100% on precision—just retrieve one document precisely on topic. However, the corresponding recall score would be a disaster. Similarly, an algorithm can score 100% on recall just by retrieving all the documents in the collection. Again, the related precision score would be abysmal. Fortunately, information retrieval is not the only technology available. For collections that only contain thousands of entries, there is no reason why people should not be allowed to simply browse the entire contents, rather than being limited to carrying out searches. Certainly, retrieval can be part of browsing—the two technologies are not mutually exclusive. However, by embedding retrieval within browsing the user gains a significant number of perceptual advantages and new opportunities for actions.


Author(s):  
Weiguo Fan ◽  
Praveen Pathak

The field of information retrieval deals with finding relevant documents from a large document collection or the World Wide Web in response to a user’s query seeking relevant information. Ranking functions play a very important role in the retrieval performance of such retrieval systems and search engines. A single ranking function does not perform well across different user queries, and document collections. Hence it is necessary to “discover” a ranking function for a particular context. Adaptive algorithms like genetic programming (GP) are well suited for such discovery.


Information retrieval is a key technology in accessing the vast amount of data present on today’s World Wide Web. Numerous challenges arise at various stages of information retrieval from the web, such as missing of plenteous relevant documents, static user queries, ever changing and tremendous amount of document collection and so forth. Therefore, more powerful strategies are required to search for relevant documents. In this paper, a PSO methodology is proposed which is hybridized with Simulated Annealing with the aim of optimizing Web Information Retrieval (WIR) process. Hybridized PSO has a high impact on reducing the query response time of the system and hence subsidizes the system efficiency. A novel similarity measure called SMDR acts as a fitness function in the hybridized PSO-SA algorithm. Evaluations measures such as accuracy, MRR, MAP, DCG, IDCG, F-measure and specificity are used to measure the effectiveness of the proposed system and to compare it with existing system as well. Ultimately, experiments are extensively carried out on a huge RCV1 collections. Achieved precision-recall rates demonstrate the considerably improved effectiveness of the proposed system than that of existing one.


Author(s):  
T. D. Gedeon ◽  
R. A. Bustos ◽  
B. J. Briedis ◽  
G. Greenleaf ◽  
A. Mowbray

2016 ◽  
Vol 78 (5-6) ◽  
Author(s):  
Jasman Pardede ◽  
Milda Gustiana Husada

Vector space model (VSM) is an Information Retrieval (IR) system model that represents query and documents as n-dimension vector. GVSM is an expansion from VSM that represents the documents base on similarity value between query and minterm vector space of documents collection. Minterm vector is defined by the term in query. Therefore, in retrieving a document can be done base on word meaning inside the query. On the contrary, a document can consist the same information semantically. LSI is a method implemented in IR system to retrieve document base on overall meaning of users’ query input from a document, not based on each word translation. LSI uses a matrix algebra technique namely Singular Value Decomposition (SVD). This study discusses the performance of VSM, GVSM and LSI that are implemented on IR to retrieve Indonesian sentences document of .pdf, .doc and .docx extension type files, by using Nazief and Adriani stemming algorithm. Each method implemented either by thread or no-thread. Thread is implemented in preprocessing process in reading each document from document collection and stemming process either for query or documents. The quality of information retrieval performance is evaluated based-on time response, values of recall, precision, and F-measure were measured. The results show that for each method, the fastest execution time is .docx extension type file followed by .doc and .pdf. For the same document collection, the results show that time response for LSI is more faster, followed by GVSM then VSM. The average of recall value for VSM, GVSM and LSI are 82.86 %, 89.68 % and 84.93 % respectively. The average of precision value for VSM, GVSM and LSI are 64.08 %, 67.51 % and 62.08 % respectively. The average of F-measure value for VSM, GVSM and LSI are 71.95 %, 76.63 % and 71.02 % respectively. Implementation of multithread for preprocessing for VSM, GVSM, and LSI can increase average time response required is about 30.422%, 26.282%, and 31.821% respectively.  


2014 ◽  
Vol 2014 ◽  
pp. 1-10 ◽  
Author(s):  
A. R. Rivas ◽  
E. L. Iglesias ◽  
L. Borrajo

Information Retrieval focuses on finding documents whose content matches with a user query from a large document collection. As formulating well-designed queries is difficult for most users, it is necessary to use query expansion to retrieve relevant information. Query expansion techniques are widely applied for improving the efficiency of the textual information retrieval systems. These techniques help to overcome vocabulary mismatch issues by expanding the original query with additional relevant terms and reweighting the terms in the expanded query. In this paper, different text preprocessing and query expansion approaches are combined to improve the documents initially retrieved by a query in a scientific documental database. A corpus belonging to MEDLINE, called Cystic Fibrosis, is used as a knowledge source. Experimental results show that the proposed combinations of techniques greatly enhance the efficiency obtained by traditional queries.


1999 ◽  
Author(s):  
Kazem Taghva ◽  
Thomas A. Nartker ◽  
Julie Borsack ◽  
Allen Condit

Author(s):  
JOHN ZAKOS ◽  
BRIJESH VERMA

In this paper we present a novel technique for determining term importance by exploiting concept-based information found in ontologies. Calculating term importance is a significant and fundamental aspect of most information retrieval approaches, and it is traditionally determined through inverse document frequency (IDF). We propose concept-based term weighting (CBW), a technique that is fundamentally different to IDF in that it calculates term importance by intuitively interpreting the conceptual information in ontologies. We show that when CBW is used in an approach for web information retrieval on benchmark data, it performs comparatively to IDF, with only a 3.5% degradation in retrieval accuracy. While this small degradation has been observed, the significance of this technique is that (1) unlike IDF, CBW is independent of document collection statistics, (2) it presents a new way of interpreting ontologies for retrieval, and (3) it introduces an additional source of term importance information that can be used for term weighting.


2006 ◽  
Vol 62 (3) ◽  
pp. 372-387 ◽  
Author(s):  
Tuomas Talvensaari ◽  
Jorma Laurikkala ◽  
Kalervo Järvelin ◽  
Martti Juhola

Sign in / Sign up

Export Citation Format

Share Document