Evaluation of Connectionist Information Retrieval in a Legal Document Collection

Rich-Prospect Browsing Interfaces

Encyclopedia of Multimedia Technology and Networking, Second Edition ◽

10.4018/978-1-60566-014-1.ch168 ◽

2009 ◽

pp. 1240-1248

Author(s):

Stan Ruecker

Keyword(s):

Information Retrieval ◽

Web Sites ◽

Manhattan Project ◽

Document Management ◽

Professional Organization ◽

Academic Computing ◽

Computing Machinery ◽

Vannevar Bush ◽

Document Collection ◽

Sophisticated User

Everyone who has browsed the Internet is familiar with the problems involved in finding what they want. From the novice to the most sophisticated user, the challenge is the same: how to identify quickly and reliably the precise Web sites or other documents they seek from within an ever-growing collection of several billion possibilities? This is not a new problem. Vannevar Bush, the successful Director of the Office of Scientific Research and Development, which included the Manhattan project, made a famous public call in The Atlantic Monthly in 1945 for the scientific community in peacetime to continue pursuing the style of fruitful collaboration they had experienced during the war (Bush, 1945). Bush advocated this approach to address the central difficulty posed by the proliferation of information beyond what could be managed by any single expert using contemporary methods of document management and retrieval. Bush’s vision is often cited as one of the early visions of the World Wide Web, with professional navigators trailblazing paths through the literature and leaving sets of linked documents behind them for others to follow. Sixty years later, we have the professional indexers behind Google, providing the rest of us with a magic window into the data. We can type a keyword or two, pause for reflection, then hit the “I’m feeling lucky” button and see what happens. Technically, even though it often runs in a browser, this task is “information retrieval.” One of its fundamental tenets is that the user cannot manage the data and needs to be guided and protected through the maze by a variety of information hierarchies, taxonomies, indexes, and keywords. Information retrieval is a complex research domain. The Association for Computing Machinery, arguably the largest professional organization for academic computing scientists, sponsors a periodic contest in information retrieval, where teams compete to see who has the most effective algorithms. The contest organizers choose or create a document collection, such as a set of a hundred thousand newspaper articles in English, and contestants demonstrate their software’s ability to find the most documents most accurately. Two of the measures are precision and recall: both of these are ratios, and they pull in opposite directions. Precision is the ratio of the number of documents that have been correctly identified out of the number of documents returned by the search. Recall is the ratio of the number of documents that have been retrieved out of the total number in the collection that should have been retrieved. It is therefore possible to get 100% on precision—just retrieve one document precisely on topic. However, the corresponding recall score would be a disaster. Similarly, an algorithm can score 100% on recall just by retrieving all the documents in the collection. Again, the related precision score would be abysmal. Fortunately, information retrieval is not the only technology available. For collections that only contain thousands of entries, there is no reason why people should not be allowed to simply browse the entire contents, rather than being limited to carrying out searches. Certainly, retrieval can be part of browsing—the two technologies are not mutually exclusive. However, by embedding retrieval within browsing the user gains a significant number of perceptual advantages and new opportunities for actions.

Download Full-text

Discovering Ranking Functions for Information Retrieval

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch072 ◽

2011 ◽

pp. 377-381

Author(s):

Weiguo Fan ◽

Praveen Pathak

Keyword(s):

Information Retrieval ◽

World Wide ◽

Adaptive Algorithms ◽

Relevant Information ◽

Ranking Function ◽

Ranking Functions ◽

Document Collections ◽

Retrieval Systems ◽

The World ◽

Document Collection

The field of information retrieval deals with finding relevant documents from a large document collection or the World Wide Web in response to a user’s query seeking relevant information. Ranking functions play a very important role in the retrieval performance of such retrieval systems and search engines. A single ranking function does not perform well across different user queries, and document collections. Hence it is necessary to “discover” a ranking function for a particular context. Adaptive algorithms like genetic programming (GP) are well suited for such discovery.

Download Full-text

A Bio-inspired Modified PSO Strategy for Effective Web Information Retrieval using RCV1 Datasets

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.j8903.0881019 ◽

2019 ◽

Vol 8 (10) ◽

pp. 779-785

Keyword(s):

Information Retrieval ◽

Fitness Function ◽

System Efficiency ◽

Web Information Retrieval ◽

Key Technology ◽

Query Response Time ◽

Web Information ◽

Document Collection ◽

Tremendous Amount ◽

Modified Pso

Information retrieval is a key technology in accessing the vast amount of data present on today’s World Wide Web. Numerous challenges arise at various stages of information retrieval from the web, such as missing of plenteous relevant documents, static user queries, ever changing and tremendous amount of document collection and so forth. Therefore, more powerful strategies are required to search for relevant documents. In this paper, a PSO methodology is proposed which is hybridized with Simulated Annealing with the aim of optimizing Web Information Retrieval (WIR) process. Hybridized PSO has a high impact on reducing the query response time of the system and hence subsidizes the system efficiency. A novel similarity measure called SMDR acts as a fitness function in the hybridized PSO-SA algorithm. Evaluations measures such as accuracy, MRR, MAP, DCG, IDCG, F-measure and specificity are used to measure the effectiveness of the proposed system and to compare it with existing system as well. Ultimately, experiments are extensively carried out on a huge RCV1 collections. Achieved precision-recall rates demonstrate the considerably improved effectiveness of the proposed system than that of existing one.

Download Full-text

The inadequacy of varying the depth of indexing and other “document collection” approaches to information retrieval for researchers

American Documentation ◽

10.1002/asi.5090120307 ◽

1961 ◽

Vol 12 (3) ◽

pp. 204-205 ◽

Cited By ~ 1

Author(s):

Ron Manly

Keyword(s):

Information Retrieval ◽

Document Collection

Download Full-text

Word-concept clusters in a legal document collection

Computational Intelligence Theory and Applications - Lecture Notes in Computer Science ◽

10.1007/3-540-62868-1_126 ◽

1997 ◽

pp. 329-336

Author(s):

T. D. Gedeon ◽

R. A. Bustos ◽

B. J. Briedis ◽

G. Greenleaf ◽

A. Mowbray

Keyword(s):

Document Collection ◽

Legal Document

Download Full-text

COMPARISON OF VSM, GVSM, AND LSI IN INFORMATION RETRIEVAL FOR INDONESIAN TEXT

Jurnal Teknologi ◽

10.11113/jt.v78.8637 ◽

2016 ◽

Vol 78 (5-6) ◽

Cited By ~ 2

Author(s):

Jasman Pardede ◽

Milda Gustiana Husada

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Word Meaning ◽

Time Response ◽

Retrieval Performance ◽

Dimension Vector ◽

Document Collection ◽

Value Decomposition ◽

F Measure

Vector space model (VSM) is an Information Retrieval (IR) system model that represents query and documents as n-dimension vector. GVSM is an expansion from VSM that represents the documents base on similarity value between query and minterm vector space of documents collection. Minterm vector is defined by the term in query. Therefore, in retrieving a document can be done base on word meaning inside the query. On the contrary, a document can consist the same information semantically. LSI is a method implemented in IR system to retrieve document base on overall meaning of users’ query input from a document, not based on each word translation. LSI uses a matrix algebra technique namely Singular Value Decomposition (SVD). This study discusses the performance of VSM, GVSM and LSI that are implemented on IR to retrieve Indonesian sentences document of .pdf, .doc and .docx extension type files, by using Nazief and Adriani stemming algorithm. Each method implemented either by thread or no-thread. Thread is implemented in preprocessing process in reading each document from document collection and stemming process either for query or documents. The quality of information retrieval performance is evaluated based-on time response, values of recall, precision, and F-measure were measured. The results show that for each method, the fastest execution time is .docx extension type file followed by .doc and .pdf. For the same document collection, the results show that time response for LSI is more faster, followed by GVSM then VSM. The average of recall value for VSM, GVSM and LSI are 82.86 %, 89.68 % and 84.93 % respectively. The average of precision value for VSM, GVSM and LSI are 64.08 %, 67.51 % and 62.08 % respectively. The average of F-measure value for VSM, GVSM and LSI are 71.95 %, 76.63 % and 71.02 % respectively. Implementation of multithread for preprocessing for VSM, GVSM, and LSI can increase average time response required is about 30.422%, 26.282%, and 31.821% respectively.

Download Full-text

Study of Query Expansion Techniques and Their Application in the Biomedical Information Retrieval

The Scientific World JOURNAL ◽

10.1155/2014/132158 ◽

2014 ◽

Vol 2014 ◽

pp. 1-10 ◽

Cited By ~ 17

Author(s):

A. R. Rivas ◽

E. L. Iglesias ◽

L. Borrajo

Keyword(s):

Information Retrieval ◽

Query Expansion ◽

Relevant Information ◽

Textual Information ◽

Retrieval Systems ◽

Biomedical Information Retrieval ◽

User Query ◽

Information Retrieval Systems ◽

Document Collection ◽

Text Preprocessing

Information Retrieval focuses on finding documents whose content matches with a user query from a large document collection. As formulating well-designed queries is difficult for most users, it is necessary to use query expansion to retrieve relevant information. Query expansion techniques are widely applied for improving the efficiency of the textual information retrieval systems. These techniques help to overcome vocabulary mismatch issues by expanding the original query with additional relevant terms and reweighting the terms in the expanded query. In this paper, different text preprocessing and query expansion approaches are combined to improve the documents initially retrieved by a query in a scientific documental database. A corpus belonging to MEDLINE, called Cystic Fibrosis, is used as a knowledge source. Experimental results show that the proposed combinations of techniques greatly enhance the efficiency obtained by traditional queries.

Download Full-text

UNLV-ISRI document collection for research in OCR and information retrieval

10.1117/12.373489 ◽

1999 ◽

Cited By ~ 5

Author(s):

Kazem Taghva ◽

Thomas A. Nartker ◽

Julie Borsack ◽

Allen Condit

Keyword(s):

Information Retrieval ◽

Document Collection

Download Full-text

CONCEPT-BASED TERM WEIGHTING FOR WEB INFORMATION RETRIEVAL

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026806001915 ◽

2006 ◽

Vol 06 (02) ◽

pp. 193-207 ◽

Cited By ~ 1

Author(s):

JOHN ZAKOS ◽

BRIJESH VERMA

Keyword(s):

Information Retrieval ◽

Fundamental Aspect ◽

Web Information Retrieval ◽

Additional Source ◽

Term Weighting ◽

Conceptual Information ◽

Inverse Document Frequency ◽

Web Information ◽

Document Frequency ◽

Document Collection

In this paper we present a novel technique for determining term importance by exploiting concept-based information found in ontologies. Calculating term importance is a significant and fundamental aspect of most information retrieval approaches, and it is traditionally determined through inverse document frequency (IDF). We propose concept-based term weighting (CBW), a technique that is fundamentally different to IDF in that it calculates term importance by intuitively interpreting the conceptual information in ontologies. We show that when CBW is used in an approach for web information retrieval on benchmark data, it performs comparatively to IDF, with only a 3.5% degradation in retrieval accuracy. While this small degradation has been observed, the significance of this technique is that (1) unlike IDF, CBW is independent of document collection statistics, (2) it presents a new way of interpreting ontologies for retrieval, and (3) it introduces an additional source of term importance information that can be used for term weighting.

Download Full-text