Advances in Data Mining and Database Management - Developing a Keyword Extractor and Document Classifier
Latest Publications


TOTAL DOCUMENTS

11
(FIVE YEARS 11)

H-INDEX

0
(FIVE YEARS 0)

Published By IGI Global

9781799837725, 9781799837732

The text-mining process starts with a keyword search in text collections. Current text processing technology allows a search technique beyond simple Boolean searches by using natural language queries. Since search engines can recognize any of thousands of keywords and phrases but not the concepts behind the text, it is necessary for researchers to construct an automatic keyword extractor to generate the “Keyword List” for each document. Later, this list can act as the knowledge base to associate unorganized documents to meaningful classes. Failures in identifying the keywords for a certain concept will result in missing values or data for that specific concept.



A test blueprint/test template, also known as the table of specifications, represents the structure of a test. It has been highly recommended in assessment textbook to carry out the preparation of a test with a test blueprint. This chapter focuses on modeling a dynamic test paper template using multi-objective optimization algorithm and makes use of the template in dynamic generation of examination test paper. Multi-objective optimization-based models are realistic models for many complex optimization problems. Modeling a dynamic test paper template, similar to many real-life problems, includes solving multiple conflicting objectives satisfying the template specifications.



Keywords can be used as attributes for mining rules or as a basis for measuring the similarity of new (unclassified) documents with existing (classified) ones. The focus is on the problem of extracting keywords from document collection in order to use them as attributes for document classification. Document classification is a hot topic in machine learning. Typical approaches extract “features,” generally words, from document, and use the feature vectors as input to a machine learning scheme that learns how to classify documents. This “bag of keywords” model neglects keyword order and contextual effects.



The report generated displays a list of automatically generated keywords in each document. A document is allowed to have any number of keywords. As the keywords are getting generated at any pass of the loop, there is no restriction on the width of keywords. Another report is also generated to display the list of the document class. If a document finds its match with more than one class (overlapping classes), the selection of the final class for a document is done on the basis of the maximum weight of the keywords in each class.



Keywords are defined as phrases that capture the main topics discussed in a document. As they offer a brief yet precise summary of document content, they can be utilized for various applications. In an IR (information retrieval) environment, they serve as an indication of document relevance for users, as the list of keywords can quickly help to determine whether a given document is relevant to their interest. As keywords reflect a document's main topics, they can be utilized to classify documents into groups by measuring the overlap between the keywords assigned to them. Keywords are also used proactively in information retrieval (i.e., in indexing).



The success of any educational program depends on its evaluation system. Examinations are a part of learning process which acts as an element in evaluation. For the smooth conduct of examinations of various universities and academic institutions, the test paper generation process would be helpful. However, examination test paper composition is a multi-constraint concurrent optimization problem. Question selection plays a key role in test paper generation systems. Also, it is the most significant and time-consuming activity. The question selection is handled in traditional test paper generation systems by using a specified test paper format containing a listing of weightages to be allotted to each unit/module of the syllabus.



Reforms in the educational system emphasize more on continuous assessment. The descriptive examination test paper when compared to objective test paper acts as a better aid in continuous assessment for testing the progress of a student under various cognitive levels at different stages of learning. Unfortunately, assessment of descriptive answers is found to be tedious and time-consuming by instructors due to the increase in number of examinations in continuous assessment system. In this chapter, an attempt has been made to address the problem of automatic evaluation of descriptive answer using vector-based similarity matrix with order-based word-to-word syntactic similarity measure. Word order similarity measure remains as one of the best measures to find the similarity between sequential words in sentences and is increasing its popularity due to its simple interpretation and easy computation.



In this chapter, the authors discuss the features of the tool which is developed using the algorithms designed and implemented as part of the research work carried out. They have named it a test paper generation system (TPGS). At some places, they have used question paper generation system (QPGS) instead of its alias TPGS. The main modules of this tool are (1) test paper template generation, (2) question conflict detection, (3) test paper template-based question selection, (4) syllabus coverage evaluator for test paper, (5) and answer paper evaluator.



It is trivial to achieve a recall of 100% by returning all documents in response to any query. Therefore, recall alone is not enough, but one needs to measure the number of non-relevant, for example by computing the precision. The analysis was performed for 30 documents to ensure the stability of precision and recall values. It is observed that the precision of large documents is less than a moderate length document, in the sense that some unimportant keywords get extracted. The reason for this may be attributed to the frequent occurrence and its unimportant role in the sentence.



A syllabus is a detailed instructional plan of materials, resources, teaching methods, and evaluation plans primarily designed to inform the students about the standards, requirements, and learning outcomes expected out of them in the course. It also expresses an “informal agreement” between the instructor and the students in completing the delivery of the content of the syllabus throughout the course. A syllabus also informs the coverage of contents to other educational institutions so that they can determine if it is equivalent to a similar one offered at their institutions. A modularized syllabus contains weightages assigned to different units/modules of a subject. Different criteria like Bloom's taxonomy, learning outcomes, etc. have been used for evaluating the syllabus coverage of a test paper.



Sign in / Sign up

Export Citation Format

Share Document