Text Mining Methods for Hierarchical Document Indexing

Author(s):  
Han-Joon Kim

We have recently seen a tremendous growth in the volume of online text documents from networked resources such as the Internet, digital libraries, and company-wide intranets. One of the most common and successful methods of organizing such huge amounts of documents is to hierarchically categorize documents according to topic (Agrawal, Bayardo, & Srikant, 2000; Kim & Lee, 2003). The documents indexed according to a hierarchical structure (termed ‘topic hierarchy’ or ‘taxonomy’) are kept in internal categories as well as in leaf categories, in the sense that documents at a lower category have increasing specificity. Through the use of a topic hierarchy, users can quickly navigate to any portion of a document collection without being overwhelmed by a large document space. As is evident from the popularity of Web directories such as Yahoo (http://www.yahoo.com/) and Open Directory Project (http://dmoz.org/), topic hierarchies have increased in importance as a tool for organizing or browsing a large volume of electronic text documents.

2021 ◽  
Author(s):  
◽  
Richard Robertson

<p>Research problem: Digital libraries have invested significant resources digitising and providing access to an increasing number of books. The various approaches taken to visualise digitised books online, has potential to effect the usability and usefulness of the book to the user. Previous usability studies focus on the digital library as a whole, this study narrows the focus to the digitised book. The intention being to identify usability issues and investigate the effects a visualisation approach may have on users.  Methodology: An anonymous survey was conducted, employing the Interaction Triptych Framework (ITF) to frame the relationships between the user and digitised books. Two examples of digitised books from the New Zealand Electronic Text Collection and the Internet Archive were used. Participants from library, archives and history fields, as well as general users, were invited to participate.  Results: 132 participants began the survey, with 86 participants completing all of the required parts. Results suggest a slightly positive attitude towards the usability and usefulness of the examples, with Open Library rated higher for usability and both examples rated similarly for usefulness. Participant comments suggest many users appreciate features analogous to physical books, with regard to aesthetics, learnability and navigation, while for ease of use and reading, rich text appeared to be preferred over digital image based visualisation.  Implications: Digital Libraries need to continually strive to improve the usability and usefulness of digitised books to satisfy their users, further research is suggested creating prototypes and conducting user testing to gain a deeper understanding of the relationship between users and digitised books online.</p>


2021 ◽  
Author(s):  
◽  
Richard Robertson

<p>Research problem: Digital libraries have invested significant resources digitising and providing access to an increasing number of books. The various approaches taken to visualise digitised books online, has potential to effect the usability and usefulness of the book to the user. Previous usability studies focus on the digital library as a whole, this study narrows the focus to the digitised book. The intention being to identify usability issues and investigate the effects a visualisation approach may have on users.  Methodology: An anonymous survey was conducted, employing the Interaction Triptych Framework (ITF) to frame the relationships between the user and digitised books. Two examples of digitised books from the New Zealand Electronic Text Collection and the Internet Archive were used. Participants from library, archives and history fields, as well as general users, were invited to participate.  Results: 132 participants began the survey, with 86 participants completing all of the required parts. Results suggest a slightly positive attitude towards the usability and usefulness of the examples, with Open Library rated higher for usability and both examples rated similarly for usefulness. Participant comments suggest many users appreciate features analogous to physical books, with regard to aesthetics, learnability and navigation, while for ease of use and reading, rich text appeared to be preferred over digital image based visualisation.  Implications: Digital Libraries need to continually strive to improve the usability and usefulness of digitised books to satisfy their users, further research is suggested creating prototypes and conducting user testing to gain a deeper understanding of the relationship between users and digitised books online.</p>


2019 ◽  
Vol 8 (2S11) ◽  
pp. 4076-4081

Clustering is the process of making data groups using similar data items, used for data mining to extract data from available large datasets. A large volume of text documents consisting of personal information is being generated in form of digital libraries and repositories in internet daily.It is conceivable to get to great quality instructive substance and strategies in an increasingly helpful manner. In spite of the fact that a ton of keen instruments have been connected for instructive application, there are just restricted looks into that show the instructive viability of shrewd devices through test contemplations, Clustering organizes large quantity of unordered text documents into small number of meaningful and coherent clusters. A clustering method based on K-Means algorithm is proposed in this paper. K-Means is a unsupervised algorithm based on randomly selected initial centroids used to cluster a highly unstructured and unlabeled document collection. The system will be evaluated using precision as a measure.


Author(s):  
Han-Joon Kim

We have recently seen a tremendous growth in the volume of online text documents from networked resources such as the Internet, digital libraries, and company-wide intranets. One of the most common and successful methods of organizing such huge amounts of documents is to hierarchically categorize documents according to topic (Agrawal, Bayardo & Srikant, 2000; Kim & Lee, 2003). The documents indexed according to a hierarchical structure (termed ‘topic hierarchy’ or ‘taxonomy’) are kept in internal categories as well as in leaf categories, in the sense that documents at a lower category have increasing specificity. Through the use of a topic hierarchy, users can quickly navigate to any portion of a document collection without being overwhelmed by a large document space. As is evident from the popularity of web directories such as Yahoo (http:// www.yahoo.com/) and Open Directory Project (http:// www.dmoz.org/), topic hierarchies have increased in importance as a tool for organizing or browsing a large volume of electronic text documents. Currently, the topic hierarchies maintained by most information systems are manually constructed and maintained by human editors. The topic hierarchy should be continuously subdivided to cope with the high rate of increase in the number of electronic documents. For example, the topic hierarchy of the Open Directory Project has now reached about 590,000 categories. However, manually maintaining the hierarchical structure incurs several problems. First, such a manual task is prohibitively costly as well as time-consuming. Until now, large search portals such as Yahoo have invested significant time and money into maintaining their taxonomy, but obviously they will not be able to keep up with the pace of growth and change in electronic documents through such manual activity. Moreover, for a dynamic networked resource (e.g., World Wide Web) that contains highly heterogeneous documents accompanied by frequent content changes, maintain- ing a ‘good’ hierarchy is fraught with difficulty, and oftentimes is beyond the human experts’ capabilities. Lastly, since human editors’ categorization decision is not only highly subjective but their subjectivity is also variable over time, it is difficult to maintain a reliable and consistent hierarchical structure. The above limitations require information systems that can provide intelligent organization capabilities with topic hierarchies. Related commercial systems include Verity Knowledge Organizer (http://www.verity.com/), Inktomi Directory Engine (http://www.inktomi.com/), and Inxight Categorizer (http://www.inxight.com/), which enable a browsable web directory to be automatically built. However, these systems did not address the (semi-)automatic evolving capabilities of organizational schemes and classification models at all. This is one of the reasons why the commercial taxonomy-based services do not tend to be as popular as their manually constructed counterparts, such as Yahoo.


2011 ◽  
Vol 268-270 ◽  
pp. 697-700
Author(s):  
Rui Xue Duan ◽  
Xiao Jie Wang ◽  
Wen Feng Li

As the volume of online short text documents grow tremendously on the Internet, it is much more urgent to solve the task of organizing the short texts well. However, the traditional feature selection methods cannot suitable for the short text. In this paper, we proposed a method to incorporate syntactic information for the short text. It emphasizes the feature which has more dependency relations with other words. The classifier SVM and machine learning environment Weka are involved in our experiments. The experiment results show that incorporate syntactic information in the short text, we can get more powerful features than traditional feature selection methods, such as DF, CHI. The precision of short text classification improved from 86.2% to 90.8%.


Author(s):  
Ranjan Karmakar

This article reports the concept of digital library (DL) with its definitions, concept, generic architecture, ethics and librarianship related to DL. DLs are created by Library professionals, publishers, Government initiatives, societies and other higher educational institutions. There are different types of files and file formats are created and stored on DL. For uploading the files, the copyright / IPR issues are involved. As one cannot upload directly someone's file without the permission of that person. In information communication technology (ICT) environment, the Internet and web enables everyone to have access to the e-contents available from anywhere at any time. Due to this, DL creators cannot take it granted to upload them and make them available online. The IPR issues in digital environment plays a key role for identifying and taking permissions from the respective authors / publishers / content creators to upload the digital content. DL and IPR issues are discussed with the digital rights issues.


Author(s):  
Iris Xie

For centuries, people have been used to printed materials. The emergence of the Internet brings dramatic changes to millions of people in terms of how they collect, organize, disseminate, access, and use information. Researchers (Chowdhury & Chowdhury, 2003; Lesk, 2005; Witten & Bainbridge, 2003) have identified the following factors that contributed to the birth of digital libraries: 1. Vannevar Bush’s pioneering concept and idea of Memex. Vannevar Bush (1945) wrote a classic article, “As We May Think,” which has had a major impact on the emergence of digital libraries. In the article, he described his Memex device, which was able to organize books, journals, and notes in different places by linked association. This associative linking was similar to what is known today as hypertext. 2. The advancement in computer and communication/network technology. The computer was first used to manage information. In the 1960s, the emergence of remote online information search services changed the way people access and search information. By the 1980s, people could remotely and locally access library catalogues via Online Public Access Catalogues (OPACs). The invention of the CD-ROM made it easy and cheap for users to access electronic information. Most importantly, Web technology started in 1990, and the occurrence of Web browsers afterwards have enabled users to access digital information anywhere as long as there is an Internet connection. Web search engines offer an opportunity for millions of people to search full-text documents on the Web. 3. The development of libraries and library access. Since the creation of Alexandrian library around 300 B.C., the size and number of libraries have grown phenomenally. A library catalogue goes from a card catalogue to three generations of online public access catalogues started in the 1980s. Library materials include mainly printed resources to multimedia collections, such as images, videos, sound files, and so forth. Simultaneously, the information explosion in the digital age makes it impossible for libraries to collect all of the available materials.


Author(s):  
Cavan McCarthy

Digital libraries (DL) can be characterized as the “high end” of the Internet, digital systems which offer significant quantities of organized, selected materials of the type traditionally found in libraries, such as books, journal articles, photographs and similar documents (Schwartz, 2000). They normally offer quality resources based on the collections of well-known institutions, such as major libraries, archives, historical and cultural associations (Love & Feather, 1998). The field of digital libraries is now firmly established as an area of study, with textbooks (Arms, 2000; Chowdhury & Chowdhury, 2003; Lesk, 1997); electronic journals from the US (D-Lib Magazine: http://www.dlib.org/) and the UK (Ariadne: http://www.ariadne.ac.uk/); even encyclopedia articles (McCarthy, 2004).


2011 ◽  
Vol 1 (3) ◽  
pp. 54-70 ◽  
Author(s):  
Abdullah Wahbeh ◽  
Mohammed Al-Kabi ◽  
Qasem Al-Radaideh ◽  
Emad Al-Shawakfa ◽  
Izzat Alsmadi

The information world is rich of documents in different formats or applications, such as databases, digital libraries, and the Web. Text classification is used for aiding search functionality offered by search engines and information retrieval systems to deal with the large number of documents on the web. Many research papers, conducted within the field of text classification, were applied to English, Dutch, Chinese, and other languages, whereas fewer were applied to Arabic language. This paper addresses the issue of automatic classification or classification of Arabic text documents. It applies text classification to Arabic language text documents using stemming as part of the preprocessing steps. Results have showed that applying text classification without using stemming; the support vector machine (SVM) classifier has achieved the highest classification accuracy using the two test modes with 87.79% and 88.54%. On the other hand, stemming has negatively affected the accuracy, where the SVM accuracy using the two test modes dropped down to 84.49% and 86.35%.


2013 ◽  
Vol 58 (3) ◽  
pp. 927-930 ◽  
Author(s):  
S. Kluska-Nawarecka ◽  
K. Regulski ◽  
M. Krzyżak ◽  
G. Leśniak ◽  
M. Gurda

Abstract This paper presents assumptions for a system of automatic cataloging and semantic text documents searching. As an example, a document repository for metals processing technology was used. The system by using ontological model provides the user with a new approach to the exploration of database resources - easier and more intuitive information search. In the current document storage systems, searching is often based only on keywords and descriptions created manually by the system administrator. The use of text mining methods, especially latent semantic indexing, allows automatic clustering of documents with respect to their content. The result of this clustering is integrated with the ontological model, making navigation through documents resources intuitive and does not require the manual creation of directories. Such an approach seems to be particularly useful in a situation where we are dealing with large repositories of unstructured documents from such sources as the Internet. This situation is very typical for cases of searching information and knowledge in the area of metallurgy, for example with regard to innovation and non-traditional suppliers of materials and equipment.


Sign in / Sign up

Export Citation Format

Share Document