scholarly journals Smart Learning in Document Categorization using Dynamic Learning

2019 ◽  
Vol 8 (2S11) ◽  
pp. 4076-4081

Clustering is the process of making data groups using similar data items, used for data mining to extract data from available large datasets. A large volume of text documents consisting of personal information is being generated in form of digital libraries and repositories in internet daily.It is conceivable to get to great quality instructive substance and strategies in an increasingly helpful manner. In spite of the fact that a ton of keen instruments have been connected for instructive application, there are just restricted looks into that show the instructive viability of shrewd devices through test contemplations, Clustering organizes large quantity of unordered text documents into small number of meaningful and coherent clusters. A clustering method based on K-Means algorithm is proposed in this paper. K-Means is a unsupervised algorithm based on randomly selected initial centroids used to cluster a highly unstructured and unlabeled document collection. The system will be evaluated using precision as a measure.

Author(s):  
Han-Joon Kim

We have recently seen a tremendous growth in the volume of online text documents from networked resources such as the Internet, digital libraries, and company-wide intranets. One of the most common and successful methods of organizing such huge amounts of documents is to hierarchically categorize documents according to topic (Agrawal, Bayardo, & Srikant, 2000; Kim & Lee, 2003). The documents indexed according to a hierarchical structure (termed ‘topic hierarchy’ or ‘taxonomy’) are kept in internal categories as well as in leaf categories, in the sense that documents at a lower category have increasing specificity. Through the use of a topic hierarchy, users can quickly navigate to any portion of a document collection without being overwhelmed by a large document space. As is evident from the popularity of Web directories such as Yahoo (http://www.yahoo.com/) and Open Directory Project (http://dmoz.org/), topic hierarchies have increased in importance as a tool for organizing or browsing a large volume of electronic text documents.


Author(s):  
Desislava Paneva-Marinova ◽  
Radoslav Pavlov

This chapter presents solutions for personalized observation and enhanced learning experience in digital libraries (DLs) by special smart educational nooks. Main factors related to the DLs user experience and content usability issues are considered. During the user experience design, the users' needs, goals, preferences, and interests have been carefully studied and have become the starting point for the new DLs functionality development. This chapter demonstrates several educational nooks or their components, such as learning tools in a digital library for fashion objects, a smart learning corner in an iconographical art digital library, an ontology of learning analysis method, and some educational games for art and culture in which authors are co-developers.


Author(s):  
Iris Xie

For centuries, people have been used to printed materials. The emergence of the Internet brings dramatic changes to millions of people in terms of how they collect, organize, disseminate, access, and use information. Researchers (Chowdhury & Chowdhury, 2003; Lesk, 2005; Witten & Bainbridge, 2003) have identified the following factors that contributed to the birth of digital libraries: 1. Vannevar Bush’s pioneering concept and idea of Memex. Vannevar Bush (1945) wrote a classic article, “As We May Think,” which has had a major impact on the emergence of digital libraries. In the article, he described his Memex device, which was able to organize books, journals, and notes in different places by linked association. This associative linking was similar to what is known today as hypertext. 2. The advancement in computer and communication/network technology. The computer was first used to manage information. In the 1960s, the emergence of remote online information search services changed the way people access and search information. By the 1980s, people could remotely and locally access library catalogues via Online Public Access Catalogues (OPACs). The invention of the CD-ROM made it easy and cheap for users to access electronic information. Most importantly, Web technology started in 1990, and the occurrence of Web browsers afterwards have enabled users to access digital information anywhere as long as there is an Internet connection. Web search engines offer an opportunity for millions of people to search full-text documents on the Web. 3. The development of libraries and library access. Since the creation of Alexandrian library around 300 B.C., the size and number of libraries have grown phenomenally. A library catalogue goes from a card catalogue to three generations of online public access catalogues started in the 1980s. Library materials include mainly printed resources to multimedia collections, such as images, videos, sound files, and so forth. Simultaneously, the information explosion in the digital age makes it impossible for libraries to collect all of the available materials.


2011 ◽  
Vol 1 (3) ◽  
pp. 54-70 ◽  
Author(s):  
Abdullah Wahbeh ◽  
Mohammed Al-Kabi ◽  
Qasem Al-Radaideh ◽  
Emad Al-Shawakfa ◽  
Izzat Alsmadi

The information world is rich of documents in different formats or applications, such as databases, digital libraries, and the Web. Text classification is used for aiding search functionality offered by search engines and information retrieval systems to deal with the large number of documents on the web. Many research papers, conducted within the field of text classification, were applied to English, Dutch, Chinese, and other languages, whereas fewer were applied to Arabic language. This paper addresses the issue of automatic classification or classification of Arabic text documents. It applies text classification to Arabic language text documents using stemming as part of the preprocessing steps. Results have showed that applying text classification without using stemming; the support vector machine (SVM) classifier has achieved the highest classification accuracy using the two test modes with 87.79% and 88.54%. On the other hand, stemming has negatively affected the accuracy, where the SVM accuracy using the two test modes dropped down to 84.49% and 86.35%.


2014 ◽  
Vol 35 (4/5) ◽  
pp. 293-307
Author(s):  
Mark Edward Phillips ◽  
Daniel Gelaw Alemneh ◽  
Brenda Reyes Ayala

Purpose – Increasingly, higher education institutions worldwide are accepting only electronic versions of their students’ theses and dissertations. These electronic theses and dissertations (ETDs) frequently feature embedded URLs in body, footnote and references section of the document. Additionally the web as ETD subject appears to be on an upward trajectory as the web becomes an increasingly important part of everyday life. The paper aims to discuss these issues. Design/methodology/approach – The authors analyzed URL references in 4,335 ETDs in the UNT ETD collection. Links were extracted from the full-text documents, cleaned and canonicalized, deconstructed in the subparts of a URL and then indexed with the full-text indexer Solr. Queries to aggregate and generate overall statistics and trends were generated against the Solr index. The resulting data were analyzed for patterns and trends within a variety of groupings. Findings – ETDs at the University of North Texas that include URL references have increased over the past 14 years from 23 percent in 1999 to 80 percent in 2012. URLs are being included into ETDs in the majority of cases: 62 percent of the publications analyzed in this work contained URLs. Originality/value – This research establishes that web resources are being widely cited in UNT's ETDs and that growth in citing these resources has been observed. Further it provides a preliminary framework for technical methods appropriate for approaching analysis of similar data that may be applicable to other sets of documents or subject areas.


Author(s):  
Nilupulee Nathawitharana ◽  
Damminda Alahakoon ◽  
Sumith Matharage

Humans are used to expressing themselves with written language and language provides a medium with which we can describe our experiences in detail incorporating individuality. Even though documents provide a rich source of information, it becomes very difficult to identify, extract, summarize and search when vast amounts of documents are collected especially over time. Document clustering is a technique that has been widely used to group documents based on similarity of content represented by the words used. Once key groups are identified further drill down into sub-groupings is facilitated by the use of hierarchical clustering. Clustering and hierarchical clustering are very useful when applied to numerical and categorical data and cluster accuracy and purity measures exist to evaluate the outcomes of a clustering exercise. Although the same measures have been applied to text clustering, text clusters are based on words or terms which can be repeated across documents associated with different topics. Therefore text data cannot be considered as a direct ‘coding’ of a particular experience or situation in contrast to numerical and categorical data and term overlap is a very common characteristic in text clustering. In this paper we propose a new technique and methodology for term overlap capture from text documents, highlighting the different situations such overlap could signify and discuss why such understanding is important for obtaining value from text clustering. Experiments were conducted using a widely used text document collection where the proposed methodology allowed exploring the term diversity for a given document collection and obtain clusters with minimum term overlap.


2020 ◽  
Vol 38 (02) ◽  
Author(s):  
TẠ DUY CÔNG CHIẾN

Question answering systems are applied to many different fields in recent years, such as education, business, and surveys. The purpose of these systems is to answer automatically the questions or queries of users about some problems. This paper introduces a question answering system is built based on a domain specific ontology. This ontology, which contains the data and the vocabularies related to the computing domain are built from text documents of the ACM Digital Libraries. Consequently, the system only answers the problems pertaining to the information technology domains such as database, network, machine learning, etc. We use the methodologies of Natural Language Processing and domain ontology to build this system. In order to increase performance, I use a graph database to store the computing ontology and apply no-SQL database for querying data of computing ontology.


Now a day’s clustering plays vital role in big data. It is very difficult to analyze and cluster large volume of data. Clustering is a procedure for grouping similar data objects of a data set. We make sure that inside the cluster high intra cluster similarity and outside the cluster high inter similarity. Clustering used in statistical analysis, geographical maps, biology cell analysis and in google maps. The various approaches for clustering grid clustering, density based clustering, hierarchical methods, partitioning approaches. In this survey paper we focused on all these algorithms for large datasets like big data and make a report on comparison among them. The main metric is time complexity to differentiate all algorithms.


Data Mining ◽  
2011 ◽  
pp. 199-219 ◽  
Author(s):  
Hsin-Chang Yang ◽  
Chung-Hong Lee

Recently, many approaches have been devised for mining various kinds of knowledge from texts. One important application of text mining is to identify themes and the semantic relations among these themes for text categorization. Traditionally, these themes were arranged in a hierarchical manner to achieve effective searching and indexing as well as easy comprehension for human beings. The determination of category themes and their hierarchical structures was mostly done by human experts. In this work, we developed an approach to automatically generate category themes and reveal the hierarchical structure among them. We also used the generated structure to categorize text documents. The document collection was trained by a self-organizing map to form two feature maps. We then analyzed these maps and obtained the category themes and their structure. Although the test corpus contains documents written in Chinese, the proposed approach can be applied to documents written in any language, and such documents can be transformed into a list of separated terms.


1998 ◽  
Vol 3 (2) ◽  
pp. 90-95
Author(s):  
Kayvon Safavi ◽  
Martin A. Weinstock

Background: Increasingly, large collections of pre-existing data are being used to analyze the occurrence, burden, and health care resources directed to the management of various skin diseases. Objective: This article discusses a number of different types of large datasets along with their common uses. Various concerns about the use of this information are also discussed. Conclusion: Although large datasets provide significant statistical power with readily available data, there are significant concerns, particularly regarding data quality and statistical analysis. Readers need to be aware of how an investigator has addressed these issues. Furthermore, the profession needs to be cognizant of very legitimate public concerns regarding confidentiality of personal information.


Sign in / Sign up

Export Citation Format

Share Document