A Machine Learning Based Framework for Enterprise Document Classification

Author(s):  
Juris Rāts ◽  
Inguna Pede ◽  
Tatjana Rubina ◽  
Gatis Vītols

Keywords can be used as attributes for mining rules or as a basis for measuring the similarity of new (unclassified) documents with existing (classified) ones. The focus is on the problem of extracting keywords from document collection in order to use them as attributes for document classification. Document classification is a hot topic in machine learning. Typical approaches extract “features,” generally words, from document, and use the feature vectors as input to a machine learning scheme that learns how to classify documents. This “bag of keywords” model neglects keyword order and contextual effects.


2020 ◽  
Vol 38 (02) ◽  
Author(s):  
TẠ DUY CÔNG CHIẾN

Ontologies apply to many applications in recent years, such as information retrieval, information extraction, and text document classification. The purpose of domain-specific ontology is to enrich the identification of concept and the interrelationships. In our research, we use ontology to specify a set of generic subjects (concept) that characterizes the domain as well as their definitions and interrelationships. This paper introduces a system for labeling subjects of a text documents based on the differential layers of domain specific ontology, which contains the information and the vocabularies related to the computer domain. A document can contain several subjects such as data science, database, and machine learning. The subjects in text document classification are determined based on the differential layers of the domain specific ontology. We combine the methodologies of Natural Language Processing with domain ontology to determine the subjects in text document. In order to increase performance, we use graph database to store and access ontology. Besides, the paper focuses on evaluating our proposed algorithm with some other methods. Experimental results show that our proposed algorithm yields performance significantly


The number of e-learning websites as well as e-contents are increasing exponentially over the years and most of the time it become cumbersome for a learner to find e-content suitable for learning as the learner gets overwhelmed by the enormity of the content availability. The proposed work focus on evaluating the efficiencies of the different classification algorithm for the identification of the e-learning content based on difficulty levels. The data is collected from many e-learning web sites through web scraping. The web scraper downloads the web pages and parse to text file. The text files were made to run through many machine learning classification algorithms to find out the best classification model suitable for achieving the highest score with minimum training and testing time. This method helps to understand the performance of different text classification algorithms on e-learning contents and identifies the classifier with high accuracy for document classification.


The classification technique is most important for supervised and semi supervised base machine learning task. Many classification algorithms has introduced already for existing systems. Class-label classification is an important machine learning task wherein one assigns a subset of candidate without label to an object. Classification of various document models based on short text, metadata, heading levels these are the existing techniques which are introduced in literature survey. Sometime whole data reading and processing might be take a much time for classification, so it increase the time complexity for entire system. We proposed a new document classification method based on deep learning using NLP and machine learning approach. In this work system has several attractive properties: it captures some metadata from entire abstract section and built the training set first. Once complete all document process, it deals with optimization algorithm. Recurrent Neural Network has used to categories the individual object according to their weights. And it provides final class label for entire test dataset. Based on the various experimental analysis system provides data classification accuracy as well as minimum time complexity than classical machine learning algorithms.


Sign in / Sign up

Export Citation Format

Share Document