Cross-Testing a Genre Classification Model for the Web

Author(s):  
Marina Santini
2021 ◽  
Author(s):  
Serge Sharoff

Abstract This paper describes a digital curation study aimed at comparing the composition of large Web corpora, such as enTenTen, ukWac or ruWac, by means of automatic text classification. First, the paper presents a Deep Learning model suitable for classifying texts from large Web corpora using a small number of communicative functions, such as Argumentation or Reporting. Second, it describes the results of applying the automatic classification model to these corpora and compares their composition. Finally, the paper introduces a framework for interpreting the results of automatic genre classification using linguistic features. The framework can help in comparing general reference corpora obtained from the Web and in comparing corpora across languages.


2020 ◽  
Vol 10 (18) ◽  
pp. 6527 ◽  
Author(s):  
Omar Sharif ◽  
Mohammed Moshiul Hoque ◽  
A. S. M. Kayes ◽  
Raza Nowrozy ◽  
Iqbal H. Sarker

Due to the substantial growth of internet users and its spontaneous access via electronic devices, the amount of electronic contents has been growing enormously in recent years through instant messaging, social networking posts, blogs, online portals and other digital platforms. Unfortunately, the misapplication of technologies has increased with this rapid growth of online content, which leads to the rise in suspicious activities. People misuse the web media to disseminate malicious activity, perform the illegal movement, abuse other people, and publicize suspicious contents on the web. The suspicious contents usually available in the form of text, audio, or video, whereas text contents have been used in most of the cases to perform suspicious activities. Thus, one of the most challenging issues for NLP researchers is to develop a system that can identify suspicious text efficiently from the specific contents. In this paper, a Machine Learning (ML)-based classification model is proposed (hereafter called STD) to classify Bengali text into non-suspicious and suspicious categories based on its original contents. A set of ML classifiers with various features has been used on our developed corpus, consisting of 7000 Bengali text documents where 5600 documents used for training and 1400 documents used for testing. The performance of the proposed system is compared with the human baseline and existing ML techniques. The SGD classifier ‘tf-idf’ with the combination of unigram and bigram features are used to achieve the highest accuracy of 84.57%.


Mathematics ◽  
2021 ◽  
Vol 9 (18) ◽  
pp. 2274
Author(s):  
Lvyang Qiu ◽  
Shuyu Li ◽  
Yunsick Sung

With unlabeled music data widely available, it is necessary to build an unsupervised latent music representation extractor to improve the performance of classification models. This paper proposes an unsupervised latent music representation learning method based on a deep 3D convolutional denoising autoencoder (3D-DCDAE) for music genre classification, which aims to learn common representations from a large amount of unlabeled data to improve the performance of music genre classification. Specifically, unlabeled MIDI files are applied to 3D-DCDAE to extract latent representations by denoising and reconstructing input data. Next, a decoder is utilized to assist the 3D-DCDAE in training. After 3D-DCDAE training, the decoder is replaced by a multilayer perceptron (MLP) classifier for music genre classification. Through the unsupervised latent representations learning method, unlabeled data can be applied to classification tasks so that the problem of limiting classification performance due to insufficient labeled data can be solved. In addition, the unsupervised 3D-DCDAE can consider the musicological structure to expand the understanding of the music field and improve performance in music genre classification. In the experiments, which utilized the Lakh MIDI dataset, a large amount of unlabeled data was utilized to train the 3D-DCDAE, obtaining a denoising and reconstruction accuracy of approximately 98%. A small amount of labeled data was utilized for training a classification model consisting of the trained 3D-DCDAE and the MLP classifier, which achieved a classification accuracy of approximately 88%. The experimental results show that the model achieves state-of-the-art performance and significantly outperforms other methods for music genre classification with only a small amount of labeled data.


2019 ◽  
Vol 8 (2S11) ◽  
pp. 2011-2016

With the boom in the number of internet pages, it is very hard to discover desired records effortlessly and fast out of heaps of web pages retrieved with the aid of a search engine. there may be a increasing requirement for automatic type strategies with more class accuracy. There are a few conditions these days in which it's far vital to have an green and reliable classification of a web-web page from the information contained within the URL (Uniform aid Locator) handiest, with out the want to go to the web page itself. We want to understand if the URL can be used by us while not having to look and visit the page due to numerous motives. Getting the web page content material and sorting them to discover the genre of the net web page is very time ingesting and calls for the consumer to recognize the shape of the web page which needs to be categorised. To avoid this time-eating technique we proposed an exchange method so one can help us get the genre of the entered URL based of the entered URL and the metadata i.e., description, keywords used in the website along side the title of the web site. This approach does not most effective rely upon URL however also content from the internet application. The proposed gadget can be evaluated using numerous available datasets.


The number of e-learning websites as well as e-contents are increasing exponentially over the years and most of the time it become cumbersome for a learner to find e-content suitable for learning as the learner gets overwhelmed by the enormity of the content availability. The proposed work focus on evaluating the efficiencies of the different classification algorithm for the identification of the e-learning content based on difficulty levels. The data is collected from many e-learning web sites through web scraping. The web scraper downloads the web pages and parse to text file. The text files were made to run through many machine learning classification algorithms to find out the best classification model suitable for achieving the highest score with minimum training and testing time. This method helps to understand the performance of different text classification algorithms on e-learning contents and identifies the classifier with high accuracy for document classification.


Corpora ◽  
2018 ◽  
Vol 13 (1) ◽  
pp. 65-95 ◽  
Author(s):  
Serge Sharoff

This paper presents an approach to classifying large web corpora into genres by means of Functional Text Dimensions (FTDs). This offers a topological approach to text typology in which the texts are described in terms of their similarity to prototype genres. The suggested set of categories is designed to be applicable to any text on the web and to be reliable in annotation practice. Interannotator agreement results show that the suggested categories produce Krippendorff's α at above 0.76. In addition to the functional space of eighteen dimensions, similarity between annotated documents can be described visually within a space of reduced dimensions obtained through t-distributed Statistical Neighbour Embedding. Reliably annotated texts also provide the basis for automatic genre classification, which can be done in each FTD, as well as as within the space of reduced dimensions. An example comparing texts from the Brown Corpus, the BNC and ukWac, a large web corpus, is provided.


Author(s):  
Piyush Bansal and Saurabh Gautam

Image classification is the task of identifying an image. Android image classification model is trained to recognize various classes of images. For example, we may train a model to recognize photos representing three different types of animals: rabbits, hamsters, and dogs. Optimized pre-trained models are provided byTensor Flow Lite that we can deploy in our mobile applications. Simple Machine Learning (ML) algorithms in Python make relatively easy to start explore datasets and make some first predictions. We can make these trained models useful in the real world by making them available to make predictions on either the Web or Portable devices.


The classical Web search engines focus on satisfying the information need of the users by retrieving relevant Web documents corresponding to the user query. The Web document contains the information on different Web objects such as authors, automobiles, political parties e.t.c. The user might be accessing the Web document to procure information about a specific Web object, the remaining information in the Web object [2-6] becomes redundant specific to the user. If the size of Web documents is significantly large and the user information requirement is small fraction of the document, the user has to invest effort in locating the required information inside the document. It would be much more convenient if the user is provided with only the required Web object information located inside the Web documents. Web object search engines provide Web search facility through vertical search on Web objects. In this paper the main goal we considered is the objective information present in different documents is extracted and integrated into an object repository over which the Web object search facility is built.


2019 ◽  
Vol 6 (2) ◽  
pp. 125-133
Author(s):  
Ismail Yusuf Panessai ◽  
Muhammad Modi Lakulu ◽  
Mohd Hishamuddin Abdul Rahman ◽  
Noor Anida Zaria Mohd Noor ◽  
Nor Syazwani Mat Salleh ◽  
...  

PSAP: Improving Accuracy of Students' Final Grade Prediction using ID3 and C4.5 This study was aimed to increase the performance of the Predicting Student Academic Performance (PSAP) system, and the outcome is to develop a web application that can be used to analyze student performance during present semester. Development of the web-based application was based on the evolutionary prototyping model. The study also analyses the accuracy of the classifier that is constructed for the prediction features in the web application. Qualitative approaches by user evaluation questionnaire were used for this study. A number of few personnel expert users which are lecturers from Universiti Pendidikan Sultan Idris were chosen as respondents. Each respondent is instructed to answer a total of 27 questions regarding respondent’s background and web application design. The accuracy of the classifier for the prediction features is tested by using the confusion matrix by using the test set of 24 rows. The findings showed the views of respondents on the aspects of interface design, functionality, navigation, and reliability of the web-based application that is developed. The result also showed that accuracy for the classifier constructed by using ID3 classification model (C4.5) is 79.18% and the highest compared to Naïve Bayes and Generalized Linear classification model.


Sign in / Sign up

Export Citation Format

Share Document