Functional Text Dimensions for the annotation of web corpora

This paper presents an approach to classifying large web corpora into genres by means of Functional Text Dimensions (FTDs). This offers a topological approach to text typology in which the texts are described in terms of their similarity to prototype genres. The suggested set of categories is designed to be applicable to any text on the web and to be reliable in annotation practice. Interannotator agreement results show that the suggested categories produce Krippendorff's α at above 0.76. In addition to the functional space of eighteen dimensions, similarity between annotated documents can be described visually within a space of reduced dimensions obtained through t-distributed Statistical Neighbour Embedding. Reliably annotated texts also provide the basis for automatic genre classification, which can be done in each FTD, as well as as within the space of reduced dimensions. An example comparing texts from the Brown Corpus, the BNC and ukWac, a large web corpus, is provided.

Download Full-text

Effective Genre Classification - Understanding Url And Webpage Attributes For Classification

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1191.0982s1119 ◽

2019 ◽

Vol 8 (2S11) ◽

pp. 2011-2016

Keyword(s):

The Internet ◽

Web Pages ◽

Web Page ◽

Exchange Method ◽

Genre Classification ◽

Internet Application ◽

Reliable Classification ◽

Time Eating ◽

The Web

With the boom in the number of internet pages, it is very hard to discover desired records effortlessly and fast out of heaps of web pages retrieved with the aid of a search engine. there may be a increasing requirement for automatic type strategies with more class accuracy. There are a few conditions these days in which it's far vital to have an green and reliable classification of a web-web page from the information contained within the URL (Uniform aid Locator) handiest, with out the want to go to the web page itself. We want to understand if the URL can be used by us while not having to look and visit the page due to numerous motives. Getting the web page content material and sorting them to discover the genre of the net web page is very time ingesting and calls for the consumer to recognize the shape of the web page which needs to be categorised. To avoid this time-eating technique we proposed an exchange method so one can help us get the genre of the entered URL based of the entered URL and the metadata i.e., description, keywords used in the website along side the title of the web site. This approach does not most effective rely upon URL however also content from the internet application. The proposed gadget can be evaluated using numerous available datasets.

Download Full-text

Cross-Testing a Genre Classification Model for the Web

Text, Speech and Language Technology - Genres on the Web ◽

10.1007/978-90-481-9178-9_5 ◽

2010 ◽

pp. 87-128 ◽

Cited By ~ 10

Author(s):

Marina Santini

Keyword(s):

Classification Model ◽

Genre Classification ◽

The Web

Download Full-text

A Topological Approach of the Web Classification

Lecture Notes in Computer Science - Theoretical Aspects of Computing - ICTAC 2006 ◽

10.1007/11921240_6 ◽

2006 ◽

pp. 80-92 ◽

Cited By ~ 3

Author(s):

Gabriel Ciobanu ◽

Dănuţ Rusu

Keyword(s):

Topological Approach ◽

Web Classification ◽

The Web

Download Full-text

Genre annotation for the Web

10.1075/rs.19015.sha ◽

2021 ◽

Author(s):

Serge Sharoff

Keyword(s):

Classification Model ◽

Linguistic Features ◽

General Reference ◽

Digital Curation ◽

Communicative Functions ◽

Genre Classification ◽

Automatic Text Classification ◽

Deep Learning Model ◽

Automatic Text ◽

The Web

Abstract This paper describes a digital curation study aimed at comparing the composition of large Web corpora, such as enTenTen, ukWac or ruWac, by means of automatic text classification. First, the paper presents a Deep Learning model suitable for classifying texts from large Web corpora using a small number of communicative functions, such as Argumentation or Reporting. Second, it describes the results of applying the automatic classification model to these corpora and compares their composition. Finally, the paper introduces a framework for interpreting the results of automatic genre classification using linguistic features. The framework can help in comparing general reference corpora obtained from the Web and in comparing corpora across languages.

Download Full-text