scholarly journals Genre annotation for the Web

2021 ◽  
Author(s):  
Serge Sharoff

Abstract This paper describes a digital curation study aimed at comparing the composition of large Web corpora, such as enTenTen, ukWac or ruWac, by means of automatic text classification. First, the paper presents a Deep Learning model suitable for classifying texts from large Web corpora using a small number of communicative functions, such as Argumentation or Reporting. Second, it describes the results of applying the automatic classification model to these corpora and compares their composition. Finally, the paper introduces a framework for interpreting the results of automatic genre classification using linguistic features. The framework can help in comparing general reference corpora obtained from the Web and in comparing corpora across languages.

2015 ◽  
Vol 713-715 ◽  
pp. 1830-1834
Author(s):  
Rong Chen ◽  
Feng Chen ◽  
Yi Sun

We consider how to efficiently text classification on all pairs of documents. This information can be used to information retrieval, digital library, information filtering, and search engine, among others. This paper describes text classification model which based on KNN algorithm. The text feature extraction algorithm, TF-IDF, can loss related information between text features, an improved ITF-IDF algorithm has been presented in order to overcome it. Our experiments show that our algorithm is better than others.


Author(s):  
Dirk Snyman ◽  
Gerhard Van Huyssteen ◽  
Walter Daelemans

When working in the terrain of text processing, metadata about a particular text plays an important role. Metadata is often generated, using automatic text classification systems which classify a text into one or more predefined classes or categories based on its contents. One of the dimensions by which a text can be can be classified, is its genre. In this study the development of an automatic genre classification system in a resource scarce environment is postulated. This study aimed to investigate the techniques and approaches that are generally used for automatic genre classification systems, and identify the best approach for Afrikaans (a resource scarce language). With the development of an automatic genre classification system, there is a set of variables that must be considered as they influence the performance of machine learning approaches (i.e. the algorithm used, the amount of training data, and data representation as features). If these variables are handled correctly, an optimal combination of them can be identified to successfully develop a genre classification system. In this article a genre classification system is being developed by using the following approach: The implementation of a MNB algorithm with a bag of words approach feature set. This system provides a resultant f-score (performance measure) of 0.929.


Corpora ◽  
2013 ◽  
Vol 8 (2) ◽  
pp. 209-234 ◽  
Author(s):  
Yan Cao ◽  
Richard Xiao

This article takes the multi-dimensional (MD) analysis approach to explore the textual variations between native and non-native English abstracts on the basis of a balanced corpus containing English abstracts written by native English and native Chinese writers from twelve academic disciplines. A total of 47 out of 163 linguistic features are retained after factor analysis, which underlies a seven-dimension framework representing seven communicative functions. The results show that the two types of abstracts demonstrate significant differences in five out of the seven dimensions. To be more specific, native English writers display a more active involvement and commitment in presenting their ideas than Chinese writers. They also use intensifying devices more frequently. In contrast, Chinese writers show stronger preferences for conceptual elaboration, passives and abstract noun phrases no matter whether the two types of data are examined as a whole or whether variations across disciplines are taken into account. The results are discussed in relation to the possible reasons and suggestions for English abstract writing in China. Methodologically, this study innovatively expands on Biber's (1988) MD analytical framework by integrating colligation in addition to grammatical and semantic features.


Author(s):  
Horacio Saggion

Over the past decades, information has been made available to a broad audience thanks to the availability of texts on the Web. However, understanding the wealth of information contained in texts can pose difficulties for a number of people including those with poor literacy, cognitive or linguistic impairment, or those with limited knowledge of the language of the text. Text simplification was initially conceived as a technology to simplify sentences so that they would be easier to process by natural-language processing components such as parsers. However, nowadays automatic text simplification is conceived as a technology to transform a text into an equivalent which is easier to read and to understand by a target user. Text simplification concerns both the modification of the vocabulary of the text (lexical simplification) and the modification of the structure of the sentences (syntactic simplification). In this chapter, after briefly introducing the topic of text readability, we give an overview of past and recent methods to address these two problems. We also describe simplification applications and full systems also outline language resources and evaluation approaches.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Dapeng Lang ◽  
Deyun Chen ◽  
Ran Shi ◽  
Yongjun He

Deep learning has been widely used in the field of image classification and image recognition and achieved positive practical results. However, in recent years, a number of studies have found that the accuracy of deep learning model based on classification greatly drops when making only subtle changes to the original examples, thus realizing the attack on the deep learning model. The main methods are as follows: adjust the pixels of attack examples invisible to human eyes and induce deep learning model to make the wrong classification; by adding an adversarial patch on the detection target, guide and deceive the classification model to make it misclassification. Therefore, these methods have strong randomness and are of very limited use in practical application. Different from the previous perturbation to traffic signs, our paper proposes a method that is able to successfully hide and misclassify vehicles in complex contexts. This method takes into account the complex real scenarios and can perturb with the pictures taken by a camera and mobile phone so that the detector based on deep learning model cannot detect the vehicle or misclassification. In order to improve the robustness, the position and size of the adversarial patch are adjusted according to different detection models by introducing the attachment mechanism. Through the test of different detectors, the patch generated in the single target detection algorithm can also attack other detectors and do well in transferability. Based on the experimental part of this paper, the proposed algorithm is able to significantly lower the accuracy of the detector. Affected by the real world, such as distance, light, angles, resolution, etc., the false classification of the target is realized by reducing the confidence level and background of the target, which greatly perturbs the detection results of the target detector. In COCO Dataset 2017, it reveals that the success rate of this algorithm reaches 88.7%.


Linguistics ◽  
2021 ◽  

Register research has been approached from differing theoretical and methodological approaches, resulting in different definitions of the term register. In the text-linguistic approach, which is the primary focus of this bibliography, register refers to text varieties that are defined by their situational characteristics, such as the purpose of writing and the mode of communication, among others. Texts that are similar in their situational characteristics also tend to share similar linguistic profiles, as situational characteristics motivate or require the use of specific linguistic features. Text-linguistic research on register tends to focus on two aspects: attempts to describe a register, or attempts to understand patterns of register variation. This research happens via comparative analyses, specific examinations of single linguistic features or situational parameters, and often via examinations of co-occurrence of linguistic features that are analyzed from a functional perspective. That is, certain lexico-grammatical features co-occur in a given text because they together serve important communicative functions that are motivated by the situational characteristics of the text (e.g., communicative purpose, mode, setting, interactivity). Furthermore, corpus methods are often relied upon in register studies, which allows for large-scale examinations of both general and specialized registers. Thus, the bibliography gives priority to research that uses corpus tools and methods. Finally, while the broadest examinations on register focus on the distinction between written and spoken domains, additional divisions of register studies fall under the categories of written registers, spoken registers, academic registers, historical registers, and electronic/online registers. This bibliography primarily introduces some of the key resources on English registers, a decision that was made to reach a broader audience.


Sign in / Sign up

Export Citation Format

Share Document