Corpus Selection Approaches for Multilingual Parsing from Raw Text to Universal Dependencies

Authorship attribution is a task to identify the writer of unknown text and categorize it to known writer. Writing style of each author is distinct and can be used for the discrimination. There are different parameters responsible for rectifying such changes. When the writing samples collected for an author when it belongs to small period, it can participate efficiently for identification of unknown sample. In this paper author identification problem considered where writing sample is not available on the same time period. Such evidences collected over long period of time. And character n-gram, word n-gram and pos n-gram features used to build the model. As they are contributing towards style of writer in terms of content as well as statistic characteristic of writing style. We applied support vector machine algorithm for classification. Effective results and outcome came out from the experiments. While discriminating among multiple authors, corpus selection and construction were the most tedious task which was implemented effectively. It is observed that accuracy varied on feature type. Word and character n-gram have shown good accuracy than PoS n-gram.

Download Full-text

Automated Text Classification of News Articles: A Practical Guide

Political Analysis ◽

10.1017/pan.2020.8 ◽

2020 ◽

Vol 29 (1) ◽

pp. 19-42 ◽

Cited By ~ 1

Author(s):

Pablo Barberá ◽

Amber E. Boydstun ◽

Suzanna Linn ◽

Ryan McMahon ◽

Jonathan Nagler

Keyword(s):

New York ◽

New York Times ◽

Fixed Number ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Methodological Choices ◽

Units Of Analysis ◽

Human Validation ◽

Corpus Selection

Automated text analysis methods have made possible the classification of large corpora of text by measures such as topic and tone. Here, we provide a guide to help researchers navigate the consequential decisions they need to make before any measure can be produced from the text. We consider, both theoretically and empirically, the effects of such choices using as a running example efforts to measure the tone of New York Times coverage of the economy. We show that two reasonable approaches to corpus selection yield radically different corpora and we advocate for the use of keyword searches rather than predefined subject categories provided by news archives. We demonstrate the benefits of coding using article segments instead of sentences as units of analysis. We show that, given a fixed number of codings, it is better to increase the number of unique documents coded rather than the number of coders for each document. Finally, we find that supervised machine learning algorithms outperform dictionaries on a number of criteria. Overall, we intend this guide to serve as a reminder to analysts that thoughtfulness and human validation are key to text-as-data methods, particularly in an age when it is all too easy to computationally classify texts without attending to the methodological choices therein.

Download Full-text

Citation Mining of Humanities Journals: The Progress to Date and the Challenges Ahead

Journal of European Periodical Studies ◽

10.21825/jeps.v4i1.10120 ◽

2019 ◽

Vol 4 (1) ◽

pp. 36-53 ◽

Cited By ~ 1

Author(s):

Giovanni Colavizza ◽

Matteo Romanello

Keyword(s):

Large Scale ◽

Web Of Science ◽

Legal Framework ◽

Time Span ◽

Primary Sources ◽

Journal Articles ◽

Current State ◽

Citation Indexes ◽

State Of Research ◽

Corpus Selection

Even large citation indexes such as the Web of Science, Scopus or Google Scholar cover only a small fraction of the literature in the humanities. This coverage sensibly decreases going backwards in time. Citation mining of humanities publications — defined as an instance of bibliometric data mining and as a means to the end of building comprehensive citation indexes — remains an open problem. In this contribution we discuss the results of two recent projects in this area: Cited Loci and Linked Books. The former focused on the domain of classics, using journal articles in JSTOR as a corpus; the latter considered the historiography on Venice and a novel corpus of journals and monographs. Both projects attempted to mine citations of all kinds — abbreviated and not, to all types of sources, including primary sources — and considered a wide time span (19th to 21st century). We first discuss the current state of research in citation mining of humanities publications. We then present the various steps involved into this process, from corpus selection to data publication, discussing the peculiarities of the humanities. The approaches taken by the two projects are compared, allowing us to highlight disciplinary differences and commonalities, as well as shared challenges between historiography and classics on this respect. The resulting picture portrays humanities citation mining as a field with a great, yet mostly untapped potential, and a few still open challenges. The potential lies in using citations as a means to interconnect digitized collections at a large scale, by making explicit the linking function of bibliographic citations. As for the open challenges, a key issue is the existing need for an integrated metadata infrastructure and an appropriate legal framework to facilitate citation mining in the humanities.

Download Full-text

The teaching of terminological literacy basics to engineering students

Contemporary Educational Researches Journal ◽

10.18844/cerj.v5i2.238 ◽

2016 ◽

Vol 5 (2) ◽

pp. 75

Author(s):

Elvira Yakovlevna Sokolova

Keyword(s):

Engineering Students ◽

Subject Area ◽

English For Specific Purposes ◽

Written Communication ◽

Word Families ◽

Semantic Orientation ◽

Technical University ◽

Corpus Selection ◽

Specialized Subject

The article analyses the significance of the formation of students’ terminological literacy in the process of English for Specific Purposes teaching at technical university and describes the conditions providing the efficiency of this process. The author emphasizes the necessity to a create terminological thesaurus which includes the most frequent systemized specialized discipline specific vocabulary and specialized subject area word families required to professional-oriented oral and written communication and provides logically-semantic orientation in the text. The author specifies the factors assumed as a basis for terminological corpus selection, shows the procedure of thesaurus creation and describes the activities used to teach terminological literacy. Keywords: Terminological literacy, English for Specific Purposes, terminological thesaurus, specialized discipline specific vocabulary, professional-oriented oral and written communication.

Download Full-text

Corpus selection and investigation

Dance Lexicon in Shakespeare and His Contemporaries ◽

10.4324/9781003087687-8-11 ◽

2021 ◽

pp. 64-72

Author(s):

Fabio Ciambella

Keyword(s):

Corpus Selection

Download Full-text

Quantitative and Qualitative Aspects of Corpus Selection in Translation Studies

Target ◽

10.1075/target.7.2.04van ◽

1995 ◽

Vol 7 (2) ◽

pp. 245-260 ◽

Cited By ~ 8

Author(s):

Luc van Doorslaer

Keyword(s):

Translation Studies ◽

Textual Information ◽

Translation Analysis ◽

Corpus Selection ◽

Research Criteria ◽

The Relationship

Abstract Although research procedures for translation analysis and comparison are being adapted to the principles of induction and deduction which are necessary in intersubjective research, criteria for corpus selection are often not explicitly motivated. Since hypotheses depend for their reliability on the corpus selected, attention should be paid to the relationship between exhaustiveness and representativeness. Criteria for corpus selection are often either random or textually motivated, while exceptions and deviations in translation often require a qualitative refinement of these criteria such as that obtained from extra-textual information.

Download Full-text

Digital Methods for Hashtag Engagement Research

Social Media + Society ◽

10.1177/2056305120940697 ◽

2020 ◽

Vol 6 (3) ◽

pp. 205630512094069

Author(s):

Janna Joceli Omena ◽

Elaine Teixeira Rabello ◽

André Goes Mintz

Keyword(s):

Social Media ◽

Online Activity ◽

Online Engagement ◽

Digital Methods ◽

Media Research ◽

Social Media Research ◽

Corpus Selection ◽

The Relationship ◽

Digital Research

This article seeks to contribute to the field of digital research by critically accounting for the relationship between hashtags and their forms of grammatization—the platform techno-materialization process of online activity. We approach hashtags as sociotechnical formations that serve social media research not only as criteria in corpus selection but also displaying the complexity of the online engagement and its entanglement with the technicity of web platforms. Therefore, the study of hashtag engagement requires a grasping of the functioning of the platform itself (technicity) along with the platform grammatization. In this respect, we propose the three-layered (3L) perspective for addressing hashtag engagement. The first contemplates potential differences between high-visibility and ordinary hashtag usage culture, its related actors, and content. The second focuses on hashtagging activity and the repurposing of how hashtags can be differently embedded into social media databases. The last layer looks particularly into the images and texts to which hashtags are brought to relation. To operationalize the 3L framework, we draw on the case of the “impeachment-cum-coup” of Brazilian president Dilma Rousseff. When cross-read, the three layers add value to one another, providing also difference visions of the high-visibility and ordinary groups.

Download Full-text

Pratiquer la poésie/enseigner la littérature

Études françaises ◽

10.7202/012051ar ◽

2006 ◽

Vol 41 (3) ◽

pp. 9-19

Author(s):

Michel Murat

Keyword(s):

Corpus Selection

La thèse développée dans cet article est que la pratique dominant les études universitaires, celle de « l’analyse de la poésie », est vouée à l’inefficacité par le fait même qu’on ne lit pas la poésie comme un roman ou un essai, et qu’elle nécessite des procédures différentes d’appropriation. L’activité d’enseignement en est une, satisfaisante pour nous mais impossible à transmettre telle quelle. Pour faire pratiquer la poésie aux étudiants, on propose donc deux pistes : la fabrication d’anthologies (constitution de corpus, sélection et agencement du recueil, mise en page et confection du livre) et la déclamation orale, éventuellement « scénographiée », pratique mieux adaptée aux poèmes longs. La poésie doit aussi faire partie d’un enseignement de la littérature. Sur ce plan on suggère, sans renoncer à la stylistique et à la rhétorique des figures, de mettre l’accent sur la transmission et la transformation de la topique ; de travailler sur les variantes, en s’inspirant des récentes éditions « pluriversionnelles » ; de s’intéresser à la carrière des poètes et à l’activité des revues. Quant à la recherche universitaire, un bilan de la production récente des thèses de doctorat en France montre qu’elle est enfermée dans le cercle de la littérarité. Il est urgent de l’ouvrir à une histoire qui serait à la fois celle de la poésie, en tant que répertoire de thèmes et de formes, et celle des poètes envisagés comme hommes de lettres.

Download Full-text