scholarly journals Corpus Selection Approaches for Multilingual Parsing from Raw Text to Universal Dependencies

2017 ◽  
Author(s):  
Ryan Hornby ◽  
Clark Taylor ◽  
Jungyeul Park
Keyword(s):  
Author(s):  
Mubin Shoukat Tamboli ◽  
Rajesh Prasad

Authorship attribution is a task to identify the writer of unknown text and categorize it to known writer. Writing style of each author is distinct and can be used for the discrimination. There are different parameters responsible for rectifying such changes. When the writing samples collected for an author when it belongs to small period, it can participate efficiently for identification of unknown sample. In this paper author identification problem considered where writing sample is not available on the same time period. Such evidences collected over long period of time. And character n-gram, word n-gram and pos n-gram features used to build the model. As they are contributing towards style of writer in terms of content as well as statistic characteristic of writing style. We applied support vector machine algorithm for classification. Effective results and outcome came out from the experiments. While discriminating among multiple authors, corpus selection and construction were the most tedious task which was implemented effectively. It is observed that accuracy varied on feature type. Word and character n-gram have shown good accuracy than PoS n-gram.


2020 ◽  
Vol 29 (1) ◽  
pp. 19-42 ◽  
Author(s):  
Pablo Barberá ◽  
Amber E. Boydstun ◽  
Suzanna Linn ◽  
Ryan McMahon ◽  
Jonathan Nagler

Automated text analysis methods have made possible the classification of large corpora of text by measures such as topic and tone. Here, we provide a guide to help researchers navigate the consequential decisions they need to make before any measure can be produced from the text. We consider, both theoretically and empirically, the effects of such choices using as a running example efforts to measure the tone of New York Times coverage of the economy. We show that two reasonable approaches to corpus selection yield radically different corpora and we advocate for the use of keyword searches rather than predefined subject categories provided by news archives. We demonstrate the benefits of coding using article segments instead of sentences as units of analysis. We show that, given a fixed number of codings, it is better to increase the number of unique documents coded rather than the number of coders for each document. Finally, we find that supervised machine learning algorithms outperform dictionaries on a number of criteria. Overall, we intend this guide to serve as a reminder to analysts that thoughtfulness and human validation are key to text-as-data methods, particularly in an age when it is all too easy to computationally classify texts without attending to the methodological choices therein.


2019 ◽  
Vol 4 (1) ◽  
pp. 36-53 ◽  
Author(s):  
Giovanni Colavizza ◽  
Matteo Romanello

Even large citation indexes such as the Web of Science, Scopus or Google Scholar cover only a small fraction of the literature in the humanities. This coverage sensibly decreases going backwards in time. Citation mining of humanities publications — defined as an instance of bibliometric data mining and as a means to the end of building comprehensive citation indexes — remains an open problem. In this contribution we discuss the results of two recent projects in this area: Cited Loci and Linked Books. The former focused on the domain of classics, using journal articles in JSTOR as a corpus; the latter considered the historiography on Venice and a novel corpus of journals and monographs. Both projects attempted to mine citations of all kinds — abbreviated and not, to all types of sources, including primary sources — and considered a wide time span (19th to 21st century). We first discuss the current state of research in citation mining of humanities publications. We then present the various steps involved into this process, from corpus selection to data publication, discussing the peculiarities of the humanities. The approaches taken by the two projects are compared, allowing us to highlight disciplinary differences and commonalities, as well as shared challenges between historiography and classics on this respect. The resulting picture portrays humanities citation mining as a field with a great, yet mostly untapped potential, and a few still open challenges. The potential lies in using citations as a means to interconnect digitized collections at a large scale, by making explicit the linking function of bibliographic citations. As for the open challenges, a key issue is the existing need for an integrated metadata infrastructure and an appropriate legal framework to facilitate citation mining in the humanities.


2016 ◽  
Vol 5 (2) ◽  
pp. 75
Author(s):  
Elvira Yakovlevna Sokolova

The article analyses the significance of the formation of students’ terminological literacy in the process of English for Specific Purposes teaching at technical university and describes the conditions providing the efficiency of this process. The author emphasizes the necessity to a create terminological thesaurus which includes the most frequent systemized specialized discipline specific vocabulary and specialized subject area word families required to professional-oriented oral and written communication and provides logically-semantic orientation in the text. The author specifies the factors assumed as a basis for terminological corpus selection, shows the procedure of thesaurus creation and describes the activities used to teach terminological literacy.  Keywords: Terminological literacy, English for Specific Purposes, terminological thesaurus, specialized discipline specific vocabulary, professional-oriented oral and written communication.


Target ◽  
1995 ◽  
Vol 7 (2) ◽  
pp. 245-260 ◽  
Author(s):  
Luc van Doorslaer

Abstract Although research procedures for translation analysis and comparison are being adapted to the principles of induction and deduction which are necessary in intersubjective research, criteria for corpus selection are often not explicitly motivated. Since hypotheses depend for their reliability on the corpus selected, attention should be paid to the relationship between exhaustiveness and representativeness. Criteria for corpus selection are often either random or textually motivated, while exceptions and deviations in translation often require a qualitative refinement of these criteria such as that obtained from extra-textual information.


2020 ◽  
Vol 6 (3) ◽  
pp. 205630512094069
Author(s):  
Janna Joceli Omena ◽  
Elaine Teixeira Rabello ◽  
André Goes Mintz

This article seeks to contribute to the field of digital research by critically accounting for the relationship between hashtags and their forms of grammatization—the platform techno-materialization process of online activity. We approach hashtags as sociotechnical formations that serve social media research not only as criteria in corpus selection but also displaying the complexity of the online engagement and its entanglement with the technicity of web platforms. Therefore, the study of hashtag engagement requires a grasping of the functioning of the platform itself (technicity) along with the platform grammatization. In this respect, we propose the three-layered (3L) perspective for addressing hashtag engagement. The first contemplates potential differences between high-visibility and ordinary hashtag usage culture, its related actors, and content. The second focuses on hashtagging activity and the repurposing of how hashtags can be differently embedded into social media databases. The last layer looks particularly into the images and texts to which hashtags are brought to relation. To operationalize the 3L framework, we draw on the case of the “impeachment-cum-coup” of Brazilian president Dilma Rousseff. When cross-read, the three layers add value to one another, providing also difference visions of the high-visibility and ordinary groups.


2006 ◽  
Vol 41 (3) ◽  
pp. 9-19
Author(s):  
Michel Murat
Keyword(s):  

La thèse développée dans cet article est que la pratique dominant les études universitaires, celle de « l’analyse de la poésie », est vouée à l’inefficacité par le fait même qu’on ne lit pas la poésie comme un roman ou un essai, et qu’elle nécessite des procédures différentes d’appropriation. L’activité d’enseignement en est une, satisfaisante pour nous mais impossible à transmettre telle quelle. Pour faire pratiquer la poésie aux étudiants, on propose donc deux pistes : la fabrication d’anthologies (constitution de corpus, sélection et agencement du recueil, mise en page et confection du livre) et la déclamation orale, éventuellement « scénographiée », pratique mieux adaptée aux poèmes longs. La poésie doit aussi faire partie d’un enseignement de la littérature. Sur ce plan on suggère, sans renoncer à la stylistique et à la rhétorique des figures, de mettre l’accent sur la transmission et la transformation de la topique ; de travailler sur les variantes, en s’inspirant des récentes éditions « pluriversionnelles » ; de s’intéresser à la carrière des poètes et à l’activité des revues. Quant à la recherche universitaire, un bilan de la production récente des thèses de doctorat en France montre qu’elle est enfermée dans le cercle de la littérarité. Il est urgent de l’ouvrir à une histoire qui serait à la fois celle de la poésie, en tant que répertoire de thèmes et de formes, et celle des poètes envisagés comme hommes de lettres.


Sign in / Sign up

Export Citation Format

Share Document