corpus creation
Recently Published Documents


TOTAL DOCUMENTS

55
(FIVE YEARS 28)

H-INDEX

5
(FIVE YEARS 1)

2022 ◽  
Author(s):  
Sebastião Pais ◽  
João Cordeiro ◽  
Muhammad Jamil

Abstract Nowadays, the use of language corpora for many purposes has increased significantly. General corpora exist for numerous languages, but research often needs more specialized corpora. The Web’s rapid growth has significantly improved access to thousands of online documents, highly specialized texts and comparable texts on the same subject covering several languages in electronic form. However, research has continued to concentrate on corpus annotation instead of corpus creation tools. Consequently, many researchers create their corpora, independently solve problems, and generate project-specific systems. The corpus construction is used for many NLP applications, including machine translation, information retrieval, and question-answering. This paper presents a new NLP Corpus and Services in the Cloud called HULTIG-C. HULTIG-C is characterized by various languages that include unique annotations such as keywords set, sentences set, named entity recognition set, and multiword set. Moreover, a framework incorporates the main components for license detection, language identification, boilerplate removal and document deduplication to process the HULTIG-C. Furthermore, this paper presents some potential issues related to constructing multilingual corpora from the Web.


2021 ◽  
Author(s):  
Sebastião Pais ◽  
João Cordeiro ◽  
Muhammad Jamil

Abstract Nowadays, the use of language corpora for many purposes has increased significantly. General corpora exist for numerous languages, but research often needs more specialized corpora. The Web's rapid growth has significantly improved access to thousands of online documents, highly specialized texts and comparable texts on the same subject covering several languages in electronic form. However, research has continued to concentrate on corpus annotation instead of corpus creation tools. Consequently, many researchers create their own corpora, independently solve problems, and generate project-specific systems. The corpus construction is used for many NLP applications, including machine translation, information retrieval, and question-answering. This paper presents a new NLP Corpus and Services in the Cloud called HULTIG-C. HULTIG-C is characterized by various languages that include unique annotations such as keywords set, sentences set, named entity recognition set, and multiword set. Moreover, a framework incorporates the main components for license detection, language identification, boilerplate removal and document deduplication to process the HULTIG-C. Furthermore, this paper presents some potential issues related to constructing multilingual corpora from the Web.


2021 ◽  
pp. 026765832110306
Author(s):  
Rosamond Mitchell

A major rationale for study abroad (SA) from the perspective of second language acquisition is the presumed opportunity available to sojourners for naturalistic second language (L2) “immersion”. However, such opportunities are affected by variations in the linguistic, institutional and social affordances of SA, in different settings. They are also affected by the varying agency and motivation of sojourners in seeking second language (L2) engagement. For example, many sojourners prioritize mastering informal L2 speech, while others prioritize academic and professional registers including writing. Most will operate multilingually, using their home language, a local language, and/or English as lingua franca for different purposes, and the types of input they seek out, and language practices they enter into, vary accordingly. Consequently, while researchers have developed varied approaches to documenting L2 engagement, and have tried to relate these to measures of L2 development, these efforts have so far seen somewhat mixed success. This article reviews different approaches to documenting SA input and interaction; first, that of participant self-report, using questionnaires, interviews, journals, or language logs. Particular attention is paid to the popular Language Contact Profile (LCP), and to approaches drawing on Social Network Analysis. The limitations of all forms of self-report are acknowledged. The article also examines the contribution of direct observation and recording of L2 input and interaction during SA. This is a significant alternative approach for the study of acquisition, but one which poses theoretical, ethical and practical challenges. Researchers have increasingly enlisted participants as research collaborators who create small corpora through self-recording with L2 interlocutors. Analyses in this tradition have so far prioritized interactional, pragmatic and sociocultural development, in learner corpora, over other dimensions of second language acquisition (SLA). The theoretical and practical challenges of corpus creation in SA settings and their wider use to promote understandings of informal L2 learning are discussed.


Author(s):  
Prasanta Mandal ◽  
Apurbalal Senapati

A corpus is a large collection of machine-readable texts, ideally, that should be representative of a Language. Corpus plays an important role in several natural language processing (NLP) and linguistic research. The corpus development itself is a substantial contribution to the resource building of language processing. The corpora play an important role in linguistic study as well as in several NLP tasks like Part-Of-Speech (POS) tagging, Parsing, Semantic tagging, in the parallel corpora, etc. There are numerous corpora in the literature of different languages and most of them are created for a specific purpose. Hence it is obvious that a researcher cannot use any corpus for their particular task. This paper also focuses on an automated technique to create a COVID-19 corpus dedicated to the research in linguistic aspects because of the pandemic situation.


Information ◽  
2021 ◽  
Vol 12 (5) ◽  
pp. 199
Author(s):  
Alexander König ◽  
Jennifer-Carmen Frey ◽  
Egon W. Stemle

Up until today research in various educational and linguistic domains such as learner corpus research, writing research, or second language acquisition has produced a substantial amount of research data in the form of L1 and L2 learner corpora. However, the multitude of individual solutions combined with domain-inherent obstacles in data sharing have so far hampered comparability, reusability and reproducibility of data and research results. In this article, we present work in creating a digital infrastructure for L1 and L2 learner corpora and populating it with data collected in the past. We embed our infrastructure efforts in the broader field of infrastructures for scientific research, drawing from technical solutions and frameworks from research data management, among which the FAIR guiding principles for data stewardship. We share our experiences from integrating some L1 and L2 learner corpora from concluded projects into the infrastructure while trying to ensure compliance with the FAIR principles and the standards we established for reproducibility, discussing how far research data that has been collected in the past can be made comparable, reusable and reproducible. Our results show that some basic needs for providing comparable and reusable data are covered by existing general infrastructure solutions and can be exploited for domain-specific infrastructures such as the one presented in this article. Other aspects need genuinely domain-driven approaches. The solutions found for the corpora in the presented infrastructure can only be a preliminary attempt, and further community involvement would be needed to provide templates and models acknowledged and promoted by the community. Furthermore, forward-looking data management would be needed starting from the beginning of new corpus creation projects to ensure that all requirements for FAIR data can be met.


2021 ◽  
Author(s):  
Kusampudi Siva Subrahamanyam Varma ◽  
◽  
Anudeep Chaluvadi ◽  
Radhika Mamidi ◽  
◽  
...  

2021 ◽  
Author(s):  
Kamal Kumar Gupta ◽  
Soumya Chennabasavaraj ◽  
Nikesh Garera ◽  
Asif Ekbal

Sign in / Sign up

Export Citation Format

Share Document