scholarly journals Exploring Reusability and Reproducibility for a Research Infrastructure for L1 and L2 Learner Corpora

Information ◽  
2021 ◽  
Vol 12 (5) ◽  
pp. 199
Author(s):  
Alexander König ◽  
Jennifer-Carmen Frey ◽  
Egon W. Stemle

Up until today research in various educational and linguistic domains such as learner corpus research, writing research, or second language acquisition has produced a substantial amount of research data in the form of L1 and L2 learner corpora. However, the multitude of individual solutions combined with domain-inherent obstacles in data sharing have so far hampered comparability, reusability and reproducibility of data and research results. In this article, we present work in creating a digital infrastructure for L1 and L2 learner corpora and populating it with data collected in the past. We embed our infrastructure efforts in the broader field of infrastructures for scientific research, drawing from technical solutions and frameworks from research data management, among which the FAIR guiding principles for data stewardship. We share our experiences from integrating some L1 and L2 learner corpora from concluded projects into the infrastructure while trying to ensure compliance with the FAIR principles and the standards we established for reproducibility, discussing how far research data that has been collected in the past can be made comparable, reusable and reproducible. Our results show that some basic needs for providing comparable and reusable data are covered by existing general infrastructure solutions and can be exploited for domain-specific infrastructures such as the one presented in this article. Other aspects need genuinely domain-driven approaches. The solutions found for the corpora in the presented infrastructure can only be a preliminary attempt, and further community involvement would be needed to provide templates and models acknowledged and promoted by the community. Furthermore, forward-looking data management would be needed starting from the beginning of new corpus creation projects to ensure that all requirements for FAIR data can be met.

2019 ◽  
Vol 8 (1) ◽  
pp. 40-52 ◽  
Author(s):  
Sarah W. Kansa ◽  
Levent Atici ◽  
Eric C. Kansa ◽  
Richard H. Meadow

ABSTRACTWith the advent of the Web, increased emphasis on “research data management,” and innovations in reproducible research practices, scholars have more incentives and opportunities to document and disseminate their primary data. This article seeks to guide archaeologists in data sharing by highlighting recurring challenges in reusing archived data gleaned from observations on workflows and reanalysis efforts involving datasets published over the past 15 years by Open Context. Based on our findings, we propose specific guidelines to improve data management, documentation, and publishing practices so that primary data can be more efficiently discovered, understood, aggregated, and synthesized by wider research communities.


2010 ◽  
Vol 1 ◽  
Author(s):  
John A. Hawkins ◽  
Paula Buttery

AbstractOne of the major goals of the Cambridge English Profile Programme is to identify ‘criterial features’ for each of the Common European Framework of Reference (CEFR) proficiency levels as they apply to English, and to assess the impact of different first languages on these features (through ‘transfer’ effects). The present paper defines what is meant by criterial features and proposes an initial taxonomy of four types. Numerous illustrations are given from our collaborative research to date on the Cambridge Learner Corpus. The benefits and challenges posed by these features for corpus linguistics and for theories of second language acquisition are briefly outlined, as are the benefits and challenges for language assessment practices and for publishing ventures that make use of them as supplements to the current CEFR descriptors.


2019 ◽  
Vol 39 ◽  
pp. 74-92 ◽  
Author(s):  
Tony McEnery ◽  
Vaclav Brezina ◽  
Dana Gablasova ◽  
Jayanti Banerjee

AbstractIn this article we explore the relationship between learner corpus and second language acquisition research. We begin by considering the origins of learner corpus research, noting its roots in smaller scale studies of learner language. This development of learner corpus studies is considered in the broader context of the development of corpus linguistics. We then consider the aspirations that learner corpus researchers have had to engage with second language acquisition research and explore why, to date, the interaction between the two fields has been minimal. By exploring some of the corpus building practices of learner corpus research, and the theoretical goals of second language acquisition studies, we identify reasons for this lack of interaction and make proposals for how this situation could be fruitfully addressed.


2015 ◽  
Vol 11 (1) ◽  
pp. 1-18 ◽  
Author(s):  
Sabine De Knop ◽  
Fanny Meunier

AbstractThe introductory chapter of this special issue on ‘Learner Corpus Research, Cognitive Linguistics and Second Language Acquisition’ addresses the strengths, weaknesses, opportunities and potential threats of using both learner corpora and Cognitive Linguistics to research second language acquisition. We also discuss some terminological issues related to the notion of second language acquisition. Finally, we present the various chapters included in the volume and explain how each of them concretely articulates the connections between the three disciplines under analysis.


Languages ◽  
2021 ◽  
Vol 6 (2) ◽  
pp. 61
Author(s):  
Lisa Kornder ◽  
Ineke Mennen

The purpose of this investigation was to trace first (L1) and second language (L2) segmental speech development in the Austrian German–English late bilingual Arnold Schwarzenegger over a period of 40 years, which makes it the first study to examine a bilingual’s speech development over several decades in both their languages. To this end, acoustic measurements of voice onset time (VOT) durations of word-initial plosives (Study 1) and formant frequencies of the first and second formant of Austrian German and English monophthongs (Study 2) were conducted using speech samples collected from broadcast interviews. The results of Study 1 showed a merging of Schwarzenegger’s German and English voiceless plosives in his late productions as manifested in a significant lengthening of VOT duration in his German plosives, and a shortening of VOT duration in his English plosives, closer to L1 production norms. Similar findings were evidenced in Study 2, revealing that some of Schwarzenegger’s L1 and L2 vowel categories had moved closer together in the course of L2 immersion. These findings suggest that both a bilingual’s first and second language accent is likely to develop and reorganize over time due to dynamic interactions between the first and second language system.


Author(s):  
Marco Angrisani ◽  
Anya Samek ◽  
Arie Kapteyn

The number of data sources available for academic research on retirement economics and policy has increased rapidly in the past two decades. Data quality and comparability across studies have also improved considerably, with survey questionnaires progressively converging towards common ways of eliciting the same measurable concepts. Probability-based Internet panels have become a more accepted and recognized tool to obtain research data, allowing for fast, flexible, and cost-effective data collection compared to more traditional modes such as in-person and phone interviews. In an era of big data, academic research has also increasingly been able to access administrative records (e.g., Kostøl and Mogstad, 2014; Cesarini et al., 2016), private-sector financial records (e.g., Gelman et al., 2014), and administrative data married with surveys (Ameriks et al., 2020), to answer questions that could not be successfully tackled otherwise.


GigaScience ◽  
2020 ◽  
Vol 9 (10) ◽  
Author(s):  
Daniel Arend ◽  
Patrick König ◽  
Astrid Junker ◽  
Uwe Scholz ◽  
Matthias Lange

Abstract Background The FAIR data principle as a commitment to support long-term research data management is widely accepted in the scientific community. Although the ELIXIR Core Data Resources and other established infrastructures provide comprehensive and long-term stable services and platforms for FAIR data management, a large quantity of research data is still hidden or at risk of getting lost. Currently, high-throughput plant genomics and phenomics technologies are producing research data in abundance, the storage of which is not covered by established core databases. This concerns the data volume, e.g., time series of images or high-resolution hyper-spectral data; the quality of data formatting and annotation, e.g., with regard to structure and annotation specifications of core databases; uncovered data domains; or organizational constraints prohibiting primary data storage outside institional boundaries. Results To share these potentially dark data in a FAIR way and master these challenges the ELIXIR Germany/de.NBI service Plant Genomic and Phenomics Research Data Repository (PGP) implements a “bring the infrastructure to the data” approach, which allows research data to be kept in place and wrapped in a FAIR-aware software infrastructure. This article presents new features of the e!DAL infrastructure software and the PGP repository as a best practice on how to easily set up FAIR-compliant and intuitive research data services. Furthermore, the integration of the ELIXIR Authentication and Authorization Infrastructure (AAI) and data discovery services are introduced as means to lower technical barriers and to increase the visibility of research data. Conclusion The e!DAL software matured to a powerful and FAIR-compliant infrastructure, while keeping the focus on flexible setup and integration into existing infrastructures and into the daily research process.


Author(s):  
Fabian Cremer ◽  
Silvia Daniel ◽  
Marina Lemaire ◽  
Katrin Moeller ◽  
Matthias Razum ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document