Effects of task type on morphosyntactic complexity across proficiency

2019 ◽  
Vol 3 (2) ◽  
Author(s):  
Marije Michel ◽  
Akira Murakami ◽  
Theodora Alexopoulou ◽  
Detmar Meurers

This study investigates the effect of instructional design on (morpho)syntactic complexity in second language (L2) writing development. We operationalised instructional design in terms of task type and empirically based the investigation on a large subcorpus (669,876 writings by 119,960 learners from 128 tasks at all Common European Framework of Reference for Languages levels) of the EF-Cambridge Open Language Database (EFCAMDAT; Geertzen, Alexopoulou and Korhonen 2014). First, the 128 task prompts were manually categorised for task type (e.g. argumentation, description). Next, developmental trajectories of syntactic complexity from A1 to C2 were established using a variety of global (e.g. mean length of clause) and specific (e.g. non-third person singular present tense verbs) measures extracted using natural language processing techniques. The effects of task type were analysed using the categorisation from the first step. Finally, tasks that showed atypical behaviour for a measure given their task type were explored qualitatively. Our results partially confirm earlier experimental and corpus-based studies (e.g. subordination associated with argumentative tasks). Going beyond, our large-scale data-driven analysis made it possible to identify specific measures that were naturally prompted by instructional design (e.g. narrations eliciting wh-phrases). We discuss which measures typically align with certain task types and highlight how instructional design relates to L2 developmental trajectories over time.

2021 ◽  
Author(s):  
R. Salter ◽  
Quyen Dong ◽  
Cody Coleman ◽  
Maria Seale ◽  
Alicia Ruvinsky ◽  
...  

The Engineer Research and Development Center, Information Technology Laboratory’s (ERDC-ITL’s) Big Data Analytics team specializes in the analysis of large-scale datasets with capabilities across four research areas that require vast amounts of data to inform and drive analysis: large-scale data governance, deep learning and machine learning, natural language processing, and automated data labeling. Unfortunately, data transfer between government organizations is a complex and time-consuming process requiring coordination of multiple parties across multiple offices and organizations. Past successes in large-scale data analytics have placed a significant demand on ERDC-ITL researchers, highlighting that few individuals fully understand how to successfully transfer data between government organizations; future project success therefore depends on a small group of individuals to efficiently execute a complicated process. The Big Data Analytics team set out to develop a standardized workflow for the transfer of large-scale datasets to ERDC-ITL, in part to educate peers and future collaborators on the process required to transfer datasets between government organizations. Researchers also aim to increase workflow efficiency while protecting data integrity. This report provides an overview of the created Data Lake Ecosystem Workflow by focusing on the six phases required to efficiently transfer large datasets to supercomputing resources located at ERDC-ITL.


Author(s):  
Giuseppe Guglielmi ◽  
Stefano D’Errico ◽  
Cristoforo Pomara ◽  
Vittorio Fineschi

Imaging techniques (plain radiographs, multi slice computed tomography (MSCT), and magnetic resonance (MRI)) are being increasingly implemented in forensic pathology. These methods may serve as an adjuvant to classic forensic medical diagnosis and as support to forensic autopsies. It is well noted that various post-processing techniques can provide strong forensic evidence for use in legal proceedings. This chapter reviews vertebral morphometry application in forensic, expressly used in the case of semi-automatic digital recognition of vertebral heights in fractures, by means of vertebral shape analysis which relies on six or more points positioned over the margins of each vertebrae T5 to L4 used to calculate anterior, medial, and posterior heights and statistical shape models. This approach is quantitative, more reproducible, and more feasible for large-scale data analysis, as in drug trials, where assessment may be performed by a variety of clinicians with different levels of experience. As a result, a number of morphometric methodologies for characterisation of osteoporosis have been developed. Current morphometric methodologies have the drawback of relying upon manual annotations. The manual placement of morphometric points on the vertebrae is time consuming, requiring more than 10 min per radiograph and can be quite subjective. Several semi-automated software have been produced to overcome this problem, but they are mainly applicable to dual X-ray absorptiometry (DXA) scans. Furthermore, this chapter aims to verify by an experimental model if the technique could contribute, in present or in future, to investigate the modality of traumatic vertebral injuries which may explain the manner of death.


2019 ◽  
Vol 8 (3) ◽  
pp. 134 ◽  
Author(s):  
Alessandro Crivellari ◽  
Euro Beinat

The rapid growth of positioning technology allows tracking motion between places, making trajectory recordings an important source of information about place connectivity, as they map the routes that people commonly perform. In this paper, we utilize users’ motion traces to construct a behavioral representation of places based on how people move between them, ignoring geographical coordinates and spatial proximity. Inspired by natural language processing techniques, we generate and explore vector representations of locations, traces and visitors, obtained through an unsupervised machine learning approach, which we generically named motion-to-vector (Mot2vec), trained on large-scale mobility data. The algorithm consists of two steps, the trajectory pre-processing and the Word2vec-based model building. First, mobility traces are converted into sequences of locations that unfold in fixed time steps; then, a Skip-gram Word2vec model is used to construct the location embeddings. Trace and visitor embeddings are finally created combining the location vectors belonging to each trace or visitor. Mot2vec provides a meaningful representation of locations, based on the motion behavior of users, defining a direct way of comparing locations’ connectivity and providing analogous similarity distributions for places of the same type. In addition, it defines a metric of similarity for traces and visitors beyond their spatial proximity and identifies common motion behaviors between different categories of people.


Author(s):  
Lewis Mitchell ◽  
Joshua Dent ◽  
Joshua Ross

It is widely accepted that different online social media platforms produce different modes of communication, however the ways in which these modalities are shaped by the constraints of a particular platform remain difficult to quantify. On 7 November 2017 Twitter doubled the character limit for users to 280 characters, presenting a unique opportunity to study the response of this population to an exogenous change to the communication medium. Here we analyse a large dataset comprising 387 million English-language tweets (10% of all public tweets) collected over the September 2017--January 2018 period to quantify and explain large-scale changes in individual behaviour and communication patterns precipitated by the character-length change. Using statistical and natural language processing techniques we find that linguistic complexity increased after the change, with individuals writing at a significantly higher reading level. However, we find that some textual properties such as statistical language distribution remain invariant across the change, and are no different to writings in different online media. By fitting a generative mathematical model to the data we find a surprisingly slow response of the Twitter population to this exogenous change, with a substantial number of users taking a number of weeks to adjust to the new medium. In the talk we describe the model and Bayesian parameter estimation techniques used to make these inferences. Furthermore, we argue for mathematical models as an alternative exploratory methodology for "Big" social media datasets, empowering the researcher to make inferences about the human behavioural processes which underlie large-scale patterns and trends.


Author(s):  
Mario Fernando Jojoa Acosta ◽  
Begonya Garcia-Zapirain ◽  
Marino J. Gonzalez ◽  
Bernardo Perez-Villa ◽  
Elena Urizar ◽  
...  

The review of previous works shows this study is the first attempt to analyse the lockdown effect using Natural Language Processing Techniques, particularly sentiment analysis methods applied at large scale. On the other hand, it is also the first of its kind to analyse the impact of COVID 19 on the university community jointly on staff and students and with a multi-country perspective. The main overall findings of this work show that the most often related words were family, anxiety, house and life. On another front, it has also been shown that staff have a slightly less negative perception of the consequences of COVID in their daily life. We have used artificial intelligence models like swivel embedding and the Multilayer Perceptron, as classification algorithms. The performance reached in terms of accuracy metric are 88.8% and 88.5%, for student and staff respectively. The main conclusion of our study is that higher education institutions and policymakers around the world may benefit from these findings while formulating policy recommendations and strategies to support students during this and any future pandemics.


2021 ◽  
Vol 16 ◽  
Author(s):  
Brahim Matougui ◽  
Abdelbasset Boukelia ◽  
Hacene Belhadef ◽  
Clovis Galiez ◽  
Mohamed Batouche

Background: Metagenomics is the study of genomic content in mass from an environment of interest such as the human gut or soil. Taxonomy is one of the most important fields of metagenomics, which is the science of defining and naming groups of microbial organisms that share the same characteristics. The problem of taxonomy classification is the identification and quantification of microbial species or higher-level taxa sampled by high throughput sequencing. Objective: Although many methods exist to deal with the taxonomic classification problem, assignment to low taxonomic ranks remains an important challenge for binning methods as is scalability to Gb-sized datasets generated with deep sequencing techniques. Methods: In this paper, we introduce NLP-MeTaxa, a novel composition-based method for taxonomic binning, which relies on the use of words embeddings and deep learning architecture. The new proposed approach is word-based, where the metagenomic DNA fragments are processed as a set of overlapping words by using the word2vec model to vectorize them in order to feed the deep learning model. NLP-MeTaxa output is visualized as NCBI taxonomy tree, this representation helps to show the connection between the predicted taxonomic identifiers. NLP-MeTaxa was trained on large-scale data from the NCBI RefSeq, more than 14,000 complete microbial genomes. The NLP-MeTaxa code is available at the website: https://github.com/padriba/NLP_MeTaxa/ Results: We evaluated NLP-MeTaxa with a real and simulated metagenomic dataset and compared our results to other tools' results. The experimental results have shown that our method outperforms the other methods especially for the classification of low-ranking taxonomic class such as species and genus. Conclusion: In summary, our new method might provide novel insight for understanding the microbial community through the identification of the organisms it might contain.


2021 ◽  
Author(s):  
Yavuz Melih Özgüven ◽  
Utku Gönener ◽  
Süleyman Eken

Abstract The revolution of big data has also affected the area of sports analytics. Many big companies have started to see the benefits of combining sports analytics and big data to make a profit. Aggregating and processing big sport data from different sources becomes challenging if we rely on central processing techniques, which hurts the accuracy and the timeliness of the information. Distributed systems come to the rescue as a solution to these problems and the MapReduce paradigm is promising for large-scale data analytics. In this study, we present a big data architecture based on Docker containers in Apache Spark. We demonstrate the architecture on four data-intensive case studies including structured analysis, streaming, machine learning methods, and graph-based analysis in sport analytics, showing ease of use.


Author(s):  
Tim Althoff ◽  
Kevin Clark ◽  
Jure Leskovec

Mental illness is one of the most pressing public health issues of our time. While counseling and psychotherapy can be effective treatments, our knowledge about how to conduct successful counseling conversations has been limited due to lack of large-scale data with labeled outcomes of the conversations. In this paper, we present a large-scale, quantitative study on the discourse of text-message-based counseling conversations. We develop a set of novel computational discourse analysis methods to measure how various linguistic aspects of conversations are correlated with conversation outcomes. Applying techniques such as sequence-based conversation models, language model comparisons, message clustering, and psycholinguistics-inspired word frequency analyses, we discover actionable conversation strategies that are associated with better conversation outcomes.


Sign in / Sign up

Export Citation Format

Share Document