scholarly journals Language Adaptation for Extending Post-Editing Estimates for Closely Related Languages

2016 ◽  
Vol 106 (1) ◽  
pp. 181-192 ◽  
Author(s):  
Miguel Rios ◽  
Serge Sharoff

Abstract This paper presents an open-source toolkit for predicting human post-editing efforts for closely related languages. At the moment, training resources for the Quality Estimation task are available for very few language directions and domains. Available resources can be expanded on the assumption that MT errors and the amount of post-editing required to correct them are comparable across related languages, even if the feature frequencies differ. In this paper we report a toolkit for achieving language adaptation, which is based on learning new feature representation using transfer learning methods. In particular, we report performance of a method based on Self-Taught Learning which adapts the English-Spanish pair to produce Quality Estimation models for translation from English into Portuguese, Italian and other Romance languages using the publicly available Autodesk dataset.

2019 ◽  
Author(s):  
Derek Howard ◽  
Marta M Maslej ◽  
Justin Lee ◽  
Jacob Ritchie ◽  
Geoffrey Woollard ◽  
...  

BACKGROUND Mental illness affects a significant portion of the worldwide population. Online mental health forums can provide a supportive environment for those afflicted and also generate a large amount of data that can be mined to predict mental health states using machine learning methods. OBJECTIVE This study aimed to benchmark multiple methods of text feature representation for social media posts and compare their downstream use with automated machine learning (AutoML) tools. We tested on datasets that contain posts labeled for perceived suicide risk or moderator attention in the context of self-harm. Specifically, we assessed the ability of the methods to prioritize posts that a moderator would identify for immediate response. METHODS We used 1588 labeled posts from the Computational Linguistics and Clinical Psychology (CLPsych) 2017 shared task collected from the Reachout.com forum. Posts were represented using lexicon-based tools, including Valence Aware Dictionary and sEntiment Reasoner, Empath, and Linguistic Inquiry and Word Count, and also using pretrained artificial neural network models, including DeepMoji, Universal Sentence Encoder, and Generative Pretrained Transformer-1 (GPT-1). We used Tree-based Optimization Tool and Auto-Sklearn as AutoML tools to generate classifiers to triage the posts. RESULTS The top-performing system used features derived from the GPT-1 model, which was fine-tuned on over 150,000 unlabeled posts from Reachout.com. Our top system had a macroaveraged F1 score of 0.572, providing a new state-of-the-art result on the CLPsych 2017 task. This was achieved without additional information from metadata or preceding posts. Error analyses revealed that this top system often misses expressions of hopelessness. In addition, we have presented visualizations that aid in the understanding of the learned classifiers. CONCLUSIONS In this study, we found that transfer learning is an effective strategy for predicting risk with relatively little labeled data and noted that fine-tuning of pretrained language models provides further gains when large amounts of unlabeled text are available.


Author(s):  
Jiajia Luo ◽  
Wei Wang ◽  
Hairong Qi

Multi-view human action recognition has gained a lot of attention in recent years for its superior performance as compared to single view recognition. In this paper, we propose a new framework for the real-time realization of human action recognition in distributed camera networks (DCNs). We first present a new feature descriptor (Mltp-hist) that is tolerant to illumination change, robust in homogeneous region and computationally efficient. Taking advantage of the proposed Mltp-hist, the noninformative 3-D patches generated from the background can be further removed automatically that effectively highlights the foreground patches. Next, a new feature representation method based on sparse coding is presented to generate the histogram representation of local videos to be transmitted to the base station for classification. Due to the sparse representation of extracted features, the approximation error is reduced. Finally, at the base station, a probability model is produced to fuse the information from various views and a class label is assigned accordingly. Compared to the existing algorithms, the proposed framework has three advantages while having less requirements on memory and bandwidth consumption: 1) no preprocessing is required; 2) communication among cameras is unnecessary; and 3) positions and orientations of cameras do not need to be fixed. We further evaluate the proposed framework on the most popular multi-view action dataset IXMAS. Experimental results indicate that our proposed framework repeatedly achieves state-of-the-art results when various numbers of views are tested. In addition, our approach is tolerant to the various combination of views and benefit from introducing more views at the testing stage. Especially, our results are still satisfactory even when large misalignment exists between the training and testing samples.


Author(s):  
Liang Lan ◽  
Yu Geng

Factorization Machines (FMs), a general predictor that can efficiently model high-order feature interactions, have been widely used for regression, classification and ranking problems. However, despite many successful applications of FMs, there are two main limitations of FMs: (1) FMs consider feature interactions among input features by using only polynomial expansion which fail to capture complex nonlinear patterns in data. (2) Existing FMs do not provide interpretable prediction to users. In this paper, we present a novel method named Subspace Encoding Factorization Machines (SEFM) to overcome these two limitations by using non-parametric subspace feature mapping. Due to the high sparsity of new feature representation, our proposed method achieves the same time complexity as the standard FMs but can capture more complex nonlinear patterns. Moreover, since the prediction score of our proposed model for a sample is a sum of contribution scores of the bins and grid cells that this sample lies in low-dimensional subspaces, it works similar like a scoring system which only involves data binning and score addition. Therefore, our proposed method naturally provides interpretable prediction. Our experimental results demonstrate that our proposed method efficiently provides accurate and interpretable prediction.


Author(s):  
Donatien Koulla Moulla ◽  
◽  
Alain Abran ◽  
Kolyang

For software organizations that rely on Open Source Software (OSS) to develop customer solutions and products, it is essential to accurately estimate how long it will take to deliver the expected functionalities. While OSS is supported by government policies around the world, most of the research on software project estimation has focused on conventional projects with commercial licenses. OSS effort estimation is challenging since OSS participants do not record effort data in OSS repositories. However, OSS data repositories contain dates of the participants’ contributions and these can be used for duration estimation. This study analyses historical data on WordPress and Swift projects to estimate OSS project duration using either commits or lines of code (LOC) as the independent variable. This study proposes first an improved classification of contributors based on the number of active days for each contributor in the development period of a release. For the WordPress and Swift OSS projects environments the results indicate that duration estimation models using the number of commits as the independent variable perform better than those using LOC. The estimation model for full-time contributors gives an estimate of the total duration, while the models with part-time and occasional contributors lead to better estimates of projects duration with both for the commits data and the lines of data.


2018 ◽  
Author(s):  
Jiayi Wang ◽  
Kai Fan ◽  
Bo Li ◽  
Fengming Zhou ◽  
Boxing Chen ◽  
...  

2021 ◽  
Author(s):  
Sophie B. Cowling ◽  
Hamidreza Soltani ◽  
Sean Mayes ◽  
Erik H. Murchie

AbstractStomata are dynamic structures that control the gaseous exchange of CO2 from the external to internal environment and water loss through transpiration. The density and morphology of stomata have important consequences in crop productivity and water use efficiency, both are integral considerations when breeding climate change resilient crops. The phenotyping of stomata is a slow manual process and provides a substantial bottleneck when characterising phenotypic and genetic variation for crop improvement. There are currently no open-source methods to automate stomatal counting. We used 380 human annotated micrographs of O. glaberrima and O. sativa at x20 and x40 objectives for testing and training. Training was completed using the transfer learning for deep neural networks method and R-CNN object detection model. At a x40 objective our method was able to accurately detect stomata (n = 540, r = 0.94, p<0.0001), with an overall similarity of 99% between human and automated counting methods. Our method can batch process large files of images. As proof of concept, characterised the stomatal density in a population of 155 O. glaberrima accessions, using 13,100 micrographs. Here, we present developed Stomata Detector; an open source, sophisticated piece of software for the plant science community that can accurately identify stomata in Oryza spp., and potentially other monocot species.


2021 ◽  
Author(s):  
Gabriel Borrageiro ◽  
Nick Firoozye ◽  
Paolo Barucca

We explore online inductive transfer learning, with a feature representation transfer from a radial basis function network formed of Gaussian mixture model hidden processing units to a direct, recurrent reinforcement learning agent. This agent is put to work in an experiment, trading the major spot market currency pairs, where we accurately account for transaction and funding costs. These sources of profit and loss, including the price trends that occur in the currency markets, are made available to the agent via a quadratic utility, who learns to target a position directly. We improve upon earlier work by learning to target a risk position in an online transfer learning context. Our agent achieves an annualised portfolio information ratio of 0.52 with a compound return of 9.3%, net of execution and funding cost, over a 7-year test set; this is despite forcing the model to trade at the close of the trading day at 5 pm EST when trading costs are statistically the most expensive.<br>


2019 ◽  
Vol 52 (1/2) ◽  
pp. 353
Author(s):  
Bernhard Hurch

Among the founding factors for modernism in philology/linguistics, there were two important medial innovations as a consequence of industrialization: new printing technologies and the installation of a general mail system. This contribution explores their catalyzing effects on the basis of Schuchardt’s bascology, showing that the (pre-)academic network of epistolary exchange, of reviewing and other types of contact, which had been created by those innovative scholars, is at least as important for the development of the humanities as the publications in sensu stricto. Moreover, this paper illustrates modern techniques applied in the digital and open source «Hugo Schuchardt Archiv», which virtually reconstruct the network which had existed in the 19th century. The archive publishes, among other things, at the moment nearly 8,000 letters with originals, transcription and comments (from which around 1,700 with bascological interest).


Sign in / Sign up

Export Citation Format

Share Document