language model
Recently Published Documents





Guirong Bai ◽  
Shizhu He ◽  
Kang Liu ◽  
Jun Zhao

Active learning is an effective method to substantially alleviate the problem of expensive annotation cost for data-driven models. Recently, pre-trained language models have been demonstrated to be powerful for learning language representations. In this article, we demonstrate that the pre-trained language model can also utilize its learned textual characteristics to enrich criteria of active learning. Specifically, we provide extra textual criteria with the pre-trained language model to measure instances, including noise, coverage, and diversity. With these extra textual criteria, we can select more efficient instances for annotation and obtain better results. We conduct experiments on both English and Chinese sentence matching datasets. The experimental results show that the proposed active learning approach can be enhanced by the pre-trained language model and obtain better performance.

Shu Jiang ◽  
Zuchao Li ◽  
Hai Zhao ◽  
Bao-Liang Lu ◽  
Rui Wang

In recent years, the research on dependency parsing focuses on improving the accuracy of the domain-specific (in-domain) test datasets and has made remarkable progress. However, there are innumerable scenarios in the real world that are not covered by the dataset, namely, the out-of-domain dataset. As a result, parsers that perform well on the in-domain data usually suffer from significant performance degradation on the out-of-domain data. Therefore, to adapt the existing in-domain parsers with high performance to a new domain scenario, cross-domain transfer learning methods are essential to solve the domain problem in parsing. This paper examines two scenarios for cross-domain transfer learning: semi-supervised and unsupervised cross-domain transfer learning. Specifically, we adopt a pre-trained language model BERT for training on the source domain (in-domain) data at the subword level and introduce self-training methods varied from tri-training for these two scenarios. The evaluation results on the NLPCC-2019 shared task and universal dependency parsing task indicate the effectiveness of the adopted approaches on cross-domain transfer learning and show the potential of self-learning to cross-lingual transfer learning.

Xianwen Liao ◽  
Yongzhong Huang ◽  
Peng Yang ◽  
Lei Chen

By defining the computable word segmentation unit and studying its probability characteristics, we establish an unsupervised statistical language model (SLM) for a new pre-trained sequence labeling framework in this article. The proposed SLM is an optimization model, and its objective is to maximize the total binding force of all candidate word segmentation units in sentences under the condition of no annotated datasets and vocabularies. To solve SLM, we design a recursive divide-and-conquer dynamic programming algorithm. By integrating SLM with the popular sequence labeling models, Vietnamese word segmentation, part-of-speech tagging and named entity recognition experiments are performed. The experimental results show that our SLM can effectively promote the performance of sequence labeling tasks. Just using less than 10% of training data and without using a dictionary, the performance of our sequence labeling framework is better than the state-of-the-art Vietnamese word segmentation toolkit VnCoreNLP on the cross-dataset test. SLM has no hyper-parameter to be tuned, and it is completely unsupervised and applicable to any other analytic language. Thus, it has good domain adaptability.

Deepang Raval ◽  
Vyom Pathak ◽  
Muktan Patel ◽  
Brijesh Bhatt

We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning-based approach that includes Convolutional Neural Network, Bi-directional Long Short Term Memory layers, Dense layers, and Connectionist Temporal Classification as a loss function. To improve the performance of the system with the limited size of the dataset, we present a combined language model (Word-level language Model and Character-level language model)-based prefix decoding technique and Bidirectional Encoder Representations from Transformers-based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we used the inferences from the system and proposed different analysis methods. These insights help us in understanding and improving the ASR system as well as provide intuition into the language used for the ASR system. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.87% decrease in Word Error Rate (WER) with respect to base-model WER.

Н.Н. Ефремов

В статье рассматривается модель пространственных предложений якутского языка, которыми описываются отношения адлокации и директив - финиша. Подобные предложения формируются конструкциями, предикат которых выражается глаголами направленного движения, ориентированного относительно конечного пункта, а актант - локализатор обозначается синтетическими и аналитическими формами. Их типовая модель представлена тремя структурными вариантами: падежным, с послелогом диэки, с наречиями и именами. Варианты в соответствии с лексико - грамматическим характером глаголов движения, выступающих предикатами, а также имен и наречий, функционирующих в роли локализаторов, наделяются теми или иными семантическими модификациями. Падежный вариант представляется относительно большим числом модификаций в связи с их сочетаемостным потенциалом, что позволяет расценивать их как ядерные средства выражения анализируемых отношений. The article discusses the model of spatial sentences of the Yakut language, which describe the relation between adlocation and direction - finish. Such sentences are formed by constructions, the predicate of which is expressed by verbs of directional movement, oriented relative to the final point, and the localizing actant is designated by certain synthetic and analytical forms. Their typical model is represented by three structural variants: case, with postposition dieki, with adverbs and names. Variants in accordance with the lexical and grammatical nature of verbs of motion, acting as predicates, as well as names and adverbs, functioning as localizers, are endowed with one or another semantic modification. The case variant is represented by a relatively large number of modifications due to their collocational potential, which allows us to regard them as core means of expressing the analyzed relations.

2022 ◽  
Vol 5 (1) ◽  
pp. 13
Barakat AlBadani ◽  
Ronghua Shi ◽  
Jian Dong

Twitter sentiment detectors (TSDs) provide a better solution to evaluate the quality of service and product than other traditional technologies. The classification accuracy and detection performance of TSDs, which are extremely reliant on the performance of the classification techniques, are used, and the quality of input features is provided. However, the time required is a big problem for the existing machine learning methods, which leads to a challenge for all enterprises that aim to transform their businesses to be processed by automated workflows. Deep learning techniques have been utilized in several real-world applications in different fields such as sentiment analysis. Deep learning approaches use different algorithms to obtain information from raw data such as texts or tweets and represent them in certain types of models. These models are used to infer information about new datasets that have not been modeled yet. We present a new effective method of sentiment analysis using deep learning architectures by combining the “universal language model fine-tuning” (ULMFiT) with support vector machine (SVM) to increase the detection efficiency and accuracy. The method introduces a new deep learning approach for Twitter sentiment analysis to detect the attitudes of people toward certain products based on their comments. The extensive results on three datasets illustrate that our model achieves the state-of-the-art results over all datasets. For example, the accuracy performance is 99.78% when it is applied on the Twitter US Airlines dataset.

2022 ◽  
Vol 12 (2) ◽  
pp. 804
Pau Baquero-Arnal ◽  
Javier Jorge ◽  
Adrià Giménez ◽  
Javier Iranzo-Sánchez ◽  
Alejandro Pérez ◽  

This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting of building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81 ± 0.09 s (mean ± stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER, respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams.

2022 ◽  
Chris Haffenden ◽  
Elena Fano ◽  
Martin Malmsten ◽  
Love Börjeson

How can novel AI techniques be made and put to use in the library? Combining methods from data and library science, this article focuses on Natural Language Processing technologies in especially national libraries. It explains how the National Library of Sweden’s collections enabled the development of a new BERT language model for Swedish. It also outlines specific use cases for the model in the context of academic libraries, detailing strategies for how such a model could make digital collections available for new forms of research: from automated classification to enhanced searchability and improved OCR cohesion. Highlighting the potential for cross-fertilizing AI with libraries, the conclusion suggests that while AI may transform the workings of the library, libraries can also have a key role to play in the future development of AI.

2022 ◽  
Vol 23 (1) ◽  
Atul Sharma ◽  
Pranjal Jain ◽  
Ashraf Mahgoub ◽  
Zihan Zhou ◽  
Kanak Mahadik ◽  

Abstract Background Sequencing technologies are prone to errors, making error correction (EC) necessary for downstream applications. EC tools need to be manually configured for optimal performance. We find that the optimal parameters (e.g., k-mer size) are both tool- and dataset-dependent. Moreover, evaluating the performance (i.e., Alignment-rate or Gain) of a given tool usually relies on a reference genome, but quality reference genomes are not always available. We introduce Lerna for the automated configuration of k-mer-based EC tools. Lerna first creates a language model (LM) of the uncorrected genomic reads, and then, based on this LM, calculates a metric called the perplexity metric to evaluate the corrected reads for different parameter choices. Next, it finds the one that produces the highest alignment rate without using a reference genome. The fundamental intuition of our approach is that the perplexity metric is inversely correlated with the quality of the assembly after error correction. Therefore, Lerna leverages the perplexity metric for automated tuning of k-mer sizes without needing a reference genome. Results First, we show that the best k-mer value can vary for different datasets, even for the same EC tool. This motivates our design that automates k-mer size selection without using a reference genome. Second, we show the gains of our LM using its component attention-based transformers. We show the model’s estimation of the perplexity metric before and after error correction. The lower the perplexity after correction, the better the k-mer size. We also show that the alignment rate and assembly quality computed for the corrected reads are strongly negatively correlated with the perplexity, enabling the automated selection of k-mer values for better error correction, and hence, improved assembly quality. We validate our approach on both short and long reads. Additionally, we show that our attention-based models have significant runtime improvement for the entire pipeline—18$$\times$$ × faster than previous works, due to parallelizing the attention mechanism and the use of JIT compilation for GPU inferencing. Conclusion Lerna improves de novo genome assembly by optimizing EC tools. Our code is made available in a public repository at:

2022 ◽  
John Caskey ◽  
Iain L McConnell ◽  
Madeline Oguss ◽  
Dmitriy Dligach ◽  
Rachel Kulikoff ◽  

BACKGROUND In Wisconsin, COVID-19 case interview forms contain free text fields that need to be mined to identify potential outbreaks for targeted policy making. We developed an automated pipeline to ingest the free text into a pre-trained neural language model to identify businesses and facilities as outbreaks. OBJECTIVE We aim to examine the performance of our pipeline. METHODS Data on cases of COVID-19 were extracted from the Wisconsin Electronic Disease Surveillance System (WEDSS) for Dane County between July 1, 2020, and June 30, 2021. Features from the case interview forms were fed into a Bidirectional Encoder Representations from Transformers (BERT) model that was fine-tuned for named entity recognition (NER). We also developed a novel location mapping tool to provide addresses for relevant NERs. The pipeline was validated against known outbreaks that were already investigated and confirmed. RESULTS There were 46,898 cases of COVID-19 with 4,183,273 total BERT tokens and 15,051 unique tokens. The recall and precision of the NER tool were 0.67 (95 % CI 0.66-0.68) and 0.55 (95 % CI: 0.54-0.57), respectively. For the location mapping tool, the recall and precision were 0.93 (95% CI: 0.92-0.95) and 0.93 (95% CI: 0.92-0.95), respectively. Across monthly intervals, the NER tool identified more potential clusters than were confirmed in the WEDSS system. CONCLUSIONS We developed a novel pipeline of tools that identified existing outbreaks and novel clusters with associated addresses. Our pipeline ingests data from a statewide database and may be deployed to assist local health departments for targeted interventions. CLINICALTRIAL Not applicable

Sign in / Sign up

Export Citation Format

Share Document