Effective and practical neural ranking

Supervised machine learning methods that use neural networks ("deep learning") have yielded substantial improvements to a multitude of Natural Language Processing (NLP) tasks in the past decade. Improvements to Information Retrieval (IR) tasks, such as ad-hoc search, lagged behind those in similar NLP tasks, despite considerable community efforts. Although there are several contributing factors, I argue in this dissertation that early attempts were not more successful because they did not properly consider the unique characteristics of IR tasks when designing and training ranking models. I first demonstrate this by showing how large-scale datasets containing weak relevance labels can successfully replace training on in-domain collections. This technique improves the variety of queries encountered when training and helps mitigate concerns of over-fitting particular test collections. I then show that dataset statistics available in specific IR tasks can be easily incorporated into neural ranking models alongside the textual features, resulting in more effective ranking models. I also demonstrate that contextualized representations, particularly those from transformer-based language models, considerably improve neural ad-hoc ranking performance. I find that this approach is neither limited to the task of ad-hoc ranking (as demonstrated by ranking clinical reports) nor English content (as shown by training effective cross-lingual neural rankers). These efforts demonstrate that neural approaches can be effective for ranking tasks. However, I observe that these techniques are impractical due to their high query-time computational costs. To overcome this, I study approaches for offloading computational cost to index-time, substantially reducing query-time latency. These techniques make neural methods practical for ranking tasks. Finally, I take a deep dive into better understanding the linguistic biases of the methods I propose compared to contemporary and traditional approaches. The findings from this analysis highlight potential pitfalls of recent methods and provide a way to measure progress in this area going forward.

Download Full-text

Efficient Embedded Decoding of Neural Network Language Models in a Machine Translation System

International Journal of Neural Systems ◽

10.1142/s0129065718500077 ◽

2018 ◽

Vol 28 (09) ◽

pp. 1850007

Author(s):

Francisco Zamora-Martinez ◽

Maria Jose Castro-Bleda

Keyword(s):

Neural Network ◽

Machine Translation ◽

Language Processing ◽

Traditional Approach ◽

Computational Cost ◽

Integrated Approach ◽

Language Models ◽

Translation System ◽

Neural Net ◽

Network Language

Neural Network Language Models (NNLMs) are a successful approach to Natural Language Processing tasks, such as Machine Translation. We introduce in this work a Statistical Machine Translation (SMT) system which fully integrates NNLMs in the decoding stage, breaking the traditional approach based on [Formula: see text]-best list rescoring. The neural net models (both language models (LMs) and translation models) are fully coupled in the decoding stage, allowing to more strongly influence the translation quality. Computational issues were solved by using a novel idea based on memorization and smoothing of the softmax constants to avoid their computation, which introduces a trade-off between LM quality and computational cost. These ideas were studied in a machine translation task with different combinations of neural networks used both as translation models and as target LMs, comparing phrase-based and [Formula: see text]-gram-based systems, showing that the integrated approach seems more promising for [Formula: see text]-gram-based systems, even with nonfull-quality NNLMs.

Download Full-text

A fast method for statistical grammar induction

Natural Language Engineering ◽

10.1017/s1351324998001983 ◽

1998 ◽

Vol 4 (3) ◽

pp. 191-209 ◽

Cited By ~ 5

Author(s):

WIDE R. HOGENHOUT ◽

YUJI MATSUMOTO

Keyword(s):

Computational Complexity ◽

Language Processing ◽

Large Scale ◽

Learning Algorithm ◽

Computational Cost ◽

Computational Effort ◽

Fast Method ◽

Statistical Induction ◽

Stochastic Context Free Grammars ◽

Context Free

The statistical induction of stochastic context free grammars from bracketed corpora with the Inside Outside Algorithm is an appealing method for grammar learning, but the computational complexity of this algorithm has made it impossible to generate a large scale grammar. Researchers from natural language processing and speech recognition have suggested various methods to reduce the computational complexity and, at the same time, guide the learning algorithm towards a solution by, for example, placing constraints on the grammar. We suggest a method that strongly reduces that computational cost of the algorithm without placing constraints on the grammar. This method can in principle be combined with any of the constraints on grammars that have been suggested in earlier studies. We show that it is feasible to achieve results equivalent to earlier research, but with much lower computational effort. After creating a small grammar, the grammar is incrementally increased while rules that have become obsolete are removed at the same time. We explain the modifications to the algorithm, give results of experiments and compare these to results reported in other publications.

Download Full-text

Lexical predictability during natural reading: Effects of surprisal and entropy reduction

10.31234/osf.io/6f4wq ◽

2017 ◽

Author(s):

Matthew Lowder ◽

Wonil Choi ◽

Fernanda Ferreira ◽

John Henderson

Keyword(s):

Language Processing ◽

Sentence Processing ◽

Large Scale ◽

Language Models ◽

Information Complexity ◽

Complexity Metrics ◽

Processing Times ◽

Entropy Reduction ◽

Theory Of Language ◽

Word Predictability

What are the effects of word-by-word predictability on sentence processing times during the natural reading of a text? Although information-complexity metrics such as surprisal and entropy reduction have been useful in addressing this question, these metrics tend to be estimated using computational language models, which require some degree of commitment to a particular theory of language processing. Taking a different approach, the current study implemented a large-scale cumulative cloze task to collect word-by-word predictability data for 40 passages and compute surprisal and entropy reduction values in a theory-neutral manner. A separate group of participants read the same texts while their eye movements were recorded. Results showed that increases in surprisal and entropy reduction were both associated with increases in reading times. Further, these effects did not depend on the global difficulty of the text. The findings suggest that surprisal and entropy reduction independently contribute to variation in reading times, as these metrics seem to capture different aspects of lexical predictability.

Download Full-text

A Pre-Training Technique to Localize Medical BERT and to Enhance Biomedical BERT

10.21203/rs.3.rs-103477/v1 ◽

2020 ◽

Author(s):

Shoya Wada ◽

Toshihiro Takeda ◽

Shiro Manabe ◽

Shozo Konishi ◽

Jun Kamohara ◽

...

Keyword(s):

Language Processing ◽

High Performance ◽

Large Scale ◽

Language Models ◽

Free Text ◽

Medical Databases ◽

Large Size ◽

Training Technique ◽

Medical Domain ◽

Medical Document

Abstract Background: Pre-training large-scale neural language models on raw texts has been shown to make a significant contribution to a strategy for transfer learning in natural language processing (NLP). With the introduction of transformer-based language models, such as Bidirectional Encoder Representations from Transformers (BERT), the performance of information extraction from free text by NLP has significantly improved for both the general domain and the medical domain; however, it is difficult for languages in which there are few publicly available medical databases with a high quality and a large size to train medical BERT models that perform well.Method: We introduce a method to train a BERT model using a small medical corpus both in English and in Japanese. Our proposed method consists of two interventions: simultaneous pre-training, which is intended to encourage masked language modeling and next-sentence prediction on the small medical corpus, and amplified vocabulary, which helps with suiting the small corpus when building the customized corpus by byte-pair encoding. Moreover, we used whole PubMed abstracts and developed a high-performance BERT model, Bidirectional Encoder Representations from Transformers for Biomedical Text Mining by Osaka University (ouBioBERT), in English via our method. We then evaluated the performance of our BERT models and publicly available baselines and compared them.Results: We confirmed that our Japanese medical BERT outperforms conventional baselines and the other BERT models in terms of the medical-document classification task and that our English BERT pre-trained using both the general and medical domain corpora performs sufficiently for practical use in terms of the biomedical language understanding evaluation (BLUE) benchmark. Moreover, ouBioBERT shows that the total score of the BLUE benchmark is 1.1 points above that of BioBERT and 0.3 points above that of the ablation model trained without our proposed method.Conclusions: Our proposed method makes it feasible to construct a practical medical BERT model in both Japanese and English, and it has a potential to produce higher performing models for biomedical shared tasks.

Download Full-text

What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations

Frontiers in Artificial Intelligence ◽

10.3389/frai.2021.767971 ◽

2021 ◽

Vol 4 ◽

Author(s):

Nikolai Ilinykh ◽

Simon Dobnik

Keyword(s):

Neural Networks ◽

Language Processing ◽

Large Scale ◽

Visual Representations ◽

Language Models ◽

Visual Stream ◽

Visual Knowledge ◽

Language And Vision ◽

The Impact ◽

Image Descriptions

Neural networks have proven to be very successful in automatically capturing the composition of language and different structures across a range of multi-modal tasks. Thus, an important question to investigate is how neural networks learn and organise such structures. Numerous studies have examined the knowledge captured by language models (LSTMs, transformers) and vision architectures (CNNs, vision transformers) for respective uni-modal tasks. However, very few have explored what structures are acquired by multi-modal transformers where linguistic and visual features are combined. It is critical to understand the representations learned by each modality, their respective interplay, and the task’s effect on these representations in large-scale architectures. In this paper, we take a multi-modal transformer trained for image captioning and examine the structure of the self-attention patterns extracted from the visual stream. Our results indicate that the information about different relations between objects in the visual stream is hierarchical and varies from local to a global object-level understanding of the image. In particular, while visual representations in the first layers encode the knowledge of relations between semantically similar object detections, often constituting neighbouring objects, deeper layers expand their attention across more distant objects and learn global relations between them. We also show that globally attended objects in deeper layers can be linked with entities described in image descriptions, indicating a critical finding - the indirect effect of language on visual representations. In addition, we highlight how object-based input representations affect the structure of learned visual knowledge and guide the model towards more accurate image descriptions. A parallel question that we investigate is whether the insights from cognitive science echo the structure of representations that the current neural architecture learns. The proposed analysis of the inner workings of multi-modal transformers can be used to better understand and improve on such problems as pre-training of large-scale multi-modal architectures, multi-modal information fusion and probing of attention weights. In general, we contribute to the explainable multi-modal natural language processing and currently shallow understanding of how the input representations and the structure of the multi-modal transformer affect visual representations.

Download Full-text

Designing production-friendly machine learning

Proceedings of the VLDB Endowment ◽

10.14778/3484224.3484241 ◽

2021 ◽

Vol 14 (13) ◽

pp. 3420-3420

Author(s):

Matei Zaharia

Keyword(s):

Machine Learning ◽

Open Source ◽

Large Scale ◽

Question Answering ◽

Failure Modes ◽

Computational Cost ◽

Language Models ◽

Software Systems ◽

Resource Cost ◽

Low Computational Cost

Building production ML applications is difficult because of their resource cost and complex failure modes. I will discuss these challenges from two perspectives: the Stanford DAWN Lab and experience with large-scale commercial ML users at Databricks. I will then present two emerging ideas to help address these challenges. The first is "ML platforms", an emerging class of software systems that standardize the interfaces used in ML applications to make them easier to build and maintain. I will give a few examples, including the open-source MLflow system from Databricks [3]. The second idea is models that are more "production-friendly" by design. As a concrete example, I will discuss retrieval-based NLP models such as Stanford's ColBERT [1, 2] that query documents from an updateable corpus to perform tasks such as question-answering, which gives multiple practical advantages, including low computational cost, high interpretability, and very fast updates to the model's "knowledge". These models are an exciting alternative to large language models such as GPT-3.

Download Full-text

Image Classification With Convolutional Neural Networks In MapReduce

10.21203/rs.3.rs-1131730/v1 ◽

2021 ◽

Author(s):

Min Chen

Keyword(s):

Neural Networks ◽

Image Classification ◽

Convolutional Neural Networks ◽

Language Processing ◽

Large Scale ◽

Data Science ◽

Fault Tolerant ◽

Computational Cost ◽

Data Intensive ◽

Computationally Intensive

Abstract Deep learning (DL) techniques, more specifically Convolutional Neural Networks (CNNs), have become increasingly popular in advancing the field of data science and have had great successes in a wide array of applications including computer vision, speech, natural language processing and etc. However, the training process of CNNs is computationally intensive and high computational cost, especially when the dataset is huge. To overcome these obstacles, this paper takes advantage of distributed frameworks and cloud computing to develop a parallel CNN algorithm. MapReduce is a scalable and fault-tolerant data processing tool that was developed to provide significant improvements in large-scale data-intensive applications in clusters. A MapReduce-based CNN (MCNN) is developed in this work to tackle the task of image classification. In addition, the proposed MCNN adopted the idea of adding dropout layers in the networks to tackle the overfitting problem. Close examination of the implementation of MCNN as well as how the proposed algorithm accelerates learning are discussed and demonstrated through experiments. Results reveal high classification accuracy and significant improvements in speedup, scaleup and sizeup compared to the standard algorithms.

Download Full-text

LogEvent2vec: LogEvent-to-Vector Based Anomaly Detection for Large-Scale Logs in Internet of Things

Sensors ◽

10.3390/s20092451 ◽

2020 ◽

Vol 20 (9) ◽

pp. 2451 ◽

Cited By ~ 4

Author(s):

Jin Wang ◽

Yangning Tang ◽

Shiming He ◽

Changqing Zhao ◽

Pradip Kumar Sharma ◽

...

Keyword(s):

Feature Extraction ◽

Internet Of Things ◽

Anomaly Detection ◽

Language Processing ◽

Large Scale ◽

Naive Bayes ◽

Computational Cost ◽

Naïve Bayes ◽

Computational Time ◽

Transformation Methods

Log anomaly detection is an efficient method to manage modern large-scale Internet of Things (IoT) systems. More and more works start to apply natural language processing (NLP) methods, and in particular word2vec, in the log feature extraction. Word2vec can extract the relevance between words and vectorize the words. However, the computing cost of training word2vec is high. Anomalies in logs are dependent on not only an individual log message but also on the log message sequence. Therefore, the vector of words from word2vec can not be used directly, which needs to be transformed into the vector of log events and further transformed into the vector of log sequences. To reduce computational cost and avoid multiple transformations, in this paper, we propose an offline feature extraction model, named LogEvent2vec, which takes the log event as input of word2vec to extract the relevance between log events and vectorize log events directly. LogEvent2vec can work with any coordinate transformation methods and anomaly detection models. After getting the log event vector, we transform log event vector to log sequence vector by bary or tf-idf and three kinds of supervised models (Random Forests, Naive Bayes, and Neural Networks) are trained to detect the anomalies. We have conducted extensive experiments on a real public log dataset from BlueGene/L (BGL). The experimental results demonstrate that LogEvent2vec can significantly reduce computational time by 30 times and improve accuracy, comparing with word2vec. LogEvent2vec with bary and Random Forest can achieve the best F1-score and LogEvent2vec with tf-idf and Naive Bayes needs the least computational time.

Download Full-text

Understanding Smart City—A Data-Driven Literature Review

Sustainability ◽

10.3390/su12208460 ◽

2020 ◽

Vol 12 (20) ◽

pp. 8460 ◽

Cited By ~ 1

Author(s):

Johannes Stübinger ◽

Lucas Schneider

Keyword(s):

Time Series ◽

Language Processing ◽

Smart City ◽

Ad Hoc ◽

Waste Heat ◽

Current Trend ◽

Data Driven ◽

Heat Output ◽

Deep Dive ◽

Scientific Methods

This paper systematically reviews the top 200 Google Scholar publications in the area of smart city with the aid of data-driven methods from the fields natural language processing and time series forecasting. Specifically, our algorithm crawls the textual information of the considered articles and uses the created ad-hoc database to identify the most relevant streams “smart infrastructure”, “smart economy & policy”, “smart technology”, “smart sustainability”, and “smart health”. Next, we automatically assign each manuscript into these subject areas by dint of several interdisciplinary scientific methods. Each stream is evaluated in a deep-dive analysis by (i) creating a word cloud to find the most important keywords, (ii) examining the main contributions, and (iii) applying time series methodologies to determine the past and future relevance. Due to our large-scaled literature, an in-depth evaluation of each stream is possible, which ultimately reveals strengths and weaknesses. We hereby acknowledge that smart sustainability will come to the fore in the next years—this fact confirms the current trend, as minimizing the required input of energy, water, food, waste, heat output and air pollution is becoming increasingly important.

Download Full-text

The great transformer: Examining the role of large language models in the political economy of AI

Big Data & Society ◽

10.1177/20539517211047734 ◽

2021 ◽

Vol 8 (2) ◽

pp. 205395172110477

Author(s):

Dieuwertje Luitse ◽

Wiebke Denkena

Keyword(s):

Political Economy ◽

Language Processing ◽

Large Scale ◽

Language Models ◽

The Political ◽

Public Controversy ◽

Corporate Influence ◽

Technical Developments ◽

Environmental Footprints ◽

Ai Ethics

In recent years, AI research has become more and more computationally demanding. In natural language processing (NLP), this tendency is reflected in the emergence of large language models (LLMs) like GPT-3. These powerful neural network-based models can be used for a range of NLP tasks and their language generation capacities have become so sophisticated that it can be very difficult to distinguish their outputs from human language. LLMs have raised concerns over their demonstrable biases, heavy environmental footprints, and future social ramifications. In December 2020, critical research on LLMs led Google to fire Timnit Gebru, co-lead of the company’s AI Ethics team, which sparked a major public controversy around LLMs and the growing corporate influence over AI research. This article explores the role LLMs play in the political economy of AI as infrastructural components for AI research and development. Retracing the technical developments that have led to the emergence of LLMs, we point out how they are intertwined with the business model of big tech companies and further shift power relations in their favour. This becomes visible through the Transformer, which is the underlying architecture of most LLMs today and started the race for ever bigger models when it was introduced by Google in 2017. Using the example of GPT-3, we shed light on recent corporate efforts to commodify LLMs through paid API access and exclusive licensing, raising questions around monopolization and dependency in a field that is increasingly divided by access to large-scale computing power.

Download Full-text