FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/622 ◽

2020 ◽

Cited By ~ 1

Author(s):

Zhuang Liu ◽

Degen Huang ◽

Kaiyu Huang ◽

Zhuang Li ◽

Jun Zhao

Keyword(s):

Deep Learning ◽

Text Mining ◽

Language Processing ◽

Large Scale ◽

Language Model ◽

Training Data ◽

Domain Specific ◽

Current State ◽

Language Representation ◽

Financial Domain

There is growing interest in the tasks of financial text mining. Over the past few years, the progress of Natural Language Processing (NLP) based on deep learning advanced rapidly. Significant progress has been made with deep learning showing promising results on financial text mining models. However, as NLP models require large amounts of labeled training data, applying deep learning to financial text mining is often unsuccessful due to the lack of labeled training data in financial fields. To address this issue, we present FinBERT (BERT for Financial Text Mining) that is a domain specific language model pre-trained on large-scale financial corpora. In FinBERT, different from BERT, we construct six pre-training tasks covering more knowledge, simultaneously trained on general corpora and financial domain corpora, which can enable FinBERT model better to capture language knowledge and semantic information. The results show that our FinBERT outperforms all current state-of-the-art models. Extensive experimental results demonstrate the effectiveness and robustness of FinBERT. The source code and pre-trained models of FinBERT are available online.

Download Full-text

Deep Embedding Sentiment Analysis on Product Reviews Using Naive Bayesian Classifier

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit1952178 ◽

2019 ◽

pp. 858-864

Author(s):

Nukabathini Mary Saroj Sahithya ◽

Manda Prathyusha ◽

Nakkala Rachana ◽

Perikala Priyanka ◽

P. J. Jyothi

Keyword(s):

Deep Learning ◽

Language Processing ◽

Large Scale ◽

Opinion Mining ◽

Machine Learning Algorithms ◽

Sentiment Classification ◽

Training Data ◽

Fine Tuning ◽

Product Reviews ◽

Deep Embedding

Product reviews are valuable for upcoming buyers in helping them make decisions. To this end, different opinion mining techniques have been proposed, where judging a review sentence�s orientation (e.g. positive or negative) is one of their key challenges. Recently, deep learning has emerged as an effective means for solving sentiment classification problems. Deep learning is a class of machine learning algorithms that learn in supervised and unsupervised manners. A neural network intrinsically learns a useful representation automatically without human efforts. However, the success of deep learning highly relies on the large-scale training data. We propose a novel deep learning framework for product review sentiment classification which employs prevalently available ratings supervision signals. The framework consists of two steps: (1) learning a high-level representation (an embedding space) which captures the general sentiment distribution of sentences through rating information; (2) adding a category layer on top of the embedding layer and use labelled sentences for supervised fine-tuning. We explore two kinds of low-level network structure for modelling review sentences, namely, convolutional function extractors and long temporary memory. Convolutional layer is the core building block of a CNN and it consists of kernels. Applications are image and video recognition, natural language processing, image classification

Download Full-text

A comparative study of deep learning based language representation learning models

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v22.i2.pp1032-1040 ◽

2021 ◽

Vol 22 (2) ◽

pp. 1032

Author(s):

Mohammed Boukabous ◽

Mostafa Azizi

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Text Mining ◽

Language Processing ◽

Representation Learning ◽

Learning Models ◽

Significant Development ◽

Language Representation ◽

Compare And Contrast ◽

Hierarchical Representations

Deep learning (DL) approaches use various processing layers to learn hierarchical representations of data. Recently, many methods and designs of natural language processing (NLP) models have shown significant development, especially in text mining and analysis. For learning vector-space representations of text, there are famous models like Word2vec, GloVe, and fastText. In fact, NLP took a big step forward when BERT and recently GTP-3 came out. In this paper, we highlight the most important language representation learning models in NLP and provide an insight of their evolution. We also summarize, compare and contrast these different models on sentiment analysis, and thus discuss their main strengths and limitations. Our obtained results show that BERT is the best language representation learning model.

Download Full-text

Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/508 ◽

2020 ◽

Author(s):

Juntao Li ◽

Ruidan He ◽

Hai Ye ◽

Hwee Tou Ng ◽

Lidong Bing ◽

...

Keyword(s):

Language Processing ◽

Large Scale ◽

Language Model ◽

Language Models ◽

Low Resource ◽

Performance Improvements ◽

Domain Specific ◽

High Resource ◽

Significant Performance ◽

Cross Lingual

Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements over various cross-lingual and low-resource tasks. Through training on one hundred languages and terabytes of texts, cross-lingual language models have proven to be effective in leveraging high-resource languages to enhance low-resource language processing and outperform monolingual models. In this paper, we further investigate the cross-lingual and cross-domain (CLCD) setting when a pretrained cross-lingual language model needs to adapt to new domains. Specifically, we propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features and domain-invariant features from the entangled pretrained cross-lingual representations, given unlabeled raw texts in the source language. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts. Experimental results show that our proposed method achieves significant performance improvements over the state-of-the-art pretrained cross-lingual language model in the CLCD setting.

Download Full-text

Transfer learning for topic labeling: Analysis of the UK House of Commons speeches 1935–2014

Research & Politics ◽

10.1177/20531680211022206 ◽

2021 ◽

Vol 8 (2) ◽

pp. 205316802110222

Author(s):

Hannah Béchara ◽

Alexander Herzog ◽

Slava Jankin ◽

Peter John

Keyword(s):

Language Processing ◽

Large Scale ◽

Topic Model ◽

Topic Models ◽

House Of Commons ◽

Domain Specific ◽

Current State ◽

The Neural Networks ◽

Topic Labeling ◽

The Uk

Topic models are widely used in natural language processing, allowing researchers to estimate the underlying themes in a collection of documents. Most topic models require the additional step of attaching meaningful labels to estimated topics, a process that is not scalable, suffers from human bias, and is difficult to replicate. We present a transfer topic labeling method that seeks to remedy these problems, using domain-specific codebooks as the knowledge base to automatically label estimated topics. We demonstrate our approach with a large-scale topic model analysis of the complete corpus of UK House of Commons speeches from 1935 to 2014, using the coding instructions of the Comparative Agendas Project to label topics. We evaluated our results using human expert coding and compared our approach with more current state-of-the-art neural methods. Our approach was simple to implement, compared favorably to expert judgments, and outperformed the neural networks model for a majority of the topics we estimated.

Download Full-text

Classification of Very-High-Spatial-Resolution Aerial Images Based on Multiscale Features with Limited Semantic Information

Remote Sensing ◽

10.3390/rs13030364 ◽

2021 ◽

Vol 13 (3) ◽

pp. 364

Author(s):

Han Gao ◽

Jinhui Guo ◽

Peng Guo ◽

Xiuwan Chen

Keyword(s):

Deep Learning ◽

Land Cover ◽

Spatial Resolution ◽

Large Scale ◽

High Spatial Resolution ◽

Training Data ◽

Aerial Images ◽

Rural Landscapes ◽

Feature Representations ◽

Object Based

Recently, deep learning has become the most innovative trend for a variety of high-spatial-resolution remote sensing imaging applications. However, large-scale land cover classification via traditional convolutional neural networks (CNNs) with sliding windows is computationally expensive and produces coarse results. Additionally, although such supervised learning approaches have performed well, collecting and annotating datasets for every task are extremely laborious, especially for those fully supervised cases where the pixel-level ground-truth labels are dense. In this work, we propose a new object-oriented deep learning framework that leverages residual networks with different depths to learn adjacent feature representations by embedding a multibranch architecture in the deep learning pipeline. The idea is to exploit limited training data at different neighboring scales to make a tradeoff between weak semantics and strong feature representations for operational land cover mapping tasks. We draw from established geographic object-based image analysis (GEOBIA) as an auxiliary module to reduce the computational burden of spatial reasoning and optimize the classification boundaries. We evaluated the proposed approach on two subdecimeter-resolution datasets involving both urban and rural landscapes. It presented better classification accuracy (88.9%) compared to traditional object-based deep learning methods and achieves an excellent inference time (11.3 s/ha).

Download Full-text

Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches

AI ◽

10.3390/ai2010001 ◽

2021 ◽

Vol 2 (1) ◽

pp. 1-16

Author(s):

Juan Cruz-Benito ◽

Sanjay Vishwakarma ◽

Francisco Martin-Fernandez ◽

Ismael Faro

Keyword(s):

Deep Learning ◽

Learning Community ◽

Programming Languages ◽

Language Processing ◽

Code Generation ◽

Language Model ◽

Language Models ◽

Stochastic Gradient Descent ◽

Network Architectures ◽

Learning Architectures

In recent years, the use of deep learning in language models has gained much attention. Some research projects claim that they can generate text that can be interpreted as human writing, enabling new possibilities in many application areas. Among the different areas related to language processing, one of the most notable in applying this type of modeling is programming languages. For years, the machine learning community has been researching this software engineering area, pursuing goals like applying different approaches to auto-complete, generate, fix, or evaluate code programmed by humans. Considering the increasing popularity of the deep learning-enabled language models approach, we found a lack of empirical papers that compare different deep learning architectures to create and use language models based on programming code. This paper compares different neural network architectures like Average Stochastic Gradient Descent (ASGD) Weight-Dropped LSTMs (AWD-LSTMs), AWD-Quasi-Recurrent Neural Networks (QRNNs), and Transformer while using transfer learning and different forms of tokenization to see how they behave in building language models using a Python dataset for code generation and filling mask tasks. Considering the results, we discuss each approach’s different strengths and weaknesses and what gaps we found to evaluate the language models or to apply them in a real programming context.

Download Full-text

Astrid

Proceedings of the VLDB Endowment ◽

10.14778/3436905.3436907 ◽

2020 ◽

Vol 14 (4) ◽

pp. 471-484

Author(s):

Suraj Shetiya ◽

Saravanan Thirumuruganathan ◽

Nick Koudas ◽

Gautam Das

Keyword(s):

Deep Learning ◽

Objective Function ◽

Pattern Matching ◽

Language Processing ◽

Language Model ◽

Language Models ◽

Selectivity Estimation ◽

Statistical Correlations ◽

Benchmark Datasets ◽

Traditional Approaches

Accurate selectivity estimation for string predicates is a long-standing research challenge in databases. Supporting pattern matching on strings (such as prefix, substring, and suffix) makes this problem much more challenging, thereby necessitating a dedicated study. Traditional approaches often build pruned summary data structures such as tries followed by selectivity estimation using statistical correlations. However, this produces insufficiently accurate cardinality estimates resulting in the selection of sub-optimal plans by the query optimizer. Recently proposed deep learning based approaches leverage techniques from natural language processing such as embeddings to encode the strings and use it to train a model. While this is an improvement over traditional approaches, there is a large scope for improvement. We propose Astrid, a framework for string selectivity estimation that synthesizes ideas from traditional and deep learning based approaches. We make two complementary contributions. First, we propose an embedding algorithm that is query-type (prefix, substring, and suffix) and selectivity aware. Consider three strings 'ab', 'abc' and 'abd' whose prefix frequencies are 1000, 800 and 100 respectively. Our approach would ensure that the embedding for 'ab' is closer to 'abc' than 'abd'. Second, we describe how neural language models could be used for selectivity estimation. While they work well for prefix queries, their performance for substring queries is sub-optimal. We modify the objective function of the neural language model so that it could be used for estimating selectivities of pattern matching queries. We also propose a novel and efficient algorithm for optimizing the new objective function. We conduct extensive experiments over benchmark datasets and show that our proposed approaches achieve state-of-the-art results.

Download Full-text

Towards Accurate and Efficient Chinese Part-of-Speech Tagging

Computational Linguistics ◽

10.1162/coli_a_00253 ◽

2016 ◽

Vol 42 (3) ◽

pp. 391-419 ◽

Cited By ~ 4

Author(s):

Weiwei Sun ◽

Xiaojun Wan

Keyword(s):

Hybrid Systems ◽

Language Processing ◽

Large Scale ◽

Unlabeled Data ◽

Training Data ◽

Test Time ◽

System Combination ◽

Pos Tagging ◽

Hybrid Approaches ◽

Lexical Relations

From the perspective of structural linguistics, we explore paradigmatic and syntagmatic lexical relations for Chinese POS tagging, an important and challenging task for Chinese language processing. Paradigmatic lexical relations are explicitly captured by word clustering on large-scale unlabeled data and are used to design new features to enhance a discriminative tagger. Syntagmatic lexical relations are implicitly captured by syntactic parsing in the constituency formalism, and are utilized via system combination. Experiments on the Penn Chinese Treebank demonstrate the importance of both paradigmatic and syntagmatic relations. Our linguistically motivated, hybrid approaches yield a relative error reduction of 18% in total over state-of-the-art baselines. Despite the effectiveness to boost accuracy, computationally expensive parsers make hybrid systems inappropriate for many realistic NLP applications. In this article, we are also concerned with improving tagging efficiency at test time. In particular, we explore unlabeled data to transfer the predictive power of hybrid models to simple sequence models. Specifically, hybrid systems are utilized to create large-scale pseudo training data for cheap models. Experimental results illustrate that the re-compiled models not only achieve high accuracy with respect to per token classification, but also serve as a front-end to a parser well.

Download Full-text

A Physics-Infused Deep Learning Model for the Prediction of Refractive Indices and Its Use for the Large-Scale Screening of Organic Compound Space

10.26434/chemrxiv.8796950 ◽

2019 ◽

Author(s):

Mojtaba Haghighatlari ◽

Gaurav Vishwakarma ◽

Mohammad Atif Faiz Afzal ◽

Johannes Hachmann

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Large Scale ◽

Organic Molecules ◽

Learning Model ◽

Training Data ◽

Refractive Indices ◽

Learning Models ◽

Deep Learning Model ◽

Machine Learning Models

<div><div><div><p>We present a multitask, physics-infused deep learning model to accurately and efficiently predict refractive indices (RIs) of organic molecules, and we apply it to a library of 1.5 million compounds. We show that it outperforms earlier machine learning models by a significant margin, and that incorporating known physics into data-derived models provides valuable guardrails. Using a transfer learning approach, we augment the model to reproduce results consistent with higher-level computational chemistry training data, but with a considerably reduced number of corresponding calculations. Prediction errors of machine learning models are typically smallest for commonly observed target property values, consistent with the distribution of the training data. However, since our goal is to identify candidates with unusually large RI values, we propose a strategy to boost the performance of our model in the remoter areas of the RI distribution: We bias the model with respect to the under-represented classes of molecules that have values in the high-RI regime. By adopting a metric popular in web search engines, we evaluate our effectiveness in ranking top candidates. We confirm that the models developed in this study can reliably predict the RIs of the top 1,000 compounds, and are thus able to capture their ranking. We believe that this is the first study to develop a data-derived model that ensures the reliability of RI predictions by model augmentation in the extrapolation region on such a large scale. These results underscore the tremendous potential of machine learning in facilitating molecular (hyper)screening approaches on a massive scale and in accelerating the discovery of new compounds and materials, such as organic molecules with high-RI for applications in opto-electronics.</p></div></div></div>

Download Full-text

BO-LSTM: Classifying relations via long short-term memory networks along biomedical ontologies

10.1101/336719 ◽

2018 ◽

Author(s):

Andre Lamurias ◽

Luka A. Clarke ◽

Francisco M. Couto

Keyword(s):

Deep Learning ◽

Text Mining ◽

Drug Interactions ◽

Short Term Memory ◽

Biomedical Ontologies ◽

Short Term ◽

Term Memory ◽

Domain Specific ◽

Learning Techniques ◽

Long Short Term Memory

AbstractRecent studies have proposed deep learning techniques, namely recurrent neural networks, to improve biomedical text mining tasks. However, these techniques rarely take advantage of existing domain-specific resources, such as ontologies. In Life and Health Sciences there is a vast and valuable set of such resources publicly available, which are continuously being updated. Biomedical ontologies are nowadays a mainstream approach to formalize existing knowledge about entities, such as genes, chemicals, phenotypes, and disorders. These resources contain supplementary information that may not be yet encoded in training data, particularly in domains with limited labeled data.We propose a new model, BO-LSTM, that takes advantage of domain-specific ontologies, by representing each entity as the sequence of its ancestors in the ontology. We implemented BO-LSTM as a recurrent neural network with long short-term memory units and using an open biomedical ontology, which in our case-study was Chemical Entities of Biological Interest (ChEBI). We assessed the performance of BO-LSTM on detecting and classifying drug-drug interactions in a publicly available corpus from an international challenge, composed of 792 drug descriptions and 233 scientific abstracts. By using the domain-specific ontology in addition to word embeddings and WordNet, BO-LSTM improved both the F1-score of the detection and classification of drug-drug interactions, particularly in a document set with a limited number of annotations. Our findings demonstrate that besides the high performance of current deep learning techniques, domain-specific ontologies can still be useful to mitigate the lack of labeled data.Author summaryA high quantity of biomedical information is only available in documents such as scientific articles and patents. Due to the rate at which new documents are produced, we need automatic methods to extract useful information from them. Text mining is a subfield of information retrieval which aims at extracting relevant information from text. Scientific literature is a challenge to text mining because of the complexity and specificity of the topics approached. In recent years, deep learning has obtained promising results in various text mining tasks by exploring large datasets. On the other hand, ontologies provide a detailed and sound representation of a domain and have been developed to diverse biomedical domains. We propose a model that combines deep learning algorithms with biomedical ontologies to identify relations between concepts in text. We demonstrate the potential of this model to extract drug-drug interactions from abstracts and drug descriptions. This model can be applied to other biomedical domains using an annotated corpus of documents and an ontology related to that domain to train a new classifier.

Download Full-text