Let the Algorithm Speak: How to Use Neural Networks for Automatic Item Generation in Psychological Scale Development

Measurement is at the heart of scientific research. As many—perhaps most—psychological constructs cannot be directly observed, there is a steady demand for sound self-report scales to assess such latent constructs. However, scale development is a tedious process that requires researchers to produce good items in large quantities. In the current tutorial, we introduce, explain, and apply the Psychometric Item Generator (PIG), an open-source, free-to-use, self-sufficient natural language processing algorithm that produces large-scale, human-like, customised text output within a few mouse clicks. The PIG is based on the GPT-2, a powerful generative language model, and runs on Google Colaboratory—an interactive virtual notebook environment that executes code on state-of-the-art virtual machines at no cost. We demonstrate that based on an input of three sentences, the PIG produces 65 items that pass initial face validity checks within a single iteration of code and a runtime of less than one minute. The PIG does not require any prior coding skills or access to computational resources and can be easily tailored to any desired context by simply switching out short linguistic prompts in a single line of code. Additionally, the PIG can also be used as a bottom-up tool to expand and diversify the conceptual understanding of a construct or derive hypotheses about its relationships to other, existing constructs. In short, we present an effective, novel machine learning solution to an old psychological challenge. As such, the PIG will not only not require you to learn a new language—but speak yours.

Download Full-text

GATECloud.net: a platform for large-scale, open-source text processing on the cloud

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2012.0071 ◽

2013 ◽

Vol 371 (1983) ◽

pp. 20120071 ◽

Cited By ~ 23

Author(s):

Valentin Tablan ◽

Ian Roberts ◽

Hamish Cunningham ◽

Kalina Bontcheva

Keyword(s):

Cloud Computing ◽

Language Processing ◽

Large Scale ◽

Virtual Machines ◽

Cost Benefit Analysis ◽

Text Processing ◽

Cost Benefit ◽

Data Intensive ◽

On Demand ◽

Usage Evaluation

Cloud computing is increasingly being regarded as a key enabler of the ‘democratization of science’, because on-demand, highly scalable cloud computing facilities enable researchers anywhere to carry out data-intensive experiments. In the context of natural language processing (NLP), algorithms tend to be complex, which makes their parallelization and deployment on cloud platforms a non-trivial task. This study presents a new, unique, cloud-based platform for large-scale NLP research—GATECloud. net. It enables researchers to carry out data-intensive NLP experiments by harnessing the vast, on-demand compute power of the Amazon cloud. Important infrastructural issues are dealt with by the platform, completely transparently for the researcher: load balancing, efficient data upload and storage, deployment on the virtual machines, security and fault tolerance. We also include a cost–benefit analysis and usage evaluation.

Download Full-text

Vertical Integration Between Providers With Possible Cloud Migration

Advances in Computer and Electrical Engineering - Advanced Methodologies and Technologies in Network Architecture, Mobile Computing, and Data Analytics ◽

10.4018/978-1-5225-7598-6.ch020 ◽

2019 ◽

pp. 274-284

Author(s):

Aleksandra Kostic-Ljubisavljevic ◽

Branka Mikavica

Keyword(s):

Large Scale ◽

Virtual Machines ◽

Wholesale Price ◽

Rejection Rate ◽

Large Scale Data ◽

Cloud Migration ◽

Computational Resources ◽

Charging Strategy ◽

Scale Data

All vertically integrated participants in content provisioning process are influenced by bandwidth requirements. Provisioning of self-owned resources that satisfy peak bandwidth demand leads to network underutilization and it is cost ineffective. Under-provisioning leads to rejection of customers' requests. Vertically integrated providers need to consider cloud migration in order to minimize costs and improve quality of service and quality of experience of their customers. Cloud providers maintain large-scale data centers to offer storage and computational resources in the form of virtual machines instances. They offer different pricing plans: reservation, on-demand, and spot pricing. For obtaining optimal integration charging strategy, revenue sharing, cost sharing, wholesale price is applied frequently. The vertically integrated content provider's incentives for cloud migration can induce significant complexity in integration contracts, and consequently improvements in costs and requests' rejection rate.

Download Full-text

Vertical Integration Between Providers With Possible Cloud Migration

Encyclopedia of Information Science and Technology, Fourth Edition ◽

10.4018/978-1-5225-2255-3.ch100 ◽

2018 ◽

pp. 1164-1173 ◽

Cited By ~ 1

Author(s):

Aleksandra Kostic-Ljubisavljevic ◽

Branka Mikavica

Keyword(s):

Large Scale ◽

Virtual Machines ◽

Wholesale Price ◽

Rejection Rate ◽

Large Scale Data ◽

Cloud Migration ◽

Computational Resources ◽

Data Centres ◽

Charging Strategy

All vertically integrated participants in content provisioning process are influenced by bandwidth requirements. Provisioning of self-owned resources that satisfy peak bandwidth demand leads to network underutilization and it is cost ineffective. Under-provisioning leads to rejection of customers' requests. Vertically integrated providers need to consider cloud migration in order to minimize costs and improve Quality of Service and Quality of Experience of their customers. Cloud providers maintain large-scale data centres to offer storage and computational resources in the form of Virtual Machines instances. They offer different pricing plans: reservation, on-demand and spot pricing. For obtaining optimal integration charging strategy, Revenue Sharing, Cost Sharing, Wholesale Price is applied frequently. The vertically integrated content provider's incentives for cloud migration can induce significant complexity in integration contracts, and consequently improvements in costs and requests' rejection rate.

Download Full-text

External features enriched model for biomedical question answering

BMC Bioinformatics ◽

10.1186/s12859-021-04176-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Gezheng Xu ◽

Wenge Rong ◽

Yanmeng Wang ◽

Yuanxin Ouyang ◽

Zhang Xiong

Keyword(s):

Language Processing ◽

Large Scale ◽

Question Answering ◽

Language Model ◽

Named Entity Recognition ◽

Entity Recognition ◽

Small Scale ◽

Original Text ◽

Specific Domain ◽

Part Of Speech

Abstract Background Biomedical question answering (QA) is a sub-task of natural language processing in a specific domain, which aims to answer a question in the biomedical field based on one or more related passages and can provide people with accurate healthcare-related information. Recently, a lot of approaches based on the neural network and large scale pre-trained language model have largely improved its performance. However, considering the lexical characteristics of biomedical corpus and its small scale dataset, there is still much improvement room for biomedical QA tasks. Results Inspired by the importance of syntactic and lexical features in the biomedical corpus, we proposed a new framework to extract external features, such as part-of-speech and named-entity recognition, and fused them with the original text representation encoded by pre-trained language model, to enhance the biomedical question answering performance. Our model achieves an overall improvement of all three metrics on BioASQ 6b, 7b, and 8b factoid question answering tasks. Conclusions The experiments on BioASQ question answering dataset demonstrated the effectiveness of our external feature-enriched framework. It is proven by the experiments conducted that external lexical and syntactic features can improve Pre-trained Language Model’s performance in biomedical domain question answering task.

Download Full-text

FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/622 ◽

2020 ◽

Cited By ~ 1

Author(s):

Zhuang Liu ◽

Degen Huang ◽

Kaiyu Huang ◽

Zhuang Li ◽

Jun Zhao

Keyword(s):

Deep Learning ◽

Text Mining ◽

Language Processing ◽

Large Scale ◽

Language Model ◽

Training Data ◽

Domain Specific ◽

Current State ◽

Language Representation ◽

Financial Domain

There is growing interest in the tasks of financial text mining. Over the past few years, the progress of Natural Language Processing (NLP) based on deep learning advanced rapidly. Significant progress has been made with deep learning showing promising results on financial text mining models. However, as NLP models require large amounts of labeled training data, applying deep learning to financial text mining is often unsuccessful due to the lack of labeled training data in financial fields. To address this issue, we present FinBERT (BERT for Financial Text Mining) that is a domain specific language model pre-trained on large-scale financial corpora. In FinBERT, different from BERT, we construct six pre-training tasks covering more knowledge, simultaneously trained on general corpora and financial domain corpora, which can enable FinBERT model better to capture language knowledge and semantic information. The results show that our FinBERT outperforms all current state-of-the-art models. Extensive experimental results demonstrate the effectiveness and robustness of FinBERT. The source code and pre-trained models of FinBERT are available online.

Download Full-text

Large-scale Analysis of Counseling Conversations: An Application of Natural Language Processing to Mental Health

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00111 ◽

2016 ◽

Vol 4 ◽

pp. 463-476 ◽

Cited By ~ 42

Author(s):

Tim Althoff ◽

Kevin Clark ◽

Jure Leskovec

Keyword(s):

Public Health ◽

Language Processing ◽

Large Scale ◽

Language Model ◽

Text Message ◽

Scale Analysis ◽

Health Issues ◽

Large Scale Data ◽

Large Scale Analysis ◽

Scale Data

Mental illness is one of the most pressing public health issues of our time. While counseling and psychotherapy can be effective treatments, our knowledge about how to conduct successful counseling conversations has been limited due to lack of large-scale data with labeled outcomes of the conversations. In this paper, we present a large-scale, quantitative study on the discourse of text-message-based counseling conversations. We develop a set of novel computational discourse analysis methods to measure how various linguistic aspects of conversations are correlated with conversation outcomes. Applying techniques such as sequence-based conversation models, language model comparisons, message clustering, and psycholinguistics-inspired word frequency analyses, we discover actionable conversation strategies that are associated with better conversation outcomes.

Download Full-text

Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/508 ◽

2020 ◽

Author(s):

Juntao Li ◽

Ruidan He ◽

Hai Ye ◽

Hwee Tou Ng ◽

Lidong Bing ◽

...

Keyword(s):

Language Processing ◽

Large Scale ◽

Language Model ◽

Language Models ◽

Low Resource ◽

Performance Improvements ◽

Domain Specific ◽

High Resource ◽

Significant Performance ◽

Cross Lingual

Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements over various cross-lingual and low-resource tasks. Through training on one hundred languages and terabytes of texts, cross-lingual language models have proven to be effective in leveraging high-resource languages to enhance low-resource language processing and outperform monolingual models. In this paper, we further investigate the cross-lingual and cross-domain (CLCD) setting when a pretrained cross-lingual language model needs to adapt to new domains. Specifically, we propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features and domain-invariant features from the entangled pretrained cross-lingual representations, given unlabeled raw texts in the source language. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts. Experimental results show that our proposed method achieves significant performance improvements over the state-of-the-art pretrained cross-lingual language model in the CLCD setting.

Download Full-text

FPM: A Collection of Large-scale Foundation Pre-trained Language Models

10.21203/rs.3.rs-1061146/v1 ◽

2021 ◽

Author(s):

Dezhou Shen

Keyword(s):

Language Processing ◽

Large Scale ◽

Language Model ◽

Language Models ◽

Classification Task ◽

Accuracy Rate ◽

Basic Model ◽

Transformer Model ◽

Effective Models ◽

Param Eters

Abstract Recent work in language modeling has shown that train- ing large-scale Transformer models has promoted the lat- est developments in natural language processing applica- tions. However, there is very little work to unify the cur- rent effective models. In this work, we use the current ef- fective model structure to launch a model set through the current most mainstream technology. We think this will become the basic model in the future. For Chinese, us- ing the GPT-2[9] model, a 10.3 billion parameter language model was trained on the Chinese dataset, and, in particu- lar, a 2.9 billion parameter language model based on dia- logue data was trained; the BERT model was trained on the Chinese dataset with 495 million parameters; the Trans- former model has trained a language model with 5.6 bil- lion parameters on the Chinese dataset. In English, cor- responding training work has also been done. Using the GPT-2 model, a language model with 6.4 billion param- eters was trained on the English dataset; the BERT[3] model trained a language model with 1.24 billion param- eters on the English dataset, and in particular, it trained a 688 million parameter based on single card training tech- nology Language model; Transformer model trained a lan- guage model with 5.6 billion parameters on the English dataset. In the TNEWS classification task evaluated by CLUE[13], the BERT-C model exceeded the 59.46% accu- racy of ALBERT-xxlarge with an accuracy rate of 59.99%, an increase of 0.53%. In the QQP classification task evalu- ated by GLUE[11], the accuracy rate of 78.95% surpassed the accuracy rate of BERT-Large of 72.1%, an increase of 6.85%. Compared with the current accuracy rate of ERNIE, the first place in the GLUE evaluation of 75.2%, an increase of 3.75%.

Download Full-text

Janus of National Large-Scale Development Project: Environmental Philosophical Issues in Korean Four Major Rivers Project

Environmental Philosophy ◽

10.35146/jecoph.2009..8.005 ◽

2009 ◽

Vol null (8) ◽

pp. 117-139 ◽

Cited By ~ 1

Author(s):

Mingull Jeung

Keyword(s):

Scale Development ◽

Large Scale ◽

Development Project

Download Full-text

Neural methods for effective, efficient, and exposure-aware information retrieval

ACM SIGIR Forum ◽

10.1145/3476415.3476434 ◽

2021 ◽

Vol 55 (1) ◽

pp. 1-2

Author(s):

Bhaskar Mitra

Keyword(s):

Information Retrieval ◽

Language Processing ◽

Large Scale ◽

Web Search ◽

Real Life ◽

Inverted Index ◽

Information Need ◽

Product Model ◽

Performance Improvements ◽

Deep Model

Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents---or short passages---in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms---such as a person's name or a product model number---not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections---such as the document index of a commercial Web search engine---containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks. We ground our contributions with a detailed survey of the growing body of neural IR literature [Mitra and Craswell, 2018]. Our key contribution towards improving the effectiveness of deep ranking models is developing the Duet principle [Mitra et al., 2017] which emphasizes the importance of incorporating evidence based on both patterns of exact term matches and similarities between learned latent representations of query and document. To efficiently retrieve from large collections, we develop a framework to incorporate query term independence [Mitra et al., 2019] into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval. In the context of stochastic ranking, we further develop optimization strategies for exposure-based objectives [Diaz et al., 2020]. Finally, this dissertation also summarizes our contributions towards benchmarking neural IR models in the presence of large training datasets [Craswell et al., 2019] and explores the application of neural methods to other IR tasks, such as query auto-completion.

Download Full-text