Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling

AbstractPurposeAutomatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research.Design/methodology/approachWe regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models.FindingsCompared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement.Research limitationsWe just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases.Practical implicationsWe make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at: https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction.Originality/valueBy designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.

Download Full-text

Information Extraction from Electronic Medical Records Using Multitask Recurrent Neural Network with Contextual Word Embedding

Applied Sciences ◽

10.3390/app9183658 ◽

2019 ◽

Vol 9 (18) ◽

pp. 3658 ◽

Cited By ~ 6

Author(s):

Jianliang Yang ◽

Yuenan Liu ◽

Minghui Qian ◽

Chenghua Guan ◽

Xiangfei Yuan

Keyword(s):

Electronic Medical Records ◽

Medical Records ◽

Large Scale ◽

Short Term Memory ◽

Conditional Random Field ◽

Named Entity Recognition ◽

Recognition Task ◽

Entity Recognition ◽

Language Models ◽

Named Entity

Clinical named entity recognition is an essential task for humans to analyze large-scale electronic medical records efficiently. Traditional rule-based solutions need considerable human effort to build rules and dictionaries; machine learning-based solutions need laborious feature engineering. For the moment, deep learning solutions like Long Short-term Memory with Conditional Random Field (LSTM–CRF) achieved considerable performance in many datasets. In this paper, we developed a multitask attention-based bidirectional LSTM–CRF (Att-biLSTM–CRF) model with pretrained Embeddings from Language Models (ELMo) in order to achieve better performance. In the multitask system, an additional task named entity discovery was designed to enhance the model’s perception of unknown entities. Experiments were conducted on the 2010 Informatics for Integrating Biology & the Bedside/Veterans Affairs (I2B2/VA) dataset. Experimental results show that our model outperforms the state-of-the-art solution both on the single model and ensemble model. Our work proposes an approach to improve the recall in the clinical named entity recognition task based on the multitask mechanism.

Download Full-text

Global soil moisture data derived through machine learning trained with in-situ measurements

Scientific Data ◽

10.1038/s41597-021-00964-1 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Sungmin O. ◽

Rene Orth

Keyword(s):

Machine Learning ◽

Soil Moisture ◽

Large Scale ◽

Short Term Memory ◽

Temporal Dynamics ◽

Soil Moisture Data ◽

Wide Range ◽

Global Soil

AbstractWhile soil moisture information is essential for a wide range of hydrologic and climate applications, spatially-continuous soil moisture data is only available from satellite observations or model simulations. Here we present a global, long-term dataset of soil moisture derived through machine learning trained with in-situ measurements, SoMo.ml. We train a Long Short-Term Memory (LSTM) model to extrapolate daily soil moisture dynamics in space and in time, based on in-situ data collected from more than 1,000 stations across the globe. SoMo.ml provides multi-layer soil moisture data (0–10 cm, 10–30 cm, and 30–50 cm) at 0.25° spatial and daily temporal resolution over the period 2000–2019. The performance of the resulting dataset is evaluated through cross validation and inter-comparison with existing soil moisture datasets. SoMo.ml performs especially well in terms of temporal dynamics, making it particularly useful for applications requiring time-varying soil moisture, such as anomaly detection and memory analyses. SoMo.ml complements the existing suite of modelled and satellite-based datasets given its distinct derivation, to support large-scale hydrological, meteorological, and ecological analyses.

Download Full-text

Financial Context News Sentiment Analysis for the Lithuanian Language

Applied Sciences ◽

10.3390/app11104443 ◽

2021 ◽

Vol 11 (10) ◽

pp. 4443

Author(s):

Rokas Štrimaitis ◽

Pavel Stefanovič ◽

Simona Ramanauskaitė ◽

Asta Slotkienė

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Short Term Memory ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Experimental Investigations ◽

Support Vector ◽

Applied Machine Learning ◽

Bayes Algorithm ◽

Website Content

Financial area analysis is not limited to enterprise performance analysis. It is worth analyzing as wide an area as possible to obtain the full impression of a specific enterprise. News website content is a datum source that expresses the public’s opinion on enterprise operations, status, etc. Therefore, it is worth analyzing the news portal article text. Sentiment analysis in English texts and financial area texts exist, and are accurate, the complexity of Lithuanian language is mostly concentrated on sentiment analysis of comment texts, and does not provide high accuracy. Therefore in this paper, the supervised machine learning model was implemented to assign sentiment analysis on financial context news, gathered from Lithuanian language websites. The analysis was made using three commonly used classification algorithms in the field of sentiment analysis. The hyperparameters optimization using the grid search was performed to discover the best parameters of each classifier. All experimental investigations were made using the newly collected datasets from four Lithuanian news websites. The results of the applied machine learning algorithms show that the highest accuracy is obtained using a non-balanced dataset, via the multinomial Naive Bayes algorithm (71.1%). The other algorithm accuracies were slightly lower: a long short-term memory (71%), and a support vector machine (70.4%).

Download Full-text

A Deep Machine Learning Method for Concurrent and Interleaved Human Activity Recognition

Sensors ◽

10.3390/s20205770 ◽

2020 ◽

Vol 20 (20) ◽

pp. 5770 ◽

Cited By ~ 1

Author(s):

Keshav Thapa ◽

Zubaer Md. Abdullah Al ◽

Barsha Lamichhane ◽

Sung-Hyun Yang

Keyword(s):

Machine Learning ◽

Random Field ◽

Activity Recognition ◽

Human Activity ◽

Short Term Memory ◽

Conditional Random Field ◽

Human Activity Recognition ◽

Second Phase ◽

Complex Activity ◽

Two Phase

Human activity recognition has become an important research topic within the field of pervasive computing, ambient assistive living (AAL), robotics, health-care monitoring, and many more. Techniques for recognizing simple and single activities are typical for now, but recognizing complex activities such as concurrent and interleaving activity is still a major challenging issue. In this paper, we propose a two-phase hybrid deep machine learning approach using bi-directional Long-Short Term Memory (BiLSTM) and Skip-Chain Conditional random field (SCCRF) to recognize the complex activity. BiLSTM is a sequential generative deep learning inherited from Recurrent Neural Network (RNN). SCCRFs is a distinctive feature of conditional random field (CRF) that can represent long term dependencies. In the first phase of the proposed approach, we recognized the concurrent activities using the BiLSTM technique, and in the second phase, SCCRF identifies the interleaved activity. Accuracy of the proposed framework against the counterpart state-of-art methods using the publicly available datasets in a smart home environment is analyzed. Our experiment’s result surpasses the previously proposed approaches with an average accuracy of more than 93%.

Download Full-text

A Machine Learning Pipeline for Demand Response Capacity Scheduling

Energies ◽

10.3390/en13071848 ◽

2020 ◽

Vol 13 (7) ◽

pp. 1848 ◽

Cited By ~ 1

Author(s):

Gautham Krishnadas ◽

Aristides Kiprakis

Keyword(s):

Machine Learning ◽

Energy Balance ◽

Smart Grid ◽

Demand Response ◽

Large Scale ◽

Performance Metrics ◽

Supervised Machine Learning ◽

Algorithm Selection ◽

Load Forecast ◽

Forecast Models

Demand response (DR) is an integral component of smart grid operations that offers the necessary flexibility to support its decarbonisation. In incentive-based DR programs, deviations from the scheduled DR capacity affect the grid’s energy balance and result in revenue losses for the DR participants. This issue aggravates with increasing DR delivery from participants such as large consumer buildings who have limited standard methods to follow for DR capacity scheduling. Load curtailment based DR capacity availability from such consumers can be forecasted reliably with the help of supervised machine learning (ML) models. This study demonstrates the development of data-driven ML based total and flexible load forecast models for a retail building. The ML model development tasks such as data pre-processing, training-testing dataset preparation, cross-validation, algorithm selection, hyperparameter optimisation, feature ranking, model selection and model evaluation are guided by deployment-centric design criteria such as reliability, computational efficiency and scalability. Based on the selected performance metrics, the day-ahead and week-ahead ML based load forecast models developed for the retail building are shown to outperform the timeseries persistence models used for benchmarking. Furthermore, the deployment of these models for DR capacity scheduling is proposed as an ML pipeline that can be realised with the help of ML workflows, computational resources as well as systems for monitoring and visualisation. The ML pipeline ensures faster, cost-effective and large-scale deployment of forecast models that support reliable DR capacity scheduling without affecting the grid’s energy balance. Minimisation of revenue losses encourages increased DR participation from large consumer buildings, ensuring further flexibility in the smart grid.

Download Full-text

Pushdown Automata in Statistical Machine Translation

Computational Linguistics ◽

10.1162/coli_a_00197 ◽

2014 ◽

Vol 40 (3) ◽

pp. 687-723 ◽

Cited By ~ 3

Author(s):

Cyril Allauzen ◽

Bill Byrne ◽

Adrià de Gispert ◽

Gonzalo Iglesias ◽

Michael Riley

Keyword(s):

Machine Translation ◽

Large Scale ◽

Complexity Analysis ◽

Statistical Machine Translation ◽

Language Model ◽

General Purpose ◽

Language Models ◽

Experimental Conditions ◽

Context Free ◽

Pushdown Automata

This article describes the use of pushdown automata (PDA) in the context of statistical machine translation and alignment under a synchronous context-free grammar. We use PDAs to compactly represent the space of candidate translations generated by the grammar when applied to an input sentence. General-purpose PDA algorithms for replacement, composition, shortest path, and expansion are presented. We describe HiPDT, a hierarchical phrase-based decoder using the PDA representation and these algorithms. We contrast the complexity of this decoder with a decoder based on a finite state automata representation, showing that PDAs provide a more suitable framework to achieve exact decoding for larger synchronous context-free grammars and smaller language models. We assess this experimentally on a large-scale Chinese-to-English alignment and translation task. In translation, we propose a two-pass decoding strategy involving a weaker language model in the first-pass to address the results of PDA complexity analysis. We study in depth the experimental conditions and tradeoffs in which HiPDT can achieve state-of-the-art performance for large-scale SMT.

Download Full-text

The performance of LSTM models from basin to continental scales

10.5194/egusphere-egu2020-8855 ◽

2020 ◽

Author(s):

Frederik Kratzert ◽

Daniel Klotz ◽

Günter Klambauer ◽

Grey Nearing ◽

Sepp Hochreiter

Keyword(s):

Machine Learning ◽

Large Scale ◽

Short Term Memory ◽

Regional Scale ◽

Data Driven ◽

Hydrological Models ◽

Short Term ◽

Term Memory ◽

Basin Scale ◽

Long Short Term Memory

Simulation accuracy among traditional hydrological models usually degrades significantly when going from single basin to regional scale. Hydrological models perform best when calibrated for specific basins, and do worse when a regional calibration scheme is used.&#160;One reason for this is that these models do not (have to) learn hydrological processes from data. Rather, they have a predefined model structure and only a handful of parameters adapt to specific basins. This often yields less-than-optimal parameter values when the loss is not determined by a single basin, but by many through regional calibration.The opposite is true for data driven approaches where models tend to get better with more and diverse training data. We examine whether this holds true when modeling rainfall-runoff processes with deep learning, or if, like their process-based counterparts, data-driven hydrological models degrade when going from basin to regional scale.Recently, Kratzert et al. (2018) showed that the Long Short-Term Memory network (LSTM), a special type of recurrent neural network, achieves comparable performance to the SAC-SMA at basin scale. In follow up work Kratzert et al. (2019a) trained a single LSTM for hundreds of basins in the continental US, which outperformed a set of hydrological models significantly, even compared to basin-calibrated hydrological models. On average, a single LSTM is even better in out-of-sample predictions (ungauged) compared to the SAC-SMA in-sample (gauged) or US National Water Model (Kratzert et al. 2019b).LSTM-based approaches usually involve tuning a large number of hyperparameters, such as the number of neurons, number of layers, and learning rate, that are critical for the predictive performance. Therefore, large-scale hyperparameter search has to be performed to obtain a proficient LSTM network.&#160;&#160;However, in the abovementioned studies, hyperparameter optimization was not conducted at large scale and e.g. in Kratzert et al. (2018) the same network hyperparameters were used in all basins, instead of tuning hyperparameters for each basin separately. It is yet unclear whether LSTMs follow the same trend of traditional hydrological models to degrade performance from basin to regional scale.&#160;In the current study, we performed a computational expensive, basin-specific hyperparameter search to explore how site-specific LSTMs differ in performance compared to regionally calibrated LSTMs. We compared our results to the mHM and VIC models, once calibrated per-basin and once using an MPR regionalization scheme. These benchmark models were calibrated individual research groups, to eliminate bias in our study. We analyse whether differences in basin-specific vs regional model performance can be linked to basin attributes or data set characteristics.References:Kratzert, F., Klotz, D., Brenner, C., Schulz, K., and Herrnegger, M.: Rainfall&#8211;runoff modelling using Long Short-Term Memory (LSTM) networks, Hydrol. Earth Syst. Sci., 22, 6005&#8211;6022, https://doi.org/10.5194/hess-22-6005-2018, 2018.&#160;Kratzert, F., Klotz, D., Shalev, G., Klambauer, G., Hochreiter, S., and Nearing, G.: Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets, Hydrol. Earth Syst. Sci., 23, 5089&#8211;5110, https://doi.org/10.5194/hess-23-5089-2019, 2019a.&#160;Kratzert, F., Klotz, D., Herrnegger, M., Sampson, A. K., Hochreiter, S., & Nearing, G. S.: Toward improved predictions in ungauged basins: Exploiting the power of machine learning. Water Resources Research, 55. https://doi.org/10.1029/2019WR026065, 2019b.

Download Full-text

COVID-19 sentiment analysis via deep learning during the rise of novel cases

PLoS ONE ◽

10.1371/journal.pone.0255615 ◽

2021 ◽

Vol 16 (8) ◽

pp. e0255615 ◽

Cited By ~ 1

Author(s):

Rohitash Chandra ◽

Aswin Krishna

Keyword(s):

Deep Learning ◽

Sentiment Analysis ◽

Short Term Memory ◽

Language Model ◽

Deep Understanding ◽

Language Models ◽

Catastrophic Events ◽

Social Scientists ◽

Significant Group ◽

Global Vector

Social scientists and psychologists take interest in understanding how people express emotions and sentiments when dealing with catastrophic events such as natural disasters, political unrest, and terrorism. The COVID-19 pandemic is a catastrophic event that has raised a number of psychological issues such as depression given abrupt social changes and lack of employment. Advancements of deep learning-based language models have been promising for sentiment analysis with data from social networks such as Twitter. Given the situation with COVID-19 pandemic, different countries had different peaks where rise and fall of new cases affected lock-downs which directly affected the economy and employment. During the rise of COVID-19 cases with stricter lock-downs, people have been expressing their sentiments in social media. This can provide a deep understanding of human psychology during catastrophic events. In this paper, we present a framework that employs deep learning-based language models via long short-term memory (LSTM) recurrent neural networks for sentiment analysis during the rise of novel COVID-19 cases in India. The framework features LSTM language model with a global vector embedding and state-of-art BERT language model. We review the sentiments expressed for selective months in 2020 which covers the major peak of novel cases in India. Our framework utilises multi-label sentiment classification where more than one sentiment can be expressed at once. Our results indicate that the majority of the tweets have been positive with high levels of optimism during the rise of the novel COVID-19 cases and the number of tweets significantly lowered towards the peak. We find that the optimistic, annoyed and joking tweets mostly dominate the monthly tweets with much lower portion of negative sentiments. The predictions generally indicate that although the majority have been optimistic, a significant group of population has been annoyed towards the way the pandemic was handled by the authorities.

Download Full-text

Exploring Eating Disorder Topics on Twitter: Machine Learning Approach (Preprint)

10.2196/preprints.18273 ◽

2020 ◽

Author(s):

Sicheng Zhou ◽

Yunpeng Zhao ◽

Jiang Bian ◽

Ann F Haynos ◽

Rui Zhang

Keyword(s):

Machine Learning ◽

Topic Modeling ◽

Short Term Memory ◽

Topic Model ◽

Modeling Method ◽

Mental Illnesses ◽

Computational Method ◽

Supervised Machine Learning ◽

Support Vector ◽

Domain Expert

BACKGROUND Eating disorders (EDs) are a group of mental illnesses that have an adverse effect on both mental and physical health. As social media platforms (eg, Twitter) have become an important data source for public health research, some studies have qualitatively explored the ways in which EDs are discussed on these platforms. Initial results suggest that such research offers a promising method for further understanding this group of diseases. Nevertheless, an efficient computational method is needed to further identify and analyze tweets relevant to EDs on a larger scale. OBJECTIVE This study aims to develop and validate a machine learning–based classifier to identify tweets related to EDs and to explore factors (ie, topics) related to EDs using a topic modeling method. METHODS We collected potential ED-relevant tweets using keywords from previous studies and annotated these tweets into different groups (ie, ED relevant vs irrelevant and then promotional information vs laypeople discussion). Several supervised machine learning methods, such as convolutional neural network (CNN), long short-term memory (LSTM), support vector machine, and naïve Bayes, were developed and evaluated using annotated data. We used the classifier with the best performance to identify ED-relevant tweets and applied a topic modeling method—Correlation Explanation (CorEx)—to analyze the content of the identified tweets. To validate these machine learning results, we also collected a cohort of ED-relevant tweets on the basis of manually curated rules. RESULTS A total of 123,977 tweets were collected during the set period. We randomly annotated 2219 tweets for developing the machine learning classifiers. We developed a CNN-LSTM classifier to identify ED-relevant tweets published by laypeople in 2 steps: first relevant versus irrelevant (F1 score=0.89) and then promotional versus published by laypeople (F1 score=0.90). A total of 40,790 ED-relevant tweets were identified using the CNN-LSTM classifier. We also identified another set of tweets (ie, 17,632 ED-relevant and 83,557 ED-irrelevant tweets) posted by laypeople using manually specified rules. Using CorEx on all ED-relevant tweets, the topic model identified 162 topics. Overall, the coherence rate for topic modeling was 77.07% (1264/1640), indicating a high quality of the produced topics. The topics were further reviewed and analyzed by a domain expert. CONCLUSIONS A developed CNN-LSTM classifier could improve the efficiency of identifying ED-relevant tweets compared with the traditional manual-based method. The CorEx topic model was applied on the tweets identified by the machine learning–based classifier and the traditional manual approach separately. Highly overlapping topics were observed between the 2 cohorts of tweets. The produced topics were further reviewed by a domain expert. Some of the topics identified by the potential ED tweets may provide new avenues for understanding this serious set of disorders.

Download Full-text

A deep learning and novelty detection framework for rapid phenotyping in high-content screening

10.1101/134627 ◽

2017 ◽

Cited By ~ 2

Author(s):

Christoph Sommer ◽

Rudolf Hoefler ◽

Matthias Samwer ◽

Daniel W. Gerlich

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Large Scale ◽

Novelty Detection ◽

A Priori ◽

Mitotic Cell ◽

Supervised Machine Learning ◽

High Content Screening ◽

Data Sets ◽

User Training

AbstractSupervised machine learning is a powerful and widely used method to analyze high-content screening data. Despite its accuracy, efficiency, and versatility, supervised machine learning has drawbacks, most notably its dependence on a priori knowledge of expected phenotypes and time-consuming classifier training. We provide a solution to these limitations with CellCognition Explorer, a generic novelty detection and deep learning framework. Application to several large-scale screening data sets on nuclear and mitotic cell morphologies demonstrates that CellCognition Explorer enables discovery of rare phenotypes without user training, which has broad implications for improved assay development in high-content screening.

Download Full-text