A machine learning approach to open public comments for policymaking

In this paper, the author argues that the conflict between the copious amount of digital data processed by public organisations and the need for policy-relevant insights to aid public participation constitutes a ‘public information paradox’. Machine learning (ML) approaches may offer one solution to this paradox through algorithms that transparently collect and use statistical modelling to provide insights for policymakers. Such an approach is tested in this paper. The test involves applying an unsupervised machine learning approach with latent Dirichlet allocation (LDA) analysis of thousands of public comments submitted to the United States Transport Security Administration (TSA) on a 2013 proposed regulation for the use of new full body imaging scanners in airport security terminals. The analysis results in salient topic clusters that could be used by policymakers to understand large amounts of text such as in an open public comments process. The results are compared with the actual final proposed TSA rule, and the author reflects on new questions raised for transparency by the implementation of ML in open rule-making processes.

Download Full-text

Twitter Discussions and Emotions About the COVID-19 Pandemic: Machine Learning Approach (Preprint)

10.2196/preprints.20550 ◽

2020 ◽

Author(s):

Jia Xue ◽

Junxiang Chen ◽

Ran Hu ◽

Chen Chen ◽

Chengda Zheng ◽

...

Keyword(s):

Public Health ◽

Machine Learning ◽

Latent Dirichlet Allocation ◽

The United States ◽

Response Monitoring ◽

Learning Approach ◽

Learning Approaches ◽

Public Response ◽

The Public ◽

Machine Learning Approach

BACKGROUND It is important to measure the public response to the COVID-19 pandemic. Twitter is an important data source for infodemiology studies involving public response monitoring. OBJECTIVE The objective of this study is to examine COVID-19–related discussions, concerns, and sentiments using tweets posted by Twitter users. METHODS We analyzed 4 million Twitter messages related to the COVID-19 pandemic using a list of 20 hashtags (eg, “coronavirus,” “COVID-19,” “quarantine”) from March 7 to April 21, 2020. We used a machine learning approach, Latent Dirichlet Allocation (LDA), to identify popular unigrams and bigrams, salient topics and themes, and sentiments in the collected tweets. RESULTS Popular unigrams included “virus,” “lockdown,” and “quarantine.” Popular bigrams included “COVID-19,” “stay home,” “corona virus,” “social distancing,” and “new cases.” We identified 13 discussion topics and categorized them into 5 different themes: (1) public health measures to slow the spread of COVID-19, (2) social stigma associated with COVID-19, (3) COVID-19 news, cases, and deaths, (4) COVID-19 in the United States, and (5) COVID-19 in the rest of the world. Across all identified topics, the dominant sentiments for the spread of COVID-19 were anticipation that measures can be taken, followed by mixed feelings of trust, anger, and fear related to different topics. The public tweets revealed a significant feeling of fear when people discussed new COVID-19 cases and deaths compared to other topics. CONCLUSIONS This study showed that Twitter data and machine learning approaches can be leveraged for an infodemiology study, enabling research into evolving public discussions and sentiments during the COVID-19 pandemic. As the situation rapidly evolves, several topics are consistently dominant on Twitter, such as confirmed cases and death rates, preventive measures, health authorities and government policies, COVID-19 stigma, and negative psychological reactions (eg, fear). Real-time monitoring and assessment of Twitter discussions and concerns could provide useful data for public health emergency responses and planning. Pandemic-related fear, stigma, and mental health concerns are already evident and may continue to influence public trust when a second wave of COVID-19 occurs or there is a new surge of the current pandemic.

Download Full-text

Twitter Discussions and Emotions About the COVID-19 Pandemic: Machine Learning Approach

Journal of Medical Internet Research ◽

10.2196/20550 ◽

2020 ◽

Vol 22 (11) ◽

pp. e20550

Author(s):

Jia Xue ◽

Junxiang Chen ◽

Ran Hu ◽

Chen Chen ◽

Chengda Zheng ◽

...

Keyword(s):

Public Health ◽

Machine Learning ◽

Latent Dirichlet Allocation ◽

The United States ◽

Response Monitoring ◽

Learning Approach ◽

Learning Approaches ◽

Public Response ◽

The Public ◽

Machine Learning Approach

Background It is important to measure the public response to the COVID-19 pandemic. Twitter is an important data source for infodemiology studies involving public response monitoring. Objective The objective of this study is to examine COVID-19–related discussions, concerns, and sentiments using tweets posted by Twitter users. Methods We analyzed 4 million Twitter messages related to the COVID-19 pandemic using a list of 20 hashtags (eg, “coronavirus,” “COVID-19,” “quarantine”) from March 7 to April 21, 2020. We used a machine learning approach, Latent Dirichlet Allocation (LDA), to identify popular unigrams and bigrams, salient topics and themes, and sentiments in the collected tweets. Results Popular unigrams included “virus,” “lockdown,” and “quarantine.” Popular bigrams included “COVID-19,” “stay home,” “corona virus,” “social distancing,” and “new cases.” We identified 13 discussion topics and categorized them into 5 different themes: (1) public health measures to slow the spread of COVID-19, (2) social stigma associated with COVID-19, (3) COVID-19 news, cases, and deaths, (4) COVID-19 in the United States, and (5) COVID-19 in the rest of the world. Across all identified topics, the dominant sentiments for the spread of COVID-19 were anticipation that measures can be taken, followed by mixed feelings of trust, anger, and fear related to different topics. The public tweets revealed a significant feeling of fear when people discussed new COVID-19 cases and deaths compared to other topics. Conclusions This study showed that Twitter data and machine learning approaches can be leveraged for an infodemiology study, enabling research into evolving public discussions and sentiments during the COVID-19 pandemic. As the situation rapidly evolves, several topics are consistently dominant on Twitter, such as confirmed cases and death rates, preventive measures, health authorities and government policies, COVID-19 stigma, and negative psychological reactions (eg, fear). Real-time monitoring and assessment of Twitter discussions and concerns could provide useful data for public health emergency responses and planning. Pandemic-related fear, stigma, and mental health concerns are already evident and may continue to influence public trust when a second wave of COVID-19 occurs or there is a new surge of the current pandemic.

Download Full-text

Assessing the Heterogeneity of Complaints Related to Tinnitus and Hyperacusis from an Unsupervised Machine Learning Approach: An Exploratory Study

Audiology and Neurotology ◽

10.1159/000504741 ◽

2020 ◽

Vol 25 (4) ◽

pp. 174-189 ◽

Cited By ~ 1

Author(s):

Guillaume Palacios ◽

Arnaud Noreña ◽

Alain Londero

Keyword(s):

Machine Learning ◽

Statistical Analysis ◽

Language Processing ◽

Exploratory Study ◽

Latent Dirichlet Allocation ◽

Suicide Attempts ◽

Real Life ◽

Supervised Machine Learning ◽

Learning Approach ◽

Machine Learning Approach

Introduction: Subjective tinnitus (ST) and hyperacusis (HA) are common auditory symptoms that may become incapacitating in a subgroup of patients who thereby seek medical advice. Both conditions can result from many different mechanisms, and as a consequence, patients may report a vast repertoire of associated symptoms and comorbidities that can reduce dramatically the quality of life and even lead to suicide attempts in the most severe cases. The present exploratory study is aimed at investigating patients’ symptoms and complaints using an in-depth statistical analysis of patients’ natural narratives in a real-life environment in which, thanks to the anonymization of contributions and the peer-to-peer interaction, it is supposed that the wording used is totally free of any self-limitation and self-censorship. Methods: We applied a purely statistical, non-supervised machine learning approach to the analysis of patients’ verbatim exchanged on an Internet forum. After automated data extraction, the dataset has been preprocessed in order to make it suitable for statistical analysis. We used a variant of the Latent Dirichlet Allocation (LDA) algorithm to reveal clusters of symptoms and complaints of HA patients (topics). The probability of distribution of words within a topic uniquely characterizes it. The convergence of the log-likelihood of the LDA-model has been reached after 2,000 iterations. Several statistical parameters have been tested for topic modeling and word relevance factor within each topic. Results: Despite a rather small dataset, this exploratory study demonstrates that patients’ free speeches available on the Internet constitute a valuable material for machine learning and statistical analysis aimed at categorizing ST/HA complaints. The LDA model with K = 15 topics seems to be the most relevant in terms of relative weights and correlations with the capability to individualizing subgroups of patients displaying specific characteristics. The study of the relevance factor may be useful to unveil weak but important signals that are present in patients’ narratives. Discussion/Conclusion: We claim that the LDA non-supervised approach would permit to gain knowledge on the patterns of ST- and HA-related complaints and on patients’ centered domains of interest. The merits and limitations of the LDA algorithms are compared with other natural language processing methods and with more conventional methods of qualitative analysis of patients’ output. Future directions and research topics emerging from this innovative algorithmic analysis are proposed.

Download Full-text

Digital Data Forgetting: A Machine Learning Approach

2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT) ◽

10.1109/ismsit.2018.8567046 ◽

2018 ◽

Cited By ~ 2

Author(s):

Melike Gunay ◽

Eyyup Yildiz ◽

Yagiz Nalcakan ◽

Batuhan Asiroglu ◽

Ahmet Zencirli ◽

...

Keyword(s):

Machine Learning ◽

Digital Data ◽

Learning Approach ◽

Machine Learning Approach

Download Full-text

Correction: Real-Time Forecasting of the COVID-19 Outbreak in Chinese Provinces: Machine Learning Approach Using Novel Digital Data and Estimates From Mechanistic Models

Journal of Medical Internet Research ◽

10.2196/23996 ◽

2020 ◽

Vol 22 (9) ◽

pp. e23996

Author(s):

Dianbo Liu ◽

Leonardo Clemente ◽

Canelle Poirier ◽

Xiyu Ding ◽

Matteo Chinazzi ◽

...

Keyword(s):

Machine Learning ◽

Real Time ◽

Digital Data ◽

Learning Approach ◽

Mechanistic Models ◽

Machine Learning Approach

Download Full-text

Discovering Crash Severity Factors of Grade Crossing With a Machine Learning Approach

2019 Joint Rail Conference ◽

10.1115/jrc2019-1231 ◽

2019 ◽

Cited By ~ 1

Author(s):

Dahye Lee ◽

Jeffery Warner ◽

Curtis Morgan

Keyword(s):

Machine Learning ◽

The United States ◽

Machine Learning Algorithms ◽

Learning Approach ◽

Highway Traffic ◽

Extreme Gradient Boosting ◽

Machine Learning Approach ◽

Grade Crossing ◽

Grade Crossings ◽

Significant Factors

According to the Federal Railroad Administration (FRA) Highway-Rail Grade Crossing Accident/Incident database, more than 12,000 accidents occurred between 2012 and 2017 in the United States with casualties of around 3900. Despite repeated efforts to fully understand the risk factors that contribute to highway-rail grade crossing collisions, there still remain many uncertainties. A machine learning approach is proposed in this paper to find out significant factors, along with their individual impacts of crash severities at grade crossings. One of the most efficient and accurate machine learning algorithms, extreme gradient boosting (XGB or XGBoost), is applied to analyze 21 different accident and crossing -related characteristics per driver severities. The XGB model has been proven in previous studies across many research areas in transportation to outperform other machine learning-based methods and statistical classification methods, such as multinomial logit model, multiple additive regression trees, decision tree, and random forest, especially in prediction accuracy. Thereby, applying the algorithm is expected to provide highly reliable results to identify important factors that have impacts on injury severities at grade crossings. Such application will further aid the discovery of potential crossings with significant factors. The FRA’s Highway-Rail Grade Crossing Accident/Incident database from 2012 to 2017 is fused with the FRA Highway-Rail Crossing Inventory database for the analysis. Observations with missing information were removed from the original database. Crossing position under or over the railroad and pedestrian or other types of highway users were also not considered since they were not specifically of interest in this study. After the database cleaning process, it condensed to the total of 1,250 accidents out of the retrieved 12,630 from the combined database. The results show that adjacent highway traffic volume and train speed are the most significant factors causing accidents and injury severity. They are followed by the driver’s age and the estimated vehicle speed. It also indicated that truck-involved accidents and crossings with gates, flashing lights, and other types of warning devices combined, and highway user’s gender as a male also pertain to the higher injury rate. Through this study, it is possible to provide guidance to decision-makers in recognizing possible risks at-grade crossings that may cause driver casualties.

Download Full-text

A Machine Learning Approach to Customer Needs Analysis for Product Ecosystems

Journal of Mechanical Design ◽

10.1115/1.4044435 ◽

2019 ◽

Vol 142 (1) ◽

Author(s):

Feng Zhou ◽

Jackie Ayoub ◽

Qianli Xu ◽

X. Jessie Yang

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Latent Dirichlet Allocation ◽

Marketing Research ◽

Needs Analysis ◽

Learning Approach ◽

Product Reviews ◽

Kano Model ◽

Customer Needs ◽

Machine Learning Approach

Abstract Creating product ecosystems has been one of the strategic ways to enhance user experience and business advantages. Among many, customer needs analysis for product ecosystems is one of the most challenging tasks in creating a successful product ecosystem from both the perspectives of marketing research and product development. In this paper, we propose a machine-learning approach to customer needs analysis for product ecosystems by examining a large amount of online user-generated product reviews within a product ecosystem. First, we filtered out uninformative reviews from the informative reviews using a fastText technique. Then, we extract a variety of topics with regard to customer needs using a topic modeling technique named latent Dirichlet allocation. In addition, we applied a rule-based sentiment analysis method to predict not only the sentiment of the reviews but also their sentiment intensity values. Finally, we categorized customer needs related to different topics extracted using an analytic Kano model based on the dissatisfaction-satisfaction pair from the sentiment analysis. A case example of the Amazon product ecosystem was used to illustrate the potential and feasibility of the proposed method.

Download Full-text

Identifying predictors of cancer prevalence at the neighborhood level in the United States: A Bayesian machine learning approach

ISEE Conference Abstracts ◽

10.1289/isee.2021.p-284 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Li Niu ◽

Liangyuan Hu ◽

Yan Li ◽

Bian Liu

Keyword(s):

United States ◽

Machine Learning ◽

The United States ◽

Learning Approach ◽

Cancer Prevalence ◽

Machine Learning Approach ◽

Bayesian Machine Learning ◽

Neighborhood Level

Download Full-text

Short term Forecast of the COVID-19 Epidemic in top 15 Affected Countries in the World using ARIMA Model with Machine Learning Approach (Preprint)

10.2196/preprints.19711 ◽

2020 ◽

Author(s):

Pavan Kumar ◽

Ranjit Sah ◽

Alfonso J. Rodriguez-Morales ◽

Himangshu Kalita Jr ◽

Akshaya Srikanth Bhagavathula ◽

...

Keyword(s):

Machine Learning ◽

Moving Average ◽

Arima Model ◽

The United States ◽

Learning Approach ◽

Average Model ◽

Machine Learning Approach ◽

The World ◽

Moving Average Model ◽

Auto Regressive

BACKGROUND The COVID-19 pendemic reached more than 200 countries, which was recognized during December-19 from CHINA and affected more than 28 lakh people on date April 26, 2020 (data source:Johns Hopkins Corona Virus Resource Center). OBJECTIVE We here predicted some trajectories of COVID-19 in the coming days (until July 2, 2020) using the most advanced Auto-Regressive Integrated Moving Average Model (ARIMA). METHODS Here we have used the Auto-Regressive Integrated Moving Average Model (ARIMA). Mathematical approaches are widely used to infer critical epidemiological transitions and parameters of COVID-19. Methods such as epidemic curve fitting, surveillance data during the early transmission R0, and other epidemic models are frequently applied to generate forecasts of COVID-19 pandemic across the world. RESULTS Our analysis predicted very frightening outcomes, which defines to worsen the conditions in Iran, entire Europe, especially Italy, Spain, and France. While South Korea, after the initial blast, has come to stability, the same goes for the COVID-19 origin country China with more positive recovery cases and confirm to remain stable. The United States of America (USA) is come as a surprise and going to become the epicenter for new cases during the mid-April 2020. CONCLUSIONS Based on our predictions, public health officials should tailor aggressive interventions to grasp the power exponential growth, and rapid infection control measures at hospital levels are urgently needed to curtail the COVID-19 pandemic. This study analyzed at global level and extracted data upon Machine Learning approach using Artificial intelligence techniques for top 10% or 20 countries.

Download Full-text

Content-based features predict social media influence operations

Science Advances ◽

10.1126/sciadv.abb5824 ◽

2020 ◽

Vol 6 (30) ◽

pp. eabb5824 ◽

Cited By ~ 1

Author(s):

Meysam Alizadeh ◽

Jacob N. Shapiro ◽

Cody Buntain ◽

Joshua A. Tucker

Keyword(s):

United States ◽

Machine Learning ◽

Social Media ◽

The United States ◽

User Generated Content ◽

Learning Approach ◽

Monthly Basis ◽

Twitter Data ◽

Random Samples ◽

Machine Learning Approach

We study how easy it is to distinguish influence operations from organic social media activity by assessing the performance of a platform-agnostic machine learning approach. Our method uses public activity to detect content that is part of coordinated influence operations based on human-interpretable features derived solely from content. We test this method on publicly available Twitter data on Chinese, Russian, and Venezuelan troll activity targeting the United States, as well as the Reddit dataset of Russian influence efforts. To assess how well content-based features distinguish these influence operations from random samples of general and political American users, we train and test classifiers on a monthly basis for each campaign across five prediction tasks. Content-based features perform well across period, country, platform, and prediction task. Industrialized production of influence campaign content leaves a distinctive signal in user-generated content that allows tracking of campaigns from month to month and across different accounts.

Download Full-text