Deep learning models have significantly advanced various natural
language processing tasks. However, they are strikingly vulnerable
to adversarial text attacks, even in the black-box setting where no model
knowledge is accessible to hackers. Such attacks are conducted with a two-phase
framework: 1) a sensitivity estimation phase to evaluate each element’s
sensitivity to the target model’s prediction, and 2) a
perturbation execution phase to craft the adversarial examples based on estimated
element sensitivity. This study explored the connections between the local
post-hoc explainable methods for deep learning and black-box adversarial text
attacks and proposed a novel eXplanation-based method for crafting
Adversarial Text Attacks (XATA). XATA leverages local post-hoc explainable
methods (e.g., LIME or SHAP) to measure input elements’ sensitivity and adopts the word replacement perturbation strategy to
craft adversarial examples. We evaluated the attack performance of the proposed
XATA on three commonly used text-based datasets: IMDB Movie Review, Yelp Reviews-Polarity,
and Amazon Reviews-Polarity. The proposed XATA outperformed existing baselines in
various target models, including LSTM, GRU, CNN, and BERT. Moreover, we found
that improved local post-hoc explainable methods (e.g., SHAP) lead to more
effective adversarial attacks. These findings showed that when researchers constantly
advance the explainability of deep learning models with local post-hoc
methods, they also provide hackers with weapons to craft more targeted and dangerous adversarial attacks.