Local Post-hoc Explainable Methods for Adversarial Text Attacks
Deep learning models have significantly advanced various natural language processing tasks. However, they are strikingly vulnerable to adversarial text attacks, even in the black-box setting where no model knowledge is accessible to hackers. Such attacks are conducted with a two-phase framework: 1) a sensitivity estimation phase to evaluate each element’s sensitivity to the target model’s prediction, and 2) a perturbation execution phase to craft the adversarial examples based on estimated element sensitivity. This study explored the connections between the local post-hoc explainable methods for deep learning and black-box adversarial text attacks and proposed a novel eXplanation-based method for crafting Adversarial Text Attacks (XATA). XATA leverages local post-hoc explainable methods (e.g., LIME or SHAP) to measure input elements’ sensitivity and adopts the word replacement perturbation strategy to craft adversarial examples. We evaluated the attack performance of the proposed XATA on three commonly used text-based datasets: IMDB Movie Review, Yelp Reviews-Polarity, and Amazon Reviews-Polarity. The proposed XATA outperformed existing baselines in various target models, including LSTM, GRU, CNN, and BERT. Moreover, we found that improved local post-hoc explainable methods (e.g., SHAP) lead to more effective adversarial attacks. These findings showed that when researchers constantly advance the explainability of deep learning models with local post-hoc methods, they also provide hackers with weapons to craft more targeted and dangerous adversarial attacks.