Applying NLP techniques to malware detection in a practical environment

International Journal of Information Security ◽

10.1007/s10207-021-00553-8 ◽

2021 ◽

Author(s):

Mamoru Mimura ◽

Ryo Ito

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Dynamic Analysis ◽

Language Processing ◽

Malware Detection ◽

Experimental Results ◽

The Internet ◽

Multiple Sources ◽

Filtering Method

AbstractExecutable files still remain popular to compromise the endpoint computers. These executable files are often obfuscated to avoid anti-virus programs. To examine all suspicious files from the Internet, dynamic analysis requires too much time. Therefore, a fast filtering method is required. With the recent development of natural language processing (NLP) techniques, printable strings became more effective to detect malware. The combination of the printable strings and NLP techniques can be used as a filtering method. In this paper, we apply NLP techniques to malware detection. This paper reveals that printable strings with NLP techniques are effective for detecting malware in a practical environment. Our dataset consists of more than 500,000 samples obtained from multiple sources. Our experimental results demonstrate that our method is effective to not only subspecies of the existing malware, but also new malware. Our method is effective against packed malware and anti-debugging techniques.

Download Full-text

A deep database of medical abbreviations and acronyms for natural language processing

Scientific Data ◽

10.1038/s41597-021-00929-4 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Lisa Grossman Liu ◽

Raymond H. Grossman ◽

Elliot G. Mitchell ◽

Chunhua Weng ◽

Karthik Natarajan ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

American English ◽

Substantial Improvement ◽

Future Application ◽

Multiple Sources ◽

High Coverage ◽

Clinical Text ◽

Automated Quality Control

AbstractThe recognition, disambiguation, and expansion of medical abbreviations and acronyms is of upmost importance to prevent medically-dangerous misinterpretation in natural language processing. To support recognition, disambiguation, and expansion, we present the Medical Abbreviation and Acronym Meta-Inventory, a deep database of medical abbreviations. A systematic harmonization of eight source inventories across multiple healthcare specialties and settings identified 104,057 abbreviations with 170,426 corresponding senses. Automated cross-mapping of synonymous records using state-of-the-art machine learning reduced redundancy, which simplifies future application. Additional features include semi-automated quality control to remove errors. The Meta-Inventory demonstrated high completeness or coverage of abbreviations and senses in new clinical text, a substantial improvement over the next largest repository (6–14% increase in abbreviation coverage; 28–52% increase in sense coverage). To our knowledge, the Meta-Inventory is the most complete compilation of medical abbreviations and acronyms in American English to-date. The multiple sources and high coverage support application in varied specialties and settings. This allows for cross-institutional natural language processing, which previous inventories did not support. The Meta-Inventory is available at https://bit.ly/github-clinical-abbreviations.

Download Full-text

Using NLP for Fact Checking: A Survey

Designs ◽

10.3390/designs5030042 ◽

2021 ◽

Vol 5 (3) ◽

pp. 42

Author(s):

Eric Lazarski ◽

Mahmood Al-Khassaweneh ◽

Cynthia Howard

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Computer Science ◽

Language Processing ◽

The Internet ◽

Fake News ◽

Fact Checking ◽

The Many ◽

Human Powered ◽

The Web

In recent years, disinformation and “fake news” have been spreading throughout the internet at rates never seen before. This has created the need for fact-checking organizations, groups that seek out claims and comment on their veracity, to spawn worldwide to stem the tide of misinformation. However, even with the many human-powered fact-checking organizations that are currently in operation, disinformation continues to run rampant throughout the Web, and the existing organizations are unable to keep up. This paper discusses in detail recent advances in computer science to use natural language processing to automate fact checking. It follows the entire process of automated fact checking using natural language processing, from detecting claims to fact checking to outputting results. In summary, automated fact checking works well in some cases, though generalized fact checking still needs improvement prior to widespread use.

Download Full-text

Detecting Linguistic Markers of Violent Extremism in Online Environments

Combating Violent Extremism and Radicalization in the Digital Era - Advances in Religious and Cultural Studies ◽

10.4018/978-1-5225-0156-5.ch018 ◽

2016 ◽

pp. 374-390 ◽

Cited By ~ 5

Author(s):

Fredrik Johansson ◽

Lisa Kaati ◽

Magnus Sahlgren

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Terrorist Attacks ◽

The Internet ◽

Written Text ◽

Linguistic Markers ◽

Geographical Regions ◽

Violent Extremism ◽

Online Environments

The ability to disseminate information instantaneously over vast geographical regions makes the Internet a key facilitator in the radicalisation process and preparations for terrorist attacks. This can be both an asset and a challenge for security agencies. One of the main challenges for security agencies is the sheer amount of information available on the Internet. It is impossible for human analysts to read through everything that is written online. In this chapter we will discuss the possibility of detecting violent extremism by identifying signs of warning behaviours in written text – what we call linguistic markers – using computers, or more specifically, natural language processing.

Download Full-text

Using Edge AI and Language Understanding for Predictive Modeling of Acute Medical Intoxications

The Journal of CIEES ◽

10.48149/jciees.2021.1.2.3 ◽

2021 ◽

Vol 1 (2) ◽

pp. 18-22

Author(s):

Strahil Sokolov ◽

Stanislava Georgieva

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Predictive Modeling ◽

Personal Data ◽

Experimental Results ◽

Language Understanding ◽

New Approach ◽

Data Anonymization ◽

Model Training

This paper presents a new approach to processing and categorization of text from patient documents in Bulgarian language using Natural Language Processing and Edge AI. The proposed algorithm contains several phases - personal data anonymization, pre-processing and conversion of text to vectors, model training and recognition. The experimental results in terms of achieved accuracy are comparable with modern approaches.

Download Full-text

Opinion Mining for Instructor Evaluations at the Autonomous University of Ciudad Juarez

Handbook of Research on Natural Language Processing and Smart Service Systems - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-7998-4730-4.ch020 ◽

2021 ◽

pp. 427-444

Author(s):

Rafael Jiménez ◽

Vicente García ◽

Abraham López ◽

Alejandra Mendoza Carreón ◽

Alan Ponce

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Opinion Mining ◽

Experimental Results ◽

Ciudad Juarez ◽

Teaching Process ◽

Instructor Evaluation ◽

Negative Comments ◽

Processing Techniques

The Autonomous University of Ciudad Juárez performs an instructor evaluation each semester to find strengths, weaknesses, and areas of opportunity during the teaching process. In this chapter, the authors show how opinion mining can be useful for labeling student comments as positives and negatives. For this purpose, a database was created using real opinions obtained from five professors of the UACJ over the last four years, covering a total of 20 subjects. Natural language processing techniques were used on the database to normalize its data. Experimental results using 1-NN and Bagging classifiers shows that it is possible to automatically label positive and negative comments with an accuracy of 80.13%.

Download Full-text

A text-based multi-span network for reading comprehension

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-200581 ◽

2021 ◽

pp. 1-13

Author(s):

Deguang Chen ◽

Ziping Ma ◽

Lin Wei ◽

Yanbin Zhu ◽

Jinlin Ma ◽

...

Keyword(s):

Reading Comprehension ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Market Value ◽

The State ◽

Experimental Results ◽

Training Time ◽

The Public

Text-based reading comprehension models have great research significance and market value and are one of the main directions of natural language processing. Reading comprehension models of single-span answers have recently attracted more attention and achieved significant results. In contrast, multi-span answer models for reading comprehension have been less investigated and their performances need improvement. To address this issue, in this paper, we propose a text-based multi-span network for reading comprehension, ALBERT_SBoundary, and build a multi-span answer corpus, MultiSpan_NMU. We also conduct extensive experiments on the public multi-span corpus, MultiSpan_DROP, and our multi-span answer corpus, MultiSpan_NMU, and compare the proposed method with the state-of-the-art. The experimental results show that our proposed method achieves F1 scores of 84.10 and 92.88 on MultiSpan_DROP and MultiSpan_NMU datasets, respectively, while it also has fewer parameters and a shorter training time.

Download Full-text

Coronavirus Pandemic (COVID-19)

International Journal on Semantic Web and Information Systems ◽

10.4018/ijswis.2021040101 ◽

2021 ◽

Vol 17 (2) ◽

pp. 1-21

Author(s):

Jalal S. Alowibdi ◽

Abdulrahman A. Alshdadi ◽

Ali Daud ◽

Mohamed M. Dessouky ◽

Essa Ali Alhazmi

Keyword(s):

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Experimental Results ◽

Mixed Emotions ◽

Social Media Platforms

People are afraid about COVID-19 and are actively talking about it on social media platforms such as Twitter. People are showing their emotions openly in their tweets on Twitter. It's very important to perform sentiment analysis on these tweets for finding COVID-19's impact on people's lives. Natural language processing, textual processing, computational linguists, and biometrics are applied to perform sentiment analysis to identify and extract the emotions. In this work, sentiment analysis is carried out on a large Twitter dataset of English tweets. Ten emotional themes are investigated. Experimental results show that COVID-19 has spread fear/anxiety, gratitude, happiness and hope, and other mixed emotions among people for different reasons. Specifically, it is observed that positive news from top officials like Trump of chloroquine as cure to COVID-19 has suddenly lowered fear in sentiment, and happiness, gratitude, and hope started to rise. But, once FDA said, chloroquine is not effective cure, fear again started to rise.

Download Full-text

MalDy: Portable, data-driven malware detection using natural language processing and machine learning techniques on behavioral analysis reports

Digital Investigation ◽

10.1016/j.diin.2019.01.017 ◽

2019 ◽

Vol 28 ◽

pp. S77-S87 ◽

Cited By ~ 8

Author(s):

ElMouatez Billah Karbab ◽

Mourad Debbabi

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Malware Detection ◽

Data Driven ◽

Machine Learning Techniques ◽

Behavioral Analysis ◽

Learning Techniques

Download Full-text

Clinical Natural Language Processing in 2015: Leveraging the Variety of Texts of Clinical Interest

Yearbook of Medical Informatics ◽

10.15265/iy-2016-049 ◽

2016 ◽

Vol 25 (01) ◽

pp. 234-239 ◽

Cited By ~ 8

Author(s):

P. Zweigenbaum ◽

A. Névéol ◽

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Clinical Information ◽

Bibliographic Databases ◽

Sources Of Information ◽

Multiple Sources ◽

Clinical Interest ◽

Clinical Natural Language Processing ◽

Selection Of

Summary Objective: To summarize recent research and present a selection of the best papers published in 2015 in the field of clinical Natural Language Processing (NLP). Method: A systematic review of the literature was performed by the two section editors of the IMIA Yearbook NLP section by searching bibliographic databases with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. Section editors first selected a shortlist of candidate best papers that were then peer-reviewed by independent external reviewers. Results: The clinical NLP best paper selection shows that clinical NLP is making use of a variety of texts of clinical interest to contribute to the analysis of clinical information and the building of a body of clinical knowledge. The full review process highlighted five papers analyzing patient-authored texts or seeking to connect and aggregate multiple sources of information. They provide a contribution to the development of methods, resources, applications, and sometimes a combination of these aspects. Conclusions: The field of clinical NLP continues to thrive through the contributions of both NLP researchers and healthcare professionals interested in applying NLP techniques to impact clinical practice. Foundational progress in the field makes it possible to leverage a larger variety of texts of clinical interest for healthcare purposes.

Download Full-text

Improving shift-reduce constituency parsing with large-scale unlabeled data

Natural Language Engineering ◽

10.1017/s1351324913000119 ◽

2013 ◽

Vol 21 (1) ◽

pp. 113-138 ◽

Cited By ~ 1

Author(s):

MUHUA ZHU ◽

JINGBO ZHU ◽

HUIZHEN WANG

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

State Of The Art ◽

Unlabeled Data ◽

Experimental Results ◽

Empirical Methods ◽

Part Of Speech

AbstractShift-reduce parsing has been studied extensively for diverse grammars due to the simplicity and running efficiency. However, in the field of constituency parsing, shift-reduce parsers lag behind state-of-the-art parsers. In this paper we propose a semi-supervised approach for advancing shift-reduce constituency parsing. First, we apply the uptraining approach (Petrov, S. et al. 2010. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cambridge, MA, USA, pp. 705–713) to improve part-of-speech taggers to provide better part-of-speech tags to subsequent shift-reduce parsers. Second, we enhance shift-reduce parsing models with novel features that are defined on lexical dependency information. Both stages depend on the use of large-scale unlabeled data. Experimental results show that the approach achieves overall improvements of 1.5 percent and 2.1 percent on English and Chinese data respectively. Moreover, the final parsing accuracies reach 90.9 percent and 82.2 percent respectively, which are comparable with the accuracy of state-of-the-art parsers.

Download Full-text