Java Bytecode Control Flow Classification: Framework for Guiding Java Decompilation

Decompilation is the main process of software development, which is very important when a program tries to retrieve lost source codes. Although decompiling Java bytecode is easier than bytecode, many Java decompilers cannot recover originally lost sources, especially the selection statement, i.e., if statement. This deficiency affects directly decompilation performance. In this paper, we propose the methodology for guiding Java decompiler to deal with the aforementioned problem. In the framework, Java bytecode is transformed into two kinds of features called frame feature and latent semantic feature. The former is extracted directly from the bytecode. The latter is achieved by two-step transforming the Java bytecode to bigram and then term frequency-inverse document frequency (TFIDF). After that, both of them are fed to the genetic algorithm to reduce their dimensions. The proposed feature is achieved by converting the selected TFIDF to a latent semantic feature and concatenating it with the selected frame feature. Finally, KNN is used to classify the proposed feature. The experimental results show that the decompilation accuracy is 93.68 percent, which is obviously better than Java Decompiler.

Download Full-text

Poisson mixtures

Natural Language Engineering ◽

10.1017/s1351324900000139 ◽

1995 ◽

Vol 1 (2) ◽

pp. 163-190 ◽

Cited By ~ 146

Author(s):

Kenneth W. Church ◽

William A. Gale

Keyword(s):

Negative Binomial ◽

Probability Distributions ◽

Hidden Variables ◽

Heterogeneous Structure ◽

Text Compression ◽

Inverse Document Frequency ◽

Poisson Mixtures ◽

Document Frequency ◽

Wide Range ◽

Better Than

AbstractShannon (1948) showed that a wide range of practical problems can be reduced to the problem of estimating probability distributions of words and ngrams in text. It has become standard practice in text compression, speech recognition, information retrieval and many other applications of Shannon's theory to introduce a “bag-of-words” assumption. But obviously, word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph. The proposed Poisson mixture captures much of this heterogeneous structure by allowing the Poisson parameter θ to vary over documents subject to a density function φ. φ is intended to capture dependencies on hidden variables such genre, author, topic, etc. (The Negative Binomial is a well-known special case where φ is a Г distribution.) Poisson mixtures fit the data better than standard Poissons, producing more accurate estimates of the variance over documents (σ2), entropy (H), inverse document frequency (IDF), and adaptation (Pr(x ≥ 2/x ≥ 1)).

Download Full-text

Text Mining of Research Articles Using Clustering Approach

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-1350 ◽

2021 ◽

pp. 1-6

Author(s):

Deepti Dominic ◽

Jyothsna R

Keyword(s):

Random Forest ◽

Processing Time ◽

Clustering Algorithm ◽

Random Forest Classifier ◽

Research Articles ◽

Inverse Document Frequency ◽

Search Mechanism ◽

Document Frequency ◽

Clustering Approach ◽

Better Than

Widening of research articles publication in various streams of research is epidemic. Tracking down of an appropriate article from the research archive is considered to be vast and also time consuming. Research articles are clustered based on their respective domain and it plays an important role for researchers to retrieve articles in a faster manner. Hence a commonly practiced search mechanism, namely domain name search has been applied to retrieve appropriate documents and articles. When new domains of documents are added to the repository it’s to spot keywords and boost the corresponding domains for proper classification. Classification techniques namely Random forest classifier, SVM and TF-IDF have been used to classify articles and compare its processing time. TF-IDF (Term Frequency-Inverse Document Frequency) has been further proposed to transform the corpus into vector space model. Clustering algorithm such as K-Means and Hierarchical have been used to cluster articles. Finally, the processing time of SVM is better than random forest classifier and TF-IDF and K-Means gives a better understanding than Hierarchical algorithm.

Download Full-text

Large expert-curated database for benchmarking document similarity detection in biomedical literature search

Database ◽

10.1093/database/baz085 ◽

2019 ◽

Vol 2019 ◽

Author(s):

Peter Brown ◽

Aik-Choon Tan ◽

Mohamed A El-Esawi ◽

Thomas Liehr ◽

Oliver Blanck ◽

...

Keyword(s):

Literature Search ◽

Relevant Literature ◽

Biomedical Literature ◽

Medical Subject Headings ◽

Document Similarity ◽

Inverse Document Frequency ◽

Research Fields ◽

Experience Levels ◽

Document Frequency ◽

Systematic Biases

Abstract Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency–Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.

Download Full-text

Inverse document frequency-based sensitivity scoring for privacy analysis

Signal Image and Video Processing ◽

10.1007/s11760-021-02013-1 ◽

2021 ◽

Author(s):

Onder Coban ◽

Ali Inan ◽

Selma Ayse Ozel

Keyword(s):

Inverse Document Frequency ◽

Document Frequency ◽

Privacy Analysis

Download Full-text

Efficient natural language classification algorithm for detecting duplicate unsupervised features

Informatics and Automation - Информатика и автоматизация ◽

10.15622/ia.2021.3.5 ◽

2021 ◽

Vol 20 (3) ◽

pp. 623-653

Author(s):

Saud Altaf ◽

Sofia Iqbal ◽

Muhammad Waseem Soomro

Keyword(s):

Natural Language ◽

Short Term Memory ◽

Short Term ◽

Vocabulary Size ◽

Language Understanding ◽

Inverse Document Frequency ◽

Classification Technique ◽

Document Frequency ◽

Text Features ◽

Long Short Term Memory

This paper focuses on capturing the meaning of Natural Language Understanding (NLU) text features to detect the duplicate unsupervised features. The NLU features are compared with lexical approaches to prove the suitable classification technique. The transfer-learning approach is utilized to train the extraction of features on the Semantic Textual Similarity (STS) task. All features are evaluated with two types of datasets that belong to Bosch bug and Wikipedia article reports. This study aims to structure the recent research efforts by comparing NLU concepts for featuring semantics of text and applying it to IR. The main contribution of this paper is a comparative study of semantic similarity measurements. The experimental results demonstrate the Term Frequency–Inverse Document Frequency (TF-IDF) feature results on both datasets with reasonable vocabulary size. It indicates that the Bidirectional Long Short Term Memory (BiLSTM) can learn the structure of a sentence to improve the classification.

Download Full-text

Sistem Rekomendasi Produk Pena Eksklusif Menggunakan Metode Content-Based Filtering dan TF-IDF

JOINTECS (Journal of Information Technology and Computer Science) ◽

10.31328/jointecs.v5i3.1563 ◽

2020 ◽

Vol 5 (3) ◽

pp. 229

Author(s):

Mariani Widia Putri ◽

Achmad Muchayan ◽

Made Kamisutara

Keyword(s):

Information Retrieval ◽

Customer Relationship Management ◽

Relationship Management ◽

Customer Relationship ◽

Brand Awareness ◽

Product Knowledge ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency ◽

Content Based Filtering

Sistem rekomendasi saat ini sedang menjadi tren. Kebiasaan masyarakat yang saat ini lebih mengandalkan transaksi secara online dengan berbagai alasan pribadi. Sistem rekomendasi menawarkan cara yang lebih mudah dan cepat sehingga pengguna tidak perlu meluangkan waktu terlalu banyak untuk menemukan barang yang diinginkan. Persaingan antar pelaku bisnis pun berubah sehingga harus mengubah pendekatan agar bisa menjangkau calon pelanggan. Oleh karena itu dibutuhkan sebuah sistem yang dapat menunjang hal tersebut. Maka dalam penelitian ini, penulis membangun sistem rekomendasi produk menggunakan metode Content-Based Filtering dan Term Frequency Inverse Document Frequency (TF-IDF) dari model Information Retrieval (IR). Untuk memperoleh hasil yang efisien dan sesuai dengan kebutuhan solusi dalam meningkatkan Customer Relationship Management (CRM). Sistem rekomendasi dibangun dan diterapkan sebagai solusi agar dapat meningkatkan brand awareness pelanggan dan meminimalisir terjadinya gagal transaksi di karenakan kurang nya informasi yang dapat disampaikan secara langsung atau offline. Data yang digunakan terdiri dari 258 kode produk produk yang yang masing-masing memiliki delapan kategori dan 33 kata kunci pembentuk sesuai dengan product knowledge perusahaan. Hasil perhitungan TF-IDF menunjukkan nilai bobot 13,854 saat menampilkan rekomendasi produk terbaik pertama, dan memiliki keakuratan sebesar 96,5% dalam memberikan rekomendasi pena.

Download Full-text

A Study on the Pivoted Inverse Document Frequency Weighting Method

Journal of the Korean Society for information Management ◽

10.3743/kosim.2003.20.4.233 ◽

2003 ◽

Vol 20 (4) ◽

pp. 233-248 ◽

Cited By ~ 4

Keyword(s):

Weighting Method ◽

Inverse Document Frequency ◽

Document Frequency ◽

Frequency Weighting

Download Full-text

Conceptual Sentiment Analysis Model

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i4.pp2358-2366 ◽

2018 ◽

Vol 8 (4) ◽

pp. 2358 ◽

Cited By ~ 2

Author(s):

Kranti Vithal Ghag ◽

Ketan Shah

Keyword(s):

Sentiment Analysis ◽

Relative Frequency ◽

Syntactic Structure ◽

Research Work ◽

Average Frequency ◽

Semantic Structure ◽

Analysis Model ◽

Inverse Document Frequency ◽

Document Frequency ◽

Improve Accuracy

<span>Bag-of-words approach is popularly used for Sentiment analysis. It maps the terms in the reviews to term-document vectors and thus disrupts the syntactic structure of sentences in the reviews. Association among the terms or the semantic structure of sentences is also not preserved. This research work focuses on classifying the sentiments by considering the syntactic and semantic structure of the sentences in the review. To improve accuracy, sentiment classifiers based on relative frequency, average frequency and term frequency inverse document frequency were proposed. To handle terms with apostrophe, preprocessing techniques were extended. To focus on opinionated contents, subjectivity extraction was performed at phrase level. Experiments were performed on Pang & Lees, Kaggle’s and UCI’s dataset. Classifiers were also evaluated on the UCI’s Product and Restaurant dataset. Sentiment Classification accuracy improved from 67.9% for a comparable term weighing technique, DeltaTFIDF, up to 77.2% for proposed classifiers. Inception of the proposed concept based approach, subjectivity extraction and extensions to preprocessing techniques, improved the accuracy to 93.9%.</span>

Download Full-text

Food behaviours of Italian consumers at risk of poverty

British Food Journal ◽

10.1108/bfj-12-2014-0417 ◽

2015 ◽

Vol 117 (11) ◽

pp. 2831-2848 ◽

Cited By ~ 8

Author(s):

Arianna Ruggeri ◽

Anne Arvola ◽

Antonella Samoggia ◽

Vaiva Hendrixson

Keyword(s):

At Risk ◽

Design Methodology ◽

Economic Status ◽

Inverse Document Frequency ◽

Content Type ◽

Semantic Categories ◽

Female Consumers ◽

Document Frequency ◽

Consumer Segment ◽

Quantitative Analyses

Purpose – At a European level, Italy experiences one of the highest percentages of population at risk of poverty (AROP). However, studies on this consumer segment are scarce. The purpose of this paper is to investigate the food behaviours of Italian female consumers, distinguishing similarities and differences due to age and level of income. Design/methodology/approach – The investigation adopted an inductive approach in order to analyse and confirm the determinants of food behaviours. Data were collected through four focus groups. Data elaboration included content analyses with term frequency – inverse document frequency index and multidimensional scaling technique. Findings – The food behaviours of Italian female consumers are based on a common set of semantic categories and theoretical dimensions that are coherent with those applied by previous studies. The age of consumers impacts the relevance attributed to the categories and income contributes to the explanation of the conceptual relations among the categories that determine food behaviours. The approach to food of younger and mature consumers AROP is strongly driven by constraints such as price and time. The study did not confirm a link between a poor health attitude and low socio-economic status. Research limitations/implications – The outcomes achieved can be strengthened by quantitative analyses to characterise the relations occurring among the factors and dimensions that influence the food behaviours of consumers AROP. Originality/value – The study increases knowledge about Italian female consumers and provides an initial contribution to the analysis of the food behaviour of the population AROP.

Download Full-text

The method of fundamental harmonic frequency determination

Transport Samochodowy ◽

10.5604/01.3001.0014.8021 ◽

2021 ◽

Vol 63 (1) ◽

pp. 38-42

Author(s):

Tomasz Szczepański ◽

Stanisław Traczyk ◽

Paweł Dziedziak

Keyword(s):

Fundamental Frequency ◽

Patent Application ◽

Diagnostic Methods ◽

Main Process ◽

Harmonic Frequency ◽

Frequency Determination ◽

The Subject ◽

Mechanical Devices ◽

Vibroacoustic Signals ◽

Better Than

Analysis of vibroacoustic signals is one of the more frequently used mechanical devices diagnostic methods occurring among others in car diagnostics. Often, it happens that the most important element of the recorded course is the fundamental harmonic frequency of vibrations. Fundamental frequency indicates the main process related to the operation of the device and allows to follow its course. In the article the author's method of determining the fundamental frequency in the signal will be presented which is the subject of a patent application. Its theoretical basis and application examples were discussed comparing the accuracy of its use with the accuracy of other methods. The frequency range where the method finds application is shown. That is, where its accuracy turns out to be better than the accuracy of popular used methods to fundamental harmonic frequency determination.

Download Full-text