Research on aviation unsafe incidents classification with improved TF-IDF algorithm

The text content of Aviation Safety Confidential Reports contains a large number of valuable information. Term frequency-inverse document frequency algorithm is commonly used in text analysis, but it does not take into account the sequential relationship of the words in the text and its role in semantic expression. According to the seven category labels of civil aviation unsafe incidents, aiming at solving the problems of TF-IDF algorithm, this paper improved TF-IDF algorithm based on co-occurrence network; established feature words extraction and words sequential relations for classified incidents. Aviation domain lexicon was used to improve the accuracy rate of classification. Feature words network model was designed for multi-documents unsafe incidents classification, and it was used in the experiment. Finally, the classification accuracy of improved algorithm was verified by the experiments.

Download Full-text

Learn About Term Frequency–Inverse Document Frequency in Text Analysis in Python With Data From How ISIS Uses Twitter Dataset (2016)

10.4135/9781526498038 ◽

2019 ◽

Author(s):

Feng Shi ◽

Keyword(s):

Text Analysis ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency

Download Full-text

Analyzing Documents with TF-IDF

The Programming Historian ◽

10.46430/phen0082 ◽

2019 ◽

Author(s):

Matthew J. Lavin

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Analysis ◽

Retrieval Method ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency

This lesson focuses on a foundational natural language processing and information retrieval method called Term Frequency - Inverse Document Frequency (tf-idf). This lesson explores the foundations of tf-idf, and will also introduce you to some of the questions and concepts of computationally oriented text analysis.

Download Full-text

Detection of Hate Speech and offensive Language on Sentiment Analysis using Machine Learning Techniques

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.e1985.039520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 136-139

Keyword(s):

Hate Speech ◽

Machine Learning Techniques ◽

Support Vector ◽

Inverse Document Frequency ◽

Document Frequency ◽

Learning Techniques ◽

Learning Technique ◽

A New Technique ◽

Text Content ◽

Offensive Language

Toxic online content (TOC) has become a significant problem in current day’s world due to uses of the internet by people of distinct culture, social, organization and industries background and followed Twitter, Facebook, WhatsApp,Instagram, and telegram, etc. Even now, there is lots of work going on related to single-label classification for the text analysis and to make less comparative to errors and more efficient. But in recent years, there is a shift towards the multi-label classification, which can be applicable for both text and images. But text classification is not much popular among the researchers when compared to the grading for images. So, in this work, we are using the dataset which is going to be a short messages dataset, to train and develop a model which can tag multiple labels for the messages. Hate speech, and offensive language is a key challenge in automatic detection of toxic text content. In this paper, to contribute term frequency–inverse document frequency(Tf-Idf), Random forest, Support Vector Machine (SVM),and Bayes Naïve classifier approaches for automatically classify tweets. After tuning the model giving the best results, it achieves an Efficient accuracy for evaluating test data analysis. In this contribution of work also moderate and encapsulate paradigms which will communicate and working between the user and Twitter API. Instead of using the traditional techniques like Bag of words or word counter, a new technique which uses Tf-Idf is built to find the similarity, and the text is transformed into the vectors using Tf-Idf, and this is used to train the model using supervised learning technique along with the labels from the dataset. The accuracy of the model is quite good and more efficient with better results.

Download Full-text

Learn About Term Frequency–Inverse Document Frequency in Text Analysis in R With Data From How ISIS Uses Twitter Dataset (2016)

10.4135/9781526489012 ◽

2019 ◽

Author(s):

Feng Shi ◽

Keyword(s):

Text Analysis ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency

Download Full-text

Large expert-curated database for benchmarking document similarity detection in biomedical literature search

Database ◽

10.1093/database/baz085 ◽

2019 ◽

Vol 2019 ◽

Author(s):

Peter Brown ◽

Aik-Choon Tan ◽

Mohamed A El-Esawi ◽

Thomas Liehr ◽

Oliver Blanck ◽

...

Keyword(s):

Literature Search ◽

Relevant Literature ◽

Biomedical Literature ◽

Medical Subject Headings ◽

Document Similarity ◽

Inverse Document Frequency ◽

Research Fields ◽

Experience Levels ◽

Document Frequency ◽

Systematic Biases

Abstract Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency–Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.

Download Full-text

Poisson mixtures

Natural Language Engineering ◽

10.1017/s1351324900000139 ◽

1995 ◽

Vol 1 (2) ◽

pp. 163-190 ◽

Cited By ~ 146

Author(s):

Kenneth W. Church ◽

William A. Gale

Keyword(s):

Negative Binomial ◽

Probability Distributions ◽

Hidden Variables ◽

Heterogeneous Structure ◽

Text Compression ◽

Inverse Document Frequency ◽

Poisson Mixtures ◽

Document Frequency ◽

Wide Range ◽

Better Than

AbstractShannon (1948) showed that a wide range of practical problems can be reduced to the problem of estimating probability distributions of words and ngrams in text. It has become standard practice in text compression, speech recognition, information retrieval and many other applications of Shannon's theory to introduce a “bag-of-words” assumption. But obviously, word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph. The proposed Poisson mixture captures much of this heterogeneous structure by allowing the Poisson parameter θ to vary over documents subject to a density function φ. φ is intended to capture dependencies on hidden variables such genre, author, topic, etc. (The Negative Binomial is a well-known special case where φ is a Г distribution.) Poisson mixtures fit the data better than standard Poissons, producing more accurate estimates of the variance over documents (σ2), entropy (H), inverse document frequency (IDF), and adaptation (Pr(x ≥ 2/x ≥ 1)).

Download Full-text

Inverse document frequency-based sensitivity scoring for privacy analysis

Signal Image and Video Processing ◽

10.1007/s11760-021-02013-1 ◽

2021 ◽

Author(s):

Onder Coban ◽

Ali Inan ◽

Selma Ayse Ozel

Keyword(s):

Inverse Document Frequency ◽

Document Frequency ◽

Privacy Analysis

Download Full-text

Efficient natural language classification algorithm for detecting duplicate unsupervised features

Informatics and Automation - Информатика и автоматизация ◽

10.15622/ia.2021.3.5 ◽

2021 ◽

Vol 20 (3) ◽

pp. 623-653

Author(s):

Saud Altaf ◽

Sofia Iqbal ◽

Muhammad Waseem Soomro

Keyword(s):

Natural Language ◽

Short Term Memory ◽

Short Term ◽

Vocabulary Size ◽

Language Understanding ◽

Inverse Document Frequency ◽

Classification Technique ◽

Document Frequency ◽

Text Features ◽

Long Short Term Memory

This paper focuses on capturing the meaning of Natural Language Understanding (NLU) text features to detect the duplicate unsupervised features. The NLU features are compared with lexical approaches to prove the suitable classification technique. The transfer-learning approach is utilized to train the extraction of features on the Semantic Textual Similarity (STS) task. All features are evaluated with two types of datasets that belong to Bosch bug and Wikipedia article reports. This study aims to structure the recent research efforts by comparing NLU concepts for featuring semantics of text and applying it to IR. The main contribution of this paper is a comparative study of semantic similarity measurements. The experimental results demonstrate the Term Frequency–Inverse Document Frequency (TF-IDF) feature results on both datasets with reasonable vocabulary size. It indicates that the Bidirectional Long Short Term Memory (BiLSTM) can learn the structure of a sentence to improve the classification.

Download Full-text

Sistem Rekomendasi Produk Pena Eksklusif Menggunakan Metode Content-Based Filtering dan TF-IDF

JOINTECS (Journal of Information Technology and Computer Science) ◽

10.31328/jointecs.v5i3.1563 ◽

2020 ◽

Vol 5 (3) ◽

pp. 229

Author(s):

Mariani Widia Putri ◽

Achmad Muchayan ◽

Made Kamisutara

Keyword(s):

Information Retrieval ◽

Customer Relationship Management ◽

Relationship Management ◽

Customer Relationship ◽

Brand Awareness ◽

Product Knowledge ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency ◽

Content Based Filtering

Sistem rekomendasi saat ini sedang menjadi tren. Kebiasaan masyarakat yang saat ini lebih mengandalkan transaksi secara online dengan berbagai alasan pribadi. Sistem rekomendasi menawarkan cara yang lebih mudah dan cepat sehingga pengguna tidak perlu meluangkan waktu terlalu banyak untuk menemukan barang yang diinginkan. Persaingan antar pelaku bisnis pun berubah sehingga harus mengubah pendekatan agar bisa menjangkau calon pelanggan. Oleh karena itu dibutuhkan sebuah sistem yang dapat menunjang hal tersebut. Maka dalam penelitian ini, penulis membangun sistem rekomendasi produk menggunakan metode Content-Based Filtering dan Term Frequency Inverse Document Frequency (TF-IDF) dari model Information Retrieval (IR). Untuk memperoleh hasil yang efisien dan sesuai dengan kebutuhan solusi dalam meningkatkan Customer Relationship Management (CRM). Sistem rekomendasi dibangun dan diterapkan sebagai solusi agar dapat meningkatkan brand awareness pelanggan dan meminimalisir terjadinya gagal transaksi di karenakan kurang nya informasi yang dapat disampaikan secara langsung atau offline. Data yang digunakan terdiri dari 258 kode produk produk yang yang masing-masing memiliki delapan kategori dan 33 kata kunci pembentuk sesuai dengan product knowledge perusahaan. Hasil perhitungan TF-IDF menunjukkan nilai bobot 13,854 saat menampilkan rekomendasi produk terbaik pertama, dan memiliki keakuratan sebesar 96,5% dalam memberikan rekomendasi pena.

Download Full-text

Centrally Reserved Access Model to the Medium in Digital Radio Communication Networks

Informatics and Automation - Информатика и автоматизация ◽

10.15622/ia.2020.19.6.8 ◽

2020 ◽

Vol 19 (6) ◽

pp. 1332-1356

Author(s):

Maksim Peregudov ◽

Anatoliy Steshkovoy

Keyword(s):

Ieee 802.11 ◽

Communication Networks ◽

Data Transmission ◽

Multiple Access ◽

Model Development ◽

Transmission Channel ◽

Radio Communication ◽

Digital Radio ◽

Relationship Of ◽

Sequential Relationship

Currently, centrally reserved access to the medium in the digital radio communication networks of the IEEE 802.11 family standards is an alternative to random multiple access to the environment such as CSMA/CA and is mainly used in the transmission voice and video messages in real time. Centrally reserved access to the environment determines the scope of interest in it from attackers. However, the assessment of effectiveness of centrally reserved access to the environment under the conditions of potentially possible destructive impacts was not carried out and therefore it is impossible to assess the contribution of such impacts to the decrease in the effectiveness of such access. Also, the stage establishing of centrally reserved access to the environment was not previously taken into account. Analytical model development of centrally reserved access to the environment under the conditions of destructive influences in digital radio communication networks of the IEEE 802.11 family standards. A mathematical model of centrally reserved access to the environment has been developed, taking into account not only the stage of its functioning, but also the stage of formation under the conditions of destructive influences by the attacker. Moreover, in the model the stage of establishing centrally reserved access to the medium displays a sequential relationship of such access, synchronization elements in digital radio communication networks and random multiple access to the medium of the CSMA/CA type. It was established that collisions in the data transmission channel caused by destructive influences can eliminate centrally reserved access to the medium even at the stage of its establishment. The model is applicable in the design of digital radio communication networks of the IEEE 802.11 family of standards, the optimization of such networks of the operation, and the detection of potential destructive effects by an attacker.

Download Full-text