A Web Text De-Noising Algorithm Based on Machine Learning

2014 ◽  
Vol 536-537 ◽  
pp. 516-519
Author(s):  
Xiao Yan Le

The Web has become a huge information resources distributed information space, contains a huge amount of various types of Web documents.Search engine, it is difficult to meet different user requirements for the elaboration of the retrieval results.To noise method, this paper proposes a text first to find out the noise page information, then according to the rules of human interaction way to generate a denoising, finally find and remove the noise, the experiments has been proved this method to be effective on improving the accuracy of classification.

Bibliosphere ◽  
2020 ◽  
pp. 78-84
Author(s):  
A. Yu. Gerasimenko

Over the past decade, the issue of uncontrolled growth of unsystematic information in Internet has remained acute for the scientific community. The problem of finding relevant information related to the distribution and autonomy of scientific information resources remains. A priority in the field of centralized access to the key scientifically significant sources of information is the creation of a united information space (UIS). The study aims to identify the main models to form systems integrating distributed information resources, and as a result to determine the structure of UIS formation in a research library. Two models were considered and analyzed in the study: a meta-aggregator and an integrated electronic library. During the analysis, elements, structure and a set of functions for users and employees of a research library are revealed for each model. The study allowed the drawing of the following conclusions:• The choice of a model for the UIS formation depends mostly on the formulation of tasks, the solution of which is the purpose of creating a system, as well as on the technological potential of the organizations involved in the process.•Multifunctionality of the system allows simultaneous use of the above-mentioned formation models.• Adding the element of interactivity to the structure of UIS of the research library will allow timely monitoring of changes in the information needs of scientists, reduction of time, labor and financial costs of both the library and a user. The article presents the criteria for choosing a model. For the first time the optimal effective structure of the UIS in the research library is described.


2020 ◽  
Vol 16 (2) ◽  
pp. 8-22
Author(s):  
Tirath Prasad Sahu ◽  
Sarang Khandekar

Sentiment analysis can be a very useful aspect for the extraction of useful information from text documents. The main idea for sentiment analysis is how people think for a particular online review, i.e. product reviews, movie reviews, etc. Sentiment analysis is the process where these reviews are classified as positive or negative. The web is enriched with huge amount of reviews which can be analyzed to make it meaningful. This article presents the use of lexicon resources for sentiment analysis of different publicly available reviews. First, the polarity shift of reviews is handled by negations. Intensifiers, punctuation and acronyms are also taken into consideration during the processing phase. Second, words are extracted which have some opinion; these words are then used for computing score. Third, machine learning algorithms are applied and the experimental results show that the proposed model is effective in identifying the sentiments of reviews and opinions.


Author(s):  
M. Ilayaraja ◽  
S. Hemalatha ◽  
P. Manickam ◽  
K. Sathesh Kumar ◽  
K. Shankar

Cloud computing is characterized as the arrangement of assets or administrations accessible through the web to the clients on their request by cloud providers. It communicates everything as administrations over the web in view of the client request, for example operating system, organize equipment, storage, assets, and software. Nowadays, Intrusion Detection System (IDS) plays a powerful system, which deals with the influence of experts to get actions when the system is hacked under some intrusions. Most intrusion detection frameworks are created in light of machine learning strategies. Since the datasets, this utilized as a part of intrusion detection is Knowledge Discovery in Database (KDD). In this paper detect or classify the intruded data utilizing Machine Learning (ML) with the MapReduce model. The primary face considers Hadoop MapReduce model to reduce the extent of database ideal weight decided for reducer model and second stage utilizing Decision Tree (DT) classifier to detect the data. This DT classifier comprises utilizing an appropriate classifier to decide the class labels for the non-homogeneous leaf nodes. The decision tree fragment gives a coarse section profile while the leaf level classifier can give data about the qualities that influence the label inside a portion. From the proposed result accuracy for detection is 96.21% contrasted with existing classifiers, for example, Neural Network (NN), Naive Bayes (NB) and K Nearest Neighbor (KNN).


2020 ◽  
Author(s):  
Mikołaj Morzy ◽  
Bartłomiej Balcerzak ◽  
Adam Wierzbicki ◽  
Adam Wierzbicki

BACKGROUND With the rapidly accelerating spread of dissemination of false medical information on the Web, the task of establishing the credibility of online sources of medical information becomes a pressing necessity. The sheer number of websites offering questionable medical information presented as reliable and actionable suggestions with possibly harmful effects poses an additional requirement for potential solutions, as they have to scale to the size of the problem. Machine learning is one such solution which, when properly deployed, can be an effective tool in fighting medical disinformation on the Web. OBJECTIVE We present a comprehensive framework for designing and curating of machine learning training datasets for online medical information credibility assessment. We show how the annotation process should be constructed and what pitfalls should be avoided. Our main objective is to provide researchers from medical and computer science communities with guidelines on how to construct datasets for machine learning models for various areas of medical information wars. METHODS The key component of our approach is the active annotation process. We begin by outlining the annotation protocol for the curation of high-quality training dataset, which then can be augmented and rapidly extended by employing the human-in-the-loop paradigm to machine learning training. To circumvent the cold start problem of insufficient gold standard annotations, we propose a pre-processing pipeline consisting of representation learning, clustering, and re-ranking of sentences for the acceleration of the training process and the optimization of human resources involved in the annotation. RESULTS We collect over 10 000 annotations of sentences related to selected subjects (psychiatry, cholesterol, autism, antibiotics, vaccines, steroids, birth methods, food allergy testing) for less than $7 000 employing 9 highly qualified annotators (certified medical professionals) and we release this dataset to the general public. We develop an active annotation framework for more efficient annotation of non-credible medical statements. The results of the qualitative analysis support our claims of the efficacy of the presented method. CONCLUSIONS A set of very diverse incentives is driving the widespread dissemination of medical disinformation on the Web. An effective strategy of countering this spread is to use machine learning for automatically establishing the credibility of online medical information. This, however, requires a thoughtful design of the training pipeline. In this paper we present a comprehensive framework of active annotation. In addition, we publish a large curated dataset of medical statements labelled as credible, non-credible, or neutral.


Author(s):  
Shobhit Mathur ◽  
Pritam Nikam ◽  
Harshita Patidar ◽  
Rohan Bapusaheb Gaikwad ◽  
Preeti Narayan Nayak
Keyword(s):  

Proceedings ◽  
2021 ◽  
Vol 77 (1) ◽  
pp. 17
Author(s):  
Andrea Giussani

In the last decade, advances in statistical modeling and computer science have boosted the production of machine-produced contents in different fields: from language to image generation, the quality of the generated outputs is remarkably high, sometimes better than those produced by a human being. Modern technological advances such as OpenAI’s GPT-2 (and recently GPT-3) permit automated systems to dramatically alter reality with synthetic outputs so that humans are not able to distinguish the real copy from its counteracts. An example is given by an article entirely written by GPT-2, but many other examples exist. In the field of computer vision, Nvidia’s Generative Adversarial Network, commonly known as StyleGAN (Karras et al. 2018), has become the de facto reference point for the production of a huge amount of fake human face portraits; additionally, recent algorithms were developed to create both musical scores and mathematical formulas. This presentation aims to stimulate participants on the state-of-the-art results in this field: we will cover both GANs and language modeling with recent applications. The novelty here is that we apply a transformer-based machine learning technique, namely RoBerta (Liu et al. 2019), to the detection of human-produced versus machine-produced text concerning fake news detection. RoBerta is a recent algorithm that is based on the well-known Bidirectional Encoder Representations from Transformers algorithm, known as BERT (Devlin et al. 2018); this is a bi-directional transformer used for natural language processing developed by Google and pre-trained over a huge amount of unlabeled textual data to learn embeddings. We will then use these representations as an input of our classifier to detect real vs. machine-produced text. The application is demonstrated in the presentation.


2021 ◽  
Vol 13 (3) ◽  
pp. 80
Author(s):  
Lazaros Vrysis ◽  
Nikolaos Vryzas ◽  
Rigas Kotsakis ◽  
Theodora Saridou ◽  
Maria Matsiola ◽  
...  

Social media services make it possible for an increasing number of people to express their opinion publicly. In this context, large amounts of hateful comments are published daily. The PHARM project aims at monitoring and modeling hate speech against refugees and migrants in Greece, Italy, and Spain. In this direction, a web interface for the creation and the query of a multi-source database containing hate speech-related content is implemented and evaluated. The selected sources include Twitter, YouTube, and Facebook comments and posts, as well as comments and articles from a selected list of websites. The interface allows users to search in the existing database, scrape social media using keywords, annotate records through a dedicated platform and contribute new content to the database. Furthermore, the functionality for hate speech detection and sentiment analysis of texts is provided, making use of novel methods and machine learning models. The interface can be accessed online with a graphical user interface compatible with modern internet browsers. For the evaluation of the interface, a multifactor questionnaire was formulated, targeting to record the users’ opinions about the web interface and the corresponding functionality.


Sign in / Sign up

Export Citation Format

Share Document