Big Data, Big Noise

In this article, we focus on noise in the sense of irrelevant information in a data set as a specific methodological challenge of web research in the era of big data. We empirically evaluate several methods for filtering hyperlink networks in order to reconstruct networks that contain only webpages that deal with a particular issue. The test corpus of webpages was collected from hyperlink networks on the issue of food safety in the United States and Germany. We applied three filtering strategies and evaluated their performance to exclude irrelevant content from the networks: keyword filtering, automated document classification with a machine-learning algorithm, and extraction of core networks with network-analytical measures. Keyword filtering and automated classification of webpages were the most effective methods for reducing noise, whereas extracting a core network did not yield satisfying results for this case.

Download Full-text

Big Data, Big Noise: The Challenge of Finding Issue Networks on the Web

10.31235/osf.io/9etqm ◽

2021 ◽

Author(s):

Annie Waldherr ◽

Daniel Maier ◽

Peter Miltner ◽

Enrico Günther

Keyword(s):

Big Data ◽

Learning Algorithm ◽

The United States ◽

Web Pages ◽

Automated Classification ◽

Core Network ◽

Data Set ◽

Core Networks ◽

Test Corpus ◽

Issue Networks

In this paper, we focus on noise in the sense of irrelevant information in a data set as a specific methodological challenge of web research in the era of big data. We empirically evaluate several methods for filtering hyperlink networks in order to reconstruct networks that contain only web pages that deal with a particular issue. The test corpus of web pages was collected from hyperlink networks on the issue of food safety in the United States and Germany. We applied three filtering strategies and evaluated their performance to exclude irrelevant content from the networks: keyword filtering, automated document classification with a machine-learning algorithm, and extraction of core networks with network-analytical measures. Keyword filtering and automated classification of web pages were the most effective methods for reducing noise whereas extracting a core network did not yield satisfying results for this case.

Download Full-text

Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams

Entropy ◽

10.3390/e23070859 ◽

2021 ◽

Vol 23 (7) ◽

pp. 859

Author(s):

Abdulaziz O. AlQabbany ◽

Aqil M. Azmi

Keyword(s):

Big Data ◽

Random Forest ◽

Real Time ◽

Data Streams ◽

Learning Algorithm ◽

Concept Drift ◽

The United States ◽

Careful Consideration ◽

Data Sets ◽

Stream Data

We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams. It requires learners to be adaptive to dynamic changes. Random forest is an ensemble approach that is widely used in classical non-streaming settings of machine learning applications. At the same time, the Adaptive Random Forest (ARF) is a stream learning algorithm that showed promising results in terms of its accuracy and ability to deal with various types of drift. The incoming instances’ continuity allows for their binomial distribution to be approximated to a Poisson(1) distribution. In this study, we propose a mechanism to increase such streaming algorithms’ efficiency by focusing on resampling. Our measure, resampling effectiveness (ρ), fuses the two most essential aspects in online learning; accuracy and execution time. We use six different synthetic data sets, each having a different type of drift, to empirically select the parameter λ of the Poisson distribution that yields the best value for ρ. By comparing the standard ARF with its tuned variations, we show that ARF performance can be enhanced by tackling this important aspect. Finally, we present three case studies from different contexts to test our proposed enhancement method and demonstrate its effectiveness in processing large data sets: (a) Amazon customer reviews (written in English), (b) hotel reviews (in Arabic), and (c) real-time aspect-based sentiment analysis of COVID-19-related tweets in the United States during April 2020. Results indicate that our proposed method of enhancement exhibited considerable improvement in most of the situations.

Download Full-text

ON A KNOWLEDGE-BASED APPROACH TO THE CLASSIFICATION OF MOBILE LASER SCANNING POINT CLOUDS

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-4-343-2018 ◽

2018 ◽

Vol XLII-4 ◽

pp. 343-349 ◽

Cited By ~ 1

Author(s):

M. Lemmens

Keyword(s):

Point Cloud ◽

Laser Scanning ◽

Decision Rules ◽

Point Clouds ◽

Bench Mark ◽

Automated Classification ◽

Initial Experiment ◽

Data Set ◽

Knowledge Based

Abstract. A knowledge-based system exploits the knowledge, which a human expert uses for completing a complex task, through a database containing decision rules, and an inference engine. Already in the early nineties knowledge-based systems have been proposed for automated image classification. Lack of success faded out initial interest and enthusiasm, the same fate neural networks struck at that time. Today the latter enjoy a steady revival. This paper aims at demonstrating that a knowledge-based approach to automated classification of mobile laser scanning point clouds has promising prospects. An initial experiment exploiting only two features, height and reflectance value, resulted in an overall accuracy of 79% for the Paris-rue-Madame point cloud bench mark data set.

Download Full-text

Big Data for Health Care Analytics using Extreme Machine Learning Based on Map Reduce

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.c5808.029320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 2758-2762

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Storage ◽

Clinical Data ◽

Disease Risk ◽

Learning Algorithm ◽

Information Storage ◽

Support Vector ◽

Machine Learning Algorithm ◽

Data Set

A large volume of datasets is available in various fields that are stored to be somewhere which is called big data. Big Data healthcare has clinical data set of every patient records in huge amount and they are maintained by Electronic Health Records (EHR). More than 80 % of clinical data is the unstructured format and reposit in hundreds of forms. The challenges and demand for data storage, analysis is to handling large datasets in terms of efficiency and scalability. Hadoop Map reduces framework uses big data to store and operate any kinds of data speedily. It is not solely meant for storage system however conjointly a platform for information storage moreover as processing. It is scalable and fault-tolerant to the systems. Also, the prediction of the data sets is handled by machine learning algorithm. This work focuses on the Extreme Machine Learning algorithm (ELM) that can utilize the optimized way of finding a solution to find disease risk prediction by combining ELM with Cuckoo Search optimization-based Support Vector Machine (CS-SVM). The proposed work also considers the scalability and accuracy of big data models, thus the proposed algorithm greatly achieves the computing work and got good results in performance of both veracity and efficiency.

Download Full-text

Automated classification of cancerous textures in histology images using quasi-supervised learning algorithm

2010 15th National Biomedical Engineering Meeting ◽

10.1109/biyomut.2010.5479863 ◽

2010 ◽

Cited By ~ 1

Author(s):

Devrim Onder ◽

Sulen Sarioglu ◽

Bilge Karacali

Keyword(s):

Supervised Learning ◽

Learning Algorithm ◽

Automated Classification

Download Full-text

Automated Classification of Mammographic Abnormalities Using Transductive Semi Supervised Learning Algorithm

Lecture Notes in Electrical Engineering - Proceedings of the Mediterranean Conference on Information & Communication Technologies 2015 ◽

10.1007/978-3-319-30298-0_73 ◽

2016 ◽

pp. 657-662 ◽

Cited By ~ 6

Author(s):

Nawel Zemmal ◽

Nabiha Azizi ◽

Mokhtar Sellami ◽

Nilanjan Dey

Keyword(s):

Supervised Learning ◽

Learning Algorithm ◽

Automated Classification

Download Full-text

EFFICIENT CLASSIFICATION OF SCANNED MEDIA USING SPATIAL STATISTICS

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001410008263 ◽

2010 ◽

Vol 24 (06) ◽

pp. 917-946

Author(s):

GOZDE UNAL ◽

GAURAV SHARMA ◽

REINER ESCHBACH

Keyword(s):

Spatial Statistics ◽

High Reliability ◽

Image Data ◽

Training Data ◽

Automated Classification ◽

Data Set ◽

Statistical Measures ◽

Scanned Image ◽

Scanned Images

Photography, lithography, xerography, and inkjet printing are the dominant technologies for color printing. Images produced on these "different media" are often scanned either for the purpose of copying or creating an electronic representation. For an improved color calibration during scanning, a media identification from the scanned image data is desirable. In this paper, we propose an efficient algorithm for automated classification of input media into four major classes corresponding to photographic, lithographic, xerographic and inkjet. Our technique exploits the strong correlation between the type of input media and the spatial statistics of corresponding images, which are observed in the scanned images. We adopt ideas from spatial statistics literature, and design two spatial statistical measures of dispersion and periodicity, which are computed over spatial point patterns generated from blocks of the scanned image, and whose distributions provide the features for making a decision. We utilize extensive training data and determined well separated decision regions to classify the input media. We validate and tested our classification technique results over an independent extensive data set. The results demonstrate that the proposed method is able to distinguish between the different media with high reliability.

Download Full-text

Bees Assorter

International Journal for Modern Trends in Science and Technology - RTT2020 ◽

10.46501/ijmtst061212 ◽

2020 ◽

Vol 6 (12) ◽

pp. 61-65

Author(s):

Himanshu Verma

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Bumble Bee ◽

Wing Size ◽

Qualitative And Quantitative Analysis ◽

Data Set ◽

Qualitative And Quantitative ◽

The Difference

Many attempts were made to classify the bees that is bumble bee or honey bee , there have been such a large amount of researches which were made to seek out the difference between them on the premise of various features like wing size , size of bee , color, life cycle and many more. But altogether the analysis there have been either that specialize in qualitative or quantitative , but to beat this issue , thus researchers came up with an answer which might be both qualitative and quantitative analysis made to classify them. And making use of machine learning algorithm to classify them gives a lift . Now the classification would take less time as these algorithms are pretty fast and accurate . By using machine learning work is made easy . Lots of photographs had to be collected and stored for data set. And by using these machine learning algorithms we would be getting information about the bees which might be employed by researchers in further classification of bees. Manipulation of images had to be done so as on prepare them in such a way that they will be applied to the algorithms and have feature extraction done. As there have been a lot of photographs(data set) which take a lot of space and also the area in which bees were present in these photographs were too small so to accommodate it dimension reduction was done , it might not consider other images like trees , leaves , flowers which were there present in the photograph which we elect as a data set.

Download Full-text

Image Classification using CNN and Machine Learning

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit195298 ◽

2019 ◽

pp. 575-580

Author(s):

G. Keerthi Devipriya ◽

E. Chandana ◽

B. Prathyusha ◽

T. Seshu Chakravarthy

Keyword(s):

Machine Learning ◽

Image Classification ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Data Set ◽

Training Models ◽

Classification Of Images ◽

Respective Category

Here by in this paper we are interested for classification of Images and Recognition. We expose the performance of training models by using a classifier algorithm and an API that contains set of images where we need to compare the uploaded image with the set of images available in the data set that we have taken. After identifying its respective category the image need to be placed in it. In order to classify images we are using a machine learning algorithm that comparing and placing the images.

Download Full-text

Alkalmazott mesterséges intelligencia felhasználási területei és biztonsági kérdései – Mesterséges intelligencia a gyakorlatban

Scientia et Securitas ◽

10.1556/112.2020.00006 ◽

2020 ◽

Vol 1 (1) ◽

pp. 35-42

Author(s):

Péter Ekler ◽

Dániel Pásztor

Keyword(s):

Artificial Intelligence ◽

Big Data ◽

Anomaly Detection ◽

Real Time ◽

Learning Algorithm ◽

Sensor Data ◽

Test Field ◽

Real Time System ◽

Data Set ◽

Wide Range

Összefoglalás. A mesterséges intelligencia az elmúlt években hatalmas fejlődésen ment keresztül, melynek köszönhetően ma már rengeteg különböző szakterületen megtalálható valamilyen formában, rengeteg kutatás szerves részévé vált. Ez leginkább az egyre inkább fejlődő tanulóalgoritmusoknak, illetve a Big Data környezetnek köszönhető, mely óriási mennyiségű tanítóadatot képes szolgáltatni. A cikk célja, hogy összefoglalja a technológia jelenlegi állapotát. Ismertetésre kerül a mesterséges intelligencia történelme, az alkalmazási területek egy nagyobb része, melyek központi eleme a mesterséges intelligencia. Ezek mellett rámutat a mesterséges intelligencia különböző biztonsági réseire, illetve a kiberbiztonság területén való felhasználhatóságra. A cikk a jelenlegi mesterséges intelligencia alkalmazások egy szeletét mutatja be, melyek jól illusztrálják a széles felhasználási területet. Summary. In the past years artificial intelligence has seen several improvements, which drove its usage to grow in various different areas and became the focus of many researches. This can be attributed to improvements made in the learning algorithms and Big Data techniques, which can provide tremendous amount of training. The goal of this paper is to summarize the current state of artificial intelligence. We present its history, introduce the terminology used, and show technological areas using artificial intelligence as a core part of their applications. The paper also introduces the security concerns related to artificial intelligence solutions but also highlights how the technology can be used to enhance security in different applications. Finally, we present future opportunities and possible improvements. The paper shows some general artificial intelligence applications that demonstrate the wide range usage of the technology. Many applications are built around artificial intelligence technologies and there are many services that a developer can use to achieve intelligent behavior. The foundation of different approaches is a well-designed learning algorithm, while the key to every learning algorithm is the quality of the data set that is used during the learning phase. There are applications that focus on image processing like face detection or other gesture detection to identify a person. Other solutions compare signatures while others are for object or plate number detection (for example the automatic parking system of an office building). Artificial intelligence and accurate data handling can be also used for anomaly detection in a real time system. For example, there are ongoing researches for anomaly detection at the ZalaZone autonomous car test field based on the collected sensor data. There are also more general applications like user profiling and automatic content recommendation by using behavior analysis techniques. However, the artificial intelligence technology also has security risks needed to be eliminated before applying an application publicly. One concern is the generation of fake contents. These must be detected with other algorithms that focus on small but noticeable differences. It is also essential to protect the data which is used by the learning algorithm and protect the logic flow of the solution. Network security can help to protect these applications. Artificial intelligence can also help strengthen the security of a solution as it is able to detect network anomalies and signs of a security issue. Therefore, the technology is widely used in IT security to prevent different type of attacks. As different BigData technologies, computational power, and storage capacity increase over time, there is space for improved artificial intelligence solution that can learn from large and real time data sets. The advancements in sensors can also help to give more precise data for different solutions. Finally, advanced natural language processing can help with communication between humans and computer based solutions.

Download Full-text