scholarly journals Big Data, Big Noise: The Challenge of Finding Issue Networks on the Web

2021 ◽  
Author(s):  
Annie Waldherr ◽  
Daniel Maier ◽  
Peter Miltner ◽  
Enrico Günther

In this paper, we focus on noise in the sense of irrelevant information in a data set as a specific methodological challenge of web research in the era of big data. We empirically evaluate several methods for filtering hyperlink networks in order to reconstruct networks that contain only web pages that deal with a particular issue. The test corpus of web pages was collected from hyperlink networks on the issue of food safety in the United States and Germany. We applied three filtering strategies and evaluated their performance to exclude irrelevant content from the networks: keyword filtering, automated document classification with a machine-learning algorithm, and extraction of core networks with network-analytical measures. Keyword filtering and automated classification of web pages were the most effective methods for reducing noise whereas extracting a core network did not yield satisfying results for this case.

2016 ◽  
Vol 35 (4) ◽  
pp. 427-443 ◽  
Author(s):  
Annie Waldherr ◽  
Daniel Maier ◽  
Peter Miltner ◽  
Enrico Günther

In this article, we focus on noise in the sense of irrelevant information in a data set as a specific methodological challenge of web research in the era of big data. We empirically evaluate several methods for filtering hyperlink networks in order to reconstruct networks that contain only webpages that deal with a particular issue. The test corpus of webpages was collected from hyperlink networks on the issue of food safety in the United States and Germany. We applied three filtering strategies and evaluated their performance to exclude irrelevant content from the networks: keyword filtering, automated document classification with a machine-learning algorithm, and extraction of core networks with network-analytical measures. Keyword filtering and automated classification of webpages were the most effective methods for reducing noise, whereas extracting a core network did not yield satisfying results for this case.


Entropy ◽  
2021 ◽  
Vol 23 (7) ◽  
pp. 859
Author(s):  
Abdulaziz O. AlQabbany ◽  
Aqil M. Azmi

We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams. It requires learners to be adaptive to dynamic changes. Random forest is an ensemble approach that is widely used in classical non-streaming settings of machine learning applications. At the same time, the Adaptive Random Forest (ARF) is a stream learning algorithm that showed promising results in terms of its accuracy and ability to deal with various types of drift. The incoming instances’ continuity allows for their binomial distribution to be approximated to a Poisson(1) distribution. In this study, we propose a mechanism to increase such streaming algorithms’ efficiency by focusing on resampling. Our measure, resampling effectiveness (ρ), fuses the two most essential aspects in online learning; accuracy and execution time. We use six different synthetic data sets, each having a different type of drift, to empirically select the parameter λ of the Poisson distribution that yields the best value for ρ. By comparing the standard ARF with its tuned variations, we show that ARF performance can be enhanced by tackling this important aspect. Finally, we present three case studies from different contexts to test our proposed enhancement method and demonstrate its effectiveness in processing large data sets: (a) Amazon customer reviews (written in English), (b) hotel reviews (in Arabic), and (c) real-time aspect-based sentiment analysis of COVID-19-related tweets in the United States during April 2020. Results indicate that our proposed method of enhancement exhibited considerable improvement in most of the situations.


A large volume of datasets is available in various fields that are stored to be somewhere which is called big data. Big Data healthcare has clinical data set of every patient records in huge amount and they are maintained by Electronic Health Records (EHR). More than 80 % of clinical data is the unstructured format and reposit in hundreds of forms. The challenges and demand for data storage, analysis is to handling large datasets in terms of efficiency and scalability. Hadoop Map reduces framework uses big data to store and operate any kinds of data speedily. It is not solely meant for storage system however conjointly a platform for information storage moreover as processing. It is scalable and fault-tolerant to the systems. Also, the prediction of the data sets is handled by machine learning algorithm. This work focuses on the Extreme Machine Learning algorithm (ELM) that can utilize the optimized way of finding a solution to find disease risk prediction by combining ELM with Cuckoo Search optimization-based Support Vector Machine (CS-SVM). The proposed work also considers the scalability and accuracy of big data models, thus the proposed algorithm greatly achieves the computing work and got good results in performance of both veracity and efficiency.


2020 ◽  
Vol 1 (1) ◽  
pp. 35-42
Author(s):  
Péter Ekler ◽  
Dániel Pásztor

Összefoglalás. A mesterséges intelligencia az elmúlt években hatalmas fejlődésen ment keresztül, melynek köszönhetően ma már rengeteg különböző szakterületen megtalálható valamilyen formában, rengeteg kutatás szerves részévé vált. Ez leginkább az egyre inkább fejlődő tanulóalgoritmusoknak, illetve a Big Data környezetnek köszönhető, mely óriási mennyiségű tanítóadatot képes szolgáltatni. A cikk célja, hogy összefoglalja a technológia jelenlegi állapotát. Ismertetésre kerül a mesterséges intelligencia történelme, az alkalmazási területek egy nagyobb része, melyek központi eleme a mesterséges intelligencia. Ezek mellett rámutat a mesterséges intelligencia különböző biztonsági réseire, illetve a kiberbiztonság területén való felhasználhatóságra. A cikk a jelenlegi mesterséges intelligencia alkalmazások egy szeletét mutatja be, melyek jól illusztrálják a széles felhasználási területet. Summary. In the past years artificial intelligence has seen several improvements, which drove its usage to grow in various different areas and became the focus of many researches. This can be attributed to improvements made in the learning algorithms and Big Data techniques, which can provide tremendous amount of training. The goal of this paper is to summarize the current state of artificial intelligence. We present its history, introduce the terminology used, and show technological areas using artificial intelligence as a core part of their applications. The paper also introduces the security concerns related to artificial intelligence solutions but also highlights how the technology can be used to enhance security in different applications. Finally, we present future opportunities and possible improvements. The paper shows some general artificial intelligence applications that demonstrate the wide range usage of the technology. Many applications are built around artificial intelligence technologies and there are many services that a developer can use to achieve intelligent behavior. The foundation of different approaches is a well-designed learning algorithm, while the key to every learning algorithm is the quality of the data set that is used during the learning phase. There are applications that focus on image processing like face detection or other gesture detection to identify a person. Other solutions compare signatures while others are for object or plate number detection (for example the automatic parking system of an office building). Artificial intelligence and accurate data handling can be also used for anomaly detection in a real time system. For example, there are ongoing researches for anomaly detection at the ZalaZone autonomous car test field based on the collected sensor data. There are also more general applications like user profiling and automatic content recommendation by using behavior analysis techniques. However, the artificial intelligence technology also has security risks needed to be eliminated before applying an application publicly. One concern is the generation of fake contents. These must be detected with other algorithms that focus on small but noticeable differences. It is also essential to protect the data which is used by the learning algorithm and protect the logic flow of the solution. Network security can help to protect these applications. Artificial intelligence can also help strengthen the security of a solution as it is able to detect network anomalies and signs of a security issue. Therefore, the technology is widely used in IT security to prevent different type of attacks. As different BigData technologies, computational power, and storage capacity increase over time, there is space for improved artificial intelligence solution that can learn from large and real time data sets. The advancements in sensors can also help to give more precise data for different solutions. Finally, advanced natural language processing can help with communication between humans and computer based solutions.


Author(s):  
Ye Ouyang ◽  
M. Hosein Fallah

The current literature provides many practical tools or theoretical methods to design, plan, and dimension Global System for Mobile Communications (GSM) and Universal Mobile Telecommunications System (UMTS) radio networks, but overlooks the algorithms of the network planning and dimensioning for core networks of GSM, UMTS, and IP Multimedia Subsystem (IMS). This chapter introduces an algorithm for traffic, bandwidth, and throughput dimensioning of the network entities in the UMTS core network. The analysis is based on the traffic and throughput generated or absorbed in the interfaces of the network entities in the UMTS core network. Finally a case study is provided to verify the algorithms created for UMTS core network. This chapter is aimed at helping UMTS network operators dimension an optimum network size and build an optimum network structure to deliver an optimum quality of service for users. The algorithms developed in the chapter have been successfully applied in dimensioning a nationwide UMTS network in North Africa and adopted in an optimization tool by a mobile operator in the United States in 2008-09.


Author(s):  
Ye Ouyang ◽  
M. Hosein Fallah

Current literature provides many practical tools and theoretical methods to design, plan, and dimension Global System for Mobile Communications (GSM) and Universal Mobile Telecommunications System (UMTS) radio networks but overlooks the algorithms of network planning and dimensioning for core networks of GSM, UMTS and IP Multimedia Subsystem (IMS). This paper introduces an algorithm for traffic, bandwidth and throughput dimensioning of the network entities in the UMTS core network, based on the traffic and throughput generated or absorbed in the interfaces of the network entities. A case study is provided to verify the algorithms created for UMTS core network. This paper helps UMTS network operators dimension and build an optimum network to deliver the best quality of service for users. The algorithms developed in the paper have been successfully applied in dimensioning a nationwide UMTS network in North Africa and adopted in an optimization tool by a mobile operator in the United States in 2008-2009.


Electronics ◽  
2020 ◽  
Vol 9 (7) ◽  
pp. 1140
Author(s):  
Jeong-Hee Lee ◽  
Jongseok Kang ◽  
We Shim ◽  
Hyun-Sang Chung ◽  
Tae-Eung Sung

Building a pattern detection model using a deep learning algorithm for data collected from manufacturing sites is an effective way for to perform decision-making and assess business feasibility for enterprises, by providing the results and implications of the patterns analysis of big data occurring at manufacturing sites. To identify the threshold of the abnormal pattern requires collaboration between data analysts and manufacturing process experts, but it is practically difficult and time-consuming. This paper suggests how to derive the threshold setting of the abnormal pattern without manual labelling by process experts, and offers a prediction algorithm to predict the potentials of future failures in advance by using the hybrid Convolutional Neural Networks (CNN)–Long Short-Term Memory (LSTM) algorithm, and the Fast Fourier Transform (FFT) technique. We found that it is easier to detect abnormal patterns that cannot be found in the existing time domain after preprocessing the data set through FFT. Our study shows that both train loss and test loss were well developed, with near zero convergence with the lowest loss rate compared to existing models such as LSTM. Our proposition for the model and our method of preprocessing the data greatly helps in understanding the abnormal pattern of unlabeled big data produced at the manufacturing site, and can be a strong foundation for detecting the threshold of the abnormal pattern of big data occurring at manufacturing sites.


Author(s):  
R. Nirmalan ◽  
M. Javith Hussain Khan ◽  
V. Sounder ◽  
A. Manikkaraja

The evolution in modern computer technology produce an huge amount of data by the way of using updated technology world with the lot and lot of inventions. The algorithms which we used in machine-learning traditionally might not support the concept of big data. Here we have discussed and implemented the solution for the problem, while predicting breast cancer using big data. DNA methylation (DM) as well gene expression (GE) are the two types of data used for the prediction of breast cancer. The main objective is to classify individual data set in the separate manner. To achieve this main objective, we have used a platform Apache Spark. Here,we have applied three types of algorithms used for classification, they are decision tree, random forest algorithm, support vector machine algorithm which will be mentioned as SVM .These three types of algorithm used for producing models used for breast cancer prediction. Analyze have done for finding which algorithm will produce the better result with good accuracy and less error rate. Additionally, the platforms like Weka and Spark are compared, to find which will have the better performance while dealing with the huge data. The obtained outcome have proved that the Support Vector Machine classifier which is scalable might given the better performance than all other classifiers and it have achieved the lowest error range with the highest accuracy using GE data set


2013 ◽  
Vol 99 (4) ◽  
pp. 40-45 ◽  
Author(s):  
Aaron Young ◽  
Philip Davignon ◽  
Margaret B. Hansen ◽  
Mark A. Eggen

ABSTRACT Recent media coverage has focused on the supply of physicians in the United States, especially with the impact of a growing physician shortage and the Affordable Care Act. State medical boards and other entities maintain data on physician licensure and discipline, as well as some biographical data describing their physician populations. However, there are gaps of workforce information in these sources. The Federation of State Medical Boards' (FSMB) Census of Licensed Physicians and the AMA Masterfile, for example, offer valuable information, but they provide a limited picture of the physician workforce. Furthermore, they are unable to shed light on some of the nuances in physician availability, such as how much time physicians spend providing direct patient care. In response to these gaps, policymakers and regulators have in recent years discussed the creation of a physician minimum data set (MDS), which would be gathered periodically and would provide key physician workforce information. While proponents of an MDS believe it would provide benefits to a variety of stakeholders, an effort has not been attempted to determine whether state medical boards think it is important to collect physician workforce data and if they currently collect workforce information from licensed physicians. To learn more, the FSMB sent surveys to the executive directors at state medical boards to determine their perceptions of collecting workforce data and current practices regarding their collection of such data. The purpose of this article is to convey results from this effort. Survey findings indicate that the vast majority of boards view physician workforce information as valuable in the determination of health care needs within their state, and that various boards are already collecting some data elements. Analysis of the data confirms the potential benefits of a physician minimum data set (MDS) and why state medical boards are in a unique position to collect MDS information from physicians.


2020 ◽  
Author(s):  
Carson Lam ◽  
Jacob Calvert ◽  
Gina Barnes ◽  
Emily Pellegrini ◽  
Anna Lynn-Palevsky ◽  
...  

BACKGROUND In the wake of COVID-19, the United States has developed a three stage plan to outline the parameters to determine when states may reopen businesses and ease travel restrictions. The guidelines also identify subpopulations of Americans that should continue to stay at home due to being at high risk for severe disease should they contract COVID-19. These guidelines were based on population level demographics, rather than individual-level risk factors. As such, they may misidentify individuals at high risk for severe illness and who should therefore not return to work until vaccination or widespread serological testing is available. OBJECTIVE This study evaluated a machine learning algorithm for the prediction of serious illness due to COVID-19 using inpatient data collected from electronic health records. METHODS The algorithm was trained to identify patients for whom a diagnosis of COVID-19 was likely to result in hospitalization, and compared against four U.S policy-based criteria: age over 65, having a serious underlying health condition, age over 65 or having a serious underlying health condition, and age over 65 and having a serious underlying health condition. RESULTS This algorithm identified 80% of patients at risk for hospitalization due to COVID-19, versus at most 62% that are identified by government guidelines. The algorithm also achieved a high specificity of 95%, outperforming government guidelines. CONCLUSIONS This algorithm may help to enable a broad reopening of the American economy while ensuring that patients at high risk for serious disease remain home until vaccination and testing become available.


Sign in / Sign up

Export Citation Format

Share Document