very large datasets
Recently Published Documents


TOTAL DOCUMENTS

152
(FIVE YEARS 45)

H-INDEX

17
(FIVE YEARS 3)

Author(s):  
Sheeba. Armoogum ◽  
◽  
Nawaz. Mohamudally

Voice over Internet Protocol (VoIP) is a recent voice communication technology and due to its variety of calling capabilities, the system is expected to fuel the market value even further in the next five years. However, there are serious concerns since VoIP systems are frequently been attacked. According to recent security alliance reports, malicious activities have increased largely during the current pandemic against VoIP and other vulnerable networks. This hence implies that existing models are not sufficiently reliable since most of them do not have a hundred percent detection rate. In this paper, a review of our most recent Intrusion Detection & Prevention Systems (IDPS) developed is proposed together with a comparative analysis. The final work consisted of ten models which addressed flood intentional attacks to mitigate VoIP attacks. The methodological approaches of the studies included the quantitative and scientific paradigms, for which several instruments (comparative analysis and experiments) were used. Six prevention models were developed using three sorting methods combined with either a modified galloping algorithm or an extended quadratic algorithm. The seventh IDPS was designed by improving an existing genetic algorithm (e-GAP) and the eighth model is a novel deep learning method known as the Closest Adjacent Neighbour (CAN). Finally, for a better comparative analysis of AI-based algorithms, a Deep Analysis of the Intruder Tracing (DAIT) model using a bottom-up approach was developed to address the issues of processing time, effectiveness, and efficiency which were challenges when addressing very large datasets of incoming messages. This novel method prevented intruders to access a system without authorization and avoided any anomaly filtering at the firewall with a minimum processing time. Results revealed that the DAIT and the e-GAP models are very efficient and gave better results when benchmarking with models. These two models obtained an F-score of 98.83%, a detection rate of 100%, a false rate of 0%, an accuracy of 98.7%, and finally a processing time per message of 0.092 ms and 0.094 ms respectively. When comparing with previous models in the literature from which it is specified that detection rates obtained are 95.5% and falsepositive alarm of around 1.8%, except for one recent machine learning-based model having a detection rate of 100% and a processing time of 0.53 ms, the DAIT and the e-GAP models give better results.


2021 ◽  
Vol 17 (12) ◽  
pp. e1009586
Author(s):  
Yanan Long ◽  
Qi Chen ◽  
Henrik Larsson ◽  
Andrey Rzhetsky

The human sex ratio at birth (SRB), defined as the ratio between the number of newborn boys to the total number of newborns, is typically slightly greater than 1/2 (more boys than girls) and tends to vary across different geographical regions and time periods. In this large-scale study, we sought to validate previously-reported associations and test new hypotheses using statistical analysis of two very large datasets incorporating electronic medical records (EMRs). One of the datasets represents over half (∼ 150 million) of the US population for over 8 years (IBM Watson Health MarketScan insurance claims) while another covers the entire Swedish population (∼ 9 million) for over 30 years (the Swedish National Patient Register). After testing more than 100 hypotheses, we showed that neither dataset supported models in which the SRB changed seasonally or in response to variations in ambient temperature. However, increased levels of a diverse array of air and water pollutants, were associated with lower SRBs, including increased levels of industrial and agricultural activity, which served as proxies for water pollution. Moreover, some exogenous factors generally considered to be environmental toxins turned out to induce higher SRBs. Finally, we identified new factors with signals for either higher or lower SRBs. In all cases, the effect sizes were modest but highly statistically significant owing to the large sizes of the two datasets. We suggest that while it was unlikely that the associations have arisen from sex-specific selection mechanisms, they are still useful for the purpose of public health surveillance if they can be corroborated by empirical evidences.


2021 ◽  
Vol 229 (4) ◽  
pp. 241-244
Author(s):  
Felix Speckmann

Abstract. When people use the Internet, they leave traces of their activities: blog posts, comments, articles, social media posts, etc. These traces represent behavior that psychologists can analyze. A method that makes downloading those sometimes very large datasets feasible is web scraping, which involves writing a program to automatically download specific parts of a website. The obtained data can be used to exploratorily generate new hypotheses, test existing ones, or extend existing research. The present Research Spotlight explains web scraping and discusses the possibilities, limitations as well as ethical and legal challenges associated with the approach.


2021 ◽  
Vol 11 (22) ◽  
pp. 11033
Author(s):  
Madiha Khalid ◽  
Muhammad Murtaza Yousaf

The emergence of social media, the worldwide web, electronic transactions, and next-generation sequencing not only opens new horizons of opportunities but also leads to the accumulation of a massive amount of data. The rapid growth of digital data generated from diverse sources makes it inapt to use traditional storage, processing, and analysis methods. These limitations have led to the development of new technologies to process and store very large datasets. As a result, several execution frameworks emerged for big data processing. Hadoop MapReduce, the pioneering framework, set the ground for forthcoming frameworks that improve the processing and development of large-scale data in many ways. This research focuses on comparing the most prominent and widely used frameworks in the open-source landscape. We identify key requirements of a big framework and review each of these frameworks in the perspective of those requirements. To enhance the clarity of comparison and analysis, we group the logically related features, forming a feature vector. We design seven feature vectors and present a comparative analysis of frameworks with respect to those feature vectors. We identify use cases and highlight the strengths and weaknesses of each framework. Moreover, we present a detailed discussion that can serve as a decision-making guide to select the appropriate framework for an application.


Lubricants ◽  
2021 ◽  
Vol 9 (11) ◽  
pp. 109
Author(s):  
Philipp Renhart ◽  
Michael Maier ◽  
Christopher Strablegg ◽  
Florian Summer ◽  
Florian Grün ◽  
...  

The measurement of acoustic emission data in experiments reveals informative details about the tribological contact. The required recording rate for conclusive datasets ranges up to several megahertz. Typically, this results in very large datasets for long-term measurements. This in return has the consequence, that acoustic emissions are mostly acquired at predefined cyclic time intervals, which leads to many blind spots. The following work shows methods for effective postprocessing and a feature based data acquisition method. Additionally, a two stage wear mechanism for bearings was found by the described method and could be substantiated by a numerical simulation.


2021 ◽  
Author(s):  
Andrew Melnyk ◽  
Fatemeh Mohebbi ◽  
Sergey Knyazev ◽  
Bikram Sahoo ◽  
Roya Hosseini ◽  
...  

The availability of millions of SARS-CoV-2 sequences in public databases such as GISAID and EMBL-EBI (UK) allows a detailed study of the evolution, genomic diversity and dynamics of a virus like never before. Here we identify novel variants and subtypes of SARS-CoV-2 by clustering sequences in adapting methods originally designed for haplotyping intra-host viral populations. We asses our results using clustering entropy --- the first time it has been used in this context. Our clustering approach reaches lower entropies compared to other methods, and we are able to boost this even further through gap filling and Monte Carlo based entropy minimization. Moreover, our method clearly identifies the well-known Alpha variant in the UK and GISAID datasets, but is also able to detect the much less represented (<1% of the sequences) Beta (South Africa), Epsilon (California), Gamma and Zeta (Brazil) variants in the GISAID dataset. Finally, we show that each variant identified has high selective fitness, based on the growth rate of its cluster over time. This demonstrates that our clustering approach is a viable alternative for detecting even rare subtypes in very large datasets.


PLoS Biology ◽  
2021 ◽  
Vol 19 (7) ◽  
pp. e3001347
Author(s):  
Hanxin Zhang ◽  
Atif Khan ◽  
Qi Chen ◽  
Henrik Larsson ◽  
Andrey Rzhetsky

Seasonal affective disorder (SAD) famously follows annual cycles, with incidence elevation in the fall and spring. Should some version of cyclic annual pattern be expected from other psychiatric disorders? Would annual cycles be similar for distinct psychiatric conditions? This study probes these questions using 2 very large datasets describing the health histories of 150 million unique U.S. citizens and the entire Swedish population. We performed 2 types of analysis, using “uncorrected” and “corrected” observations. The former analysis focused on counts of daily patient visits associated with each disease. The latter analysis instead looked at the proportion of disease-specific visits within the total volume of visits for a time interval. In the uncorrected analysis, we found that psychiatric disorders’ annual patterns were remarkably similar across the studied diseases in both countries, with the magnitude of annual variation significantly higher in Sweden than in the United States for psychiatric, but not infectious diseases. In the corrected analysis, only 1 group of patients—11 to 20 years old—reproduced all regularities we observed for psychiatric disorders in the uncorrected analysis; the annual healthcare-seeking visit patterns associated with other age-groups changed drastically. Analogous analyses over infectious diseases were less divergent over these 2 types of computation. Comparing these 2 sets of results in the context of published psychiatric disorder seasonality studies, we tend to believe that our uncorrected results are more likely to capture the real trends, while the corrected results perhaps reflect mostly artifacts determined by dominantly fluctuating, health-seeking visits across a given year. However, the divergent results are ultimately inconclusive; thus, we present both sets of results unredacted, and, in the spirit of full disclosure, leave the verdict to the reader.


Author(s):  
E. Sanchez Castillo ◽  
D. Griffiths ◽  
J. Boehm

Abstract. This paper proposes a semantic segmentation pipeline for terrestrial laser scanning data. We achieve this by combining co-registered RGB and 3D point cloud information. Semantic segmentation is performed by applying a pre-trained off-the-shelf 2D convolutional neural network over a set of projected images extracted from a panoramic photograph. This allows the network to exploit the visual image features that are learnt in a state-of-the-art segmentation models trained on very large datasets. The study focuses on the adoption of the spherical information from the laser capture and assessing the results using image classification metrics. The obtained results demonstrate that the approach is a promising alternative for asset identification in laser scanning data. We demonstrate comparable performance with spherical machine learning frameworks, however, avoid both the labelling and training efforts required with such approaches.


Sensors ◽  
2021 ◽  
Vol 21 (11) ◽  
pp. 3793
Author(s):  
Nan Wu ◽  
Kazuhiko Kawamoto

Large datasets are often used to improve the accuracy of action recognition. However, very large datasets are problematic as, for example, the annotation of large datasets is labor-intensive. This has encouraged research in zero-shot action recognition (ZSAR). Presently, most ZSAR methods recognize actions according to each video frame. These methods are affected by light, camera angle, and background, and most methods are unable to process time series data. The accuracy of the model is reduced owing to these reasons. In this paper, in order to solve these problems, we propose a three-stream graph convolutional network that processes both types of data. Our model has two parts. One part can process RGB data, which contains extensive useful information. The other part can process skeleton data, which is not affected by light and background. By combining these two outputs with a weighted sum, our model predicts the final results for ZSAR. Experiments conducted on three datasets demonstrate that our model has greater accuracy than a baseline model. Moreover, we also prove that our model can learn from human experience, which can make the model more accurate.


2021 ◽  
Vol 28 (2) ◽  
pp. 181-212
Author(s):  
Jonathan M. Lilly ◽  
Paula Pérez-Brunius

Abstract. A method for objectively extracting the displacement signals associated with coherent eddies from Lagrangian trajectories is presented, refined, and applied to a large dataset of 3770 surface drifters from the Gulf of Mexico. The method, wavelet ridge analysis, is a general method for the analysis of modulated oscillations, here modified to be more suitable to the eddy-detection problem. A means for formally assessing statistical significance is introduced, addressing the issue of false positives arising by chance from an unstructured turbulent background and opening the door to confident application of the method to very large datasets. Significance is measured through a frequency-dependent comparison with a stochastic dataset having statistical and spectral properties that match the original, but lacking organized oscillations due to eddies or waves. The application to the Gulf of Mexico reveals major asymmetries between cyclones and anticyclones, with anticyclones dominating at radii larger than about 50 km, but an unexpectedly rich population of highly nonlinear cyclones dominating at smaller radii. Both the method and the Gulf of Mexico eddy dataset are made freely available to the community for noncommercial use in future research.


Sign in / Sign up

Export Citation Format

Share Document