Towards AI in Array Databases

Author(s):  
Otoniel José Campos Escobar ◽  
Peter Baumann

<p>Multi-dimensional arrays (also known as raster data, gridded data, or datacubes) are key, if not essential, in many science and engineering domains. In the case of Earth sciences, a significant amount of the data that is produced falls into the category of array data. That being said, the amount of data that is produced daily from this field is huge. This makes it hard for researchers to analyze and retrieve any valuable insight from it. 1-D sensor data, 2-D satellite imagery, 3-D x/y/t image time series and x/y/z subsurface voxel data, 4-D x/y/z/t atmospheric and ocean data often produce dozens of Terabytes of data every day, and the rate is only expected to increase in the future. In response, Array Databases systems were specifically designed and constructed to provide modeling, storage, and processing support for multi-dimensional arrays. They offer a declarative query language for flexible data retrieval and some, e.g., rasdaman, provide federation processing and standard-based query capabilities compliant with OGC standards such as WCS, WCPS, and WMS. However, despite these advances, the gap between efficient information retrieval and the actual application of this data remains very broad, especially in the domain of artificial intelligence AI and machine learning ML.</p><p>In this contribution, we present the state-of-art in performing ML through Array Databases. First, a motivating example is introduced from the Deep Rain Project which aims at enhancing rainfall prediction accuracy in mountainous areas by implementing ML code on top of an Array Database. Deep Rain also explores novel methods for training prediction models by implementing server-side ML processing inside the database. A brief introduction of the Array Database rasdaman that is used in this project is also provided featuring its standard-based query capabilities and scalable federation processing features that are required for rainfall data processing. Next, the workflow approach for ML and Array Databases that is employed in the Deep Rain project is described in detail listing the benefits of using an Array Database with declarative query language capabilities in the machine learning pipeline. A concrete use case will be used to illustrate step by step how these tools integrate. Next, an alternative approach will be presented where ML is done inside the Array Database using user-defined functions UDFs. Finally,  a detailed comparison between the UDF and workflow approach is presented explaining their challenges and benefits.</p>

2021 ◽  
Vol 5 (3) ◽  
pp. 1-30
Author(s):  
Gonçalo Jesus ◽  
António Casimiro ◽  
Anabela Oliveira

Sensor platforms used in environmental monitoring applications are often subject to harsh environmental conditions while monitoring complex phenomena. Therefore, designing dependable monitoring systems is challenging given the external disturbances affecting sensor measurements. Even the apparently simple task of outlier detection in sensor data becomes a hard problem, amplified by the difficulty in distinguishing true data errors due to sensor faults from deviations due to natural phenomenon, which look like data errors. Existing solutions for runtime outlier detection typically assume that the physical processes can be accurately modeled, or that outliers consist in large deviations that are easily detected and filtered by appropriate thresholds. Other solutions assume that it is possible to deploy multiple sensors providing redundant data to support voting-based techniques. In this article, we propose a new methodology for dependable runtime detection of outliers in environmental monitoring systems, aiming to increase data quality by treating them. We propose the use of machine learning techniques to model each sensor behavior, exploiting the existence of correlated data provided by other related sensors. Using these models, along with knowledge of processed past measurements, it is possible to obtain accurate estimations of the observed environment parameters and build failure detectors that use these estimations. When a failure is detected, these estimations also allow one to correct the erroneous measurements and hence improve the overall data quality. Our methodology not only allows one to distinguish truly abnormal measurements from deviations due to complex natural phenomena, but also allows the quantification of each measurement quality, which is relevant from a dependability perspective. We apply the methodology to real datasets from a complex aquatic monitoring system, measuring temperature and salinity parameters, through which we illustrate the process for building the machine learning prediction models using a technique based on Artificial Neural Networks, denoted ANNODE ( ANN Outlier Detection ). From this application, we also observe the effectiveness of our ANNODE approach for accurate outlier detection in harsh environments. Then we validate these positive results by comparing ANNODE with state-of-the-art solutions for outlier detection. The results show that ANNODE improves existing solutions regarding accuracy of outlier detection.


Atmosphere ◽  
2020 ◽  
Vol 11 (8) ◽  
pp. 870 ◽  
Author(s):  
Chih-Chiang Wei ◽  
Tzu-Hao Chou

Situated in the main tracks of typhoons in the Northwestern Pacific Ocean, Taiwan frequently encounters disasters from heavy rainfall during typhoons. Accurate and timely typhoon rainfall prediction is an imperative topic that must be addressed. The purpose of this study was to develop a Hadoop Spark distribute framework based on big-data technology, to accelerate the computation of typhoon rainfall prediction models. This study used deep neural networks (DNNs) and multiple linear regressions (MLRs) in machine learning, to establish rainfall prediction models and evaluate rainfall prediction accuracy. The Hadoop Spark distributed cluster-computing framework was the big-data technology used. The Hadoop Spark framework consisted of the Hadoop Distributed File System, MapReduce framework, and Spark, which was used as a new-generation technology to improve the efficiency of the distributed computing. The research area was Northern Taiwan, which contains four surface observation stations as the experimental sites. This study collected 271 typhoon events (from 1961 to 2017). The following results were obtained: (1) in machine-learning computation, prediction errors increased with prediction duration in the DNN and MLR models; and (2) the system of Hadoop Spark framework was faster than the standalone systems (single I7 central processing unit (CPU) and single E3 CPU). When complex computation is required in a model (e.g., DNN model parameter calibration), the big-data-based Hadoop Spark framework can be used to establish highly efficient computation environments. In summary, this study successfully used the big-data Hadoop Spark framework with machine learning, to develop rainfall prediction models with effectively improved computing efficiency. Therefore, the proposed system can solve problems regarding real-time typhoon rainfall prediction with high timeliness and accuracy.


Author(s):  
Sofie Reumers ◽  
Feng Liu ◽  
Davy Janssens ◽  
Geert Wets

The aim of this chapter is to evaluate whether GPS data can be annotated or semantically enriched with different activity categories, allowing GPS data to be used in the future in simulation systems. The data in the study stems from a paper-and-pencil activity-travel diary survey and a corresponding survey in which GPS-enabled Personal Digital Assistants (PDAs) were used. A set of new approaches, which are all independent of additional sensor data and map information, thus significantly reducing additional costs and making the set of techniques relatively easily transferable to other regions, are proposed. Furthermore, this chapter makes a detailed comparison of different machine learning algorithms to semantically enrich GPS data with activity type information.


2019 ◽  
Vol 9 (22) ◽  
pp. 4931 ◽  
Author(s):  
Aguasca-Colomo ◽  
Castellanos-Nieves ◽  
Méndez

We present a comparative study between predictive monthly rainfall models for islands of complex orography using machine learning techniques. The models have been developed for the island of Tenerife (Canary Islands). Weather forecasting is influenced both by the local geographic characteristics as well as by the time horizon comprised. Accuracy of mid-term rainfall prediction on islands with complex orography is generally low when carried out with atmospheric models. Predictive models based on algorithms such as Random Forest or Extreme Gradient Boosting among others were analyzed. The predictors used in the models include weather predictors measured in two main meteorological stations, reanalysis predictors from the National Oceanic and Atmospheric Administration, and the global predictor North Atlantic Oscillation, all of them obtained over a period of time of more than four decades. When comparing the proposed models, we evaluated accuracy, kappa and interpretability of the model obtained, as well as the relevance of the predictors used. The results show that global predictors such as the North Atlantic Oscillation Index (NAO) have a very low influence, while the local Geopotential Height (GPH) predictor is relatively more important. Machine learning prediction models are a relevant proposition for predicting medium-term precipitation in similar geographical regions.


Author(s):  
Dr. Vivek Waghmare

Rain prediction is one of the most challenging and uncertain tasks that has a profound effect on human society. Timely and accurate forecasting can help significantly reduce population and financial losses. This study presents a collection of tests involving the use of conventional machine learning techniques to create rainfall prediction models depending on the weather information of the area. This Comparative research was conducted focusing on three aspects: modeling inputs, modeling methods, and prioritization techniques. The results provide a comparison of the various test metrics for these machine learning methods and their reliability estimates in rain by analyzing weather data. This study seeks a unique and effective machine learning system for predicting rainfall. The study experimented with different parameters of the rainfall from various regions in order to assess the efficiency and durability of the model. The machine learning model is focused on this study. Rainfall patterns in this study are collected, trained and tested for achievement of sustainable outcomes using machine learning models. The monthly rainfall predictions obtained after training and testing are then compared to real data to ensure the accuracy of the model The results of this study indicate that the model has been successful in it predicting monthly rain data and specific parameters.


2019 ◽  
Author(s):  
Oskar Flygare ◽  
Jesper Enander ◽  
Erik Andersson ◽  
Brjánn Ljótsson ◽  
Volen Z Ivanov ◽  
...  

**Background:** Previous attempts to identify predictors of treatment outcomes in body dysmorphic disorder (BDD) have yielded inconsistent findings. One way to increase precision and clinical utility could be to use machine learning methods, which can incorporate multiple non-linear associations in prediction models. **Methods:** This study used a random forests machine learning approach to test if it is possible to reliably predict remission from BDD in a sample of 88 individuals that had received internet-delivered cognitive behavioral therapy for BDD. The random forest models were compared to traditional logistic regression analyses. **Results:** Random forests correctly identified 78% of participants as remitters or non-remitters at post-treatment. The accuracy of prediction was lower in subsequent follow-ups (68%, 66% and 61% correctly classified at 3-, 12- and 24-month follow-ups, respectively). Depressive symptoms, treatment credibility, working alliance, and initial severity of BDD were among the most important predictors at the beginning of treatment. By contrast, the logistic regression models did not identify consistent and strong predictors of remission from BDD. **Conclusions:** The results provide initial support for the clinical utility of machine learning approaches in the prediction of outcomes of patients with BDD. **Trial registration:** ClinicalTrials.gov ID: NCT02010619.


2020 ◽  
Author(s):  
Sina Faizollahzadeh Ardabili ◽  
Amir Mosavi ◽  
Pedram Ghamisi ◽  
Filip Ferdinand ◽  
Annamaria R. Varkonyi-Koczy ◽  
...  

Several outbreak prediction models for COVID-19 are being used by officials around the world to make informed-decisions and enforce relevant control measures. Among the standard models for COVID-19 global pandemic prediction, simple epidemiological and statistical models have received more attention by authorities, and they are popular in the media. Due to a high level of uncertainty and lack of essential data, standard models have shown low accuracy for long-term prediction. Although the literature includes several attempts to address this issue, the essential generalization and robustness abilities of existing models needs to be improved. This paper presents a comparative analysis of machine learning and soft computing models to predict the COVID-19 outbreak as an alternative to SIR and SEIR models. Among a wide range of machine learning models investigated, two models showed promising results (i.e., multi-layered perceptron, MLP, and adaptive network-based fuzzy inference system, ANFIS). Based on the results reported here, and due to the highly complex nature of the COVID-19 outbreak and variation in its behavior from nation-to-nation, this study suggests machine learning as an effective tool to model the outbreak. This paper provides an initial benchmarking to demonstrate the potential of machine learning for future research. Paper further suggests that real novelty in outbreak prediction can be realized through integrating machine learning and SEIR models.


2020 ◽  
Author(s):  
Nalika Ulapane ◽  
Karthick Thiyagarajan ◽  
sarath kodagoda

<div>Classification has become a vital task in modern machine learning and Artificial Intelligence applications, including smart sensing. Numerous machine learning techniques are available to perform classification. Similarly, numerous practices, such as feature selection (i.e., selection of a subset of descriptor variables that optimally describe the output), are available to improve classifier performance. In this paper, we consider the case of a given supervised learning classification task that has to be performed making use of continuous-valued features. It is assumed that an optimal subset of features has already been selected. Therefore, no further feature reduction, or feature addition, is to be carried out. Then, we attempt to improve the classification performance by passing the given feature set through a transformation that produces a new feature set which we have named the “Binary Spectrum”. Via a case study example done on some Pulsed Eddy Current sensor data captured from an infrastructure monitoring task, we demonstrate how the classification accuracy of a Support Vector Machine (SVM) classifier increases through the use of this Binary Spectrum feature, indicating the feature transformation’s potential for broader usage.</div><div><br></div>


2019 ◽  
Vol 21 (9) ◽  
pp. 662-669 ◽  
Author(s):  
Junnan Zhao ◽  
Lu Zhu ◽  
Weineng Zhou ◽  
Lingfeng Yin ◽  
Yuchen Wang ◽  
...  

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.


2020 ◽  
Vol 16 ◽  
Author(s):  
Nitigya Sambyal ◽  
Poonam Saini ◽  
Rupali Syal

Background and Introduction: Diabetes mellitus is a metabolic disorder that has emerged as a serious public health issue worldwide. According to the World Health Organization (WHO), without interventions, the number of diabetic incidences is expected to be at least 629 million by 2045. Uncontrolled diabetes gradually leads to progressive damage to eyes, heart, kidneys, blood vessels and nerves. Method: The paper presents a critical review of existing statistical and Artificial Intelligence (AI) based machine learning techniques with respect to DM complications namely retinopathy, neuropathy and nephropathy. The statistical and machine learning analytic techniques are used to structure the subsequent content review. Result: It has been inferred that statistical analysis can help only in inferential and descriptive analysis whereas, AI based machine learning models can even provide actionable prediction models for faster and accurate diagnose of complications associated with DM. Conclusion: The integration of AI based analytics techniques like machine learning and deep learning in clinical medicine will result in improved disease management through faster disease detection and cost reduction for disease treatment.


Sign in / Sign up

Export Citation Format

Share Document