scholarly journals Aggregate-based Training Phase for ML-based Cardinality Estimation

Author(s):  
Lucas Woltmann ◽  
Claudio Hartmann ◽  
Dirk Habich ◽  
Wolfgang Lehner

AbstractCardinality estimation is a fundamental task in database query processing and optimization. As shown in recent papers, machine learning (ML)-based approaches may deliver more accurate cardinality estimations than traditional approaches. However, a lot of training queries have to be executed during the model training phase to learn a data-dependent ML model making it very time-consuming. Many of those training or example queries use the same base data, have the same query structure, and only differ in their selective predicates. To speed up the model training phase, our core idea is to determine a predicate-independent pre-aggregation of the base data and to execute the example queries over this pre-aggregated data. Based on this idea, we present a specific aggregate-based training phase for ML-based cardinality estimation approaches in this paper. As we are going to show with different workloads in our evaluation, we are able to achieve an average speedup of 90 with our aggregate-based training phase and thus outperform indexes.

2021 ◽  
Vol 13 (4) ◽  
pp. 94
Author(s):  
Haokun Fang ◽  
Quan Qian

Privacy protection has been an important concern with the great success of machine learning. In this paper, it proposes a multi-party privacy preserving machine learning framework, named PFMLP, based on partially homomorphic encryption and federated learning. The core idea is all learning parties just transmitting the encrypted gradients by homomorphic encryption. From experiments, the model trained by PFMLP has almost the same accuracy, and the deviation is less than 1%. Considering the computational overhead of homomorphic encryption, we use an improved Paillier algorithm which can speed up the training by 25–28%. Moreover, comparisons on encryption key length, the learning network structure, number of learning clients, etc. are also discussed in detail in the paper.


2021 ◽  
Vol 5 (2) ◽  
pp. 312-318
Author(s):  
Rima Dias Ramadhani ◽  
Afandi Nur Aziz Thohari ◽  
Condro Kartiko ◽  
Apri Junaidi ◽  
Tri Ginanjar Laksana ◽  
...  

Waste is goods / materials that have no value in the scope of production, where in some cases the waste is disposed of carelessly and can damage the environment. The Indonesian government in 2019 recorded waste reaching 66-67 million tons, which is higher than the previous year, which was 64 million tons. Waste is differentiated based on its type, namely organic and anorganic waste. In the field of computer science, the process of sensing the type waste can be done using a camera and the Convolutional Neural Networks (CNN) method, which is a type of neural network that works by receiving input in the form of images. The input will be trained using CNN architecture so that it will produce output that can recognize the object being inputted. This study optimizes the use of the CNN method to obtain accurate results in identifying types of waste. Optimization is done by adding several hyperparameters to the CNN architecture. By adding hyperparameters, the accuracy value is 91.2%. Meanwhile, if the hyperparameter is not used, the accuracy value is only 67.6%. There are three hyperparameters used to increase the accuracy value of the model. They are dropout, padding, and stride. 20% increase in dropout to increase training overfit. Whereas padding and stride are used to speed up the model training process.


Sensors ◽  
2020 ◽  
Vol 20 (7) ◽  
pp. 2007
Author(s):  
Ruizhe Shao ◽  
Chun Du ◽  
Hao Chen ◽  
Jun Li

With the development of unmanned aerial vehicle (UAV) techniques, UAV images are becoming more widely used. However, as an essential step of UAV image application, the computation of stitching remains time intensive, especially for emergency applications. Addressing this issue, we propose a novel approach to use the position and pose information of UAV images to speed up the process of image stitching, called FUIS (fast UAV image stitching). This stitches images by feature points. However, unlike traditional approaches, our approach rapidly finds several anchor-matches instead of a lot of feature matches to stitch the image. Firstly, from a large number of feature points, we design a method to select a small number of them that are more helpful for stitching as anchor points. Then, a method is proposed to more quickly and accurately match these anchor points, using position and pose information. Experiments show that our method significantly reduces the time consumption compared with the-state-of-art approaches with accuracy guaranteed.


2020 ◽  
Vol 34 (2) ◽  
pp. 143-164 ◽  
Author(s):  
Tobias Baur ◽  
Alexander Heimerl ◽  
Florian Lingenfelser ◽  
Johannes Wagner ◽  
Michel F. Valstar ◽  
...  

Abstract In the following article, we introduce a novel workflow, which we subsume under the term “explainable cooperative machine learning” and show its practical application in a data annotation and model training tool called NOVA. The main idea of our approach is to interactively incorporate the ‘human in the loop’ when training classification models from annotated data. In particular, NOVA offers a collaborative annotation backend where multiple annotators join their workforce. A main aspect is the possibility of applying semi-supervised active learning techniques already during the annotation process by giving the possibility to pre-label data automatically, resulting in a drastic acceleration of the annotation process. Furthermore, the user-interface implements recent eXplainable AI techniques to provide users with both, a confidence value of the automatically predicted annotations, as well as visual explanation. We show in an use-case evaluation that our workflow is able to speed up the annotation process, and further argue that by providing additional visual explanations annotators get to understand the decision making process as well as the trustworthiness of their trained machine learning models.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Ramón Alain Miranda-Quintana ◽  
Anita Rácz ◽  
Dávid Bajusz ◽  
Károly Héberger

AbstractDespite being a central concept in cheminformatics, molecular similarity has so far been limited to the simultaneous comparison of only two molecules at a time and using one index, generally the Tanimoto coefficent. In a recent contribution we have not only introduced a complete mathematical framework for extended similarity calculations, (i.e. comparisons of more than two molecules at a time) but defined a series of novel idices. Part 1 is a detailed analysis of the effects of various parameters on the similarity values calculated by the extended formulas. Their features were revealed by sum of ranking differences and ANOVA. Here, in addition to characterizing several important aspects of the newly introduced similarity metrics, we will highlight their applicability and utility in real-life scenarios using datasets with popular molecular fingerprints. Remarkably, for large datasets, the use of extended similarity measures provides an unprecedented speed-up over “traditional” pairwise similarity matrix calculations. We also provide illustrative examples of a more direct algorithm based on the extended Tanimoto similarity to select diverse compound sets, resulting in much higher levels of diversity than traditional approaches. We discuss the inner and outer consistency of our indices, which are key in practical applications, showing whether then-ary and binary indices rank the data in the same way. We demonstrate the use of the newn-ary similarity metrics ont-distributed stochastic neighbor embedding (t-SNE) plots of datasets of varying diversity, or corresponding to ligands of different pharmaceutical targets, which show that our indices provide a better measure of set compactness than standard binary measures. We also present a conceptual example of the applicability of our indices in agglomerative hierarchical algorithms. The Python code for calculating the extended similarity metrics is freely available at:https://github.com/ramirandaq/MultipleComparisons


2021 ◽  
Author(s):  
Pierstefano Bellani ◽  
Marina Carulli ◽  
Giandomenico Caruso

Abstract The several loops characterizing the design process used to slow down the development of new projects. Since the 70s, the design process has changed due to the new technologies and tools related to Computer-Aided Design software and Virtual Reality applications that make almost the whole process digital. However, the concept phase of the design process is still based on traditional approaches, while digital tools are poor exploited. In this phase, designers need tools that allow them to rapidly save and freeze their ideas, such as sketching on paper, which is not integrated in the digital-based process. The paper presents a new gestural interface to give designers more support by introducing an effective device for 3D modelling to improve and speed up the conceptual design process. We designed a set of gestures to allow people from different background to 3D model their ideas in a natural way. A testing session with 17 participants allowed us to verify if the proposed interaction was intuitive or not. At the end of the tests, all participants succeeded in the 3D modelling of a simple shape (a column) by only using air gestures in a relatively short amount of time exactly how they expected it to be built, confirming the proposed interaction.


2021 ◽  
Vol 13 (11) ◽  
pp. 2181
Author(s):  
Svetlana Illarionova  ◽  
Sergey Nesteruk  ◽  
Dmitrii Shadrin ◽  
Vladimir Ignatiev  ◽  
Maria Pukalchik  ◽  
...  

Usage of multispectral satellite imaging data opens vast possibilities for monitoring and quantitatively assessing properties or objects of interest on a global scale. Machine learning and computer vision (CV) approaches show themselves as promising tools for automatizing satellite image analysis. However, there are limitations in using CV for satellite data. Mainly, the crucial one is the amount of data available for model training. This paper presents a novel image augmentation approach called MixChannel that helps to address this limitation and improve the accuracy of solving segmentation and classification tasks with multispectral satellite images. The core idea is to utilize the fact that there is usually more than one image for each location in remote sensing tasks, and this extra data can be mixed to achieve the more robust performance of the trained models. The proposed approach substitutes some channels of the original training image with channels from other images of the exact location to mix auxiliary data. This augmentation technique preserves the spatial features of the original image and adds natural color variability with some probability. We also show an efficient algorithm to tune channel substitution probabilities. We report that the MixChannel image augmentation method provides a noticeable increase in performance of all the considered models in the studied forest types classification problem.


2021 ◽  
Vol 11 ◽  
Author(s):  
Qingling Hua ◽  
Dejun Zhang ◽  
Yunqiao Li ◽  
Yue Hu ◽  
Pian Liu ◽  
...  

AimsSurvival benefit of liver cancer patients who undergo palliative radiotherapy varies from person to person. The present study aims to identify indicators of survival of advanced liver cancer patients receiving palliative radiotherapy.Patients and MethodsOne hundred and fifty-nine patients treated with palliative radiotherapy for advanced liver cancer were retrospectively assessed. Of the 159 patients, 103 patients were included for prediction model construction in training phase, while other 56 patients were analyzed for external validation in validation phase. In model training phase, clinical characteristics of included patients were evaluated by Kaplan-Meier curves and log-rank test. Thereafter, multivariable Cox analysis was taken to further identify characteristics with potential for prediction. In validation phase, a separate dataset including 56 patients was used for external validation. Harrell’s C-index and calibration curve were used for model evaluation. Nomograms were plotted based on the model of multivariable Cox analysis.ResultsThirty-one characteristics of patients were investigated in model training phase. Based on the results of Kaplan-Meier plots and log-rank tests, 6 factors were considered statistically significant. On multivariable Cox regression analysis, bone metastasis (HR = 1.781, P = 0.026), portal vein tumor thrombus (HR = 2.078, P = 0.015), alpha-fetoprotein (HR = 2.098, P = 0.007), and radiation dose (HR = 0.535, P = 0.023) show significant potential to predict the survival of advanced liver cancer patients treated with palliative radiotherapy. Moreover, nomograms predicting median overall survival, 1- and 2-year survival probability were plotted. The Harrell’s C-index of the predictive model is 0.709(95%CI, 0.649-0.769) and 0.735 (95%CI, 0.666-0.804) for training model and validation model respectively. Calibration curves of the 1- and 2-year overall survival of the predictive model indicate that the predicted probabilities of OS are very close to the actual observed outcomes both in training and validation phase.ConclusionBone metastasis, portal vein tumor thrombus, alpha-fetoprotein and radiation dose are independent prognostic factors for the survival of advanced liver cancer patients treated with palliative radiotherapy.


2019 ◽  
Vol 18 ◽  
pp. 117693511983554 ◽  
Author(s):  
Zhonglin Qu ◽  
Chng Wei Lau ◽  
Quang Vinh Nguyen ◽  
Yi Zhou ◽  
Daniel R Catchpoole

Visual analytics and visualisation can leverage the human perceptual system to interpret and uncover hidden patterns in big data. The advent of next-generation sequencing technologies has allowed the rapid production of massive amounts of genomic data and created a corresponding need for new tools and methods for visualising and interpreting these data. Visualising genomic data requires not only simply plotting of data but should also offer a decision or a choice about what the message should be conveyed in the particular plot; which methodologies should be used to represent the results must provide an easy, clear, and accurate way to the clinicians, experts, or researchers to interact with the data. Genomic data visual analytics is rapidly evolving in parallel with advances in high-throughput technologies such as artificial intelligence (AI) and virtual reality (VR). Personalised medicine requires new genomic visualisation tools, which can efficiently extract knowledge from the genomic data and speed up expert decisions about the best treatment of individual patient’s needs. However, meaningful visual analytics of such large genomic data remains a serious challenge. This article provides a comprehensive systematic review and discussion on the tools, methods, and trends for visual analytics of cancer-related genomic data. We reviewed methods for genomic data visualisation including traditional approaches such as scatter plots, heatmaps, coordinates, and networks, as well as emerging technologies using AI and VR. We also demonstrate the development of genomic data visualisation tools over time and analyse the evolution of visualising genomic data.


2020 ◽  
Vol 34 (05) ◽  
pp. 7179-7186
Author(s):  
Hanpeng Hu ◽  
Dan Wang ◽  
Chuan Wu

Many emerging AI applications request distributed machine learning (ML) among edge systems (e.g., IoT devices and PCs at the edge of the Internet), where data cannot be uploaded to a central venue for model training, due to their large volumes and/or security/privacy concerns. Edge devices are intrinsically heterogeneous in computing capacity, posing significant challenges to parameter synchronization for parallel training with the parameter server (PS) architecture. This paper proposes ADSP, a parameter synchronization model for distributed machine learning (ML) with heterogeneous edge systems. Eliminating the significant waiting time occurring with existing parameter synchronization models, the core idea of ADSP is to let faster edge devices continue training, while committing their model updates at strategically decided intervals. We design algorithms that decide time points for each worker to commit its model update, and ensure not only global model convergence but also faster convergence. Our testbed implementation and experiments show that ADSP outperforms existing parameter synchronization models significantly in terms of ML model convergence time, scalability and adaptability to large heterogeneity.


Sign in / Sign up

Export Citation Format

Share Document