Cuckoo Feature Hashing: Dynamic Weight Sharing for Sparse Analytics

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/295 ◽

2018 ◽

Author(s):

Jinyang Gao ◽

Beng Chin Ooi ◽

Yanyan Shen ◽

Wang-Chien Lee

Keyword(s):

Predictive Models ◽

Large Scale ◽

Model Performance ◽

Experimental Results ◽

Cuckoo Hashing ◽

Dynamic Weight ◽

Model Training ◽

Sparse Features ◽

Feature Hashing

Feature hashing is widely used to process large scale sparse features for learning of predictive models. Collisions inherently happen in the hashing process and hurt the model performance. In this paper, we develop a feature hashing scheme called Cuckoo Feature Hashing(CCFH) based on the principle behind Cuckoo hashing, a hashing scheme designed to resolve collisions. By providing multiple possible hash locations for each feature, CCFH prevents the collisions between predictive features by dynamically hashing them into alternative locations during model training. Experimental results on prediction tasks with hundred-millions of features demonstrate that CCFH can achieve the same level of performance by using only 15%-25% parameters compared with conventional feature hashing.

Download Full-text

NIMG-32. THE FEDERATED TUMOR SEGMENTATION (FETS) INITIATIVE: THE FIRST REAL-WORLD LARGE-SCALE DATA-PRIVATE COLLABORATION FOCUSING ON NEURO-ONCOLOGY

Neuro-Oncology ◽

10.1093/neuonc/noab196.532 ◽

2021 ◽

Vol 23 (Supplement_6) ◽

pp. vi135-vi136

Author(s):

Ujjwal Baid ◽

Sarthak Pati ◽

Siddhesh Thakur ◽

Brandon Edwards ◽

Micah Sheller ◽

...

Keyword(s):

Large Scale ◽

Model Performance ◽

Training Data ◽

Tumor Segmentation ◽

Consensus Model ◽

Local Data ◽

Validation Data ◽

Performance Improvements ◽

Brain Extraction ◽

Model Training

Abstract PURPOSE Robustness and generalizability of artificial intelligent (AI) methods is reliant on the training data size and diversity, which are currently hindered in multi-institutional healthcare collaborations by data ownership and legal concerns. To address these, we introduce the Federated Tumor Segmentation (FeTS) Initiative, as an international consortium using federated learning (FL) for data-private multi-institutional collaborations, where AI models leverage data at participating institutions, without sharing data between them. The initial FeTS use-case focused on detecting brain tumor boundaries in MRI. METHODS The FeTS tool incorporates: 1) MRI pre-processing, including image registration and brain extraction; 2) automatic delineation of tumor sub-regions, by label fusion of pretrained top-performing BraTS methods; 3) tools for manual delineation refinements; 4) model training. 55 international institutions identified local retrospective cohorts of glioblastoma patients. Ground truth was generated using the first 3 FeTS functionality modes as mentioned earlier. Finally, the FL training mode comprises of i) an AI model trained on local data, ii) local model updates shared with an aggregator, which iii) combines updates from all collaborators to generate a consensus model, and iv) circulates the consensus model back to all collaborators for iterative performance improvements. RESULTS The first FeTS consensus model, from 23 institutions with data of 2,200 patients, showed an average improvement of 11.1% in the performance of the model on each collaborator’s validation data, when compared to a model trained on the publicly available BraTS data (n=231). CONCLUSION Our findings support that data increase alone would lead to AI performance improvements without any algorithmic development, hence indicating that the model performance would improve further when trained with all 55 collaborating institutions. FL enables AI model training with knowledge from data of geographically-distinct collaborators, without ever having to share any data, hence overcoming hurdles relating to legal, ownership, and technical concerns of data sharing.

Download Full-text

Hybrid Attention-Based Prototypical Networks for Noisy Few-Shot Relation Classification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016407 ◽

2019 ◽

Vol 33 ◽

pp. 6407-6414 ◽

Cited By ~ 13

Author(s):

Tianyu Gao ◽

Xu Han ◽

Zhiyuan Liu ◽

Maosong Sun

Keyword(s):

Large Scale ◽

State Of The Art ◽

Low Noise ◽

Experimental Results ◽

Long Tail ◽

Supervised Training ◽

Distant Supervision ◽

New Knowledge ◽

Model Training ◽

Relation Classification

The existing methods for relation classification (RC) primarily rely on distant supervision (DS) because large-scale supervised training datasets are not readily available. Although DS automatically annotates adequate amounts of data for model training, the coverage of this data is still quite limited, and meanwhile many long-tail relations still suffer from data sparsity. Intuitively, people can grasp new knowledge by learning few instances. We thus provide a different view on RC by formalizing RC as a few-shot learning (FSL) problem. However, the current FSL models mainly focus on low-noise vision tasks, which makes them hard to directly deal with the diversity and noise of text. In this paper, we propose hybrid attention-based prototypical networks for the problem of noisy few-shot RC. We design instancelevel and feature-level attention schemes based on prototypical networks to highlight the crucial instances and features respectively, which significantly enhances the performance and robustness of RC models in a noisy FSL scenario. Besides, our attention schemes accelerate the convergence speed of RC models. Experimental results demonstrate that our hybrid attention-based models require fewer training iterations and outperform the state-of-the-art baseline models. The code and datasets are released on https://github.com/thunlp/ HATT-Proto.

Download Full-text

Epigenetic Target Prediction with Accurate Machine Learning Models

10.26434/chemrxiv.13522313 ◽

2021 ◽

Author(s):

Norberto Sánchez-Cruz ◽

Jose L. Medina-Franco

Keyword(s):

Machine Learning ◽

Small Molecules ◽

Predictive Models ◽

Large Scale ◽

Target Prediction ◽

Quantitative Measure ◽

Learning Models ◽

Discovery Research ◽

Drug Discovery Research ◽

Machine Learning Models

Epigenetic targets are a significant focus for drug discovery research, as demonstrated by the eight approved epigenetic drugs for treatment of cancer and the increasing availability of chemogenomic data related to epigenetics. This data represents a large amount of structure-activity relationships that has not been exploited thus far for the development of predictive models to support medicinal chemistry efforts. Herein, we report the first large-scale study of 26318 compounds with a quantitative measure of biological activity for 55 protein targets with epigenetic activity. Through a systematic comparison of machine learning models trained on molecular fingerprints of different design, we built predictive models with high accuracy for the epigenetic target profiling of small molecules. The models were thoroughly validated showing mean precisions up to 0.952 for the epigenetic target prediction task. Our results indicate that the herein reported models have considerable potential to identify small molecules with epigenetic activity. Therefore, our results were implemented as freely accessible and easy-to-use web application.

Download Full-text

Combining Regional Habitat Selection Models for Large-Scale Prediction: Circumpolar Habitat Selection of Southern Ocean Humpback Whales

Remote Sensing ◽

10.3390/rs13112074 ◽

2021 ◽

Vol 13 (11) ◽

pp. 2074

Author(s):

Ryan R. Reisinger ◽

Ari S. Friedlaender ◽

Alexandre N. Zerbini ◽

Daniel M. Palacios ◽

Virginia Andrews-Goff ◽

...

Keyword(s):

Habitat Selection ◽

Predictive Models ◽

Regional Variation ◽

Large Scale ◽

Predictive Performance ◽

Humpback Whale ◽

Machine Learning Algorithms ◽

Humpback Whales ◽

Environmental Covariates ◽

Animal Habitat

Machine learning algorithms are often used to model and predict animal habitat selection—the relationships between animal occurrences and habitat characteristics. For broadly distributed species, habitat selection often varies among populations and regions; thus, it would seem preferable to fit region- or population-specific models of habitat selection for more accurate inference and prediction, rather than fitting large-scale models using pooled data. However, where the aim is to make range-wide predictions, including areas for which there are no existing data or models of habitat selection, how can regional models best be combined? We propose that ensemble approaches commonly used to combine different algorithms for a single region can be reframed, treating regional habitat selection models as the candidate models. By doing so, we can incorporate regional variation when fitting predictive models of animal habitat selection across large ranges. We test this approach using satellite telemetry data from 168 humpback whales across five geographic regions in the Southern Ocean. Using random forests, we fitted a large-scale model relating humpback whale locations, versus background locations, to 10 environmental covariates, and made a circumpolar prediction of humpback whale habitat selection. We also fitted five regional models, the predictions of which we used as input features for four ensemble approaches: an unweighted ensemble, an ensemble weighted by environmental similarity in each cell, stacked generalization, and a hybrid approach wherein the environmental covariates and regional predictions were used as input features in a new model. We tested the predictive performance of these approaches on an independent validation dataset of humpback whale sightings and whaling catches. These multiregional ensemble approaches resulted in models with higher predictive performance than the circumpolar naive model. These approaches can be used to incorporate regional variation in animal habitat selection when fitting range-wide predictive models using machine learning algorithms. This can yield more accurate predictions across regions or populations of animals that may show variation in habitat selection.

Download Full-text

Process-Based Model Prediction of Coastal Dune Erosion through Parametric Calibration

Journal of Marine Science and Engineering ◽

10.3390/jmse9060635 ◽

2021 ◽

Vol 9 (6) ◽

pp. 635

Author(s):

Hyeok Jin ◽

Kideok Do ◽

Sungwon Shin ◽

Daniel Cox

Keyword(s):

Large Scale ◽

Numerical Models ◽

Model Performance ◽

Coastal Dunes ◽

Wave Transformation ◽

Storm Event ◽

Face Model ◽

Dune Erosion ◽

Simulation Performance ◽

Sand Bar

Coastal dunes are important morphological features for both ecosystems and coastal hazard mitigation. Because understanding and predicting dune erosion phenomena is very important, various numerical models have been developed to improve the accuracy. In the present study, a process-based model (XBeachX) was tested and calibrated to improve the accuracy of the simulation of dune erosion from a storm event by adjusting the coefficients in the model and comparing it with the large-scale experimental data. The breaker slope coefficient was calibrated to predict cross-shore wave transformation more accurately. To improve the prediction of the dune erosion profile, the coefficients related to skewness and asymmetry were adjusted. Moreover, the bermslope coefficient was calibrated to improve the simulation performance of the bermslope near the dune face. Model performance was assessed based on the model-data comparisons. The calibrated XBeachX successfully predicted wave transformation and dune erosion phenomena. In addition, the results obtained from other two similar experiments on dune erosion with the same calibrated set matched well with the observed wave and profile data. However, the prediction of underwater sand bar evolution remains a challenge.

Download Full-text

Computational fluid dynamics of rectangular external loop airlift reactor

International Journal of Chemical Reactor Engineering ◽

10.1515/ijcre-2020-0009 ◽

2020 ◽

Vol 18 (5-6) ◽

Author(s):

Shivanand M. Teli ◽

Channamallikarjun S. Mathpati

Keyword(s):

Large Scale ◽

Lift Coefficient ◽

Volume Ratio ◽

Reynolds Stress Model ◽

Airlift Reactor ◽

Experimental Results ◽

Turbulent Dispersion ◽

Drag Forces ◽

External Loop ◽

Turbulent Models

AbstractThe novel design of a rectangular external loop airlift reactor is at present the most used large-scale reactor for microalgae culture. It has a unique future for a large surface to volume ratio for exposure of light radiation for photosynthesis reaction. The 3D simulations have been performed in rectangular EL-ALR. The Eulerian–Eulerian approach has been used with a dispersed gas phase for different turbulent models. The performance and applicability of different turbulent model’s i.e., K-epsilon standard, K-epsilon realizable, K-omega, and Reynolds stress model are used and compared with experimental results. All drag forces and non-drag forces (turbulent dispersion, virtual mass, and lift coefficient) are included in the model. The experimental values of overall gas hold-up and average liquid circulation velocity have been compared with simulation and literature results. It is seemed to give good agreements. For the different elevations in the downcomer section, liquid axial velocity, turbulent kinetic energy, and turbulent eddy dissipation experimental have been compared with different turbulent models. The K-epsilon Realizable model gives better prediction with experimental results.

Download Full-text

Evaluation of recent advances in recommender systems on Arabic content

Journal Of Big Data ◽

10.1186/s40537-021-00420-2 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Mehdi Srifi ◽

Ahmed Oussous ◽

Ayoub Ait Lahcen ◽

Salma Mouline

Keyword(s):

Recommender Systems ◽

High Performance ◽

Large Scale ◽

State Of The Art ◽

Experimental Results ◽

Recent Advances ◽

Research Gap ◽

Text Preprocessing

AbstractVarious recommender systems (RSs) have been developed over recent years, and many of them have concentrated on English content. Thus, the majority of RSs from the literature were compared on English content. However, the research investigations about RSs when using contents in other languages such as Arabic are minimal. The researchers still neglect the field of Arabic RSs. Therefore, we aim through this study to fill this research gap by leveraging the benefit of recent advances in the English RSs field. Our main goal is to investigate recent RSs in an Arabic context. For that, we firstly selected five state-of-the-art RSs devoted originally to English content, and then we empirically evaluated their performance on Arabic content. As a result of this work, we first build four publicly available large-scale Arabic datasets for recommendation purposes. Second, various text preprocessing techniques have been provided for preparing the constructed datasets. Third, our investigation derived well-argued conclusions about the usage of modern RSs in the Arabic context. The experimental results proved that these systems ensure high performance when applied to Arabic content.

Download Full-text

Towards scalable and reusable predictive models for cyber twins in manufacturing systems

Journal of Intelligent Manufacturing ◽

10.1007/s10845-021-01804-0 ◽

2021 ◽

Author(s):

Cinzia Giannetti ◽

Aniekan Essien

Keyword(s):

Data Streams ◽

Predictive Models ◽

Manufacturing Systems ◽

Model Building ◽

Model Performance ◽

Cyberphysical Systems ◽

Pragmatic Approach ◽

Analyse Data ◽

Fully Connected ◽

Smart Factories

AbstractSmart factories are intelligent, fully-connected and flexible systems that can continuously monitor and analyse data streams from interconnected systems to make decisions and dynamically adapt to new circumstances. The implementation of smart factories represents a leap forward compared to traditional automation. It is underpinned by the deployment of cyberphysical systems that, through the application of Artificial Intelligence, integrate predictive capabilities and foster rapid decision-making. Deep Learning (DL) is a key enabler for the development of smart factories. However, the implementation of DL in smart factories is hindered by its reliance on large amounts of data and extreme computational demand. To address this challenge, Transfer Learning (TL) has been proposed to promote the efficient training of models by enabling the reuse of previously trained models. In this paper, by means of a specific example in aluminium can manufacturing, an empirical study is presented, which demonstrates the potential of TL to achieve fast deployment of scalable and reusable predictive models for Cyber Manufacturing Systems. Through extensive experiments, the value of TL is demonstrated to achieve better generalisation and model performance, especially with limited datasets. This research provides a pragmatic approach towards predictive model building for cyber twins, paving the way towards the realisation of smart factories.

Download Full-text

Parameter transferability between multiple gridded input datasets challenges hydrological model performance under changing climate

10.5194/egusphere-egu21-8782 ◽

2021 ◽

Author(s):

Moctar Dembélé ◽

Bettina Schaefli ◽

Grégoire Mariéthoz

Keyword(s):

West Africa ◽

Large Scale ◽

Hydrological Model ◽

Meteorological Data ◽

Model Performance ◽

Hydrologic Model ◽

Hydrological Modelling ◽

Climate Change Scenarios ◽

Depth Analysis ◽

Input Dataset

The diversity of remotely sensed or reanalysis-based rainfall data steadily increases, which on one hand opens new perspectives for large scale hydrological modelling in data scarce regions, but on the other hand poses challenging question regarding parameter identification and transferability under multiple input datasets. This study analyzes the variability of hydrological model performance when (1) a set of parameters is transferred from the calibration input dataset to a different meteorological datasets and reversely, when (2) an input dataset is used with a parameter set, originally calibrated for a different input dataset.The research objective is to highlight the uncertainties related to input data and the limitations of hydrological model parameter transferability across input datasets. An ensemble of 17 rainfall datasets and 6 temperature datasets from satellite and reanalysis sources (Demb&#233;l&#233; et al., 2020), corresponding to 102 combinations of meteorological data, is used to force the fully distributed mesoscale Hydrologic Model (mHM). The mHM model is calibrated for each combination of meteorological datasets, thereby resulting in 102 calibrated parameter sets, which almost all give similar model performance. Each of the 102 parameter sets is used to run the mHM model with each of the 102 input datasets, yielding 10404 scenarios to that serve for the transferability tests. The experiment is carried out for a decade from 2003 to 2012 in the large and data-scarce Volta River basin (415600 km2) in West Africa.The results show that there is a high variability in model performance for streamflow (mean CV=105%) when the parameters are transferred from the original input dataset to other input datasets (test 1 above). Moreover, the model performance is in general lower and can drop considerably when parameters obtained under all other input datasets are transferred to a selected input dataset (test 2 above). This underlines the need for model performance evaluation when different input datasets and parameter sets than those used during calibration are used to run a model. Our results represent a first step to tackle the question of parameter transferability to climate change scenarios. An in-depth analysis of the results at a later stage will shed light on which model parameterizations might be the main source of performance variability.Demb&#233;l&#233;, M., Schaefli, B., van de Giesen, N., & Mari&#233;thoz, G. (2020). Suitability of 17 rainfall and temperature gridded datasets for large-scale hydrological modelling in West Africa. Hydrology and Earth System Sciences (HESS). https://doi.org/10.5194/hess-24-5379-2020

Download Full-text

Abstract 223: Non-Cardiac Comorbidities in Predictive Models for Valvular Heart Disease Interventions

Circulation Cardiovascular Quality and Outcomes ◽

10.1161/circoutcomes.9.suppl_2.223 ◽

2016 ◽

Vol 9 (suppl_2) ◽

Author(s):

Benjamin Wessler ◽

Christine Lundquist ◽

Gowri Raman ◽

Jennifer Lutz ◽

Jessica Paulus ◽

...

Keyword(s):

Heart Disease ◽

Predictive Models ◽

Valvular Heart Disease ◽

De Novo ◽

Predictive Analytics ◽

Model Performance ◽

Median Number ◽

Valve Surgery ◽

Comorbid Condition ◽

Comorbid Conditions

Background: Interventions for patients with valvular heart disease (VHD) now include both surgical and percutaneous procedures. As a result, treatments are being offered to increasingly complex patients with a significant burden of non-cardiac comorbid conditions. There is a major gap in our understanding of how various comorbidities relate to prognosis following interventions for VHD. Here we describe how comorbidities are handled in clinical predictive models for patients undergoing interventions for VHD. Methods: We queried the Tufts Predictive Analytics and Comparative Effectiveness (PACE) Clinical Prediction Model (CPM) Registry to identify de novo CPMs for patients undergoing VHD interventions. We systematically extracted information on the non-cardiac comorbidities contained in the CPMs and also measures of model performance. Results: From January 1990- May 2012 there were 12 CPMs predicting measures of morbidity or mortality for patients undergoing interventions for VHD. There were 2 CPMs predicting outcomes for isolated aortic valve replacement, 3 CPMs predicting outcomes for isolated mitral valve surgery, and 7 models predicting outcomes for a combination of valve surgery subtypes. Ten out of twelve (83%) of the CPMs for patients undergoing interventions for VHD predicted mortality. The median number of non-cardiac comorbidities included in the CPMs was 4 (range 0-7). All of the CPMs predicting mortality included at least 1 comorbid condition. The top 3 most common comorbidities included in these CPMs were, renal dysfunction (10/12, 83%), prior CVA (7/12, 58%) and measures of BMI/BSA (7/12, 58%). Diabetes was present in only 25% (3/12) of the models and chronic lung disease in only 17% (2/12). Conclusions: Non-cardiac comorbidities are frequently found in CPMs predicting morbidity and mortality following interventions for VHD. There is significant variation in the number and type of specific comorbid conditions included in these CPMs. More work is needed to understand the directionality, magnitude, and consistency of effect of these non-cardiac comorbid conditions for patients undergoing interventions for VHD.

Download Full-text