data partitioning Latest Research Papers

Since most classifiers are biased toward the dominant class, class imbalance is a challenging problem in machine learning. The most popular approaches to solving this problem include oversampling minority examples and undersampling majority examples. Oversampling may increase the probability of overfitting, whereas undersampling eliminates examples that may be crucial to the learning process. We present a linear time resampling method based on random data partitioning and a majority voting rule to address both concerns, where an imbalanced dataset is partitioned into a number of small subdatasets, each of which must be class balanced. After that, a specific classifier is trained for each subdataset, and the final classification result is established by applying the majority voting rule to the results of all of the trained models. We compared the performance of the proposed method to some of the most well-known oversampling and undersampling methods, employing a range of classifiers, on 33 benchmark machine learning class-imbalanced datasets. The classification results produced by the classifiers employed on the generated data by the proposed method were comparable to most of the resampling methods tested, with the exception of SMOTEFUNA, which is an oversampling method that increases the probability of overfitting. The proposed method produced results that were comparable to the Easy Ensemble (EE) undersampling method. As a result, for solving the challenge of machine learning from class-imbalanced datasets, we advocate using either EE or our method.

Download Full-text

Areal Precipitation Coverage Ratio for Enhanced AI Modelling of Monthly Runoff: A New Satellite Data-Driven Scheme for Semi-Arid Mountainous Climate

Remote Sensing ◽

10.3390/rs14020270 ◽

2022 ◽

Vol 14 (2) ◽

pp. 270

Author(s):

Seyyed Hasan Hosseini ◽

Hossein Hashemi ◽

Ahmad Fakheri Fard ◽

Ronny Berndtsson

Keyword(s):

Satellite Data ◽

Precipitation Variability ◽

Data Partitioning ◽

Performance Criteria ◽

Data Driven ◽

Coverage Ratio ◽

Areal Precipitation ◽

Input Variables ◽

Monthly Runoff ◽

New Variables

Satellite remote sensing provides useful gridded data for the conceptual modelling of hydrological processes such as precipitation–runoff relationship. Structurally flexible and computationally advanced AI-assisted data-driven (DD) models foster these applications. However, without linking concepts between variables from many grids, the DD models can be too large to be calibrated efficiently. Therefore, effectively formulized, collective input variables and robust verification of the calibrated models are desired to leverage satellite data for the strategic DD modelling of catchment runoff. This study formulates new satellite-based input variables, namely, catchment- and event-specific areal precipitation coverage ratios (CCOVs and ECOVs, respectively) from the Global Precipitation Mission (GPM) and evaluates their usefulness for monthly runoff modelling from five mountainous Karkheh sub-catchments of 5000–43,000 km2 size in west Iran. Accordingly, 12 different input combinations from GPM and MODIS products were introduced to a generalized deep learning scheme using artificial neural networks (ANNs). Using an adjusted five-fold cross-validation process, 420 different ANN configurations per fold choice and 10 different random initial parameterizations per configuration were tested. Runoff estimates from five hybrid models, each an average of six top-ranked ANNs based on six statistical criteria in calibration, indicated obvious improvements for all sub-catchments using the new variables. Particularly, ECOVs were most efficient for the most challenging sub-catchment, Kashkan, having the highest spacetime precipitation variability. However, better performance criteria were found for sub-catchments with lower precipitation variability. The modelling performance for Kashkan indicated a higher dependency on data partitioning, suggesting that long-term data representativity is important for modelling reliability.

Download Full-text

Real-Time Aerodynamic Modeling of a Folding Wing Vehicle using Automatic Data Partitioning

10.2514/6.2022-1309 ◽

2022 ◽

Author(s):

Joaquim N. Dias

Keyword(s):

Real Time ◽

Data Partitioning ◽

Automatic Data ◽

Aerodynamic Modeling ◽

Folding Wing

Download Full-text

Performance Analysis of Iteratively Decoded Convergent Source Mapping with Sphere Packing-Assisted Differential Space-Time Spreading Technique for Efficient Video Transmission

Complexity ◽

10.1155/2021/5776480 ◽

2021 ◽

Vol 2021 ◽

pp. 1-16

Author(s):

Ishtiaque Ahmed ◽

Nasru Minallah ◽

Jaroslav Frnda ◽

Jan Nedoma

Keyword(s):

Information Transfer ◽

Video Transmission ◽

Hamming Distance ◽

Signal To Noise Ratio ◽

Sphere Packing ◽

Space Time ◽

Data Partitioning ◽

Video Codec ◽

Efficient System ◽

Data Rates

With the substantial growth in number of wireless devices, future communication demands overarching research to design high-throughput and efficient systems. We propose an intelligent Convergent Source Mapping (CSM) approach incorporating Differential Space-Time Spreading (DSTS) technique with Sphere Packing (SP) modulation. The crux of CSM process is assured convergence by attaining an infinitesimal Bit-Error Rate (BER). Data Partitioning (DP) H.264 video codec is deployed to gauge the performance of our intelligent and efficient system. For the purpose of efficient and higher data rates, we have incorporated compression efficient source encoding along with error resiliency and transmission robustness features. The proposed system follows the concept of iterations between the Soft-Bit Source-Decoder (SBSD) and Recursive Systematic Convolutional (RSC) decoder. Simulations of the DSTS-SP-assisted CSM system are presented for the correlated narrowband Rayleigh channel, using different CSM rates but constant overall bit-rate budget. The SP-assisted DSTS systems are mainly useful in decoding algorithms that operate without requiring Channel State Information (CSI). The effects of incorporating redundancy via different CSM schemes on the attainable performance and convergence of the proposed system are investigated using EXtrinsic Information Transfer (EXIT) charts. The effectiveness of the proposed system is demonstrated through IT++ based proof-of-concept simulations. The Peak Signal-to-Noise Ratio (PSNR) analysis shows that using Rate-2/6 CSM with minimum Hamming distance ( d H , min ) of 4 offers about 5 dB gain, compared to an identical overall system code rate but with Rate-2/3 CSM and d H , min of 2. Furthermore, for a consistent value of d H , min and overall rate, the Rate-2/3 CSM scheme beats the Rate-5/6 CSM by about 2 dB at the PSNR degradation point of 2 dB. Moreover, the proposed system with Rate-2/3 CSM scheme furnishes an E b / N 0 gain of 20 dB when compared with the uniform-rate benchmarker. Clearly, we can say that higher d H , min and lower CSM values are favourable for our proposed setup.

Download Full-text

A fast and scalable ensemble of global models with long memory and data partitioning for the M5 forecasting competition

International Journal of Forecasting ◽

10.1016/j.ijforecast.2021.11.004 ◽

2021 ◽

Author(s):

Kasun Bandara ◽

Hansika Hewamalage ◽

Rakshitha Godahewa ◽

Puwasala Gamakumara

Keyword(s):

Long Memory ◽

Data Partitioning ◽

Global Models

Download Full-text

PhyloHerb: A phylogenomic pipeline for processing genome skimming data for plants

10.1101/2021.11.29.470431 ◽

2021 ◽

Author(s):

Liming Cai ◽

Hongrui Zhang ◽

CHARLES C DAVIS

Keyword(s):

Genomic Dna ◽

High Throughput Sequencing ◽

Data Partitioning ◽

Published Data ◽

Herbarium Specimens ◽

Bioinformatic Pipeline ◽

Biodiversity Research ◽

Blast Search ◽

Genome Skimming ◽

Low Coverage

Premise of the study: The application of high throughput sequencing, especially to herbarium specimens, is greatly accelerating biodiversity research. Among various techniques, low coverage Illumina sequencing of total genomic DNA (genome skimming) can simultaneously recover the plastid, mitochondrial, and nuclear ribosomal regions across hundreds of species. Here, we introduce PhyloHerb -- a bioinformatic pipeline to efficiently and effectively assemble phylogenomic datasets derived from genome skimming. Methods and Results: PhyloHerb uses either a built-in database or user-specified references to extract orthologous sequences using BLAST search. It outputs FASTA files and offers a suite of utility functions to assist with alignment, data partitioning, concatenation, and phylogeny inference. The program is freely available at https://github.com/lmcai/PhyloHerb/. Conclusions: Using published data from Clusiaceae, we demonstrated that PhyloHerb can accurately identify genes using highly fragmented assemblies derived from sequencing older herbarium specimens. Our approach is effective at all taxonomic depths and is scalable to thousands of species.

Download Full-text

A Systematic Review of Federated Learning in the Healthcare Area: From the Perspective of Data Properties and Applications

Applied Sciences ◽

10.3390/app112311191 ◽

2021 ◽

Vol 11 (23) ◽

pp. 11191

Author(s):

Prayitno ◽

Chi-Ren Shyu ◽

Karisma Trinanda Putra ◽

Hsing-Chung Chen ◽

Yuan-Yu Tsai ◽

...

Keyword(s):

Deep Learning ◽

Learning Model ◽

Data Partitioning ◽

Data Driven ◽

Future Research ◽

Healthcare Applications ◽

Smart Healthcare ◽

Protection Mechanisms ◽

Benchmark Datasets ◽

Deep Learning Model

Recent advances in deep learning have shown many successful stories in smart healthcare applications with data-driven insight into improving clinical institutions’ quality of care. Excellent deep learning models are heavily data-driven. The more data trained, the more robust and more generalizable the performance of the deep learning model. However, pooling the medical data into centralized storage to train a robust deep learning model faces privacy, ownership, and strict regulation challenges. Federated learning resolves the previous challenges with a shared global deep learning model using a central aggregator server. At the same time, patient data remain with the local party, maintaining data anonymity and security. In this study, first, we provide a comprehensive, up-to-date review of research employing federated learning in healthcare applications. Second, we evaluate a set of recent challenges from a data-centric perspective in federated learning, such as data partitioning characteristics, data distributions, data protection mechanisms, and benchmark datasets. Finally, we point out several potential challenges and future research directions in healthcare applications.

Download Full-text

Stochastic Computing co-processing elements for Evolving Autonomous Data Partitioning

10.1109/dcis53048.2021.9666167 ◽

2021 ◽

Author(s):

Alejandro Moran ◽

Vincent Canals ◽

Plamen P. Angelov ◽

Christian F. Frasser ◽

Erik S. Skibinsky-Gitlin ◽

...

Keyword(s):

Data Partitioning ◽

Processing Elements ◽

Stochastic Computing

Download Full-text

Biotectonics of Sulawesi: Principles, methodology, and area relationships

Zootaxa ◽

10.11646/zootaxa.5068.4.1 ◽

2021 ◽

Vol 5068 (4) ◽

pp. 451-484

Author(s):

BERNARD MICHAUX ◽

VISOTHEARY UNG

Keyword(s):

Historical Biogeography ◽

Data Partitioning ◽

Pacific Region ◽

Consensus Trees ◽

Areas Of Endemism ◽

History Of ◽

Tectonic Development ◽

Phylogenetic Hypotheses ◽

The Relationship

Biotectonics is an approach to historical biogeography based on the analysis of independently derived biological and tectonic data, which we demonstrate using the island of Sulawesi as an example. We describe the tectonic development of Sulawesi and discuss the relationship between tectonic models and phylogenetic hypotheses. We outline the problem of interpreting areagrams based on single phylogenies and stress the importance of combining all available data into a general areagram. We analysed the distributions of Sulawesi area of endemism endemics (AEEs) using 30 published phylogenies, which were converted into paralogy-free taxon-area cladograms using the programme LisBeth (Zaragüeta-Bagalis et al. 2012) from which Adam’s consensus trees were constructed using PAUP (Swofford 2002). The results of our analyses show that the relationship between the areas of endemism is congruent with the terrane history of the island. A further 79 phylogenies of Sulawesi species with extralimital distributions were analysed to determine area relationships of Sulawesi within the broader Indo-Pacific region. We demonstrate the utility of data partitioning when dealing with areas that are geologically and biologically composite by showing that analysing Asian and Australasian elements of the Sulawesi biota separately produced general areagrams that avoid artifice and are interpretable in the light of current tectonic models.

Download Full-text

Developing an Image-Based Deep Learning Framework for Automatic Scoring of The Pentagon Drawing Test

Journal of Alzheimer s Disease ◽

10.3233/jad-210714 ◽

2021 ◽

pp. 1-11

Author(s):

Yike Li ◽

Jiajie Guo ◽

Peikai Yang

Keyword(s):

Deep Learning ◽

Object Detection ◽

Transfer Learning ◽

High Efficiency ◽

Characteristic Curve ◽

Data Partitioning ◽

Training Data ◽

Drawing Test ◽

Automatic Scoring ◽

Efficiency And Reliability

Background: The Pentagon Drawing Test (PDT) is a common assessment for visuospatial function. Evaluating the PDT by artificial intelligence can improve efficiency and reliability in the big data era. This study aimed to develop a deep learning (DL) framework for automatic scoring of the PDT based on image data. Methods: A total of 823 PDT photos were retrospectively collected and preprocessed into black-and-white, square-shape images. Stratified fivefold cross-validation was applied for training and testing. Two strategies based on convolutional neural networks were compared. The first strategy was to perform an image classification task using supervised transfer learning. The second strategy was designed with an object detection model for recognizing the geometric shapes in the figure, followed by a predetermined algorithm to score based on their classes and positions. Results: On average, the first framework demonstrated 62%accuracy, 62%recall, 65%precision, 63%specificity, and 0.72 area under the receiver operating characteristic curve. This performance was substantially outperformed by the second framework, with averages of 94%, 95%, 93%, 93%, and 0.95, respectively. Conclusion: An image-based DL framework based on the object detection approach may be clinically applicable for automatic scoring of the PDT with high efficiency and reliability. With a limited sample size, transfer learning should be used with caution if the new images are distinct from the previous training data. Partitioning the problem-solving workflow into multiple simple tasks should facilitate model selection, improve performance, and allow comprehensible logic of the DL framework.

Download Full-text

data partitioning
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

RDPVR: Random Data Partitioning with Voting Rule for Machine Learning from Class-Imbalanced Datasets

Areal Precipitation Coverage Ratio for Enhanced AI Modelling of Monthly Runoff: A New Satellite Data-Driven Scheme for Semi-Arid Mountainous Climate

Real-Time Aerodynamic Modeling of a Folding Wing Vehicle using Automatic Data Partitioning

Performance Analysis of Iteratively Decoded Convergent Source Mapping with Sphere Packing-Assisted Differential Space-Time Spreading Technique for Efficient Video Transmission

A fast and scalable ensemble of global models with long memory and data partitioning for the M5 forecasting competition

PhyloHerb: A phylogenomic pipeline for processing genome skimming data for plants

A Systematic Review of Federated Learning in the Healthcare Area: From the Perspective of Data Properties and Applications

Stochastic Computing co-processing elements for Evolving Autonomous Data Partitioning

Biotectonics of Sulawesi: Principles, methodology, and area relationships

Developing an Image-Based Deep Learning Framework for Automatic Scoring of The Pentagon Drawing Test

Export Citation Format

data partitioningRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

RDPVR: Random Data Partitioning with Voting Rule for Machine Learning from Class-Imbalanced Datasets

Areal Precipitation Coverage Ratio for Enhanced AI Modelling of Monthly Runoff: A New Satellite Data-Driven Scheme for Semi-Arid Mountainous Climate

Real-Time Aerodynamic Modeling of a Folding Wing Vehicle using Automatic Data Partitioning

Performance Analysis of Iteratively Decoded Convergent Source Mapping with Sphere Packing-Assisted Differential Space-Time Spreading Technique for Efficient Video Transmission

A fast and scalable ensemble of global models with long memory and data partitioning for the M5 forecasting competition

PhyloHerb: A phylogenomic pipeline for processing genome skimming data for plants

A Systematic Review of Federated Learning in the Healthcare Area: From the Perspective of Data Properties and Applications

Stochastic Computing co-processing elements for Evolving Autonomous Data Partitioning

Biotectonics of Sulawesi: Principles, methodology, and area relationships

Developing an Image-Based Deep Learning Framework for Automatic Scoring of The Pentagon Drawing Test

data partitioning
Recently Published Documents