heterogeneous datasets
Recently Published Documents





Zulqarnain Nazir ◽  
Khurram Shahzad ◽  
Muhammad Kamran Malik ◽  
Waheed Anwar ◽  
Imran Sarwar Bajwa ◽  

Authorship attribution refers to examining the writing style of authors to determine the likelihood of the original author of a document from a given set of potential authors. Due to the wide range of authorship attribution applications, a plethora of studies have been conducted for various Western, as well as Asian, languages. However, authorship attribution research in the Urdu language has just begun, although Urdu is widely acknowledged as a prominent South Asian language. Furthermore, the existing studies on authorship attribution in Urdu have addressed a considerably easier problem of having less than 20 candidate authors, which is far from the real-world settings. Therefore, the findings from these studies may not be applicable to the real-world settings. To that end, we have made three key contributions: First, we have developed a large authorship attribution corpus for Urdu, which is a low-resource language. The corpus is composed of over 2.6 million tokens and 21,938 news articles by 94 authors, which makes it a closer substitute to the real-world settings. Second, we have analyzed hundreds of stylometry features used in the literature to identify 194 features that are applicable to the Urdu language and developed a taxonomy of these features. Finally, we have performed 66 experiments using two heterogeneous datasets to evaluate the effectiveness of four traditional and three deep learning techniques. The experimental results show the following: (a) Our developed corpus is many folds larger than the existing corpora, and it is more challenging than its counterparts for the authorship attribution task, and (b) Convolutional Neutral Networks is the most effective technique, as it achieved a nearly perfect F1 score of 0.989 for an existing corpus and 0.910 for our newly developed corpus.

2021 ◽  
Vol 13 (24) ◽  
pp. 5138
Seyd Teymoor Seydi ◽  
Mahdi Hasanlou ◽  
Jocelyn Chanussot

Wildfires are one of the most destructive natural disasters that can affect our environment, with significant effects also on wildlife. Recently, climate change and human activities have resulted in higher frequencies of wildfires throughout the world. Timely and accurate detection of the burned areas can help to make decisions for their management. Remote sensing satellite imagery can have a key role in mapping burned areas due to its wide coverage, high-resolution data collection, and low capture times. However, although many studies have reported on burned area mapping based on remote sensing imagery in recent decades, accurate burned area mapping remains a major challenge due to the complexity of the background and the diversity of the burned areas. This paper presents a novel framework for burned area mapping based on Deep Siamese Morphological Neural Network (DSMNN-Net) and heterogeneous datasets. The DSMNN-Net framework is based on change detection through proposing a pre/post-fire method that is compatible with heterogeneous remote sensing datasets. The proposed network combines multiscale convolution layers and morphological layers (erosion and dilation) to generate deep features. To evaluate the performance of the method proposed here, two case study areas in Australian forests were selected. The framework used can better detect burned areas compared to other state-of-the-art burned area mapping procedures, with a performance of >98% for overall accuracy index, and a kappa coefficient of >0.9, using multispectral Sentinel-2 and hyperspectral PRISMA image datasets. The analyses of the two datasets illustrate that the DSMNN-Net is sufficiently valid and robust for burned area mapping, and especially for complex areas.

2021 ◽  
Joaquin Torres-Sospedra ◽  
Ivo Silva ◽  
Lucie Klus ◽  
Darwin Quezada-Gaibor ◽  
Antonino Crivello ◽  

2021 ◽  
Vol 8 (12) ◽  
pp. 193
Andrea Bizzego ◽  
Giulio Gabrieli ◽  
Michelle Jin Yee Neoh ◽  
Gianluca Esposito

Deep learning (DL) has greatly contributed to bioelectric signal processing, in particular to extract physiological markers. However, the efficacy and applicability of the results proposed in the literature is often constrained to the population represented by the data used to train the models. In this study, we investigate the issues related to applying a DL model on heterogeneous datasets. In particular, by focusing on heart beat detection from electrocardiogram signals (ECG), we show that the performance of a model trained on data from healthy subjects decreases when applied to patients with cardiac conditions and to signals collected with different devices. We then evaluate the use of transfer learning (TL) to adapt the model to the different datasets. In particular, we show that the classification performance is improved, even with datasets with a small sample size. These results suggest that a greater effort should be made towards the generalizability of DL models applied on bioelectric signals, in particular, by retrieving more representative datasets.

2021 ◽  
Vol 21 (1) ◽  
Baochun He ◽  
Dalong Yin ◽  
Xiaoxia Chen ◽  
Huoling Luo ◽  
Deqiang Xiao ◽  

Abstract Background Most existing algorithms have been focused on the segmentation from several public Liver CT datasets scanned regularly (no pneumoperitoneum and horizontal supine position). This study primarily segmented datasets with unconventional liver shapes and intensities deduced by contrast phases, irregular scanning conditions, different scanning objects of pigs and patients with large pathological tumors, which formed the multiple heterogeneity of datasets used in this study. Methods The multiple heterogeneous datasets used in this paper includes: (1) One public contrast-enhanced CT dataset and one public non-contrast CT dataset; (2) A contrast-enhanced dataset that has abnormal liver shape with very long left liver lobes and large-sized liver tumors with abnormal presets deduced by microvascular invasion; (3) One artificial pneumoperitoneum dataset under the pneumoperitoneum and three scanning profiles (horizontal/left/right recumbent position); (4) Two porcine datasets of Bama type and domestic type that contains pneumoperitoneum cases but with large anatomy discrepancy with humans. The study aimed to investigate the segmentation performances of 3D U-Net in: (1) generalization ability between multiple heterogeneous datasets by cross-testing experiments; (2) the compatibility when hybrid training all datasets in different sampling and encoder layer sharing schema. We further investigated the compatibility of encoder level by setting separate level for each dataset (i.e., dataset-wise convolutions) while sharing the decoder. Results Model trained on different datasets has different segmentation performance. The prediction accuracy between LiTS dataset and Zhujiang dataset was about 0.955 and 0.958 which shows their good generalization ability due to that they were all contrast-enhanced clinical patient datasets scanned regularly. For the datasets scanned under pneumoperitoneum, their corresponding datasets scanned without pneumoperitoneum showed good generalization ability. Dataset-wise convolution module in high-level can improve the dataset unbalance problem. The experimental results will facilitate researchers making solutions when segmenting those special datasets. Conclusions (1) Regularly scanned datasets is well generalized to irregularly ones. (2) The hybrid training is beneficial but the dataset imbalance problem always exits due to the multi-domain homogeneity. The higher levels encoded more domain specific information than lower levels and thus were less compatible in terms of our datasets.

2021 ◽  
Esten H Leonardsen ◽  
Han Peng ◽  
Tobias Kaufmann ◽  
Ingrid Agartz ◽  
Ole A Andreassen ◽  

The discrepancy between chronological age and the apparent age of the brain based on neuroimaging data - the brain age delta - has emerged as a reliable marker of brain health. With an increasing wealth of data, approaches to tackle heterogeneity in data acquisition are vital. To this end, we compiled raw structural magnetic resonance images into one of the largest and most diverse datasets assembled (n=53542), and trained convolutional neural networks (CNNs) to predict age. We achieved state-of-the-art performance on unseen data from unknown scanners (n=2553), and showed that higher brain age delta is associated with diabetes, alcohol intake and smoking. Using transfer learning, the intermediate representations learned by our model complemented and partly outperformed brain age delta in predicting common brain disorders. Our work shows we can achieve generalizable and biologically plausible brain age predictions using CNNs trained on heterogeneous datasets, and transfer them to clinical use cases.

Ghada Alqubati ◽  
Ghaleb Algaphari

Alzheimer’s disease (AD) is a progressive neurodegenerative disorder. It can cause a massive impact on a patient's memory and mobility. As this disease is irreversible, early diagnosis is crucial for delaying the symptoms and adjusting the patient's lifestyle. Many machine learning (ML) and deep learning (DL) based-approaches have been proposed to accurately predict AD before its symptoms onset. However, finding the most effective approach for AD early prediction is still challenging. This review explored 24 papers published from 2018 until 2021. These papers have proposed different approaches using state of the art machine learning and deep learning algorithms on different biomarkers to early detect AD. The review explored them from different perspectives to derive potential research gaps and draw conclusions and recommendations. It classified these recent approaches in terms of the learning technique used and AD biomarkers. It summarized and compared their findings, and defined their strengths and limitations. It also provided a summary of the common AD biomarkers. From this review, it was found that some approaches strove to increase the prediction accuracy regardless of their complexity such as using heterogeneous datasets, while others sought to find the most practical and affordable ways to predict the disease and yet achieve good accuracy such as using audio data. It was also noticed that DL based-approaches with image biomarkers remarkably surpassed ML based-approaches. However, they achieved poorly with genetic variants data. Despite the great importance of genetic variants biomarkers, their large variance and complexity could lead to a complex approach or poor accuracy. These data are crucial to discover the underlying structure of AD and detect it at early stages. However, an effective pre-processing approach is still needed to refine these data and employ them efficiently using the powerful DL algorithms.

2021 ◽  
Florian Schmidt ◽  
Alexander Marx ◽  
Nina Baumgarten ◽  
Marie Hebel ◽  
Martin Wegner ◽  

Abstract Understanding how epigenetic variation in non-coding regions is involved in distal gene-expression regulation is an important problem. Regulatory regions can be associated to genes using large-scale datasets of epigenetic and expression data. However, for regions of complex epigenomic signals and enhancers that regulate many genes, it is difficult to understand these associations. We present StitchIt, an approach to dissect epigenetic variation in a gene-specific manner for the detection of regulatory elements (REMs) without relying on peak calls in individual samples. StitchIt segments epigenetic signal tracks over many samples to generate the location and the target genes of a REM simultaneously. We show that this approach leads to a more accurate and refined REM detection compared to standard methods even on heterogeneous datasets, which are challenging to model. Also, StitchIt REMs are highly enriched in experimentally determined chromatin interactions and expression quantitative trait loci. We validated several newly predicted REMs using CRISPR-Cas9 experiments, thereby demonstrating the reliability of StitchIt. StitchIt is able to dissect regulation in superenhancers and predicts thousands of putative REMs that go unnoticed using peak-based approaches suggesting that a large part of the regulome might be uncharted water.

2021 ◽  
Vol 11 (17) ◽  
pp. 8275
Ganesh Kumar ◽  
Shuib Basri ◽  
Abdullahi Abubakar Imam ◽  
Sunder Ali Khowaja ◽  
Luiz Fernando Capretz ◽  

As data size increases drastically, its variety also increases. Investigating such heterogeneous data is one of the most challenging tasks in information management and data analytics. The heterogeneity and decentralization of data sources affect data visualization and prediction, thereby influencing analytical results accordingly. Data harmonization (DH) corresponds to a field that unifies the representation of such a disparate nature of data. Over the years, multiple solutions have been developed to minimize the heterogeneity aspects and disparity in formats of big-data types. In this study, a systematic review of the literature was conducted to assess the state-of-the-art DH techniques. This study aimed to understand the issues faced due to heterogeneity, the need for DH and the techniques that deal with substantial heterogeneous textual datasets. The process produced 1355 articles, but among them, only 70 articles were found to be relevant through inclusion and exclusion criteria methods. The result shows that the heterogeneity of structured, semi-structured, and unstructured (SSU) data can be managed by using DH and its core techniques, such as text preprocessing, Natural Language Preprocessing (NLP), machine learning (ML), and deep learning (DL). These techniques are applied to many real-world applications centered on the information-retrieval domain. Several assessment criteria were implemented to measure the efficiency of these techniques, such as precision, recall, F-1, accuracy, and time. A detailed explanation of each research question, common techniques, and performance measures is also discussed. Lastly, we present readers with a detailed discussion of the existing work, contributions, and managerial and academic implications, along with the conclusion, limitations, and future research directions.

Sign in / Sign up

Export Citation Format

Share Document