initial dataset
Recently Published Documents

Multivariate chemical reaction optimization involving catalytic systems is a non-trivial task due to the high number of tuneable parameters and discrete choices. Closed-loop optimization featuring active Machine Learning (ML) represents a powerful strategy for automating reaction optimization. However, the translation of chemical reaction conditions into a machine-readable format comes with the challenge of finding highly informative features which accurately capture the factors for reaction success and allow the model to learn efficiently. Herein, we compare the efficacy of different calculated chemical descriptors for a high throughput generated dataset to determine the impact on a supervised ML model when predicting reaction yield. Then, the effect of featurization and size of the initial dataset within a closed-loop reaction optimization was examined. Finally, the balance between descriptor complexity and dataset size was considered. Ultimately, tailored descriptors did not outperform simple generic representations, however, a larger initial dataset accelerated reaction optimization.

Download Full-text

Chemometric Tools to Point Out Benchmarks and Chromophores in Pigments through Spectroscopic Data Analyses

Molecules ◽

10.3390/molecules27010163 ◽

2021 ◽

Vol 27 (1) ◽

pp. 163

Author(s):

Giulia Festa ◽

Claudia Scatigno ◽

Francesco Armetta ◽

Maria Luisa Saladino ◽

Veronica Ciaramitaro ◽

...

Keyword(s):

Principal Component ◽

Wall Painting ◽

Spectroscopic Techniques ◽

Systematic Classification ◽

Total Reflectance ◽

Spectral Bands ◽

Initial Dataset ◽

Speed Up ◽

The Impact ◽

Chemometric Tools

Spectral preprocessing data and chemometric tools are analytical methods widely applied in several scientific contexts i.e., in archaeometric applications. A systematic classification of natural powdered pigments of organic and inorganic nature through Principal Component Analysis with a multi-instruments spectroscopic study is presented here. The methodology allows the access to elementary and molecular unique benchmarks to guide and speed up the identification of an unknown pigment and its recipe. This study is conducted on a set of 48 powdered pigments and tested on a real-case sample from the wall painting in S. Maria Delle Palate di Tusa (Messina, Italy). Four spectroscopic techniques (X-ray Fluorescence, Raman, Attenuated Total Reflectance and Total Reflectance Infrared Spectroscopies) and six different spectrometers are tested to evaluate the impact of different setups. The novelty of the work is to use a systematic approach on this initial dataset using the entire spectroscopic energy range without any windows selection to solve problems linked with the manipulation of large analytes/materials to find an indistinct property of one or more spectral bands opening new frontiers in the dataset spectroscopic analyses.

Download Full-text

Evaluation of Diagnostic Workup and Etiology of Hypercalcemia of Malignancy in a Cohort of 167,551 Patients over 20 Years

Journal of the Endocrine Society ◽

10.1210/jendso/bvab157 ◽

2021 ◽

Author(s):

Michael T Sheehan ◽

Ya-Huei Li ◽

Suhail A Doi ◽

Adedayo A Onitilo

Keyword(s):

Vitamin D ◽

Parathyroid Hormone ◽

Medical Records ◽

Laboratory Data ◽

Hypercalcemia Of Malignancy ◽

Patients With Cancer ◽

Initial Dataset ◽

Osteolytic Metastases ◽

The One ◽

Over Time

Abstract Context Hypercalcemia of malignancy (HCM) has not been studied in a fashion to determine all possible mechanisms of hypercalcemia in any given patient. Objective The two objectives were to assess the completeness of evaluation and to determine the distribution of etiologies of HCM in a contemporary cohort of patients. Methods A retrospective analysis was performed of patients with cancer who developed hypercalcemia over 20 years at a single health system. Laboratory data were electronically captured from medical records to identify cases of parathyroid hormone (PTH)-independent hypercalcemia. The records were then manually reviewed to confirm the diagnosis of HCM, document the extent of evaluation, and determine underlying etiology(ies) of HCM in each patient. Results The initial dataset included 167,551 adult patients with malignancy, of which 11,589 developed hypercalcemia. Of these, only a quarter (25.4%) had assessment of PTH with a third of the latter (30.9%) indicating PTH-independent hypercalcemia. Of those with PTH-independent hypercalcemia, a third (31.6%) had assessment of PTH-related peptide (PTHrP) and/or 1,25-dihydroxy vitamin D (1,25-OH vitamin D) and constituted the one hundred and fifty three cases of HCM examined in this study. Eighty three of these patients had an incomplete evaluation of their HCM. The distribution of etiologies of HCM was therefore determined from the remaining 70 patients who had assessment of all three possible etiologies (PTHrP, 1,25-OH vitamin D and skeletal imaging) and was as follows: PTHrP 27%, osteolytic metastases 50% and 1,25-OH vitamin D 39%, with combinations of etiologies being common (approximately 20%). Conclusion HCM is incompletely evaluated in many patients. The distribution of etiologies of HCM in this report differs significantly from the previous literature warranting further study to determine if its causes have indeed changed over time.

Download Full-text

Fake News on the Covid-19 outbreak: a new metadata-based dataset for the analysis of Brazilian and British Twitter posts

10.5753/sbseg.2021.17332 ◽

2021 ◽

Author(s):

Tuany Mariah Lima do Nascimento ◽

Laura Emmanuella Alves dos Santos Santana ◽

Márjory Da Costa Abreu

Keyword(s):

Political Campaigns ◽

Fake News ◽

Life And Death ◽

Political Propaganda ◽

The People ◽

Political Views ◽

Initial Dataset ◽

The Difference ◽

The Uk ◽

The Impact

The dissemination of fake news is a problem that has already been addressed but by no means is solved. After the manipulation made by Cambridge Analytica which was based on classifying users by their political views and targeting specific political propaganda on the Brexit campaign, the Trump election and the Bolsonaro election, there is no doubt this issue can have a real impact on society in ‘normal times’. During a pandemic, any type of fake news can be the difference between life and death when the data shared can directly hurt the people who are believing in it. Moreover, there is also a new trend of using artificial robots to disseminate such news with a special target on Twitter which can be linked with political campaigns. Thus, it is essential that we identify and understand what kind of news is selected to be 'dressed' as fake and how it is disseminated. This paper aims to investigate the dissemination of fake news related with Covid-19 in the UK and Brazil in order to understand the impact of fake news on public sector actions, social isolation and quarantine imposition. Those two case studies are well versed on the fake news dissemination. Our initial dataset of Twitter posts have focused on posts from four different cities (Natal, São Paulo, Sheffield and London) and have shown interesting pointers that will be discussed.

Download Full-text

Serum glycoprotein markers in non-alcoholic steatohepatitis and hepatocellular carcinoma

10.1101/2021.09.30.462486 ◽

2021 ◽

Author(s):

Prasanna Ramachandran ◽

Gege Xu ◽

Hector Han-Li Huang ◽

Rachel Rice ◽

Bo Zhou ◽

...

Keyword(s):

Hepatocellular Carcinoma ◽

Liver Disease ◽

Relative Abundance ◽

Serum Proteins ◽

Independent Set ◽

Diagnostic Tools ◽

Serum Samples ◽

Alcoholic Steatohepatitis ◽

Healthy State ◽

Initial Dataset

AbstractFatty liver disease progresses through stages of fat accumulation and inflammation to non-alcoholic steatohepatitis (NASH), fibrosis and cirrhosis and eventually hepatocellular carcinoma (HCC). Currently available diagnostic tools for HCC lack sensitivity and specificity and deliver little value to patients. In this study, we investigated the use of circulating serum glycoproteins to identify a panel of potential prognostic markers that may be indicative of progression from the healthy state to NASH and further to HCC. Serum samples were processed using a standard pre-analytical sample preparation protocol and were analyzed using a novel high throughput glycoproteomics platform. Relative abundance of 413 glycopeptides, representing 57 abundant serum proteins were determined and compared among the three phenotypes. We used PB-net, a peak picking software built in-house, to quantify area under the peaks. Our initial dataset, containing healthy, NASH, and HCC serum samples, yielded several glycopeptides that demonstrated statistically significant differences in abundances in NASH and HCC compared to controls. We analyzed the relative abundance of common glycoforms and observed higher levels of core-fucosylated, sialylated and branched glycans, in NASH and HCC as compared to controls. We replicated these findings in an independent set of samples of individuals with benign liver conditions and HCC, respectively. Glycoproteomic analysis of serum proteins is a novel source of prognostic biomarkers differentially associated with absence of liver disease vs. NASH vs. HCC, respectively. Our results may be of value in the management of patients with liver disease.

Download Full-text

A curated dataset for data-driven turbulence modelling

Scientific Data ◽

10.1038/s41597-021-01034-2 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Ryley McConkey ◽

Eugene Yee ◽

Fue-Sang Lien

Keyword(s):

Machine Learning ◽

Turbulence Models ◽

Turbulence Modelling ◽

Navier Stokes ◽

Square Duct ◽

Eddy Simulation ◽

Initial Dataset ◽

Diverging Channel ◽

Rans Models ◽

New Feature

AbstractThe recent surge in machine learning augmented turbulence modelling is a promising approach for addressing the limitations of Reynolds-averaged Navier-Stokes (RANS) models. This work presents the development of the first open-source dataset, curated and structured for immediate use in machine learning augmented corrective turbulence closure modelling. The dataset features a variety of RANS simulations with matching direct numerical simulation (DNS) and large-eddy simulation (LES) data. Four turbulence models are selected to form the initial dataset: k-ε, k-ε-ϕt-f, k-ω, and k-ω SST. The dataset consists of 29 cases per turbulence model, for several parametrically sweeping reference DNS/LES cases: periodic hills, square duct, parametric bumps, converging-diverging channel, and a curved backward-facing step. At each of the 895,640 points, various RANS features with DNS/LES labels are available. The feature set includes quantities used in current state-of-the-art models, and additional fields which enable the generation of new feature sets. The dataset reduces effort required to train, test, and benchmark new corrective RANS models. The dataset is available at 10.34740/kaggle/dsv/2637500.

Download Full-text

Parallel Network Analysis and Communities Detection (PANC) Pipeline for the Analysis and Visualization of COVID-19 Data

Parallel Processing Letters ◽

10.1142/s0129626421420020 ◽

2021 ◽

pp. 2142002

Author(s):

Giuseppe Agapito ◽

Marianna Milano ◽

Mario Cannataro

Keyword(s):

Network Analysis ◽

Real Data ◽

Detection Algorithm ◽

Data Sets ◽

Data Set ◽

Italian Regions ◽

Initial Dataset ◽

Parallel Network ◽

Community Detection Algorithm ◽

Similarity Matrices

A new coronavirus, causing a severe acute respiratory syndrome (COVID-19), was started at Wuhan, China, in December 2019. The epidemic has rapidly spread across the world becoming a pandemic that, as of today, has affected more than 70 million people causing over 2 million deaths. To better understand the evolution of spread of the COVID-19 pandemic, we developed PANC (Parallel Network Analysis and Communities Detection), a new parallel preprocessing methodology for network-based analysis and communities detection on Italian COVID-19 data. The goal of the methodology is to analyze set of homogeneous datasets (i.e. COVID-19 data in several regions) using a statistical test to find similar/dissimilar behaviours, mapping such similarity information on a graph and then using community detection algorithm to visualize and analyze the initial dataset. The methodology includes the following steps: (i) a parallel methodology to build similarity matrices that represent similar or dissimilar regions with respect to data; (ii) an effective workload balancing function to improve performance; (iii) the mapping of similarity matrices into networks where nodes represent Italian regions, and edges represent similarity relationships; (iv) the discovering and visualization of communities of regions that show similar behaviour. The methodology is general and can be applied to world-wide data about COVID-19, as well as to all types of data sets in tabular and matrix format. To estimate the scalability with increasing workloads, we analyzed three synthetic COVID-19 datasets with the size of 90.0[Formula: see text]MB, 180.0[Formula: see text]MB, and 360.0[Formula: see text]MB. Experiments was performed on showing the amount of data that can be analyzed in a given amount of time increases almost linearly with the number of computing resources available. Instead, to perform communities detection, we employed the real data set.

Download Full-text

Augmented Data and XGBoost Improvement for Sales Forecasting in the Large-Scale Retail Sector

Applied Sciences ◽

10.3390/app11177793 ◽

2021 ◽

Vol 11 (17) ◽

pp. 7793

Author(s):

Alessandro Massaro ◽

Antonio Panarese ◽

Daniele Giannone ◽

Angelo Galiano

Keyword(s):

Mean Square Error ◽

Large Scale ◽

Learning Algorithm ◽

Training Model ◽

Gradient Boosting ◽

Retail Sector ◽

Mean Square ◽

Extreme Gradient Boosting ◽

Initial Dataset ◽

Order Of Magnitude

The organized large-scale retail sector has been gradually establishing itself around the world, and has increased activities exponentially in the pandemic period. This modern sales system uses Data Mining technologies processing precious information to increase profit. In this direction, the extreme gradient boosting (XGBoost) algorithm was applied in an industrial project as a supervised learning algorithm to predict product sales including promotion condition and a multiparametric analysis. The implemented XGBoost model was trained and tested by the use of the Augmented Data (AD) technique in the event that the available data are not sufficient to achieve the desired accuracy, as for many practical cases of artificial intelligence data processing, where a large dataset is not available. The prediction was applied to a grid of segmented customers by allowing personalized services according to their purchasing behavior. The AD technique conferred a good accuracy if compared with results adopting the initial dataset with few records. An improvement of the prediction error, such as the Root Mean Square Error (RMSE) and Mean Square Error (MSE), which decreases by about an order of magnitude, was achieved. The AD technique formulated for large-scale retail sector also represents a good way to calibrate the training model.

Download Full-text

ARPEGGIO: Automated Reproducible Polyploid EpiGenetic GuIdance workflOw

BMC Genomics ◽

10.1186/s12864-021-07845-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Stefan Milosavljevic ◽

Tony Kuo ◽

Samuele Decarli ◽

Lucas Mohn ◽

Jun Sese ◽

...

Keyword(s):

Dna Methylation ◽

Molecular Mechanisms ◽

Sequence Similarity ◽

Ground Truth ◽

Whole Genome ◽

Initial Dataset ◽

Quality Checks ◽

Living Organisms ◽

Set Up

Abstract Background Whole genome duplication (WGD) events are common in the evolutionary history of many living organisms. For decades, researchers have been trying to understand the genetic and epigenetic impact of WGD and its underlying molecular mechanisms. Particular attention was given to allopolyploid study systems, species resulting from an hybridization event accompanied by WGD. Investigating the mechanisms behind the survival of a newly formed allopolyploid highlighted the key role of DNA methylation. With the improvement of high-throughput methods, such as whole genome bisulfite sequencing (WGBS), an opportunity opened to further understand the role of DNA methylation at a larger scale and higher resolution. However, only a few studies have applied WGBS to allopolyploids, which might be due to lack of genomic resources combined with a burdensome data analysis process. To overcome these problems, we developed the Automated Reproducible Polyploid EpiGenetic GuIdance workflOw (ARPEGGIO): the first workflow for the analysis of epigenetic data in polyploids. This workflow analyzes WGBS data from allopolyploid species via the genome assemblies of the allopolyploid’s parent species. ARPEGGIO utilizes an updated read classification algorithm (EAGLE-RC), to tackle the challenge of sequence similarity amongst parental genomes. ARPEGGIO offers automation, but more importantly, a complete set of analyses including spot checks starting from raw WGBS data: quality checks, trimming, alignment, methylation extraction, statistical analyses and downstream analyses. A full run of ARPEGGIO outputs a list of genes showing differential methylation. ARPEGGIO was made simple to set up, run and interpret, and its implementation ensures reproducibility by including both package management and containerization. Results We evaluated ARPEGGIO in two ways. First, we tested EAGLE-RC’s performance with publicly available datasets given a ground truth, and we show that EAGLE-RC decreases the error rate by 3 to 4 times compared to standard approaches. Second, using the same initial dataset, we show agreement between ARPEGGIO’s output and published results. Compared to other similar workflows, ARPEGGIO is the only one supporting polyploid data. Conclusions The goal of ARPEGGIO is to promote, support and improve polyploid research with a reproducible and automated set of analyses in a convenient implementation. ARPEGGIO is available at https://github.com/supermaxiste/ARPEGGIO.

Download Full-text

Real-World Marine Radar Datasets for Evaluating Target Tracking Methods

Sensors ◽

10.3390/s21144641 ◽

2021 ◽

Vol 21 (14) ◽

pp. 4641

Author(s):

Jaya Shradha Fowdur ◽

Marcus Baum ◽

Frank Heymann

Keyword(s):

Target Tracking ◽

Target Detection ◽

Situation Awareness ◽

Real World ◽

Autonomous Navigation ◽

Automatic Identification ◽

Identification System ◽

Initial Dataset ◽

Marine Radar ◽

German Aerospace

As autonomous navigation is being implemented in several areas including the maritime domain, the need for robust tracking is becoming more important for traffic situation awareness, assessment and monitoring. We present an online repository comprising three designated marine radar datasets from real-world measurement campaigns to be employed for target detection and tracking research purposes. The datasets have their respective reference positions on the basis of the Automatic Identification System (AIS). Together with the methods used for target detection and clustering, a novel baseline algorithm for an extended centroid-based multiple target tracking is introduced and explained. We compare the performance of our algorithm to its standard version on the datasets using the AIS references. The results obtained and some initial dataset specific analysis are presented. The datasets, under the German Aerospace Centre (DLR)’s terms and agreements, can be procured from the company website’s URL provided in the article.

Download Full-text

initial datasetRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

The Effect of Chemical Representation on Active Machine Learning Towards Closed-Loop Optimization

Chemometric Tools to Point Out Benchmarks and Chromophores in Pigments through Spectroscopic Data Analyses

Evaluation of Diagnostic Workup and Etiology of Hypercalcemia of Malignancy in a Cohort of 167,551 Patients over 20 Years

Fake News on the Covid-19 outbreak: a new metadata-based dataset for the analysis of Brazilian and British Twitter posts

Serum glycoprotein markers in non-alcoholic steatohepatitis and hepatocellular carcinoma

A curated dataset for data-driven turbulence modelling

Parallel Network Analysis and Communities Detection (PANC) Pipeline for the Analysis and Visualization of COVID-19 Data

Augmented Data and XGBoost Improvement for Sales Forecasting in the Large-Scale Retail Sector

ARPEGGIO: Automated Reproducible Polyploid EpiGenetic GuIdance workflOw

Real-World Marine Radar Datasets for Evaluating Target Tracking Methods

initial dataset
Recently Published Documents