Employment of Machine Learning Models Yields Highly Accurate Hematological Disease Prediction from Raw Flow Cytometry Matrix Data without the Need for Visualization or Human Intervention

Background: Machine Learning (ML) offers automated data processing substituting various analysis steps. So far it has been applied to flow cytometry (FC) data only after visualization which may compromise data by reduction of data dimensionality. Automated analysis of FC raw matrix data has not yet been pursued. Aim: To establish as proof of concept an ML-based classifier processing FC matrix data to predict the correct lymphoma type without the need for visualization or human analysis and interpretation. Methods: A set of 6,393 uniformly analyzed samples (Navios cytometers, Kaluza software, Beckman Coulter, Miami, FL) was used for training (n=5,115) and testing (n=1,278) of different ML models. Entities were chronic lymphatic leukemia (CLL) 1103 (training) and 279 (testing), monoclonal B-cell lymphocytosis (MBL, 831/203), CLL with increased prolymphocytes (CLL-PL, 649/161), lymphoplasmacytic lymphoma (LPL, 560/159), hairy cell leukemia (HCL, 328/88), mantle cell lymphoma (MCL, 259/53), marginal zone lymphoma (MZL, 90/28), follicular lymphoma (FL, 84/16), no lymphoma (1211/291). Three tubes comprising 11 parameters per tube were applied. Besides scatter signals analyzed antigens included: CD3, CD4, CD5, CD8, CD10, CD11c, CD19, CD20, CD22, CD23, CD25, CD38, CD45, CD56, CD79b, CD103, FMC7, HLA-DR, IgM, Kappa, Lambda. Measurements generated LMD files with 50,000 rows of data for each of the 11 parameters. After removing the saturated values (≥ 1023) we produced binned histograms with 16 predefined frequency bins per parameter. Histograms were converted to cumulative distribution functions (CDF) for respective parameters and concatenated to produce a 16x11 matrix per each tube. Following the assumption of independence of parameters this simplification of concatenating CDFs represents the same information as if they were jointly distributed. The first matrix-based classifier was a decision tree model (DT), the second a deep learning model (DL) and the third was an XGBoost (XG) model, an implementation of gradient boosted decision trees ideal for structured tabular data (such as LMD files). The first set of analyses included only three classes which are readily separated by human operators: 1) CLL, 2) HCL, 3) no lymphoma. The second set included all nine entities but grouped into four classes: 1) CD5+ lymphoma (CLL, MBL, CLL-PL, MCL), 2) HCL, 3) other CD5- lymphoma (LPL, MZL, FL), 4) no lymphoma. The third set included each of the nine entities as its own class. Results: Analyzing the three classes from the first set (CLL, HCL, no lymphoma) the models achieved accuracies of 94% (DT), 95% (DL) and 96% (XG) when including all cases. By analysis of cases with prediction probabilities above 90%, DT now reached 97%, DL 97% and XG 98% accuracy, whilst losing 38%, 8% and 6% of samples, respectively. We further observed that accuracy was also dependent on the size of the pathologic clone, which is in line with the experiences from human experts with very small clones (≤ 0.1% of leukocytes) representing a major challenge regarding their correct classification. Focusing on cases with clones > 0.1% but considering all prediction probabilities accuracies were 96% (DT), 97% (DL) and 98% (XG), with loss of 5% of samples for each model. Considering cases only with prediction probabilities > 90% and clones > 0.1% accuracies were 97% (DT), 99% (DL) and 99% (XG) whilst losing 38%, 9% and 9% of samples, respectively. Further analyses were performed applying the best model based on results above, i.e. XG. Analyzing four classes in the second set of analyses (CD5+ lymphoma, HCL, other CD5- lymphoma, no lymphoma) and considering cases only with prediction probabilities > 95% and clones > 0.1% accuracy was 96% while losing 28% of samples. In the third set of analyses with each entity assigned its own class and again considering cases only with prediction probabilities > 95% and clones > 0.1% accuracy was 93% while losing 28% of samples. Conclusions: This first ML-based classifier using the XGboost model with transforming FC matrix data to concatenated distributions, is capable of correctly assigning the vast majority of lymphoma samples analyzing FC raw data without visualization or human interpretation. Cases that need further attention by human experts will be flagged but will not account for more than 30% of all cases. This data will be extended in a prospective blinded study (clinicaltrials.gov NCT4466059). Disclosures Heo: AWS: Current Employment. Wetton:AWS: Current Employment.

Download Full-text

Machine Learning Models to Predict Cognitive Impairment of Rodents Subjected to Space Radiation

Frontiers in Systems Neuroscience ◽

10.3389/fnsys.2021.713131 ◽

2021 ◽

Vol 15 ◽

Author(s):

Mona Matar ◽

Suleyman A. Gokoglu ◽

Matthew T. Prelich ◽

Christopher A. Gallo ◽

Asad K. Iqbal ◽

...

Keyword(s):

Machine Learning ◽

Cognitive Impairment ◽

Radiation Exposure ◽

Space Radiation ◽

Distribution Functions ◽

Cumulative Distribution ◽

Support Vector ◽

Control Group ◽

Significant Finding ◽

Dose Dependent

This research uses machine-learned computational analyses to predict the cognitive performance impairment of rats induced by irradiation. The experimental data in the analyses is from a rodent model exposed to ≤15 cGy of individual galactic cosmic radiation (GCR) ions: 4He, 16O, 28Si, 48Ti, or 56Fe, expected for a Lunar or Mars mission. This work investigates rats at a subject-based level and uses performance scores taken before irradiation to predict impairment in attentional set-shifting (ATSET) data post-irradiation. Here, the worst performing rats of the control group define the impairment thresholds based on population analyses via cumulative distribution functions, leading to the labeling of impairment for each subject. A significant finding is the exhibition of a dose-dependent increasing probability of impairment for 1 to 10 cGy of 28Si or 56Fe in the simple discrimination (SD) stage of the ATSET, and for 1 to 10 cGy of 56Fe in the compound discrimination (CD) stage. On a subject-based level, implementing machine learning (ML) classifiers such as the Gaussian naïve Bayes, support vector machine, and artificial neural networks identifies rats that have a higher tendency for impairment after GCR exposure. The algorithms employ the experimental prescreen performance scores as multidimensional input features to predict each rodent’s susceptibility to cognitive impairment due to space radiation exposure. The receiver operating characteristic and the precision-recall curves of the ML models show a better prediction of impairment when 56Fe is the ion in question in both SD and CD stages. They, however, do not depict impairment due to 4He in SD and 28Si in CD, suggesting no dose-dependent impairment response in these cases. One key finding of our study is that prescreen performance scores can be used to predict the ATSET performance impairments. This result is significant to crewed space missions as it supports the potential of predicting an astronaut’s impairment in a specific task before spaceflight through the implementation of appropriately trained ML tools. Future research can focus on constructing ML ensemble methods to integrate the findings from the methodologies implemented in this study for more robust predictions of cognitive decrements due to space radiation exposure.

Download Full-text

Machine Learning Models Improve the Diagnostic Yield of Peripheral Blood Flow Cytometry

American Journal of Clinical Pathology ◽

10.1093/ajcp/aqz150 ◽

2019 ◽

Vol 153 (2) ◽

pp. 235-242

Author(s):

M Lisa Zhang ◽

Alan X Guo ◽

Stephan Kadauke ◽

Anand S Dighe ◽

Jason M Baron ◽

...

Keyword(s):

Machine Learning ◽

Flow Cytometry ◽

Blood Flow ◽

Peripheral Blood ◽

Diagnostic Yield ◽

Absolute Lymphocyte Count ◽

Clinical History ◽

Diagnostic Value ◽

Peripheral Blood Flow ◽

Tree Model

Abstract Objectives Peripheral blood flow cytometry (PBFC) is useful for evaluating circulating hematologic malignancies (HM) but has limited diagnostic value for screening. We used machine learning to evaluate whether clinical history and CBC/differential parameters could improve PBFC utilization. Methods PBFC cases with concurrent/recent CBC/differential were split into training (n = 626) and test (n = 159) cohorts. We classified PBFC results with abnormal blast/lymphoid populations as positive and used two models to predict results. Results Positive PBFC results were seen in 58% and 21% of training cases with and without prior HM (P < .001). % neutrophils, absolute lymphocyte count, and % blasts/other cells differed significantly between positive and negative PBFC groups (areas under the curve [AUC] > 0.7). Among test cases, a decision tree model achieved 98% sensitivity and 65% specificity (AUC = 0.906). A logistic regression model achieved 100% sensitivity and 54% specificity (AUC = 0.919). Conclusions We outline machine learning-based triaging strategies to decrease unnecessary utilization of PBFC by 35% to 40%.

Download Full-text

Advances in the Prediction of Protein Subcellular Locations with Machine Learning

Current Bioinformatics ◽

10.2174/1574893614666181217145156 ◽

2019 ◽

Vol 14 (5) ◽

pp. 406-421 ◽

Cited By ~ 3

Author(s):

Ting-He Zhang ◽

Shao-Wu Zhang

Keyword(s):

Machine Learning ◽

Feature Fusion ◽

Protein Sequences ◽

Subcellular Location ◽

Automated Analysis ◽

Cellular Level ◽

Machine Learning Algorithms ◽

Feature Representation ◽

Protein Subcellular Location ◽

Protein Subcellular Locations

Background: Revealing the subcellular location of a newly discovered protein can bring insight into their function and guide research at the cellular level. The experimental methods currently used to identify the protein subcellular locations are both time-consuming and expensive. Thus, it is highly desired to develop computational methods for efficiently and effectively identifying the protein subcellular locations. Especially, the rapidly increasing number of protein sequences entering the genome databases has called for the development of automated analysis methods. Methods: In this review, we will describe the recent advances in predicting the protein subcellular locations with machine learning from the following aspects: i) Protein subcellular location benchmark dataset construction, ii) Protein feature representation and feature descriptors, iii) Common machine learning algorithms, iv) Cross-validation test methods and assessment metrics, v) Web servers. Result & Conclusion: Concomitant with a large number of protein sequences generated by highthroughput technologies, four future directions for predicting protein subcellular locations with machine learning should be paid attention. One direction is the selection of novel and effective features (e.g., statistics, physical-chemical, evolutional) from the sequences and structures of proteins. Another is the feature fusion strategy. The third is the design of a powerful predictor and the fourth one is the protein multiple location sites prediction.

Download Full-text

OutlierNets: Highly Compact Deep Autoencoder Network Architectures for On-Device Acoustic Anomaly Detection

Sensors ◽

10.3390/s21144805 ◽

2021 ◽

Vol 21 (14) ◽

pp. 4805

Author(s):

Saad Abbasi ◽

Mahmoud Famouri ◽

Mohammad Javad Shafiee ◽

Alexander Wong

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

Detection Methods ◽

Detection Accuracy ◽

Network Architectures ◽

Design Exploration ◽

Convolutional Autoencoder ◽

Acoustic Anomaly ◽

Human Operators ◽

Computational Resources

Human operators often diagnose industrial machinery via anomalous sounds. Given the new advances in the field of machine learning, automated acoustic anomaly detection can lead to reliable maintenance of machinery. However, deep learning-driven anomaly detection methods often require an extensive amount of computational resources prohibiting their deployment in factories. Here we explore a machine-driven design exploration strategy to create OutlierNets, a family of highly compact deep convolutional autoencoder network architectures featuring as few as 686 parameters, model sizes as small as 2.7 KB, and as low as 2.8 million FLOPs, with a detection accuracy matching or exceeding published architectures with as many as 4 million parameters. The architectures are deployed on an Intel Core i5 as well as a ARM Cortex A72 to assess performance on hardware that is likely to be used in industry. Experimental results on the model’s latency show that the OutlierNet architectures can achieve as much as 30x lower latency than published networks.

Download Full-text

Do galactic bars depend on environment?: an information theoretic analysis of Galaxy Zoo 2

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa3665 ◽

2020 ◽

Vol 501 (1) ◽

pp. 994-1001

Author(s):

Suman Sarkar ◽

Biswajit Pandey ◽

Snehasish Bhattacharjee

Keyword(s):

Spatial Distribution ◽

Mutual Information ◽

Local Density ◽

Statistical Significance ◽

Distribution Functions ◽

Cumulative Distribution ◽

Host Galaxy ◽

Data Sets ◽

Data Set ◽

Information Theoretic

ABSTRACT We use an information theoretic framework to analyse data from the Galaxy Zoo 2 project and study if there are any statistically significant correlations between the presence of bars in spiral galaxies and their environment. We measure the mutual information between the barredness of galaxies and their environments in a volume limited sample (Mr ≤ −21) and compare it with the same in data sets where (i) the bar/unbar classifications are randomized and (ii) the spatial distribution of galaxies are shuffled on different length scales. We assess the statistical significance of the differences in the mutual information using a t-test and find that both randomization of morphological classifications and shuffling of spatial distribution do not alter the mutual information in a statistically significant way. The non-zero mutual information between the barredness and environment arises due to the finite and discrete nature of the data set that can be entirely explained by mock Poisson distributions. We also separately compare the cumulative distribution functions of the barred and unbarred galaxies as a function of their local density. Using a Kolmogorov–Smirnov test, we find that the null hypothesis cannot be rejected even at $75{{\ \rm per\ cent}}$ confidence level. Our analysis indicates that environments do not play a significant role in the formation of a bar, which is largely determined by the internal processes of the host galaxy.

Download Full-text

Semi-automated classification of colonial Microcystis by FlowCAM imaging flow cytometry in mesocosm experiment reveals high heterogeneity during seasonal bloom

Scientific Reports ◽

10.1038/s41598-021-88661-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Yersultan Mirasbekov ◽

Adina Zhumakhanova ◽

Almira Zhantuyakova ◽

Kuanysh Sarkytbayev ◽

Dmitry V. Malashenkov ◽

...

Keyword(s):

Machine Learning ◽

Flow Cytometry ◽

Spatial Resolution ◽

Mesocosm Experiment ◽

Imaging Flow Cytometry ◽

Leibler Divergence ◽

Temporal And Spatial ◽

High Level ◽

Training Sets

AbstractA machine learning approach was employed to detect and quantify Microcystis colonial morphospecies using FlowCAM-based imaging flow cytometry. The system was trained and tested using samples from a long-term mesocosm experiment (LMWE, Central Jutland, Denmark). The statistical validation of the classification approaches was performed using Hellinger distances, Bray–Curtis dissimilarity, and Kullback–Leibler divergence. The semi-automatic classification based on well-balanced training sets from Microcystis seasonal bloom provided a high level of intergeneric accuracy (96–100%) but relatively low intrageneric accuracy (67–78%). Our results provide a proof-of-concept of how machine learning approaches can be applied to analyze the colonial microalgae. This approach allowed to evaluate Microcystis seasonal bloom in individual mesocosms with high level of temporal and spatial resolution. The observation that some Microcystis morphotypes completely disappeared and re-appeared along the mesocosm experiment timeline supports the hypothesis of the main transition pathways of colonial Microcystis morphoforms. We demonstrated that significant changes in the training sets with colonial images required for accurate classification of Microcystis spp. from time points differed by only two weeks due to Microcystis high phenotypic heterogeneity during the bloom. We conclude that automatic methods not only allow a performance level of human taxonomist, and thus be a valuable time-saving tool in the routine-like identification of colonial phytoplankton taxa, but also can be applied to increase temporal and spatial resolution of the study.

Download Full-text

GammaCHI: a package for the inversion and computation of the gamma and chi-square cumulative distribution functions (central and noncentral). New version announcement

Computer Physics Communications ◽

10.1016/j.cpc.2021.108083 ◽

2021 ◽

pp. 108083

Author(s):

Amparo Gil ◽

Javier Segura ◽

Nico M. Temme

Keyword(s):

Distribution Functions ◽

Cumulative Distribution ◽

Chi Square ◽

Cumulative Distribution Functions

Download Full-text

Composite Aerosol Optical Depth Mapping over Northeast Asia from GEO-LEO Satellite Observations

Remote Sensing ◽

10.3390/rs13061096 ◽

2021 ◽

Vol 13 (6) ◽

pp. 1096

Author(s):

Soi Ahn ◽

Sung-Rae Chung ◽

Hyun-Jong Oh ◽

Chu-Yong Chung

Keyword(s):

Aerosol Optical Depth ◽

Optical Depth ◽

Northeast Asia ◽

Low Earth Orbit ◽

Distribution Functions ◽

Cumulative Distribution ◽

Radiometric Calibration ◽

Climate Data ◽

Error Statistics ◽

Spatiotemporal Resolution

This study aimed to generate a near real time composite of aerosol optical depth (AOD) to improve predictive model ability and provide current conditions of aerosol spatial distribution and transportation across Northeast Asia. AOD, a proxy for aerosol loading, is estimated remotely by various spaceborne imaging sensors capturing visible and infrared spectra. Nevertheless, differences in satellite-based retrieval algorithms, spatiotemporal resolution, sampling, radiometric calibration, and cloud-screening procedures create significant variability among AOD products. Satellite products, however, can be complementary in terms of their accuracy and spatiotemporal comprehensiveness. Thus, composite AOD products were derived for Northeast Asia based on data from four sensors: Advanced Himawari Imager (AHI), Geostationary Ocean Color Imager (GOCI), Moderate Infrared Spectroradiometer (MODIS), and Visible Infrared Imaging Radiometer Suite (VIIRS). Cumulative distribution functions were employed to estimate error statistics using measurements from the Aerosol Robotic Network (AERONET). In order to apply the AERONET point-specific error, coefficients of each satellite were calculated using inverse distance weighting. Finally, the root mean square error (RMSE) for each satellite AOD product was calculated based on the inverse composite weighting (ICW). Hourly AOD composites were generated (00:00–09:00 UTC, 2017) using the regression equation derived from the comparison of the composite AOD error statistics to AERONET measurements, and the results showed that the correlation coefficient and RMSE values of composite were close to those of the low earth orbit satellite products (MODIS and VIIRS). The methodology and the resulting dataset derived here are relevant for the demonstrated successful merging of multi-sensor retrievals to produce long-term satellite-based climate data records.

Download Full-text

Probabilistic finite element analysis in heat transfer to a nuclear fuel rod bumper support

Proceedings of the Institution of Mechanical Engineers Part C Journal of Mechanical Engineering Science ◽

10.1243/0954406042690425 ◽

2004 ◽

Vol 218 (12) ◽

pp. 1499-1505

Author(s):

Rama Subba Reddy Gorla

Keyword(s):

Heat Transfer ◽

Finite Element ◽

Nuclear Fuel ◽

Cost Effective ◽

Distribution Functions ◽

Cumulative Distribution ◽

Element Analysis ◽

Transfer Rates ◽

Fuel Rod ◽

Design Variables

Heat transfer from a nuclear fuel rod bumper support was computationally simulated by a finite element method and probabilistically evaluated in view of the several uncertainties in the performance parameters. Cumulative distribution functions and sensitivity factors were computed for overall heat transfer rates due to the thermodynamic random variables. These results can be used to identify quickly the most critical design variables in order to optimize the design and to make it cost effective. The analysis leads to the selection of the appropriate measurements to be used in heat transfer and to the identification of both the most critical measurements and the parameters.

Download Full-text

Some new Simpson-type inequalities for generalized p-convex function on fractal sets with applications

Advances in Difference Equations ◽

10.1186/s13662-020-02955-9 ◽

2020 ◽

Vol 2020 (1) ◽

Author(s):

Thabet Abdeljawad ◽

Saima Rashid ◽

Zakia Hammouch ◽

İmdat İşcan ◽

Yu-Ming Chu

Keyword(s):

Convex Functions ◽

Fractional Derivatives ◽

Distribution Functions ◽

Cumulative Distribution ◽

Auxiliary Result ◽

Fractal Sets ◽

Different Types ◽

Novel Applications ◽

Class Of Functions ◽

Harmonically Convex Functions

Abstract The present article addresses the concept of p-convex functions on fractal sets. We are able to prove a novel auxiliary result. In the application aspect, the fidelity of the local fractional is used to establish the generalization of Simpson-type inequalities for the class of functions whose local fractional derivatives in absolute values at certain powers are p-convex. The method we present is an alternative in showing the classical variants associated with generalized p-convex functions. Some parts of our results cover the classical convex functions and classical harmonically convex functions. Some novel applications in random variables, cumulative distribution functions and generalized bivariate means are obtained to ensure the correctness of the present results. The present approach is efficient, reliable, and it can be used as an alternative to establishing new solutions for different types of fractals in computer graphics.

Download Full-text