Computing Sum of Products about the Mean with Pairwise Algorithms

We discuss pairwise algorithms, a kind of computational algorithm which can be useful in dynamically updating statistics as new samples of data are collected. Since test data are usually collected through time as individual data sets, these algorithms would be profitably used in computer programs to treat this situation. Pair-wise algorithms are presented for calculating the sum of products of deviations about the mean for adding a sample of data (or removing one) to the whole data set.

Download Full-text

Improving SAR Altimeter processing over the coastal zone and inland waters - the ESA HYDROCOASTAL project

10.5194/egusphere-egu21-9 ◽

2021 ◽

Author(s):

David Cotton ◽

Keyword(s):

Coastal Zone ◽

Test Data ◽

River Discharge ◽

Altimeter Data ◽

Inland Waters ◽

Data Sets ◽

Data Set ◽

Discharge Data ◽

Processing Algorithms ◽

The Impact

IntroductionHYDROCOASTAL is a two year project funded by ESA, with the objective to maximise exploitation of SAR and SARin altimeter measurements in the coastal zone and inland waters, by evaluating and implementing new approaches to process SAR and SARin data from CryoSat-2, and SAR altimeter data from Sentinel-3A and Sentinel-3B. Optical data from Sentinel-2 MSI and Sentinel-3 OLCI instruments will also be used in generating River Discharge products.New SAR and SARin processing algorithms for the coastal zone and inland waters will be developed and implemented and evaluated through an initial Test Data Set for selected regions. From the results of this evaluation a processing scheme will be implemented to generate global coastal zone and river discharge data sets.A series of case studies will assess these products in terms of their scientific impacts.All the produced data sets will be available on request to external researchers, and full descriptions of the processing algorithms will be provided&#160;ObjectivesThe scientific objectives of HYDROCOASTAL are to enhance our understanding&#160; of interactions between the inland water and coastal zone, between the coastal zone and the open ocean, and the small scale processes that govern these interactions. Also the project aims to improve our capability to characterize the variation at different time scales of inland water storage, exchanges with the ocean and the impact on regional sea-level changesThe technical objectives are to develop and evaluate&#160; new SAR&#160; and SARin altimetry processing techniques in support of the scientific objectives, including stack processing, and filtering, and retracking. Also an improved Wet Troposphere Correction will be developed and evaluated.Project&#160; OutlineThere are four tasks to the project<ul><li>Scientific Review and Requirements Consolidation: Review the current state of the art in SAR and SARin altimeter data processing as applied to the coastal zone and to inland waters</li> <li>Implementation and Validation: New processing algorithms with be implemented to generate a Test Data sets, which will be validated against models, in-situ data, and other satellite data sets. Selected algorithms will then be used to generate global coastal zone and river discharge data sets</li> <li>Impacts Assessment: The impact of these global products will be assess in a series of Case Studies</li> <li>Outreach and Roadmap: Outreach material will be prepared and distributed to engage with the wider scientific community and provide recommendations for development of future missions and future research.</li> </ul>&#160;PresentationThe presentation will provide an overview to the project, present the different SAR altimeter processing algorithms that are being evaluated in the first phase of the project, and early results from the evaluation of the initial test data set.&#160;

Download Full-text

Data Analysis With Shapley Values For Automatic Subject Selection in Alzheimer's Disease Data Sets Using Interpretable Machine Learning

10.21203/rs.3.rs-245707/v1 ◽

2021 ◽

Author(s):

Louise Bloch ◽

Christoph M. Friedrich

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Test Data ◽

Noisy Data ◽

Training Data ◽

Data Sets ◽

Data Set ◽

Model Interpretation ◽

Percentage Points ◽

Shapley Values

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.

Download Full-text

Generation of Business Event Data Sets for Testing RFID Information Services

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194015500096 ◽

2015 ◽

Vol 25 (04) ◽

pp. 757-780

Author(s):

Gihong Kim ◽

Bonghee Hong

Keyword(s):

Life Cycle ◽

Test Data ◽

Large Volume ◽

Information Services ◽

Data Sets ◽

Event Data ◽

Data Set ◽

Generation Algorithm ◽

New Model ◽

Business Events

The testing of RFID information services requires a test data set of business events comprising object, aggregation, quantity and transaction events. To generate business events, we need to address the performance issues in creating a large volume of event data. This paper proposes a new model for the tag life cycle and a fast generation algorithm for this model. We present the results of experiments with the generation algorithm, showing that it outperforms previous methods.

Download Full-text

Crystal and Molecular Structure of Thiocarbonyl-ethoxo(tetraphenylporphyrinato)ruthenium(II), [Ru(TPP)(CS)(HOC2H5)]. A Case of Centrosymmetric–Noncentrosymmetric Ambiguity

Acta Crystallographica Section B Structural Science ◽

10.1107/s0108768197003789 ◽

1997 ◽

Vol 53 (5) ◽

pp. 767-772 ◽

Cited By ~ 1

Author(s):

T. J. Bartczak ◽

K. Rachlewicz ◽

L. Latos-Grażynski

Keyword(s):

Crystal And Molecular Structure ◽

Bond Angle ◽

Data Sets ◽

Alpha Radiation ◽

Data Set ◽

Fourier Methods ◽

Space Group ◽

Space Groups ◽

The Mean ◽

Difference Fourier

[Ru(TPP)(CS)(EtOH)] crystallizes in the triclinic system. Crystal data: C47H34N4ORuS, M r = 803.91, a = 10.607 (3), b = 11.308 (5), c = 17.699 (2) Å, \alpha = 77.53 (2), \beta = 73.17 (1), \gamma = 69.85 (3)°, V = 1891.6 (10) Å3, P\overline 1 (C^{1}_{i}, no. 2), Z = 2, F(000) = 824, D x = 1.410, D m = 1.39 Mg m−3 (by flotation in aqueous KI), \mu(Mo K\alpha) = 0.512 mm−1, R = 0.094, wR = 0.098, S = 2.28 for 4610 independent reflections with F o > 5\sigma(F o ). A second data set was collected using Cu K\alpha radiation. The structure was refined by standard least-squares and difference-Fourier methods in space groups P1 and P\overline 1 using both the Mo K\alpha and Cu K\alpha data sets. Both data sets favor space group P\overline 1, the Mo data giving a slightly better result than the Cu data. The two independent Ru atoms lie on the inversion centers ½,0,0 and ½,½,½ of space group P\overline 1. Consequently, the two independent molecules have crystallographically imposed \overline 1 symmetry, the CS and EtOH axial groups are disordered and the RuN4 portions of the molecules are planar. The deviations from planarity of the porphyrinato core are very small. The Ru—C—S groups are essentially linear with an average Ru—C—S bond angle of 174 (1)°. The mean Ru—C(CS) and Ru—O (Et) bond lengths are 1.92 (4) and 2.15 (3) Å, respectively.

Download Full-text

Effects of environmental factors on multiple ovulation of zebu donors

Arquivo Brasileiro de Medicina Veterinária e Zootecnia ◽

10.1590/s0102-09352006000400019 ◽

2006 ◽

Vol 58 (4) ◽

pp. 567-574 ◽

Cited By ~ 3

Author(s):

M.G.C.D. Peixoto ◽

J.A.G. Bergmann ◽

C.G. Fonseca ◽

V.M. Penna ◽

C.S. Pereira

Keyword(s):

Environmental Factors ◽

Least Squares ◽

Inbreeding Coefficient ◽

Corpora Lutea ◽

Data Sets ◽

Data Set ◽

The Third ◽

Multiple Ovulation ◽

The Mean ◽

Best Responses

Data on 1,294 superovulations of Brahman, Gyr, Guzerat and Nellore females were used to evaluate the effects of: breed; herd; year of birth; inbreeding coefficient and age at superovulation of the donor; month, season and year of superovulation; hormone source and dose; and the number of previous treatments on the superovulation results. Four data sets were considered to study the influence of donors’ elimination effect after each consecutive superovulation. Each one contained only records of the first, or of the two firsts, or three firsts or all superovulations. The average number of palpated corpora lutea per superovulation varied from 8.6 to 12.6. The total number of recovered structures and viable embryos ranged from 4.1 to 7.3 and from 7.3 to 13.8, respectively. Least squares means of the number of viable embryos at first superovulation were 7.8 ± 6.6 (Brahman), 3.7 ± 4.5 (Gyr), 6.1 ± 5.9 (Guzerat) and 5.2 ± 5.9 (Nellore). The numbers of viable embryos of the second and the third superovulations were not different from those of the first superovulation. The mean intervals between first and second superovulations were 91.8 days for Brahman, 101.8 days for Gyr, 93.1 days for Guzerat and 111.3 days for Nellore donors. Intervals between the second and the third superovulations were 134.3, 110.3, 116.4 and 108.5 days for Brahman, Gyr, Guzerat and Nellore donors, respectively. Effects of herd nested within breed and dose nested within hormone affected all traits. For some data sets, the effects of month and order of superovulation on three traits were importants. The maximum number of viable embryos was observed for 7-8 year-old donors. The best responses for corpora lutea and recovered structures were observed for 4-5 year-old donors. Inbreeding coefficient was positively associated to the number of recovered structures when data set on all superovulations was considered.

Download Full-text

Aerosol-cirrus interactions: A number based phenomenon at all?

Atmospheric Chemistry and Physics Discussions ◽

10.5194/acpd-3-3625-2003 ◽

2003 ◽

Vol 3 (4) ◽

pp. 3625-3657

Author(s):

M. Seifert ◽

J. Ström ◽

R. Krejci ◽

A. Minikin ◽

A. Petzold ◽

...

Keyword(s):

Aerosol Particles ◽

Cloud Formation ◽

Data Sets ◽

Number Density ◽

Cloud Droplets ◽

Data Set ◽

Local Maxima ◽

The Mean ◽

Positive Correlations ◽

Aerosol Cloud Interactions

Abstract. In situ measurements of the partitioning of aerosol particles within cirrus clouds were used to investigate aerosol-cloud interactions in ice clouds. The number density of interstitial aerosol particles (non-activated particles in between the cirrus crystals) was compared to the number density of cirrus crystal residuals. The data was obtained during the two INCA (Interhemispheric Differences in Cirrus Properties form Anthropogenic Emissions) campaigns, performed in the Southern Hemisphere (SH) and Northern Hemisphere (NH) midlatitudes. Different aerosol-cirrus interactions can be linked to the different stages of the cirrus lifecycle. Cloud formation is linked to positive correlations between the number density of interstitial aerosol (Nint) and crystal residuals (Ncvi), whereas the correlations are smaller or even negative in a dissolving cloud. Unlike warm clouds, where the number density of cloud droplets is positively related to the aerosol number density, we observed a rather complex relationship when expressing Ncvi as a function of Nint for forming clouds. The data sets are similar in that they both show local maxima in the Nint range 100 to 200 cm−3, where the SH-maximum is shifted towards the higher value. For lower number densities Nint and Ncvi are positively related. The slopes emerging from the data suggest that a tenfold increase in the aerosol number density corresponds to a 3 to 4 times increase in the crystal number density. As Nint increases beyond the ca. 100 to 200 cm−3, the mean crystal number density decreases at about the same rate for both data sets. For much higher aerosol number densities, only present in the NH data set, the mean Ncvi remains low. The situation for dissolving clouds presents two alternative interactions between aerosols and cirrus. Either evaporating clouds are associated with a source of aerosol particles, or air pollution (high aerosol number density) retards evaporation rates.

Download Full-text

Use of artificial intelligence for public health surveillance: a case study to develop a machine Learning-algorithm to estimate the incidence of diabetes mellitus in France

Archives of Public Health ◽

10.1186/s13690-021-00687-0 ◽

2021 ◽

Vol 79 (1) ◽

Author(s):

Romana Haneef ◽

Sofiane Kab ◽

Rok Hrzic ◽

Sonsoles Fuentes ◽

Sandrine Fosse-Edorh ◽

...

Keyword(s):

Public Health ◽

Machine Learning ◽

Test Data ◽

Public Health Surveillance ◽

Health Surveillance ◽

Data Sets ◽

Data Set ◽

Linear Discriminant ◽

Final Data ◽

Selection Of

Abstract Background The use of machine learning techniques is increasing in healthcare which allows to estimate and predict health outcomes from large administrative data sets more efficiently. The main objective of this study was to develop a generic machine learning (ML) algorithm to estimate the incidence of diabetes based on the number of reimbursements over the last 2 years. Methods We selected a final data set from a population-based epidemiological cohort (i.e., CONSTANCES) linked with French National Health Database (i.e., SNDS). To develop this algorithm, we adopted a supervised ML approach. Following steps were performed: i. selection of final data set, ii. target definition, iii. Coding variables for a given window of time, iv. split final data into training and test data sets, v. variables selection, vi. training model, vii. Validation of model with test data set and viii. Selection of the model. We used the area under the receiver operating characteristic curve (AUC) to select the best algorithm. Results The final data set used to develop the algorithm included 44,659 participants from CONSTANCES. Out of 3468 variables from SNDS linked to CONSTANCES cohort were coded, 23 variables were selected to train different algorithms. The final algorithm to estimate the incidence of diabetes was a Linear Discriminant Analysis model based on number of reimbursements of selected variables related to biological tests, drugs, medical acts and hospitalization without a procedure over the last 2 years. This algorithm has a sensitivity of 62%, a specificity of 67% and an accuracy of 67% [95% CI: 0.66–0.68]. Conclusions Supervised ML is an innovative tool for the development of new methods to exploit large health administrative databases. In context of InfAct project, we have developed and applied the first time a generic ML-algorithm to estimate the incidence of diabetes for public health surveillance. The ML-algorithm we have developed, has a moderate performance. The next step is to apply this algorithm on SNDS to estimate the incidence of type 2 diabetes cases. More research is needed to apply various MLTs to estimate the incidence of various health conditions.

Download Full-text

DNA Methylation Markers for Pan-Cancer Prediction by Deep Learning

Genes ◽

10.3390/genes10100778 ◽

2019 ◽

Vol 10 (10) ◽

pp. 778 ◽

Cited By ~ 6

Author(s):

Liu ◽

Pan ◽

Li ◽

Yang ◽

...

Keyword(s):

Dna Methylation ◽

Deep Learning ◽

Sensitivity And Specificity ◽

Test Data ◽

Data Sets ◽

Methylation Data ◽

Average Sensitivity ◽

Validation Data ◽

Data Set ◽

Cancer Types

For cancer diagnosis, many DNA methylation markers have been identified. However, few studies have tried to identify DNA methylation markers to diagnose diverse cancer types simultaneously, i.e., pan-cancers. In this study, we tried to identify DNA methylation markers to differentiate cancer samples from the respective normal samples in pan-cancers. We collected whole genome methylation data of 27 cancer types containing 10,140 cancer samples and 3386 normal samples, and divided all samples into five data sets, including one training data set, one validation data set and three test data sets. We applied machine learning to identify DNA methylation markers, and specifically, we constructed diagnostic prediction models by deep learning. We identified two categories of markers: 12 CpG markers and 13 promoter markers. Three of 12 CpG markers and four of 13 promoter markers locate at cancer-related genes. With the CpG markers, our model achieved an average sensitivity and specificity on test data sets as 92.8% and 90.1%, respectively. For promoter markers, the average sensitivity and specificity on test data sets were 89.8% and 81.1%, respectively. Furthermore, in cell-free DNA methylation data of 163 prostate cancer samples, the CpG markers achieved the sensitivity as 100%, and the promoter markers achieved 92%. For both marker types, the specificity of normal whole blood was 100%. To conclude, we identified methylation markers to diagnose pan-cancers, which might be applied to liquid biopsy of cancers.

Download Full-text

Improved Retrieval Methods for Sentinel-3 SAR Altimetry over Coastal and Open Ocean and recommendations for implementation: ESA SCOOP Project Results

10.5194/egusphere-egu2020-2215 ◽

2020 ◽

Author(s):

David Cotton ◽

Thomas Moreau ◽

Mònica Roca ◽

Christine Gommenginger ◽

Mathilde Cancet ◽

...

Keyword(s):

Test Data ◽

Open Ocean ◽

Global Navigation Satellite Systems ◽

Future Research ◽

Data Sets ◽

Polar Regions ◽

Path Delay ◽

Data Set ◽

Processing Scheme ◽

Satellite Systems

SCOOP (SAR Altimetry Coastal & Open Ocean Performance) is a project funded under the ESA SEOM (Scientific Exploitation of Operational Missions) Programme Element, to characterise the expected performance of Sentinel-3 SRAL SAR mode altimeter products, and then to develop and evaluate enhancements to the baseline processing scheme in terms of improvements to ocean measurements. Another objective is to develop and evaluate an improved Wet Troposphere correction for Sentinel-3.The SCOOP studies are based on two 2-year test data sets derived from CryoSat-2 FBR data, produced for 10 regions. The first Test Data Set was processed with algorithms equivalent to the Sentinel-3 baseline, and the second with algorithms expected to provide an improved performance.We present results from the SCOOP project that demonstrate the excellent performance of SRAL at the coast in terms of measurement precision, with noise in Sea Surface Height 20Hz measurements of less than 5cm to within 5km of the coast.We then report the development and testing of new processing approaches designed to improve performance, including, for L1B to L2:<ul><li>Application of zero-padding</li> <li>Application of intra-burst Hamming windowing</li> <li>Exact beam forming in the azimuthal direction</li> <li>Restriction of stack processing to within a specified range of look angles.</li> <li>Along-track antenna compensation</li> </ul>&#160;And for L1B to L2<ul><li>Application of alternative re-trackers for SAR and RDSAR.</li> </ul>&#160;Based on the results of this assessment, a second test data set was generated and we present an assessment of the performance of this second Test Data Set generated, and compare it to that of the original Test Data Set.Regarding the WTC for Sentinel-3A, the correction from the on-board MWR has been assessed by means of comparison with independent data sets such as the GPM Microwave Imager (GMI), Jason-2, Jason-3 and Global Navigation Satellite Systems (GNSS) derived WTC at coastal stations. GNSS-derived path Delay Plus (GPD+) corrections have been derived for S3A. Results indicate good overall performance of S3A MWR and GPD+ WTC improvements over MWR-derived WTC, particularly in coastal and polar regions.&#160;Based on the outcomes of this study we provide recommendations for improving SAR mode altimeter processing and priorities for future research.

Download Full-text

VSS: Variance-stabilized signals for sequencing-based genomic signals

10.1101/2020.01.31.929174 ◽

2020 ◽

Author(s):

Faezeh Bayat ◽

Maxwell Libbrecht

Keyword(s):

Negative Binomial ◽

Empirical Relationship ◽

Data Sets ◽

Data Set ◽

Variance Stabilization ◽

Mean And Variance ◽

The Mean ◽

Genomic Signals ◽

Mean Variance ◽

Inverse Hyperbolic Sine

AbstractMotivationA sequencing-based genomic assay such as ChIP-seq outputs a real-valued signal for each position in the genome that measures the strength of activity at that position. Most genomic signals lack the property of variance stabilization. That is, a difference between 100 and 200 reads usually has a very different statistical importance from a difference between 1,100 and 1,200 reads. A statistical model such as a negative binomial distribution can account for this pattern, but learning these models is computationally challenging. Therefore, many applications—including imputation and segmentation and genome annotation (SAGA)—instead use Gaussian models and use a transformation such as log or inverse hyperbolic sine (asinh) to stabilize variance.ResultsWe show here that existing transformations do not fully stabilize variance in genomic data sets. To solve this issue, we propose VSS, a method that produces variance-stabilized signals for sequencingbased genomic signals. VSS learns the empirical relationship between the mean and variance of a given signal data set and produces transformed signals that normalize for this dependence. We show that VSS successfully stabilizes variance and that doing so improves downstream applications such as SAGA. VSS will eliminate the need for downstream methods to implement complex mean-variance relationship models, and will enable genomic signals to be easily understood by [email protected]://github.com/faezeh-bayat/Variance-stabilized-units-for-sequencing-based-genomic-signals.

Download Full-text