scholarly journals Euclid preparation

2019 ◽  
Vol 627 ◽  
pp. A59 ◽  
Author(s):  
◽  
N. Martinet ◽  
T. Schrabback ◽  
H. Hoekstra ◽  
M. Tewes ◽  
...  

In modern weak-lensing surveys, the common approach to correct for residual systematic biases in the shear is to calibrate shape measurement algorithms using simulations. These simulations must fully capture the complexity of the observations to avoid introducing any additional bias. In this paper we study the importance of faint galaxies below the observational detection limit of a survey. We simulate simplified Euclid VIS images including and excluding this faint population, and measure the shift in the multiplicative shear bias between the two sets of simulations. We measure the shear with three different algorithms: a moment-based approach, model fitting, and machine learning. We find that for all methods, a spatially uniform random distribution of faint galaxies introduces a shear multiplicative bias of the order of a few times 10−3. This value increases to the order of 10−2 when including the clustering of the faint galaxies, as measured in the Hubble Space Telescope Ultra-Deep Field. The magnification of the faint background galaxies due to the brighter galaxies along the line of sight is found to have a negligible impact on the multiplicative bias. We conclude that the undetected galaxies must be included in the calibration simulations with proper clustering properties down to magnitude 28 in order to reach a residual uncertainty on the multiplicative shear bias calibration of a few times 10−4, in line with the 2 × 10−3 total accuracy budget required by the scientific objectives of the Euclid survey. We propose two complementary methods for including faint galaxy clustering in the calibration simulations.

2021 ◽  
Vol 54 (3) ◽  
pp. 1-18
Author(s):  
Petr Spelda ◽  
Vit Stritecky

As our epistemic ambitions grow, the common and scientific endeavours are becoming increasingly dependent on Machine Learning (ML). The field rests on a single experimental paradigm, which consists of splitting the available data into a training and testing set and using the latter to measure how well the trained ML model generalises to unseen samples. If the model reaches acceptable accuracy, then an a posteriori contract comes into effect between humans and the model, supposedly allowing its deployment to target environments. Yet the latter part of the contract depends on human inductive predictions or generalisations, which infer a uniformity between the trained ML model and the targets. The article asks how we justify the contract between human and machine learning. It is argued that the justification becomes a pressing issue when we use ML to reach “elsewhere” in space and time or deploy ML models in non-benign environments. The article argues that the only viable version of the contract can be based on optimality (instead of on reliability, which cannot be justified without circularity) and aligns this position with Schurz's optimality justification. It is shown that when dealing with inaccessible/unstable ground-truths (“elsewhere” and non-benign targets), the optimality justification undergoes a slight change, which should reflect critically on our epistemic ambitions. Therefore, the study of ML robustness should involve not only heuristics that lead to acceptable accuracies on testing sets. The justification of human inductive predictions or generalisations about the uniformity between ML models and targets should be included as well. Without it, the assumptions about inductive risk minimisation in ML are not addressed in full.


2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Runzhi Zhang ◽  
Alejandro R. Walker ◽  
Susmita Datta

Abstract Background Composition of microbial communities can be location-specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from samples across 16 cities around the world and samples from another 8 cities were provided as the main and mystery datasets respectively as the part of the CAMDA 2019 MetaSUB “Forensic Challenge”. The feature selecting, normalization, three methods of machine learning, PCoA (Principal Coordinates Analysis) and ANCOM (Analysis of composition of microbiomes) were conducted for both the main and mystery datasets. Results Features selecting, combined with the machines learning methods, revealed that the combination of the common features was effective for predicting the origin of the samples. The average error rates of 11.93 and 30.37% of three machine learning methods were obtained for main and mystery datasets respectively. Using the samples from main dataset to predict the labels of samples from mystery dataset, nearly 89.98% of the test samples could be correctly labeled as “mystery” samples. PCoA showed that nearly 60% of the total variability of the data could be explained by the first two PCoA axes. Although many cities overlapped, the separation of some cities was found in PCoA. The results of ANCOM, combined with importance score from the Random Forest, indicated that the common “family”, “order” of the main-dataset and the common “order” of the mystery dataset provided the most efficient information for prediction respectively. Conclusions The results of the classification suggested that the composition of the microbiomes was distinctive across the cities, which could be used to identify the sample origins. This was also supported by the results from ANCOM and importance score from the RF. In addition, the accuracy of the prediction could be improved by more samples and better sequencing depth.


2021 ◽  
pp. 1-47
Author(s):  
Yang Trista Cao ◽  
Hal Daumé

Abstract Correctly resolving textual mentions of people fundamentally entails making inferences about those people. Such inferences raise the risk of systematic biases in coreference resolution systems, including biases that can harm binary and non-binary trans and cis stakeholders. To better understand such biases, we foreground nuanced conceptualizations of gender from sociology and sociolinguistics, and investigate where in the machine learning pipeline such biases can enter a coreference resolution system. We inspect many existing datasets for trans-exclusionary biases, and develop two new datasets for interrogating bias in both crowd annotations and in existing coreference resolution systems. Through these studies, conducted on English text, we confirm that without acknowledging and building systems that recognize the complexity of gender, we will build systems that fail for: quality of service, stereotyping, and over- or under-representation, especially for binary and non-binary trans users.


2021 ◽  
Author(s):  
Bezuayehu Gutema Asefa ◽  
Legesse Hagos ◽  
Tamirat Kore ◽  
Shimelis Admassu Emire

Abstract A rapid method based on digital image analysis and machine learning technique is proposed for the detection of milk adulteration with water. Several machine learning algorithms were compared, and SVM performed best with 89.48 % of total accuracy and 95.10 % precision. An increase in the classification performance was observed in extreme classes. Better quantitative determination of the extraneous water was achieved using SVMR with R2(CV) and R2(P) of 0.65 and 0.71 respectively. The proposed technique can be used to screen raw milk based on the level of added extraneous water without the necessity of any additional reagent.


Materials ◽  
2022 ◽  
Vol 15 (2) ◽  
pp. 643
Author(s):  
Paul Meißner ◽  
Jens Winter ◽  
Thomas Vietor

A neural network (NN)-based method is presented in this paper which allows the identification of parameters for material cards used in Finite Element simulations. Contrary to the conventionally used computationally intensive material parameter identification (MPI) by numerical optimization with internal or commercial software, a machine learning (ML)-based method is time saving when used repeatedly. Within this article, a self-developed ML-based Python framework is presented, which offers advantages, especially in the development of structural components in early development phases. In this procedure, different machine learning methods are used and adapted to the specific MPI problem considered herein. Using the developed NN-based and the common optimization-based method with LS-OPT, the material parameters of the LS-DYNA material card MAT_187_SAMP-1 and the failure model GISSMO were exemplarily calibrated for a virtually generated test dataset. Parameters for the description of elasticity, plasticity, tension–compression asymmetry, variable plastic Poisson’s ratio (VPPR), strain rate dependency and failure were taken into account. The focus of this paper is on performing a comparative study of the two different MPI methods with varying settings (algorithms, hyperparameters, etc.). Furthermore, the applicability of the NN-based procedure for the specific usage of both material cards was investigated. The studies reveal the general applicability for the calibration of a complex material card by the example of the used MAT_187_SAMP-1.


2020 ◽  
Author(s):  
Runzhi Zhang ◽  
Alejandro R. Walker ◽  
Susmita Datta

Abstract BackgroundComposition of microbial communities can be location specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from samples across 16 cities around the world and samples from another 8 cities were provided as the main and mystery datasets respectively as the part of the CAMDA 2019 MetaSUB “Forensic Challenge”. The feature selection, normalization, three methods of machine learning, PCoA (Principal Coordinates Analysis) and ANCOM (Analysis of composition of microbiomes) were conducted for both the main and mystery datasets.ResultsFeature selection, combined with the machines learning methods, revealed that the combination of the common features was effective for predicting the origin of the samples. The average error rates of 11.6% and 30.0% of three machine learning methods were obtained for main and mystery datasets respectively. Using the samples from main dataset to predict the labels of samples from mystery dataset, nearly 89.98% of the test samples could be correctly labeled as “mystery” samples. PCoA showed that nearly 60% of the total variability of the data could be explained by the first two PCoA axes. Although many cities overlapped, the separation of some cities was found in PCoA. The results of ANCOM, combined with importance score from the Random Forest, indicated that the common “family”, “order” of the main-dataset and the common “order” of the mystery dataset provided the most efficient information for prediction respectively.ConclusionsThe results of the classification suggested that the composition of the microbiomes was distinctive across the cities, which was also supported by the results from ANCOM and importance score from the RF. The analysis utilized in this study can be of great help in field of forensic science to efficiently predict the origin of the samples. And the accurate of the prediction could be improved by more samples and better sequencing depth.


2020 ◽  
Vol 633 ◽  
pp. A154
Author(s):  
C. H. A. Logan ◽  
S. Fotopoulou

Context. Classification will be an important first step for upcoming surveys aimed at detecting billions of new sources, such as LSST and Euclid, as well as DESI, 4MOST, and MOONS. The application of traditional methods of model fitting and colour-colour selections will face significant computational constraints, while machine-learning methods offer a viable approach to tackle datasets of that volume. Aims. While supervised learning methods can prove very useful for classification tasks, the creation of representative and accurate training sets is a task that consumes a great deal of resources and time. We present a viable alternative using an unsupervised machine learning method to separate stars, galaxies and QSOs using photometric data. Methods. The heart of our work uses Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) to find the star, galaxy, and QSO clusters in a multidimensional colour space. We optimized the hyperparameters and input attributes of three separate HDBSCAN runs, each to select a particular object class and, thus, treat the output of each separate run as a binary classifier. We subsequently consolidated the output to give our final classifications, optimized on the basis of their F1 scores. We explored the use of Random Forest and PCA as part of the pre-processing stage for feature selection and dimensionality reduction. Results. Using our dataset of ∼50 000 spectroscopically labelled objects we obtain F1 scores of 98.9, 98.9, and 93.13 respectively for star, galaxy, and QSO selection using our unsupervised learning method. We find that careful attribute selection is a vital part of accurate classification with HDBSCAN. We applied our classification to a subset of the SDSS spectroscopic catalogue and demonstrated the potential of our approach in correcting misclassified spectra useful for DESI and 4MOST. Finally, we created a multiwavelength catalogue of 2.7 million sources using the KiDS, VIKING, and ALLWISE surveys and published corresponding classifications and photometric redshifts.


Sign in / Sign up

Export Citation Format

Share Document