scholarly journals A deep learning framework for characterization of genotype data

2020 ◽  
Author(s):  
Kristiina Ausmees ◽  
Carl Nettelblad

ABSTRACTDimensionality reduction is a data transformation technique widely used in various fields of genomics research, with principal component analysis one of the most frequently employed methods. Application of principal component analysis to genotype data is known to capture genetic similarity between individuals, and is used for visualization of genetic variation, identification of population structure as well as ancestry mapping. However, the method is based on a linear model that is sensitive to characteristics of data such as correlation of single-nucleotide polymorphisms due to linkage disequilibrium, resulting in limitations in its ability to capture complex population structure.Deep learning models are a type of nonlinear machine learning method in which the features used in data transformation are decided by the model in a data-driven manner, rather than by the researcher, and have been shown to present a promising alternative to traditional statistical methods for various applications in omics research. In this paper, we propose a deep learning model based on a convolutional autoencoder architecture for dimensionality reduction of genotype data.Using a highly diverse cohort of human samples, we demonstrate that the model can identify population clusters and provide richer visual information in comparison to principal component analysis, and also yield a more accurate population classification model. We also discuss the use of the methodology for more general characterization of genotype data, showing that models of a similar architecture can be used as a genetic clustering method, comparing results to the ADMIXTURE software frequently used in population genetic studies.

Author(s):  
А.О. Алексанян ◽  
С.О. Старков ◽  
К.В. Моисеев

Данная статья затрагивает проблему распознавания лиц при решении задачи идентификации, где в качестве входных данных для последующей классификации используются вектора-признаки, полученные в результате работы сети глубокого обучения. Немногие существующие алгоритмы способны проводить классификацию на открытых наборах (open-set classification) с достаточно высокой степенью надежности. Общепринятым подходом к проведению классификации является применение классификатора на основании порогового значения. Такой подход обладает рядом существенных недостатков, что и является причиной низкого качества классификации на открытых наборах. Из основных недостатков можно выделить следующие. Во-первых, отсутствие фиксированного порога — невозможно подобрать универсальный порог для каждого лица. Во-вторых, увеличение порога ведет к снижению качества классификации. И, в-третьих, при пороговой классификации одному лицу может соответствовать сразу большое количество классов. В связи с этим мы предлагаем использование метода главных компонент в качестве дополнительного способа понижения размерности, вдобавок к выделению ключевых признаков лица сетью глубокого обучения, для дальнейшей классификации векторов-признаков. Геометрически применение метода главных компонент к векторам-признакам и проведение дальнейшей классификации равносильно поиску пространства меньшей размерности, в котором проекции исходных векторов будут хорошо разделимы. Идея понижения размерности логически вытекает из предположения, что не все компоненты N-мерных векторов-признаков несут значимый вклад в описание человеческого лица и что лишь некоторые компоненты образуют большую часть дисперсии. Таким образом, выделение только значимых компонентов из векторов-признаков позволяет производить разделение классов на основании самых вариативных признаков, без изучения при этом менее информативных данных и без сравнения вектора в пространстве большой размерности. The study objective is face recognition for identification purposes. The input data to be classified are attribute vectors generated by a deep learning neural network. The few existing algorithms can perform sufficiently reliable openset classification. The common approach to classification is using a classification threshold. It has several disadvantages leading to the low quality of openset classifications. The key disadvantages are as follows. First, there is no set threshold: it is impossible to find a common threshold suitable for every face. Second, the higher the threshold, the lower the quality of classification. Third, with the threshold classification more than one class can match a face. For this reason, we proposed to apply the principal component analysis as an extra dimensionality reduction tool besides identifying the key face attributes by a deep learning neural network for subsequent classification of the attribute vectors. In geometric terms, the principal component analysis application to attribute vectors with subsequent classification is similar to a search for a lowdimension space where the projections of the source vectors can be easily separated. The dimensionality reduction concept is based on the assumption that not all the components on Ndimensional attribute vectors are relevant for the human face representation, and only some of them produce the larger part of the dispersion. Therefore, by selecting only the relevant components of the attribute vectors we can separate the classes using the most variable attributes while skipping the less informative data and not comparing the vectors in a highdimensional space.


Author(s):  
Waqar Qureshi ◽  
Francesca Cura ◽  
Andrea Mura

Fretting wear is a quasi-static process in which repeated relative surface movement of components results in wear and fatigue. Fretting wear is quite significant in the case of spline couplings which are frequently used in the aircraft industry to transfer torque and power. Fretting wear depends on materials, pressure distribution, torque, rotational speeds, lubrication, surface finish, misalignment between spline shafts, etc. The presence of so many factors makes it difficult to conduct experiments for better models of fretting wear and it is the case whenever a mathematical model is sought from experimental data which is prone to noisy measurements, outliers and redundant variables. This work develops a principal component analysis based method, using a criterion which is insensitive to outliers, to realize a better design and interpret experiments on fretting wear. The proposed method can be extended to other cases too.


2022 ◽  
pp. 146808742110707
Author(s):  
Aran Mohammad ◽  
Reza Rezaei ◽  
Christopher Hayduk ◽  
Thaddaeus Delebinski ◽  
Saeid Shahpouri ◽  
...  

The development of internal combustion engines is affected by the exhaust gas emissions legislation and the striving to increase performance. This demands for engine-out emission models that can be used for engine optimization for real driving emission controls. The prediction capability of physically and data-driven engine-out emission models is influenced by the system inputs, which are specified by the user and can lead to an improved accuracy with increasing number of inputs. Thereby the occurrence of irrelevant inputs becomes more probable, which have a low functional relation to the emissions and can lead to overfitting. Alternatively, data-driven methods can be used to detect irrelevant and redundant inputs. In this work, thermodynamic states are modeled based on 772 stationary measured test bench data from a commercial vehicle diesel engine. Afterward, 37 measured and modeled variables are led into a data-driven dimensionality reduction. For this purpose, approaches of supervised learning, such as lasso regression and linear support vector machine, and unsupervised learning methods like principal component analysis and factor analysis are applied to select and extract the relevant features. The selected and extracted features are used for regression by the support vector machine and the feedforward neural network to model the NOx, CO, HC, and soot emissions. This enables an evaluation of the modeling accuracy as a result of the dimensionality reduction. Using the methods in this work, the 37 variables are reduced to 25, 22, 11, and 16 inputs for NOx, CO, HC, and soot emission modeling while maintaining the accuracy. The features selected using the lasso algorithm provide more accurate learning of the regression models than the extracted features through principal component analysis and factor analysis. This results in test errors RMSETe for modeling NOx, CO, HC, and soot emissions 19.22 ppm, 6.46 ppm, 1.29 ppm, and 0.06 FSN, respectively.


Sign in / Sign up

Export Citation Format

Share Document