Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources

Sheng Yu; Katherine P Liao; Stanley Y Shaw; Vivian S Gainer; Susanne E Churchill; Peter Szolovits; Shawn N Murphy; Isaac S. Kohane; Tianxi Cai

doi:10.1093/jamia/ocv034

Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocv034 ◽

2015 ◽

Vol 22 (5) ◽

pp. 993-1000 ◽

Cited By ~ 68

Author(s):

Sheng Yu ◽

Katherine P Liao ◽

Stanley Y Shaw ◽

Vivian S Gainer ◽

Susanne E Churchill ◽

...

Keyword(s):

Rheumatoid Arthritis ◽

Feature Extraction ◽

High Throughput ◽

Knowledge Sources ◽

Phenotype Classification ◽

Feature Extraction And Selection ◽

Automated Feature Extraction ◽

High Throughput Phenotyping ◽

Text Features ◽

Selection Of

Abstract Objective Analysis of narrative (text) data from electronic health records (EHRs) can improve population-scale phenotyping for clinical and genetic research. Currently, selection of text features for phenotyping algorithms is slow and laborious, requiring extensive and iterative involvement by domain experts. This paper introduces a method to develop phenotyping algorithms in an unbiased manner by automatically extracting and selecting informative features, which can be comparable to expert-curated ones in classification accuracy. Materials and methods Comprehensive medical concepts were collected from publicly available knowledge sources in an automated, unbiased fashion. Natural language processing (NLP) revealed the occurrence patterns of these concepts in EHR narrative notes, which enabled selection of informative features for phenotype classification. When combined with additional codified features, a penalized logistic regression model was trained to classify the target phenotype. Results The authors applied our method to develop algorithms to identify patients with rheumatoid arthritis and coronary artery disease cases among those with rheumatoid arthritis from a large multi-institutional EHR. The area under the receiver operating characteristic curves (AUC) for classifying RA and CAD using models trained with automated features were 0.951 and 0.929, respectively, compared to the AUCs of 0.938 and 0.929 by models trained with expert-curated features. Discussion Models trained with NLP text features selected through an unbiased, automated procedure achieved comparable or slightly higher accuracy than those trained with expert-curated features. The majority of the selected model features were interpretable. Conclusion The proposed automated feature extraction method, generating highly accurate phenotyping algorithms with improved efficiency, is a significant step toward high-throughput phenotyping.

Download Full-text

Surrogate-assisted feature extraction for high-throughput phenotyping

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocw135 ◽

2016 ◽

Vol 24 (e1) ◽

pp. e143-e149 ◽

Cited By ~ 19

Author(s):

Sheng Yu ◽

Abhishek Chakrabortty ◽

Katherine P Liao ◽

Tianrun Cai ◽

Ashwin N Ananthakrishnan ◽

...

Keyword(s):

Feature Extraction ◽

Language Processing ◽

High Throughput ◽

Gold Standard ◽

Characteristic Curve ◽

Feature Selection Method ◽

International Classification Of Diseases ◽

Silver Standard ◽

Automated Feature Extraction ◽

High Throughput Phenotyping

Objective: Phenotyping algorithms are capable of accurately identifying patients with specific phenotypes from within electronic medical records systems. However, developing phenotyping algorithms in a scalable way remains a challenge due to the extensive human resources required. This paper introduces a high-throughput unsupervised feature selection method, which improves the robustness and scalability of electronic medical record phenotyping without compromising its accuracy. Methods: The proposed Surrogate-Assisted Feature Extraction (SAFE) method selects candidate features from a pool of comprehensive medical concepts found in publicly available knowledge sources. The target phenotype’s International Classification of Diseases, Ninth Revision and natural language processing counts, acting as noisy surrogates to the gold-standard labels, are used to create silver-standard labels. Candidate features highly predictive of the silver-standard labels are selected as the final features. Results: Algorithms were trained to identify patients with coronary artery disease, rheumatoid arthritis, Crohn’s disease, and ulcerative colitis using various numbers of labels to compare the performance of features selected by SAFE, a previously published automated feature extraction for phenotyping procedure, and domain experts. The out-of-sample area under the receiver operating characteristic curve and F-score from SAFE algorithms were remarkably higher than those from the other two, especially at small label sizes. Conclusion: SAFE advances high-throughput phenotyping methods by automatically selecting a succinct set of informative features for algorithm training, which in turn reduces overfitting and the needed number of gold-standard labels. SAFE also potentially identifies important features missed by automated feature extraction for phenotyping or experts.

Download Full-text

A new hybrid approach for feature extraction and selection of electroencephalogram signals in case of person recognition

Journal of Reliable Intelligent Environments ◽

10.1007/s40860-021-00148-z ◽

2021 ◽

Author(s):

Bhawna Kaliraman ◽

Manoj Duhan

Keyword(s):

Feature Extraction ◽

Hybrid Approach ◽

Person Recognition ◽

Feature Extraction And Selection ◽

Selection Of

Download Full-text

Penerapan K-Means Clustering Untuk Seleksi Frame Dominan Berbasis NTSC Pada Obyek Bergerak

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2020742184 ◽

2020 ◽

Vol 7 (4) ◽

pp. 745

Author(s):

Rizka Indah Armianti ◽

Achmad Fanany Onnilita Gaffar ◽

Arief Bramanto Wicaksono Putra

Keyword(s):

Feature Extraction ◽

Video Data ◽

Clustering Method ◽

Data Frame ◽

First Order ◽

Feature Extraction And Selection ◽

Different Shapes ◽

Frame Set ◽

Selection Of

Obyek dinyatakan bergerak jika terjadi perubahan posisi dimensi disetiap frame. Pergerakan obyek menyebabkan obyek memiliki perbedaan bentuk pola disetiap frame-nya. Frame yang memiliki pola terbaik diantara frame lainnya disebut frame dominan. Penelitian ini bertujuan untuk menyeleksi frame dominan dari rangkaian frame dengan menerapkan metode K-means clustering untuk memperoleh centroid dominan (centroid dengan nilai tertinggi) yang digunakan sebagai dasar seleksi frame dominan. Dalam menyeleksi frame dominan terdapat 4 tahapan utama yaitu akuisisi data, penetapan pola obyek, ekstrasi ciri dan seleksi. Data yang digunakan berupa data video yang kemudian dilakukan proses penetapan pola obyek menggunakan operasi pengolahan citra digital, dengan hasil proses berupa pola obyek RGB yang kemudian dilakukan ekstraksi ciri berbasis NTSC dengan menggunakan metode statistik orde pertama yaitu Mean. Data hasil ekstraksi ciri berjumlah 93 data frame yang selanjutnya dikelompokkan menjadi 3 cluster menggunakan metode K-Means. Dari hasil clustering, centroid dominan terletak pada cluster 3 dengan nilai centroid 0.0177 dan terdiri dari 41 data frame. Selanjutnya diukur jarak kedekatan seluruh data cluster 3 terhadap centroid, data yang memiliki jarak terdekat dengan centroid itulah frame dominan. Hasil seleksi frame dominan ditunjukkan pada jarak antar centroid dengan anggota cluster, dimana dari seluruh 41 data frame tiga jarak terbaik diperoleh adalah 0.0008 dan dua jarak bernilai 0.0010 yang dimiliki oleh frame ke-59, ke-36 dan ke-35. AbstractThe object is declared moving if there is a change in the position of the dimensions in each frame. The movement of an object causes the object to have different shapes in each frame. The frame that has the best pattern among other frames is called the dominant frame. This study aims to select the dominant frame from the frame set by applying the K-means clustering method to obtain the dominant centroid (the highest value centroid) which is used as the basis for the selection of dominant frames. In selecting dominant frames, there are 4 main stages, namely data acquisition, determination of object patterns, feature extraction and selection. The data used in the form of video data which is then carried out the process of determining the pattern of objects using digital image processing operations, with the results of the process in the form of an RGB object pattern which is then performed NTSC-based feature extraction using the first-order statistical method, Mean. The data from feature extraction are 93 data frames which are then grouped into 3 clusters using the K-Means method. From the results of clustering, the dominant centroid is located in cluster 3 with a centroid value of 0.0177 and consists of 41 data frames. Furthermore, the proximity of all data cluster 3 to the centroid is measured, the data having the closest distance to the centroid is the dominant frame. The results of dominant frame selection are shown in the distance between centroids and cluster members, where from all 41 data frames the three best distances obtained are 0.0008, 0.0010, and 0.0010 owned by 59th, 36th and 35th frames.

Download Full-text

Industrial condition monitoring with smart sensors using automated feature extraction and selection

Measurement Science and Technology ◽

10.1088/1361-6501/aad1d4 ◽

2018 ◽

Vol 29 (9) ◽

pp. 094002 ◽

Cited By ~ 9

Author(s):

Tizian Schneider ◽

Nikolai Helwig ◽

Andreas Schütze

Keyword(s):

Feature Extraction ◽

Condition Monitoring ◽

Industrial Condition ◽

Smart Sensors ◽

Feature Extraction And Selection ◽

Automated Feature Extraction

Download Full-text

Simultaneous feature extraction and selection of microarray data using fuzzy-rough based multiobjective nonnegative matrix factorization

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-17954 ◽

2017 ◽

Vol 33 (6) ◽

pp. 4043-4053

Author(s):

Mohamed E. Abd Elaziz

Keyword(s):

Feature Extraction ◽

Microarray Data ◽

Matrix Factorization ◽

Nonnegative Matrix Factorization ◽

Nonnegative Matrix ◽

Feature Extraction And Selection ◽

Selection Of

Download Full-text

Aerial High-Throughput Phenotyping Enabling Indirect Selection for Grain Yield at the Early-generation Seed-limited Stages in Breeding Programs

10.1101/2020.04.21.054163 ◽

2020 ◽

Author(s):

Margaret R. Krause ◽

Suchismita Mondal ◽

José Crossa ◽

Ravi P. Singh ◽

Francisco Pinto ◽

...

Keyword(s):

Grain Yield ◽

High Throughput ◽

Vegetation Indices ◽

Early Generation ◽

Breeding Programs ◽

Breeding Lines ◽

High Throughput Phenotyping ◽

Seed Increase ◽

Yield Trials ◽

Selection Of

ABSTRACTBreeding programs for wheat and many other crops require one or more generations of seed increase before replicated yield trials can be sown. Extensive phenotyping at this stage of the breeding cycle is challenging due to the small plot size and large number of lines under evaluation. Therefore, breeders typically rely on visual selection of small, unreplicated seed increase plots for the promotion of breeding lines to replicated yield trials. With the development of aerial high-throughput phenotyping technologies, breeders now have the ability to rapidly phenotype thousands of breeding lines for traits that may be useful for indirect selection of grain yield. We evaluated early generation material in the irrigated bread wheat (Triticum aestivum L.) breeding program at the International Maize and Wheat Improvement Center to determine if aerial measurements of vegetation indices assessed on small, unreplicated plots were predictive of grain yield. To test this approach, two sets of 1,008 breeding lines were sown both as replicated yield trials and as small, unreplicated plots during two breeding cycles. Vegetation indices collected with an unmanned aerial vehicle in the small plots were observed to be heritable and moderately correlated with grain yield assessed in replicated yield trials. Furthermore, vegetation indices were more predictive of grain yield than univariate genomic selection, while multi-trait genomic selection approaches that combined genomic information with the aerial phenotypes were found to have the highest predictive abilities overall. A related experiment showed that selection approaches for grain yield based on vegetation indices could be more effective than visual selection; however, selection on the vegetation indices alone would have also driven a directional response in phenology due to confounding between those traits. A restricted selection index was proposed for improving grain yield without affecting the distribution of phenology in the breeding population. The results of these experiments provide a promising outlook for the use of aerial high-throughput phenotyping traits to improve selection at the early-generation seed-limited stage of wheat breeding programs.

Download Full-text

General method for automated feature extraction and selection and its application for gender classification and biomechanical knowledge discovery of sex differences in spinal posture during stance and gait

Computer Methods in Biomechanics & Biomedical Engineering ◽

10.1080/10255842.2020.1828375 ◽

2020 ◽

pp. 1-9

Author(s):

Carlo Dindorf ◽

Jürgen Konradi ◽

Claudia Wolf ◽

Bertram Taetz ◽

Gabriele Bleser ◽

...

Keyword(s):

Feature Extraction ◽

Sex Differences ◽

Knowledge Discovery ◽

Gender Classification ◽

Spinal Posture ◽

Feature Extraction And Selection ◽

Automated Feature Extraction ◽

General Method

Download Full-text

The Implementation Of Mfcc Feature Extraction And Selection of Cepstral Coefficient for Qur’an Recitation in TPA (Qur’an Learning Center) Nurul Huda Plus Purbayan

RSF Conference Series: Engineering and Technology ◽

10.31098/cset.v1i1.417 ◽

2021 ◽

Vol 1 (1) ◽

pp. 453-478

Author(s):

Heriyanto Heriyanto ◽

Herlina Jayadianti ◽

Juwairiah Juwairiah

Keyword(s):

Feature Extraction ◽

Recognition System ◽

Automatic Speech Recognition System ◽

Linear Predictive Code ◽

Predictive Code ◽

Feature Extraction And Selection ◽

Approach Method ◽

The Right ◽

Mel Frequency Cepstral Coefficient ◽

Selection Of

There are two approaches to Qur’an recitation, namely talaqqi and qira'ati. Both approaches use the science of recitation containing knowledge of the rules and procedures for reading the Qur'an properly. Talaqqi requires the teacher and students to sit facing each other while qira'ati is the recitation of the Qur'an with rhythms and tones. Many studies have developed an automatic speech recognition system for Qur’an recitation to help the learning process. Feature extraction model using Mel Frequency Cepstral Coefficient (MFCC) and Linear Predictive Code (LPC). The MFCC method has an accuracy of 50% to 60% while the accuracy of Linear Predictive Code (LPC) is only 45% to 50%, so the non-linear MFCC method has higher accuracy than the linear approach method. The cepstral coefficient feature that is used starts from 0 to 23 or 24 cepstral coefficients. Meanwhile, the frame taken consists of 0 to 10 frames or eleven frames. Voting for 300 recorded voice samples was tested against 200 voice recordings, both male and female voices. The frequency used was 44.100 kHz stereo 16 bit. This study aims to obtain good accuracy by selecting the right feature on the cepstral coefficient using MFCC feature extraction and matching accuracy through the selection of the cepstral coefficient feature with Dominant Weight Normalization (NBD) at TPA Nurul Huda Plus Purbayan. Accuracy results showed that the MFCC method with the selection of the 23rd cepstral coefficient has a higher accuracy rate of 90.2% compared to the others. It can be concluded that the selection of the right features on the 23rd cepstral coefficient affects the accuracy of the voice of Qur’an recitation.

Download Full-text

HIGH THROUGHPUT PHENOTYPING OF PHYSIOLOGICAL GROWTH DYNAMICS FROM UAS-BASED 3D MODELING IN SOYBEAN

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-2-w13-357-2019 ◽

2019 ◽

Vol XLII-2/W13 ◽

pp. 357-361 ◽

Cited By ~ 3

Author(s):

M. Herrero-Huerta ◽

K. M. Rainey

Keyword(s):

High Throughput ◽

Digital Camera ◽

Growth Dynamics ◽

Cost Effective ◽

Growing Season ◽

3D Modelling ◽

Crop Genetics ◽

High Throughput Phenotyping ◽

Research And Education ◽

Selection Of

Abstract. Nowadays, an essential tool to improve the efficiency of crop genetics is automated, precise and cost-effective phenotyping of the plants. The aim of this study is to generate a methodology for high throughput phenotyping the physiological growth dynamics of soybeans by UAS-based 3D modelling. During the 2018 growing season, a soybean experiment was performed at the Agronomy Center for Research and Education (ACRE) in West-Lafayette (Indiana, USA). Periodic images were acquired by G9X Canon compact digital camera on board senseFly eBee. The study area is reconstructed in 3D by Image-based modelling. Algorithms and techniques were combined to analyse growth dynamics of the crop via height variations and to quantify biomass. Results provide practical information for the selection of phenotypes for breeding.

Download Full-text

Feature Extraction and Selection in Archaeological Images for Automatic Annotation

International Journal of Image and Graphics ◽

10.1142/s0219467822500061 ◽

2021 ◽

pp. 2250006

Author(s):

Marwa Ben Salah ◽

Ameni Yengui ◽

Mahmoud Neji

Keyword(s):

Feature Extraction ◽

Feature Selection ◽

Image Feature ◽

Contour Method ◽

Automatic Annotation ◽

Extraction Techniques ◽

Feature Extraction And Selection ◽

Selection Stage ◽

Feature Selection Techniques ◽

Selection Of

In this paper, we present two steps in the process of automatic annotation in archeological images. These steps are feature extraction and feature selection. We focus our research on archeological images which are very much studied in our days. It presents the most important steps in the process of automatic annotation in an image. Feature extraction techniques are applied to get the feature that will be used in classifying and recognizing the images. Also, the selection of characteristics reduces the number of unattractive characteristics. However, we reviewed various images of feature extraction techniques to analyze the archaeological images. Each feature represents one or more feature descriptors in the archeological images. We focus on the descriptor shape of the archaeological objects extraction in the images using contour method-based shape recognition of the monuments. So, the feature selection stage serves to acquire the most interesting characteristics to improve the accuracy of the classification. In the feature selection section, we present a comparative study between feature selection techniques. Then we give our proposal of application of methods of selection of the characteristics of the archaeological images. Finally, we calculate the performance of two steps already mentioned: the extraction of characteristics and the selection of characteristics.

Download Full-text