scholarly journals Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources

2015 ◽  
Vol 22 (5) ◽  
pp. 993-1000 ◽  
Author(s):  
Sheng Yu ◽  
Katherine P Liao ◽  
Stanley Y Shaw ◽  
Vivian S Gainer ◽  
Susanne E Churchill ◽  
...  

Abstract Objective Analysis of narrative (text) data from electronic health records (EHRs) can improve population-scale phenotyping for clinical and genetic research. Currently, selection of text features for phenotyping algorithms is slow and laborious, requiring extensive and iterative involvement by domain experts. This paper introduces a method to develop phenotyping algorithms in an unbiased manner by automatically extracting and selecting informative features, which can be comparable to expert-curated ones in classification accuracy. Materials and methods Comprehensive medical concepts were collected from publicly available knowledge sources in an automated, unbiased fashion. Natural language processing (NLP) revealed the occurrence patterns of these concepts in EHR narrative notes, which enabled selection of informative features for phenotype classification. When combined with additional codified features, a penalized logistic regression model was trained to classify the target phenotype. Results The authors applied our method to develop algorithms to identify patients with rheumatoid arthritis and coronary artery disease cases among those with rheumatoid arthritis from a large multi-institutional EHR. The area under the receiver operating characteristic curves (AUC) for classifying RA and CAD using models trained with automated features were 0.951 and 0.929, respectively, compared to the AUCs of 0.938 and 0.929 by models trained with expert-curated features. Discussion Models trained with NLP text features selected through an unbiased, automated procedure achieved comparable or slightly higher accuracy than those trained with expert-curated features. The majority of the selected model features were interpretable. Conclusion The proposed automated feature extraction method, generating highly accurate phenotyping algorithms with improved efficiency, is a significant step toward high-throughput phenotyping.

2016 ◽  
Vol 24 (e1) ◽  
pp. e143-e149 ◽  
Author(s):  
Sheng Yu ◽  
Abhishek Chakrabortty ◽  
Katherine P Liao ◽  
Tianrun Cai ◽  
Ashwin N Ananthakrishnan ◽  
...  

Objective: Phenotyping algorithms are capable of accurately identifying patients with specific phenotypes from within electronic medical records systems. However, developing phenotyping algorithms in a scalable way remains a challenge due to the extensive human resources required. This paper introduces a high-throughput unsupervised feature selection method, which improves the robustness and scalability of electronic medical record phenotyping without compromising its accuracy. Methods: The proposed Surrogate-Assisted Feature Extraction (SAFE) method selects candidate features from a pool of comprehensive medical concepts found in publicly available knowledge sources. The target phenotype’s International Classification of Diseases, Ninth Revision and natural language processing counts, acting as noisy surrogates to the gold-standard labels, are used to create silver-standard labels. Candidate features highly predictive of the silver-standard labels are selected as the final features. Results: Algorithms were trained to identify patients with coronary artery disease, rheumatoid arthritis, Crohn’s disease, and ulcerative colitis using various numbers of labels to compare the performance of features selected by SAFE, a previously published automated feature extraction for phenotyping procedure, and domain experts. The out-of-sample area under the receiver operating characteristic curve and F-score from SAFE algorithms were remarkably higher than those from the other two, especially at small label sizes. Conclusion: SAFE advances high-throughput phenotyping methods by automatically selecting a succinct set of informative features for algorithm training, which in turn reduces overfitting and the needed number of gold-standard labels. SAFE also potentially identifies important features missed by automated feature extraction for phenotyping or experts.


2020 ◽  
Vol 7 (4) ◽  
pp. 745
Author(s):  
Rizka Indah Armianti ◽  
Achmad Fanany Onnilita Gaffar ◽  
Arief Bramanto Wicaksono Putra

<p class="Abstrak">Obyek dinyatakan bergerak jika terjadi perubahan posisi dimensi disetiap <em>frame</em>. Pergerakan obyek menyebabkan obyek memiliki perbedaan bentuk pola disetiap <em>frame-</em>nya. <em>Frame</em> yang memiliki pola terbaik diantara <em>frame</em> lainnya disebut <em>frame</em> dominan. Penelitian ini bertujuan untuk menyeleksi <em>frame</em> dominan dari rangkaian <em>frame</em> dengan menerapkan metode K-means <em>clustering</em> untuk memperoleh <em>centroid</em> dominan (<em>centroid</em> dengan nilai tertinggi) yang digunakan sebagai dasar seleksi <em>frame</em> dominan. Dalam menyeleksi <em>frame</em> dominan terdapat 4 tahapan utama yaitu akuisisi data, penetapan pola obyek, ekstrasi ciri dan seleksi. Data yang digunakan berupa data video yang kemudian dilakukan proses penetapan pola obyek menggunakan operasi pengolahan citra digital, dengan hasil proses berupa pola obyek RGB yang kemudian dilakukan ekstraksi ciri berbasis NTSC dengan menggunakan metode statistik orde pertama yaitu <em>Mean</em>. Data hasil ekstraksi ciri berjumlah 93 data <em>frame</em> yang selanjutnya dikelompokkan menjadi 3 <em>cluster</em> menggunakan metode K-Means. Dari hasil <em>clustering</em>, <em>centroid</em> dominan terletak pada <em>cluster</em> 3 dengan nilai <em>centroid</em> 0.0177 dan terdiri dari 41 data <em>frame</em>. Selanjutnya diukur jarak kedekatan seluruh data <em>cluster</em> 3 terhadap <em>centroid</em>, data yang memiliki jarak terdekat dengan <em>centroid</em> itulah <em>frame</em> dominan. Hasil seleksi <em>frame</em> dominan ditunjukkan pada jarak antar <em>centroid</em> dengan anggota <em>cluster</em>, dimana dari seluruh 41 data frame tiga jarak terbaik diperoleh adalah 0.0008 dan dua jarak bernilai  0.0010 yang dimiliki oleh <em>frame</em> ke-59, ke-36 dan ke-35.</p><p class="Abstrak"> </p><p class="Abstrak"><em><strong>Abstract</strong></em></p><p class="Abstract"><em>The object is declared moving if there is a change in the position of the dimensions in each frame. The movement of an object causes the object to have different shapes in each frame. The frame that has the best pattern among other frames is called the dominant frame. This study aims to select the dominant frame from the frame set by applying the K-means clustering method to obtain the dominant centroid (the highest value centroid) which is used as the basis for the selection of dominant frames. In selecting dominant frames, there are 4 main stages, namely data acquisition, determination of object patterns, feature extraction and selection. The data used in the form of video data which is then carried out the process of determining the pattern of objects using digital image processing operations, with the results of the process in the form of an RGB object pattern which is then performed NTSC-based feature extraction using the first-order statistical method, Mean. The data from feature extraction are 93 data frames which are then grouped into 3 clusters using the K-Means method. From the results of clustering, the dominant centroid is located in cluster 3 with a centroid value of 0.0177 and consists of 41 data frames. Furthermore, the proximity of all data cluster 3 to the centroid is measured, the data having the closest distance to the centroid is the dominant frame. The results of dominant frame selection are shown in the distance between centroids and cluster members, where from all 41 data frames the three best distances obtained are 0.0008, 0.0010, and 0.0010 owned by 59th, 36th and 35th frames.</em></p><p class="Abstrak"><em><strong><br /></strong></em></p><p> </p>


2020 ◽  
Author(s):  
Margaret R. Krause ◽  
Suchismita Mondal ◽  
José Crossa ◽  
Ravi P. Singh ◽  
Francisco Pinto ◽  
...  

ABSTRACTBreeding programs for wheat and many other crops require one or more generations of seed increase before replicated yield trials can be sown. Extensive phenotyping at this stage of the breeding cycle is challenging due to the small plot size and large number of lines under evaluation. Therefore, breeders typically rely on visual selection of small, unreplicated seed increase plots for the promotion of breeding lines to replicated yield trials. With the development of aerial high-throughput phenotyping technologies, breeders now have the ability to rapidly phenotype thousands of breeding lines for traits that may be useful for indirect selection of grain yield. We evaluated early generation material in the irrigated bread wheat (Triticum aestivum L.) breeding program at the International Maize and Wheat Improvement Center to determine if aerial measurements of vegetation indices assessed on small, unreplicated plots were predictive of grain yield. To test this approach, two sets of 1,008 breeding lines were sown both as replicated yield trials and as small, unreplicated plots during two breeding cycles. Vegetation indices collected with an unmanned aerial vehicle in the small plots were observed to be heritable and moderately correlated with grain yield assessed in replicated yield trials. Furthermore, vegetation indices were more predictive of grain yield than univariate genomic selection, while multi-trait genomic selection approaches that combined genomic information with the aerial phenotypes were found to have the highest predictive abilities overall. A related experiment showed that selection approaches for grain yield based on vegetation indices could be more effective than visual selection; however, selection on the vegetation indices alone would have also driven a directional response in phenology due to confounding between those traits. A restricted selection index was proposed for improving grain yield without affecting the distribution of phenology in the breeding population. The results of these experiments provide a promising outlook for the use of aerial high-throughput phenotyping traits to improve selection at the early-generation seed-limited stage of wheat breeding programs.


2021 ◽  
Vol 1 (1) ◽  
pp. 453-478
Author(s):  
Heriyanto Heriyanto ◽  
Herlina Jayadianti ◽  
Juwairiah Juwairiah

There are two approaches to Qur’an recitation, namely talaqqi and qira'ati. Both approaches use the science of recitation containing knowledge of the rules and procedures for reading the Qur'an properly. Talaqqi requires the teacher and students to sit facing each other while qira'ati is the recitation of the Qur'an with rhythms and tones. Many studies have developed an automatic speech recognition system for Qur’an recitation to help the learning process. Feature extraction model using Mel Frequency Cepstral Coefficient (MFCC) and Linear Predictive Code (LPC). The MFCC method has an accuracy of 50% to 60% while the accuracy of Linear Predictive Code (LPC) is only 45% to 50%, so the non-linear MFCC method has higher accuracy than the linear approach method. The cepstral coefficient feature that is used starts from 0 to 23 or 24 cepstral coefficients. Meanwhile, the frame taken consists of 0 to 10 frames or eleven frames. Voting for 300 recorded voice samples was tested against 200 voice recordings, both male and female voices. The frequency used was 44.100 kHz stereo 16 bit. This study aims to obtain good accuracy by selecting the right feature on the cepstral coefficient using MFCC feature extraction and matching accuracy through the selection of the cepstral coefficient feature with Dominant Weight Normalization (NBD) at TPA Nurul Huda Plus Purbayan. Accuracy results showed that the MFCC method with the selection of the 23rd cepstral coefficient has a higher accuracy rate of 90.2% compared to the others. It can be concluded that the selection of the right features on the 23rd cepstral coefficient affects the accuracy of the voice of Qur’an recitation.


Author(s):  
M. Herrero-Huerta ◽  
K. M. Rainey

<p><strong>Abstract.</strong> Nowadays, an essential tool to improve the efficiency of crop genetics is automated, precise and cost-effective phenotyping of the plants. The aim of this study is to generate a methodology for high throughput phenotyping the physiological growth dynamics of soybeans by UAS-based 3D modelling. During the 2018 growing season, a soybean experiment was performed at the Agronomy Center for Research and Education (ACRE) in West-Lafayette (Indiana, USA). Periodic images were acquired by G9X Canon compact digital camera on board senseFly eBee. The study area is reconstructed in 3D by Image-based modelling. Algorithms and techniques were combined to analyse growth dynamics of the crop via height variations and to quantify biomass. Results provide practical information for the selection of phenotypes for breeding.</p>


Author(s):  
Marwa Ben Salah ◽  
Ameni Yengui ◽  
Mahmoud Neji

In this paper, we present two steps in the process of automatic annotation in archeological images. These steps are feature extraction and feature selection. We focus our research on archeological images which are very much studied in our days. It presents the most important steps in the process of automatic annotation in an image. Feature extraction techniques are applied to get the feature that will be used in classifying and recognizing the images. Also, the selection of characteristics reduces the number of unattractive characteristics. However, we reviewed various images of feature extraction techniques to analyze the archaeological images. Each feature represents one or more feature descriptors in the archeological images. We focus on the descriptor shape of the archaeological objects extraction in the images using contour method-based shape recognition of the monuments. So, the feature selection stage serves to acquire the most interesting characteristics to improve the accuracy of the classification. In the feature selection section, we present a comparative study between feature selection techniques. Then we give our proposal of application of methods of selection of the characteristics of the archaeological images. Finally, we calculate the performance of two steps already mentioned: the extraction of characteristics and the selection of characteristics.


Sign in / Sign up

Export Citation Format

Share Document