Identification of Chicken Populations by Machine Learning Models Using the Minimum Number of SNPs
Abstract BackgroundA marker combination capable of classifying a specific chicken population could improve commercial value by increasing consumer confidence with respect to the origin of the population. This would also facilitate the protection of genetic resources, especially in developing countries. MethodsIn this study, a total of 20 lines 283 samples which were consist of Korean native chicken, commercial native chicken, and commercial broilers with layer population were used for finding the minimum number of marker combinations through the 600k high-density single nucleotide polymorphism (SNP) array. Application of the machine learning algorithms, a genome-wide association study (GWAS), linkage disequilibrium (LD) analysis, and principal component analysis (PCA) were used to distinguish a target (case) group from control chicken groups. In the verification of the selected markers, a total of 12 lines 182 samples were used to confirm the change in the accuracy of the target chicken breed identification.ResultsA total of 47,303 SNPs was used for classifying chicken populations; 96 LD-pruned SNPs (50 SNPs per LD block) served as the best marker combination for target chicken classification. Moreover, 36, 44, and 8 SNPs were selected as the minimum numbers of markers by Adaboost (AB), Random Forest (RF), and Decision Tree (DT) machine learning classification models, which had accuracy rates of 99.6%, 98.0% and 97.9%, respectively. The selected marker combinations increased the genetic distance between the case and control groups, and reduced the number of genetic components, confirming that an efficient classification of the groups was possible using small number of marker sets. In a verification study including additional chicken breeds and samples, the accuracy did not significantly change, and the target chicken group could be clearly distinguished from the other populations.ConclusionsThe GWAS and PCA analysis, machine learning algorithm used in this study is able to be applied efficiently to explore the minimum combination of markers that can distinguish varieties among a large number of SNP markers.