A comparative study of machine learning methods to predict average daily gain from single nucleotide polymorphisms
This study compares the accuracy of prediction of total genetic effects, i.e. additive and non-additive genetic effects, of average daily gain (ADG) from single-nucleotide polymorphisms (SNPs) using two machine learning (ML) algorithms, i.e. Elastic Net and Support Vector Machine, and a genome-enabled best linear unbiased prediction model (GBLUP) as benchmark. The target examples were 439 ADG records which were previously adjusted for environmental systematic and random effects. After quality control and selection of one SNP per linkage group, the retained 14,713 SNPs were ranked using their importance measure for predicting the adjusted ADG records. Then, different subsets with increasing number of the most informative SNPs (50, 100, 200, 300, 500 and all most informative SNPs) were used as variables for predicting adjusted ADG records either by using radial basis function SVM or ENET. Optimal hyperparameters for the two algorithm were tuned using nested resampling. The predictive performance of each ML algorithm and the GBLUP was evaluated as the median of the Spearman correlation (SC) across the 30 testing sets originated from a 6-fold cross-validation repeated 5 times. The best predictive performance and repeatable results were obtained with a subset of 100 SNPs and using ENET with a median SC of 0.26 and an interquartile range of 0.07. Predictive ability was null when using all available SNPs either using ENET, SVM or GBLUP. The selected subset of 100 SNPs that have been identified could be potentially used in selection to boost genetic progress of ADG.
Key words: Support Vector Machine, Elastic Net, prediction, growth, genome selection