scholarly journals Gradient boosting for the prediction of gas chromatographic retention indices

Author(s):  
Dmitriy D. Matyushin ◽  
Anastasia Yu. Sholokhova ◽  
Aleksey K. Buryak

The estimation of gas chromatographic retention indices based on compounds structures is an importantproblem. Predicted retention indices can be used in a mass spectral library search for the identificationof unknowns. Various machine learning methods are used for this task, but methods based on decisiontrees, in particular gradient boosting, are not used widely. The aim of this work is to examine the usability ofthis method for the retention index prediction. 177 molecular descriptors computed with Chemistry Development Kit are used as the input representation of a molecule. Random subsets of the whole NIST 17 database are used as training, test and validation sets. 8000 trees with 6 leaves each are used. A neural network with one hidden layer (90 hidden nodes) is used for the comparison. The same data sets and the set of descriptors are used for the neural network and gradient boosting. The model based on gradient boosting outperforms the neural network with one hidden layer for subsets of NIST 17 and for the set of essential oils.The performance of this model is comparable or better than performance of other modern retention prediction models. The average relative deviation is ~3.0%, the median relative deviation is ~1.7% for subsets of NIST 17. The median absolute deviation is ~34 retention index units. Only non-polar liquid stationary phases (such as polydimethylsiloxane, 5% phenyl 95% polydimethylsiloxane, squalane) are considered. Errors obtained with different machine learning algorithms and with the same representation of the molecule strongly correlate with each other.

Sign in / Sign up

Export Citation Format

Share Document