Ways to build text collections for training classifiers
We report on solving the problem of forming a Russian-language text collection (dataset) consisting of bibliographic descriptions of scientific articles for training classifiers. Various approaches to creating such collections are considered. The expediency of using expert estimates for assigning class labels is assessed. The known datasets are analyzed, the requirements for the generated text array are formulated, and the choice of the subject area (Computer Science) is justified. We propose a technology of forming collection in conditions of the shortage of Russian-language articles. To do this we use automated translation of publications (bibliographic descriptions) from available English-language electronic libraries (ACM digital library, IEEE Xplore digital library, CiteSeerX) with additional expert quality control of the translation. The bibliographic collection thus formed was studied using methods of clustering (Latent Semantic Analysis) and visualization (Principal Component Analysis). Training and test samples were compiled and «standard» classifiers (K-Nearest Neighbor Method, Logistic Regression, Random Forest) were used. Then we calculated standard quality measures (accuracy, precision, recall). The rigid and soft classification were carried out. For rigid and soft classification all calculated measures (for the studied classifiers) ranged within [0.79; 0.87], and [0.91; 0.95], respectively. The experiments showed almost identical results for Russian and English bibliographic descriptions (the difference did not exceed 2%). The proposed method of forming text collections reduces the complexity of the labeling process compared to the expert approach, solves the problem of the lack of Russian-language documents, allows formation of sufficiently large balanced bibliographic datasets for training and testing classifiers.