Limitations of Applying the Data Compression Method to the Classification of Abstracts of Publications Indexed in Scopus
The paper describes the limitations of applying the method of classification of scientific texts based on data compression to all categories indicated in the ASJC classification used in the Scopus bibliographic database. It is shown that the automatic generation of learning samples for each category is a rather time-consuming process, and in some cases is impossible due to the restriction on data upload installed in Scopus and the lack of category names in the Scopus Search API. Another reason is that in many subject areas there are completely no journals and, accordingly, publications that have only one category. Application of the method to all 26 subject areas is impossible due to their vastness, as well as the initial classification of Scopus. Often in different subject areas there are terminologically close categories, which makes it difficult to classify a publication as a true area. These findings also indicate that the classification currently used in Scopus and SciVal may not be completely reliable. For example, according to SciVal in terms of the number of publications, the category “Theoretical computer science” is in second place among all publications in the subject area “Mathematics”. The study showed that this category is one of the smallest categories, both in terms of the presence of journals and publications with only this category. Thus, many studies based on the use of publications in ASJC may have some inaccuracies.