Natural language processing systems for data extraction and mapping on the basis of unstructured text blocks
Modern natural language processing technologies allow you to work with texts without being a specialist in linguistics. The use of popular data processing platforms for the development and use of linguistic models provides an opportunity to implement them in popular geographic information systems. This feature allows you to significantly expand the functionality and improve the accuracy of standard geocoding functions. The article provides a comparison of the most popular methods and software implemented on their basis, using the example of solving the problem of extracting geographical names from plain text. This option is an extended version of the geocoding operation, since the result also includes the coordinates of the point features of interest, but there is no need to separately extract the addresses or geographical names of the objects in advance from the text. In computer linguistics, this problem is solved by the methods of extracting named entities (Eng. named entity recognition). Among the most modern approaches to the final implementation, the authors of the article have chosen algorithms based on rules, models of maximum entropy and convolutional neural networks. The selected algorithms and methods were evaluated not only from the point of view of the accuracy of searching for geographical objects in the text, but also from the point of view of simplicity of refinement of the basic rules or mathematical models using their own text bodies. Reports on technological violations, accidents and incidents at the facilities of the heat and power complex of the Ministry of Energy of the Russian Federation were selected as the initial data for testing the abovementioned methods and software solutions. Also, a study is presented on a method for improving the quality of recognition of named entities based on additional training of a neural network model using a specialized text corpus.