Automatic Normalisation of Historical Text
Spelling variation in historical text negatively impacts the performance of naturallanguage processing techniques, so normalisation is an important pre-processingstep. Current methods fall some way short of perfect accuracy, often requiringlarge amounts of training data to be effective, and are rarely evaluated againsta wide range of historical sources. This thesis evaluates three models: a HiddenMarkov Model, which has not been previously used for historical text normalisation; a soft attention Neural Network model, which has previously only been evaluated on a single German dataset; and a hard attention Neural Network model,which is adapted from work on morphological inflection and applied here to historical text normalisation for the first time. Each is evaluated against multipledatasets taken from prior work on historical text normalisation. This facilitatesdirect comparison of this work to that existing work. The hard attention NeuralNetwork model achieves state-of-the-art normalisation accuracy in all datasets,even when the volume of training data is restricted. This work will be of particular interest to researchers working with noisy historical data which they wouldlike to explore using modern computational techniques.