Volume 19, no. 2Pages 65 - 74 Developing an OCR Model for Recognizing Text in the Mansi Language
A.V. Melnikov, I.S. Veretennikov, V.Yu. Polishchuk, M.A. Rusanov, S.N. SherginThis article examines the development of an optical character recognition (OCR) system for the Mansi language, a resource-poor Finno-Ugric language with a unique set of diacritics. The primary objective of the study is to adapt existing OCR technologies to the specifics of the Mansi script, which is characterized by a limited volume of digitized texts and the presence of specific graphic symbols. To address this challenge, a comprehensive approach was developed, including generating an extensive synthetic dataset taking into account font variability and Unicode normalization, further training the Tesseract 5 model using transfer learning based on a pre-trained Russian-language model, and evaluating recognition quality using the CER and WER metrics. The resulting specialized model demonstrated a CER value of 0,85, which is twenty times higher than the baseline model (18,5%). The developed model is implemented as a public web service and is openly accessible, enabling the automated digitization of printed sources in the Mansi language and facilitating the preservation of the cultural heritage of the indigenous peoples of the North.
Full text- Keywords
- language models; neural networks; optical character recognition; Mansi language; dataset.
- References
- 1. Agarwal M., Anastasopoulos A. A Concise Survey of OCR for Low-Resource Languages Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas, 2024, pp. 88-102. DOI: 10.18653/v1/2024.americasnlp-1.10
2. Kashid H., Bhattacharyya P. RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages. Proceedings of the 21st International Conference on Natural Language Processing. NLP Association of India, 2024, pp. 274-284. DOI: 10.48550/arXiv.2412.15248
3. Ignat O., Maillard J., Chaudhary V., Guzman F. OCR Improves Machine Translation for Low-Resource Languages. Findings of the Association for Computational Linguistics, 2022, pp. 1164-1174. DOI: 10.18653/v1/2022.findings-acl.92
4. Low-Resource Language OCR: New Possibilities with AI. Available at: https://sunway.edu.np/low-resource-language-ocr-automation/ (accessed 25.01.2026)
5. Drobac S., Kauppinen P., Linden K. OCR and Post-Correction of Historical Finnish Texts. Proceedings of the 21st Nordic Conference on Computational Linguistics, 2017, pp. 70-76.
6. Tesseract User Manual: Tesseract Documentation. Available at: https://tesseract-ocr.github.io/tessdoc/Home.html (accessed 25.01.2026)
7. Unicode Normalization Forms. Available at: https://unicode.org/reports/tr15/ (accessed 26.01.2026).
8. Keren G., Schulle B. Convolutional RNN: an Enhanced Model for Extracting Features from Sequential Data. arXiv. Computation and Language. Available at:: https://arxiv.org/abs/1602.05875 (accessed 25.01.2026)
9. Sequence Modeling with CTC. Available at: https://distill.pub/2017/ctc/ (accessed 24.01.2026).
10. Tesseract-ocr. Available at: https://github.com/tesseract-ocr (accessed 26.01.2026).
11. Sinno Jialin Pan, Qiang Yang. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 2010, vol. 22, no 10, pp. 1345-1359.
12. CS231n Deep Learning for Computer Vision. Introduction to RNN. Available at: https://cs231n.github.io/rnn/ (accessed 17.03.2026)
13. Deep Learning: An MIT Press Book. Available at: https://www.deeplearningbook.org/ (accessed 17.03.2026)
14. Raskutti G., Wainwright M.J., Bin Yu. Early Stopping for Non-Parametric Regression: An Optimal Data-Dependent Stopping Rule. 2011 49th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, 2011, pp. 1318-1325. DOI: 10.1109/Allerton.2011.6120320
15. Evaluating AI Models: Understanding the Character Error Rate (CER) Metric. Available at: https://galileo.ai/blog/character-error-rate-cer-metric (accessed 24.01.2026)
16. Yakubovskyi R., Morozov Yu. Speech Models Training Technologies Comparison Using Word Error Rate. Advances in Cyber-Physical Systems, 2023, vol. 8, no. 1, pp. 74-80. DOI: 10.23939/acps2023.01.074
17. URIIT/mns-tesseract. Available at: https://huggingface.co/URIIT/mns-tesseract (accessed 26.01.2026)