Volume 18, no. 3Pages 87 - 95 Using Fuzzy String Comparison For Automated Transfer of Formating in Poetic Works
N.N. Teslya, G.N. BelyakThe creation of the scientific and educational resource "Pushkin Digital" is driven by the necessity of typesetting poetic texts based on layout information from other editions. From one edition to another, texts may vary, and in each case, typesetting is performed a new according to the rules of the specific edition. Manual typesetting demands attentiveness and significant time and effort from a specialist, as it requires comparing several identical texts across multiple editions. The proposed method addresses two tasks.
First, it determines the extent to which the texts differ between editions, enabling an assessment of the number of errors or deliberate transformations of the text, which is a separate subject of study for textual scholars. Second, based on an evaluation of line differences and their fuzzy alignment, the method generates typesetting rules for each line, taking into account the rules applied in earlier editions.
The method was tested on 914 lyrical works by A.S. Pushkin, successfully ensuring the correct and complete transfer of typesetting for 74,55% of the texts. However, for 25,45% of the cases, this proved unfeasible, requiring manual typesetting instead.
Full text- Keywords
- fuzzy string comparison; levenshtein distance; formatting; text processing.
- References
- 1. Pushkin A.S. Polnoe sobranie sochinenii: 16 t. [Complete Works: In 16 Volumes]. Moscow, Leningrad, Izdatelstvo AN SSSR, 1937–1959. (in Russian)
2. Pushkin A.S. Polnoe sobranie sochinenii: 20 t. [Complete Works:
In 20 Volumes]. St. Petersburg, Nauka, 1999–... (in Russian)
3. Wang Jiapeng, Dong Yihong. Measurement of Text Similarity: A Survey. Information, 2020, vol. 11, no. 9, article ID: 421, 17 p. DOI: 10.3390/info11090421
4. Rani S., Singh J. Enhancing Levenshtein’s Edit Distance Algorithm for Evaluating Document Similarity. Computing, Analytics and Networks: First International Conference (ICAN 2017), 2018, pp. 72–80.
5. Pikies M., Ali J. Analysis and Safety Engineering of Fuzzy String Matching Algorithms. ISA Transactions, 2021, vol. 113, pp. 1–8.
6. Kente T., De Rijke M. Short Text Similarity With Word Embeddings. Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 2015, pp. 1411–1420.
7. Mikolov T. Efficient Estimation of Word Representations in Vector
Space. arXiv: Computation and Language, 2013. Available at: https://arxiv.org/abs/1301.3781. DOI: 10.48550/arXiv.1301.3781
8. Thada V., Jaglan V. Comparison of Jaccard, Dice, Cosine Similarity Coefficient to Find Best Fitness Value for Web Retrieved Documents Using Genetic Algorithm. International Journal of Innovations in Engineering and Technology, 2013, vol. 2, no. 4, pp. 202–205.
9. Patricoski J. et al. An Evaluation of Pretrained BERT Models for Comparing Semantic Similarity Across Unstructured Clinical Trial Texts. Informatics and Technology in Clinical Care and Public Health, 2022, pp. 18–21.
10. Neculoiu P., Versteegh M., Rotaru M. Learning Text Similarity With Siamese Recurrent Networks. Proceedings of the 1st Workshop on Representation Learning for NLP, 2016, pp. 148–157.
11. Amin K., Lancaster G., Kapetanakis S. et al. Advanced Similarity Measures Using Word Embeddings and Siamese Networks in CBR. Intelligent Systems and Applications: Proceedings of the 2019 Intelligent Systems Conference, 2020, pp. 449–462. DOI: 10.1007/978-
3-030-29513-4_32
12. Reimers N., Gurevych I. Sentence-BERT: Sentence Embeddings
Using Siamese BERT-Networks. arXiv: Computation and Language,
2019. Available at: https://doi.org/10.48550/arXiv.1908.10084. DOI: 10.48550/arXiv.1908.10084
13. Thefuzz – Fuzzy String Matching in Python. Available at: https://github.com/seatgeek/thefuzz
14. Pushkin A.S. “Prostish li mne revnivye mechty...” [Will You Forgive My Jealous Dreams...]. Polnoe sobranie sochinenii: V 16 t. [Complete Works: In 16 Volumes], Moscow; Leningrad, Izdatelstvo AN SSSR, 1937–1959, Vol. 2, book 1. Stikhotvoreniia, 1817–1825. Litseiskie stikhotvoreniia v pozdneishikh redaktsiiakh [Poems, 1817–1825. Lyceum Poems in Later
Editions], 1947, pp. 300–301. (in Russian)
15. Pushkin A.S. “Prostish li mne revnivye mechty...” [Will You Forgive My Jealous Dreams...]. Polnoe sobranie sochinenii: V 20 t. [Complete Works: In 20 Volumes], Vol. 2, Book 2. Stikhotvoreniia Kniga vtoraia (Iug, 1820–1824) [Poems. Second Book (South, 1820–1824)], 2016, pp. 91–92. (in Russian)