INTELLIGENCE SYSTEMS AND TECHNOLOGIES
K. S. Nikolaev Semantic Formula Search Service for a Collection of Mathematical PDF Documents
COMPUTING SYSTEMS AND NETWORKS
MATHEMATICAL MODELLING
DATA PROCESSING AND ANALYSIS
MANAGEMENT AND DECISION MAKING
K. S. Nikolaev Semantic Formula Search Service for a Collection of Mathematical PDF Documents
Abstract. 

This paper presents a service based on Semantic Web technologies that allows searching mathematical formulas in a collection of scientific PDF documents. The search for formulas is performed by searching for the concepts included in the mathematical formula. In this regard, the search results do not depend on the author's designations of formula variables and contain all formulas that contain the concept from the search query. A distinctive feature of the service is the possibility of extending the collection of documents with scientific articles in PDF format without explicit markup of mathematical formulas. OntoMathPro, an ontology of professional mathematics covering a wide range of mathematics fields, is used as a source of concepts.

Keywords: semantic formula search, PDF documents, document processing, scientific journals, scientific libraries, ontologies, web service.

DOI 10.14357/20718632250304

EDN MDPQKP

PP. 34-43.

References

1. Constantin A, Pettifer S, Voronkov A. PDFX: Fully-automated PDF-to-XML conversion of scientific literature. In: DocEng 2013 - Proceedings of the 2013 ACM Symposium on Document Engineering [Internet]. New York, NY, USA: ACM; 2013. p. 177–80. Available from: https://dl.acm.org/doi/10.1145/2494266.2494271.
2. Ciancarini P, Di Iorio A, Nuzzolese AG, Peroni S, Vitali F. Semantic annotation of scholarly documents and citations. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics). 2013;8249 LNAI:336–47. 
3. Peroni S, Shotton D. FaBiO and CiTO: Ontologies for describing bibliographic resources and citations. J Web Semant. 2012;17:33–43.
4. Bertin M, Atanassova I. Hybrid Approach for the Semantic Processing of Scientific Papers. Semant Publ Chall Track 11 th Eur Semant Web Conf (ESWC 2014). 2014; 
5. The Linked Open Data Cloud [Internet]. [cited 2025 Apr 12]. Available from: https://lod-cloud.net/
6. Ahmad R, Afzal MT, Qadir MA. Information extraction from PDF sources based on rule-based system using integrated formats. Commun Comput Inf Sci. 2016;641:293–308. 
7. Schubotz M, Greiner-Petter A, Scharpf P, Meuschke N, Cohl HS, Gipp B. Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context. Proc ACM/IEEE Jt Conf Digit Libr. 2018;233–42.
8. Mathematical Markup Language (MathML) Version 3.0 2nd Edition [Internet]. [cited 2025 Dec 4]. Available from: https://www.w3.org/TR/MathML3/
9. Greiner-Petter A, Youssef A, Ruas T, Miller BR, Schubotz M, Aizawa A, et al. Math-word embedding in math search and semantic extraction. Scientometrics.2020;125(3):3017–46.
10. Nevzorova O, Kirillovich A, Nevzorov V, Nikolaev K. The semantic context models of mathematical formulas in scientific papers. CEUR Workshop Proc. 2018;2277:33–40. 
11. Taraborelli D. LaTeXSearch: 1M snippets in a searchable database [Internet]. [cited 2025 Apr 12]. Available from: https://academicproductivity.com/2010/latexsearch/
12. Formula Search [Internet]. [cited 2025 Apr 12]. Available from: http://shinh.org/wfs/. 
13. Wolfram Formula Search [Internet]. [cited 2025 Apr 12]. Available from: http://functions.wolfram.com/formulasearch.
14. Durgin S, Gore J, Mansouri B. MathMex: Search Engine for Math Definitions. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) [Internet]. 2024. p. 194–9. Available from: https://link.springer.com/10.1007/978-3-031-56069-9_17.
15. Sojka P, Líška M. The art of mathematics retrieval. In: Proceedings of the 11th ACM symposium on Document engineering [Internet]. New York, NY, USA: ACM; 2011. p.57–60. Available from: https://dl.acm.org/doi/10.1145/2034691.2034703.
16. Elizarov A, Kirillovich A, Lipachev E, Nevzorova O. Semantic formula search in digital mathematical libraries. In: RPC 2017 - Proceedings of the 2nd Russian-Pacific Conference on Computer Technology and Applications [Internet]. IEEE; 2017. p. 39–43. Available from: http://ieeexplore.ieee.org/document/8168063/.
17. Nevzorova OA, Nikolaev KS. Semantic Annotation of Mathematical Formulas in PDF-Documents. Russian Digital Libraries Journal. 2022;25(6):616–39. 
18. pdfminer.six [Internet]. [cited 2025 Apr 12]. Available from: https://github.com/pdfminer/pdfminer.six. 
19. Paruchuri V. Surya [Internet]. [cited 2025 Apr 12]. Available from: https://github.com/VikParuchuri/surya. 
20. Kirillovich A V., Nevzorova OA, Lipachev EK. OntoMathPRO 2.0 Ontology: Updates of Formal Model. Lobachevskii J Math [Internet]. 2022 Dec 17;43(12):3504–14. Available from: https://link.springer.com/10.1134/S1995080222150136.
21. Elizarov AM, Kirillovich AV, Lipachev EK, Nevzorova OA. Ontologiya Matematicheskogo Znaniya Ontomathpro [Ontology of mathematical knowledge Ontomathpro]. Doklady` Rossijskoj Akademii Nauk Matematika, Informatika, Processy` Upravleniya [Internet]. 2022 Oct;507(1):29–35. Available from: https://elibrary.ru/item.asp?id=49991280.
22. Nevzorova OA, Zhiltsov N, Kirillovich A, Lipachev E. Ontomathpro ontology: A linked data hub for mathematics. In: Communications in Computer and Information Science [Internet]. 2014. p. 105–19. Available from: http://link.springer.com/10.1007/978-3-319-11716-4_9.
23. Nikolaev K.S. Mathematical content processing methods and algorithms based on semantic web technologies. PhD Thesis. Kazan. 2024. 118 p.
24. OntoMathPro Ontology classes [Internet]. [cited 2025 Apr 12]. Available from: http://ontomathpro.org/ontology/.
2026 / 01
2025 / 04
2025 / 03
2025 / 02

© ФИЦ ИУ РАН 2008-2018. Создание сайта "РосИнтернет технологии".