Linguistic knowledge in collaboration with NLP for languages of Karelia

Introduction

This summer I participated in a very interesting research project in the Google Summer of Code 2021 program. The work was coordinated by Tommi Pirinen and Jack Rueter, with whom we have been working earlier in different projects, especially in connection with the Karelian language. Karelian has a large number of dictionaries, and also electronic resources, but there are several translation directions that are often missing, or less complete. In this project we wanted to improve the situation. The resources created during the summer are openly available, and we hope they will contribute future dictionary work in the Karelian language, which is a field that certainly needs continued development to flourish.

Karelian dictionaries

Karelian has two main varieties, Karelian Proper and Livvi Karelian. Finnish is also closely related, and within Karelian there is a complex dialect continuum. The task was to approve of translation pairs for the three languages Finnish-Livvi, Finnish-Karelian and Karelian-Livvi. Defining the varieties used in this work was not trivial, and is a topic that could be discussed in great length.

One problem is the definition of the language Karelian Proper (krl), this is, in fact, problematic since it is represented by two different writing systems, a northern variant published in the newspaper Vienan karjala and southern variant native to Tver. The language form used in this project features that of northern Karelian Proper as described in the Russian-Northern Karelian Proper dictionary of 2015 (Русско-карельский словарь (севернокарельские диалекты) = Venäjä-viena šanakirja / сост.: Зайков П. M. [и др.]. – Петрозаводск: Периодика, 2015. – 360 с. ISBN 978-5-88170-255-7.).

The Finnish-Livvi (fin-olo) translation pairs were extended from materials from an open-source project funded through the Kone Language Programme begun in 2013. Since I had been working with these materials in that time, they familiar for me and this revisit was very useful and productive, and also different from other language pairs.

https://apertium.github.io/apertium-fin-olo/apertium-fin-olo.fin-olo.dix.html

For example, work with the Finnish-(Northern) Karelian Proper (fin-krl) required personal knowledge of the language and the use of a Russian-Karelian (Northern Karelian Proper) dictionary. With these resources, however, the dictionary was possible to extend very well.

https://apertium.github.io/apertium-fin-krl/apertium-fin-krl.fin-krl.dix.html

Now, the Karelian-Livvi dictionary work was a totally different issue. No previous dictionaries, written or online, were available. It’s also an important example about work where a dictionary is built between two endangered language varieties, or the varieties of same language.

https://github.com/apertium/apertium-krl-olo

For this last dictionary, Khalid Alnajar, who had worked on the advancement of the Veʹrdd online dictionary editing platform in GSoC 2020 (https://www.khalidalnajjar.com/verdd-dictionary-editing-tool/) used an algorithm he had developed for predicting translation pairs in previous projects. It was very nice to see such continuing collaboration between these projects from different summers. Karelian-Livvi dictionary part was indeed very different from other languages I worked on this period.

The task of approving translations was done according to two basic workflows. One was something I could perform directly online in Veʹrdd, and the other involved collaboration with others working with the Veʹrdd multilingual dictionary databases.

The first alternative was to use the online Veʹrdd editing platform, where source-language words could be searched for in many numbers of ways, such as part of speech, but it also supports regular expressions and filtering for approved word pairs.

The second alternative applied the inherent knowledge of Verdd content in coordination with algorithmic predictions of feasible translation pairs. In this latter workflow my duty as editor was to approve all correct translation pairs even if they were secondary meanings. The results of this second flow were then automatically uploaded to the system and approved for further download and incorporation into the Apertium github repositories.

After completing my editing work, my mentor Jack Rueter went over the materials before checking them into the Apertium github repositories. Links to the history below show the input from our work in order they were done in.

Fin-olo:

From commit #: 56809b28f8c4ff347515c5b8e13cb2211e311dde

To commit #: f3807b8c47578f776ea7152b59ce070e3b2c7438

Fin-krl:

From commit #: 74e79ef3cc3a971d3b092f4a7ff093c334cfdd59

To commit #: 68524436e454c106430df6878e6f1ccd0d049880

Krl-olo:

Commit #:f5c3092442b7ac4cdcbf1af9ce20939e06a12835

The resulting content for this summer’s work can be enumerated as follows:

The Finnish-Livvi dictionary contains 30,212 translation pairs

The Finnish-Karelian dictionary contains 2,297 translation pairs

The Karelian-Livvi dictionary contains 6,419 translation pairs

Remaining questions

One issue is whether proper names should be included in the dictionary. From a machine translation perspective place names are definitely a must, but maybe there should also be room for personal names. When we delve into local place names, finding Finnish equivalents becomes problematic, as there may be historical names available, on the one hand, and an absence of them, on the other. So where do we go to find the Finnish place name, a difficult Russian transliteration or an easier fit from Karelian or Livvi?

The Finnish orthographic solution may also present problems. Should the non-native hushed sibilants be written š and ž or sh and zh? Should the vowels be written long as they are in Karelian and Livvi, or short as they are in Russian, e.g. Tuuksa in Finnish or Tuksa, which is a simple transliteration of the Russian Тукса.

Some of these are questions that should not be solved at levels of individual dictionaries, but when we create new resources from different dictionaries, it would be beneficial to harmonize the result one way or another. Lastly, dictionary work is never complete and there are always some missing lexemes and translations, which hopefully will be covered in future work.