Back to the schedule
Previous: The True Frownies are the Friends We Made Along the Way: An Anecdote of Emacs's Malleability
Next: GNU's Not UNIX: Why Emacs Demonstrates The UNIX Philosophy Isn't Always The Only Answer

Emacs manuals translation and OmegaT

Jean-Christophe Helary

Q&A: live Q&A / IRC / pad
Status: Finished
Duration: 9:07

This talk will also be streamed at an alternate time for APAC hours: https://libreau.org/upcoming.html#emacsconf21

If you have questions and the speaker has not indicated public contact information on this page, please feel free to e-mail us at emacsconf-submit@gnu.org and we'll forward your question to the speaker.

Description

Even if it is generally agreed that software localization is a good thing, Emacs is lacking in that respect for a number of technical reasons. Nonetheless, the free software using public could greatly benefit from Emacs manuals translations, even if the interface were to remain in English.

OmegaT is a multiplatform GPL3+ "computer aided translation" (CAT) tool running on OpenJDK 8. CATs are roughly equivalent for translators to what IDEs are for code writers. Casual translators can benefit from their features but professionals or committed amateurs are the most likely to make the most use of such tools.

When OmegaT, free software based forges and Emacs meet, we have a free multi-user translation environment that can easily sustain the (close to) 2 million words load that comprise the manuals distributed with Emacs, along with powerful features like arbitrary string protection for easy typing and QA (quality assurance), automatic legacy translation handling, glossary management, history based or predictive autocompletion, etc.

The current trial project for French is hosted on 2 different forges:

  1. sr.ht hosts the source files https://sr.ht/~brandelune/documentation_emacs/
  2. chapril hosts the OmegaT team project architecture https://forge.chapril.org/brandelune/documentation_emacs

The sources are regularly updated with a po4a based shell script.

Discussion

IRC nick: brandelune

  • translation is nice but typing anything non latin or cyrillic is hard with keyboard
    • Try out the Emacs IMF. One of the main reasons I use Emacs. Input Method Framework: https://www.gnu.org/software/emacs/manual/html_node/emacs/Input-Methods.html
  • Hi, thanks for the talk. I love OmegaT and use it always. But I would have liked to here about the experience of working both with Emacs and OmegaT. Can you tell us something about it?
  • brandelune: wondering if anyone is interested in working on translating the emacs manuals to a language different from French. I know there are ongoing attempts in a number of languages (Japanese for one). LibreOffice JA has worked with "machine translation post editing" (MTPE in the "industry") and they seem to have produced good results.
    • i'd definitely be interested, tho not sure i'll have the time anytime soon. but if there's a mailing list i'd be interested in subscribing or joining an irc channel.

Feedback:

  • OmegaT looks very powerful: it goes to show how much work goes into translations; work that we sometimes take for granted
  • I once had to translate a document the old-fashioned way: it was painful... Will check OmegaT afterwards. Thanks!

Outline

  • Duration: 10 minutes
  • Software introduced during the presentation
    • po4a a tool to convert documentation formats to and from the commonly used gettext PO format. po4a supports the texinfo format along with many others.
    • OmegaT a "computer aided translation" tool used by professional (and amateur) translators to efficiently combine translation resources (legacy translations, glossaries, etc.) so as to produce more consistent translations.

During this short presentation, I will address:

  • The specificities of the Emacs manuals and the difficulties they present to the translator
  • How to convert the texi and org files to a format that translators can handle
  • How to adapt OmegaT to the Emacs manual specificities
  • How to use OmegaT features such as arbitrary string protection, legacy translation handling, glossaries, autocompletion, QA, etc.
  • How to use OmegaT with a team of 2 (or more) translators working at the same time

I will not discuss:

  • How to create an OmegaT project
  • How to set up an OmegaT team project
  • How to use OmegaT from the command line to work in localization pipelines
  • How to use machine translation and MT "post-edit"
  • How to convert back the translated files to texi format
  • How to install translated texi files for use in Emacs

People who are interested in knowing more about OmegaT are invited to check the online user manual.

Personal information

Transcript

Hello, everybody. My name is Jean-Christophe Helary, and today I'm going to talk about Emacs manuals translation and OmegaT. Thank you for joining the session.

[00:00:10.960] Translation in the free software world is really a big thing. You already know that most of the Linux distributions, most of the software packages, most of the websites are translated by dozens of communities using different processes and file formats. Translation and localizations are things we know very well. It's a tad different for the Emacs community. We didn't have a localization process because it's quite complex and because we don't have the resources yet. Still, we could translate the manuals, and translating the manuals would probably bring a lot of good to the Emacs community at large.

[00:00:45.600] So what's the state of the manuals? As of today, we have 182 files coming in .texi and .org format. We've got more than 2 million words. We've got more than 50 million characters. So that's quite a lot of work, and obviously, it's not a one person job.

[00:01:04.559] When we open .texi files, what do we have? Well, we actually have a lot of things that the translators shouldn't have to translate. Here we can see that only the very last segment, the very last sentence should be translated. All those meta things should not be under the translator's eyes. How do we deal with this situation? For code files, we have the gettext utility that converts all the translatable strings into a translable format, which is the .po format. And that .po format is ubiquitous, even in the non-free software translation industry. For documentation, we have something different. It's called po4a, which is short for .po for all. When we use po4a on those 182 .texi and .org files, what do we get? We get something that's much better. Now we have three segments. It's not perfect because, as you can see, the two first segments should not be translated. so there's still room for improvement. Now, when we put that file set into OmegaT, we considerably reduce the words total. We have now 50% less words and 23% less characters to type, but that's still a lot of work.

[00:02:15.680] So let's talk about OmegaT now and see where it can help. OmegaT is a GPL3+ Java8+ computer aided translation tool. We call them CATs. CATs are to translators what ideas are to programmers. They leverage the power of computers to automate our work, which is reference searches, fuzzy matching, automatic insertions, and things like that.

[00:02:44.080] OmegaT is not really recent. It will turn 20 next year, and at this point, we have about 1.5 million downloads from the SourceForge site, which doesn't mean much because that includes files used for localization and manuals, but still it's a pretty big number. OmegaT is included in a lot of Linux distributions, but as you can see here, it's mostly downloaded on Windows systems because translators mostly work on Windows. OmegaT comes with a cool logo and a cool site too, and I really invite you to visit it. It's omegat.org , and you'll see all the information you need, plus downloads to Linux versions, with or without Java included. So what does OmegaT bring to the game? Professional translators have to deliver fast, consistent, and quality translations, and we need to have proper tools to achieve that. I wish po-mode was part of the toolbox, but that's not the case, and it's a pity. So we have to use those CAT tools.

[00:03:39.760] Let me show you what OmegaT looks like when I open this project that I created for this demonstration. The display is quite a mouthful, but you can actually modify all windows as needed. I just want to show you everything at once to give you a quick idea of the thing. You have various colors, windows, and all those spaces have different functions that help the translator, and that you're probably not familiar with.

[00:04:02.879] I'm going to introduce you to the interface now. So first, we have the editor. The editor comes in two parts: the current segment, which is associated to a number, and all the other segments above or below. At the top of the window, you can see the three first segments that were in the .po file.

[00:04:20.799] The last one here, the fourth one, comes with an automatic fuzzy match insertion. Such legacy translations are what we call "translation memories". OmegaT has inserted this one automatically because I told it to do so, and for my security, it comes with the predefined fuzzy prefix that i will have to remove to validate the translation.

[00:04:44.880] Our next feature is the glossary feature. In this project, we have a lot of glossary data. Some is relevant and some is not. In the segment that I'm translating at the moment, you can see underlying items.

[00:04:57.520] This pop-up menu on the right allows me to enter the terms as I type. It's kind of an auto insertion system that also supports history predictions, predefined strings, and things like that.

[00:05:14.479] In the part on the right, we have reference information that comes directly from the .po and the .texi files.

[00:05:21.440] We also have notes that I can share with fellow translators, and we have numbers that tell me that I still have 143 segments more to go before I complete this translation. As we see, there are plenty of strings that we really don't want to have to type. For example, those strings are typical .texi strings that the translator should really not have to type. So we're going to have to do something about that. we're going to have to create protected strings with regular expressions, so that the strings can be visualized right away in the source segment, entered semi-automatically in the target segment, and checked for integrity. The regular expression I came up with for defining most of the strings is this one, and I'm not a regular expression pro so I'm sure some of you will correct me. But this expression gives me a good enough definition even though it does not yet include Org mode syntax. So now we have all those .texi-specific things that we don't want to touch displayed in gray. Actually, you may have noticed that I cheated a bit, because here I added the years and the Free Software Foundation name to the previous regular expression to show you that you can protect any kind of string, really.

[00:06:38.560] So what we have now is a way to visualize the strings that we do not want to touch, but we still have to enter all of them in the translation. For that, we have the pop-up menu that I used earlier with the glossary, and we also have items in the edit menu that come with shortcuts for easy insertion of missing tags.

[00:06:57.199] Last, but certainly not least, we can now validate our input. Here, OmegaT properly tells me that I made 7 protected strings, I entered only 1998, but there were five different years, the copyright string, and the FSF name string. With all this almost-native Texinfo support, we have much less things to type, and there is a much lower potential for errors. But we agree, it's still a lot of work.

[00:07:25.199] What we'd like now is to work with fellow translators, and here we need to know that OmegaT is actually a hidden svn/git client, and team projects can be hosted on svn/git platforms. Translators don't need to know anything about VCS. They just need access credentials, and OmegaT commits for them. This way we do not have to use ugly and clumsy web-based translation interfaces, and we can use a powerful offline professional tool. So this is how it looks when you look at the platform where I hosted this project. The last updates are from 20 days and 30 seconds ago when I created this slide, and you can see that I had a partner who worked with me on the same file set. Although it looks like we actually committed the translation to the platform, it was not us, but OmegaT. OmegaT does all the heavy-duty work. It regularly saves to and syncs from the servers. Translators are regularly kept updated with work from fellow translators and when necessary, OmegaT offers a simple conflict-resolution dialogue. Translators never have to do anything with svn or git ever. And now we can envision a future not so far away where the manuals will be translated and eventually included in the distribution, but that's a topic for a different presentation.

[00:08:39.760] So we've reached the end of this session. Thank you very much again for joining it. There are plenty of topics I promised I would not address, and I think I kept my promise. There will be a Q&A now, and I also started a thread about this talk on Reddit last saturday. You can find me on emacs-help and emacs-devel list as well, so don't hesitate to send me questions and remarks. Thank you again, and see you around.

Back to the schedule
Previous: The True Frownies are the Friends We Made Along the Way: An Anecdote of Emacs's Malleability
Next: GNU's Not UNIX: Why Emacs Demonstrates The UNIX Philosophy Isn't Always The Only Answer