Back to the talks Previous by track: Putting Org Mode on the Indieweb Next by track: Emacs development updates Track: Development

Pre-localizing Emacs

Jean-Christophe Helary (he/him)

The following image shows where the talk is in the schedule for Sun 2022-12-04. Solid lines show talks with Q&A via BigBlueButton. Dashed lines show talks with Q&A via IRC or Etherpad.

Format: 11-min talk followed by live Q&A (done)
Etherpad: https://pad.emacsconf.org/2022-localizing
Discuss on IRC: #emacsconf-dev
Status: TO_CAPTION_QA

Times in different timezones:
Sunday, Dec 4 2022, ~4:00 PM - 4:10 PM EST (US/Eastern)
which is the same as:
Sunday, Dec 4 2022, ~3:00 PM - 3:10 PM CST (US/Central)
Sunday, Dec 4 2022, ~2:00 PM - 2:10 PM MST (US/Mountain)
Sunday, Dec 4 2022, ~1:00 PM - 1:10 PM PST (US/Pacific)
Sunday, Dec 4 2022, ~9:00 PM - 9:10 PM UTC
Sunday, Dec 4 2022, ~10:00 PM - 10:10 PM CET (Europe/Paris)
Sunday, Dec 4 2022, ~11:00 PM - 11:10 PM EET (Europe/Athens)
Monday, Dec 5 2022, ~2:30 AM - 2:40 AM IST (Asia/Kolkata)
Monday, Dec 5 2022, ~5:00 AM - 5:10 AM +08 (Asia/Singapore)
Monday, Dec 5 2022, ~6:00 AM - 6:10 AM JST (Asia/Tokyo)
Find out how to watch and participate

Talk

Q&A

06:11.680 Is Emacs localized/localizable? 10:07.160 You mention regex on strings is a red flag for localization, are there others to look out for? 13:49.980 So, your project is to localize all of Emacs? 14:47.325 How deep would usefull localization go? Because at the core of Emacs are docstrings and localizing them could also imply localizing Elisp.

Listen to just the audio:

Description

Before Emacs user-facing strings are localized to users’ languages (will that ever happen?), there are things developers must remember when including such strings in the code, and there are things that emacs-lisp beginners (like me, forever) can do to help when they face such issues.

It is not easy to write naturally flowing language when the language depends on program variables, but even if we stick to English for the time being, it is important to separate natural language from computer language as much as possible.

I will be presenting an old patch to packages.el accepted in June 2018 that took me about a year to write. The origin of the patch is a plural mistake in the packages install messages. As you can see pre-patch, the code was filled with English substrings embedded into the code to produce pasts, plurals and all sorts of English grammatical constructs.

https://git.savannah.gnu.org/cgit/emacs.git/commit/?id=61f73703c74756e6963cc622f03bcc6938ab71b2

Even if it is a beginner’s patch (thoroughly reviewed by dev-experts), it shows what can be done by emacs-lisp beginners to help with “straightening” the strings to reduce the number of potential English bugs and then to make Emacs strings easier to be handled by real localization processes, one day.

Discussion

Questions and answers

  • Q: I use Emacs on English but my mother tongue is Spanish. Is Emacs localized/localizable?
    • haha I have no English knowledge to help fixing text labels, but it is of use for new developments.
    • A:
  • Q: You mention regex on strings is a red flag for localization, are there others to look out for?
    • A:
  • Q: So, your project is to localize all of Emacs?
    • A:
  • Q: How deep would usefull localization go? Because at the core of emacs are Docstrings and localizing them could also imply localizing Elisp.
    • A:

Other feedback:

  • Merci Jean-Christophe ! I really enjoyed your talk (and would very much like to help localise Emacs).

Transcript

Hello everyone, I am Jean-Christophe Helary, I live in Japan, and I'm a translator. Here is my second presentation on this very prestigious stage that is the Emacs conference. Following my "Let's Translate the 2 million words in the Emacs manual" in 2021, my topic this year, always related to translation, is pre-localizing Emacs or much less pretentiously, "Just make sure that your strings don't mix up plurals". So, for some reason I resumed Emacs use around 2016, and as I was rediscovering the thing I found really old outline-mode files here and there on my machine. And I started to experiment again and write again with Emacs. I think that at the time, I was coming from Aquamacs and because of an integration bug with macOS, I decided to check what was going on in the code. That was my first official contribution. So as I was happily installing and uninstalling things, I noticed something weird one day. Let me enlarge that picture. See? And even if I were not a translator, I would not like that string, and obviously the same bug bites you when the string tells you to erase the package. Boom, so we agree that we have a problem here. So, I started to do some spelunking into the code, and at least that was my feeling because I really am not a programmer by any stretch of the imagination. And what I found was an amazing piece of natural language engineering that was mixing code with English suffixes and all that, and I could see that the people who had written that code were pretty smart, but had missed a number of edge cases that produced the above bugs. That was my first experience with all the message related functions, "format", "concat", "message", etc. But even with my beginner's eyes I could see that something was off because when you want to produce natural language strings you never ever should use "replace-regex-in-string" to add an "ing" or an "ed" suffix to change the mode of a sentence. But that's what I was seeing was happening. So, what we had to deal with here was way more than just a missed plural. It was an attempt at engineering all the message strings destined to the user with the smart code that was making assumptions on the structure of words, and in the localization world that's a big no-no. I'm a translator, and such UI strings issues have been sorted out decades ago. So I was a bit shocked. The final patch took me about a year to write, because I'm slow, because I needed to verify and understand a lot, because there are plenty of rules and plenty of people who are explaining you very nicely what the rules are, because I have kids, and because the Emacs development list is such a cool place to be that you often forget why you're there sometimes. Anyway, for people who can't click on a video, and I can't either, here are the relevant parts with some short comments. I'll be talking with localization in mind, knowing full well that Emacs localization is not on the map at the moment. So first, there is this thing about "format" and "concat". And if I remember correctly, "format" is better for user-facing things, and "concat" is better for internal things. Here, there are two things. First, a rule that we have when we prepare strings that need to be localized is never ever make assumptions on the way numbers are expressed in the language. Here, the assumption is that we have either a singular or plural form, and that's not always the case. That usually means that you should externalize numbers and find a generic way to express them. So it makes for slightly less natural language strings, but it's better anyway. Then we have that comma there that's trying to be externalized and that's weird, so I put it back into the sentence. Here we have another construct, or two rather, that really should not be used like this. It's "prin1" that uses quoting characters, just like "print", and "princ" that does not. And you see why they were combined together. And they were both trying to be really smart about which article to put in front of a vowel. And you just don't do that. You just keep things simple. Here again, the code is trying to be smart, but it's really not much more efficient than plainly stating what you want. And here again, we have "concat" things that we could just use to plainly state what we want to state. So, instead of "concat" I just put a "message". And here we have something that's very cute. It's a computerized plural. Here again, assuming that there are only plural or singular forms. But the end string is not that much more natural than the fix, the code is less efficient and is harder to understand. Here again, the code is trying to make smart things where it could be much simpler. That is the part where you get the number of packages and their names. Here the whole sentence with the semicolons and the question mark is split in parts, between which something will be inserted. That's really ugly and difficult to read. Here again, another "ing" waiting to be regex-inserted into the code. And here at last, we get to the point where everything started. And you can see that unlike in the other spots, there is no possibility for the expression to be singular. So, I guess that if it hadn't been for that bug, I would not have found the other items, and we would be left with code that works, of course, but that is harder to understand, and maintain. Last but not least, a last version of "just plainly state what you mean to state". Keep it simple. So first, we have this wonderful CONTRIBUTE file that is very explicit about how we must proceed when contributing code. So, that's really the first place that we should all read. The README file is pretty cool too, especially at the beginning of the process, when you're not sure whether you want to fix that bug or just report it. And then we've got packages. We've got a number of packages that are really helpful when it comes to reading the information and the manuals. I'm mentioning three of them here, and I think they are the most important for us. So "helpful" is on the right, and it's overflowing the window with all the contextualized information it provides, and the standard "help" is on the left. I mean, really there are like two or three screen-full of information in the "helpful" output, so you really only see a part, but I guess if you use it, you know what I'm saying. What I like the most here is the "view in manual" part, where you can actually click and even get more information that's sometimes easier to read and understand. And then you've got the "info" versus "inform" formats. When you're in the manual, "inform" makes a huge difference. You can see here that you've got colorized items, and also in the middle you've got that 'read' part that's green and bold. In "info" it's not a specific object, it's just a string. In 'inform' it's actually a link that you can click, and actually go to that 'read' manual page. Now, we've got "which-key". "which-key" is a savior for beginners too. Just wait half a second or something, and Emacs will show you all the keys that you can access from the prefix combination that you just typed. So, it's really helpful for discovering functions and learning new functions, getting used to them. And so that whole process started…, it was May 23, 2017, with that thread when I found the bug. I just bumped into an English/code bug this morning. In package.el, when one package is not needed anymore, the message is: "Package menu: Operation finished. 1 packages are no longer needed", etc. So, I was asking whether we had best practices for using messages, and we had a whole thread about that. And while I was discussing on that thread, I started that new thread, which is: "package.el strings". The whole thing actually ended on June 27, 2018. So, a year after, with that message from Noam telling me that "Yes I can close the bug," and that was it. So, it took about a year to finish that. What I did learn basically is that helping with Emacs is not that difficult. It takes time when you're not fluent with the code, but that's okay because the reference is excellent, and there are lots of people who are here to help. Basically, the solution to all our problems is "Keep It Simple and Straightforward". As you can see in that patch, even if it's a beginner's patch, what I did shows what can be done by Emacs Lisp beginners to help with "straightening" the strings to reduce the number of potential English bugs. And then to make Emacs strings easier to be handled by real localization processes one day. But it doesn't have to be about strings because strings can be an easy entry point to Emacs, but it can be any itch that you want to scratch. And my real conclusion is that Emacs is free software, and what that means is mostly that it allows you to do things that you would never have thought of being able to do before. That's really the biggest lesson to be learned here. So, I want to thank all the people who allowed this to be happening, allowed me to learn a bit and contribute a bit to that wonderful piece of software that Emacs is. And thank you everyone for listening, and hopefully I'll see you next year with a different translation related presentation. Thank you very much.

Captioner: jean-christophe

Questions or comments? Please e-mail emacsconf-org-private@gnu.org

Back to the talks Previous by track: Putting Org Mode on the Indieweb Next by track: Emacs development updates Track: Development