Back to the schedule
Previous: Test blocks
Next: Emacs Application Framework: A 2021 Update

Perso-Arabic Input Methods And BIDI Aware Apps

Mohsen BANAN -- emacs@mohsen.1.banan.byname.net -- محسن بنان
pronouns: he/him, pronunciation: MO-HH-SS-EN
http://mohsen.1.banan.byname.net

Q&A: live
Duration: 19:52

If you have questions and the speaker has not indicated public contact information on this page, please feel free to e-mail us at emacsconf-submit@gnu.org and we'll forward your question to the speaker.

Talk

Q&A

Description

About The Video

The video is a screen capture of a reveal presentation prepared with Beamer XeLaTeX and HaVeA. So, the original reveal presentation allows you to click on links that you see in the video and also navigate through the slide. In html, it is also availble as a Presenation-As-Article format which includes complete text of the audio. The traditional beamer slides are also available. The Access Page for PLPC-180063 points to all available forms and formats.

About This Presentation

Emacs is a multilingual user environment. A true multilingual editor must support bidirectionality and shaping of characters. Perso-Arabic scripts require both of these features.

Starting with Emacs 24, full native bidi (bidirectional) support became available. For many years prior to that Unicode support was available and by around year 2000, reasonable open-source shaping libraries were also available.

With these in place at around 2012, I developed two Persian input methods for emacs. These input methods or variations of them can also be used for Arabic and other Perso-Arabic scripts.

With all of these in place, Emacs has now become the ne plus ultra Libre-Halaal and Convivial usage environment for Perso-Arabic users.

Since emacs comes loaded with everything (Gnus for email, Bbdb for address books, XeLaTeX modes for typesetting, org-mode for organization, spell checkers, completion systems, calendar, etc.), all basic computing and communication needs of Perso-Arabic users can be addressed in one place and cohesively.

In this talk I will demonstrate what a wonderful environment that can be.

My talk will be in two parts.

In Part 1, I cover Persian input methods. With an emphasis on "Banan Multi-Character (Reverse) Transliteration Persian Input Method". The software is part of base emacs distribution. Full documentation is available at:

       Persian Input Methods
       For Emacs And More Broadly Speaking
       شیوه‌هایِ درج به فارسی‌

http://mohsen.1.banan.byname.net/PLPC/120036

In Part 2, I'll demonstrate that Emacs is far more than an editor. Emacs can be a complete Perso-Arabic usage environment. I will also cover the ramifications of bidi on existing emacs applications, including:

  • Spell Checking, Dictionaries And Completion Frameworks:

    • Existing emacs facilities can be extended to cover Perso-Arabic.
  • Gnus:

    • Perso-Arabic rich email sending in HTML.
    • Ramifications of bidi on from:, to: and subject: lines.
  • Bbdb: Ramifications of bidi on display and completion.

  • Calendar:

    • Ramifications of bidi on display.
    • Use of Persian text for Persian (solar) calendar.
    • Use of Arabic text for Muslim (lunar) calendar.
  • AUCTeX: Persian typesetting with XeLaTeX

    • Option of having right-to-left Perso-Arabic aliases for all latex commands.

References:

Persian Input Methods:

http://mohsen.1.banan.byname.net/PLPC/120036 http://www.persoarabic.org/PLPC/120036 -- Persian Input Methods Access Page http://www.persoarabic.org -- Various Perso-Arabic resources http://www.freeprotocols.org/Repub/fpf-isiri-6219 -- Re-Publication Of Persian Information Interchange and Display Mechanism, using Unicode https://github.com/bx-blee/persian-input-method -- Git repo for persian.el -- Quail package for inputting Persian/Farsi keyboards

BIDI:

http://www.unicode.org/reports/tr9/ -- Annex #9 of the Unicode standard https://www.gnu.org/software/emacs/manual/html_node/elisp/Bidirectional-Display.html
Emacs Bidirectional Display
http://www.persoarabic.org/answers
Paragraph Directionality Results Into Serious Communication Problems

Blee and Persian-Blee:

https://github.com/bx-blee/env2 -- Very messy work-in-progress git repo for:
Blee: By* Libre-Halaal Emacs Environment
http://www.by-star.net -- A Moral Alternative To The Proprietary American Digital Ecosystem
http://mohsen.1.banan.byname.net/PLPC/120033 --
Nature of Polyexistentials:
Basis for Abolishment of The Western Intellectual Property Rights Regime
http://mohsen.1.banan.byname.net/PLPC/120039 -- Defining The Libre-Halaal Label

Mohsen BANAN -- محسن بنان:

http://mohsen.1.banan.byname.net/ -- Globish
http://mohsen.1.banan.byname.net/persian -- Farsi
http://mohsen.1.banan.byname.net/french -- French

Discussion

Pad:

  • Q1: is there any additions that you have to add to emacs for using non-English/latin characters or does it work mostly out of the box? 
    • A: [Prot] :  I only set the default-input-method to "greek". Then switch to it with C- (toggle-input-method)
  • Q2: One stuggle I have with this input method option is, why not use an IME that's installed on the host OS?
    • A:I live inside Emacs, and that the host OS typically provides an unintelligent keyboard, and Farsi and transliterate BANAN provides multi-character input, which is a lot more powerful.
  • Q3: Do you write any lisp or other code/markup with these scripts? (Sorry if I missed you mentioning this.)
    • A:No, everything is in pure Elisp.
  • Q4: What alternatives have you looked into for solving the problem related to your markup language idea? What isn't achieved by them?
    • A:The way that Emacs has evovled about properties about string and text. And I suggest we adopt the "web" model for Emacs application development. If you step back and look at where we are, there's no such thing as no 'emacs native markup language mode' similar to HTML for web.  Emacs's display engine is capable of doing everything, but we're not exposing .... (sorry, missed this part)
    • Makes sense to me, thanks!
  • Q5: bandali: genenrally curious about the state of writing/reading Persian in the TTY
  • Q6: Does your input method also solves problems with exporting doctuments ? usually when  you exporting a Persian-Enlight doc it redirects the Persian scripts to LTR

Questions/comments:

  • Thanks for giving such a nice presentation of the Emacs input method framework! I'm just curious about if you've made any plans for setting up your markup language? I know you said you hadn't written any code for it yet.
  • That makes sense. Do you think you could use org more exclusively, and just add portions to implement your idea? As-in, there's nothing within org mode that would need to be fundamentally changed, correct?
  • I wonder about that. Org doesn't quite support all the expressivness that you see in some buffers/modes.
  • I agree. Finding a way to reach a happy medium without having to go "full elisp" would be quite powerful.
  • Potentially the tui.el system mentioned earlier in the conference could mix will with your idea as well.
  • I have one last, quick question. If you've used a version of Emacs 28, how have you found the new feature of doing a quick switch into a different IME? I know John Wiegly mentioned it in his talk earlier.
  • Does OS-level stuff work when you have to change character direction on the same line, like LtR numbers in a RtL script?

Feedback:

  • This is great. I've done a demo like this for a few friends in the past as well.
  • Whoever did the captions for this was spot on, the unicode characters would be challenging.
  • I just love the Emacs input method framework, and I don't think a lot of latin script users know about it.
  • This is really cool, it's something that I never think about from other users in other countries using Emacs.
  • The captions for this conference have has an impressive amount of work put into them.
  • omg! this is great. farsi 101 in emacs
  • ++ to all that stuff. Great job on the captions, and the demonstrated functionality is very impressive.
  • Yay for the captions!
  • This has been really slick. Kudos for the captions including the Farsi characters and latin text.
  • At first, I thought the captions would be unnecessary, but over time, understanding the accents for various individuals has been challenging, so the captions helped.
  • One struggle I have with this input method option is, why not use an IME that's installed on the host OS? I mean, I do that with Japanese, but that may no longer work easily with qubes, so maybe it's more of a thing that'd benefit me now.
    • though, I'm thinking that certain input methods don't actually simulate key-presses on virtual keyboards ... ?
    • Not a primary reason, but since I'm used to configuring Emacs, I've found it a lot easier to learn to configure the integrated IMF than to configure an external one.
    • I used SCIM/uim for japanese input at one put, but that was before I used emacs, it was a nightmare to set up
  • I may have to try this, the IMEs I've used haven't been an issue too much in the past, but...maybe this would be better, at least I wouldn't have to worry about config on each qube.
  • Banan's work on BIDI support is an eye-opener...
  • yeah absolutely. it's a really great point that Emacs can always be expanded to be more inclusive to other languages in ways that are more than just Unicode related.
  • bidi destorying irssi, time to find a good emacs irc client ...
  • thanks for the talk...another example how Emacs is inclusive catering for all forms of text.
  • Lots to think about. Thanks for the talk and inspiration!
  • Awesome. Thanks again for such a great talk and a great q&a!

Transcript

[00:00:02.960] Greetings. Salaam. This is Mohsen Banan. I am an Iranian software and internet engineer. I converted to Emacs in 1986. It was Emacs version 17 then. By around 1988, when Emacs version 18 was well in place, I started living inside of Emacs. My primary digital environment has been Emacs ever since. It has been a good life. I'm a native Farsi speaker and writer. I'm not a linguist, and I do not specialize in multilingualization, internationalization, and localization. My favorite programming language is Lisp, and I am a bit of an Emacs developer.

[00:00:58.320] This presentation is about use of Perso-Arabic Scripts with Emacs. It's an overview presentation. I won't be digging deep in many of the mentioned topics. My goal is to make you aware of what can be done with Emacs today, and the potentials that Emacs presents for Perso-Arabic writers. The main topics that i'll cover are:

  • a brief introduction to Perso-Arabic scripts
  • two existing Emacs Persian input methods
  • the challenges involved with
making Emacs applications bidirectional-aware
  • the ultimate goal of creating
a complete digital environment for Perso-Arabic writers I'll also be including various pointers.

[00:01:57.680] So first, let's make sure that what I'm presenting is of interest to you. if you are a Perso-Arabic writer and if you use Emacs, you're definitely my intended audience. If you're an Emacs developer who wishes to make her Emacs apps multilingual and bidi-aware, you're also my intended audience.

[00:02:27.360] For the purposes of this presentation, in this slide, I'm categorizing scripts based on directionality and shaping. Latin letters are not shaped. Generally speaking, the shape of a Latin letter is independent of its position in a word. Perso-Arabic letters are subject to shaping. For example the letter mim-- sounding similar to M-- takes three shapes depending on whether it is in the beginning of a word, in the middle of a word, or at the end of a word. I'll be showing more of how shaping works in an Emacs session screencast later. Shaping has ramifications for Emacs application developers. For example, if you are combining initial letters to create a label, those letters can be shaped together-- which is not what you want. In such cases, you would need to explicitly keep them separate. Latin-based scripts are always left-to-right. Perso-Arabic scripts are right-to-left with letters, but numbers are left-to-right. So, Perso-Arabic scripts are bi-directional (BIDI). Hebrew is also bi-directional, but Hebrew is not shaped. More recently, it has become very common to mix Perso-Arabic and Latin text. This can become very confusing if paragraph directionality is not properly observed. I'll be providing some examples as screencasts. The Emacs display engine now fully and well supports both shaping and BIDI.

[00:04:32.080] Since 2012, starting with Emacs version 24, we can say that Emacs is a truly multilingual-capable environment. Like everything else, multilingual support for Emacs was added gradually. Unicode support was added early on. The framework for input methods evolved in the 1990s. But it was not till version 24 in 2012 that the display engine could fully support BIDI. Hats off to Eli Zaretskii for his work on Emacs BIDI. Once full BIDI support was in place in 2012, I went ahead and added two Persian input methods to Emacs 24. So now Emacs fully supports Perso-Arabic scripts.

[00:05:30.639] By Perso-Arabic script, we're referring to the Arabic writing system with various extensions used by a large number of languages. Perso-Arabic is the second most widely used writing system in the world by the number of countries. It is the third by the number of users after the Latin and Chinese scripts. So, by well supporting Perso-Arabic, Emacs's potential user base can be greatly enhanced.

[00:06:09.919] Before focusing on the Persian input methods, let me quickly summarize Emacs's input methods model. Input methods allow you to enter characters that are not supported by your keyboard. With Quail maps, we can map ASCII key strings to multilingual characters. So we can input any text from an ASCII keyboard. You select an input method with C-x RET C-\ . We'll try that in a screencast shortly.

[00:06:49.919] Since version 24, Emacs comes loaded with two Persian input methods: farsi-isiri-9147 is the standard traditional Iranian keyboard. farsi-transliterate-banan is an intuitive transliteration keyboard for Farsi which requires near-zero training for use. I'll be mostly focused on farsi-transliterate-banan in this presentation. So let's try this out.

[00:07:27.840] In this gif-cast, we're going to select a Persian input method and write a few simple sentences. With no training and no documentation, any Farsi writer familiar with Emacs can write these, as farsi-transliterate-banan input method is intuitive. I'll be using keycast to show you keys as they are used. Let me first describe as to what we have on the screen. There are three windows in one frame. Keycast will show commands and keys on the mode line. The leftmost window is showing logs of keycast. Transformed individual unshaped letters will appear here. The middle window is running a tail -f on the dribble file piped to fold -w1. This lets you see the raw ASCII characters as I type them. The right window is the empty buffer on the ex.fa file. Anything that I described here can be done with virgin Emacs distribution with nothing added, but I'm using Blee (By* Libre-Halaal Emacs Environment) to show things. You don't need to have blee for writing the equivalent of the text in this gif-cast. First, I'm going to select the farsi-transliterate-banan. I'm entering C-x RET C-\ . Notice the mode -ine and the prompt at mini-buffer. With completion, I'm going to select farsi-transliterate-banan. Notice that farsi-isiri-9147 was also provided as a choice. Also notice that the letter 'b' appears in the left of the mode line of ex.fa. This indicates which input method has been selected. Also notice that cursor is on the top left corner of ex.fa. Next, I'm going to enter the 's' character. Notice the cursor moved to the right, and unshaped ’seen’ appeared in the ex.fa buffer, on the mode-line, and in the keycast log buffer. Next, I'm going to enter 'l'. Notice ل in the mode-line, and notice how س was subjected to shaping. Next I'm going to write the letter to write the following: سلام –حال شما چه طوره؟ –با ایمکس همه کار میشه کرد "Hello, how are you?" You can do everything with Emacs. Generally, same-sounding Latin characters are used. As usual, vowels are ignored unless called for. Notice that in order to get ح, i repeated 'h' twice. ش is the obvious 'sh'. چ is the obvious 'ch'. t, that's the ط is upper case T. ت is lowercase t. That's it. We managed to write in Farsi with a QWERTY keyboard, intuitively. Next, we are going to switch back to globish and write "back to globish". Notice that the globish sentence started from the left side. This is due to proper detection of paragraph directionality by Emacs.

[00:12:00.160] For the most part, Emacs is self-documenting. Here we are pointing you to some relevant self-contained Emacs resources. The BIDI documentation applies to all BIDI scripts, not just Perso-Arabic scripts. Referring to the code can also be useful for some.

[00:12:28.000] Here are some pointers. The Quail translation code for Persian input methods: the persian.el file has full details of the mapping and some documentation.

[00:12:44.000] Next, we'll show the keyboard layouts as a gif-cast. You can get relevant documentation for any input method with the describe-input-method command. So, let's try that for farsi-transliterate-banan. We are back in the ex.fa buffer as one window. We don't need the keycast logging and the dribble windows any more. With the C-\, I reactivate the farsi input method. Notice that keycast is still active on the mode line. Next, with the C-h C-\ , I get the input methods documentation. I then delete other windows and keep the help buffer visible. Notice that beh in this input methods identifier. Here is the URL for full documentation on the web. The keyboard layout itself is a one-to-one mapping, but towards making transliteration intuitive. Multiple keys are sometimes mapped to the same letter. For example, both 'i' and 'y' produce yeh. The usual two letter transliterations ending with 'h' -- zh, ch, sh, and kh are provided. The ampersand prefix is used to support often invisible BIDI markings. In addition to this internal documentation, full documentation is also available.

[00:14:48.959] Complete documentation for Persian input methods is available as PLPC-120036. Next, we'll take a quick look at this on the web. You can click on links in the reveal web-based form of this presentation. So let's visit PLPC-120036. This document fully describes Persian input methods. In addition to HTML, you can also obtain it in PDF. Let's do that. Of particular interest in this document are various tables that enumerate lists of letters with their association to both Persian input methods. Let's take a look at a few of these. Table 3, mapping of isiri-6219 (the Farsi character set) to Emacs version input methods could be of interest to you, as well as table 8 for BIDI-related control markups, and table 9, for vowels and other signs.

[00:16:17.519] Having covered input methods, let's turn our attention to ramifications of BIDI and Perso-Arabic on various Emacs applications. Since 2012, I have been using Persian text in various Emacs applications. In short, my experience has been that most Emacs apps are usable, but they all have glitches that could at a minimum annoy Perso-Arabic users. In this slide, I'm presenting a summary. The glitches with Gnus are not all that significant for me. BBDB glitches can easily be fixed. For Calendar, I have customized my own setup to support Persian and Islamic dates in Perso-Arabic. Perhaps they should be merged upstream, instead of dealing with apps one at a time. I think it's more reasonable to consider them collectively.

[00:17:25.280] The glitches that I mentioned in the previous slide have two routes: some are BIDI-specific and some are Perso-Arabic-specific. In this slide, I have classified them as such and have made some general suggestions, but all of these at best amount to tactical approaches. I think a more strategic approach is called for.

[00:17:53.520] The right way to address BIDI-awareness and other awarenesses is to build them in frameworks that Emacs apps can then use. So, I'm proposing that we first create ENML, the Emacs Native Markup Language, as a Lisp-ish (perhaps even not fully secure) super-set of HTML5. With that in place, we can then build on the two decades of experience that have produced various web application development frameworks by mimicking one of them. I don't have any running code for any of these, but discussing strategy need not always be futile.

[00:18:44.640] Emacs has immense potentials, but those potentials cannot be realized unless we integrate Emacs in the totality of a specific complete digital ecosystem. Over the past two decades, I've been building the contours of The Libre-Halaal By* (ByStar) digital ecosystem. Emacs can then be fully integrated into ByStar. It's through such integration that full conviviality of Emacs can be experienced.

[00:19:24.240] Blee, the ByStar Libre-Halaal Emacs environment is Emacs plus a whole lot of Emacs apps integrated with Debian, With ByStar services and with BISOS, the ByStar Internet Services OS. Perhaps this could be the topic of a presentation captions by Mohsen Banan

Back to the schedule
Previous: Test blocks
Next: Emacs Application Framework: A 2021 Update