Back to the schedule
Previous: the org-gtd package: opinions about Getting Things Done
Next: Experience Report: Steps to "Emacs Hyper Notebooks"

One Big-ass Org File or multiple tiny ones? Finally, the End of the debate!

Leo Vivier

Download video
Download captions

Download compressed .webm video (52.1M)
Download compressed .webm video (22.3M, highly compressed)
View transcript

Many discussions have been had over the years on the debate between using few big files versus many small files. However, more often than not, those discussions devolve in a collection of anecdotes with barely any science to them.

Once and for all (or, at least until org-element.el get overhauled), I would like to settle the debate by explaining why the way we parse Org-mode files becomes slower as our files grow in size or numbers, and how that affects their browsing and the building of custom-agenda views.

I feel qualified to talk about this topic for two reasons:

I went through the trouble of optimising my agenda-views by implementing clever regex-based skips, so I know the ceiling that can be reached with the current tech.
My work on Org-roam has led me to consider the use of an external parser for Org-mode files, and whilst we are only at the prototyping stage, we know what is at stake.

I intend the talk to be fairly light-hearted and humorous, which is the only way we can do true justice to the topic.

Actual start and end time (EST): Start 2020-11-28T13.43.24; Q&A 2020-11-28T13.51; End: 2020-11-28T14.00.07

Questions

What's better: one big file or many small ones? :>

For knowledge management: many files (see also org-roam).

Otherwise: one big file to have everything (todos, projects, notes, etc…) in one single place.

Possible walk around by some hacks?

Do you switch between British and French accents?

What's the Emacs icon in the firefox address bar?

Browser extension for org-protocol made by vifon: https://github.com/vifon/org-protocol-for-firefox

How do you feel about archive files in org mode, how can that work in?

Could you post links?

How big are your org files?

Main file: 38000 lines for all GTD-tasks and he does archive.

Karl does use archiving although Karl does use Org tasks even in knowledge management and those don't get archived most of the time.

Does it not consume more resources and time to load multiple files than a large file of the same contents?

Dealing with hiding contents is computationally expensive.

I doubt it is correct. Emacs display engine is quite effective dealing with invisible text. Moving cursor around is affected, but I never heard (and never experienced) issues with scrolling on large (2Mb) org files.
- Actually, Org currently uses overlays to hide text, and the overhead of the overlays does eventually add up. There's a working branch that uses text-properties instead, and it may be merged to Org someday.
  - It is on the way I need more feedback (see help request in https://updates.orgmode.org/).
    - If I ever have time to even get my Org upgraded to the latest version, maybe I can think about trying to test that
      - Would it help to share the branch on GitHub?
        
        It would probably make it easier to use and more visible, so…maybe?
        
        Noted (or rather captured) (using org-mode right? Indeed.
- Karl: whenever I had severe performance issues and somebody was nice and helped to analyze the issue, "overlays" were the root cause in probably 90% of the cases. However, an average user (including me) does not know if a specific feature is implemented using overlays or not. My Org life is basically try and error
  - alphapapa: FYI, if you use org-indent-mode (or whatever the name is of the mode that uses overlays to indent contents), you could disable that to reduce the number of overlays in a buffer.
    - Karl: thanks a bunch. However, some features are delivering important features to me so that I do have to accept the performance overhead to a certain level. That's a difficult trade-off I do have to make from time to time

Doesn't using many small org file clutter up your buffer list when generating agenda etc?

Personally, I limit org agenda to just a few files while keeping notes in many more.

Notes

Speaker's emacs.d: https://github.com/zaeph/.emacs.d
Mentioned: https://karl-voit.at/2020/05/03/current-org-files/ -> Karl's big Org files.
org-element.el: https://orgmode.org/worg/dev/org-element-api.html.
- single-threaded lisp function that parses the whole file.
"the problem is to let org-element to make sense of the item …".

Transcript

00:00:24.160 --> 00:00:58.434 Hello again, everyone! I hope you had, well, quite a lot of talks ever since the last one I did, and all more interesting one after the other. You know, I'm a bit in a bit of a weird spot right now, because I'm supposed to be presenting to you (as you can see on my screen) "One big-ass Org file or multiple tiny ones: finally, the end of the debate," and it sounds about as clickbaity as you can possibly get with those topics. By the way, credit where credit is due, the title is not mine. It's actually from Bastien Guerry, the current Org maintainer.

00:00:58.434 --> 00:01:22.823 Yeah, I wanted to talk to you a little bit today about this question because if you are used to going on reddit.com/r/emacs , you know the subreddit that we have, if you go on Hacker News often, you know it's a question that you see pop up every once in a while. "Should I be using one big file, or should I be using a lot of tiny files?"

00:01:22.823 --> 00:01:58.575 I believe you know we've got defenders on both sides. If I just show you one example... We have Karl Voit. He's one of the organizers for the conference. He is the guy who probably has the biggest Org Mode files right now in all the people I know, and god knows I know plenty of people use Org Mode. But if you just look at this line--I hope it's not too small; you just make it a little larger--but Karl basically has a file with 126,000 lines.

00:01:58.575 --> 00:02:57.040 I'm just going to pause and try to have you imagine how large a file it actually is. Just think about all of these lines being tasks in your days. Think about all those lines being about little thoughts you know that you've had throughout the day or project that you were working on. It's massive. You know one of the problems that Karl Voit actually approaches on this topic is that it takes him roughly 20 seconds to get his Org agenda going, which is a massive amount of time. I mean, we have very fast computers now. You know, ever since Emacs was created in 1976, computers... I have no idea how much faster they've gotten. And yet, you know, for 100,000 lines, Emacs seems to be choking. It's certainly not reasonable, in a way, to have to wait 20 seconds just for your entire file to be parsed. So basically what I want to do--

00:02:57.040 --> 00:03:50.720 By the way, I forgot to introduce the presentation, but I'm Leo Vivier. I did this before, for those who were around. I help maintain a software which is called org-roam, and that's the expertise that I have on the topic. Actually, if you go online, I do have a Github page. I will make sure that you have all the links available afterwards. But I do publish my init files, and you can see, if you scroll at the bottom, I have a little demonstration which shows you the fancy things that I can do with my Org Mode setup. That might be even interesting in light of the talk you've just had about GTD stuff, because the first one is about how I handle my projects, the second one is about the flow from a task as I work on it... So I won't spend too much time on this, but basically that's my expertise. I have spent eight years working with Org Mode, three of them actually thinking about writing packages.

00:03:50.720 --> 00:04:32.880 The thing is, if I go into a little bit of detail (and obviously it's only a lighting talk, so I won't have time to actually go really in depth about it), but there is something in the Org Mode library which is called org-element. You have the name right there, org-element.el, .el being for Elisp file. As you can see, the page is on the Worg wiki, so it's accessible by everyone. It's basically the API that Org Mode uses to parse Org Mode files. For those who don't know, parsing means basically checking a file, checking all the contents of the file, and extracting all the information that we need from that file.

00:04:32.880 --> 00:04:58.960 As you can imagine, you all have Org Mode files in your mind, well you know they can be fairly complex. You can have properties, you can have contextual information, like if you write a line which starts at column zero (which means at the left), it doesn't have the same meaning, whether or not it is before the beginning of a headline or if it is after the beginning of a headline. It's going to be relatively different, hierarchically speaking.

00:04:58.960 --> 00:05:39.280 So the problem, when it comes to the question of many files versus one big file or few big files, is that we always have to keep in mind what org-element wants you to do. The thing is, there are plenty of problems when it comes to parsing files, the first one being obviously that Emacs is a single-thread process (or has some threading capabilities; we're not going to go into the details right now, that's not my goal). It makes it incredibly hard to parallelize parsing processes with the current technology.

00:05:39.280 --> 00:07:03.759 So you'd have to imagine that if you have a very large file--if you go back to the example of Karl Voit from before: 100,000 lines--that means that you have to scan through every single line, basically. Because sometimes... Let's just say that you have a property drawer, for instance, which tells you, oh okay, this tree has the tag :foo:. So the problem is, there are multiple ways for you to define a tag. You can use the usual way, which is about wrapping in columns the :tag: at the end of a heading. For instance, if I... (I'm not going to switch to Emacs, that's going to waste too much time) That's one way to say your tag. But say, you have tag inheritance, which means that when you have a parent with a tag, you also want the child to inherit the tag. If you have first heading with the tag :foo:, you have the first subheading, and the tag :foo: is implied. Now imagine having to do that with a file that is completely nested, a file that has maybe 9, 10, 11 levels of depth to it. It's mind-bogglingly complicated for the software to do that, knowing that... I've told you about tags, but any property can be inheritable. Anything like priorities, even. Though why would you do this? You can have groups. You can have all this.

00:07:03.759 --> 00:07:21.957 And as someone who went through the trouble of optimizing his Org agenda... So basically, if we go back to the GIFs--oh god we've already had this discussion between the "git" and "magit" and now I've started "gif" and "gif" and I only have one more minute left to do so, so let's just say I'm going to say "gif" just to spite people...

00:07:21.957 --> 00:07:41.360 So if you go on the way I organize my agenda, what I did in order to keep my agenda build time under two seconds, is that I've rewritten a whole lot of codes to be able to parse my Org agenda files. So the thing is, I'm going to be talking more about this later.

00:07:41.360 --> 00:07:44.479 I only have, let's say, one minute to conclude.

00:07:44.479 --> 00:08:15.199 So as you've gathered, I'm not going to be giving you the answer right now. I'm going to be talking about org-roam a little later, which is about following the principle of having many small files. But as someone who has been using one large file to manage my life, you know, I'm sitting on the fence. I do not know which one is the best, but I hope that my presentation has given you a little idea of what goes on behind the principles.

00:08:15.520 --> 00:08:52.000 You also need to think about the philosophy behind the organization of your notes. I hope to be approaching this topic with you in about two hours or so (maybe one hour actually). I'm actually finished. I've decided to leave you two minutes of questions. If someone could feed me the questions, that might be best, because I don't want... oh actually I can just open the pad. I can just open it. Give me a second, okay. Just loading up. I might stop showing my screen. That might make it easier. So I mean if you can make myself big now on the screen, that would be splendid. ([Amin]: yeah sure)

00:08:52.000 --> 00:09:13.920 Thank you. Where are we... Question 12. Okay, so what's better, one big file or...? Is it a jab to tell me that I haven't answered the question because someone just asked me the question? Well, personally, if I were to give you a quick answer in 20 seconds, personally, I think it's a question that is contextually based.

00:09:13.920 --> 00:09:45.890 Do you want something that is efficient as far as optimization is concerned? Then you need to think about this. Personally, for all the organization that I do, all this stuff, all the TODOs that I handle, I like to do this in one simple big file because you benefit from all the refiling capabilities of Org Mode, so I would do that. But for knowledge management, for note-taking and all this, well I'd much rather follow the org-roam way of doing things, which is about having many small files.

00:09:45.890 --> 00:09:57.040 I'm not getting any more questions. I'm not sure if there is one on IRC that could be fed to me. Otherwise, I'm happy to pass over to the next speaker.

00:09:57.040 --> 00:10:06.520 By the way, just before I finish, your world is a lie. It's not a three-piece suit. I'm wearing jeans below, so I hope that satisfies your curiosity.

00:10:10.640 --> 00:10:35.680 Okay, there's one more question appearing. "but otherwise one big file to have everything..." So I'm putting you on the spot, I believe. It was such a short talk. You know the problem is, I just wanted to give you a little answer. A little, you know, path of thinking on this topic. Obviously it's a topic I could be spending 40 minutes on, but I'm going to be drained, you're going to be drained, nobody's going to be happy if I do this.

00:10:39.440 --> 00:11:08.240 Someone asked me if I switch between British and French accents. A little secret for you: when I'm stressed, I tend to revert to a French accent, so you can measure the amount of stress that I'm feeling during this talk with the amount of h's that I drop and the amount of sheer fright that you can see sometimes in my eyes, when I'm thinking about what to say next.

00:11:08.240 --> 00:11:17.040 All right sir. So, Amin, do you believe we can leave it at that? I'll be... People will see plenty more of me later on, anyway.

00:11:17.040 --> 00:11:27.120 ([Amin:] So, looking at the schedule, I think your talk has until like 2:02, meaning like five or six minutes from now.)

00:11:27.120 --> 00:11:28.000 Oh, right.

00:11:28.000 --> 00:11:33.920 ([Amin:] So if you do like to take one or two questions, to add two more questions, by all means.)

00:11:33.920 --> 00:12:20.555 So someone has asked me what is the Emacs icon (sorry, see, another French accent) here in my status bar... Oh sorry, I'm not sharing any more. I might just share again just so that everyone can catch a glimpse of that. There we go. Allow... So it should be... So if you could make me small again, Amin, I'm not sure if it's going to do it by itself, but I do have a little icon here in my status bar which is basically a way to interact with org-protocol. I'm not going to look for it right now, but it's a browser extension that is developed by one of my friends over at Ranger whose name is Li Fong (??) and it's very useful. I'm someone who uses a lot of Org protocols.

00:12:20.555 --> 00:12:53.600 And by the way, I used to teach English to high schoolers, and they were supremely worried when I showed them my status line and they saw "kill" and "explore" in my status line. As fellow Emacs users, you know that obviously kill means to kill a selection of text and keep it inside your clipboard, but for my students, they were very worried about what their professor was up to during his nights.

00:12:53.600 --> 00:13:01.920 So let's see if we've got more questions. I'm showing you the questions on the rainbow. Let's see if we've got more. People are posting a lot of questions now.

00:13:01.920 --> 00:13:06.399 So how do you feel about archiving files in Org Mode and how can that work?

00:13:06.399 --> 00:13:59.519 So one of the things when we think about optimization is: yes, archiving done trees is a good idea because it means that if we go back to the org-element, the way it works (and we'll get into technical details afterwards; I'm giving a presentation about org-roam technical aspects, sorry, so I'll have a chance to expand a little more on this) but basically, org-element needs to... Every time it sees a TODO, it has to consider it, even though it is a done TODO. Why? Because let's say, for instance, that in your agenda you want to activate log mode, which is going to show the tasks which are done... Now you could be clever and say, oh okay, the Org agenda does not need to show done items, so it's not going to look for them, but the problem is that org-element is always called. It always needs to parse the buffer.

00:13:59.519 --> 00:14:22.079 You know, Nicolas Goaziou, who is the French developer who's worked a whole lot on org-element has gone through a lot of trouble to optimize org-element, but the problem is there's just so much that we can do with a concurrent process. Right now it leaves somewhat things to be desired, but we're working on it.

00:14:22.079 --> 00:14:32.639 One more time... I feel like I spent half of this talk teasing my next talks, but I'll be talking more about this in my future talks in about one to two hours.

00:14:32.639 --> 00:14:36.079 So, continuing with questions, how big are my Org files?

00:14:36.079 --> 00:15:04.880 So in the background, I'm just going to check how many lines I have in my main file. In my own file, so the one I told you about where I keep all my TODO GTD stuff, I have 38,000 lines, which is... It's sizable, definitely. But I do archive a lot of stuff, so that might be a slight difference between myself and Karl Voit, even though I don't remember if they actually archive stuff.

00:15:04.880 --> 00:15:12.560 So does it not consume more resources and time to load multiple files files than a large file or the same content now?

00:15:12.560 --> 00:16:00.560 Theoretically, yes, having many files open concurrently is slightly slower than having one main file opened. Now the problem is for those of you who have large files, you may have noticed that when you are scrolling in a very large file, it starts taking quite a bit of time. Why? It's because in Org Mode, you have a lot of content that is hidden, so when you have the view mode which hides as much stuff as possible, meaning that you only see the top heading--and I'm checking the time, Amin, don't worry, I'm finished on this one-- when you're hiding a whole lot of stuff, Org Mode needs to keep track, or I should say, Emacs needs to keep track of which areas of text to show and which areas of text to hide.

00:16:00.560 --> 00:16:21.199 The problem is that when you're hiding stuff-- let's say you're moving from the first heading to the second heading, but you've got like 10,000 lines between those two headings-- well, Emacs needs to compute the difference between the two passages, and that takes quite a lot of time. That's why you might realize that it's a little choppy when you start scrolling in large files.

00:16:21.199 --> 00:16:30.719 Anyway I could be answering questions about Org Mode for literally two hours straight, so I'm gonna hand it over to the next speakers. I'll be seeing you guys a little later.

00:16:30.719 --> 00:16:33.440 ([Amin]: Thank you very much, Leo.)

00:16:33.440 --> 00:16:34.889 Oh, thank you.

00:16:34.889 --> 00:16:36.959 ([Amin:] Yes. Bye.)

00:16:36.959 --> 00:16:39.839 Bye.

Saturday, Nov 28 2020, ~ 1:52 PM - 2:02 PM EST
Saturday, Nov 28 2020, ~10:52 AM - 11:02 AM PST
Saturday, Nov 28 2020, ~ 6:52 PM - 7:02 PM UTC
Saturday, Nov 28 2020, ~ 7:52 PM - 8:02 PM CET
Sunday, Nov 29 2020, ~ 2:52 AM - 3:02 AM +08

Back to the schedule
Previous: the org-gtd package: opinions about Getting Things Done
Next: Experience Report: Steps to "Emacs Hyper Notebooks"