Back to the talks Previous by track: Why Nabokov would use Org-Mode if he were writing today Next by track: How I play TTRPGs in Emacs Track: General

Collaborative data processing and documenting using org-babel

Jonathan Hartman (he/him), Lukas C. Bossert (he/him) - https://mastodon.social/@lukascbossert, hartman@itc.rwth-aachen.de, bossert@itc.rwth-aachen.de

Format: 20-min talk; Q&A: ask questions via Etherpad/IRC; we'll e-mail the speaker and post answers on this wiki page after the conference
Status: All done

00:00.000 Introduction 01:16.080 Org Mode 02:18.960 Working together 06:27.840 Data cleaning 08:04.040 Processing 12:36.040 Visualization 14:01.760 Preserve

Duration: 19:16 minutes

Description

In our presentation we will show an efficient way of combining information and enriching it by retrieving data, processing it, and finally exporting it, all with org-mode. In this presentation, we will demonstrate not only org-mode, but also a few companion libraries that add functionality such as knowledge graph visualizations, literate programming, and collaborative editing to quickly create a deeply informative reference page.

The starting point of our best practice is the National Research Data Infrastructure Germany (NFDI), about which we intend to retrieve and process certain information data gathered from wikidata. For this, we are additionally leveraging the "org-roam" emacs package, which provides functionality for quickly and simply linking together notes and ideas into a custom knowledge graph. Initially, we will write a short abstract about the NFDI and embed it into our existing knowledge graph by linking it to other existing nodes. In the visualized graph (using the “org-roam-ui” package), links and secondary connections to other existing nodes can now be revealed.

Next, we would like to enrich the text about the NFDI by with data retrieved from the Wikidata API. A convenient way of creating self-documenting code is the approach called “literate programming”, which presents program logic embedded within human language text. In Emacs we achieve this by using the “org-babel” package. Perhaps now we find it is helpful to collaborate with a colleague in the document: while one is writing the code, the other can explain its use and interpret the results. We will do this simultaneously in the same document using a method called “crdt” (conflict-free replicated data type) and – of course – there is also an implementation of this in Emacs. The results of the code blocks can be used for further analysis and shared throughout the same document.

Finally, for the sake of proper and barrier free documentation, we show how to export the document to various formats like pdf, html, txt etc. using either the built-in feature of org-mode or the implementation of pandoc.

About the speakers:

Jonathan Hartman is a trained data scientist and works at the IT Center of the RWTH Aachen University, Germany.

Lukas C. Bossert is a trained classical archaeologist and is deputy head of the department "research process and data management" at the IT Center of the RWTH.

Lukas, an intermediate Emacs user, is currently exploring how to optimize his daily workflow by leveraging various Emacs packages. On the other hand, Jonathan is a relative newcomer to this environment, encountering common pitfalls faced by beginners. Together, they explore the capabilities and functionalities of org-mode, discovering how it can enhance data management and presentation in their research processes.

Lukas and Jonathan are financed by the DKZ.2R Datenkompetenzkolleg Rhein-Ruhr (16DKZ2030E), www.dks2r.de

Discussion

Questions and answers

  • Q: How reliable it resolves the conflict? I mean, for my personal use case, for example, Sycnthing, sometimes it's not working perfectly and I had to manually edit it. How is it robust compared to syncthing?
    • A (Lukas): We  also faced sometimes issues that letters got mixed up. We couldnt figure out what caused it and it was not reproducable . I cannot compare it to syncthing, never used that with emacs/org-mode.
  • Q: How's the security for this kind of things? I mean, if we adopt these things in our PAD, is there any, can this thing execute arbitrary (elisp) code in different people's computer? (Think like an adversary!)
    • A: (Lukas)  As far as we saw the code is executed on the local computer, see the part with the R-code in our video. 
    • (zaeph) We had plans with qhong (maintainer of crdt.el) to tunnel the connection via SSL, but we were blocked by the SSL library that shipped with Emacs, sadly.  However, we did create a security policy that allowed restrictions on the execution of Elisp code. (great!)
  • Q: Really nice talk and demo!  You guys clearly rehearsed :).  I always wonder with serial data processing sequencing like this, to what degree do the intermediate outputs need to appear inline in the text?  Suppose you had 50,000 or one million rows from your initial wikidata (or similar) call.  How would you handle that size of data using a collaborative, literate approach like this?
    • A: (Lukas) Good question. In your local buffer there is no difference and for the collaborative partner I cannot tell. We testet it with 50 items because that was enough for demonstrating our purpose.
    • noweb allows getting results of evaluation without having to put the actual data into Org buffer - just arrange the original block generating the data to have :results silent. Basically, :var foo=block-name does not require "block-name" to be evaluated in advance - it will be evaluated as necessary. AFAIU, in the talk, it is re-evaluated every time (to not have it, one would need :cache t).
      • This has tremendous utility
    • So it would be stored on disk and referenced by name in a subsequent block?  Sounds useful.  
      • Not on disk - just cached within a single session. To store on disk, need to save to actual file on disk.
  • Q: How do you handle the viewing of larger or really any tabular data in Emacs/Org when you want to inspect it, like the nice way tabular data is displayed inline in Rmarkdown/RStudio?
    • A: (Lukas) I have no particular way of doing this. 
    • What about pandas data summary functionality? Can be a simple python block.
    • Lukas: Jonathan is our python expert, he might answer this question.
    • A: (Jonathan) If I follow, you can certainly just use DataFrame.describe() or Series.describe() to get summary statistics for a dataset - the return value would be a Series or a DataFrame, which would be displayed similiarly to how we show things here. Alternatively, DataFrame.head(n) or DataFrame.sample(n) would return a dataframe of the first n / n random lines of a dataset, and might be a way of providing the gist of a very large dataset without printing the entire table in the document.
    • Would be nice to have a "summarized table" functionality in Org, that includes an abridged copy of a long table inline, but you can open it in another buffer to browse/edit the full table (ala block edit).  
  • Q: I'm thinking about an application for a single user, but in different platforms. In a simple case. For example, you have a buffer in your local computer, and you also want to have some files on your pad or on your phone, and you can use this CADT concept to make sure that there's not too much conflict in between different editing sections. Do you think this is a good idea? I mean, compared to purely relying on Syncthing, which sometimes I feel is unreliable for resolving those conflicts.
    • A: (Lukas) This sounds very interesting and could beneficial for contiously working on things.

Notes

  • I like the way you highlight the point you are talking about in real time.
  • Conflict-free Replicated Data Types (CADT) :: https://github.com/emacs-straight/crdt
  • !This is the future of PAD for our conference.
  • Just came here to say watching two users editing the same buffer simultaneously is BLOWING MY MIND 
    • BLOWING MY MIND  +2
    • blowing my mind, too ...
    • WOW
  • Gitlab custom-export.setup
  • Truly one of the most impressive talks of the day. Congrats! Very inspiring
    • Yes, indeed. 
    • (Lukas) Wow! Thank you. We werent sure if this is worth showing at EmacsConf because there already have been plenty of talks about literate programming and org-babel....
      • Great collaborative conversation and step-wise example creates a different (and impactful) framing.  Thank you!
  • crdt is fantastic; pity that most (all but one) of my collaborators use Word & VS Code. 🙁
  • that's really cool. One of the parts that's a bit hidden from the user is seeing the format that the data is in inside the shell script
  • it is whatever constitutes the closest equivalent of table in sh (array)
    - yeah, you have to keep the representation in mind when filtering it as text through sed
    
  • this demo is so cool :D
  • Really, really impressive I have to admit
  • HA. you cannot evaluate in place so seamlessly in that way with Rmarkdown :). And you cannot combine named blocks in this way either. Wish more folks used emacs.
  • wow, so #+CALL can be embedded in text via call_()? TIL
  • such a slick presentation, I like the CRDT collaboration angle, looks like an end-game UX
  • Impressive workflow!
  • great presentation!
  • For those of you who remember the bad old days before "reproducible research," that talk is even more impressive. Great job!
    • i was prolly not there in the bad old days, but imho reproducible research is a pressing, current problem.
  • I feel like that talk video should be shared on Hacker News

Transcript

[00:00:00.000] Introduction

Collaborative Data Processing and Documenting using org-babel. My name is Lukas Bossert, and I'm from the RWTH Aachen University in the city of Aachen, Germany. I'm also from the IT Center here at RWTH Aachen. And we will show you today how you can use Org Mode for data processing. So you see a little workflow what we are going to do. First, we will give you a slight introduction to Org Mode. Then we will dive into the part of data preparing. First, you're going to query the data using the language SPARQL. Then we're going to clean it using a different language. And in the main part of our presentation, we're going to do the data processing, first aggregating using Python, later on counting items using Org, and even visualizing it using R. At the end, we're going to show you how to preserve the data and the document and its documentation, first doing in plain exporting, then adding some metadata, and showing you two different ways, first a manual export, and also then a batch-processed export. All right. Let's dive in to that.

[00:01:16.080] Org Mode

Jonathan, can you give us an introduction about Org Mode? So in case anyone isn't familiar with it, Org Mode, in the words of Carsten Dominik, is back to the future for plain text. So this is just a module available for Emacs, plain-text base. It's been around since 2003, which makes it about 20 years old. And it's extensible and fully customizable. And especially, it's very convenient, very good for scientific text production and organization. So for example, you can do project management, agenda, diary, journaling, personal knowledge management, presentation. Even this is written in Org Mode. It's an Org Mode presentation. You can do single source publishing, which we will do later on, and also literate programming, which is the core of our talk. OK. So what you see here is the plain text underneath it. So this is Org Mode.

[00:02:18.960] Working together

And Jonathan, since we kind of already did the introduction together, should we also do the working part together? So you see on the screen there on the right, that's my screen in Emacs. And Lukas, why don't you host a session using CRDT, and I'll connect to your buffer. I do that. So what I do, I'm using Doom Emacs. And I can use the SPC and then the l for the live share/collab part. I can use the s for share current buffer. So when I do this, I'm getting asked for some settings. I'm going with the default settings here. So default port, no password, and my display name. And now Emacs is connecting. And once it's connected, which just takes a couple of seconds, I can get the URL. So I'm going back to this menu and using y for copying the URL of the current session. And this is the URL I'm going to send over to you, Jonathan, to pick that up. OK. And now on my screen, I'm going to do a SPC l c for connect. And I'm going to paste the URL that Lukas just sent me in here. Default port, no password. And we're connecting now. So this takes a second just to get us synced up. So we can work on the same document at the same time. We can follow each other's cursors around. We can have multiple buffers open and work on them at the same time. And so here you see that we are both in the same document. You can see my cursor popping around. And you can see we're both editing the same item. Great. with the user overview. So let me just delete that window. And that's going to work in our main one. So we said first part is about data retrieval. So we should give it a headline. We said prepare stage. So what are we going to do first, Jonathan? what this whole document is based upon, is we're going to pull data from Wikidata using a SPARQL query. The data we're going to pull is related to the NFDIs, which here in Germany is the National Forschungsdaten Infrastructure, which is a sort of collection of universities that work together on various research projects. And this is emblematic of the kind of data that we would be interested in working with here. So I'm going to paste a--forgive the pre-written code-- I'm going to paste some text in here. keep on documenting what we do so we can split the work. is the raw dataset cell. And it's going to use SPARQL, which is how we have the syntax highlighting in our code here. It's going to go to the URL endpoint query.wikidata.org/sparql , and it's going to return the data as a text CSV, and it's going to cache that data so that we don't constantly hammer the API every time we run this notebook. So I'm going to run that there. You can see down at the bottom of my screen, we're contacting the host query.wikidata.org . we're just going to limit this to 50 results. I'm going to run that again. There we go. That looks a little better. 50 items is fine. So what do we see here, Jonathan?

[00:06:27.840] Data cleaning

So the first thing we see when we look at this is a couple of Q codes at the top, which are an artifact of Wikidata. So these are pages which don't have the label for whichever institution they happen to be. For our purposes here, we're just going to exclude them. We could just go on Wikidata and edit them ourselves. But for now, it's a little more interesting if we go and remove them. So I'm going to create a new cell. Lukas, if you don't mind starting one for data cleaning. Good point. Yeah, data cleaning. OK. How do you want to do that, Jonathan? So let's see. There we go. And so you can see, here is another cell, that the cell is now using a shell, and that we have this thing :var input=raw-dataset, which is the name of the cell above where we got our data from Wikidata. This is going to run just a simple shell command. It's going to take the input and then run sed on it and exclude any records which have a Q followed by one or more digits afterwards. That should remove those from our data set. So I'm going to run that. That seems to have done the trick. That's really good. We got rid of all the Q items. Very good. So we just have two-column table: institutions and consortia. Very nice.

[00:08:04.040] Processing

So let's come to our main part, doing some processing. Let me give you a headline here, process the data. What do you want to do first? but let's just do some simple counts first. I'm going to start with Python, and we're just going to do some aggregation with Python. Again, I've got some pre-written code here. You can see that we've started a cell using Python. The variable clean_df now is equal to clean-dataset. So we're going to take that data that we retrieved from the SPARQL query, we're going to run it through the cleaning cell, and then we're going to import it into this cell. This is just going to do some simple Python aggregation. We're going to import pandas, which is the Python data science library, create a data frame out of our input, and then aggregate it, grouping on wLabel, and getting a count from that and returning it. So if we execute that cell... But what about not ordering it by the alphabet, but more like ordering by counts? So let's do this... sort_values(), I think, as the Python. How does that look? have the highest number first and then ascending. Well, not ascending, descending. OK, that's nice. We get a good overview here. But can we also do something else, like counting how many institutions are involved in one consortium? And also using this later on in the text? If you give me another heading down here for institutions per consortium... And here we're going to use awk code just to spice things up and add yet another language in here. So you can see this is awk. We're using standard in instead of defining a variable. But the really interesting thing about this cell is that we have this :var consortium="NFDI4Memory". And what this code is doing is it's counting any time it sees that particular consortium name and keeping track of that. So if we execute this, Lukas, why don't you execute this one? And I get a result, NFDI4Memory, because this is our default value for this variable. And we get the count. So it's five institutions are involved in the NFDI4memory consortium. Great, but the very nice thing, what I think, is here that we can use this code snippet within our text. So, blended in seamlessly. Let me give you an example. I'm writing out the text. Now we know how many institutions are in... Give me an example. I would like to know how many institutions are involved in NFDI4Objects, which is a consortium. So I'm writing call_ and using the name of this snippet here, of this cell, which is inst-count(, and writing my value, NFDI4Objects. As soon as I evaluate this using C-c C-c, I get the result back here. I can do this even for more. Or in writing, call_inst-count, go with NFDI4Earth, which is another consortium. C-c C-c, it's three institutions. This can be used throughout your text, and as soon as the data set changes from in the beginning, maybe different results requiring Wikidata, this also will be updated once it's exported. Very nice, Jonathan.

[00:12:36.040] Visualization

But I think we did a lot of analysis on text and counting things. Can we also do something more visual? Show me something. So what we can do with this, because we just have two columns here that are sort of related, we can build a little network plot out of it. So let's make a network visualization. We're going to use the igraph library from R and just plot the edges that we see here. There we go. There's my little heading and space. Here is our code. Again, just to be fancy and keep using different languages in here, we set a variable called NFDI_edges equal to clean-dataset. So this, again, is sort of cascading through the original data that we pulled from the Wikidata endpoint, cleaning that data, and now it's being inserted into this cell as well. But you see the difference here. Instead of exporting a table, what we're saying is that there will be a graphics file, and it will be called network-plot.png. All right. And so Lukas, why don't you execute this one? I can click C-c C-c and I get a nice plot of the network below our cell. So this is very nice indeed.

[00:14:01.760] Preserve

So I think it's about time to wrap it up and to export and to preserve the data and the documentation that we have in our very last step, calling preserve. So I would like to do it in two steps. First, maybe manually exporting it, but then also doing it in a batch process. Giving you some insights how to do that manual export. For example, you can do a LaTeX export. Let me write down the key combination to do that here. So you press SPC m e l o. Let me show you how this is done. So I'm pressing SPC. I'm pressing m, which is my local leader. I'm pressing e, which is now the org-export-dispatch. And now I have different options I can choose from. I want to do a LaTeX export because I want to get in PDF. So I'm pressing l. Now I've got different options available. So I'm pressing o for a PDF file and open that. Let's see now the code. Now this is exporting document. And what we have here is PDF, which contains our workflow in the beginning, our bullet points we have here, and also the code snippet that we use for querying the data. And we have the result below that. So this is our table with all the data sets. But as you can see, this is running out of the page. So this is not very nice using the default settings. But everything is in this PDF. I guess we can now show you a way how to improve this result. So we have, of course, a version of this that we prepared ahead of time, which is more or less identical to the one we just made, but it has a little more text, a little more explanation, a little more documentation along with the code. You can see we have some metadata up at the top, the title, the authors, a bibliography, and most importantly, the custom-export.setup file, which lists specifically the sort of LaTeX commands that we're using and the HTML styles that we're going to use. And then down at the bottom of this file, we have our automatic batch process. Here is one more language we're including in here. So this is Lisp. And you can see here we are exporting to HTML, ASCII, and PDF. The nice thing about this is that this is a document. It's a sort of document that we have a couple of that we can have running automatically and building. It will export a HTML, an ASCII file, and a PDF file every time it's run based off of the most recent data available on Wikidata. So it's self-documenting. We have, of course, our data retrieval steps, our data cleaning steps, our data preparation steps, and our preservation steps all listed at the same time. And then you can see over on the right, there's an example of the HTML file that we get out of this. We also get a very nicely formatted PDF file, which doesn't have that little issue with the overflow of the table. It's very nicely put together. And we even have an ASCII file. And I should also point out very quickly, while you have this one up, Lukas, after the awk code, you can see the text for the number of consortia, or the number of institutions per consortia is actually printed inline. So this is what we had as code, and now this is nicely integrated into our text. So we got the consortium and number of institutions. You can't tell a difference between code and text. So if another institution joins NFDI4Earth, then the next time this runs, we update the text right here. It's nothing we have to worry about. We just pull it directly out of Wikidata. this is the ASCII file. That's in the export format. It contains also everything, code and data. Yeah, so this is what we wanted to show you, how to do some data processing, some collaborative work, documenting using org-babel. Thanks for listening.

Captioner: amine

Questions or comments? Please e-mail hartman@itc.rwth-aachen.de, bossert@itc.rwth-aachen.de

Back to the talks Previous by track: Why Nabokov would use Org-Mode if he were writing today Next by track: How I play TTRPGs in Emacs Track: General