Back to the talks Previous by track: Emacs was async before async was cool Next by track: The Wheels on D-Bus Track: General

GRAIL---A Generalized Representation and Aggregation of Information Layers

Sameer Pradhan (he/him)


00:00.000 Introduction 01:13.400 Processing language 02:34.560 Annotation 03:43.240 Learning from data 04:39.680 Manual annotation 05:44.400 How can we develop a unified representation? 06:22.520 What role might Emacs and Org mode play? 06:55.280 The complex structure of language 08:10.800 Annotation tools 10:22.360 Org mode 12:45.480 Example 17:36.240 Different readings 19:17.680 Spontaneous speech 23:32.000 Editing properties in column view 24:20.280 Conclusion 25:15.280 Bonus material 27:20.480 Syntactic analysis 28:39.280 Forced alignment 30:12.600 Alignment before tokenization 31:42.880 Layers 34:31.320 Variations


Listen to just the audio:


Help wanted: Q&A could be indexed with chapter markers

The Q&A session for this talk does not have chapter markers yet. Would you like to help? See help with chapter markers for more details. You can use the vidid="grail-qanda" if adding the markers to this wiki page, or e-mail your chapter notes to

(If you want to work on this and you think it might take you a while, you can reserve this task by editing the page and adding volunteer="your-name date" or by e-mailing

The human brain receives various signals that it assimilates (filters, splices, corrects, etc.) to build a syntactic structure and its semantic interpretation. This is a complex process that enables human communication. The field of artificial intelligence (AI) is devoted to studying how we generate symbols and derive meaning from such signals and to building predictive models that allow effective human-computer interaction.

For the purpose of this talk we will limit the scope of signals to the domain to language—text and speech. Computational Linguistics (CL), a.k.a. Natural Language Processing (NLP), is a sub-area of AI that tries to interpret them. It involves modeling and predicting complex linguistic structures from these signals. These models tend to rely heavily on a large amount of raw'' (naturally occurring) data and a varying amount of (manually) enriched data, commonly known asannotations''. The models are only as good as the quality of the annotations. Owing to the complex and numerous nature of linguistic phenomena, a divide and conquer approach is common. The upside is that it allows one to focus on one, or few, related linguistic phenomena. The downside is that the universe of these phenomena keeps expanding as language is context sensitive and evolves over time. For example, depending on the context, the word bank'' can refer to a financial institution, or the rising ground surrounding a lake, or something else. The verbgoogle'' did not exist before the company came into being.

Manually annotating data can be a very task specific, labor intensive, endeavor. Owing to this, advances in multiple modalities have happened in silos until recently. Recent advances in computer hardware and machine learning algorithms have opened doors to interpretation of multimodal data. However, the need to piece together such related but disjoint predictions poses a huge challenge.

This brings us to the two questions that we will try to address in this talk:

  1. How can we come up with a unified representation of data and annotations that encompasses arbitrary levels of linguistic information? and,

  2. What role might Emacs play in this process?

Emacs provides a rich environment for editing and manipulating recursive embedded structures found in programming languages. Its view of text, however, is more or less linear–strings broken into words, strings ended by periods, strings identified using delimiters, etc. It does not assume embedded or recursive structure in text. However, the process of interpreting natural language involves operating on such structures. What if we could adapt Emacs to manipulate rich structures derived from text? Unlike programming languages, which are designed to be parsed and interpreted deterministically, interpretation of statements in natural languages has to frequently deal with phenomena such as ambiguity, inconsistency, incompleteness, etc. and can get quite complex.

We present an architecture (GRAIL) which utilizes the capabilities of Emacs to allow the representation and aggregation of such rich structures in a systematic fashion. Our approach is not tied to Emacs, but uses its many built-in capabilities for creating and evaluating solution prototypes.



  • I will plan to fix the issues with the subtitles in a more systematic fashion and make the video available on the emacsconf/grail  URL. My sense is that this URL will be active for the foreseeable future.
  • I am going to try and revise some of the answers which I typed quite quickly and may not have provided useful context or might have made errors.
  • .
  • Please feel free to email me at for any futher questions or discussions you may want to have with me or be part of the grail community (doesn't exist yet :-), or is a community of 1)
  • .

Questions and answers

  • Q: Has the '92 UPenn corpus of articles feat been reproduced over and over again using these tools?
    • A: 
    • Yes. The '92 corpus only annotated syntactic structure. It was probably the first time that the details captured in syntax were selected not purely based on linguistic accuracy, but on the consistency of such annotations across multiple annotators. This is often referred to as Inter-Annotator Agreement. The high IAA for this corpus was probably one of the reasons that parsers trained on it got accuracies in the mid 80s or so. Then over the next 30 years (and still continuing..) academics improved on parsers and today the performance on the test set from this corpus is somewhere around F-score of 95. But this has to be taken with a big grain of salt given overfitting and how many times people have seen the test set. 
    • One thing that might be worth mentioing is that over the past 30 years, there have been many different phenomena that have been annotated on a part of this corpus. However, as I mentioned given the difficulty of current tools and representations to integrate disparate layers of annotations. Some such issues being related to the complexity of the phenomena and others related to the brittleness of the representations. For example, I remember when we were building the OntoNotes corpus, there was a point where the guidelines were changed to split all words at a 'hyphen'. That simple change cause a lot of heartache because the interdependencies were not captured at a level that could be programmatically manipulated. That was around 2007 when I decided to use a relational database architecture to represent the layers. The great thing is that it was an almost perfect representation but for some reason it never caught up because using a database to prepare data for training was something that was kind of unthinkable 15 years ago. Maybe? Anyway, the format that is the easiest to use but very rigid in the sense that you can quickly make use of it, but if something changes somewhere you have no idea if the whole is consistent. And when came across org-mode sometime around 2011/12 (if I remember correctly) I thought it would be a great tool. And indeed about decade in the future I am trying to stand on it's and emacs' shoulders.
    • This corpus was one of the first large scale manually annotated corpora that bootstrapped the statistical natural language processing era.  That can be considered the first wave...  SInce then, there have been  more corpora built on the same philosophy.  In fact I spent about 8 years about a decade ago building a much larger corpus with more layers of information and it is called the OntoNotes. It covers Chinese and Arabic as well (DARPA funding!) This is freely available for research to anyone anywhere. that was quite a feat. 
  • Q:Is this only for natural languagles like english or more general? Would this be used for programing laungages.

    • A: I am using English as a use case, but the idea is to have it completely multilingual. 
    • I cannot think why you would want to use it for programming languages. In fact the concept of an AST in programming languages was what I thought would be worth exploring in this area of research.  Org Mode, the way I sometimes view it is a somewhat crude incarnation of that and can be sort of manually built, but the idea is to identify patterns and build upon them to create a larger collection of transformations that could be generally useful.  That could help capture the abstract reprsentation of "meaning" and help the models learn better. 
    • These days most models are trained on a boat load of data and no matter how much data you use to train your largest model, it is still going to be a small spec in the universe of ever growing data that are are sitting in today. So, not surprisingly, these models tend to overfit the data they are trained on.  
    • So, if you have a smaller data set which is not quite the same as the one that you had the training data for, then the models really do poorly. It is sometimes compared to learning a sine function using the points on the sine wave as opposed to deriving the function itself. You can get close, but then then you cannot really do a lot better with that model :-)
    • I did a brief stint at the Harvard Medical School/Boston Childrens' Hospital to see if we would use the same underlying philosophy to build better models for understanding clinical notes. It would be an extremely useful and socially beneficial use case, but then after a few years and realizing that the legal and policy issues realted to making such data available on a larger scale might need a few more decades, I decided to step off that wagon (if I am using the figure of speech correctly).
    • .
    • More recently, since I joined the Linguistic Data Consortium, we have been looking at spoken neurological tests that are taken by older people and using which neurologists can predict a potential early onset of some neurological disorder. The idea is to see if we can use speech and langauge signals to predict such cases early on. The fact that we don't have cures for those conditions yet, the best we can do it identify them earlier with the hope that the progression can be slowed down.
    • .
    • This is sort of what is happening with the deep learning hype. It is not to say that there hasn;t been a significant advancement in the technologies, but to say that the models can "learn" is an extremely overstatement. 
  • Q: Reminds me of the advantages of pre computer copy and paste. Cut up paper and rearange but having more stuff with your pieces.

    • A: Right! 
    • Kind of like that, but more "intelligent" than copy/paste, because you could have various local constraints that would ensure that the information that is consistent with the whole. I am also ensioning this as a usecase of hooks. And if you can have rich local dependencies, then you can be sure (as much as you can) that the information signal is not too corrupted.
    • .
    • I did not read the "cut up paper" you mentioned. That is an interesting thought. In fact, the kind of thing I was/am envisioning is that you can cut the paper a million ways but then you can still join them back to form the original piece of paper. 



  • Q: Have you used it on some real life situation? where have you experimented with this?

    • A: NO. 
    • I am probably the only person who is doing this crazy thing. It would be nice, or rather I have a feeling that something like this, if worked upon for a while by many might lead to a really potent tool for the masses. I feel strongly about giving such power to the users, and be able to edit and share the data openly so that they are not stuck in some corporate vault somewhere :-) One thing at a time.
    • .
    • I am in the process of creating a minimally viable package and see where that goes.
    • .
    • The idea is to start within emacs and orgmode but not necessarily be limited to it.
  • Q:Do you see this as a format for this type of annotation specifically, or something more general that can be used for interlinear glosses, lexicons, etc? -- Does wordsense include a valence on positive or negative words-- (mood) . 

  • Interesting. question.  There are sub-corpora that have some of this data. 

    • A: Absolutely. IN fact, the project I mentioned OntoNotes has multiple layers of annotation. One of them being the propositional structure which uses a large lexicon that covers about 15K verbs and nouns and all their argument structures that we have been seen so far in the corpora. There is about a million "propositions" that have been released recently (we just recently celebrated a 20th birthday of the corpus. It is called the PropBank. 
  • There is an interesting history of the "Banks" . It started with Treebank, and then there was PropBank (with a capital B), but then when we were developing OntoNotes which contains:

    • Syntax
    • Named Entities
    • Coreference Resolutoion
    • Propositions
    • Word Sensse 
  • All in the same whole and across various genre... (can add more information here later... )

  • Q: Are there parallel efforts to analyze literary texts or news articles? Pulling the ambiguity of meaning and not just the syntax out of works? (Granted this may be out of your area-- ignore as desired)

    • A: :-) Nothing that relates to "meaning" falls too far away from where I would like to be. It is a very large landscape and growing very fast, so it is hard to be able to be everywhere at the same time :-)
    • .
    • Many people are working on trying to analyze literature. Analyzing news stories has been happening since the beginning of the statistical NLP revolution---sort of linked to the fact that the first million "trees" were curated using WSJ articles :-)
  • Q: Have you considered support for conlangs, such as Toki Pona?  The simplicity of Toki Pona seems like it would lend itself well to machine processing.

    • A:  This is the first time I hearing of conlangs and Toki Pona. I would love to know more about them to say more, but I cannot imaging any langauge not being able to use this framework.
    • conlangs are "constructed languages" such as Esperanto --- languages designed with intent, rather than evolved over centuries.  Toki Pona is a minimal conlang created in 2001, with a uniform syntax and small (<200 word) vocabulary.
    • Thanks for the information! I would love to look into it.
  • Q: Is there a roadmap of sorts for GRAIL?

    • A: 
    • Yes. I am now actually using real world annotations on larg corpora---both text and speech and am validating the concept further. I am sure there will be some bumps in the way, and I am not saying that this is going to be a cure-all, but I feel (after spending most of my professional life building/using corpora) that this approach does seem very appealing to me. The speed of its development will depend on how many buy into the idea and pitch in, I guess.
  • Q: How can GRAIL be used by common people?

    • A: I don't think it can be used by common people at the very moment---partly because most "common man" has never heard of emacs or org-mode. But if we can valide the concept and if it does "grow legs" and walk out of the emacs room into the larger universe, then absolutely, anyone who can have any say about langauge could use it. And the contributions would be as useful as the consistency with which one can capture a certain phenomena.
    • .
    • Everytime you use a capta these days, the algorithms used by the company storing the data get slightly better. What if we could democratize this concept. That could lead to fascinating things. Like Wikipedia did for the sum total of human knowledge.
  • Q: 

    • A: 


[00:00:00.000] Thank you for joining me today. I'm Sameer Pradhan from the Linguistic Data Consortium at the University of Pennsylvania and founder of . Today we'll be addressing research in computational linguistics, also known as natural language processing a sub area of artificial intelligence with a focus on modeling and predicting complex linguistic structures from various signals. The work we present is limited to text and speech signals. but it can be extended to other signals. We propose an architecture, and we call it GRAIL, which allows the representation and aggregation of such rich structures in a systematic fashion. I'll demonstrate a proof of concept for representing and manipulating data and annotations for the specific purpose of building machine learning models that simulate understanding. These technologies have the potential for impact in almost every conceivable field that generates and uses data.

[00:01:13.400] We process human language when our brains receive and assimilate various signals which are then manipulated and interpreted within a syntactic structure. it's a complex process that I have simplified here for the purpose of comparison to machine learning. Recent machine learning models tend to require a large amount of raw, naturally occurring data and a varying amount of manually enriched data, commonly known as "annotations". Owing to the complex and numerous nature of linguistic phenomena, we have most often used a divide and conquer approach. The strength of this approach is that it allows us to focus on a single, or perhaps a few related linguistic phenomena. The weaknesses are the universe of these phenomena keep expanding, as language itself evolves and changes over time, and second, this approach requires an additional task of aggregating the interpretations, creating more opportunities for computer error. Our challenge, then, is to find the sweet spot that allows us to encode complex information without the use of manual annotation, or without the additional task of aggregation by computers.

[00:02:34.560] So what do I mean by "annotation"? In this talk the word annotation refers to the manual assignment of certain attributes to portions of a signal which is necessary to perform the end task. For example, in order for the algorithm to accurately interpret a pronoun, it needs to know that pronoun, what that pronoun refers back to. We may find this task trivial, however, current algorithms repeatedly fail in this task. So the complexities of understanding in computational linguistics require annotation. The world annotation itself is a useful example, because it also reminds us that words have multiple meetings as annotation itself does— just as I needed to define it in this context, so that my message won't be misinterpreted. So, too, must annotators do this for algorithms through the manual intervention.

[00:03:43.240] Learning from raw data (commonly known as unsupervised learning) poses limitations for machine learning. As I described, modeling complex phenomena need manual annotations. The learning algorithm uses these annotations as examples to build statistical models. This is called supervised learning. Without going into too much detail, I'll simply note that the recent popularity of the concept of deep learning is that evolutionary step where we have learned to train models using trillions of parameters in ways that they can learn richer hierarchical structures from very large amounts of annotate, unannotated data. These models can then be fine-tuned, using varying amounts of annotated examples depending on the complexity of the task to generate better predictions.

[00:04:39.680] As you might imagine, manually annotating complex, linguistic phenomena can be very specific, labor-intensive task. For example, imagine if we were to go back through this presentation and connect all the pronouns with the nouns to which they refer. Even for a short 18 min presentation, this would require hundreds of annotations. The models we build are only as good as the quality of the annotations we make. We need guidelines that ensure that the annotations are done by at least two humans who have substantial agreement with each other in their interpretations. We know that if we try to trade a model using annotations that are very subjective, or have more noise, we will receive poor predictions. Additionally, there is the concern of introducing various unexpected biases into one's models. So annotation is really both an art and a science.

[00:05:44.400] In the remaining time, we will turn to two fundamental questions. First, how can we develop a unified representation of data and annotations that encompasses arbitrary levels of linguistic information? There is a long history of attempting to answer this first question. This history is documented in our recent article, and you can refer to that article. It will be on the website. It is as if we, as a community, have been searching for our own Holy Grail.

[00:06:22.520] The second question we will pose is what role might Emacs, along with Org mode, play in this process? Well, the solution itself may not be tied to Emacs. Emacs has built in capabilities that could be useful for evaluating potential solutions. It's also one of the most extensively documented pieces of software and the most customizable piece of software that I have ever come across, and many would agree with that.

[00:06:55.280] In order to approach this second question, we turn to the complex structure of language itself. At first glance, language appears to us as a series of words. Words form sentences, sentences form paragraphs, and paragraphs form completed text. If this was a sufficient description of the complexity of language, all of us would be able to speak and read at least ten different languages. We know it is much more complex than this. There is a rich, underlying recursive tree structure-- in fact, many possible tree structures which makes a particular sequence meaningful and many others meaningless. One of the better understood tree structures is the syntactic structure. While natural language has rich ambiguities and complexities, programming languages are designed to be parsed and interpreted deterministically. Emacs has been used for programming very effectively. So there is a potential for using Emacs as a tool for annotation. This would significantly improve our current set of tools.

[00:08:10.800] It is important to note that most of the annotation tools that have been developed over the past few decades have relied on graphical interfaces, even those used for enriching textual information. Most of the tools in current use are designed for a end user to add very specific, very restricted information. We have not really made use of the potential that an editor or a rich editing environment like Emacs can add to the mix. Emacs has long enabled the editing of, the manipulation of complex embedded tree structures abundant in source code. So it's not difficult to imagine that it would have many capabilities that we we need to represent actual language. In fact, it already does that with features that allow us to quickly navigate through sentences and paragraphs, and we don't need a few key strokes. Or to add various text properties to text spans to create overlays, to name but a few. Emacs figured out this way to handle Unicode, so you don't even have to worry about the complexity of managing multiple languages. It's built into Emacs. In fact, this is not the first time Emacs has been used for linguistic analysis. One of the breakthrough moments in language, natural language processing was the creation of manually created syntactic trees for a 1 million word collection of Wall Street Journal articles. This was else around 1992 before Java or graphical interfaces were common. The tool that was used to create that corpus was Emacs. It was created at UPenn, and is famously known as the Penn Treebank. '92 was about when the Linguistic Data Consortium was also established, and it's been about 30 years that it has been creating various language-related resources.

[00:10:22.360] Org mode--in particular, the outlining mode, or rather the enhanced form of outlining mode-- allows us to create rich outlines, attaching properties to nodes, and provides commands for easily customizing sorting of various pieces of information as per one's requirement. This can also be a very useful tool. This enhanced form of outline-mode adds more power to Emacs. It provides commands for easily customizing and filtering information, while at the same time hiding unnecessary context. It also allows structural editing. This can be a very useful tool to enrich corpora where we are focusing on limited amount of phenomena. The two together allow us to create a rich representation that can simultaneously capture multiple possible sequences, capture details necessary to recreate the original source, allow the creation of hierarchical representation, provide structural editing capabilities that can take advantage of the concept of inheritance within the tree structure. Together they allow local manipulations of structures, thereby minimizing data coupling. The concept of tags in Org mode complement the hierarchy part. Hierarchies can be very rigid, but to tags on hierarchies, we can have a multifaceted representations. As a matter of fact, Org mode has the ability for the tags to have their own hierarchical structure which further enhances the representational power. All of this can be done as a sequence of mostly functional data transformations, because most of the capabilities can be configured and customized. It is not necessary to do everything at once. Instead, it allows us to incrementally increase the complexity of the representation. Finally, all of this can be done in plain-text representation which comes with its own advantages.

[00:12:45.480] Now let's take a simple example. This is a a short video that I'll play. The sentence is "I saw the moon with a telescope," and let's just make a copy of the sentence. What we can do now is to see: what does this sentence comprise? It has a noun phrase "I," followed by a word "saw." Then "the moon" is another noun phrase, and "with the telescope" is a prepositional phrase. Now one thing that you might remember, from grammar school or syntax is that there is a syntactic structure. And if you in this particular case-- because we know that the moon is not typically something that can hold the telescope, that the seeing must be done by me or "I," and the telescope must be in my hand, or "I" am viewing the moon with a telescope. However, it is possible that in a different context the moon could be referring to an animated character in a animated series, and could actually hold the telescope. And this is one of the most-- the oldest and one of the most-- and in that case the situation might be that I'm actually seeing the moon holding a telescope... I mean. The moon is holding the telescope, and I'm just seeing the moon holding the telescope. Complex linguistic ambiguity or linguistic phenomena that requires world knowledge, and it's called the PP attachment problem where the propositional phrase attachment can be ambiguous, and various different contextual cues have to be used to resolve the ambiguity. So in this case, as you saw, both the readings are technically true, depending on different contexts. So one thing we could do is just to cut the tree and duplicate it, and then let's create another node and call it an "OR" node. And because we are saying, this is one of the two interpretations. Now let's call one interpretation "a", and that interpretation essentially is this child of that node "a" and that says that the moon is holding the telescope. Now we can create another representation "b" where we capture the other interpretation, where this, the act, the moon or--I am actually holding the telescope, and watching the moon using it. So now we have two separate interpretations in the same structure, and all we do--we're able to do is with this, with very quick key strokes now... While we are at it, let's add another interesting thing, this node that represents "I": "He." It can be "She". It can be "the children," or it can be "The people". Basically, any entity that has the capability to "see" can be substituted in this particular node. Let's see what we have here now. We just are getting sort of a zoom view of the entire structure, what we created, and essentially you can see that by just, you know, using a few keystrokes, we were able to capture two different interpretations of a a simple sentence, and they are also able to add these alternate pieces of information that could help machine learning algorithms generalize better. All right.

[00:17:36.240] Now, let's look at the next thing. So in a sense, we can use this power of functional data structures to represent various potentially conflicting and structural readings of that piece of text. In addition to that, we can also create more texts, each with different structure, and have them all in the same place. This allows us to address the interpretation of a static sentence that might be occurring in the world, while simultaneously inserting information that would add more value to it. This makes the enrichment process also very efficient. Additionally, we can envision a power user of the future, or present, who can not only annotate a span, but also edit the information in situ in a way that would help machine algorithms generalize better by making more efficient use of the annotations. So together, Emacs and Org mode can speed up the enrichment of the signals in a way that allows us to focus on certain aspects and ignore others. Extremely complex landscape of rich structures can be captured consistently, in a fashion that allows computers to understand language. We can then build tools to enhance the tasks that we do in our everyday life. YAMR is acronym, or the file's type or specification that we are creating to capture this new rich representation.

[00:19:17.680] We'll now look at an example of spontaneous speech that occurs in spoken conversations. Conversations frequently contain errors in speech: interruptions, disfluencies, verbal sounds such as cough or laugh, and other noises. In this sense, spontaneous speech is similar to a functional data stream. We cannot take back words that come out of our mouth, but we tend to make mistakes, and we correct ourselves as soon as we realize that we have made-- we have misspoken. This process manifests through a combination of a handful of mechanisms, including immediate correction after an error, and we do this unconsciously. Computers, on the other hand, must be taught to understand these cases. What we see here is a example document or outline, or part of a document that illustrates various different aspects of the representation. We don't have a lot of time to go through many of the details. I would highly encourage you to play a... I'm planning on making some videos, or ascii cinemas, that I'll be posting, and you can, if you're interested, you can go through those. The idea here is to try to do a slightly more complex use case. But again, given the time constraint and the amount of information that needs to fit in the screen, this may not be very informative, but at least it will give you some idea of what can be possible. And in this particular case, what you're seeing is that there is a sentence which is "What I'm I'm tr- telling now." Essentially, there is a repetition of the word "I'm", and then there is a partial word that somebody tried to say "telling", but started saying "tr-", and then corrected themselves and said, "telling now." So in this case, you see, we can capture words or a sequence of words, or a sequence of tokens. One thing to... An interesting thing to note is that in NLP, sometimes we have to break typically words that don't have spaces into two separate words, especially contractions like "I'm", so the syntactic parser needs needs two separate nodes. But anyway, so I'll... You can see that here. The other... This view. What this view shows is that with each of the nodes in the sentence or in the representation, you can have a lot of different properties that you can attach to them, and these properties are typically hidden, like you saw in the earlier slide. But you can make use of all these properties to do various kind of searches and filtering. And on the right hand side here-- this is actually not a legitimate syntax-- but on the right are descriptions of what each of these represent. All the information is also available in the article. You can see there... It shows how much rich context you can capture. This is just a closer snapshot of the properties on the node, and you can see we can have things like, whether the word is a token or not, or that it's incomplete, whether some words might want to be filtered out for parsing, and we can say this: PARSE_IGNORE, or some words or restart markers... We can mark, add a RESTART_MARKER, or sometimes, some of these might have durations. Things like that.

[00:23:32.000] The other fascinating thing of this representation is that you can edit properties in the column view. And suddenly, you have this tabular data structure combined with the hierarchical data structure. And as you can--you may not be able to see it here, but what has also happened here is that some of the tags have been inherited from the earlier nodes. And so you get a much fuller picture of things. Essentially you, can filter out things that you want to process, process them, and then reintegrate it into the whole.

[00:24:20.280] So, in conclusion, today we have proposed and demonstrated the use of an architecture (GRAIL), which allows the representation, manipulation, and aggregation of rich linguistic structures in a systematic fashion. We have shown how GRAIL advances the tools available for building machine learning models that simulate understanding. Thank you very much for your time and attention today. My contact information is on this slide. If you are interested in an additional example that demonstrates the representation of speech and written text together, please continue watching. Otherwise, you can stop here and enjoy the rest of the conference.

[00:25:15.280] Welcome to the bonus material. I'm glad for those of you who are stuck around. We are now going to examine an instance of speech and text signals together that produce multiple layers. When we have--when we take a spoken conversation and use the best language processing models available, we suddenly hit a hard spot because the tools are typically not trained to filter out the unnecessary cruft in order to automatically interpret the part of what is being said that is actually relevant. Over time, language researchers have created many interdependent layers of annotations, yet the assumptions underlying them are seldom the same. Piecing together such related but disjointed annotations on their predictions poses a huge challenge. This is another place where we can leverage the data model underlying the Emacs editor, along with the structural editing capabilities of Org mode to improve current tools. Let's take this very simple looking utterance. "Um {lipsmack} and that's it. ({laugh})" Looks like the person-- so this is-- what you are seeing here is a transcript of an audio signal that has a lip smack and a laugh as part of it, and there is also a "Um" like interjection. So this has a few interesting noises and specific things that would be illustrative of what we are going to, how we are going to represent it.

[00:27:20.480] Okay. So let's say you want to have a syntactic analysis of this sentence or utterance. One common technique people use is just to remove the cruft, and, you know, write some rules, clean up the utterance, make it look like it's proper English, and then, you know, tokenize it, and basically just use standard tools to process it. But in that process, they end up eliminating valid pieces of signal that have meaning to others studying different phenomena of language. Here you have the rich transcript, the input to the syntactic parser. As you can see, there is a little tokenization happening where you'll be inserting space between "that" and the contracted is ('s), and between the period and the "it," and the output of the syntactic parser is shown below. which (surprise) is a S-expression. Like I said, the parse trees, when they were created, and still largely when they are used, are S-expressions, and most of the viewers here should not have much problem reading it. You can see this tree structure of this syntactic parser here.

[00:28:39.280] Now let's say you want to integrate phonetic information or phonetic layer that's in the audio signal, and do some analysis. Now, it would need you to do a few-- take a few steps. First, you would need to align the transcript with the audio. This process is called forced alignment, where you already know what the transcript is, and you have the audio, and you can get a good alignment using both pieces of information. And this is typically a technique that is used to create training data for training automatic speech recognizers. One interesting thing is that in order to do this forced alignment, you have to keep the non-speech events in transcript, because they consume some audio signal, and if you don't have that signal, the alignment process doesn't know exactly... you know, it doesn't do a good job, because it needs to align all parts of the signal with something, either pause or silence or noise or words. Interestingly, punctuations really don't factor in, because we don't speak in punctuations. So one of the things that you need to do is remove most of the punctuations, although you'll see there are some punctuations that can be kept, or that are to be kept.

[00:30:12.600] And the other thing is that the alignment has to be done before tokenization, as it impacts pronunciation. To show an example: Here you see "that's". When it's one word, it has a slightly different pronunciation than when it is two words, which is "that is", like you can see "is." And so, if you split the tokens or split the words in order for syntactic parser to process it, you would end up getting the wrong phonetic analysis. And if you have--if you process it through the phonetic analysis, and you don't know how to integrate it with the tokenized syntax, you can, you know, that can be pretty tricky. And a lot of time, people write one-off pieces of code that handle these, but the idea here is to try to have a general architecture that seamlessly integrates all these pieces. Then you do the syntactic parsing of the remaining tokens. Then you align the data and the two annotations, and then integrate the two layers. Once that is done, then you can do all kinds of interesting analysis, and test various hypotheses and generate the statistics, but without that you only are dealing with one or the other part.

[00:31:42.880] Let's just take a quick look at how each of the layers that are involved look like. So this is "Um {lipsmack}, and that's it. {laugh}" This is the transcript, and on the right hand side, you see the same thing as a transcript listed in a vertical in a column. You'll see why, in just a second. And there are some place-- there are some rows that are empty, some rows that are wider than the others, and we'll see why. The next is the tokenized sentence where you have space added, you know space between these two tokens: "that" and the apostrophe "s" ('s), and the "it" and the "period". And you see on the right hand side that the tokens have attributes. So there is a token index, and there are 1, 2, you know 0, 1, 2, 3, 4, 5 tokens, and each token has a start and end character, and space (sp) also has a start and end character, and space is represented by a "sp". And there are these other things that we removed, like the "{LS}" which is for "{lipsmack}" and "{LG}" which is "{laugh}" are showing grayed out, and you'll see why some of these things are grayed out in a little bit. This is what the forced alignment tool produces. Basically, it takes the transcript, and this is the transcript that has slightly different symbols, because different tools use different symbols and their various configurational things. But this is what is used to get an alignment or time alignment with phones. So this column shows the phones, and so each word... So, for example, "and" has been aligned with these phones, and these on the start and end are essentially temporal or time stamps that it aligned-- that has been aligned to it. Interestingly, sometimes we don't really have any pause or any time duration between some words and those are highlighted as gray here. See, there's this space... Actually it does not have any temporal content, whereas this other space has some duration. So the ones that have some duration are captured, while the others are the ones that in the earlier diagram we saw were left out.

[00:34:31.320] And the aligner actually produces multiple files. One of the files has a different, slightly different variation on the same information, and in this case, you can see that the punctuation is missing, and the punctuation is, you know, deliberately missing, because there is no time associated with it, and you see that it's not the tokenized sentence-- a tokenized word. This... Now it gives you a full table, and you can't really look into it very carefully. But we can focus on the part that seems legible, or, you know, properly written sentence, process it and reincorporate it back into the whole. So if somebody wants to look at, for example, how many pauses the person made while they were talking, And they can actually measure the pause, the number, the duration, and make connections between that and the rich syntactic structure that is being produced. And in order to do that, you have to get these layers to align with each other, and this table is just a tabular representation of the information that we'll be storing in the YAMR file. Congratulations! You have reached the end of this demonstration. Thank you for your time and attention.

Questions or comments? Please e-mail

Back to the talks Previous by track: Emacs was async before async was cool Next by track: The Wheels on D-Bus Track: General