+++ title = "Xml Parser" date = "2023-07-27" author = "John Costa" toc = true tags = ["Software"] +++ --- - Git Repo: [Here](https://github.com/JohnCosta27/GoXmlParser) During my time at university I took a compilers module, it turned out to be my favourite module out of all my 3 years at university, and I had the privilege of being taught by [Elizabeth Scott](https://pure.royalholloway.ac.uk/en/persons/elizabeth-scott), one of the creators of [GLL Parsers](https://www.cs.rhul.ac.uk/research/languages/csle/GLLparsers.html), a parsing algorithm that can parse any context free grammar, it's very cool. I did very well in this module, and I really enjoyed it, but it was heavily theoretical, with little to no practical work. This is fine, it taught me a lot of parsing algorithms, and everything I know about grammars and languages. So I decided to give it a go and actually write one! # XML (like) I decided to parse a XML like language. You are probably mostly familiar with XML from its use in HTML. ```html World ``` You have an opening tag, some content in the middle, and a closing tag. However it can get more complicated than that. ```html Some text here dsamkdsamdkas ``` As you can see, we can have nested structures, attributes on tags, and self closing tags. # Building the Parser I used Golang to build the parser, it's a language that is familiar to me, extremely permanent and a delight to use. I decided to use recursive descent to build the parser, as XML is fairly easy to parse this way, and it becomes less complicated to start my compiler journey using this technique. From my compilers module I knew how to generate recursive descent parsing functions even if I was blind folded for any grammar, so this shouldn't be hard... Right? ## What I learnt ### Theory != Practice Even though I knew how to generate RDP (Recursive descent parsers), I was always given a grammar to start from. But this was a different problem, I have to create a grammar that cannot be left recursive (RDP limitation), and would parse without ambiguity. This turned our harder than I thought, because it's a task that can be done in millions of different ways. In the end I settled for the following grammar. ``` Element ::= OpenTag ElementSuffix OpenTag ::= '<' NAME Attributes Attributes ::= NAME '=' STRING Attributes | EPSILLON ElementSuffix ::= '/>' | Content CloseTag Content ::= DATA Content | Element Content | EPSILLON CloseTag ::= '' // Literals DATA = All the content between '>' and '<' NAME = A continuous string of alphabetical characters STRING = Content inside (and including) "" EPSILLON = null ``` This seems like a good grammar. It allows for all the XML structures we are used to seeing, and it also seemed to have the right mix of parser and lexer complexity. I'll explain what I mean by this. ### Lexer vs Parser Lexer (Lexical Analysis), or Tokenizer takes the raw string input and turns it into meaningful tokens, we saw the tokens I used above: - < - \> - /> -