FEAT: Using hugo
This commit is contained in:
102
content/projects/XmlParser.md
Normal file
102
content/projects/XmlParser.md
Normal file
@@ -0,0 +1,102 @@
|
||||
+++
|
||||
title = "Xml Parser"
|
||||
date = "2023-07-27"
|
||||
author = "John Costa"
|
||||
toc = true
|
||||
tags = ["Software"]
|
||||
+++
|
||||
|
||||
---
|
||||
|
||||
- Git Repo: [Here](https://github.com/JohnCosta27/GoXmlParser)
|
||||
|
||||
During my time at university I took a compilers module, it turned out to be my favourite module out of all my 3 years at university, and I had the privilege of being taught by [Elizabeth Scott](https://pure.royalholloway.ac.uk/en/persons/elizabeth-scott), one of the creators of [GLL Parsers](https://www.cs.rhul.ac.uk/research/languages/csle/GLLparsers.html), a parsing algorithm that can parse any context free grammar, it's very cool.
|
||||
|
||||
I did very well in this module, and I really enjoyed it, but it was heavily theoretical, with little to no practical work. This is fine, it taught me a lot of parsing algorithms, and everything I know about grammars and languages.
|
||||
|
||||
So I decided to give it a go and actually write one!
|
||||
|
||||
# XML (like)
|
||||
|
||||
I decided to parse a XML like language. You are probably mostly familiar with XML from its use in HTML.
|
||||
|
||||
```html
|
||||
<hello>World</hello>
|
||||
```
|
||||
|
||||
You have an opening tag, some content in the middle, and a closing tag. However it can get more complicated than that.
|
||||
|
||||
```html
|
||||
<hello hello="world">
|
||||
Some text here
|
||||
<a></a>
|
||||
dsamkdsamdkas
|
||||
|
||||
<b attribute="dksoadisma" />
|
||||
</hello>
|
||||
```
|
||||
|
||||
As you can see, we can have nested structures, attributes on tags, and self closing tags.
|
||||
|
||||
# Building the Parser
|
||||
|
||||
I used Golang to build the parser, it's a language that is familiar to me, extremely permanent and a delight to use.
|
||||
|
||||
I decided to use recursive descent to build the parser, as XML is fairly easy to parse this way, and it becomes less complicated to start my compiler journey using this technique. From my compilers module I knew how to generate recursive descent parsing functions even if I was blind folded for any grammar, so this shouldn't be hard... Right?
|
||||
|
||||
## What I learnt
|
||||
|
||||
### Theory != Practice
|
||||
|
||||
Even though I knew how to generate RDP (Recursive descent parsers), I was always given a grammar to start from. But this was a different problem, I have to create a grammar that cannot be left recursive (RDP limitation), and would parse without ambiguity.
|
||||
|
||||
This turned our harder than I thought, because it's a task that can be done in millions of different ways. In the end I settled for the following grammar.
|
||||
|
||||
```
|
||||
Element ::= OpenTag ElementSuffix
|
||||
OpenTag ::= '<' NAME Attributes
|
||||
Attributes ::= NAME '=' STRING Attributes | EPSILLON
|
||||
ElementSuffix ::= '/>' | Content CloseTag
|
||||
Content ::= DATA Content | Element Content | EPSILLON
|
||||
CloseTag ::= '</' NAME '>'
|
||||
|
||||
// Literals
|
||||
DATA = All the content between '>' and '<'
|
||||
NAME = A continuous string of alphabetical characters
|
||||
STRING = Content inside (and including) ""
|
||||
EPSILLON = null
|
||||
```
|
||||
|
||||
This seems like a good grammar. It allows for all the XML structures we are used to seeing, and it also seemed to have the right mix of parser and lexer complexity. I'll explain what I mean by this.
|
||||
|
||||
### Lexer vs Parser
|
||||
|
||||
Lexer (Lexical Analysis), or Tokenizer takes the raw string input and turns it into meaningful tokens, we saw the tokens I used above:
|
||||
|
||||
- <
|
||||
- \>
|
||||
- />
|
||||
- </
|
||||
- STRING
|
||||
- DATA
|
||||
- NAME
|
||||
|
||||
Some of these are trivial, such as the top 4 on the list, but the rest isn't so easy. String would be an example of a slightly more complex token.
|
||||
|
||||
But when I was first starting, in my head all the hard work would have to be done by the parser and the lexer would basically just pick up single strings of characters. This led me to a rabbit hole of parsing basically single characters as tokens (including spaces!), and writing progressively more complicated grammar rules (and therefore more complicated parser), in order to not have the lexer be complicated.
|
||||
|
||||
But I quickly realised that regular expressions are a tool, and I should use them well. And so that's how I ended up with more complex tokens, and was able to drastically simplify my grammar rules.
|
||||
|
||||
# Semantics
|
||||
|
||||
The last thing I want to talk about was my semantic analysis, which was by far the easiest part of the project, simply because all I have to do is make sure that the opening tags have the correct closing tags associated with them.
|
||||
|
||||
This is implemented using a stack, you push as you see opening tags, pop as you see closing ones, and compare. If you see a closing tag that does not correspond to the top item of the stack then there is something wrong.
|
||||
|
||||
# Future Work
|
||||
|
||||
I am mostly finished with this project. The only thing I would like to add is translation from XML to other languages. This step would be fairly easy because of the AST already being build, and now I simply have to walk it.
|
||||
|
||||
But for other project, I would like to go into the world of programming languages instead of markup languages. To start with I could write a program that compilers down to Golang, and use the already amazing Golang tools to compile the underlying program. But this is for another time.
|
||||
|
||||
If you made it this far, hey thank you for reading, you can always email me at johncosta027@gmail.com for any inquiries, or check out the GitHub repo linked at the top of the article if you have any further coments.
|
||||
6
content/projects/_index.md
Normal file
6
content/projects/_index.md
Normal file
@@ -0,0 +1,6 @@
|
||||
+++
|
||||
title = 'Projects'
|
||||
toc = true
|
||||
+++
|
||||
|
||||
A list of my various projects
|
||||
42
content/projects/huffmanz.md
Normal file
42
content/projects/huffmanz.md
Normal file
@@ -0,0 +1,42 @@
|
||||
+++
|
||||
title = "Huffmanz - Huffman Encoding Implementation in Zig"
|
||||
date = "2023-09-21"
|
||||
author = "John Costa"
|
||||
toc = true
|
||||
tags = ["Software"]
|
||||
+++
|
||||
|
||||
---
|
||||
|
||||
- Link to Repo: [https://github.com/JohnCosta27/Huffmanz](https://github.com/JohnCosta27/Huffmanz)
|
||||
- Link to Video: [https://www.youtube.com/watch?v=D5l5GUuNXB8&ab_channel=JohnCosta](https://www.youtube.com/watch?v=D5l5GUuNXB8&ab_channel=JohnCosta)
|
||||
|
||||
[Huffman Encoding](https://en.wikipedia.org/wiki/Huffman_coding) is an algorithm for compressing text, using a binary tree to shorted the number of bits needed to represent each character. It's one of the first algorithms I learned in Computer Science. I was 14 year old and in Year 9. But until recently it hadn't crossed my mind again.
|
||||
|
||||
Then I found out about [Zig](https://ziglang.org/). A new(ish) language where you manage your own memory, and have access to an incredible compiler that support `comptime`, it's an amazing (and extremely fast) language.
|
||||
|
||||
So I decided to learn both fully, and implemented a huffman encoding algorithm using the zig programming language. Up until this point I haven't worked on a project in a language like zig, which requires memory management and everything that goes along with it. I have also never thought much about huffman encoding, but it's a fairly simple algorithm so it was the perfect project to learn about this language.
|
||||
|
||||
I won't go into much detail about Huffman Encoding, you can find many tutorials that do it better than I ever could. But I will talk about what I think about Zig.
|
||||
|
||||
## Zig - It's great
|
||||
|
||||
I'm not a big language connoisseur, I like things that work, that are simple and have a friendly syntax, so I happen to love (Golang)[https://go.dev/], it's simple, robust and very fast. It's also garbage collected which is a plus.
|
||||
|
||||
I didn't think I was going to be a big fan of Zig, my experience with C has been fine, but I haven't loved it much, and I've never touched C++, but I really enjoyed it.
|
||||
|
||||
### Some things I liked
|
||||
|
||||
- The compiler is fast, and extremely informative, I found that I didn't need to go to the official documentation, the compiler knew what I was trying to do and often led me in the right direction.
|
||||
- Using allocator objects to initialise various data structure is a very nice pattern, one that makes managing memory really not complicated.
|
||||
- Comptime. Absolutely genious!
|
||||
- No secret allocations (looking at you Rust!). You are in control of the allocations completely.
|
||||
|
||||
### Some things I didn't like
|
||||
|
||||
- This isn't a fairly criticism, because it's a fairly new and evolving language, but the LSP isn't the best, it's very fast and when it works it works really well, but it seems to have been lacking some features. The one I missed the most was _hover_, where you can ask the LSP what type something is, it sometimes worked but most of the time it got confused.
|
||||
- Using comptime types and returning a struct from a function seems odd, but also the best way to achieve generic data structures, don't love this pattern and there might be a better way. See the `heap.zig` file in my repo to see what I'm talking about.
|
||||
|
||||
I would say I definitely would love to work more with Zig. It seems like the logical step forward for program that require the blazingly fast speed of no garbage collection. You can already see this happening in [Bun.js](https://bun.sh/), which seems to have great speed advantages (sometimes) compared to Node.js or Deno (even though both are written in a low level language).
|
||||
|
||||
Thank you for reading, if you would like to chat with me. You can email me at `johncosta027@gmail.com`. Or visit my GitHub profile.
|
||||
Reference in New Issue
Block a user