Create your own parser with the Nearley Parser

No Comments

For a project I needed to parse data that is being delivered through email; yes, I know, when was the last time you received production data through email instead of a clean REST API? For sure it is not an ideal interface but it is one we have to deal with as we cannot change the source to provide a clean REST API on short notice; and perhaps also familiar, the project needs to deliver value as soon as possible. Luckily I came across the Nearly Parser.

Nearley Parser

So at first you might think to try and parse the email and extract the data by using regular expressions or even just on keywords, rows, and columns. However, the nature of email and mail servers can cause changes to the content of the message like messages being forwarded or signatures being added or text changing into HTML. So I needed something that is more intelligent and robust to parse these mail messages.

Searching for a document parser that could do the job, I found something more interesting. Why would you not explain the parser how to read your document and as a result provide you with some meaningful JSON objects containing your data?

The tool for this job is the Nearley Parser. It was named after its inventor Jay Earley.

To really go into depth on how this parsing algorithm works, I would recommend you to read the explanation of the algorithm by the author himself.

We can make the algorithm work for us by providing a definition file called the grammar file. But before we do, we can experiment with our grammar file with an online tool which is obviously called the Nearley Playground. Here you create a test, which is basically the content you need to parse, and provide the grammar which is real-time compiled, and it directly shows you the result.

Nearley Parsing Primer

The data we need to parse is coming from a sensor and contains, among other things, an information status about its battery. This data needs to be read and stored in a document store. So wouldn’t it be great if we could feed this piece of data to a document parser that can parse this and that returns a JSON object containing the data?

So the first sentence of data we are interested in:

Battery: 51%, 4.01 Volt

The sentence starts with a word, then a colon, white space and the data.

The Nearley Parser reads each line of the document and treats every word and character as what is called a “terminal”. When there is a match with one of the strings or characters, we can use a post-processor method to actually do something with the data and in this case construct a JSON object.

So to parse “Battery:”, the grammar is

sentence -> “Battery:” 

But our line contains more characters. To tell the parser there can be one or more spaces after “Battery:”, we can use the built-in function “whitespace.ne” simply by adding an underscore in our grammar. So now our grammar becomes:

sentence -> “Battery:” _  

Then we encounter our first value; a percentage value. To parse this we can use the built-in function “number.ne” as follows:

“sentence -> “Battery:” _ percentage

This way we can parse the complete line with the following grammar, including the built-in functions:

@builtin "whitespace.ne" # `_` means arbitrary amount of whitespace
@builtin "number.ne"     # `int`, `decimal`, and `percentage`

sentence -> “Battery:” _ percentage “,” _ decimal _ “Volt” 

All this can be easily run as an experiment on the Nearley playground website.

Real usage

This is all nice in a playground web application to learn and write your grammar file, but now we need to put this to use in an application.

Nearley consists of two components, the compiler and the parser. The compiler is used to compile your grammar file and can be used with the parser and your document to be parsed.

Both components are available as npm packages and can be installed through npm.

To install the parser in your project:

npm install --save nearley

This will add it as dependency in the package.json

To use the compiler to compile your grammar, install it as follows:

npm install -g nearley

Store the grammar as shown above into a file called grammar.ne and compile it using the following command:

nearleyc grammar.ne -o grammar.js

This will compile the grammar file into a JavaScript Parser module. Now we can use the test tool provided by the Nearley compiler:

nearley-test ./grammar.js --input “Battery: 51%, 4.01 Volt”

And it will show the results:

Parse results:
[ [ 'Battery:', null, 0.51, null, ',', null, 4.01, null, 'Volt' ] ]

As you can see, it will output some arrays containing our data and also some null values for the whitespaces in our string. To clean this up and make it return a JSON object, we can add a post-processing method as follows:

sentence -> "Battery:" _ percentage _ "," _ decimal _ "Volt" 
  {% ([,,level,,,,volts]) => 
    ({battery:{percentage: level, value: volts}}) %}

When we compile and test this, we get the following result:

Parse results:
[ { battery: { percentage: 0.51, value: 4.01 } } ]

To use this in your code, you can simply include the parser and provide your compiled grammar and data and you get the result back as an array:

const nearley = require("nearley");
const grammar = require(“./grammar.js");
const parser = new nearley.Parser(nearley.Grammar.fromCompiled(grammar));

console.log(parser.feed("Battery: 51%, 4.01 Volt”));

Conclusion

I have written numerous lines of code which parses some data in one or another form; single lines and multilines. So also for this problem at hand, my first choice was to use some kind of regular expression to parse each line in the mail message. Which of course could have worked well, but the knowledge that the content in the mail can vary made me look for something that can handle this content variation in a clean manner without creating a complex unreadable regular expression or complex piece of code; instead the Nearley Parser grammar provides you with clear, semantically readable code.

So luckily my search brought the Nearley Parser to my attention and although it has a steep learning curve, you can create something useable quite quickly. Yes, the above example is just one line that is parsed and could have been done much quicker with a regular expression. However, as this line is somewhere in the message and also has some variation in spacing and there are numerous other pieces of data in the message that you want to read, it can become more complex quite quickly. To be fair, I am definitely not calling myself an expert on the vocabulary of the Nearley Parser, but I thought it was worth spreading the word!

Harald Rietman works a senior software craftsman at the dutch codecentric office. He is an allround developer working on different solutions promoting CI/CD practices both technical and on an organisational level. Always seeking what delivers the most value to the customer.

Comment

Your email address will not be published. Required fields are marked *