Posted in GSoC, Python

The shortcomings of ANTLR

An important aspect of using any tool is knowing when not to use it, and the same goes with ANTLR as well.
This post is dedicated to the shortcomings that I faced while using the Python runtime, as of July 2018. The ANTLR team is continuously working to improve them and is possible that they might get resolved in the future.

The first and the foremost shortcoming is modifying a parse tree in memory. Once a source code is read, ANTLR did very poorly at providing an interface for re-writing the parse tree. Here is a PR that is an attempt to improve that, so possibly this will change in future soon.

The next shortcoming in the Python runtime specifically (not sure about the Java runtime, would be great if someone could confirm), is retrieval of source code. Although ANTLR does very well at giving us the positions of tokens (there line numbers and column numbers in the original source code), it didn’t work out well when I tried converting the parse tree back into a source code.

For e.g consider a python source code such as:

var_x = 5
if var_x:
    print('Variable x is', var_x)
    print('Variable x is not set')

If I pass the above code to ANTLR’s python runtime, and then use this code to regenerate the output, it would look like as follows:

print('Variable x is',var_x)
print('Variable x is not set')

The source of inspiration for the round-tripping code is this stack overflow answer.

Essentially, ANTLR has this concept of a HiddenTokenStream, into which all the unused tokens are dumped (such as whitespaces). During my attempt to retrieve the original source code from a given parse tree, the Python runtime could not put together the spaces well, it missed out on all white spaces. The solution could be to add white spaces according to the column number, but it doesn’t work very well if we don’t know what the original white space was, it is very well possible that we replace a tab with a space, or vice-versa inadvertently.

The above two reasons are majorly blockers into creating an API that can suggest complicated fixes for the anomalies detected by the linting logic.

Also ANTLR is a parse tree generator, and I’ve seen many people avoid parse tree generators for the reason that they are quite slow, and writing a custom parser will give you the extra drop of optimisation that’s needed in a typical compilation process.

Many such articles explain the reasons for better performance of a Parser written by hand, such as this one. I didn’t run into this yet, but is mentioned here for completeness sake.

With this, I conclude my post for the second phase. Hoping for a positive outcome of the evaluations 🙂 .

Thanks for reading !


Code Lover

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s