Posted in GSoC, Python

The shortcomings of ANTLR

An important aspect of using any tool is knowing when not to use it, and the same goes with ANTLR as well.
This post is dedicated to the shortcomings that I faced while using the Python runtime, as of July 2018. The ANTLR team is continuously working to improve them and is possible that they might get resolved in the future.

The first and the foremost shortcoming is modifying a parse tree in memory. Once a source code is read, ANTLR did very poorly at providing an interface for re-writing the parse tree. Here is a PR that is an attempt to improve that, so possibly this will change in future soon.

The next shortcoming in the Python runtime specifically (not sure about the Java runtime, would be great if someone could confirm), is retrieval of source code. Although ANTLR does very well at giving us the positions of tokens (there line numbers and column numbers in the original source code), it didn’t work out well when I tried converting the parse tree back into a source code.

For e.g consider a python source code such as:

var_x = 5
if var_x:
    print('Variable x is', var_x)
else:
    print('Variable x is not set')

If I pass the above code to ANTLR’s python runtime, and then use this code to regenerate the output, it would look like as follows:

var_x=5
ifvar_x:
print('Variable x is',var_x)
else:
print('Variable x is not set')

The source of inspiration for the round-tripping code is this stack overflow answer.

Essentially, ANTLR has this concept of a HiddenTokenStream, into which all the unused tokens are dumped (such as whitespaces). During my attempt to retrieve the original source code from a given parse tree, the Python runtime could not put together the spaces well, it missed out on all white spaces. The solution could be to add white spaces according to the column number, but it doesn’t work very well if we don’t know what the original white space was, it is very well possible that we replace a tab with a space, or vice-versa inadvertently.

The above two reasons are majorly blockers into creating an API that can suggest complicated fixes for the anomalies detected by the linting logic.

Also ANTLR is a parse tree generator, and I’ve seen many people avoid parse tree generators for the reason that they are quite slow, and writing a custom parser will give you the extra drop of optimisation that’s needed in a typical compilation process.

Many such articles explain the reasons for better performance of a Parser written by hand, such as this one. I didn’t run into this yet, but is mentioned here for completeness sake.

With this, I conclude my post for the second phase. Hoping for a positive outcome of the evaluations ūüôā .

Thanks for reading !

Advertisements
Posted in GSoC, Python

Deeper into Visitor Patterns – The flexible design for any meta-program

This is the third blog on ANTLR, and in this one I will walk through the process on how to manipulate ANTLR’s parse trees to do advanced stuff.
More specifically, I will talk about the visitor pattern of traversing the parse tree, and will explain alongside a complicated use case that I have been working on in these summers alongside coala for my GSoC !

In the last two blogs, I established setting up of ANTLR to work with Python and usage of the Python runtime for parse tree traversal, and also explained it’s usage with one grammar taken from the official grammars repository. You can have a look at them here (Part 1) and here (Part 2).

In the last blog, we constructed a meta-program that could read a python source code and report all the class methods in a nice dictionary format. Notice that we had to override the visitFuncdef method from the Python3Visitor class. This was a relatively simple use case that followed the defaults very closely and required only one function being over-ridden. Continue reading “Deeper into Visitor Patterns – The flexible design for any meta-program”

Posted in GSoC, Python

Dive deeper into ANTLR – Write a python meta-program !

In the previous post on ANTLR, I introduced how to set-up ANTLR locally and use the python runtime to hack on a customised grammar (and made a simple calculator using that).

In this post we will take it forward from there, and write a program that parses actual Python source code !
Just for completeness, such a kind of programming is called meta-programming : A program that takes input as the source code of another program.

Let’s first do some conceptual overview of what we want to achieve:
We want to write a program, that reads another Python class source code, and reports all the variables that are class-methods.

For e.g, let’s take the following class:

def func2():
    pass

class aSampleClass(object):
    def func(self):
        pass

From the above piece of code, we can easily see that func2 is a normal method and func is a class method. We want to identify all methods of this kind !

Continue reading “Dive deeper into ANTLR – Write a python meta-program !”

Posted in GSoC, Python

Circular Dependencies in Python

When learning Object Oriented Concepts, I was taught about various problems such as the dreaded diamond, etc etc. That time I never imagined I’d write any code that would run into any such issues, and here I am, facing those issues with the small library that I am trying to build !

So I decided to document about these things, in particular this post will be dedicated to the circular dependency problem and another post will be dedicated to the dreaded diamond problem.

The circular dependency problem:

The issue:
Suppose you have two modules, module A and module B. Module A uses some classes from module B and module B uses some classes from module A. This will result in errors when you try to use any of the modules, since none of the modules can be loaded without loading the other module.

This is a classic example of circular dependency issue. This is obviously an indicator that you have a bad design, and you need to change it, but sometimes this is the only design that is feasible and possible. Let me illustrate via an example:

Suppose you want to create a Fruit class. And you want this class to automatically detect the current season, and return a SummerFruit or WinterFruit accordingly.

So what you really want is to have the superclass select one of the child class instances for you based on some parameters.

The code would look something like this:


#!/usr/bin/env python
from random import randint
from SummerFruit import SummerFruit
from WinterFruit import WinterFruit

class Fruit(object):
	@staticmethod
	def get_fruit():
		y = randint(0,99)
		if y%2==1:
			return SummerFruit
		else:
			return WinterFruit

Continue reading “Circular Dependencies in Python”

Posted in GSoC, Non Tech

GSoC Coding Phase-1 Wrap-up

Now it’s nearly the end of coding phase -1 for GSoC, and this post will be dedicated to my journey throughout this month.

My project got a new direction, and the concept was validated with some concrete code.
I’ve started building the library hosted currently at –¬†https://gitlab.com/virresh/coala-antlr
(maybe consider it a result of anticipation of #movingtogitlab :p, but it was more because of the features it provided to us and done way before the deal was even announced).

At coala, we have a good practice of first writing out our concept concretely and documenting it first. A wise person once said:

Think twice, Code once ūüėČ

Continue reading “GSoC Coding Phase-1 Wrap-up”

Posted in GSoC, Python

moban: Templating using jinja !

While working with coala for the past few weeks, there’s a movement to standardise the code of various projects under the same organisation.
The way this is being done is through templating the common parts of code, like the setup.py files, CI files etc etc.

The tool of choice there is via a templating engine like Jinja2, and the parser that we used is moban. Those familiar with django or flask would be already aware of Jinja2 templates, very useful for html templates, and those familiar with jekyll would be aware of liquid, so an analogy for them would be  liquid:ruby::jinja:python .

Even if you didn’t get any of the jargon in the previous para, no worries, we will come back to it in a while. Consider the following example:

You want to create several html files, but you have common navigation bar code for all of them, would you copy paste the same code in each of the files manually ?

Continue reading “moban: Templating using jinja !”

Posted in GSoC

Hello ANTLR ! – Introduction to ANTLR and Parse Trees

ANTLR stands for ANother Tool for Language Recognition. That seems like a very technical jargon for the people hearing it for the first time, so I decided to write a simple introduction to ANTLR and parse trees.

We will get back to ANTLR in a while, first on towards a parse tree.
During the development of any program for any language, the language should have a set of tokens pre-defined and some reserved keywords that have some special meaning. The programming language has a particular structure, and the code that is written for that language must follow that structure and the rules.
Formally, a programming language has a grammar. This is somewhat analogous to the grammar for human languages like English, Hindi, French etc.

Let’s consider the life-cycle of a typical compiled program, for e.g a C program:

  • The source code is read by a program and broken into tokens, often done by a lexer
  • After token-isation of the source code, these tokens are arranged in a tree data-structure for further analysis. This tree is known as the parse tree.
  • Most often the parse tree cannot be used as-it is due to lots of extra and unnecessary nodes in between and is further abstracted by converting into an Abstract Syntax Tree (AST), which is then parsed by a parser and then analysed for errors
  • Once all semantic and syntactic errors are successfully determined, the AST is then used to create an intermediate representation (IR), which acts as hints for the assembler.
  • The IR is then further converted into assembly code by the assembler and this is then finally translated into machine code and results in the final binary that can be executed by the cpu.

Where does ANTLR fit in all of this ? Continue reading “Hello ANTLR ! – Introduction to ANTLR and Parse Trees”