The world is full of programming languages with all kinds of different use cases. Most of these languages, though, are built for very general purposes — sometimes, we may want to design a language to fit a very specific use case (e.g. facebook designed React to make it easier to develop their web applications, and Apple recently developed Pkl, a language designed to make configurations easier. There are many more examples of this in various fields). As such, knowing how to build a programming language is a useful skill to have in your belt.
In this article, we will build an interpreted programming language from scratch and learn a little bit about both the lambda calculus and programming languages as a whole along the way. The language we build here will be fairly esoteric, but the process should hopefully give you an idea of how to design your own use-case-specific languages (and teach you useful information about how programming languages function under the hood).
Since we’re building an interpreted language¹, our overarching flow will look something like this:
Basically, we start with some concrete syntax (code) written in our target language (the language that we are trying to write), pass it to some parser that converts it to an abstract syntax tree (a tree representation of the code that’s easy to work with), and then pass that to an interpreter that “runs” the abstract syntax tree to give us a final result. Note that the parser and interpreter are written in some already existing host language — C’s original parser and compiler, for instance, were written in assembly.
** Note: My use of “parser” here encapsulates the entire parsing process. Usually, lexing is done prior to “parsing”, but in this case parsing is just the process of taking concrete syntax to abstract syntax, whatever that may look like.
As an example, consider the following specification for a simple language for basic arithmetic:
EXPR = number
| EXPR + EXPR
| EXPR - EXPR
| EXPR * EXPR
| EXPR / EXPR
| (EXPR)
The above, by the way, is an EBNF for a context-free grammar². I won’t delve too deeply into what this means here, but all programming languages written in a form like this are parse-able in polynomial time via the CYK algorithm. For this EBNF, something like (4 + 4) * 3
is a valid program, but something like def f(x): return 5; f(5)
is not.
Suppose we are now given the concrete syntax (4 + 4) * 3
. After parsing, we should get an abstract syntax tree (AST) like this:
Then our interpreter will start at the root and recursively go down the tree until we get our answer, namely 24.
Note quickly that this grammar is ambiguous — for instance, how should the expression 4 + 4 * 3
parse? It could either parse as the above (that is, (4 + 4) * 3
), or it could also parse as 4 + (4 * 3)
— neither parsing is inherently more “correct” in the way that we have specified the language, as both are valid parse trees. In cases like this, the parser will have to arbitrarily make a decision about how to parse our language.
By the flow chart, our first step should logically be to design our concrete syntax. What you choose to make your syntax is completely up to you. I decided to create EmojiLang, a (horrible) language that ensures an extremely colorful screen while you are typing. The grammar is below:
grammar EmojiLang;program: '' expr '' EOF;
expr: '(' (ID
| atom
| ifCond
| anonFunctionDefn
| funApplication
| varAssignment
| READ_FLOAT
| READ_STRING
| printExpr
| sequentialOp
| baseComputation) ')';
atom: NUMBER | BOOLEAN | STRING;
ifCond: '' cond=expr '' ontrue=expr ':' onfalse=expr;
anonFunctionDefn: '' arg=ID '' body=expr;
funApplication: '' fun=expr arg=expr;
varAssignment: '' var=ID '' val=expr;
printExpr: '' expr;
sequentialOp: '' first=expr second=expr;
baseComputation: left=expr op=('' | '' | '' | '' | '' | '≤') right=expr;
ID: (a-zA-Z_)(a-zA-Z0-9_)*;
NUMBER: (0-9)+ ('.' (0-9)+)?;
BOOLEAN: '' | '';
STRING: '"' ~(\r\n")* '"';
READ_FLOAT: '';
READ_STRING: '';
WHITESPACE: ( \t\r\n)+ -> skip;
COMMENT: '' .*? '' -> skip;
LINE_COMMENT: '' ~(\r\n)* -> skip;
** Note: the above specification is written to be used in a tool called ANTLR, we’ll get back to this very soon.
This language is, of course, ridiculous, but it is interesting for a couple of reasons. Firstly, all of our expressions are required to be parenthesized. This makes it extremely annoying to code, but also makes our grammar non-ambiguous. Secondly, notice how we can only define anonymous functions — there is no equivalent syntax for something like def
in Python. Finally, all functions in this language (except for the base computations) have exactly one argument. We’ll explore the implications of the last two design decisions when we play around with our language in a bit.
We can, of course, write our own parser. Luckily though, there are tools that can parse arbitrary context-free grammars. For this tutorial, we will use ANTLR (you can download it here). ANTLR is a very nice and easy-to-use tool that takes grammar specifications like the above and creates parsers in a variety of target languages (in this tutorial, we will be using Python).
Using it is fairly simple, just follow the following steps:
- Download the ANTLR Java binaries from here
- Create a .g4 file (like the above) for your grammar. Note that ANTLR can’t actually handle emojis very well, so if you plan to use emojis in your language, run the following python script to demojify your grammar:
import emoji
import sysdef demojify_file(input_file, output_file):
with open(input_file, "r", encoding="utf-8") as f:
input_text = f.read()
input_text = emoji.demojize(input_text)
with open(output_file, "w", encoding="utf-8") as f:
f.write(input_text)
if __name__ == "__main__":
input_file = sys.argv(1)
output_file = sys.argv(2)
demojify_file(input_file, output_file)
3. Run java -Xmx500M -cp org.antlr.v4.Tool -Dlanguage=Python3
to generate your parser
You can then import the generated parsing files and use them as follows:
from antlr4 import *
from EmojiLangLexer import EmojiLangLexer
from EmojiLangParser import EmojiLangParser
from EmojiLangListener import EmojiLangListener
import emojiif __name__ == "__main__":
input_file = sys.argv(1)
with open(input_file, "r", encoding="utf-8") as f:
input_text = f.read()
input_text = emoji.demojize(input_text)
input_stream = InputStream(input_text)
lexer = EmojiLangLexer(input_stream)
token_stream = CommonTokenStream(lexer)
parser = EmojiLangParser(token_stream)
tree = parser.program()
if parser.getNumberOfSyntaxErrors() > 0:
exit(1)
You probably won’t have to do the demojizing step in which case you can use antlr4’s FileStream
instead of the InputStream
, but it really doesn’t matter. Now, we have a very nice abstract syntax tree that’s easy to work with, and we can move on to the hard part — interpreting³
Because we are working with trees, our interpreter will naturally be a recursive entity. We do have some choices though on how exactly we implement some of its features. For this tutorial, we will build an interpreter that uses environments that bind variables to addresses and then a mutable store that maps addresses to values. This idea is fairly common, though not ubiquitous, and it allows us to maintain proper scope and support variable mutation. For ease of implementation, we will also make our interpreter return a common value structure.
Values, Stores, and the Environment
First, let’s define what our interpreter can output. We have three obvious base cases in our EBNF (namely booleans, strings, and numbers) so let’s create value objects for those:
class Value:
def __init__(self, value):
self.value = valuedef __str__(self) -> str:
return str(self.value)
class NumValue(Value):
def __init__(self, value: float):
super().__init__(value)
class StringValue(Value):
def __init__(self, value: str):
super().__init__(value)
class BoolValue(Value):
def __init__(self, value: bool):
super().__init__(value)
To store mappings of variables to values, we will also create an environment and a store:
class EnvLookupException(Exception):
passclass Environment:
def __init__(self):
self.vars = {}
def set_var(self, name, addr: int):
self.vars(name) = addr
def get_var(self, name):
if name not in self.vars:
raise EnvLookupException(f"Variable {name} not found in environment")
return self.vars(name)
def copy(self):
new_env = Environment()
new_env.vars = self.vars.copy()
return new_env
class Store:
def __init__(self):
self.store = ()
def alloc(self, value: Value):
self.store.append(value)
return len(self.store) - 1
def get(self, addr: int):
if addr >= len(self.store):
raise EnvLookupException(f"Address {addr} not found in store")
return self.store(addr)
def set(self, addr: int, value: Value):
if addr >= len(self.store):
raise EnvLookupException(f"Address {addr} not found in store")
self.store(addr) = value
Effectively, our environment will store variable → address bindings, and our store will hold address → value bindings. The store is perhaps not necessary with our current feature set, but if we allowed for mutation for pass-by-reference variables, it would be useful⁴.
Ideally, we’d also like to pass functions as variables, so we need one more value to represent them. To do this, we create a closure, which contains the function’s parameter, body, and the environment that it was created in:
class ClosureValue(Value):
# Body should be an antlr4 tree
def __init__(self, param: str, body: object, env: 'Environment'):
super().__init__(None)
self.param = param
self.body = body
self.env = envdef __str__(self) -> str:
return f""
You may reasonably ask about why we need the environment stored in the function. Suppose instead that we had a function value like this:
class FunctionValue(Value):
# Body should be an antlr4 tree
def __init__(self, param: str, body: object):
super().__init__(None)
self.param = param
self.body = bodydef __str__(self) -> str:
return f""
Now, suppose that we had code equivalent to the following in our language:
# ----------------
# ENV MUST PERSIST
# ----------------
def f(x):
def g(y):
return x + y
return g(x)print((f(4))(5)) # 9
# ----------------
# ENV MUST CLEAR
# ----------------
def f2(x):
return x + y
def g2(y):
return f(5)
print(f(4)) # Should crash
To ensure that y
is still in scope for g
in the top case, we would have to implement dynamic scoping (scope where variables are added to the environment as the program runs without being cleared) without closures, meaning that the bottom code would actually run and print 9
. For the bottom code to properly crash though, we can’t implement dynamic scope. Thus we want the functions to effectively remember what environment they were created in — hence why we add environments to our closure class⁵.
The Interpreter
Now, we are ready to write our actual interpreter. In ANTLR, our interpreter will extend the EmojiLangListener
class. We will also need to create a top-level environment and give the interpreter a store:
class EmojiLangException(Exception):
passTOP_ENV = Environment()
class Interpreter(EmojiLangListener):
def __init__(self):
self.store = Store()
Now, we need to create an interp method that handles all of our expression cases. It will end up looking something as follows:
def interp(self, prog, env: Environment) -> Value:
if prog.ID():
return self.interp_id(prog.ID())
elif prog.atom():
return self.interp_atom(prog.atom())
elif prog.anonFunctionDefn():
return self.interp_function_defn(prog.anonFunctionDefn())
elif prog.READ_FLOAT():
return self.interp_read_float()
elif prog.READ_STRING():
return self.interp_read_string()
elif prog.printExpr():
return self.interp_print_expr()
elif prog.ifCond():
return self.interp_if(prog.ifCond(), env)
elif prog.sequentialOp():
return self.interp_sequential_op(prog.sequentialOp(), env)
elif prog.baseComputation():
return self.interp_base_computation(prog.baseComputation(), env)
elif prog.varAssignment():
return self.interp_var_assignment(prog.varAssignment(), env)
elif prog.funApplication():
return self.interp_fun_application(prog.funApplication(), env)
Our base cases (IDs, atoms, function definitions, reading, and printing) are fairly simple, so we can just write them in:
def interp(self, prog, env: Environment) -> Value:
if prog.ID():
return self.store.get(env.get_var(prog.ID().getText()))
elif prog.atom():
return self.interp_atom(prog.atom())
elif prog.anonFunctionDefn():
return ClosureValue(prog.anonFunctionDefn().arg.text, prog.anonFunctionDefn().body, env)
elif prog.READ_FLOAT():
try:
return NumValue(float(input("> ")))
except ValueError:
raise EmojiLangException("Expected float input")
elif prog.READ_STRING():
return StringValue(input("> "))
elif prog.printExpr():
value = self.interp(prog.printExpr().expr(), env)
if isinstance(value, StringValue):
# print without quotes
print(str(value)(1:-1))
else:
print(value)
return value
# ...def interp_atom(self, atm):
if atm.NUMBER():
return NumValue(float(atm.NUMBER().getText()))
elif atm.BOOLEAN():
return BoolValue(atm.BOOLEAN().getText() == ":thumbs_up:")
elif atm.STRING():
return StringValue(atm.STRING().getText())
# THIS SHOULD NEVER HAPPEN
raise EmojiLangException(f"Unknown atom {atm.getText()}")
Our if condition is also fairly simple. We just need to interpret the condition and then return the result of interpreting the true case if it’s true and the false case if its false. The code is thus just
def interp_if(self, if_cond, env: Environment):
cond = self.interp(if_cond.cond, env)
if not isinstance(cond, BoolValue):
raise EmojiLangException(f"Expected boolean when evaluating if condition, got {cond}")
return self.interp(if_cond.ontrue if cond.value else if_cond.onfalse, env)
Sequential operations are similarly simple, they just need to interpret the first expression and then the second. We can thus replace the code in that block as follows:
def interp(self, prog, env: Environment) -> Value:
# ...
elif prog.sequentialOp():
self.interp(prog.sequentialOp().first, env)
return self.interp(prog.sequentialOp().second, env)
# ...
Next are the base computations. This is a decent amount of code since we need to handle a lot of operations, but it’s not super complex:
def interp_base_computation(self, base_computation, env: Environment):
left, right = self.interp(base_computation.left, env), self.interp(base_computation.right, env)
if base_computation.op.text == ":plus:":
if isinstance(left, NumValue) and isinstance(right, NumValue):
return NumValue(left.value + right.value)
elif isinstance(left, StringValue) and isinstance(right, StringValue):
return StringValue(left.value + right.value)
raise EmojiLangException(f"Cannot add {left} and {right}")
if base_computation.op.text == ":heavy_equals_sign:":
if type(left) != type(right):
return BoolValue(False)
if isinstance(left, ClosureValue):
raise EmojiLangException("Cannot compare functions")
return BoolValue(left.value == right.value)# NUMBERS ONLY COMPUTATIONS
if not isinstance(left, NumValue) or not isinstance(right, NumValue):
raise EmojiLangException(f"Expected numbers when evaluating base computation, got {left} and {right}")
if base_computation.op.text == ":minus:":
return NumValue(left.value - right.value)
elif base_computation.op.text == ":multiply:":
return NumValue(left.value * right.value)
elif base_computation.op.text == ":divide:":
if right.value == 0:
raise EmojiLangException("Division by zero")
return NumValue(left.value / right.value)
elif base_computation.op.text == "≤":
return BoolValue(left.value <= right.value)
Perhaps this can also be cleaned up a bit with the use of e.g. dictionaries, but that’s not super important. Next we have variable assignments, which are also not too complicated. We have a couple choices here on what exactly we want it to do — namely, should it allow new variables to come into existence or only modify existing ones? I’ll choose the latter case, which makes our code look like this:
def interp_var_assignment(self, var_assign, env: Environment):
value = self.interp(var_assign.val, env)
addr = env.get_var(var_assign.var.text)
self.store.store(addr) = value
return value
Finally, we have function applications. Here, we have to do four steps. First, we interpret the function we are calling to get our closure. Then, we interpret our argument. Then, we bind the argument into a copy of the closure’s environment. Finally, we interpret the closure’s body in the new environment. The code ends up looking as follows:
def interp_fun_application(self, fun_app, env: Environment):
closure = self.interp(fun_app.fun, env)
if not isinstance(closure, ClosureValue):
raise EmojiLangException(f"Expected function when evaluating function application, got {closure}")
arg = self.interp(fun_app.arg, env)
new_env = closure.env.copy()
new_env.set_var(closure.param, self.store.alloc(arg))
return self.interp(closure.body, new_env)
Now, our interp is fully functional! We need only modify our main program to run it now on a program:
if __name__ == "__main__":
input_file = sys.argv(1)
with open(input_file, "r", encoding="utf-8") as f:
input_text = f.read()# Preprocess input to replace emojis with demojized names
input_text = emoji.demojize(input_text)
input_stream = InputStream(input_text)
lexer = EmojiLangLexer(input_stream)
token_stream = CommonTokenStream(lexer)
parser = EmojiLangParser(token_stream)
tree = parser.program()
if parser.getNumberOfSyntaxErrors() > 0:
exit(1)
interpreter = Interpreter()
interpreter.interp(tree.expr(), TOP_ENV)
We are now finally ready to start writing programs in our language. Here’s a simple hello world program in the emoji language (EML):
( ("Hello World!"))
And to run it, we just do
> python emoji_lang.py helloworld.eml
Hello World!
(the above, of course, assumes that the program is present in a file called helloworld.eml
)
Currying
Back in the first section, I noted that our programming language is interesting because functions can only take one argument. How then do we create an effect similar to multivariate functions? Consider, for instance, the following python code:
def f(x, y):
return x + yprint(f(3, 4))
f
here has arity 2 — that is, it takes two arguments. We can, however, write equivalent code that only uses functions of arity one (except addition) as follows:
def f(x):
def g(y):
return x + y
return gprint((f(3))(4))
The above concept of turning higher arity functions into unary functions is called currying. It works for functions of any arity — for a function f of arity n, we can simply perform currying n-1 times. In emoji language, this allows us to write a program like this:
(
( ("Enter two numbers to compute their sum."))
(
(
(
( x
( y
((x) (y))
)
)
())
())
)
)
the Python translation of which is
print("Enter two numbers to compute their sum.")
print(((lambda x: (lambda y: x + y)))(float(input()))(float(input())))
Or, more legibly,
print("Enter two numbers to compute their sum.")def f(x):
def g(y):
return x + y
return g
x = float(input())
y = float(input())
print(f(x)(y))
Notice also how the first python iteration used no named functions. It turns out that we don’t really need them, though of course they are useful for readability. Then we get
> python emoji_lang.py currying.eml
Enter two numbers to compute their sum
> 4
> 5
9.0
Recursion
How do we do a loop or recursion in this language though? We have no syntax for for
or while
, and without names for functions, it may seem like recursion is impossible.
We can, however, do a neat little trick. Since functions are values, we can pass functions to themselves whenever we call them, thus allowing them to call themselves.
Take, for instance, the following python code:
n = int(input())
while n > 0:
print(n)
n -= 1
We can convert this to a recursive version using something like this⁶:
def while_loop(condition, body):
"""
A recursive implementation of a while loop.Arguments
-------------
- condition: Some function of zero arguments that returns a boolean value
- body: Some function of zero arguments to run while the condition returns true
"""
if condition():
body()
while_loop(condition, body)
else:
return False
class Box:
def __init__(self, n):
self.n = n
def set_n(self, n):
self.n = n
def get_n(self):
return self.n
n = Box(int(input()))
def body():
print(n.get_n())
n.set_n(n.get_n() - 1)
while_loop(lambda: n.get_n() > 0, body)
But we do have a problem here — namely, notice how while_loop
calls itself. We can’t do that in our language with only anonymous functions, so how do we fix that? The answer is that we can do something like this:
def while_loop(self, condition, body):
if condition():
body()
self(self, condition, body)
else:
return False# ...
# (define n as a box)
# ...
def body():
print(n.get_n())
n.set_n(n.get_n() - 1)
def call_while(loop_func, condition, body):
loop_func(loop_func, condition, body)
call_while(while_loop, lambda: n.get_n() > 0, body)
In effect, we make while_loop
take itself as a parameter. Then, we can call self
inside while_loop
, allowing while_loop
to call itself recursively.
We still, of course, need to lambda-fy and curry this for our language, so we need to make code equivalent to the following:
(((lambda while_loop:
lambda n:
while_loop(while_loop)
(lambda bogus: n.get_n() > 0)
(lambda bogus: print(n.get_n()) or n.set_n(n.get_n() - 1)))
(lambda self:
lambda cond:
lambda body:
(body("BOGUS") or self(self)(cond)(body)) if cond("BOGUS") else False))
(Box(int(input()))))
This is a bit of a doozy, but it does work. In emoji lang, this then becomes
(
(
( while
( n
( ( ( (while) (while))
( bogus ( ((n) ≤ (0)) () : ())))
( bogus (
( (n))
( n ((n) (1)))
)))
))
Below is our while function. Note that it takes
itself as an argument - this allows for recursion
(there are other ways to do this, but recursion via self
passing is fairly simple)ARGS:
1. self(itself)
2. condition_func (function of zero arguments that returns a boolean)
3. body (function of zero arguments that returns nothing to run while true)
RETURNS:
false when finished
( self
( condition_func
( body
(
( (condition_func) ("BOGUS"))
(
( (body) ("BOGUS"))
( ( ( (self) (self))
(condition_func))
(body))) :
()
))))
)
())
Then we get
> python emoji_lang.py while_loop.eml
> 4
4.0
3.0
2.0
1.0
Bonus: The Y Combinator
It’s somewhat annoying to pass in while to itself each time we want to call it, so what if we could create a version of while that already had itself curried? It turns out we can with something called the Y Combinator. The Y combinator is a function that looks as follows:
Y = lambda g: (lambda f: g(lambda arg: f(f)(arg))) (lambda f: g(lambda arg: f(f)(arg)))
It’s completely absurd, but it allows us to effectively “store” a function in itself. I won’t go into too much detail about it, but if you’d like to learn more I advise looking at this excellent article
With the combinator though, we can change our code to the following:
(
(
(
( y_combinator
( while
( n
(
y-combinate our while
( while ( (y_combinator) (while)))
( ( (while)
( bogus ( ((n) ≤ (0)) () : ())))
( bogus (
( (n))
( n ((n) (1)))
))
)
)
)
)
)
Our y combinator function - this allows for recursion without e.g. self passing
by effectively currying the function and passing it to itself.
( fn_nr
(
( cc
( (fn_nr)
( x ( ( (cc) (cc)) (x)))
)
)
( cc
( (fn_nr)
( x ( ( (cc) (cc)) (x)))
)
)
)
)
)
( while
( condition_func
( body
(
( (condition_func) ("BOGUS"))
(
( (body) ("BOGUS"))
( ( (while)
(condition_func))
(body))) :
()
))))
)
())
Now, notice how our call to while after it has been y-combinated⁷ only involves passing the condition and the body — we don’t need to pass itself. and we still get
> python emoji_lang.py y_comb_while.eml
> 4
4.0
3.0
2.0
1.0
Congratulations! You have now hopefully built your own programming language and coded some fun things in it. Though something like EmojiLang obviously doesn’t have much real use, you can imagine some cool use cases for building your own languages, e.g. by creating a super case-specific scripting language to speed up common tasks.
If you’d like some challenges, try the following exercises:
Exercise 1: Build a simple parser and interpreter for the following language without using ANTLR. Ensure that parenthesis always take precedence, and that operations otherwise have equal precedence (e.g. 4 + 4 * 3
should evaluate to 24
, not 16
)
EXPR = number
| EXPR + EXPR
| EXPR - EXPR
| EXPR * EXPR
| EXPR / EXPR
| (EXPR)
Exercise 2: Modify your above code to add operator precedence
Exercise 3 (Tricky): We don’t have to make all of our functions anonymous. Try implementing an interpreter for the following language (you can use ANTLR, but you’ll have to write your own .g4 file):
program = (FUNDEF | EXPR)* // one or more function definitions or expressions// NOTE: implies that something is a string
// also, feel free to ignore whitespace or add semicolons or parenthesize
// expressions/fundefs if you please
EXPR = number
| functionApplication
| computation
FUNDEF = 'def' '(' * '):' EXPR
functionApplication = '(' EXPR* ')' // e.g. f(1, 2+2, g(3))
computation = EXPR + EXPR
| EXPR - EXPR
| EXPR * EXPR
| EXPR / EXPR
| (EXPR)
Exercise 4 (Easy →Very Very Hard): .g4 files for a ton of real languages can be found here. Implement an interpreter for any one of them.
Please contact [email protected] for any inquiries
P.S. Thank you to Brian Jones, my CS 430 professor, for teaching me a lot of this stuff.
All images by author unless otherwise stated.