1  Introduction to Formal Languages

At its core, a compiler is a translator between two languages: the source language (Python, C#, Java, HULK, etc) and the target language (Assembly, C, LLVM, MIPS, etc). Thus, it will pay of to study languages from a computational perspective.

In this first part of the book, we introduce Formal Language Theory, a major field in Computer Science that deals with an abstract notion of language. Formal Language Theory is one of the foundational fields in CS, and its early development during the 60s and 70s laid the grounds for many of the most important theoretical results in Computer Science.

So, although our focus in this book is on building compilers, in the next few chapters we will forget about them for a while, and just look at languages as mathematical constructions. We will prove a bunch of surprising theorems and discover a breadth of theory that touches upon all parts of Computer Science. Towards the end of this part, we will peek outside formal languages and look at some of the most interesting connections with computability theory, computational complexity, artificial intelligence, and everything else.

But let’s start at the basics.

1.1 What is a language

Intuitively, a language is just a collection of correct sentences. In natural languages (Spanish, English, etc,), each sentence is made up of words, which have some intrinsic meaning, and there are rules that describe which sequences of words are valid.

Some of these rules, which we often call “syntactic” are just about the structure of words and sentences, and not their meaning–like how nouns and adjectives must match in gender and number or how verbs connect to adverbs and other modifiers. Other rules, which we call “semantic”, deal with the valid meanings of collections of words–the reason why the sentence “the salad was happy” is perfectly valid syntactically but makes no sense. In linguistics, the set of rules that determine which sentences are valid is called a “grammar”.

In formal language theory, we want to make all these notions as precise as possible in mathematical terms. To achieve so, we will have to make some simplifications which, ultimately, will imply that natural languages fall outside the scope of what formal language theory can fully study. But these simplifications will enable us to define a very robust notion of language for which we can make pretty strong theoretical claims.

So let’s build this definition from the ground up, starting with our notion of words, or, formally, symbols:

Definition 1.1 (Symbol) A symbol is an atomic element that has an intrinsic meaning.

Examples of symbols in abstract languages might be single letters like a, b or c. In programming languages, a symbol might be a variable name, a number, or a keyword like for or class. The next step is to define sentences:

Definition 1.2 (Sentence) A sentence (alternatively called a string) is a finite sequence of symbols.

An example of a sentence formed with the symbols a and b is abba. In a programming language like C# or Python, a sentence can be anything from a single expression to a full program.

One special string is the empty string, which has zero symbols, and will often bite us in proofs. It is often denoted as \(\epsilon\).

We are almost ready to define a language. But before, we need to define a “vocabulary”, which is just a collection of valid symbols.

Definition 1.3 (Vocabulary) A vocabulary \(V\) is a finite set of symbols.

An example of a vocabulary is \(\{ a,b,c \}\), which contains three symbols. In a programming language like Python, a sensible vocabulary would be something like \(\{ \mathrm{for}, \mathrm{while}, \mathrm{def}, \mathrm{class}, ... \}\) containing all keywords, but also symbols like +, ., etc.

What about identifiers?

If you think about our definition of vocabulary for a little bit, you’ll notice we defined it as finite set of symbols. At the same time, I’m claiming that things like variable and function names, and all identifiers in general, will end up being part of the vocabulary in programming languages. However, there are infinitely many valid identifiers, so… how does that work?

The solution to this problem is that we will actually deal with two different languages, on two different levels. We will define a first language for the tokens, which just determines what types of identifiers, numbers, etc., are valid. Then the actual programming language will be defined based on the types of tokens available. So, all numbers are the same token, and all identifiers as well.

Given a concrete vocabulary, we can then define a language as a (posibly infinite) subset of all the sentences that can be formed with the symbols from that vocabulary.

Definition 1.4 (Language) Given a vocabulary \(V\), a language \(L\) is a set of sentences with symbols taken from \(V\).

Let’s see some examples.

1.2 Examples of languages

To illustrate how rich languages can be, let’s define a simple vocabulary with just two symbols, \(V = \{a,b\}\), and see how many interesting languages we can come up with.

The simplest possible language in any vocabulary is the singleton language whose only sentence is formed by a single symbol from the vocabulary. For example, \(L_a=\{a\}\) or \(L_b = \{b\}\). This is, of course, rather useless, so let’s keep up.

We can also define what’s called a finite language, which is just a collection a few (or perhaps many) specific strings. For example, \[L_1 = \{bab, abba, ababa, babba\}\]

Note

Since languages are sets, there is no intrinsic order to the sentences in a language. For visualization purposes, we will often sort sentences in a language in shortest-to-largest, and then lexicographic order, assuming there is a natural order for the symbols. But this is just one arbitrary way of doing it.

Now we can enter the realm of infinite languages. Even when the vocabulary is finite, and each sentence itself is also as finite sequence of symbols, we can have infinitely many different sentences in a language. If you need to convince yourself of this claim, think about the language of natural numbers: every natural number is a finite sequence of, at most, 10 different digits, and yet, we have infinitely many natural numbers because we always take a number and add a digit at the end to make a new one.

In the same sense, we can have infinite languages simply by concatenating symbols from the vocabulary ad infinitum. The most straightforward infinite language we can make from an arbitrary vocabulary \(V\) is called the universe language, and it’s just the collection of all possible strings one can form with symbols from \(V\).

Definition 1.5 (Universe language) Given a vocabulary \(V\), the universe language, denoted \(V^*\) is the set of all possible strings that can be formed with symbols from \(V\).

An extensional representation of a finite portion of \(V^*\) would be:

\[V^* = \{\epsilon,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,...\}\]

We can now easily see that an alternative definition of language could be any subset of the universe language of a given vocabulary \(V\).

Now let’s take it up a notch. We can come up with a gazillion languages just involving \(a\) and \(b\), by concocting different relationships between the symbols. For this, we will need some way to describe the languages that doesn’t require listing all the elements–as they are infinitely many. We can do it with natural language, of course, but in the long run it will pay to be a slightly more formal when describing infinite languages.

For example, let \(L_2\) be the language of strings over the alphabet \(V=\{a,b\}\) that has the exact same number of \(a\) and \(b\).

\[L_2 = \{\epsilon, ab, aabb, abab, baba, baab, abba, ...\}\]

We can define it with a bit of math syntax sugar as follows:

\[L_2 = \{ \omega \in \{a,b\}^* | \#(a,\omega) = \#(b,\omega) \}\]

Let’s unpack this definition. We start by saying, \(\omega in \{a,b\}^*\), which literaly parses to “strings \(\omega\) in the universe language of the vocabulary \(\{a,b\}\)”. This is just standard notation to say “string made out of \(a\) and \(b\). Then we add the conditional part \(\#(a,\omega) = \#(b,\omega)\) which should be pretty straightforward: we are using the \(\#(\mathrm{<symbol>},\mathrm{<string>})\) notation to denote the function that counts a given symbol in a string.

\(L_2\) is slightly more interesting than \(V^*\) because it introduces the notion that a formal language is equivalent to a computation. This insight is the fundamental idea that links formal language and computability theories, and we will formalize this idea in the next section. But first, let’s see other, even more interesting languages, to solidify this intuition that languages equal computation.

Let’s define \(L_3\) as the language of all strings in \(V^*\) where the number \(a\) is a prime factor of the number of \(b\). Intuitively, working with this language–e.g., finding valid strings–will require us to solve prime factoring, as any question about \(L\) that has different answers for string in \(L\) than for strings not in \(L\) will necessarily go through what it means for a number to be a prime factor of another.

But it gets better. We can define the language of all strings made out of \(a\) and \(b\) such that, when interpreting \(a\) as \(0\) and \(b\) as \(1\), the resulting binary number has any property we want. We can thus codify all problems in number theory as problems in formal language theory.

And, as you can probably understand already, we can easily codify any mathematical problem, not just about number theory. Ultimately, we can define a language as the set of strings that are valid input/ouput pairs for any specific problem we can come up with. Let’s make this intuition formal.

1.3 Recognizing a language

The central problem is formal language theory is called the word problem. Intuitively, it is about determining whether a given string is part of a language, or not. Formally:

Definition 1.6 (The Word Problem) Given a language \(L\) on some vocabulary \(V\), the word problem is defined as devising a procedure that, for any string \(\omega \in V^*\), determines where \(\omega \in L\).

Notice that we didn’t define the word problem simply as “given a language \(L\) and a string \(\omega\), is $omega L$”. Why? Because we might be able to answer that question correctly only for some \(\omega\), but not all. Instead, the word problem is coming up with an algorithm that answers for all possible strings \(\omega\)–technically, a procedure, which is not exactly the same, we will see the details in ?sec-computability.

The word problem is the most important question in formal language theory, and one of the central problems in computer science in general. So much so, that we actually classify languages (and by extension, all computer science problems) according to how easy or hard it is to solve their related word problem.

In the next few chapters, we will review different classes of languages that have certain common characterists which make them, in a sense, equally complex. But first, let’s see what it would take to solve the word problem in our example languages.

Solving the word problem in any finite language is trivial. You only need to iterate through all of the strings in the language. The word problem becomes way more interesting when we have infinite languages. In these cases, we need to define a recognizer mechanism, that is, some sort of computational algorithm or procedure to determine whether any particular string is part of the language.

For example, language \(L_2\) has a very simple solution to the word problem. The following Python program gets the job done:

def l2(s):
    a,b = 0,0

    for c in s:
        if c == "a":
            a += 1
        else:
            b += 1
    return a == b

One fundamental question in formal language theory is not only coming up with a solution to the word problem for a given language but, actually, coming up with the simplest solution–for a very specific definition of simple: how much do you need to remember.

For example, we can solve \(L_2\) with \(O(n)\) memory. That is, we need to remember something that is proportional to how many \(a\)’s or \(b\)’s are in the string. And we cannot do it with less, as we will prove a couple chapters down the road.

Now, let’s turn to the opposite problem, that of generating strings from a given language, and wonder what, if any, is the connection between these two.

1.4 Generating a language