BNFGen

This is an online version of BNFGen, yet another random text generator based on context-free grammars. Here are the highlights:

  • Familiar BNF input and examples.
  • Allows you to choose how often each alternative is taken by adding a “weight” and tweak your grammars for more realistic output. For example, <x_or_y> ::= 10 "x" | "y" ; will make x appear ten times more often than y.
  • Supports comments in grammar files (shell style).
  • Good syntax error reporting.

Grammar syntax

If you are already familiar with formal languages, here's a quick summary of the BNF variant used by BNFGen. If not, there's a formal grammar tutorial below the input form.

  • Terminals are written in single or double quotes ('foo' or "foo").
  • Non-terminals are in angle brackets (<foo>, <_foo_bar0>).
  • Rules are separated by semicolons.
  • Empty alternatives (<foo> ::= | "a" | "b"; are not allowed.
  • The default starting symbol is <start>, but that's configurable.
  • You can specify the symbol separator if you don't want to insert all spaces by hand.
Start symbol: Symbol separator:
Output:
 

Tutorial

If you don't know what's BNF or context-free grammar but still found this page somehow and now you are curious, here's a quick and informal introduction that I hope will get you interested in the subject. Don't worry, I'm not an expert myself.

Recognizing languages is much more important than generating them, so most of the time people get introduced to formal languages by writing lexers and parsers. That's much less illustrative, and, let's admit it, much less fun, and many of them give up or limit themselves to someone else's recipes without learning the underlying concepts. It doesn't have to be that way—formal languages are fun.

BNFGen's primary purpose is to produce test data for other programs, but the same technique can be used for generating amusing texts or fake research papers that look plausible enough to be accepted by unscrupulous conference committees.

Formal languages

A formal language is a language whose grammar is defined precisely. In natural languages we can have endless arguments whether singular they or double negatives are grammatical, but in a formal language it's known exactly what's grammatical and what isn't. All computer languages including programming languages and file formats are examples of formal languages.

Symbols

A formal language is built upon a set of symbols. The meaning of that word is different from everyday use: it can be a letter, a word, or a larger language unit such as a noun phrase or a whole sentence.

Symbols that, from a particular grammar's point of view, do not consist of anything, are called terminal symbols. Compound symbols that can refer to other symbols are called non-terminal.

For example, in English, any infinitive looks like “to VERB”. “To” is a terminal symbol—it always stays the same. The VERB part is a non-terminal because you can put any verb there and it's still a valid infinitive.

Production (rewrite) rules

Grammars are defined in terms of production rules that describe all valid ways to reduce (rewrite) non-terminal symbols to terminals. Every grammar has a start symbol—the largest language unit it means to describe.

Consider a language of single-word sentences that can be “Yes” or “No”. Those words are its terminal symbols, and the sentence is a non-terminal symbol. In a blackboard notation, non-terminals are often represented by capital letters, while terminals are represented by lowercase letters, and the sides of a rewrite rule are separated by an arrow.

S → y (a sentence can be the terminal y, assumed to mean yes)
S → n (a sentence can also be the terminal n, assumed to mean no)

Backus-Naur form

Programmers usually use more consice and keyboard-friendly conventions, such as BNF (Backus-Naur form) named after two great computer scientists. They created some of the first programming languages, Fortran and Algol respectively. BNF itself is not formally defined, it's only a convention that says that non-terminals should be in angle brackets, terminals should be in quotes, that the | character means “or”, and that ::= means “is defined as”

Let's define our yes/no language grammar in BNF. All examples below are valid BNFGen inputs and you can paste them into the form.

<start> ::= "Yes" | "No" ;

Now let's try to define a grammar for “to be or not to be” sentences.

<start> ::= <infinitive> "or not" <infinitive> ;

<infinitive> ::= "to" <verb> ;

<verb> ::= "be" | "do" | "compute"

Note that “or not” is a single terminal symbol, since for the purpose of this grammar we are not interested in its internal structure, much like we weren't interested in the letters that “yes” and “no” consist of in our previous example.

Hitting the “generate” button a few times proves that it's not following the Hamlet's pattern—the verbs can be different.

Language complexity

This demonstrates the idea of different language complexity classes. The yes/no language is finite—there are only two valid sentences in it.

The language of sentences like “to be or not to be” is context-sensitive—if we require that the verb is the same. “To be or not to do” would not be valid. When we pick the first verb, we create a context that limits what may come next.

It seems very simple to humans, but that grammar would be quite tricky to define in terms of production rules, and it would take a lot more computational effort to check if a sentence is valid or not since you need to remember what the previous symbols were.

BNFGen works with context-free grammars. Context-free means what came before a symbol is not limiting what may come after it. This is why there can be only one symbol on the left-hand side of a rule and it must be a non-terminal. A rule like "a" <A> ::= ... would only apply to symbol A when it follows “a”. BNFGen doesn't remember what came before a symbol, so it cannot do that. While it seems limiting, most computer languages are in fact context-free—it's a complexity sweet spot where you can express a lot, but don't run into serious trouble reading or producing it automatically.

Infinite languages and recursive rules

Our actual grammar for those sentences is limited in another way though— since there are only three possible verbs, the language it defines is also finite. How to define an infinite language? The key point is that a non-terminal symbol definition may refer to any symbol, including itself. Some BNF extensions include special syntax for “one or more” and “zero or more”, but they are really just shortcuts for recursive rules.

This is how we can define a language whose sentences consist of one or more letters a:

<start> ::= "a" <start> | "a" ;

If we take the second alternative, the rewriting process immediately terminates and we end up with just one a. However, if we take the first alternative, we add one a to our sentence, and then we have two alternatives: either to add another a and stop, or take the first alternative again and repeat that process.

Of course, it's possible to accidentally create a rule will never produce anything. This grammar will go deeper and deeper until you get a “too much recursion” error:

<start> ::= "a" <start> | <start> ;

BNFGen makes no attempt to check for non-terminating rules, even though it's possible and there are grammar analysis tools that can do it for you.

In practice, a common problem with well-formed recursive rules is that if you select alternatives randomly, they terminate too early and produce uninteresting texts. For this reason, BNFGen allows you to specify how often each alternative should be taken.

For example:

<letter> ::= 10 "a" | 5 "b" | "c" ;

You can combine it with recursive rules to make sure they produce longer strings:

<start> ::= 100 "a" <start> | <start> ;

Further reading

On certain formal properties of grammars
Noam Chomsky's paper that started it all.
Introduction to Automata Theory, Languages, and Computation
A classic textbook on the subject
L-system image generator
You can make pretty pictures with grammars too.
This page was last modified: