BNFGen

BNFGen is a random text generator based on context-free grammars. Its goal is to make it easy to write and share grammars, and to give the user total control and insight into the generation process. The ultimate goal is to bring generative grammar tools to people other than programmers: writers, linguists, and anyone who may be interested in generating languages, whether it's for parser fuzzing, research, worldbuilding, or pure amusement.

How does it achieve those goals? First, it uses a DSL for grammar definitions that is similar to the familiar BNF, with two extensions: weighted random selection and deterministic repetition.

Second, it offers good syntax error reporting for that language. In most cases you get a helpful error message if you get it wrong. For example:

$ cat bad.bnf
<start> ::= ;

$ bnfgen ./bad.bnf 
Could not load grammar from ./bad.bnf.
Syntax error on line 1, character 13: The right-hand side of a rule is empty or starts with an empty alternative.
Empty rules (<foo> ::= ;) and empty alternatives (<foo> ::= | "foo") are not allowed.
Example of a valid rule: <start> ::= <foo> | <bar> ;

Finally, there are options to enable detailed debugging output that shows every reduction step and the program state so that you can easily find out what exactly is going on.

Demo

This is a pure client-side demo with a few sample grammars and an option to load your own from file.

Sample grammars:
Start symbol: Symbol separator:
Maximum reduction steps: Maximum non-productive steps:
Debug: Dump symbol stack:
Output:
 

Installation

BNFGen is two packages and three different things. It's a CLI tool, an OCaml library, and a JS library (cross-compiled from OCaml using js_of_ocaml).

If you are only interested in the CLI tool, you may want to download precompiled binaries for Linux (statically linked), Windows (64-bit), or macOS (x86-64).

If you want the OCaml library along with the CLI tool, or just happen to have the OCaml toolchain installed, you may want to install from OPAM: opam install bnfgen.

Finally, if you want the JS version, you can get it from NPM: npm install bnfgen.

Grammar definition syntax

  • Terminals are in single or double quotes ("foo", 'foo')
  • Nonterminals are in angle brackets (<baz>, <xyzzy_quux>)
  • Rule example: <start> ::= "foo" | <bar>
  • Rules are separated by semicolons: <start> ::= "foo" | <bar>; <bar> ::= "bar"
  • Any rule may refer to any symbol, including itself: <start> ::= "foo" <start> | "foo"
  • You can add comments in the shell style, any line starting with # is ignored by the parser.

Weighted random selection

Repetition is canonically expressed via recursion. For example, “string "foo" repeated one or more times” is written <start> ::= "foo" <start> | "foo". The problem with this approach for driving language generators is that there's a 50% chance that the process will pick the non-recursive alternative right at the first step and terminate immediately.

BNFGen fixes this problem by allowing weighted random rules. For example, in <start> ::= 10 "foo" <start> | "foo", the first (recursive) alternative will be taken ten times more frequently, so that longer sequences are more likely.

You can use that to express different “word frequencies” for terminals as well: <start> ::= 20 "foo" | "bar".

Repetition ranges

Sometimes you may want to set an explicit limit. For example, if you are generating names for imaginary people, you may want to say “a person may have no more than three middle names”. This is expressible in the normal BNF, but can be very tedious to write.

BNFGen gives you a shortcut that looks like the repetition syntax in regular expressions. You can write <FooFooFoo> ::= "foo"{3} (exactly three of foo) or <foos> ::= "foo"{1,3} (from one to three). The rule for names may look like this: <personName> ::= <firstName> <middleName>{0,3} <lastName>

CLI tool usage

Usage: bnfgen [OPTIONS] <BNF file>
  --dump-rules  Dump production rules and exit
  --separator <string>  Token separator for generated output, default is space
  --start <string>  Start symbol, default is "start"
  --max-reductions <int>  Maximum reductions, default is infinite
  --max-nonproductive-reductions <int>  Maximum number of reductions that don't produce a terminal, default is infinite
  --debug  Enable debug output (symbols processed, alternatives taken...)
  --dump-stack  Show symbol stack for every reduction (implies --debug)
  --version   Print version and exit
  -help   Display this list of options
  --help  Display this list of options

Running bnfgen --debug --dump-stack will make it log every reduction step and show you the current symbol stack, so that you know what it's doing and can see where your grammar is looping or growing out of control.

OCaml library usage example


# let g = Bnfgen.grammar_from_string " <greeting> ::= \"hello\" | \"hi\" ; <start> ::= <greeting> \"world\"; " |> Result.get_ok ;;
val g : Bnfgen.Grammar.grammar =
  [("greeting",
    [{Bnfgen.Grammar.weight = 1; symbols = [Bnfgen.Grammar.Terminal "hi"]};
     {Bnfgen.Grammar.weight = 1; symbols = [Bnfgen.Grammar.Terminal "hello"]}]);
   ("start",
    [{Bnfgen.Grammar.weight = 1;
      symbols =
       [Bnfgen.Grammar.Nonterminal "greeting"; Bnfgen.Grammar.Terminal "world"]}])]

# Bnfgen.generate_string ~settings:({Bnfgen.default_settings with symbol_separator=" "}) g "start" ;;
- : (string, string) result = Ok "hello world "

# Bnfgen.generate ~settings:({Bnfgen.default_settings with symbol_separator=""}) print_endline g "start" ;;
hello
world
- : (unit, string) result = Ok ()

JS library usage example


> var bnfgen = require('bnfgen');

> bnfgen.loadGrammar(' <greeting> ::= "hello" | "hi"; <start> ::= <greeting> "world"; ');

> bnfgen.symbolSeparator = " ";

> bnfgen.generate('start')
'hello world '

Grammar syntax

Tutorial

If you don't know what's BNF or context-free grammar but still found this page somehow and now you are curious, here's a quick and informal introduction that I hope will get you interested in the subject. Don't worry, I'm not an expert myself.

Recognizing languages is much more important than generating them, so most of the time people get introduced to formal languages by writing lexers and parsers. That's much less illustrative, and, let's admit it, much less fun, and many of them give up or limit themselves to someone else's recipes without learning the underlying concepts. It doesn't have to be that way—formal languages are fun.

BNFGen's primary purpose is to produce test data for other programs, but the same technique can be used for generating amusing texts or fake research papers that look plausible enough to be accepted by unscrupulous conference committees.

Formal languages

A formal language is a language whose grammar is defined precisely. In natural languages we can have endless arguments whether singular they or double negatives are grammatical, but in a formal language it's known exactly what's grammatical and what isn't. All computer languages including programming languages and file formats are examples of formal languages.

Symbols

A formal language is built upon a set of symbols. The meaning of that word is different from everyday use: it can be a letter, a word, or a larger language unit such as a noun phrase or a whole sentence.

Symbols that, from a particular grammar's point of view, do not consist of anything, are called terminal symbols. Compound symbols that can refer to other symbols are called non-terminal.

For example, in English, any infinitive looks like “to VERB”. “To” is a terminal symbol—it always stays the same. The VERB part is a non-terminal because you can put any verb there and it's still a valid infinitive.

Production (rewrite) rules

Grammars are defined in terms of production rules that describe all valid ways to reduce (rewrite) non-terminal symbols to terminals. Every grammar has a start symbol—the largest language unit it means to describe.

Consider a language of single-word sentences that can be “Yes” or “No”. Those words are its terminal symbols, and the sentence is a non-terminal symbol. In a blackboard notation, non-terminals are often represented by capital letters, while terminals are represented by lowercase letters, and the sides of a rewrite rule are separated by an arrow.

S → y (a sentence can be the terminal y, assumed to mean yes)
S → n (a sentence can also be the terminal n, assumed to mean no)

Backus-Naur form

Programmers usually use more consice and keyboard-friendly conventions, such as BNF (Backus-Naur form) named after two great computer scientists. They created some of the first programming languages, Fortran and Algol respectively. BNF itself is not formally defined, it's only a convention that says that non-terminals should be in angle brackets, terminals should be in quotes, that the | character means “or”, and that ::= means “is defined as”

Let's define our yes/no language grammar in BNF. All examples below are valid BNFGen inputs and you can paste them into the form.

<start> ::= "Yes" | "No" ;

Now let's try to define a grammar for “to be or not to be” sentences.

<start> ::= <infinitive> "or not" <infinitive> ;

<infinitive> ::= "to" <verb> ;

<verb> ::= "be" | "do" | "compute"

Note that “or not” is a single terminal symbol, since for the purpose of this grammar we are not interested in its internal structure, much like we weren't interested in the letters that “yes” and “no” consist of in our previous example.

Hitting the “generate” button a few times proves that it's not following the Hamlet's pattern—the verbs can be different.

Language complexity

This demonstrates the idea of different language complexity classes. The yes/no language is finite—there are only two valid sentences in it.

The language of sentences like “to be or not to be” is context-sensitive—if we require that the verb is the same. “To be or not to do” would not be valid. When we pick the first verb, we create a context that limits what may come next.

It seems very simple to humans, but that grammar would be quite tricky to define in terms of production rules, and it would take a lot more computational effort to check if a sentence is valid or not since you need to remember what the previous symbols were.

BNFGen works with context-free grammars. Context-free means what came before a symbol is not limiting what may come after it. This is why there can be only one symbol on the left-hand side of a rule and it must be a non-terminal. A rule like "a" <A> ::= ... would only apply to symbol A when it follows “a”. BNFGen doesn't remember what came before a symbol, so it cannot do that. While it seems limiting, most computer languages are in fact context-free—it's a complexity sweet spot where you can express a lot, but don't run into serious trouble reading or producing it automatically.

Infinite languages and recursive rules

Our actual grammar for those sentences is limited in another way though— since there are only three possible verbs, the language it defines is also finite. How to define an infinite language? The key point is that a non-terminal symbol definition may refer to any symbol, including itself. Some BNF extensions include special syntax for “one or more” and “zero or more”, but they are really just shortcuts for recursive rules.

This is how we can define a language whose sentences consist of one or more letters a:

<start> ::= "a" <start> | "a" ;

If we take the second alternative, the rewriting process immediately terminates and we end up with just one a. However, if we take the first alternative, we add one a to our sentence, and then we have two alternatives: either to add another a and stop, or take the first alternative again and repeat that process.

Of course, it's possible to accidentally create a rule will never produce anything. This grammar will go deeper and deeper until you get a “too much recursion” error:

<start> ::= "a" <start> | <start> ;

BNFGen makes no attempt to check for non-terminating rules, even though it's possible and there are grammar analysis tools that can do it for you.

In practice, a common problem with well-formed recursive rules is that if you select alternatives randomly, they terminate too early and produce uninteresting texts. For this reason, BNFGen allows you to specify how often each alternative should be taken.

For example:

<letter> ::= 10 "a" | 5 "b" | "c" ;

You can combine it with recursive rules to make sure they produce longer strings:

<start> ::= 100 "a" <start> | <start> ;

Further reading

On certain formal properties of grammars
Noam Chomsky's paper that started it all.
Introduction to Automata Theory, Languages, and Computation
A classic textbook on the subject
L-system image generator
You can make pretty pictures with grammars too.
This page was last modified: