soupault

Soupault (soup-oh) is a static website generator based on HTML rewriting.

Pretty much like client-side DOM manipulation, just without the browser or interactivity. You can tell it to do things like “insert contents of footer.html file into <div id="footer"> no matter where that div is in the page”, or “use the first <h1> element in the page for the page title”.

Full HTML awareness
Soupault is aware of the full page structure. It can find and manipulate any element using CSS3 selectors like p, div#content, or a.nav, no matter where it is in the page. There is no need “front matter”— all metadata can be extracted from HTML itself.
No themes
Any page can be a soupault theme, you just tell it where to insert the content. By default it inserts page content into the <body> element, but it can be anything identifiable with a CSS selector. You decide how much is the “theme” and how much is the content.
Built for text-heavy, nested websites
Tables of contents, footnotes, and breadcrumbs are supported out of the box. Existing element id's are respected, so you can make table of contents and footenote links permanent and immune to link rot, even if you change the heading text or page structure.
Preserves your content structure
Soupault can mirror your directory structure exactly, down to file extensions. You can migrate from an old handcrafted website without breaking any links.
Any input format
HTML is the primary format, but you can use anything that can be converted to HTML. Just specify preprocessors for files with certain extensions.
Easy to install
Soupault is a single executable file with no dependencies. Just unpack the archive and it's ready to use.
Fast
Soupault is written in OCaml and compiled to native code on all platforms. This website takes less than 0.5 second to build.
Extensible
You can add your own HTML rewriting logic with Lua scripts.

Quick links:

Download latest release:
github.com/dmbaturin/soupault/releases/latest
Git repo:
github.com/dmbaturin/soupault

Soupault is free software published under the MIT license.

Prebuilt executables are available for Linux (x86-64, statically linked), macOS, and Microsoft Windows (7 and later, 32-bit).

This website is made with soupault and you can use its source as an example: github.com/dmbaturin/baturin.org.

For a very quick start:

  1. Create a directory for your site.
  2. Create a page layout with empty <body> in templates/main.html
  3. Create some files with content to insert inside <body> and drop them to site/
  4. Run soupault in that directory.
  5. Look in build/.

For details, read the documentation below.


Contents


Overview

Soupault is named after the French dadaist and surrealist writer Philippe Soupault because it's based on the Lambda Soup library, which is a reference to the Beautiful Soup library, which is a reference to the Mock Turtle chapter from Alice in Wonderland (the best book on programming for the layman according to Alan Perlis), which is a reference to an actual turtle soup imitation recipe, and also a reference to tag soup, a derogatory term for malformed HTML, which is a reference to... in any case, soupault is not the French for a stick horse, for better or worse, and this paragraph is the only nod to dadaism in this document.

Soupault is quite different from other website generators. Its design goals are:

  1. No special syntax inside pages, only plain old semantic HTML.
  2. No templates or themes. Any HTML page can be a soupault theme.
  3. No front matter.

People have written lots of static website generators, but most of them are variations on a single theme: take a file with some metadata (front matter) followed by content in a limited markup format such as Markdown and feed it to a template processor.

This approach works well in many cases, but has its limitations. A template processor is only aware of its own tags, but not of the full page structure. The front matter metadata is the only part of the page source it can use, and it can only insert it into fixed locations in the template. The part below the front matter is effectively opaque, and formats like Markdown simply don't allow you to add machine-readable annotations anyway.

Well-formed HTML, however, is a machine-readable format, and the libraries that can handle it existed for a long time. As shown by microformats, you can embed a lot of information in it. More importantly, unlike front matter metadata, HTML metadata is multi-purpose: for example, you can use the id attribute for CSS styling and as a page anchor, as well as a unique element identifier for data extraction or HTML manipulation programs.

With soupault, it's possible to take advantage of the full HTML markup and even make every page on your website look different rather than built from the same template, and still have an automated workflow.

You can also use Markdown/reStructuredText/whatever if you specify preprocessors. Soupault will automatically run a preprocessor before parsing your page, though you'll miss some finer points like footnotes if you go that way.

How soupault works

Soupault takes a page “template”— an HTML file devoid of content, parses it into an element tree, and locates the content container element inside it.

By default the content container is <body>, but you can use any selector: div#content (a <div id="content"> element), article (an HTML5 article element), #post (any element with id="post") or any other valid CSS selector.

Then it traverses your site directory where page source files are stored, takes a page file, parses it into an HTML element tree too, and inserts it into the content container element of the template.

The new HTML tree is then passed to widgets—HTML rewriting modules that manipulate it in different ways: incude other files or outputs of external programs into specific elements, create breadcrumbs for your page, they may delete unwanted elements too.

Processed pages are then written to disk, into a directory structure that mirrors your source directory structure.

Performance

Despite its heavy-handed approach, soupault is reasonably fast. With a config that includes ToC, footnotes, file inclusion, breadcrumbs, and built-in section index generator, it can process 1000 copies of its own documentation page in about 14 seconds1. Small websites take less than a second to build.

For comparison, in a simplified “read-parse-prettify-write” test with 1000 copies of this document, CPython/BeautifulSoup takes about 20 seconds to complete.

Why use soupault?

If you are starting a website

All website generators provide a default theme, but making new themes can be tricky. With soupault, there are no intermediate steps between writing your page layout and building your website from it.

In the simplest case you can just create a page skeleton in templates/main.html with an empty <body>, add a bunch of pages to site/ (site/about.html, site/cv.html, ...) and run the soupault command in that directory.

If you already have a website

If you have a website made of handwritten pages, you can use soupault as a drop-in automation tool. Just take your existing page skeleton, strip the other pages down to their content, and you are good to go.

If you have an existing directory structure that you don't want to change because it will break links, soupault can mirror it exactly, with the clean_urls = false config option. It also preseves original file names and extensions in that mode.

What kind of rewriting it can do

Soupalt comes with a bunch of built-in widgets that can:

  • Include external files, HTML snippets, or output of external programs into an element.
  • Set the page title from text in some element.
  • Generate breadcrumbs.
  • Move footnotes out of the text.
  • Insert a table of contents
  • Delete unwanted elements

For example, here's a config for the title widget that sets the page title.

[widgets.page-title]
  widget = "title"
  selector = "#title"
  default = "My Website"
  append = " &mdash; My Website"

It takes the text from the element with id="title" and copies it to the <title> tag of the generated page. It can be any element, and it can be a different element in every page. If you use <h1 id="title"> in site/foo.html and <strong id="title"> in another, soupault will still find it.

It's just as simple to prevent something from appearing on a particular page. Just don't use an element that a widget uses for its target, and the widget will not run on that page.

Another example: to automatically include content of a templates/nav-menu.html into the <nav> element, you can put this into your soupault.conf file:

[widgets.nav-menu]
  widget = "include"
  selector = "nav"
  file = "templates/nav-menu.html"

What soupault does not include

By design:

Development web server
There are plenty of those, even python3 -m http.server is perfectly good for previews.
Deployment automation
Same reason, there are lots of tools for it.

Because I don't need it and I'm not sure if anyone wants it or how it will fit:

Asset management
Incremental builds
Multilingual sites

Installation

Binary release packages

Soupault is distributed as a single, self-contained executable, so installing it from a binary release package it trivial.

You can download the latest release from github.com/dmbaturin/soupault/releases/latest. Prebuilt executables are available for Linux (x86-64, statically linked), macOS (x86-64), and Microsoft Windows (32-bit, Windows 7 and newer).

Just unpack the archive and copy the executable wherever you want.

Soupault is now stable enough to build this website, but hasn't received much testing from other people yet. Prebuilt binaries are compiled with debug symbols, which makes them a couple of megabytes larger than they could be, but you can get better error messages if something goes wrong. If you encounter an internal error, you can get an exception trace by running it with OCAMLRUNPARAM=b environment variable.

Building from source

If you are familiar with the OCaml programming language, you may want to install from source.2

Since Lua-ML, the Lua interpreter that soupault uses for executing plugins, it not in the OPAM repository yet, you need to install it first.

opam pin add lua-ml git+https://github.com/lindig/lua-ml
opam pin add soupault git+https://github.com/dmbaturin/soupault

Due to library dependencies, at the time of writing soupault doesn't build with OCaml 4.08 yet. You should build it with 4.07 until the libraries catch up.

Getting started

Create your first website

Soupault has only one file of its own: the config file soupault.conf. It does not impose any particular directory layout on you. However, it has default settings that allow you to run it unconfigured.

You can initialize a simple project with default configuration using this command:

$ soupault --init

It will create the following directory structure:

.
├── site
│   └── index.html
├── templates
│   └── main.html
└── soupault.conf
  • The site/ is a site directory where page files are stored.
  • The templates/ directory is just a convention, soupault uses templates/main.html as the default page template.
  • soupault.conf is the config file.

Now you can build your website. Just run soupault in your website directory, and it will put the generated pages in build/. Your index page will become build/index.html.

By default, soupault inserts page content into the <body> element of the page template. Therefore, from the default template:

<html>
  <head></head>
  <body>
  <!-- your page content here -->
  </body>
</html>

and the default index page source that is <p>Site powered by soupault</p> it will make this page:

<!DOCTYPE html>
<html>
 <head></head>
 <body>
  <p>
   Site powered by soupault
  </p>
 </body>
</html>

You can use any CSS selector to determine where your content goes. For example, you can tell soupault to insert it into <div id="content> by changing the content_selector in soupault.conf to content_selector = "div#content".

Page files

Soupault assumes that with extensions .html, .htm, .md, .rst are pages and processes them. All other files are simply copied to the build directory.

If you want to use other extensions, you can change it in soupault.conf. For example, to add .txt to the page extension list, use this option:

[settings]
  page_file_extensions = ["htm", "html", "md", "rst", "txt"]

Clean URLs

Soupault uses clean URLs by default. If you add another page to site/, for example, site/about.html, it will turn into build/about/index.html so that it can be accessed as https://mysite.example.com/about.

Index files, by default files whose name is index are simply copied to the target directory.

site/index.html → build/index.html
site/about.html → build/about/index.html
site/papers/theorems-for-free.html → build/papers/theorems-for-free/index.html

Note: a page named foo.html and a section directory named foo/ is undefined behaviour when clean URLs are used. Don't do that to avoid unpredictable results.

This is what soupault will make from a source directory:

$ tree site/
site/
├── about.html
├── cv.html
└── index.html

$ tree build/
build/
├── about
│   └── index.html
├── cv
│   └── index.html
└── index.html

Disabling clean URLs

If you've had a website for a long time and there are links to your page that will break if you change the URLs, you can make soupault mirror your site directory structure exactly and preserve original file names.

Just add clean_urls = false to the [settings] table of your soupault.conf file.

[settings]
  clean_urls = false

Nested structures

A flat layout is not always desirable. If you want to create site sections, just add some directories to site/. Subdirectories are subsections, their subdirectories are subsubsections and so on—they can go as deep as you want. Soupault will process them all recursively and recreate the directories in build/.

site/
├── articles
│   ├── goto-considered-harmful.html
│   ├── index.html
│   └── theorems-for-free.html
├── about.html
├── cv.html
└── index.html

build/
├── about
│   └── index.html
├── articles
│   ├── goto-considered-harmful
│   │   └── index.html
│   ├── theorems-for-free
│   │   └── index.html
│   └── index.html
├── cv
│   └── index.html
└── index.html

Note that if your section does not have an index page, soupault will not create it automatically. If you want a page to exist, you need to make it.

Configuration

The default directory and file paths soupault --init creates are not fixed, you can change any of them. If you prefer different names, or you have an existing directory structure you want soupault to use, just edit the soupault.conf file.

[settings]
  # Where generated files go
  build_dir = "build"

  # Where page files are stored
  site_dir = "site"

  # Where to insert the content in the page template
  content_selector = "#content"

  default_template = "templates/main.html"

  # Name of the page used as a section index,
  # without extension
  index_page = "index"

  # Output file name for the "clean URLs" page files
  # site/foo.html → build/foo/index.html
  index_file = "index.html"

  doctype = "<!DOCTYPE html>"

  # Don't print debugging information
  verbose = false

  # Don't fail on page processing errors
  strict = false

  # Create a directory per page, site/foo.html → build/foo/$index_file
  clean_urls = true

  # What files are considered pages and processed
  page_extensions = ["htm", "html", "md", "rst"]

Note that if you create soupault.conf file before running soupault --init, it will use settings from that file instead of default settings.

In this document, whenever a specific site or build dir has to be mentioned, we'll use default values.

Note that there is no check for invalid fields. Soupault will simply ignore fields it doesn't know about. A downside is that it doesn't detect misspellt options, so check your spelling carefully.

The config is also typed and wrong value type has the same effect as a missing option. All boolean values must be true or false (without quotes), all integer values must not have quotes around numbers, and all strings must be in single or double quotes.

Page preprocessors

Soupault has no built-in support for formats other than HTML, but you can use any format with it if you specify an appropriate page preprocessor.

Any preprocessor that takes page file as its argument and outputs the result to stdout can be used.

For example, this configuration will make soupault call a program called cmark on the page file if its extension is .md.

[preprocessors]
  md = "cmark"

The table key can be any extension (without the dot), and the value is a command. You can specify as many extensions as you want.

Preprocessor commands are executed in shell, so it's fine to use relative paths and add arguments. Page file name will be appended to the command string.

Automatic section and site index

Having to add links to all pages by hand can be a tedious task. Nothing beats a carefully written and annotated section index, but it's not always practical.

Soupault can automatically generate a section index for you. While it's not a blog generator and doesn't have built-in features for generateing indices of pages by date, category etc., it can save you time writing a section index pages by hand.

Metadata is extracted directly from pages using selectors you specify in the config. It's more than possible to use a different element for excerpt in every page, not just the first paragraph, without having to duplicate it in the “front matter”. It doesn't even have to be text either. Same goes for other fields.

To use automatic indexing, you still need an index page in your section. It can be empty, but it must be there. Default index page name is index, so you should make a page like site/articles/index.html first.

Then enable indexing in the config. All indexing options are in tne [index] table.

[index]
  index = true

By default, soupault will append the index to the <body> element. You can tell it to insert it anywhere you want with the index_selector option, e.g. index_selector = "div#index".

There are a few configurable options. You can specify element selectors for page title, excerpt, date, and author.

These are all available options:

[index]
  # Whether to generate indices or not
  # Default is false, set to true to enable
  index = false

  # Where to insert the index
  index_selector = "body"

  # Page title selector
  index_title_selector = "h1"

  # Page excerpt selector
  index_excerpt_selector = "p"

  # Page date selector
  index_date_selector = "time"

  # Date format for sorting
  # Default %F means YYYY-MM-DD
  # For other formats, see http://calendar.forge.ocamlcore.org/doc/Printer.html
  index_date_format = "%F"

  # Page author selector
  index_author_selector = "#author"

  # Wrapper element for index entries
  index_item_template = "<div> </div>"

  # External index generator
  # There is no default
  index_processor =

Using external index generators

The built-in index generator simply copies elements from the page to the index. You can easily end up with a rather odd-looking index, especially if you are using different elements on every page and identify them by id rather than element name.

Generating indices and blog feeds is where template processors really shine. Everyone has different preferences though, so instead of having a built-in template processor, soupault supports exporting the index to JSON and feeding it to an external program.

JSON-encoded index is written to program's standard input, as a single line3. It's a list of objects with following fields:

url
Absolute page URL path, like /papers/simple-imperative-polymorphism
nav_path
A list of strings that represents the logical section path, e.g. for /pictures/cats/grumpy it will be ["pictures", "cats"].
title, date, excerpt, author
Metadata extracted from the page. Any of them can be null.

Here's an example of very simple indexing setup that will take the first h1 of every page in a section and make an unordered list of links to them. The external processor will use Python and Mustache templates.

First, create an index.html page in every section and include a <div id="index"> element in it.

Then write this to your config file:

[index]
  index = true
  index_selector = "#index"
  index_processor = "scripts/index.py"
  index_title_selector = "h1"

Then install the pystache library and save this script to scripts/index.py:

#!/usr/bin/env python3

import sys
import json

import pystache

template = """
<li><a href="{{url}}">{{title}}</a></li>
"""

renderer = pystache.Renderer()

input = sys.stdin.readline()
index_entries = json.loads(input)

print("<ul class=\"nav\">")
for entry in index_entries:
    print(renderer.render(template, entry))
print("</ul>")

Index processors are not required to output anything. You can as well use them to save the index data somewhere and create taxonomies and custom indices from it with another script, then re-run soupault to have them included in the pages 4.

Custom fields

Built-in fields should be enough for a typical blog taxonomy, but it's possible to add custom fields to your JSON index data.

Custom field queries are defined in the [index.custom_fields] table. Table keys are field names as they will appear in the exported JSON. Their values are inline tables with required selector field and optional select_all parameters.

[index.custom_fields]
  category = { selector = "span#category" }

  tags = { selector = ".tag", select_all = true }

In this example, the category field will contain the inner HTML of the first <span id="category"> element even if there's more than one in the page. The tags field will contain a list of contents of all elements with class="tag".

Exporting global site index to a file

The index processor invoked with the index_processor option receives the index of the current section. It doesn't include subsections. Since the site directory is processed top to bottom, the site/index.html page would not get the global site index either.

If you want to create your own taxonomies from the metadata imported from pages, create a global site index, or an index of a section and all its subsections, you can export the aggregated index data to a file for further processing. Add this option to your index config:

[index]
  dump_json = "path/to/file.json"

This way you can use a TeX-like workflow:

  1. Run soupault so that index file is created.
  2. Run your custom index generator and save generated taxonomy pages to site/.
  3. Run soupault one more time to have them included in the build.

Widgets

Widgets provide additional functionality. When a page is processed, its content is inserted into the template, and the resulting HTML element tree is passed through a widget pipeline.

Widget behaviour

Widgets that require a selector option first check if there's an element matching that selector in the page, and do nothing if it's not found, since they wouldn't have a place to insert their output.

This way you can avoid having a widget run on a page simply by omitting the element it's looking for.

If more than one element matches the selector, the first element is used.

Widget configuration

Widget configuration is stored in the [widgets] table. The TOML syntax for nested tables is [table.subtable], therefore, you will have entries like [widgets.foo], [widgets.bar] and so on.

Widget subtable names are purely informational and have no effect, the widget type is determined by the widget option. Therefore, if you want to use a hypothetical frobnicator widget, your entry will look like:

[widgets.frobnicate]
  widget = "frobnicator"
  selector "div#frob"

It may seen confusing and redundant, but if subtable name defined the widget to be called, you could only have one widget of the same type, and would have to choose whether to include the header or the footer with the include widget for example.

Limiting widgets to specific pages or sections

If the widget target comes from the page content rather than the template, you can simply not include any elements matching its selector option.

Otherwise, you can explicitly set a widget to run or not run on specific pages or sections.

All options from this section can take either a single string, or a list of strings.

Limiting to pages or sections

There are page and section options that allow you to specify exact paths to specific pages or sections. Paths are relative to your site directory.

The page option limits a widget to an exact page file, while the section option applies a widget to all files in a subdirectory.

[widgets.site-news]
  # only on site/index.html and site/news.html
  page = ["index.html", "news.html"]

  widget = "include"
  file = "includes/site-news.html"
  selector = "div#news"

[widgets.cat-picture]
  # only on site/cats/*
  section = "cats"

  widget = "insert_html"
  html = "<img src=\"/images/lolcat_cookie.gif\" />"
  selector = "#catpic"

Excluding sections or pages

It's also possible to explicitly exclude pages or sections.

[widgets.toc]
  # Don't add a TOC to the main page
  exlude_page = "index.html"
  ...

[widgets.evil-analytics]
  exclude_section = "privacy"
  ...

Using regular expressions

When nothing else helps, path_regex and exclude_path_regex options may solve your problem. They take a Perl-compatible regular expression (not a glob).

[widgets.toc]
  # Don't add a TOC to any section index page
  exclude_path_regex = '^(.*)/index\.html$'
  ...

[widgets.cat-picture]
  path_regex = 'cats/'

Widget processing order

If in your soupault.conf one config for widget A is before widget B, it doesn't guarantee that widget A will be processed first. By default, soupault assumes that widgets are independent and can be processed in arbitrary order. In future versions they may even be processed in parallel, who knows.

This can be an issue if one widget relies on putput from another. In that case, you can order widgets explicitly with the after parameter. It can be a single widget (after = "mywidget"after = ["some-widget", "another-widget"]).

Here is an example from this website's config. In the template there's a <div id="breadcrumbs"> where breadcrumbs are inserted by the add-breadcrumbs widget. Since there may not be breadcrumbs if the page is not deep enough, the div may be left empty, and that's not neat, so the cleanup-breadcrumbs widget removes it.

## Breadcrumbs
[widgets.add-breadcrumbs]
  widget = "breadcrumbs"
  selector = "#breadcrumbs"
  # 
     
      

## Remove div#breadcrumbs if the breadcrumbs widget left it empty
[widgets.cleanup-breadcrumbs]
  widget = "delete_element"
  selector = "#breadcrumbs"
  only_if_empty = true

  # Important!
  after = "add-breadcrumbs"

     
    

Built-in widgets

File and output inclusion widgets

These widgets include something into your page: a file, a snippet, or putput of an external program.

include

The include widget simply reads a file and inserts its content into some element.

The following configuration will insert the content of templates/header.html into an element with id="header" and the content of templates/footer.html into an element with id="footer".

[widgets.header]
  widget = "include"
  file = "templates/header.html"
  selector = "#header"

[widgets.footer]
  widget = "include"
  file = "templates/footer.html"
  selector = "#footer"

This widget provides a parse option that controls whether the file is parsed or included as a text node. Use parse = false if you want to include a file verbatim, with HTML special characters escaped.

insert_html

For a small HTML snippet, a separate file may be too much. The insert_html widget

[widgets.tracking-script]
  widget = "insert_html"
  html = '<script src="/scripts/evil-analytics.js"> </script>'
  selector = "head"

exec

The exec widget executes an external program and includes its output into an element. The program is executed in shell, so you can write a complete command with arguments in the command option. Like the include widget, it has a parse option that includes the output verbatim if set to false.

Simple example: page generation timestamp.

[widgets.generated-on]
  widget = "exec"
  selector = "#generated-on"
  command = "date -R"
Environment variables

The following environment variables are passed to the external program:

PAGE_FILE
Path to the page source file, relative to the current working directory (e.g. site/index.html

This is how you can include page's own source into a page, on a UNIX-like system:

[widgets.page-source]
  widget = "exec"
  selector = "#page-source"
  parse = false
  command = "cat $PAGE_FILE"

If you store your pages in git, you can get a page timestamp from the git log with a similar method (note that it's not a very fast operation for long commit histories):

[widgets.last-modified]
  widget = "exec"
  selector = "#git-timestamp"
  command = "git log -n 1 --pretty=format:%ad --date=format:%Y-%m-%d -- $PAGE_FILE"

The PAGE_FILE variable can be used in many different ways, for example, you can use it to fetch the page author and modification date from a revision control system like git or mercurial.

In the current version you cannot uses an external program as a filter— only as a producer, but it can be implemented in future versions.

Content widgets

title

The title widget sets the page title, that is, the <title> from an element with a certain selector. If there is no <title> element in the page, this widget assumes you don't want it and does nothing.

Example:

[widgets.page-title]
  widget = "title"
  selector = "h1"
  default = "My Website"
  append = " on My Website"
  prepend = "Page named "

If selector is not specified, it uses the first <h1> as the title source element by default.

The selector option can be a list. For example, selector = ["h1", "h2", "#title"] means “use the first <h1> if the page has it, else use <h2>, else use anything with id="title", else use default”.

Optional prepend and append parameters allow you to insert some text before and after the title.

If there is no element matching the selector in the page, it will use the title specified in default option. In that case the prepend and append options are ignored.

If the title source element is absent and default title is not set, the title is left empty.

footnotes

The footnotes widgets finds all elements matching a selector, moves them to another element, and replaces them with numbered links. As usual, the container element can be anywhere in the page—you can have footnotes at the top if you feel like it.

[widgets.footnotes]
  # Required: Where to move the footnotes
  selector = "#footnotes"

  # Required: What elements to consider footnotes
  footnote_selector = ".footnote"

  # Optional: Element to wrap footnotes in, default is <p>
  footnote_template = "<p> </p>"

  # Optional: Element to wrap the footnote number in, default is <sup>
  ref_template = "<sup> </sup>"

  # Optional: Class for footnote links, default is none
  footnote_link_class = "footnote"

  # Optional: do not create links back to original locations
  back_links = true

The footnote_selector option can be a list, in that case all elements matching any of those selectors will be considered footnotes.

By default, the number in front of a footnote is a hyperlink back to the original location. You can disable it and make footnotes one way links with back_links = false.

toc

The toc widget generates a table of contents for your page.

Table of contents is generated from the heading tags from <h1> to <h6>.

Here is the ToC configuration from this website:

[widgets.table-of-contents]
  widget = "toc"

  # Required: where to insert the ToC
  selector = "#generated-toc"

  # Optional: minimum and maximum levels, defaults are 1 and 6 respectively
  min_level = 2
  max_level = 6

  # Optional: use <ol> instead of <ul> for ToC lists
  # Default is false
  numbered_list = false

  # Optional: Class for the ToC list element, default is none
  toc_list_class = "toc"

  # Optional: append the heading level to the ToC list class
  # In this example list for level 2 would be "toc-2"
  toc_class_levels = false

  # Optional: Insert "link to this section" links next to headings
  heading_links = true

  # Optional: text for the section links
  # Default is "#"
  heading_link_text = "→ " 

  # Optional: class for the section links
  # Default is none
  heading_link_class = "here"

  # Optional: insert the section link after the header text rather than before
  # Default is false
  heading_links_append = false

  # Optional: use header text slugs for anchors
  # Default is false
  use_heading_slug = true

  # Optional: use unchanged header text for anchors
  # Default is false
  use_heading_text = false
Choosing the heading anchor options

For the table of contents to work, every heading needs a unique id attribute that can be used as an anchor.

If a heading has an id attribute, it will be used for the anchor. If it doesn't, soupault has to generate one.

By default, if a heading has no id, soupault will generate a unique numeric identifier for it. This is safe, but not very good for readers (links are non-indicative) and for people who want to share direct links to sections (they will change if you add more sections).

If you want to find a balance between readability, permanence, and ease of maintenance, there are a few ways you can do it and the choice is yours.

The use_heading_slug = true option converts the heading text to a valid HTML identifier. Right now, however, it's very aggressive and replaces everything other than ASCII letters and digits with hyphens. This is obviously a no go for non-ASCII languages, that is, pretty much all languages in the world. It may be implemented more sensibly in the future.

The use_heading_text = true option uses unmodified heading text for the id, with whitespace and all. This is against the rules of HTML, but seems to work well in practice.

Note that use_heading_slug and use_heading_text do not enforce uniqueness.

All in all, for best link permanence you should give every heading a unique id by hand, and for best readability you may want to go with use_heading_text = true.

breadcrumbs

The breadcrumbs widget generates breadcrumbs for the page.

The only required parameter is selector, the rest is optional.

[widgets.breadcrumbs]
  widget = "breadcrumbs"
  selector = "#breadcrumbs"
  prepend = ".. / "
  append = " /"
  between = " / "
  breadcrumb_template = ""
  min_depth = 1

The breadcrumb_template is an HTML snippet that can be used for styling your breadcrumbs. It must have an <a> element in it. By default, a simple unstyled link is used.

The min_depth sets the minimum nesting depth where breadcrumbs appear. That's the length of the logical navigation path rather than directory path.

There is a fixup that decrements the path for section index pages, that is, pages namedindex by default, or whatever is specified in the index_page option. Their navigation path is considered one level shorter than any other page in the section, when clean URLs are used. This is to prevent section index pages from having links to themselves.

  • site/index.html → 0
  • site/foo/index.html → 0 (sic!)
  • site/foo/bar.html → 1

HTML manipulation widgets

delete_element

The opposite of insert_html. Deletes an element with given selector. It can be useful in two situations:

  • Another widget may leave an element empty and you want to clean it up.
  • Your pages are generated with another tool and it inserts something you don't want.
# Who reads footers anyway?
[widgets.delete_footer]
  widget = "delete_element"
  selector = "#footer"

You can limit it to deleting only empty elements with only_if_empty = true option. Element is considered empty if there's nothing but whitespace inside it.

Plugins

Since version 1.2, soupault can be extended with Lua plugins. Currently there are following limitations:

  • The supported language is Lua 2.5, not modern Lua 5.x. That means no closures and no for loops in particular.
  • Lua execution errors are logged to stderr, but don't stop processing even in strict mode.
  • Only string options can be passed to plugins via widget options from soupault.conf

Plugins are treated like widgets and are configured the same way.

Plugin example

Here is an example of a plugin that converts relative links to absolute URLs by prepending a site URL to them:

-- Converts relative links to absolute URLs
-- e.g. "/about" -> "https://www.example.com/about"

-- Get the URL from the widget config
site_url = config["site_url"]

if not Regex.match(site_url, "(.*)/$") then
  site_url = site_url .. "/"
end

links = HTML.select(page, "a")

-- That's Lua 2.5, hand-cranked iteration...
index, link = next(links)

while index do
  href = HTML.get_attribute(link, "href")
  if href then
    -- Check if URL schema is present
    if not Regex.match(href, "^([a-zA-Z0-9]+):") then
      -- Remove leading slashes
      href = Regex.replace(href, "^/*", "")
      href = site_url .. href
      HTML.set_attribute(link, "href", href)
    end
  end
  index, link = next(links, index)
end

Configuring plugins

Plugin files can be placed in any directory. By convention, we'll use plugins/. So, to use that plugin, first save it to plugins/site-url.lua

Then you need to configure soupault to load the plugin. Add this snippet to soupault.conf:

[plugins.site-url]
  file = "plugins/site-url.lua"

It will register the plugin as a widget named site-url.

Then you can use it like any other widget. Plugin subtable name becomes the name of the widget, in our case site-url. The site_url option from the widget config will be accessible to the plugin as config["site_url"].

[widgets.absolute-urls]
  widget = "site-url"
  site_url = "https://www.example.com"

Plugin environment

Plugins have access to the following global variables:

page
The page element tree that can be manipulated with functions from the HTML module.
page_file
String containing page file path, e.g. site/index.html
nav_path
List of strings representing the logical nativation path. For example, for site/foo/bar/quux.html it's ["foo", "bar"].
config
A table with widget config options.

Note: only string options can be passed to plugins through the config table. As in, they must be TOML strings like "foo" or "42.0". Lua will convert strings to numbers when appropriate, so you can pass numbers by writing them as strings. You cannot pass TOML lists or inline tables to plugins in the current soupault version.

Plugin API

Apart from the standard Lua 2.5 functions, soupault provides two additional modules: HTML for HTML element tree manipulation and Regex for simple regex operations.

The HTML module

Function Example Description
HTML.parse(string) h = HTML.parse("<p>hello world<p>") Parses a string into an HTML element tree
HTML.create_element(tag, text) h = HTML.create_element("p", "hello world") Creates an HTML element node.
HTML.inner_html(html) h = HTML.inner_html(HTML.create_element("

hello world

"))
Returns element content as a string.
HTML.select(html, selector) links = HTML.select(page, "a") Returns a list of elements matching specified selector
HTML.select_one(html, selector) content_div = HTML.select(page, "div#content") Returns the first element matching specified selector, or nil if none are found.
HTML.get_attribute(html_element, attribute) href = HTML.get_attribute(link, "href") Returns the value of an element attribute, or nil if the attribute is absent. The first argument must be an element reference produced by HTML.select/HTML.select_one/HTML.select_element
HTML.set_attribute(html_element, attribute) HTML.set_attribute(content_div, "id", "content") Sets an attribute value. The first argument must be an element reference produced by HTML.select/HTML.select_one/HTML.select_element
HTML.add_class(html_element, class_name) HTML.add_class(p, "centered") Adds a class="class_name" attribute. The first argument must be an element reference produced by HTML.select/HTML.select_one/HTML.select_element
HTML.remove_class(html_element, class_name) HTML.remove_class(p, "centered") Adds a class="class_name" attribute. The first argument must be an element reference produced by HTML.select/HTML.select_one/HTML.select_element
HTML.append_child(parent, child) HTML.append_child(page, HTML.create_element("br")) Appends a child element to the parent.
HTML.delete(html_element) HTML.delete(HTML.select_one(page, "h1")) Deletes an element from the page. The second argument must be an element reference returns by a select function.

The Regex module

Regular expressions used by this module are mostly Perl-compatible. Capturing groups and back references are not supported.

Function Example Description
Regex.match(string, regex) Regex.match("/foo/bar", "^/") Checks if a string matches a regex.
Regex.find_all(string, regex) matches = Regex.find_all("/foo/bar", "([a-z]+)") Returns a list of substrings matching a regex.
Regex.replace(string, regex, string) s = Regex.replace("/foo/bar", "^/", "") Replaces the first occurence of a matching strings. It returns a new string and doesn't modify the argument.
Regex.replace_all(string, regex, string) Regex.replace("/foo/bar", "/", "") Replaces every matching substring. It returns a new string and doesn't modify the argument.

1On my desktop with an i5-7260U CPU and a magnetic drive.

2 Building for POSIX platforms just works, but building for Windows requires unreleased fixes to file-utils as of 0.5.3, so you will need to build fileutils first.

3 Newline as end of message is a horrible protocol, but since there's no universally agreed upon alternative for sending structured data to stdin, that's what we've got.

4TeX users are familiar with this approach.

This page was last modified: