User-friendly topological sort in OCaml

Category: OCaml

Topological sorting is a relatively common task, but it’s also relatively hard to generalize to match everyone’s needs and keep it user-friendly at the same time. Wherever there are dependencies between tasks, whether it’s a build automation tool like Make or something else, you need to sort them, and you also need to tell users what they are doing wrong if an error occurs.

We have many libraries that are great at generic graph manipulation but want you to convert the data to a format that is designed to be usable with any possible graph algorithm. At the same time they may be unable to represent data with non-existent dependencies, or produce an easily printable cyclic dependency error.

It may be the reason we have many programs that either produce nondescript dependency error messages or just resolve dependency cycles as they please, like systemd.

That is why I wrote a topological module that can work with raw user input and detect both missing dependencies and dependency cycles. You can install it from the OPAM repository or get the source from github.com/dmbaturin/ocaml-tsort.

I intentionally sacrificed both generality and efficiency for ease of use. It does exactly one thing, and hopefully does it well.

I tried to make it as easy to use as the classic UNIX tsort(1). The main function is Tsort.sort : ('a * 'a list) list -> 'a Tsort.sort_result. Here’s a usage example:

utop # Tsort.sort [("foundation", []); ("walls", ["foundation"]); ("roof", ["walls"])] ;;
- : string Tsort.sort_result = Tsort.Sorted ["foundation"; "walls"; "roof"]

It also produces distinct errors for non-existent and circular dependencies. I went for a custom tri-state type so that it can work with any types, not just strings. The type is type 'a sort_result = Sorted of 'a list | ErrorNonexistent of 'a list | ErrorCycle of 'a list.

utop # Tsort.sort [("foundation", ["roof"]); ("walls", ["foundation"]); ("roof", ["walls"])] ;;
- : string Tsort.sort_result = Tsort.ErrorCycle ["roof"; "foundation"; "walls"]

utop # Tsort.sort [("foundation", ["construction permit"]); ("walls", ["foundation"]); ("roof", ["walls"])] ;;
- : string Tsort.sort_result = Tsort.ErrorNonexistent ["construction permit"]

Discussion

It uses the classic Kahn’s algorithm from 1962. The input is a ('a * 'a list) list, but internally I convert it to a Hashtbl for ease of manipulation.

The idea is simple:

Create an empty list for the sorted nodes.
Find a node that has no dependencies, remove it from the graph and move it to that list.
Repeat until the graph is empty.
If there are nodes that can’t be removed, there’s a dependency cycle.

That’s assuming there’s actually a graph. When the graph is constructed from user-defined statements like Requires = foo.service or foo: bar.c baz.c, the natural representation is an associative array of node names as keys and their dependency lists as values. And there’s no guarantee that foo.service or bar.c actually actually exists in the set of its keys.

However, If we sacrifice efficiency, we can work with that format directly. First make a helper that find nodes whose dependency lists are empty:

let find_isolated_nodes hash =
  let aux id deps acc =
    match deps with
    | [] -> id :: acc
    | _  -> acc
  in Hashtbl.fold aux hash []

Then a helper for removing all nodes with given names from the hash (those names are later moved to the list of sorted nodes):

let remove_nodes nodes hash =
  List.iter (Hashtbl.remove hash) nodes

And then we also need a way to delete a node name from every dependency list:

let remove_dependency hash dep =
  let aux dep hash id =
    let deps = Hashtbl.find hash id in
    let deps =
      if List.exists ((=) dep) deps then
        CCList.remove ~eq:(=) ~key:dep deps
      else deps
    in
    begin
      Hashtbl.remove hash id;
      Hashtbl.add hash id deps
    end
  in
  let ids = CCHashtbl.keys_list hash in
  List.iter (aux dep hash) ids

We make an assumption that every node mentioned in dependency lists exists in the hashtable keys, but if the graph turns out to be unsortable, we need to consider that possibility and have a function for checking it ready:

let find_nonexistent_nodes nodes =
  let keys = List.fold_left (fun acc (k, _) -> k :: acc) [] nodes in
  let rec find_aux ns nonexistent =
    match ns with
    | n :: ns ->
      if List.exists ((=) n) keys then find_aux ns nonexistent
      else find_aux ns (n :: nonexistent)
    | [] -> nonexistent
  in
  let nonexistent = List.fold_left (fun acc (_, vs) -> List.append acc (find_aux vs [])) [] nodes in
  CCList.uniq ~eq:(=) nonexistent

Now we can wrap it up. First we find nodes that have no dependencies at all (“base nodes”), remove them from the hash, and create an initial list of sorted nodes from them. Then we start a loop that has three parameters: a list of nodes that are safe to remove from dependencies (deps), a graph hash, and an accumulator list for storing sorted nodes.

After ininial identification and removal of base nodes, the list of nodes safe to remove and the list of already sorted nodes are the same. Then it gets funny because the recursion is not structural and that list can either shrink or grow.

If there are more nodes become isolated after removal of already sorted dependencies, it’s a good thing because they will be removed at the next step. If everything is fine, then eventually both the list of nodes safe to remove from dependency lists and the hash itself become empty.

If the list of nodes that are safe to remove becomes empty but the hash still has items, then there are two possibilities: something depends on a node that is not in the hash key set, or two nodes depend on each other. To tell these conditions apart, we run the check for non-existent nodes on the original data. If none are found, then there’s actually a cycle.

let sort nodes =
  let rec sorting_loop deps hash acc =
    match deps with
    | [] -> acc
    | dep :: deps ->
      let () = remove_dependency hash dep in
      let isolated_nodes = find_isolated_nodes hash in
      let () = remove_nodes isolated_nodes hash in
      sorting_loop (List.append deps isolated_nodes) hash (List.append acc isolated_nodes)
  in
  let nodes_hash = CCHashtbl.of_list nodes in
  let base_nodes = find_isolated_nodes nodes_hash in 
  let () = remove_nodes base_nodes nodes_hash in
  let sorted_node_ids = sorting_loop base_nodes nodes_hash [] in
  let sorted_node_ids = List.append base_nodes sorted_node_ids in
  let remaining_ids = CCHashtbl.keys_list nodes_hash in
  match remaining_ids with
  | [] -> Sorted sorted_node_ids
  | _  ->
    let nonexistent_nodes = find_nonexistent_nodes nodes in
    begin
      match nonexistent_nodes with
      | [] -> ErrorCycle remaining_ids
      | _  -> ErrorNonexistent nonexistent_nodes
    end

This page was last modified: 2019-12-24