A pub/sub behavior specified—with TLA+ thus verified

"If you’re used to the constraints of programming languages, it may be a while before you start taking advantage of the freedom afforded by mathematics." —Leslie Lamport, Specifying Systems

I contribute to Eclipse Zenoh as part of my current occupation. Zenoh is a pub/sub and request/response message-oriented middleware that operates on a mix of clique, mesh and brokered topologies. I have recently completed work on a major extension of Zenoh's routing layer to unlock new useful topologies in user deployments.

During development, I had introduced a regression in Zenoh's subscription bookkeeping. The bug essentially delayed the propagation of unsubscriptions from a broker to one of its clients. The issue is that the client in question—unaware that no subscriber remains in the system—would needlessly send publications to the broker, thus wasting bandwidth.

I can no longer recollect how exactly I came across TLA+ and P. Yet I am near certain that the experience of working on Zenoh's protocol implementation without much in the way of a specification convinced me that I had to add a specification language to my engineering toolbox¹.

Although I have a decent mathematical background—courtesy of les classes préparatoires—providing me with the skills needed to read and write proofs and work with Set Theory, propositional and first-order logic, etc, I still had to climb quite the steep learning curve before properly grasping the fact that TLA+ is anything but a programming language that happens to ship with a verifier.

Temporal Logic of Actions, or TLA+, is a formal specification system designed by Leslie Lamport. It is formal as in grounded in a formal logical system. It enables engineers to write precise descriptions of system behavior which is tremendously useful when dealing with distributed systems.

Back to the main issue of broker unsubscription propagation. When should a pub/sub broker notify its clients of unsubscriptions on a given topic? One naive answer would be: when no client is subscribed to said topic any longer. As alluded to earlier, this is not true and breaks down in the case of a sole subscribed client.

When I initially sought to formalize this problem in TLA+ and in P, I got bogged down in the details of Zenoh code: topic names (i.e. "key expressions"), message queues, publication messages, etc. I was set on building a faithful reproduction of Zenoh's code to the detriment of my goal: verifying the logic of subscription bookkeeping as a learning exercise and a much anticipated introduction to TLA+. Larger specifications are certainly more complex to write and check, and don't make for pleasant learning projects.

I will not go over my failed attempts at specifying subscription bookkeeping with (bounded) message queues. We will consider that (1) the broker atomically notifies all clients, that (2) there is only one implicit topic name and that (3) we need not specify how publications are routed. We start by writing down the specification variables, i.e. the system state.

---- MODULE PubSub ----

EXTENDS 
  FiniteSets, \* Needed for the `Cardinality(S)` operator, i.e. the number of elements in set `S`
  Naturals    \* Needed for the `>` operator on natural numbers

CONSTANT Client \* Set of broker clients, e.g. `{c1, c2, c3}`

VARIABLES 
  subs,    \* Set of subscribed clients
  notified \* Function mapping each client to its boolean notification state, 
           \* `TRUE` for client `c` iff there is a remote subscription from the perspective of `c`

vars == << subs, notified >>

(* TLA+ is an untyped language. 
The mechanism used to check system invariants also serves as a "type checker" of sorts *)
TypeInvariant ==
  /\ subs \subseteq Client
  /\ notified \in [Client -> BOOLEAN]

A TLA behavior is a sequence of steps, where a step is a transition between two states. Actions are formulas that describe steps. Actions are typically written as conjunctions (e.g. A /\ B) of smaller formulas. Actions by definition utilize the prime symbol ', used to denote the next value of a variable. Action conjunction elements that do not contain the prime symbol may be interpreted as enabling conditions, i.e. formulas that must be true for the action to be possible.

(* Initial state: no inbound and no outbound subscriptions *)
Init ==
  /\ subs = {}
  /\ notified = [c \in Client |-> FALSE]

(* Client `c` subscribes: notify every other client *)
Subscribe(c) ==
  /\ c \notin subs
  /\ subs' = subs \union { c }
  /\ notified' = [c2 \in Client |-> c2 /= c \/ notified[c]]

(* Client `c` unsubscribes: notify every client iff subs' is empty *)
Unsubscribe(c) ==
  /\ c \in subs
  /\ subs' = subs \ { c }
  /\ notified' = [c2 \in Client |-> subs \ { c } /= {}]

(* Some client `c` either subscribes or unsubscribes *)
Next == \E c \in Client: Subscribe(c) \/ Unsubscribe(c)

(* `Init` is true, `Next` is always a step
and steps that don't modify `vars` (i.e. _stuttering_ steps) are also part of the `Spec` *)
Spec == Init /\ [][Next]_vars

We now have a minimal TLA+ specification of the particular system aspect we are interested in. At this point, we may use the TLC model checker to verify that TypeInvariant indeed holds at every step; tlc PubSub.tla should yield no errors. We may now move onto the more interesting invariant: a client should be notified if and only if (this is denoted <=> and often abbreviated as "iff") there is at least one subscriber beside itself.

SubscriptionConsistency ==
  \A c \in Client: ( notified[c] ) <=> ( Cardinality(subs \ { c }) > 0 )

TLC expectedly reports that SubscriptionConsistency does not always hold for the reasons outlined above. It provides us with a counter-example, i.e. a behavior that satisfies Spec yet breaks SubscriptionConsistency. I have set Client to {c1, c2, c3} for the following TLC run.

Error: Invariant SubscriptionConsistency is violated.
Error: The behavior up to this point is:

State 1: <Initial predicate>
/\ subs = {}
/\ notified = (c1 :> FALSE @@ c2 :> FALSE @@ c3 :> FALSE)

State 2: <Subscribe>
/\ subs = {c1}
/\ notified = (c1 :> FALSE @@ c2 :> TRUE @@ c3 :> TRUE)

State 3: <Subscribe>
/\ subs = {c1, c2}
/\ notified = (c1 :> TRUE @@ c2 :> TRUE @@ c3 :> TRUE)

State 4: <Unsubscribe>
/\ subs = {c2}
/\ notified = (c1 :> TRUE @@ c2 :> TRUE @@ c3 :> TRUE)

Perfect! The counter-example correctly identifies the two-to-one subscription scenario that violates SubscriptionConsistency. Now all that's left is to patch the Unsubscribe action and re-run TLC.

diff --git a/Bad.tla b/Good.tla
index d73f05d..1f8f852 100644
--- a/Bad.tla
+++ b/Good.tla
@@ -13,7 +13,7 @@ Subscribe(c) ==
 Unsubscribe(c) ==
   /\ c \in subs
   /\ subs' = subs \ { c }
-  /\ notified' = [c2 \in Client |-> subs \ { c } /= {}]
+  /\ notified' = [c2 \in Client |-> Cardinality(subs \ { c, c2 }) >= 1]

SubscriptionConsistency is a good example of a safety property, i.e. a formula describing what a system must not do. Liveness properties on the other hand express what a system must (eventually) do. If broker-to-clients subscription propagation is not atomic, then SubscriptionConsistency presumably should become a liveness property since we'd want to assert consistency is eventually restored.

After completing the first eight chapters of Specifying Systems I was quite surprised to read Lamport stating that "liveness properties are likely to be the least important part of your specification". At that time, I could only imagine writing SubscriptionConsistency as a liveness formula: I was still thinking in terms of code and the full details of the system. Now I see why liveness was underplayed by the author.

Because non-atomic asynchronous notifications make the specification quite complex, I have yet to find a satisfactory expression of SubscriptionConsistency that holds in the presence of message queues. However, if we limit all queue lengths to 1, then Quiescent => SubscriptionConsistency holds as an invariant, where Quiescent is defined as "all broker-to-client queues are empty".

I chose to write about my venturing into the ivory tower of formal specification to document my learning for my future self and for anyone else on a similar path. If you have any question or advice, don't hesitate to contact me via email.

Pun shamelessly and definitely intended.