WebAssembly and replayable functions
... or how to make a non-deterministic functions deterministic through the power of isolated WASM sandbox. This time we'll go through the problems of unpredictability in code, which execution could be affected by external factors like I/O operations, time etc. Then we'll see what kind of domains would greatly benefit if we managed to solve these issues. And finally: how Web Assembly could help us in dealing with them.
Motivation
Let's consider several scenarios, where being able to isolate and control sources of uncertainty could benefit us.
Flaky tests
One of the problems that developers often struggle with is diagnosing occasionally failing tests. This often happens when the bug inside of a test is related to a non-deterministic code execution sources like time conditional logic, random number generators, I/O operation or an order of scheduled tasks in parallel computation. Since any of these can produce different results with each call, they make hard to reproduce bugs.
Unfortunately almost none of the languages, runtimes and ecosystems used in production systems nowadays has sufficient abstractions allowing us to easily identify and substitute all possible sources of non-determinism. This forces users to sometimes introduce their own - often half-backed and incomplete - abstractions to isolate them.
Stateful workflows
Workflows are another candidate. We're talking about programs representing complex state machine, potentially within a long-running processes, that can include orchestrating multiple services, I/O operations over potentially long periods of time: days, weeks or months. Examples of such could be periodic subscription events like automated mailing or payment scheduling services.
Since we cannot simply call sleep(1.month())
inside of a function logic hoping that machine it lives on won't die. This issue was usually solved by splitting the logic into steps, with complex frameworks or hand-crafted code gluing, persisting and navigating the state machine created by such workflow. All of that plumbing logic can easily obfuscate the actual domain, making it harder to reason about and maintain.
With that in mind solutions like Azure Durable Functions or temporal.io started offering an execution environment, capable of running, suspending and restoring the state produced by language-idiomatic code across multiple machine runs.
Replicating state by replicating functions
The problem of state machine replication is quite popular in both industry and academia. Paxos and Raft are commonly used by pretty much any distributed system that needs to manage state. Conflict-free Replicated Data Types are another approach that also targets the same issue, but with different set of constraints.
Many of these implementations relly of something called replicated log of operations, meaning we replicate a sequence of records describing state change over multiple machines. These are usually predefined by system. We described that approach in context of CRDTs in the past.
Now imagine that the produced state change was huge or generated by some user defined script like stored procedure or a formula in a spreadsheet. We have to replicate the results of that formula execution (potentially thousands of cells), even if replicating the formula itself and recomputing it on other nodes would be much more lightweight option, mostly because:
- Custom scripts may be hard to serialize.
- Logic often relies on at least one resource (usually time), which varies across replicated content execution.
With Web Assembly, we can solve both of these issues.
Challenge
All of the examples above suffer from the same condition: non-determinism. Unfortunately many useful programs must rely on external inputs - changing over time - that are hard to identify and control.
What if we could escape that insanity? What if we could repeat the same function call to expect the same result, even if it was dependent on the external factors?
Here comes the WASM
Nice thing about Web Assembly is that its implementations offer us a sandboxed environment which - thanks to deny-by-default architecture - cuts it off from all peripherals like clock and I/O devices. These can be provided by host if it specifies them explicitly as part of its imports (we'll see them later).
Below we're going to use code from this repository. Our demo comes in two parts:
- Web Assembly binary written in Rust, which provides our "user-defined stored procedure".
- JavaScript host program which initialized a WASM virtual machine and executed WASM binary code.
We'll be running following Rust program:
#[no_mangle]
pub unsafe extern "C" fn echo() {
// our "business logic" receive 5 messages from the outside
// and send them back with 'echo:' prefix, waiting 2 seconds
// each time
for _ in 0..5 {
let timestamp = receive();
sleep(2000.0);
send(&format!("echo: {}", timestamp))
}
}
fn send(data: &str) {
let ptr = data.as_ptr();
unsafe { write(ptr, data.len() as u32) }
}
const BUF_LEN: usize = 16;
static mut BUF: [u8; BUF_LEN] = [0u8; BUF_LEN];
fn receive() -> String {
let mut result = String::new();
// concat buffered slices into full message
while {
unsafe {
let ptr = BUF.as_mut_ptr();
let n = read(ptr, BUF_LEN as u32) as usize;
result.push_str(std::str::from_utf8_unchecked(&BUF[0..n]));
n == BUF_LEN
}
} { }
result
}
extern "C" {
/// Signal host that there's a string to write at address `ptr`
/// and it's `len` bytes long.
fn write(ptr: *const u8, len: u32);
/// Signal host that we want to read the next message, which could
/// be written into a buffer at a given address and provided length.
/// Returns number of bytes written into a buffer.
fn read(ptr: *mut u8, len: u32) -> u32;
fn sleep(millis: f64);
}
Since WASM sandbox barrier primitives are very constrained - basically all we can pass are numbers/pointer addresses and shared memory between the host and WASM module - we use buffering pattern similar to reading the contents of the file.
What you can see is that our echo
function depends on three external functions, that we expect to be provided by a WASM host: send
/receive
used for communication with the world outside, and sleep to present how to handle delays. Here, we'll use following definition:
const env = {
read: () => new Date().toISOString(),
write: (str) => console.log(str),
// sleep actively blocks current thread before returning
sleep: (ms) => Atomics.wait(new Int32Array(new SharedArrayBuffer(4)),0 ,0, Number(ms)),
}
The problem here is that passing strings through WebAssembly instance requires writing to a shared memory. It doesn't happen automatically, so we need to add some extra logic to make our env
object compliant with Rust extern function signatures:
function imports(env) {
const read = env.read
const write = env.write
const sleep = env.sleep
let buf = null
let readIdx = 0
return {
env: {
read: (ptr, len) => {
if (buf === null || readIdx === buf.byteLength) {
// we already passed full message, read the next one
const json = read()
const encoder = new TextEncoder()
buf = encoder.encode(json)
readIdx = 0
}
let n = Math.min(len, buf.byteLength - readIdx)
const slice = buf.slice(readIdx, readIdx + n)
readIdx += n
const view = new Uint8Array(this.memory.buffer, ptr, n)
view.set(slice)
return n
},
write: (ptr, len) => {
const view = new Uint8Array(this.memory.buffer, ptr, len)
const decoder = new TextDecoder(('utf-8'))
const json = decoder.decode(view)
write(json)
},
sleep,
}
}
}
Since our reads happen through shared buffer memory, we need to split messages into chunks, that would fit into provided client buffer. If we would want to pass more complex data types, the logic is similar. We can do one of:
- Serialize object, pass it through shared memory and deserialize on the executing client code. This is the approach we use here.
- Have a dedicated map, store object there and pass its unique key. We can define a set of dedicated methods to operate on that object using its identifier instead. It comes with a cost of indirection, but allows us to also ie. execute methods of object living in a host space. It's basically how
wasm-bindgen
enables our Rust code to perform reflection and interact with an external JavaScript environment.
The read
/write
/sleep
function signatures are good for our demo, but for building general purpose programs they may be too limiting. In order to avoid every framework/compiler to rely on it's own set of abstractions, Bytecode Alliance - organization in care of WASM future development - proposed to standardize them in form of WASI. It's a thing worth keeping some attention on.
With all of definitions prepared, we can create a Web Assembly instance straight from JavaScript like in the code below:
import * as fs from 'fs'
// binary payload representing our complete WASM script
const bin = fs.readFileSync('path/to/wasm/binary.wasm')
this.memory = null // atm. we didn't construct memory object yet
function imports(env) { ... }
const env = { ... }
const importObject = imports(env)
const module = new WebAssembly.Module(bin)
const instance = new WebAssembly.Instance(module, importObject)
// rebind memory used by exported functions
this.memory = instance.exports.memory
// get handle to Rust function and run it
const { echo } = instance.exports
echo()
For now if you try to run this program several times, you can notice that each time you'll run it, it will hang for few seconds (due to sleep
call) and produce different result each time. We're going to change that.
Capturing and replaying function calls
Our idea is very simple - we're going to wrap functions declared at the very beginning of our imports
definition with a result capturing mechanism. That mechanism will work in two phases:
- capturing is executed originally at source, first time a function is being called. During this phase we are running original underlying function calls and record their results on the fly onto call list (since the same function can be called multiple times over the course of WASM program execution).
- replaying is used to replay captured context of the function without actually doing any of the underlying extern calls. Instead it will return captured call results in FIFO order.
Thankfully, building such mechanism is very simple:
export class CapturingContext {
constructor() {
this.isCapturing = true
this.callPointers = {}
this.calls = {}
this.memory = null
}
capture(name, fn) {
return ((...args) => {
const ctx = this.calls[name] || []
this.calls[name] = ctx
if (this.isCapturing) {
let result = fn(...args)
ctx.push(result) // record result before returning
return result
} else {
// dequeue the next result from queue of records
// and return it instead of doing a function call
let ptr = this.callPointers[name] | 0
let result = ctx[ptr++]
this.callPointers[name] = ptr
return result
}
}).bind(this)
}
imports(env) {
const read = this.capture('read', env.read)
const write = env.write
// uncomment line below to capture the printed output as well
//const write = this.capture('write', env.write)
const sleep = this.capture('sleep', env.sleep)
/* ... rest of our imports function defined before */
}
}
Our capture
method is basically replacing original import function definitions with wrappers recording and replying their outputs depending on the execution context (capture vs. replay). Now let's compose our CapturingContext
into a WASM instance.
function run(context) {
const module = new WebAssembly.Module(bin)
const importObject = context.imports(env)
const instance = new WebAssembly.Instance(module, importObject)
context.memory = instance.exports.memory
const { echo } = instance.exports
echo() // our Rust code call
}
const context = new CapturingContext()
run(context) // first call - record execution trace
const snapshot = context.snapshot() // snapshot recorded calls
const context2 = new CapturingContext()
// restore recorded functions and replay the same logic
context2.replay(snapshot)
run(context2)
Even if you run a new WASM instance over this captured snapshot, in second run
call you'll notice two things:
- Results are exactly the same - we replaced time-dependent calls with results captured from previous run.
- Our script executes almost instantaneously - this is result of capturing the
sleep
function itself. We still can see the console outputs though. It's because we didn't capture thewrite
method. If we did so, printed statements would also be omitted in the second run. Depending on the use case this may or may not be a desired outcome.
If you're ie. diagnosing a flaky test, this captured snapshot could be used to investigate the last failing run. If you're replicating this stored procedure call via ie. CAS Paxos, this snapshot could be a part of message payload to guarantee that the final state on each replica is consistent.
Workflows
When it comes to workflows, there are more considerations, however the basics are already there. What's special here is that long running processes usually can be cut into series of steps, split from each other by things like I/O operations or longer sleep
periods.
The idea here is that we could build a boundaries around these natural barriers in code ie. in order to support long running processes, we could adapt our sleep
function implementation to persist context snapshot recorded so far, scrap current WASM instance, then schedule a cron job to wake it up and restart in cases when sleep is about to take longer.
Each process restart means quickly replaying captured effects from the barrier functions - like immediate omission of sleep
call, so that we could move to next execution steps right away - until we reach the next non-captured execution step.
This also means that the difference between capture/replay modes blurs out, as our workflow execution is both capturing un-executed steps and replaying the ones already performed on each restart, until the program completes.
What else?
We didn't cover non-determinism caused by multithreaded function execution - we didn't need to as WASM instances are by definition single threaded. We could of course change that - if you didn't notice our send
/receive
methods are already hints to an actor programming model - but again, since this happen over the boundaries of extern functions that we have control over, we can substitute them as well.
Keep in mind that replaying means re-executing code blocks between external functions - if such code contains ie. CPU-intensive logic, it will be executed each time it's called. That could be a problem, however it's much less frequent in scenarios we discussed above and can also be solved by having a conditional logic over the captured functions.