roto/README.md

# roto

Zero-allocation Rust protobuf reader and writer.

## Overview

Instead of deserializing binary protobuf data into Rust structs, roto scans a message _once_ on
construction — recording the byte offset of each field — then reads fields on demand directly from
the original bytes. No heap allocation, no data copying, no full deserialization upfront.

Writing works the same way: you provide a fixed buffer and a builder writes fields directly into it,
returning a slice of the bytes written.

## Design

`protoc` generates a `CodeGeneratorRequest` message; `protoc-gen-roto` (in
`src/bin/protoc-gen-roto.rs`) reads this from stdin, generates Rust source files, and writes a
`CodeGeneratorResponse` to stdout. `protoc` then writes those `.rs` files to disk. The generated
files are included directly in the crate that uses the protobuffers.

Sample usage:

```
protoc -Iproto/ proto/hackers.proto --plugin=./target/debug/protoc-gen-roto --roto_out=src/
```

This will generate a file, src/hackers.rs.

## Generated code

For each protobuf message roto generates two types:

- **Reader struct** `MessageName<'a>` — borrows the original byte slice, zero-copy.
- **Builder struct** `MessageNameBuilder<'b>` — writes into a caller-provided `&mut [u8]`.

Nested message types are placed in a `pub mod message_name { ... }` module (snake_case of the
parent message name) within the same generated file.

## Sample usage

Given this proto definition:

```proto
message Hello {
    string hello_world = 1;
    message InnerWorld {
        string thought = 1;
    }
    InnerWorld inner_world = 2;
}
```

### Reading

```rust
fn parse_proto(data: &[u8]) -> roto::Result<String> {
    // Scan the data once, recording field offsets
    let hello = Hello::new(data)?;

    // String fields return &str borrowed from the original bytes (zero-copy)
    let hello_world: &str = hello.hello_world()?;

    // Nested message fields return &[u8]; construct the nested reader from those bytes
    let inner_bytes: &[u8] = hello.inner_world()?;
    let inner_world = hello::InnerWorld::new(inner_bytes)?;
    let thought: &str = inner_world.thought()?;

    Ok(format!("{} is about {}", hello_world, thought))
}
```

Fields absent from the binary data return `Err(roto::RotoError::FieldNotFound)`.

### Writing

Nested messages must be serialized into a scratch buffer first, then embedded as raw bytes in the
outer builder.

```rust
fn build_proto(buf: &mut [u8]) -> roto::Result<&[u8]> {
    // Serialize the inner message first
    let mut inner_buf = [0u8; 256];
    let inner_bytes = hello::InnerWorldBuilder::builder(&mut inner_buf)
        .thought("some thought")?
        .finish()?;

    // Build the outer message, embedding the serialized inner bytes
    HelloBuilder::builder(buf)
        .hello_world("some world")?
        .inner_world(inner_bytes)?
        .finish() // returns Result<&'b mut [u8]> — the written portion of buf
}
```

Builder methods consume `self` and return `Result<Self>`, enabling `?`-based chaining.
`finish()` returns `Result<&'b mut [u8]>` — a slice of the portion of the buffer that was written.

### Updating messages

You can read a message, modify specific fields, and use `.with()` to copy the remaining fields from the original binary.

```rust
fn update_proto(data: &[u8], buf: &mut [u8]) -> roto::Result<&[u8]> {
    let msg = Message::new(data)?;

    let mut builder = MessageBuilder::builder(buf);
    if msg.foo()? == "bar" {
        builder = builder.foo("foosbar")?;
    }

    builder.with(&msg)?.finish()
}
```

### Repeated fields

Repeated fields return a `RepeatedFieldIterator<'a>`. Each item yields `Result<(&[u8], WireType)>`.

```rust
let hello = Hello::new(data)?;
for item in hello.tags() {
    let (value_bytes, _wire_type) = item?;
    // decode value_bytes according to the expected wire type
}
```

## Runtime API

The core runtime in `src/lib.rs` provides:

- `ProtoAccessor<'a>` — scans a message's fields and reads values at recorded offsets.
- `ProtoBuilder<'a>` — writes fields into a provided `&mut [u8]` buffer.
- `FieldIterator<'a>` / `RepeatedFieldIterator<'a>` — iterators over fields and repeated fields.
- `Tag`, `WireType` — protobuf encoding primitives.
- `read_varint`, `write_varint`, `skip_value` — low-level wire-format helpers.
- `RotoError`, `Result<T>` — error type and alias.

## High-level design

On construction (`MessageName::new(data)`), the generated reader struct iterates the binary once
using `FieldIterator` and records the byte offset of each field's tag. Subsequent field accesses
call `ProtoAccessor::get_value_at(offset)` — no re-scanning. For repeated fields, the start and
end offsets of the field range are recorded to bound iteration efficiently.

## Benchmarks

Two benchmark suites share the same binary data files and the same four
measurement groups:

| Group           | What is timed                                           |
| --------------- | ------------------------------------------------------- |
| `shallow_parse` | Become ready to read any field (one scan / full decode) |
| `deep_parse`    | Walk the full tree: Campaign → Operations → Hackers     |
| `field_access`  | Read individual fields on an already-parsed message     |
| `iterate`       | Count top-level and nested repeated fields              |

### 1 — Generate the shared data files (do this once)

Data files are written to `data/bench/`.

```sh
cargo run --release --bin gen_bench_data -- --preset tiny
cargo run --release --bin gen_bench_data -- --preset small
cargo run --release --bin gen_bench_data -- --preset medium
cargo run --release --bin gen_bench_data -- --preset large
```

For even larger inputs use `--preset huge` (~500 MB) or set the knobs
directly:

```sh
# ~50 MB: 500 operations × 100 KB stolen_data each
cargo run --release --bin gen_bench_data -- --ops 500 --stolen-kb 100 --output data/bench/50mb.pb
```

### 2 — Rust benchmark (criterion)

```sh
cargo bench --bench hackers_bench
```

HTML reports are written to `target/criterion/`. Run a single group:

```sh
cargo bench --bench hackers_bench -- shallow_parse
```

### 3 — C / upb benchmark

Requires protobuf ≥ 21 with `protoc-gen-upb` (ships with modern `protoc`).

```sh
cd upb_test
make          # compiles hackers_bench from the pre-generated upb files
./hackers_bench
```

To regenerate the upb C files from `proto/hackers.proto`:

```sh
cd upb_test && make regen
```

### 4 — Results

Measured on Linux x86-64 with the four standard presets. Rust times are
criterion medians; C/upb times are the custom runner's mean over ≥ 0.5 s.

#### `shallow_parse` — cost to become ready to read any field

| Size   |       Bytes | roto (ns) |     upb (ns) | roto speedup |
| ------ | ----------: | --------: | -----------: | -----------: |
| tiny   |         588 |      32.7 |        606.2 |    **18.5×** |
| small  |      20,265 |     182.9 |     22,619.2 |   **123.7×** |
| medium |   2,071,053 |  16,632.0 |  5,346,977.2 |     **321×** |
| large  | 102,608,384 |   1,618.6 | 41,132,079.7 |  **25,411×** |

> roto's cost is O(number of top-level fields): it records field offsets by
> jumping past nested blobs using their length prefixes. upb fully decodes the
> entire tree — including all nested messages and raw byte payloads — into
> arena-allocated structs.

#### `deep_parse` — parse + walk Campaign → Operations → every Hacker handle

| Size   |     Bytes |   roto (ns) |    upb (ns) | roto speedup |
| ------ | --------: | ----------: | ----------: | -----------: |
| tiny   |       588 |       385.3 |       596.8 |    **1.55×** |
| small  |    20,265 |    13,374.0 |    22,321.6 |    **1.67×** |
| medium | 2,071,053 | 1,454,400.0 | 4,227,384.3 |    **2.91×** |

> roto pays one extra `::new()` scan per nesting level; upb's walk is pure
> pointer-chasing because everything was decoded upfront. roto is still
> faster overall because its per-level scans cost less than upb's full decode.

#### `field_access` — individual field reads on a pre-parsed message (`small` preset)

| Field                          | roto (ns) | upb (ns) | upb speedup |
| ------------------------------ | --------: | -------: | ----------: |
| `campaign::name`               |      14.3 |     1.11 |   **12.9×** |
| `campaign::total_bytes_stolen` |       7.1 |     1.74 |    **4.1×** |
| `operation::codename`          |      13.8 |     1.76 |    **7.8×** |
| `operation::timestamp`         |       9.7 |     1.40 |    **6.9×** |
| `operation::successful`        |       7.0 |     1.13 |    **6.1×** |
| `hacker::handle`               |      14.4 |     1.56 |    **9.2×** |
| `hacker::skill_level` (f32)    |       7.7 |     1.76 |    **4.4×** |
| `hacker::is_elite` (bool)      |       7.5 |     1.14 |    **6.6×** |
| `worm::polymorphic` (bool)     |       7.5 |     1.76 |    **4.2×** |
| `worm::payload` (bytes)        |      16.6 |     1.75 |    **9.5×** |

> After parsing, upb field reads are direct struct-member lookups (~1–2 ns).
> roto re-decodes the value at its pre-recorded byte offset on every call
> (~7–17 ns). This is the one area where upb holds a clear advantage.

#### `iterate` — count repeated fields (parse included in every iteration)

| Benchmark          | Size   | roto (ns) |    upb (ns) | roto speedup |
| ------------------ | ------ | --------: | ----------: | -----------: |
| `count_operations` | tiny   |      50.0 |       600.2 |    **12.0×** |
| `count_operations` | small  |     393.7 |    22,702.9 |    **57.7×** |
| `count_operations` | medium |  36,628.0 | 4,193,874.0 |   **114.5×** |
| `count_all_crew`   | tiny   |     235.3 |       610.2 |     **2.6×** |
| `count_all_crew`   | small  |   4,369.5 |    23,109.0 |     **5.3×** |
| `count_all_crew`   | medium | 444,930.0 | 4,151,181.5 |     **9.3×** |

> `count_operations` includes parsing; upb's O(1) array-length read is
> dominated by its full-decode cost, so roto wins by the same margin as
> `shallow_parse`. `count_all_crew` also parses each `Operation` sub-message;
> roto's per-level scans remain cheaper than upb's full decode.

### Interpreting the comparison

The two libraries have fundamentally different models:

- **roto `shallow_parse`** does one linear scan recording byte offsets — no
  allocation, no field decoding. Subsequent field reads decode on demand at
  the stored offset.
- **upb `Campaign_parse`** fully decodes the entire message tree into
  arena-allocated structs upfront. Subsequent field reads are direct struct
  member lookups (~1 ns).

The result: roto's parse is faster and allocation-free; upb's field access
after parsing is faster. For workloads that read every field the costs
invert; for workloads that read a handful of fields from large messages roto
wins.

## Literature

https://protobuf.dev/programming-guides/encoding/