Pulsar-native Shared-Storage Streaming Engine
Oxia-backed metadata and coordination plane. Object-storage-first durable data plane. Stateless, leaderless broker. Pulsar native protocol, KoP Kafka compatibility, and lakehouse-native stream-table duality β all in one architecture.
L0–L4 layered architecture. The single internal coordinate β
streamId + offset
β powers Pulsar, Kafka, cursor, retention, compaction, and lakehouse.
Pulsar native protocol + KoP Kafka compatibility. ManagedLedger facade projects MessageId/Position onto the internal offset truth without introducing a second durable log.
Offset authority, append fencing, offset index, object manifest, cursor state, transaction state, routing. Oxia decides visibility β not the broker, not the object store.
Multi-stream WAL objects aggregate writes across partitions. Per-stream compacted objects serve reads and lakehouse queries from the same physical bytes.
Generation replacement swaps offset index targets atomically. SBT and SDT expose committed streams as Iceberg/Delta/Hudi tables.
Stateless broker, preferred routing, append session fencing, zone-aware failover. No data movement on scale-in/out.
Multi-stream WAL objects, per-stream compacted objects, SBT/SDT table files, cursor/txn snapshots. Objects are immutable once committed. Visibility is Oxia's domain.
Nereus is built on a shared-storage streaming architecture, adding first-class Pulsar-native semantics β MessageId, ManagedCursor, Shared/Key_Shared subscriptions, transactions, and replication.
| Capability | Nereus Design | Layer |
|---|---|---|
| Oxia metadata service | Oxia: offset index, fencing, routing, cursor, txn, manifest | L0/L4 |
| Stateless / leaderless broker | Broker not durable owner; preferred broker for locality only | L4 |
| Object WAL | Multi-stream WAL object; independent per-stream slice visibility | L0 |
| Offset index | {streamId, offsetEnd} β object range + entry index + cumulative size | L0 |
| Commit-time offset assignment | Offset assigned by Oxia at index-commit time, not by broker | L0 |
| Compaction | Generation replacement: same offset, new physical object, atomic swap | L3 |
| Lakehouse / stream-table duality | SBT (built-in table) + SDT (external delivery); catalog not on ack path | L3 |
| Cost profile | Object WAL + Oxia + object storage for large-scale, low-cost workloads | L0/L3 |
| Latency profile | BookKeeper WAL or local WAL feeding into the same StreamStorage truth | L0 |
| Kafka protocol | KoP projection: Kafka offset = stream record offset | L2 |
Oxia decides visibility, assigns offsets, fences stale commits. Object store stores bytes β not truth. Broker is stateless for correctness.
One physical WAL object aggregates multiple stream slices. Each slice becomes visible independently when its Oxia offset index entry is committed.
Compaction never changes stream offsets. It replaces which physical object an offset range points to β readers switch atomically to the highest visible generation.
SBT exposes streams as built-in Iceberg tables. SDT delivers to external catalogs. Lakehouse queries and streaming reads share the same compacted objects.
Pulsar protocol as first-class, Kafka via KoP projection. Both map to the same stream offset truth. No second durable log for Kafka workloads.
Latency-optimized with BookKeeper WAL, cost-optimized with Object WAL. Same StreamStorage API, same Oxia offset truth, same compaction pipeline.
Oxia append sessions with monotonic fencing tokens. Stale broker commits are rejected at the metadata plane. No split-brain visible commit.
WAL, compacted, snapshot, and table objects are immutable once committed. GC is driven by Oxia reference counting β not by object store list operations.
One target architecture, split into designable, reviewable modules. Each future maps to one or more layers (L0–L4).
StreamStorage API, object WAL, Oxia offset index, commit-time offset assignment, read resolver. L0
ManagedLedger/ManagedCursor compatibility runtime, virtual ledger projection, Position/MessageId mapping. L1
Mark-delete offset, individual ack ranges, cursor snapshot objects, cursor CAS, Shared/Failover/Exclusive recovery. L1
WALβcompacted Parquet, generation overlay, atomic index replacement, fallback, GC protection. L3
Kafka offset=stream offset, Produce/Fetch mapping, group offset, transaction visibility, leader epoch projection. L2
SBT built-in table, SDT external delivery, Iceberg catalog commit, catalog repair, delivery idempotence. L3
Broker session, zone-aware routing ring, preferred broker, append session transfer, cache invalidation. L4
Key_Shared, delayed delivery, pending ack transaction, replicated subscription, schema/system topic bootstrap, geo-replication. L1/L4
Nereus is in the design phase. Start with the architecture overview, then dive into individual future design docs.