Skip to main content

Performance

Work in Progress

Nexus is under active development. These benchmarks reflect the current state of the codebase and may change as optimizations are added.

All benchmarks run inside Docker on an Apple M4 Max (16 cores, 128 GB RAM), PHP 8.5.3, Swoole 6.0. Numbers are from the automated PHPUnit performance test suite (tests/Performance/).

Message throughput

How many messages per second a single actor can process end-to-end (tell -> mailbox -> behavior handler):

BenchmarkFiberSwoole
100K messages to one actor1.16M msgs/sec929K msgs/sec
50K message burst1.29M msgs/sec909K msgs/sec
100K stateful transitions1.06M msgs/sec853K msgs/sec
Fan-out (100 actors x 100 msgs)1.06M msgs/sec659K msgs/sec
Multi-dispatch (50 x 100 rounds)998K msgs/sec574K msgs/sec

Fiber is faster in single-process benchmarks because it avoids Swoole's coroutine scheduling overhead. Swoole's advantage is true async I/O and multi-worker scaling — not single-process throughput.

Dispatch rate

Raw tell() throughput without waiting for processing:

RuntimeDispatch rate
Fiber5.14M tells/sec
Swoole995K tells/sec

Actor lifecycle

OperationFiberSwoole
Spawn 1,000 actors453K ops/sec (2.2 us/actor)471K ops/sec (2.1 us/actor)
Kill 500 actors (PoisonPill)165K ops/sec107K ops/sec
500 spawn-kill cycles151K ops/sec98K ops/sec

Ping-pong latency

Round-trip time for a message sent to an actor that replies immediately:

RuntimeLatencyThroughput
Fiber2.5 us per round trip399K ops/sec
Swoole2.5 us per round trip407K ops/sec

Memory

RuntimeMemory per actor
Fiber3,884 bytes
Swoole3,164 bytes

At ~3-4 KB per actor, 100K actors consume roughly 300-400 MB.

Multi-worker scaling (Swoole threads)

Cross-worker messaging through Thread\Queue (one inbox per worker) with a shared Thread\Map actor directory:

MetricResult
Cross-worker throughput260K msgs/sec per worker pair
Cross-worker round-trip latency20 us per round trip
Fan-out (4 workers, 5K messages)188K msgs/sec aggregate

Envelope delivery

The worker pool passes Envelope objects through Thread\Queue. PHP serializes each object on push and deserializes on pop — this is the primary throughput ceiling for cross-thread messaging (see Hot-path breakdown below).

PHP OPcache JIT

PHP's JIT compiler reduces interpreter overhead for hot loops and pure-PHP arithmetic. The Nexus php-swoole Docker image enables JIT automatically:

opcache.enable=1
opcache.enable_cli=1
opcache.jit=tracing
opcache.jit_buffer_size=64M

Effect on the world benchmark (16 workers · 16 senders, Apple M4 Max, Docker):

JIT modeThroughput
JIT disabled (default CLI)~3.1M orders/sec
JIT tracing~3.5M orders/sec

JIT is most effective on the pure-PHP hot paths: actor handler closures, ActorPath string operations, and Behavior dispatch. Cross-thread serialization (Thread\Queue) is a native C operation and is unaffected.

ZTS compatibility. PHP 8.5 ZTS supports JIT. Each thread benefits from pre-compiled hot functions in the shared OPcache region. No configuration differences are needed between single-threaded and multi-threaded deployments.

Production deployments. When not using the Nexus Docker image, add the above opcache.* settings to php.ini or a conf.d override file. PHP CLI disables OPcache by default; opcache.enable_cli=1 is required.

Hot-path component breakdown

Each message through the worker pool passes through five stages. The table below shows the measured cost of each stage in isolation (JIT tracing enabled, Apple M4 Max, Docker, 300K iterations after warmup):

StageComponentµs/opM/s ceiling
ProducerActorPath::root() cache hit0.0186
random_bytes(16) + bin2hex() — Envelope ID0.185.6
Envelope::of() (rand + alloc + 3 fields)0.362.8
Thread\Queue::push (PHP serialize)0.611.6
WorkerThread\Queue::pop (PHP unserialize)0.981.0
Channel::push()+pop() (SwooleMailbox)0.0715
ActorCell dispatch overhead0.06
Full BehaviorWithState handler + apply0.137.7

Critical-path analysis

Producer and worker run concurrently on separate OS threads. Only the slower side constrains throughput:

Producer:  Envelope::of() + serialize   =  0.97 µs
Worker: unserialize + actor dispatch = 1.24 µs ← bottleneck

At 1.24 µs per message, each worker's theoretical ceiling is 0.81 M/s, giving 12.9 M/s across 16 workers. The measured 3.5 M/s is ~27% of this ceiling — the gap is Thread\Queue mutex contention under concurrent load. Single-threaded micro-benchmarks do not capture the synchronization overhead between 16 sender threads and 16 worker threads.

What limits throughput

Thread\Queue serialization (1.59 µs/msg total) is the structural ceiling. Thread\Queue uses PHP's native serialize()/unserialize() internally; this cannot be configured or replaced without changes to Swoole. Reducing the serialized payload size (e.g., compact ActorPath serialization) saves bytes but does not materially improve throughput because the bottleneck is PHP interpreter overhead, not data transfer time.

Secondary hotspots

HotspotCostFixable?
random_bytes(16) per Envelope — one getrandom() syscall/msg0.18 µsYes — thread-local PRNG eliminates the syscall at the cost of non-CSPRNG IDs
BehaviorWithState::next() — 1 PHP object alloc/msg0.08 µsPartially — could cache a static same-state singleton to avoid allocation on no-change paths
applyStatefulBehavior()isStopped() + hasNewState() + state()0.02 µsLow priority — direct nullable field access, already very fast

Running the breakdown yourself

docker compose exec php-swoole php \
-d opcache.enable_cli=1 -d opcache.jit=tracing \
-d opcache.jit_buffer_size=64M \
tests/Performance/hotpath_breakdown.php

Running benchmarks

# All benchmarks (requires Swoole container)
docker compose exec php-swoole vendor/bin/phpunit --testsuite=performance

# Fiber-only benchmarks (no Swoole needed)
docker compose exec php vendor/bin/phpunit --testsuite=performance --filter=Fiber

# Cluster benchmarks only
docker compose exec php-swoole vendor/bin/phpunit --testsuite=performance --filter=Cluster

Interpreting the numbers

Fiber vs Swoole: Fiber is faster in isolated single-process benchmarks. This does not mean Fiber is "better" — Swoole provides true async I/O (database, HTTP, filesystem), multi-worker scaling, and native coroutine support. Use Fiber for development and moderate workloads. Use Swoole for production with I/O-bound or multi-core workloads.

Docker overhead: Benchmarks run inside Docker containers. Native performance on the host machine is typically 10-20% faster.

Message size: All benchmarks use small messages ((object)['seq' => $i]). Larger messages will reduce throughput due to serialization and memory copy costs.