KNet: performance evaluation
This document describes the benchmark approach used to evaluate KNet performance, presents results, and provides an interpretation of the data. The benchmarks are:
Initial considerations
Apache Kafka™ is a client-server architecture that relies on the network for communication. Overall infrastructure performance depends on several elements:
- The hardware running the Apache Kafka™ server: see https://kafka.apache.org/documentation/#hwandos for details
- The Apache Kafka™ server configuration
- The network between clients and servers
- The client library and its configuration parameters
- The user application
All elements above affect the results, with the first three typically having the highest impact. The KNet benchmarks focus on point 4 — the client library — while controlling for the others:
- Points 1, 2 and 3 are addressed by using an infrastructure based on SSD storage, a high core count, and a Gigabit LAN, reducing the influence of external conditions and distributing their effects statistically.
- Point 5 is addressed by running identical application logic for both libraries in every test, applying the same configuration parameters each time.
Since absolute numbers are strongly influenced by hardware and network conditions that vary between environments, the benchmarks use a relative comparison approach: every result is expressed as a ratio between KNet and Confluent.Kafka™.
- < 100% → KNet is faster
- > 100% → Confluent.Kafka™ is faster
- ≈ 100% → comparable performance
The reference library for comparison is Confluent.Kafka™, the actively maintained .NET client for Apache Kafka™. The two libraries differ in their architecture:
- KNet wraps the official Apache Kafka™ JARs via JNI; Confluent.Kafka™ wraps librdkafka, a native C library.
- Thread models and internal queuing differ.
- Serializers and deserializers differ.
- Many configuration parameters are shared.
Produce and Consume Benchmark
This benchmark measures the throughput of KNet and Confluent.Kafka™ for produce and consume operations independently.
Test program
To make the comparison meaningful, shared configuration parameters (linger time, batch size, buffer sizes, etc.) are set identically for both libraries. Parameters that have different semantics across libraries (e.g. KNet's byte-based memory pool vs librdkafka's message-count-based queue limit) are tuned to minimise their influence by ensuring all messages are fully sent or received before stopping measurement.
Each test:
- runs produce and consume as two separate phases;
- uses a dedicated topic per test to avoid cross-test interference:
{TopicPrefix}_{testName}_{packets}_{length}_{testNum}, where TopicPrefix is configurable (defaulttestTopic), testName isKNETorCONF, and testNum is the repetition index; - uses simple types to minimise serializer overhead: key is a
long(incremental ordinal), value is abyte[]pre-built by the application; - alternates between KNet and Confluent.Kafka™ across repetitions to distribute external effects;
- writes raw data to CSV for offline analysis;
- reports aggregated statistics at the end.
For each (repetitions × library) combination the test reports: Max, Min, Average, Standard Deviation, and Coefficient of Variation.
The ratio columns in the tables below are KNet / Confluent.Kafka™ × 100 for Average and Standard Deviation.
Approach
- Create a topic.
- Produce all messages, measuring elapsed time; the cycle ends with a flush to guarantee all data has been delivered before stopping the clock.
- Consume the messages produced in step 2 until the expected count is received.
The produce cycle:
- allocates a random byte array (allocation time is excluded from measurement);
- creates and sends each message, measuring both operations;
- calls flush and stops the clock.
The consume cycle:
- subscribes to the topic;
- starts the clock when the partition assignment callback fires;
- increments a counter on each received message;
- unsubscribes and stops the clock when the expected count is reached.
Configuration
| Parameter | Value |
|---|---|
| Acks | None (no server-side acknowledgement overhead) |
| LingerMs | 100 ms |
| BatchSize | 1 000 000 |
| MaxInFlight | 1 000 000 |
| SendBuffer | 32 MB |
| ReceiveBuffer | 32 MB |
| FetchMinBytes | 100 000 |
Benchmark results
- KNet/Confluent.Kafka™ Produce Average ratio percentage (SD ratio percentage):
Using KNetProducer and KNetConsumer (-UseKNetProducer -UseKNetConsumer):
| 100 bytes | 1,000 bytes | 10,000 bytes | 100,000 bytes | |
|---|---|---|---|---|
| 100 messages | 41,50 (49,73) | 28,95 (20,24) | 52,10 (38,06) | 26,88 (40,87) |
| 1,000 messages | 40,64 (8,59) | 47,07 (7,92) | 44,16 (19,06) | 25,33 (34,59) |
| 10,000 messages | 235,86 (31,42) | 188,84 (173,27) | 54,48 (53,89) | 24,60 (48,63) |
Using KafkaProducer and KafkaConsumer (standard JNI wrappers):
| 100 bytes | 1,000 bytes | 10,000 bytes | 100,000 bytes | |
|---|---|---|---|---|
| 100 messages | 59,56 (88,09) | 46,10 (4,25) | 95,11 (42,90) | 21,65 (4,02) |
| 1,000 messages | 195,98 (29,46) | 58,79 (41,11) | 53,33 (46,21) | 45,11 (34,41) |
| 10,000 messages | 382,96 (59,35) | 204,80 (76,04) | 53,59 (21,63) | 54,11 (78,37) |
Results automatically updated by CI run #41 · commit
e49c3e2· 2026-05-25 08:17 UTC
- KNet/Confluent.Kafka™ Consume Average ratio percentage (SD ratio percentage):
Using KNetProducer and KNetConsumer (-UseKNetProducer -UseKNetConsumer):
| 100 bytes | 1,000 bytes | 10,000 bytes | 100,000 bytes | |
|---|---|---|---|---|
| 100 messages | 101,54 (1058,00) | 102,95 (250,48) | 102,39 (132,77) | 104,21 (187,48) |
| 1,000 messages | 100,90 (78,66) | 100,82 (107,78) | 108,52 (267,72) | 99,93 (35,29) |
| 10,000 messages | 158,47 (1979,49) | 169,91 (1791,57) | 149,81 (90,66) | 20,42 (16,30) |
Using KafkaProducer and KafkaConsumer (standard JNI wrappers):
| 100 bytes | 1,000 bytes | 10,000 bytes | 100,000 bytes | |
|---|---|---|---|---|
| 100 messages | 84,40 (635,50) | 5,05 (230,19) | 5,21 (138,11) | 13,97 (55,25) |
| 1,000 messages | 3,89 (114,63) | 4,32 (92,78) | 12,55 (134,25) | 52,01 (70,62) |
| 10,000 messages | 4,78 (6,83) | 11,51 (113,01) | 51,37 (7,33) | 38,98 (7,32) |
Results automatically updated by CI run #41 · commit
e49c3e2· 2026-05-25 08:17 UTC
Analysis
KNet produce performance improves as payload size grows. The JNI call overhead is amortised over larger payloads, making KNet increasingly competitive. With small messages (100 bytes) the per-message JNI cost dominates and Confluent.Kafka™ is faster.
KNet consume performance with small payloads is significantly faster, because the consumer receives messages that are already fully assembled in the JVM and only a lightweight reference crosses the JNI boundary. With larger payloads the picture is more mixed; see the Roundtrip Benchmark for a detailed explanation of what the consume numbers actually measure in KNet.
Note
Results depend on the specific hardware and configuration used. With different parameters, Confluent.Kafka™ may outperform KNet in all combinations.
Roundtrip Benchmark
This benchmark measures end-to-end latency: the time from when a message is produced until it is received by the consumer, expressed in microseconds. Producer and consumer run in the same process on separate threads, using the system tick counter (DateTime.Now.Ticks) as the timing reference.
Test program
The setup follows the same design principles as the produce/consume benchmark: identical shared parameters, dedicated topics per test, simple key/value types, alternating library order across repetitions, CSV output, and aggregated statistics.
The key field carries the tick counter at produce time. The consumer subtracts that value from the current ticks on receipt to obtain the round-trip latency. The value is a pre-built byte[] payload.
Approach
- Create a topic.
- Start a consumer thread and subscribe to the topic.
- When the partition assignment callback fires, start the producer.
- The producer sends all messages, embedding the current tick count in each key, then flushes and waits for the consumer thread to finish.
- The consumer thread, on each received message, computes
(DateTime.Now.Ticks - item.Key)and stores the result. - When the expected message count is reached, the consumer unsubscribes and the thread exits.
What the latency number actually measures in KNet
This distinction is important for interpreting the results.
Without -CheckOnConsume (default): when a KNet consumer receives a message, the full record — key and value — has already been delivered to the JVM. The round-trip at the Kafka protocol level is complete. However, only item.Key (a long) is transferred across the JNI boundary to compute the latency delta. The value byte array stays in JVM heap and is never materialised in CLR. This measures Kafka network + broker latency, with minimal JNI overhead. It is the lower bound of what KNet can achieve.
With -CheckOnConsume: after computing the latency, the test calls item.Value.SequenceEqual(data), which forces the full payload to cross the JNI boundary and be compared in CLR. This adds JNI transfer cost that scales with payload size, and makes the KNet measurement directly comparable to Confluent.Kafka™, where Value is already a CLR byte[] and SequenceEqual runs almost for free.
The gap between the two variants quantifies the JNI payload transfer cost, which is the practical overhead a real KNet application pays when it accesses message content in .NET.
Configuration
| Parameter | Value |
|---|---|
| Acks | Default (reliable delivery required for accurate latency measurement) |
| LingerMs | 0 ms |
| BatchSize | 1 000 000 |
| MaxInFlight | 1 000 000 |
| SendBuffer | 32 MB |
| ReceiveBuffer | 32 MB |
| FetchMinBytes | 1 (deliver immediately without waiting to accumulate bytes) |
Benchmark results
Without -CheckOnConsume — Kafka round-trip latency
The Value payload is not transferred to CLR. KNet latency reflects the Kafka network + broker round-trip only.
- KNet/Confluent.Kafka™ Roundtrip Average ratio percentage (SD ratio percentage):
Using KNetProducer and KNetConsumer (-UseKNetProducer -UseKNetConsumer):
| 100 bytes | 1,000 bytes | 10,000 bytes | 100,000 bytes | |
|---|---|---|---|---|
| 100 messages | 7,97 (215,08) | 5,69 (75,54) | 7,39 (530,34) | 10,67 (685,85) |
| 1,000 messages | 12,69 (2139,84) | 14,17 (2473,82) | 17,47 (277,52) | 42,66 (44,95) |
| 10,000 messages | 52,97 (2501,70) | 55,89 (1037,47) | 83,56 (195,67) | 41,88 (34,70) |
Using KafkaProducer and KafkaConsumer (standard JNI wrappers):
| 100 bytes | 1,000 bytes | 10,000 bytes | 100,000 bytes | |
|---|---|---|---|---|
| 100 messages | 6,22 (642,52) | 4,69 (866,14) | 6,14 (883,13) | 12,37 (391,86) |
| 1,000 messages | 9,79 (2409,93) | 11,24 (2316,58) | 13,28 (40,24) | 45,09 (48,12) |
| 10,000 messages | 16,97 (437,10) | 19,86 (825,56) | 27,42 (28,34) | 50,63 (38,11) |
Results automatically updated by CI run #41 · commit
e49c3e2· 2026-05-25 08:17 UTC
Analysis
KNet shows significantly lower latency in this test. The result reflects the architectural difference: KNet's consumer receives the record in the JVM and the round-trip completes there, while Confluent.Kafka™ must also deserialise the payload into a CLR object before the application can read the key. The KNet number here is therefore a lower bound — it does not include the cost of making the value available in .NET.
With -CheckOnConsume — CLR data availability latency
item.Value.SequenceEqual(data) is called for each message, forcing full JNI payload transfer. This is the fair comparison with Confluent.Kafka™.
- KNet/Confluent.Kafka™ Roundtrip with CheckOnConsume Average ratio percentage (SD ratio percentage):
Using KNetProducer and KNetConsumer (-UseKNetProducer -UseKNetConsumer -CheckOnConsume):
| 100 bytes | 1,000 bytes | 10,000 bytes | 100,000 bytes | |
|---|---|---|---|---|
| 100 messages | 8,22 (701,39) | 5,88 (837,46) | 7,48 (1150,79) | 14,85 (625,87) |
| 1,000 messages | 13,48 (2297,67) | 14,06 (2283,15) | 18,29 (348,76) | 50,07 (45,56) |
| 10,000 messages | 53,16 (2798,93) | 54,77 (1382,01) | 80,91 (77,96) | 59,42 (70,81) |
Using KafkaProducer and KafkaConsumer (standard JNI wrappers -CheckOnConsume):
| 100 bytes | 1,000 bytes | 10,000 bytes | 100,000 bytes | |
|---|---|---|---|---|
| 100 messages | 6,57 (472,65) | 4,87 (1018,97) | 6,48 (1017,00) | 12,83 (790,75) |
| 1,000 messages | 10,11 (3837,54) | 10,98 (2566,76) | 15,80 (303,01) | 50,00 (54,25) |
| 10,000 messages | 21,47 (34,21) | 23,06 (111,64) | 45,46 (22,45) | 55,02 (81,04) |
Results automatically updated by CI run #41 · commit
e49c3e2· 2026-05-25 08:17 UTC
Analysis
With -CheckOnConsume the JNI transfer cost is included. The gap relative to the previous table grows with payload size, directly quantifying the JNI overhead for payload materialisation. This is the most relevant comparison for applications that actually read message content in .NET code.
Note
Results depend on the specific hardware and configuration used. With different parameters, Confluent.Kafka™ may outperform KNet in all combinations.
Final considerations
KNet performs best when messages are large, because the JNI overhead per message is amortised over a larger payload. With small messages Confluent.Kafka™ has the advantage due to its native librdkafka implementation avoiding the JNI boundary entirely.
The JNI overhead is measurable and scales with the number of JNI calls. Two architectural choices in KNet directly reduce this overhead:
KNetProducer batches and pipelines JNI calls more efficiently than the standard KafkaProducer wrapper. Switching from KafkaProducer to KNetProducer (via -UseKNetProducer) reduces the JNI call count and improves produce throughput, especially at high message rates.
Prefetch on consume offloads JVM method invocations to a background thread, allowing the main iterator to proceed while the next record's JNI calls are in flight:
var records = consumer.Poll(duration);
if (UsePrefetch)
{
foreach (var item in records.WithPrefetch().WithThread())
{
// process item
}
}
This reduces the effective JNI latency visible to the application and is particularly effective at high throughput with larger payloads.
The Garbage Collector is another factor: at high message rates the GC activates more frequently, increasing JNI overhead. The JCOBridge HPA (High Performance Application) Edition addresses this specifically by preventing premature GC collection of cross-boundary object references and reducing GC pressure through buffer pooling and deep caching of generic type resolution.