Use MSG_ZEROCOPY for plaintext replication traffic #1543

murphyjacob4 · 2025-01-11T03:47:01Z

Summary

This PR integrates with MSG_ZEROCOPY for plaintext replication traffic. MSG_ZEROCOPY is a Linux kernel functionality that allows writes to avoid copying data from user space to kernel space. Instead, MSG_ZEROCOPY allows the kernel to pin the user space data into kernel space for asynchronous writing without any copying. In turn, users have to track the ongoing writes and ensure the data is kept stable, listening on the socket's message queue for a completion notification.

Design

We track the ongoing writes through a dynamic circular buffer, zeroCopyTracker. Each ongoing write is indexed by its sequence number, matching the one assigned by the kernel, with each entry holding a refcount into the replication backlog. We also add a new event type to the event loop APIs (AE_ERROR_QUEUE) to register for SO_EE_ORIGIN_ZEROCOPY notifications from the kernel. When we choose to use zero-copy for writes, we append a tracking entry to the circular buffer and register for the AE_ERROR_QUEUE event, which later fires and notifies us of a batch of completions, allowing us to trim the tracker which in turn decrements the refcount and trims the replication backlog.

To maintain a net improvement on performance,MSG_ZEROCOPY is only used in situations where it should perform better than our typical write syscalls:

MSG_ZEROCOPY is only used for writes that are over 10 KiB (by default, user configurable)
MSG_ZEROCOPY is only used for remote connections
MSG_ZEROCOPY is only used for plaintext connections
1. Adding support for TLS should be feasible, but leaving this as an iterative improvement

This PR enables MSG_ZEROCOPY as the new default behavior for writes from the replication backlog that meet the aforementioned criteria on builds that support it. It also introduces a config to disable it altogether via --tcp-tx-zerocopy no.

Although this PR only implements it for replication, the core concept should be reusable at various other places where we do writes, so long as reference counting can be used to defer cleanup until a later epoll cycle is notified of the write completion. The prospects of combining this with something like IO uring via IORING_OP_SENDZC are especially exciting. We should aim to continue to evolve the core concept of the zeroCopyTracker to meet these needs.

Some other notable call outs in the implementation:

The connection shutdown flow needs to be modified for connections that have ongoing zero copy writes. We are not allowed to touch this memory until the kernel has explicitly told us to do so. Instead of the normal close operation, we instead go into a draining phase where we first call shutdown, then await the zero copy tracker to reach zero length before later calling close and freeing the client in entirety.
Using zero-copy should be a net improvement for memory, however it does change the semantics of how memory is accounted. Previously, TCP memory was not tracked at all by the Valkey process, but now we will be internalizing this TCP memory into the buffers that we hold for the kernel. Overall, this should reduce memory usage on the machine since each ongoing write for the same memory will not require a copy, but it may lead to situations where replication backlog and Valkey process memory looks larger in comparison to previous versions.
For testing purposes, we add two DEBUG commands, one to allow us to simulate a slow replica via pausing completion notification events, and another to allow us to use zero-copy over loopback.
Introduced some new stats to help track how many zero-copy writes are happening, how many are currently in flight, and how many connections are in the "draining" state. Additionally, added memory tracking info for the zeroCopyTracker.
Since the kernel only uses uint32_t sequence numbers, it is important that we handle sequence number wrap around gracefully in zeroCopyTracker. A single replication link with 4 billion writes is not unexpected on a long-running instance.

Performance

Copying and pasting some initial performance comparisons from #1335:

Key size (B)	Primary Throughput Delta	Replica Throughput Delta
1024	+2.31%	+2.31%
4096	+4.30%	+4.30%
10240	+7.67%	+7.67%
40960	+17.67%	+17.67%
102400	+16.02%	+16.02%
409600	+16.84%	+1.57%

closes #1335

Signed-off-by: Jacob Murphy <[email protected]>

…regardless of tcp memory buffers Signed-off-by: Jacob Murphy <[email protected]>

Signed-off-by: Jacob Murphy <[email protected]>

codecov · 2025-01-11T04:07:35Z

Codecov Report

Attention: Patch coverage is 91.10169% with 21 lines in your changes missing coverage. Please review.

Project coverage is 70.88%. Comparing base (e60990e) to head (89e614e).
Report is 7 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/zerocopy.c	90.83%	11 Missing ⚠️
src/ae.c	91.42%	3 Missing ⚠️
src/config.c	0.00%	3 Missing ⚠️
src/anet.c	50.00%	2 Missing ⚠️
src/networking.c	97.05%	1 Missing ⚠️
src/socket.c	94.44%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1543      +/-   ##
============================================
+ Coverage     70.77%   70.88%   +0.11%     
============================================
  Files           120      121       +1     
  Lines         65005    65274     +269     
============================================
+ Hits          46005    46268     +263     
- Misses        19000    19006       +6

Files with missing lines	Coverage Δ
src/ae_epoll.c	`85.71% <100.00%> (+0.42%)`	⬆️
src/connection.h	`88.29% <100.00%> (+0.79%)`	⬆️
src/debug.c	`52.38% <100.00%> (+0.25%)`	⬆️
src/object.c	`82.18% <100.00%> (+0.03%)`	⬆️
src/server.c	`87.60% <100.00%> (+0.01%)`	⬆️
src/server.h	`100.00% <ø> (ø)`
src/networking.c	`88.57% <97.05%> (+0.09%)`	⬆️
src/socket.c	`91.83% <94.44%> (+0.21%)`	⬆️
src/anet.c	`72.96% <50.00%> (-0.28%)`	⬇️
src/ae.c	`79.50% <91.42%> (+1.85%)`	⬆️
... and 2 more

... and 12 files with indirect coverage changes

ranshid

Just some highlevel comments.
I still did not dive deep into the review of this, but it does sound promising.
I also wonder how this will look like when we do integrate io-uring.
there would be several options to go:

no io-uring AND no ZERO-COPY support
io-uring support but NO ZERO-COPY support
both io-uring AND ZERO-COPY support

I think we would probably want to support all 3 modes, but we could also just invest in io-uring+ZERO copy support which could simplify the tracker code (am I correct in this observation?)

ranshid · 2025-01-15T12:57:47Z

src/ae.c

+            if (invert) {
+                event_order[0] = AE_ERROR_QUEUE;
+                event_order[1] = AE_WRITABLE;
+                event_order[2] = AE_READABLE;
+            } else {
+                event_order[0] = AE_ERROR_QUEUE;
+                event_order[1] = AE_READABLE;
+                event_order[2] = AE_WRITABLE;
            }


Suggested change

if (invert) {

event_order[0] = AE_ERROR_QUEUE;

event_order[1] = AE_WRITABLE;

event_order[2] = AE_READABLE;

} else {

event_order[0] = AE_ERROR_QUEUE;

event_order[1] = AE_READABLE;

event_order[2] = AE_WRITABLE;

}

event_order[0] = AE_ERROR_QUEUE;

event_order[1] = invert ? AE_WRITABLE : AE_READABLE;

event_order[0] = invert ? AE_READABLE : AE_WRITABLE;

ranshid · 2025-01-15T17:49:12Z

src/ae.h

+#define AE_READABLE 1 << 0    /* Fire when descriptor is readable. */
+#define AE_WRITABLE 1 << 1    /* Fire when descriptor is writable. */
+#define AE_BARRIER 1 << 2     /* With WRITABLE, never fire the event if the      \
+                                 READABLE event already fired in the same event  \
+                                 loop iteration. Useful when you want to persist \
+                                 things to disk before sending replies, and want \
+                                 to do that in a group fashion. */
+#define AE_ERROR_QUEUE 1 << 3 /* Fire when descriptor has a message on the \


Suggested change

#define AE_READABLE 1 << 0 /* Fire when descriptor is readable. */

#define AE_WRITABLE 1 << 1 /* Fire when descriptor is writable. */

#define AE_BARRIER 1 << 2 /* With WRITABLE, never fire the event if the \

READABLE event already fired in the same event \

loop iteration. Useful when you want to persist \

things to disk before sending replies, and want \

to do that in a group fashion. */

#define AE_ERROR_QUEUE 1 << 3 /* Fire when descriptor has a message on the \

#define AE_READABLE (1 << 0) /* Fire when descriptor is readable. */

#define AE_WRITABLE (1 << 1) /* Fire when descriptor is writable. */

#define AE_BARRIER (1 << 2) /* With WRITABLE, never fire the event if the \

READABLE event already fired in the same event \

loop iteration. Useful when you want to persist \

things to disk before sending replies, and want \

to do that in a group fashion. */

#define AE_ERROR_QUEUE (1 << 3) /* Fire when descriptor has a message on the \

ranshid · 2025-01-15T17:57:17Z

src/config.c

@@ -3289,6 +3303,7 @@ standardConfig static_configs[] = {
    createIntConfig("rdma-port", NULL, MODIFIABLE_CONFIG, 0, 65535, server.rdma_ctx_config.port, 0, INTEGER_CONFIG, NULL, updateRdmaPort),
    createIntConfig("rdma-rx-size", NULL, IMMUTABLE_CONFIG, 64 * 1024, 16 * 1024 * 1024, server.rdma_ctx_config.rx_size, 1024 * 1024, INTEGER_CONFIG, NULL, NULL),
    createIntConfig("rdma-completion-vector", NULL, IMMUTABLE_CONFIG, -1, 1024, server.rdma_ctx_config.completion_vector, -1, INTEGER_CONFIG, NULL, NULL),
+    createIntConfig("tcp-zerocopy-min-write-size", NULL, MODIFIABLE_CONFIG, 0, INT_MAX, server.tcp_zerocopy_min_write_size, CONFIG_DEFAULT_ZERO_COPY_MIN_WRITE_SIZE, INTEGER_CONFIG, NULL, NULL),


I think this falls under the "configurations we do not want users to mess with". agree 10K is the recommended sweet spot, but not sure if this needs any kind of tunability option ATM.
Consider dropping this config for now

ranshid · 2025-01-16T07:29:15Z

src/networking.c

+            size_t data_len = o->used - c->repl_data->ref_block_pos;
+            int use_zerocopy = shouldUseZeroCopy(c->conn, data_len);
+            if (use_zerocopy) {
+                /* Lazily enable zero copy at the socket level only on first use */
+                if (!c->zero_copy_tracker) {
+                    connSetZeroCopy(c->conn, 1);
+                    c->zero_copy_tracker = createZeroCopyTracker();
+                }
+                nwritten = zeroCopyWriteToConn(c->conn, o->buf + c->repl_data->ref_block_pos, data_len);
+            } else {
+                nwritten = connWrite(c->conn, o->buf + c->repl_data->ref_block_pos, data_len);
+            }


This seems like it could be encapsulated in the connection abstraction maybe? I wonder if the tracker should be a client or a connection owned? I think it makes more sense that it will be part of the connection.

murphyjacob4 added 24 commits November 26, 2024 00:26

Initial draft of zerocopy for replication streams

855a059

Signed-off-by: Jacob Murphy <[email protected]>

Enforce minimum zero copy write size

0a33aaa

Signed-off-by: Jacob Murphy <[email protected]>

Incremental improvements to zerocopy

7199031

Signed-off-by: Jacob Murphy <[email protected]>

Add tests and support for non-graceful termination

756ab02

Signed-off-by: Jacob Murphy <[email protected]>

Merge remote-tracking branch 'origin/unstable' into zerocopy

b9a8086

Signed-off-by: Jacob Murphy <[email protected]>

Remove debug logs and fix a merge error

35562ce

Signed-off-by: Jacob Murphy <[email protected]>

Fix typos and cmake

6caf217

Signed-off-by: Jacob Murphy <[email protected]>

Allow debug command to pause error queue to make tests deterministic …

bbedd5b

…regardless of tcp memory buffers Signed-off-by: Jacob Murphy <[email protected]>

Make tests more deterministic

bec3ff2

Signed-off-by: Jacob Murphy <[email protected]>

Fix mac build

a04d132

Signed-off-by: Jacob Murphy <[email protected]>

Disable zerocopy test suite on unsupported builds

929bc75

Signed-off-by: Jacob Murphy <[email protected]>

Typo fix

f39f7d3

Signed-off-by: Jacob Murphy <[email protected]>

More test stability fixes

a47b959

Signed-off-by: Jacob Murphy <[email protected]>

Log when zerocopy tests are skipped

82f7303

Signed-off-by: Jacob Murphy <[email protected]>

Fix wait for offset sync behavior in tests

a1c3e52

Signed-off-by: Jacob Murphy <[email protected]>

Cleanup unused functions and fix parameter issue in test

469fd9b

Signed-off-by: Jacob Murphy <[email protected]>

Make event loop code more readable

3a11ce9

Signed-off-by: Jacob Murphy <[email protected]>

Remove draining force close logic

63a583e

Signed-off-by: Jacob Murphy <[email protected]>

Fix build

0d5a2c7

Signed-off-by: Jacob Murphy <[email protected]>

Restore gitignore

d81949c

Signed-off-by: Jacob Murphy <[email protected]>

Add additional resiliency to zerocopy tests

425366f

Signed-off-by: Jacob Murphy <[email protected]>

Cleanup zerocopy messsage processing

22d3258

Signed-off-by: Jacob Murphy <[email protected]>

Disable zerocopy for non-TCP connections

359728f

Signed-off-by: Jacob Murphy <[email protected]>

Apply clang-format

89e614e

Signed-off-by: Jacob Murphy <[email protected]>

ranshid reviewed Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use MSG_ZEROCOPY for plaintext replication traffic #1543

Use MSG_ZEROCOPY for plaintext replication traffic #1543

murphyjacob4 commented Jan 11, 2025 •

edited

Loading

codecov bot commented Jan 11, 2025 •

edited

Loading

ranshid left a comment

ranshid Jan 15, 2025

ranshid Jan 15, 2025

ranshid Jan 15, 2025

ranshid Jan 16, 2025

Use MSG_ZEROCOPY for plaintext replication traffic #1543

Are you sure you want to change the base?

Use MSG_ZEROCOPY for plaintext replication traffic #1543

Conversation

murphyjacob4 commented Jan 11, 2025 • edited Loading

Summary

Design

Performance

codecov bot commented Jan 11, 2025 • edited Loading

Codecov Report

ranshid left a comment

Choose a reason for hiding this comment

ranshid Jan 15, 2025

Choose a reason for hiding this comment

ranshid Jan 15, 2025

Choose a reason for hiding this comment

ranshid Jan 15, 2025

Choose a reason for hiding this comment

ranshid Jan 16, 2025

Choose a reason for hiding this comment

murphyjacob4 commented Jan 11, 2025 •

edited

Loading

codecov bot commented Jan 11, 2025 •

edited

Loading