-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use MSG_ZEROCOPY for plaintext replication traffic #1543
base: unstable
Are you sure you want to change the base?
Conversation
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
…regardless of tcp memory buffers Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1543 +/- ##
============================================
+ Coverage 70.77% 70.88% +0.11%
============================================
Files 120 121 +1
Lines 65005 65274 +269
============================================
+ Hits 46005 46268 +263
- Misses 19000 19006 +6
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some highlevel comments.
I still did not dive deep into the review of this, but it does sound promising.
I also wonder how this will look like when we do integrate io-uring.
there would be several options to go:
- no io-uring AND no ZERO-COPY support
- io-uring support but NO ZERO-COPY support
- both io-uring AND ZERO-COPY support
I think we would probably want to support all 3 modes, but we could also just invest in io-uring+ZERO copy support which could simplify the tracker code (am I correct in this observation?)
if (invert) { | ||
event_order[0] = AE_ERROR_QUEUE; | ||
event_order[1] = AE_WRITABLE; | ||
event_order[2] = AE_READABLE; | ||
} else { | ||
event_order[0] = AE_ERROR_QUEUE; | ||
event_order[1] = AE_READABLE; | ||
event_order[2] = AE_WRITABLE; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (invert) { | |
event_order[0] = AE_ERROR_QUEUE; | |
event_order[1] = AE_WRITABLE; | |
event_order[2] = AE_READABLE; | |
} else { | |
event_order[0] = AE_ERROR_QUEUE; | |
event_order[1] = AE_READABLE; | |
event_order[2] = AE_WRITABLE; | |
} | |
event_order[0] = AE_ERROR_QUEUE; | |
event_order[1] = invert ? AE_WRITABLE : AE_READABLE; | |
event_order[0] = invert ? AE_READABLE : AE_WRITABLE; |
#define AE_READABLE 1 << 0 /* Fire when descriptor is readable. */ | ||
#define AE_WRITABLE 1 << 1 /* Fire when descriptor is writable. */ | ||
#define AE_BARRIER 1 << 2 /* With WRITABLE, never fire the event if the \ | ||
READABLE event already fired in the same event \ | ||
loop iteration. Useful when you want to persist \ | ||
things to disk before sending replies, and want \ | ||
to do that in a group fashion. */ | ||
#define AE_ERROR_QUEUE 1 << 3 /* Fire when descriptor has a message on the \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#define AE_READABLE 1 << 0 /* Fire when descriptor is readable. */ | |
#define AE_WRITABLE 1 << 1 /* Fire when descriptor is writable. */ | |
#define AE_BARRIER 1 << 2 /* With WRITABLE, never fire the event if the \ | |
READABLE event already fired in the same event \ | |
loop iteration. Useful when you want to persist \ | |
things to disk before sending replies, and want \ | |
to do that in a group fashion. */ | |
#define AE_ERROR_QUEUE 1 << 3 /* Fire when descriptor has a message on the \ | |
#define AE_READABLE (1 << 0) /* Fire when descriptor is readable. */ | |
#define AE_WRITABLE (1 << 1) /* Fire when descriptor is writable. */ | |
#define AE_BARRIER (1 << 2) /* With WRITABLE, never fire the event if the \ | |
READABLE event already fired in the same event \ | |
loop iteration. Useful when you want to persist \ | |
things to disk before sending replies, and want \ | |
to do that in a group fashion. */ | |
#define AE_ERROR_QUEUE (1 << 3) /* Fire when descriptor has a message on the \ |
@@ -3289,6 +3303,7 @@ standardConfig static_configs[] = { | |||
createIntConfig("rdma-port", NULL, MODIFIABLE_CONFIG, 0, 65535, server.rdma_ctx_config.port, 0, INTEGER_CONFIG, NULL, updateRdmaPort), | |||
createIntConfig("rdma-rx-size", NULL, IMMUTABLE_CONFIG, 64 * 1024, 16 * 1024 * 1024, server.rdma_ctx_config.rx_size, 1024 * 1024, INTEGER_CONFIG, NULL, NULL), | |||
createIntConfig("rdma-completion-vector", NULL, IMMUTABLE_CONFIG, -1, 1024, server.rdma_ctx_config.completion_vector, -1, INTEGER_CONFIG, NULL, NULL), | |||
createIntConfig("tcp-zerocopy-min-write-size", NULL, MODIFIABLE_CONFIG, 0, INT_MAX, server.tcp_zerocopy_min_write_size, CONFIG_DEFAULT_ZERO_COPY_MIN_WRITE_SIZE, INTEGER_CONFIG, NULL, NULL), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this falls under the "configurations we do not want users to mess with". agree 10K is the recommended sweet spot, but not sure if this needs any kind of tunability option ATM.
Consider dropping this config for now
size_t data_len = o->used - c->repl_data->ref_block_pos; | ||
int use_zerocopy = shouldUseZeroCopy(c->conn, data_len); | ||
if (use_zerocopy) { | ||
/* Lazily enable zero copy at the socket level only on first use */ | ||
if (!c->zero_copy_tracker) { | ||
connSetZeroCopy(c->conn, 1); | ||
c->zero_copy_tracker = createZeroCopyTracker(); | ||
} | ||
nwritten = zeroCopyWriteToConn(c->conn, o->buf + c->repl_data->ref_block_pos, data_len); | ||
} else { | ||
nwritten = connWrite(c->conn, o->buf + c->repl_data->ref_block_pos, data_len); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it could be encapsulated in the connection abstraction maybe? I wonder if the tracker should be a client or a connection owned? I think it makes more sense that it will be part of the connection.
Summary
This PR integrates with
MSG_ZEROCOPY
for plaintext replication traffic. MSG_ZEROCOPY is a Linux kernel functionality that allows writes to avoid copying data from user space to kernel space. Instead,MSG_ZEROCOPY
allows the kernel to pin the user space data into kernel space for asynchronous writing without any copying. In turn, users have to track the ongoing writes and ensure the data is kept stable, listening on the socket's message queue for a completion notification.Design
We track the ongoing writes through a dynamic circular buffer,
zeroCopyTracker
. Each ongoing write is indexed by its sequence number, matching the one assigned by the kernel, with each entry holding a refcount into the replication backlog. We also add a new event type to the event loop APIs (AE_ERROR_QUEUE
) to register forSO_EE_ORIGIN_ZEROCOPY
notifications from the kernel. When we choose to use zero-copy for writes, we append a tracking entry to the circular buffer and register for the AE_ERROR_QUEUE event, which later fires and notifies us of a batch of completions, allowing us to trim the tracker which in turn decrements the refcount and trims the replication backlog.To maintain a net improvement on performance,
MSG_ZEROCOPY
is only used in situations where it should perform better than our typicalwrite
syscalls:MSG_ZEROCOPY
is only used for writes that are over 10 KiB (by default, user configurable)MSG_ZEROCOPY
is only used for remote connectionsMSG_ZEROCOPY
is only used for plaintext connectionsThis PR enables
MSG_ZEROCOPY
as the new default behavior for writes from the replication backlog that meet the aforementioned criteria on builds that support it. It also introduces a config to disable it altogether via--tcp-tx-zerocopy no
.Although this PR only implements it for replication, the core concept should be reusable at various other places where we do writes, so long as reference counting can be used to defer cleanup until a later
epoll
cycle is notified of the write completion. The prospects of combining this with something like IO uring via IORING_OP_SENDZC are especially exciting. We should aim to continue to evolve the core concept of thezeroCopyTracker
to meet these needs.Some other notable call outs in the implementation:
close
operation, we instead go into a draining phase where we first callshutdown
, then await the zero copy tracker to reach zero length before later callingclose
and freeing the client in entirety.zeroCopyTracker
.zeroCopyTracker
. A single replication link with 4 billion writes is not unexpected on a long-running instance.Performance
Copying and pasting some initial performance comparisons from #1335:
closes #1335