HDDS-11898. design doc leader side execution #7583

sumitagrawl · 2024-12-16T16:18:51Z

What changes were proposed in this pull request?

Design doc for leader side execuiton

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11898

How was this patch tested?

NA

adoroszlai

Thanks @sumitagrawl for the docs.

Please combine the files into a single markdown file with headers (title, author, status, etc.), license (please see other design docs for example).

This will help readers know where to start, and it is also needed for display on the website: https://ozone.apache.org/docs/edge/design.html

sumitagrawl · 2025-01-06T10:23:17Z

Thanks @sumitagrawl for the docs.

Please combine the files into a single markdown file with headers (title, author, status, etc.), license (please see other design docs for example).

This will help readers know where to start, and it is also needed for display on the website: https://ozone.apache.org/docs/edge/design.html

@adoroszlai Please recheck, now have below as separate

Leader-execution.md
obs-lock.md
requests:

obs-create-key.md
obs-commit-key.md

Above kept separate as these are independent feature as part of leader execution and its design further will go independently,

errose28 · 2025-01-10T23:31:11Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+- The current implementation depends on consensus on the order of requests received and not on consensus on the processing of the requests.
+- The double buffer implementation currently is meant to optimize the rate at which writes get flushed to RocksDB but the effective batching achieved is 1.2 at best. It is also a source of continuous bugs and added complexity for new features.
+- The number of transactions that can be pushed through Ratis currently caps out around 25k.
+- The Current performance envelope for OM is around 12k transactions per second. The early testing pushes this to 40k transactions per second.


Is my understanding here correct?

Suggested change

- The Current performance envelope for OM is around 12k transactions per second. The early testing pushes this to 40k transactions per second.

- The Current performance envelope for OM is around 12k transactions per second. The early testing of this feature pushes this to 40k transactions per second.

errose28 · 2025-01-10T23:32:58Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+3. Cache Optimization: Cache are maintained for write operation and read also make use of same for consistency. This creates complexity for read to provide accurate result with parallel operation.
+4. Double buffer code complexity: Double buffer provides batching for db update. This is done with ratis state machine and induces issues managing ratis state machine, cache and db updates.


The current phrasing does not make it clear that these are things this feature aims to remove. The other items listed are things it is going to add or improve.

errose28 · 2025-01-10T23:42:07Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+### Batching (Ratis request)
+All request as executed parallel are batched and send as single request to other nodes. This helps improve performance over network with batching.
+
+### Apply Transaction (via ratis at all nodes)


There's another step after this that needs to specified: that we don't return success to the client until the apply transaction of their request has completed on the leader

errose28 · 2025-01-10T23:45:25Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+Index Preserved in TransactionInfo Table with new KEY: "#KEYINDEX"
+Format: <timestamp>#<index>
+Time stamp: This will be used to identify last saved transaction executed
+Index: index identifier of the request


Please check the rendered version of this section I don't think it is being displayed as intended.

errose28 · 2025-01-10T23:49:02Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+- Upgrade: Last Ratis index + 1
+
+
+#### Index Persistence:


Please add a lot more details to this section, it doesn't really explain how this will work. I assume there is going to be some sort of atomic long incremented in memory. The control request section also does not add much information to explain this.

errose28 · 2025-01-11T00:03:25Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+1. for increment changes, need remove dependency with ratis index. For this, need to use om managed index in both old and new flow.
+2. objectId generation: need follow old logic of index to objectId mapping.


These steps aren't clear to me. This section also needs to cover update ID handling.

errose28 · 2025-01-11T00:07:51Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+
+### No-Cache for write operation
+
+In old flow, a key creation / updation is added to PartialTableCache, and cleanup happens when DoubleBuffer flushes DB changes.


Suggested change

In old flow, a key creation / updation is added to PartialTableCache, and cleanup happens when DoubleBuffer flushes DB changes.

In old flow, a key creation / update is added to PartialTableCache, and cleanup happens when DoubleBuffer flushes DB changes.

errose28 · 2025-01-11T00:18:52Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+- lock: granular level locking
+- unlock: unlock locked keys


What happens while we are holding the lock down? Shouldn't this be where processing is happening? This seems like a duplicate of the information in the "Leader Execution" section but both sections are missing steps. For example submitting to Ratis is not mentioned here anywhere.

errose28 · 2025-01-11T01:54:18Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+- [Create key](request/obs-create-key.md)
+- [Commit key](request/obs-commit-key.md)
+
+### Execution persist and distribution


I think this whole section needs to be redesigned. In theory, Ratis + RocksDB should be able to exist in its own module as a replicated DB with no dependencies on anything Ozone specific. We will need this eventually to bring the same code flow to SCM (for rolling upgrade) and Recon (for non-voting follower) without rewriting these critical pieces that deal with replication and persistence. Actually moving the code to separate modules may be outside the scope of this feature, but we need to define the API surface such that it is possible to avoid having to rewrite/refactor what is soon to be already new code. For this example I will refer to the replicated DB as its own module, even if V1 of the code does not structure it this way for migration purposes. It is the API surface used by each request that is more important to lock down now.

Input to this module should be of the form of protos that define the DB updates to perform. The actual values written to the DB should already have been serialized to bytes by this point and they should not be deserialized at any point later in the flow (with the exception of merges). This means the module has no knowledge of client ID, quota info, etc.

We would have one proto message defining each operation supported by the DB. The module takes one Batch which contains these operations and will be treated as one Ratis request

message Put { optional bytes columnFamily optional bytes key optional bytes value } message Delete { optional bytes columnFamily optional bytes key } message Merge { optional bytes columnFamily optional bytes key optional bytes value } message Checkpoint { // Path to place the checkpoint optional string destination } // Only one field should be present to define the operation to do. // The module can validate this input. message Operation { optional Put put optional Delete delete optional Merge merge optional Checkpoint checkpoint } // Each OM request would result in one list of ordered operations submitted to the module. // The module can internally combine these lists into one Batch proto that gets submitted to Ratis. // The update to the transaction ID table needs to be handled within the module for each batch applied. message Batch { repeated Operation operations }

Now to translate each proto to a DB update in Ratis' applyTransaction:

Put and Delete simply map to existing RocksDB put and delete key ops. Note that RocksDB does not have a move operation.

Checkpoint creates a RocksDB checkpoint and will be used by snapshots.

Merge will be used to implement any increments required, like quota using the RocksDB associative merge operator. Initializers of the module will pass in a mapping of column families to their corresponding merge operators if required.

For example, the OM would initialize the module with a BucketInfoMergeOperator on the BucketTable, a VolumeInfoMergeOperator on the VolumeTable, etc.

Then the API surface between OM or any other service and the replicated DB module is just a list of column families to open, with some optionally mapped to merge operator callbacks provided on construction, and calls to submit new Operation lists to the module.

ivandika3

Thanks for the patch. Left an initial comments.

Regarding the request flow, could you add a more detailed sequence diagram? Similar to https://issues.apache.org/jira/browse/HDDS-1595 so that it's easier to visualize the new flow.

ivandika3 · 2025-01-13T08:26:53Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+Here is the summary of the challenges:
+
+- The current implementation depends on consensus on the order of requests received and not on consensus on the processing of the requests.
+- The double buffer implementation currently is meant to optimize the rate at which writes get flushed to RocksDB but the effective batching achieved is 1.2 at best. It is also a source of continuous bugs and added complexity for new features.


Could you clarify what does "effective batching" entails? Does it mean 1.2 OM requests per batch?

Yes this should be clarified in the doc. I think this meant to say "the effect of batching on performance is a 1.2x speedup at best", as in best case the double buffer is only adding a 20% speedup, while prototypes of the new design show far greater improvements.

ivandika3 · 2025-01-13T08:27:47Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+
+- The current implementation depends on consensus on the order of requests received and not on consensus on the processing of the requests.
+- The double buffer implementation currently is meant to optimize the rate at which writes get flushed to RocksDB but the effective batching achieved is 1.2 at best. It is also a source of continuous bugs and added complexity for new features.
+- The number of transactions that can be pushed through Ratis currently caps out around 25k.


So this is the theoretical bottleneck on the Ratis itself?

ivandika3 · 2025-01-13T09:16:24Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+| 3   | CPU Utilization Leader | 16% (unable to increase load) | 33%                    |
+| 4   | CPU Utilization Follower | 6% above                      | 4% below               |
+
+Refer [performance prototype result](performance-prototype-result.pdf)


Are these performance results referring to the prototype in #7406 ?

ivandika3 · 2025-01-13T09:22:11Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+- On restart (leader): last preserved index + 1
+- On Switch over: last index + 1
+- Request execution: index + 1
+- Upgrade: Last Ratis index + 1


So for existing cluster, the subsequent object IDs will be based on the Ratis last applied index?

ivandika3 · 2025-01-13T09:33:11Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+### Batching (Ratis request)
+All request as executed parallel are batched and send as single request to other nodes. This helps improve performance over network with batching.


When does the OM decide whether a batch will be sent to Ratis? Is it decided based on time / size of the batch?

Also some suggestions

Suggested change

### Batching (Ratis request)

All request as executed parallel are batched and send as single request to other nodes. This helps improve performance over network with batching.

### Batching (Ratis request)

All requests executed in parallel are batched and send as single request to other nodes. This helps improve performance over network with batching.

ivandika3 · 2025-01-13T09:57:14Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+--> Else continue request handling`
+
+#### Client request replay at leader node
+- When request is received at leader node, it will cache the request in replayCache immediately


I think it's better to use retryCache instead of replayCache to standardize the terminology with Ratis.

Personally, "replay" terminology seems to be more related to the replaying unapplied Ratis transactions from the previous OM design documentations.

ivandika3 · 2025-01-13T10:27:44Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+#### Replay cache distribution to other nodes
+Request - response will be cached to other node via ratis distribution
+- It will be added to memory cache with expiry handling
+- Also will be added to DB for persistence for restart handing
+
+Below information will be sync to all nodes via ratis:
+```
+message ClientRequestInfo {
+  optional string uuidClientId = 1;
+  optional uint64 callId = 2;
+  optional unint64 timestamp = 5;
+  optional OMResponse response = 3;
+}
+```


I'm not sure about this. Previously, each OM / Ratis request corresponds to a single OM / Ratis response because the Ratis will only reply pending request after the log is applied at the leader.

How would the client reply and retry cache mechanisms work now since each Ratis request contains multiple DB updates from multiple OM requests? So during log apply, the state machine needs to reply to multiple clients at the same time? Am I missing something?

ivandika3 · 2025-01-14T02:24:05Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+#### Memory caching:
+```
+Memory Map: ClientId#CallId Vs Response
+Expiry: 10 minute (as current default for ratis)


This expiry is done independently for each OM node? It won't be replicated from leader?

ivandika3 · 2025-01-14T02:26:23Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+1. With Leader side execution, metrics and its capturing information can change.
+   - Certain metrics may not be valid
+   - New metrics needs to be added
+   - Metrics will be updated at leader side now like for key create. At follower node, its just db update, so value will not be udpated.


FYI, also note that write audit logs will only be generated in leader, instead of both in leader and follower.

ivandika3 · 2025-01-14T02:29:43Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+- If we shard OM, then across OMs the object ID will not be unique.
+- When batching multiple requests, we cannot utilize Ratis metadata to generate object IDs.
+
+Longer term, we should move to a UUID based object ID generation. This will allow us to generate object IDs that are globally unique. In the mean time, we are moving to a persistent counter based object ID generation. The counter is persisted during apply transaction and is incremented for each new object created.


How about update ID? In the future if we decide to shard the OMs, we probably need some kind of sequence generator to generate a monotonically increasing ID for each update.

ivandika3 · 2025-01-14T02:42:51Z

cc: @xichen01 @symious

HDDS-11898. design doc leader side execution

117007a

sumitagrawl requested a review from kerneltime December 16, 2024 16:18

sumitagrawl added 2 commits December 16, 2024 22:23

HDDS-11898. design doc leader side execution

1e0752f

request handling and perf result

19ad04a

errose28 self-requested a review December 17, 2024 19:33

sumitagrawl added 4 commits December 18, 2024 16:21

update doc

d53b95d

doc update

0667886

doc enhancement

142c43e

obs locking doc

4377612

sumitagrawl added the design label Dec 20, 2024

sumitagrawl marked this pull request as ready for review December 20, 2024 12:54

adoroszlai reviewed Dec 22, 2024

View reviewed changes

merge md files and add title

acbd6d9

sumitagrawl requested a review from adoroszlai January 6, 2025 10:23

update obs lock doc

5e9a928

errose28 reviewed Jan 11, 2025

View reviewed changes

ivandika3 self-requested a review January 13, 2025 08:20

ivandika3 reviewed Jan 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-11898. design doc leader side execution #7583

HDDS-11898. design doc leader side execution #7583

sumitagrawl commented Dec 16, 2024

adoroszlai left a comment •

edited

Loading

sumitagrawl commented Jan 6, 2025

errose28 Jan 10, 2025

errose28 Jan 10, 2025

errose28 Jan 10, 2025

errose28 Jan 10, 2025

errose28 Jan 10, 2025

errose28 Jan 11, 2025

errose28 Jan 11, 2025

errose28 Jan 11, 2025

errose28 Jan 11, 2025 •

edited

Loading

ivandika3 left a comment

ivandika3 Jan 13, 2025

errose28 Jan 14, 2025

ivandika3 Jan 13, 2025

ivandika3 Jan 13, 2025

ivandika3 Jan 13, 2025

ivandika3 Jan 13, 2025

ivandika3 Jan 13, 2025

ivandika3 Jan 13, 2025

ivandika3 Jan 13, 2025

ivandika3 Jan 14, 2025

ivandika3 Jan 14, 2025

ivandika3 Jan 14, 2025

ivandika3 commented Jan 14, 2025

	- The Current performance envelope for OM is around 12k transactions per second. The early testing pushes this to 40k transactions per second.
	- The Current performance envelope for OM is around 12k transactions per second. The early testing of this feature pushes this to 40k transactions per second.

		3. Cache Optimization: Cache are maintained for write operation and read also make use of same for consistency. This creates complexity for read to provide accurate result with parallel operation.
		4. Double buffer code complexity: Double buffer provides batching for db update. This is done with ratis state machine and induces issues managing ratis state machine, cache and db updates.

		1. for increment changes, need remove dependency with ratis index. For this, need to use om managed index in both old and new flow.
		2. objectId generation: need follow old logic of index to objectId mapping.


		### No-Cache for write operation

		In old flow, a key creation / updation is added to PartialTableCache, and cleanup happens when DoubleBuffer flushes DB changes.

	In old flow, a key creation / updation is added to PartialTableCache, and cleanup happens when DoubleBuffer flushes DB changes.
	In old flow, a key creation / update is added to PartialTableCache, and cleanup happens when DoubleBuffer flushes DB changes.

		### Batching (Ratis request)
		All request as executed parallel are batched and send as single request to other nodes. This helps improve performance over network with batching.

HDDS-11898. design doc leader side execution #7583

Are you sure you want to change the base?

HDDS-11898. design doc leader side execution #7583

Conversation

sumitagrawl commented Dec 16, 2024

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

adoroszlai left a comment • edited Loading

Choose a reason for hiding this comment

sumitagrawl commented Jan 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

errose28 Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

ivandika3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivandika3 commented Jan 14, 2025

adoroszlai left a comment •

edited

Loading

errose28 Jan 11, 2025 •

edited

Loading