-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify values in secondary database against expected state #13281
base: main
Are you sure you want to change the base?
Conversation
fdffa05
to
8f637f9
Compare
7831d9a
to
8638ad5
Compare
@archang19 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
8638ad5
to
2a6c0e3
Compare
@archang19 has updated the pull request. You must reimport the pull request before landing. |
@archang19 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
2a6c0e3
to
8820235
Compare
@archang19 has updated the pull request. You must reimport the pull request before landing. |
@archang19 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@archang19 has updated the pull request. You must reimport the pull request before landing. |
ce0cd10
to
f5a7877
Compare
@archang19 has updated the pull request. You must reimport the pull request before landing. |
@archang19 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
f5a7877
to
8f176e9
Compare
@archang19 has updated the pull request. You must reimport the pull request before landing. |
@archang19 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
8f176e9
to
843761d
Compare
@archang19 has updated the pull request. You must reimport the pull request before landing. |
@archang19 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@@ -412,9 +412,8 @@ class StressTest { | |||
std::atomic<bool> db_preload_finished_; | |||
std::shared_ptr<SstQueryFilterConfigsManager::Factory> sqfc_factory_; | |||
|
|||
// Fields used for continuous verification from another thread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the original others intended for cmp_db_
to potentially be used for other purposes, but right now the only usages are for opening secondary databases. So I think we can improve the naming
s = secondary_db_->Get(options, column_families_[cf], key, | ||
&from_db); | ||
|
||
assert(!pre_read_expected_values.empty() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to get our internal code linter to stop complaining about the vector index access
@@ -2810,6 +2863,84 @@ class NonBatchedOpsStressTest : public StressTest { | |||
return true; | |||
} | |||
|
|||
// Compared to VerifyOrSyncValue, VerifyValueRange takes in a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about adding this functionality into VerifyOrSyncValue
but that would make the method signature and the implementation even more complicated
…FLAGS_continuous_verification_interval
843761d
to
cc8b070
Compare
@archang19 has updated the pull request. You must reimport the pull request before landing. |
@archang19 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary
TLDR: This PR enables secondary DB verification inside the "simple" crash tests (
NonBatchedOpsStressTest
). Essentially, we want to be able to verify that the secondary is a valid "prefix" of the primary. This PR allows us to do this by piggybacking on the existing verification of the primary throughGet()
requests.I originally proposed replaying the trace file to recreate the
ExpectedState
as of a specific sequence number. This could be used to run verifications against the secondary database. I did some experimenting in #13266 and got a "mostly working" implementation of this approach. I could sometimes get through entire key space verifications but eventually one of the keys would fail verification. I have not figured out the root cause yet, but I assume that something caused the sequence number to trace record alignment to break.The approach in this PR is considerably simpler. We can just check that the secondary database's value is in the correct "range," which we already have functionality for checking that. Compared to the approach in #13266, this approach is much, much simpler since we do not have to go through the whole headache of replaying the trace and creating an entire new
ExpectedState
. (Look at #13266 to see how much of a mess that creates.) I think this approach is better than my original approach in almost most aspects: it's faster, uses less space, and has less room for implementation errors.Other nice aspects of this approach:
TryCatchUpWithPrimary
while we are trying to perform aGet()
)The main drawback of course is that we verify against a range of expected values, rather than one particular expected value. However, I think this is acceptable and "good enough" especially with all of other the aforementioned benefits.
Historical context: There is some very old code that attempted to verify secondaries, but is not enabled. This code has not been touched or executed in an extremely long time, and the crash tests started failing when I tried enabling it, most likely because the code is not compatible with certain other crash test options. This code is for the "continuous verification" and involves long iterator scans over the secondary database. Some of the code involved the cross CF consistency test type. I don't think the old checks are what we really want for our purposes of verifying the secondary functionality. Since I don't think we will get much value out of this old "continuous verification" code, I integrated my secondary verification with the "regular" database verification. This also makes the rollout simpler on my end, since I can control whether my secondary verifications are enabled through one
test_secondary
configuration. To make sure the old code does not execute for our recurring crash test runs, I had to enforce thatcontinuous_verification_interval
is 0 whenevertest_secondary
is set.Monitoring: I will want to monitor the Sandcastle "simple" runs for failures where
test_secondary
is set. All of my error messages are prefixed with "Secondary" so it should be easy to tell if this PR causes any crash test issues.Future work:
TryCatchUpWithPrimary
), then we know there is a problemGet()
for the secondary (i.e. iterators). I think the focus here should be testing replication-specific logic, and since we will already have separate unit tests, we do not need to repeat all of tests against both the primary and the secondary.Test Plan
The primary crash test commands I ran were:
As a sanity check, I added an
assert(false)
right after my secondary verification code to make sure that my code was actually being run.