-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] Improve the Data Protection experience for apps with multiple instances #53654
Comments
I'm really heartened to see this epic, because in my work on IdentityServer, the by far most common support topic is data protection. The problems that I see are all to do with this:
My users are on the golden path on their local dev machines, but not when they deploy! And IdentityServer uses the data protection services itself to protect sensitive data at rest, some of which can be very long lived (e.g., signing keys, refresh token data, etc), so data protection issues are especially pronounced in that context. I'm really hopeful to see what all comes of this. |
After staring at the code for a while, this seems like a key source of these problems: #21096 Data Protection has a number of catch-and-swallow blocks, but this seems like the only one that wouldn't result in negligible harm (like losing a trace-level log message). Interestingly, a Warning is logged when this happens (which is how I found that issue), so it would be interesting to know if people are seeing this in prod. [LoggerMessage(12, LogLevel.Warning, "Key {KeyId:B} is ineligible to be the default key because its {MethodName} method failed.", EventName = "KeyIsIneligibleToBeTheDefaultKeyBecauseItsMethodFailed")]
public static partial void KeyIsIneligibleToBeTheDefaultKeyBecauseItsMethodFailed(this ILogger logger, Guid keyId, string methodName, Exception exception); |
Assuming the above is correct, we probably need:
As noted in #21096, the existing recovery mechanism (swallow and eventually generate a new key) is probably helpful in apps with session affinity (not recommended) or a single instance, so we need to try to avoid making things worse for those apps. |
It used to be the case that part of IDefaultKeyResolver.ResolveDefaultKeyPolicy's job was to determine whether the current default key was close enough to expiration that a new one ought to be generated. This didn't make sense as the definition of "too close" depended upon the refresh period and propagation time of the ICacheableKeyRingProvider. That is, the IDefaultKeyResolver had to make assumptions about how often it would be polled for changed. The old logic was also very subtle and, as far as I was able to determine, slightly incorrect. Formerly, the presence of any key activated prior to the current default key's expiration date and not expiring during the next propagation cycle was considered an acceptable replacement. Several things seem strange about this: 1. The logic for finding a successor key is not the same as the logic for finding a preferred key (e.g. CanCreateAuthenticatedEncryptor is not checked) 2. The propagation window is counted forward from the current time, rather than backward from the expiration time 3. It's not immediately clear what happens if the successor key is unexpired at the end of the propagation window but expired before the default key's expiration time (maybe that's impossible or maybe that would be caught next refresh?) 4. As mentioned above, it doesn't seem like the resolver should know about the refresh period or make assumptions about how often it's called Now, the ICacheableKeyRingProvider is responsible for determining whether the returned default key is close enough to expiration that a new key should be generated. It checks whether the current time is within one propagation cycle of the expiration time, padding by an extra refresh period to account for the fact that we don't know where in the refresh cycle expiration will fall (i.e. so that we never generate a new key _less_ than a full propagation cycle ahead of when it's needed). Part of dotnet#53654
* Move computation of ShouldGenerateNewKey to KeyRingProvider It used to be the case that part of IDefaultKeyResolver.ResolveDefaultKeyPolicy's job was to determine whether the current default key was close enough to expiration that a new one ought to be generated. This didn't make sense as the definition of "too close" depended upon the refresh period and propagation time of the ICacheableKeyRingProvider. That is, the IDefaultKeyResolver had to make assumptions about how often it would be polled for changed. The old logic was also very subtle and, as far as I was able to determine, slightly incorrect. Formerly, the presence of any key activated prior to the current default key's expiration date and not expiring during the next propagation cycle was considered an acceptable replacement. Several things seem strange about this: 1. The logic for finding a successor key is not the same as the logic for finding a preferred key (e.g. CanCreateAuthenticatedEncryptor is not checked) 2. The propagation window is counted forward from the current time, rather than backward from the expiration time 3. It's not immediately clear what happens if the successor key is unexpired at the end of the propagation window but expired before the default key's expiration time (maybe that's impossible or maybe that would be caught next refresh?) 4. As mentioned above, it doesn't seem like the resolver should know about the refresh period or make assumptions about how often it's called Now, the ICacheableKeyRingProvider is responsible for determining whether the returned default key is close enough to expiration that a new key should be generated. It checks whether the current time is within one propagation cycle of the expiration time, padding by an extra refresh period to account for the fact that we don't know where in the refresh cycle expiration will fall (i.e. so that we never generate a new key _less_ than a full propagation cycle ahead of when it's needed). Part of #53654 * Don't repeat the second resolution after key generation * Update comment * Add explanatory comment * Make comment more explicit
Telemetry: #54451 |
Further to josephdecock's point I do not like the implications of point This seems like something that could be resolved through distributed locking either in the If the external component is the way forward I'd likely implement this as a background timer task that runs on each node and handles the locking. Distributed locking is so useful and common(and with many good backing options via Redis, Dynamo, K8s, etc) that it would almost make sense to introduce the concept into Asp.Net Core.. But I digress 😄 |
The default application discriminator is the content root path, which can vary between a local instance and a deployed instance (e.g. to App Service). It would be nice if we could do something about this foot-gun. |
@ProTip I'm not sure I fully understand your comment, so please let me know if I've missed your point. I think you're saying that we could/should resolve key-write races using some sort of coordination mechanism. While introducing a new component does add complexity, I think it has some advantages over keeping the logic in the app instances:
Note that there are no plans to eliminate the existing mechanisms, so you can continue to have each node do its own key generation and you're free to implement your own |
Heya, yes I think you grok what I'm saying fully. I think it's a great option to manually handle key generation and rotation however:
Can't this be done anyway based on implementer requirements? I think a short delay when firing up a brand new application with an empty keyring isn't a concern for the vast majority.
But it won't work as expected OOTB..
I'm not as familiar with the interfaces or issues here. However this coordination logic may be able to be moved into the core DP code that reaches out to the Ultimately it feels a bit yucky that DP, which is supposed to just transparently work, would be the only(?) foundational Asp.Net Core component that introduces a requirement on an external service to function as expected(I'm tempted to say correctly) in a multi-node environment. |
#53539 and possibly #53860 are the only changes I'm expecting to aspnetcore to support this new pattern. Without additional configuration, I wouldn't expect either to have any effect.
It's not just the delay. Instances don't know how many other instances there are, so it's hard for them to ensure they're using a key available to all other instances (i.e. so that sessions can migrate between them).
Tell me more about what it means for things to work out of the box? It sounds like you don't mean "without user configuration", because the app developer would need to establish a shared storage location and point all the app instances at it.
I'm not sure what design you have in mind, but it sounds like it would require instances to either know about each other or know about a shared coordination component. Is that right?
None of this is expected to be required. At present, it is expected to be sample code demonstrating a way app developers can set up their application that we think will make them more reliable. There may also be a nuget package or something, but that's TBD. Obviously, we're a long way from actually shipping and things could change. |
@lukasz-zoglowek Sorry, you'll have to refresh my memory - a link to the conversation would help. Having said that, #54299 should have addressed the fact that those timestamps can be out of order (which was harmless) and, yes, the goal of this epic is to eliminate most (but not all) of those cases. Not creating a new key when a remote store is not accessible is not viable as a general solution because it can leave the app without a key to use but #54711 added some retry support and #54490 made many potentially-flaky calls unnecessary. My expectations is that the combination of those changes will substantially mitigate the issue. I believe all of that work made it into Preview 4 and your feedback would be greatly appreciated. |
Reliability should be substantially better in 9.0 Preview 4. If you continue to see key-not-found errors in Preview 4, please tag me. |
@amcasey Is this epic done? It feels like the key deletion one isn't necessary to gate this on? |
AES encryption of the DP key will be a very welcoming feature. Currently, the only way to encrypt the key when persisted via EF or Redis is to use a certificate. Why I can't just use AES encryption to encrypt and decrypt the key? |
@md-redwan-hossain Data Protection is remarkably pluggable - I'd be surprised to learn you can't do that by providing your own encryptor. Regardless, this issue probably isn't the best place to get eyes on the question - can you please file a new one? |
Key deletion is in and the docs PR is looking good. |
The golden path for Data Protection is having a single app instance, ideally on Windows. In that case, pulling anti-forgery, Razor pages, etc into the app, will automatically pull in Data Protection with its default settings. Keys will be stored locally in an XML file on disk and protected with (e.g.) Windows DPAPI. Along the golden path, app developers don't have to think about, let alone configure, Data Protection.
Things get more complicated when there are multiple app instances. For scalability and resilience, it's desirable for any app instance to be able to handle requests from any user session, so each app instance needs to be able to use keys generated by any other instance. Unfortunately, a mechanism for sharing keys is not something Data Protection can figure out on its own, so explicit configuration is required. Still more unfortunately, common approaches like storing the keys in Azure occasionally encounter races and communication failures that result in key-not-found errors.
What are we going to do about it? In 9.0, we're going to focus on:
The text was updated successfully, but these errors were encountered: