Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New legacy id generator #4875

Merged
merged 2 commits into from
Oct 24, 2024
Merged

New legacy id generator #4875

merged 2 commits into from
Oct 24, 2024

Conversation

fmarco76
Copy link
Member

Current implementation of legacy id generator using serial number has the problem to leave gaps when new ranges are generated because some conversion problem from decimal to hexadecimal range value. The gap could create problem to third party tool/service if they expected all no gaps are present in the serials.

A new id generator is introduced to solve the problem. To use the new id generator the following value should be used in CS.cfg:

dbs.cert.id.generator=newLegacy
dbs.request.id.generator=newLegacy

This can be be specified in pkispawn with the parameters:

pki_request_id_generator=newLegacy
pki_cert_id_generator=newLegacy

The remaining configuration are similar to the legacy generator. The main difference is that hexadecimal value must have the prefix 0x, if it is not present then the value is read as decimal.

The number serial written in CS.cfg during range update can be forced to be only decimal or hexadecimal setting the value dbs.numberRangeRadix to 10 or 16 respectively. If not set or negative then the default for the specific generator will be used.

@fmarco76
Copy link
Member Author

fmarco76 commented Oct 11, 2024

To upgrade the id generator from legacy to newLegacy, in case there are no clones, the CS.cfg values for certificate serial numbers should be updated to hex. In partitular:

  • the beginSerialNumber can be directly converted adding the 0x prefix while;
  • the endSerialNumber is calcalted adding the increment read as hex.

Finally, the value of nextRange in DS node ou=certificateRepository,ou=ca,<DN suffix> has to be converted from hex to decimal (legacy use hes without prefix). If the previous values have been updated and the id modified then it could be enough the command: pki-server ca-range-update

@fmarco76
Copy link
Member Author

@parrjd testing the fix for next range gaps in PR #4840 create problem with update. If fixed all the radix issues (there are few other operation errors) then the ranges do not match properly generating errors in deployed instances (it is OK for new CA). To avoid problems we are implementing this fix which provides a new generator so in cases instances have no problem with gaps there is no need to modify it.
Is this approach working in your case?

Copy link
Contributor

@edewata edewata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmarco76 I'm still reviewing this, but could you add a test here (before switching to RSNv3) to show how to fix an existing CA that has a sequential serial number gap?

So basically the test might look like this:

  1. make the necessary changes to the CA
  2. allocate new ranges
  3. enroll certs to exhaust the current range
  4. repeat step 2 & 3 if necessary
  5. verify that no new gap is created

For simplicity it's not necessary to verify the config params or LDAP entries. The verification in step 5 should be sufficient. Thanks!

@parrjd
Copy link
Contributor

parrjd commented Oct 12, 2024

@fmarco76 sorry I did not see this earlier

I think forcing the values in the CS.cfg to be read as base 10 is a bad idea and will only cause more problems for existing installs. What we have currently in our prod environment is a hex number including characters a-f that are allowed in a hex string, and if it is forced to be read as a base 10 that will likely break on startup, or when ever that vaule is required to be read.

Using 100 as an example for both end and increment
100 = 0x64
256 = 0x100

if you force it to be read and used as a decimal and the CA has already issued 0x80 then it will jump to the next range starting at 200 which will overlap the previous range. The only safe way to switch to Dec would be to convert the Incremement and the end values to Decimal so increment and end would need to be changed in the CS.cfg

dbs.beginSerialNumber=1
dbs.endSerialNumber=10000000
dbs.serialCloneTransferNumber=10000

dbs.serialIncrement=10000000
dbs.serialLowWaterMark=2000000
dbs.serialCloneTransferNumber=10000
dbs.serialDN=ou=certificateRepository, ou=ca
dbs.serialRangeDN=ou=certificateRepository, ou=ranges
dbs.beginReplicaNumber=[pki_replica_number_range_start]
dbs.endReplicaNumber=[pki_replica_number_range_end]

As a side note:
dbs.beginRequestNumber=[pki_request_number_range_start]
dbs.endRequestNumber=[pki_request_number_range_end]
dbs.requestIncrement=10000000
dbs.requestLowWaterMark=2000000
dbs.requestCloneTransferNumber=10000
dbs.requestDN=ou=ca, ou=requests
dbs.requestRangeDN=ou=requests, ou=ranges

pkispawn values
pki_serial_number_range_start=
pki_serial_number_range_end=
pki_request_number_range_start=
pki_request_number_range_end=
pki_replica_number_range_start=
pki_replica_number_range_end=

For an install that did not override the begin and end for serial numbers CS.cfg with the pkispawn vars the first range has 268,435,456 possible serial numbers, and the increment has the same amount while the request only has 10,000,000 since the request is done in base 10. as long as range management is enabled it should not be an issue since the requests will role to the next range separate of the serials.

The nextrange DS Values do not get populated for some time. We do not have them on most of our prod systems as they get populated at a point well after the CA is stood up when serial issuances reaches a point and populates the values to the DS instance.

For a stand alone CA the default end serialNumer Value will result in most installs of a dogtag CA never rolling over to a new range. Active clones are more likely to role faster since they only get a very small fraction of the top end of the existing CA's serial range for a sequential CA. For a Random serial CA I believe it pulls the new nextRange when the CA is built, but may be wrong since I have not deployed a random serial CA yet.

Ldap output from a serial CA that has a clone that has not reached the point where a range has been added to the DS instance.

dn: ou=certificateRepository,ou=ranges,dc=*****
objectClass: top
objectClass: organizationalUnit
ou: certificateRepository

I think the best solution would be to set a couple of extra values in the CS.cfg so that different subsystems can be contolled differently for like the TPS, TKS, or OSCP where everything I believe is in base 10, and treat all of the values for a type the same, so start, end, and increment for Serial would be base 16, and for request and replica they would be base 10.

dbs.serialRadix=16
dbs.requestRadix=10
dbs.replicaRadix=10

From DOGTAG_10_5_BRANCH
base/server/cms/src/com/netscape/cms/servlet/csadmin/UpdateNumberRange.java


            if (type.equals("request")) {
                radix = 10;
                endNumConfig = "dbs.endRequestNumber";
                cloneNumConfig = "dbs.requestCloneTransferNumber";
                nextEndConfig = "dbs.nextEndRequestNumber";
            } else if (type.equals("serialNo")) {
                radix = 16;
                endNumConfig = "dbs.endSerialNumber";
                cloneNumConfig = "dbs.serialCloneTransferNumber";
                nextEndConfig = "dbs.nextEndSerialNumber";
            } else if (type.equals("replicaId")) {
                radix = 10;
                endNumConfig = "dbs.endReplicaNumber";
                cloneNumConfig = "dbs.replicaCloneTransferNumber";
                nextEndConfig = "dbs.nextEndReplicaNumber";
            }

For a RHEL 7 amd RHEL 8 instance it can be assumed that for Serial should be interprtied as base 16, and for Requests and replica should be done as base 10.

Since the math error in the code results in the value being recroded in the ds being higher than the value stored in the CS.cfg when taking base into account I am not seeing the risk to fixing it by fixing how the math is done.

For dealing with existing systems I see a couple of paths.

Handle at subsystem startup

If nextRange values are not in DS, no issue since the problem code has not been executed.

If nextRange exists in the DS, read the existing range from the cn=<start_of_current_range> from the DS and compare to what is in CS.cfg, if they match nothing to do, if they do not match the write back the End of the Range to CS.cfg so that they do match. This will prevent a gap from being created. Fix the code so that Reads from DS Range Values and the CS.cfg always bigInt with the Radix and toString(radix) so that the math is all done using the proper base for the values, and also stored in the same manner in both the DS and the CS.cfg so they can easily be compared.

Switch to all Dec, but manually convert/ update the increment and end to be hex to dec values so that 100 would be 256 written back to the CS.cfg. The challenge here is I think this is a bigger lift, and anywhere in the code that currently uses the value as hex would have to be found, and anywhere it needs to be hex would need to do an integer.toHexString() to convert it.

Basicly I am looking at the best way to prevent this from happening when a CA Rolls to the next range. Not looking to go back and try to fill in where we have ended up with gaps. We are already having to update Other systems that pull stuff from the CAs to handle, the gap, and we have had to do that before when we did not use the serial management provides and we did manual clones with completely different ranges. and had huge gaps so most of the other systems have a method to deal with gaps.

If we read from DS and CS.cfg with a Radix then it is coming in to Bigint as a bigint from hex or Dec, and when writing back to either toString(radix), then it is put back in the expected value.

I think adding 0x to designate it as hex is also likely a bad Idea as most of the code in the repository class is used by different subsystems that have different bases and then every value would need to be tested to see if the 0x needs to be stripped, off then bigint with a 16 radix if it had a 0x. where as if every subsystem provides the radix that it uses, it is just used vs having to add logic to determine the value type of dec/hex.

I dont think a new id generator is really required since currently the existing has no potential of Serial overlap which would be really bad, it just can result in gaps since the math is not done the same way for the value written to the DS, and the value written to the CS.cfg

base/ca/shared/conf/CS.cfg Outdated Show resolved Hide resolved
endRange: 48
host: primary.example.com

EOF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the comment to indicate that the allocated cert range for the primary CA changed from 13-30 to 19-48.

Could you add a similar check for the request range too since we're migrating it to SSNv2 too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding the beginRange and endRange attributes are always in decimal and we're not changing it in this PR:
https://github.com/dogtagpki/pki/blob/master/base/server/src/main/java/com/netscape/cmscore/dbs/Repository.java#L574-L578

The values are derived from nextRange which was parsed as decimal:
https://github.com/dogtagpki/pki/blob/master/base/server/src/main/java/com/netscape/cmscore/dbs/Repository.java#L543-L544

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are written in decimal and derived from nextRange. However, nextRange is read as decimal when assigned to the range but it is read as hex when assigned in nextMinSerialNumber:

mNextMinSerialNo = new BigInteger(nextRange, mRadix);

Therefore the value written in range match the value written in CS.cfg but the real value is different. With this patch the value will really match after the conversion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes it really complicated since we cannot really rely on the values written in the CS.cfg or DS so there are more possibilities of problems. I need to look at the code to see how mNextMinSerialNo is used.

So in this step the range size grows from 18 (which is the size specified during installation) to 30, and it temporarily overlaps the other server's range. Where does the 30 come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no real overlap because the one updated is in decimal while the other is in hex. If we took the range of the primary as example here. Initially the range was from 13 to 30 (30 = 13 + increment in dec - 1). The following range start from 31 and the begin is the real one. The gap is generated because in the CS.cfg the end is not 30 but 13 + incerement in hex - 1. = 24 and the next start from 31.

BTW, I think I found a possible conflict and I am updating the code.

.github/workflows/ca-clone-sequential-test.yml Outdated Show resolved Hide resolved
Comment on lines +1065 to +1157
- name: Check cert range config in primary CA
run: |
tests/ca/bin/ca-cert-range-config.sh primary | tee output

cat > expected << EOF
dbs.beginSerialNumber=0x13
dbs.endSerialNumber=0x30
dbs.serialCloneTransferNumber=0x9
dbs.serialIncrement=0x12
dbs.serialLowWaterMark=0x9
EOF

diff expected output

- name: Check cert range config in secondary CA
run: |
tests/ca/bin/ca-cert-range-config.sh secondary | tee output

cat > expected << EOF
dbs.beginSerialNumber=0x31
dbs.endSerialNumber=0x48
dbs.serialCloneTransferNumber=0x9
dbs.serialIncrement=0x12
dbs.serialLowWaterMark=0x9
EOF

diff expected output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the ranges changed from 0x13-0x24 to 0x13-0x30 in the primary CA and from 0x31-0x42 to 0x31 to 0x48 in the secondary CA. Is there any possibility that the new ranges would conflict with other existing ranges?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I get the math correct the range in CS.cfg are smaller then in DS and not full used but the start range are always correct in case of gap. The ID between 0x42 and 0x48 are lost because when a new range is computed it start from the value recorded in nextRange and do not take in account the range objects and/or configuration range so overlap should not be possible. Using the full recorded range we are requesting to utilise the full range.

base/server/python/pki/server/cli/range.py Outdated Show resolved Hide resolved
@fmarco76 fmarco76 force-pushed the nextRange branch 2 times, most recently from 4b71c97 to da9f148 Compare October 21, 2024 15:11
Comment on lines 244 to 296
LDAPSearchResults ranges = conn.search(rangeDN, LDAPv3.SCOPE_SUB, "(objectClass=pkiRange)", null, false);

BigInteger lastUsedSerial = BigInteger.ZERO;
boolean nextRangeToUpdate = true;
// Search for the last range entry. If it is associated to the CA to update or ranges are not defined
// then the nextRange is
while (ranges.hasMoreElements()) {
LDAPEntry entry = ranges.next();
String endRange = entry.getAttribute("endRange").getStringValues().nextElement();
String host = entry.getAttribute("host").getStringValues().nextElement();
String port = entry.getAttribute("securePort").getStringValues().nextElement();
BigInteger next = new BigInteger(endRange, 16);
if (lastUsedSerial.compareTo(next) < 0) {
lastUsedSerial = next;
nextRangeToUpdate = host.equals(hostName) && port.equals(securePort);

}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC in this loop we go through all ranges and parse the endRange as hexadecimal and use that to find the last range entry. However, in the following loop we're replacing the hexadecimal endRange with its decimal value. So if we run this command again for another replica, wouldn't the now decimal endRange be incorrectly parsed as hexadecimal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nextRange is updated only if last range is associated with the CA doing the generator update so only one CA will do the update and since the other are not creating new range there should be no conflict. In the following loop only the ranges associated with this CA are updated so if last range is not owned by this CA there is no update therefore there should not be any conflict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually concerned about the endRange, not nextRange. In the first loop above it's parsed as hexadecimal and used for comparison:

String endRange = entry.getAttribute("endRange").getStringValues().nextElement();
BigInteger next = new BigInteger(endRange, 16);
if (lastUsedSerial.compareTo(next) < 0) {

In the second loop below, it's parsed again as hexadecimal, but written as decimal, basically replacing the old value:

String endRange = entry.getAttribute("endRange").getStringValues().nextElement();
BigInteger endRangeNo = new BigInteger(endRange, 16);
attrs.add(new LDAPAttribute("endRange", endRangeNo.toString()));

So suppose we run the tool again for another clone, the first loop will run again, but this time it will be reading the endRange that has already been converted to decimal by the second loop, and it will parse the value as hexadecimal again. Is this correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the ranges in the DS just before the migration:

beginRange: 13
endRange: 30
host: primary.example.com

beginRange: 31
endRange: 48
host: secondary.example.com

nextRange: 49

So in the DS at least there seems to be no gap. Is it possible to take these values as decimals, then update CS.cfg based on these values, but avoid updating the DS completely? Some serial numbers that used to belong to one server might "move" to another server, but since we're stopping all servers, fixing all ranges at once, and restarting the servers again, that might be OK. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have inverted the order of the operation. Update the range to decimal of the updated CA and then look for the next range. In case of not selecting the real last range is not a problem. Considering the case in CI but we suppose the primary range is from 13 to 31 and the secondary range is from 31 to 48. The first updated is the primary so it become 19 49 and it is considered as last. The nextRange will be 50. Then the other CA is update and the range is 50 -72 and the next range is updated again to 73. There are no other operation involved so it is not a problem.

Copy link
Contributor

@edewata edewata Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is we're still mix up radixes in the second loop:

LDAPSearchResults ranges = conn.search(rangeDN, LDAPv3.SCOPE_SUB, "(objectClass=pkiRange)", null, false);
String endRange = entry.getAttribute("endRange").getStringValues().nextElement();
BigInteger next = new BigInteger(endRange);  // could be decimal or hexadecimal that has not been converted to decimal

While the code might work with the above example, there isn't really a guarantee that it will work fine in all cases.

As discussed in Slack, we should explore some alternatives:

Comment on lines 302 to 307
String dn = "cn=" + beginRangeNo.toString() + "," + rangeDN;
LDAPEntry rangeEntry = new LDAPEntry(dn, attrs);
logger.info("SubsystemRangeGeneratorUpdateCLI.updateRanges: Remove entry " + entry.getDN());
conn.delete(entry.getDN());
logger.info("SubsystemRangeGeneratorUpdateCLI.updateRanges: Adding entry " + dn);
conn.add(rangeEntry);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're deleting the old cn=<hex> and adding a new cn=<decimal>. Is there a possibility that the cn=<decimal> will conflict with an existing range?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to compute this to exclude this possibility but I cannot. However, since the conversion between hex and dec is not linear, the results are not multiple and the value increase very quickly I cannot find combination of value where overlap could happen so I consider very unlikely. However, we could try to find if a combination is possible orwe could add a check and eventually do a correction (e.g. adding or reducing the range by one entry)

@fmarco76 fmarco76 force-pushed the nextRange branch 3 times, most recently from a36d41b to 6a95c31 Compare October 23, 2024 17:00
The current generator has a problem with converting from hex to decimal
the range boundaries creating gaps between ranges. This a problem when
third parties tools are used to with certificates because contiguous
range are expected.

This commit introduce the generator legacy2. This uses same
configuration parameter but hex value are specified by the prefix '0x'.

When value are written to the configuration value it is possible to set
the radix with the options:
- dbs.cert.id.radix (default to 16)
- dbs.key.id.radix (default to 16)
- dbs.request.id.radix (default to 10)

Additionally, the new command `pki-server <subsystem>-id-generator-*`
has been added to migrate from the legacy generator to the legacy2 or to
random.
Copy link

Copy link
Contributor

@edewata edewata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still some concerns about the migration process, but the SSNv2 itself looks fine. I think lets merge this PR so we can add tests for SSNv2 (without the migration) while we keep improving the migration process. Or if you prefer, feel free to split the PR and move the migration tools & tests into a separate PR. Thanks!

@fmarco76
Copy link
Member Author

@edewata Thanks! I am merging this with the current migration implementation. It will be easier to test and improve to support the corner cases.

@fmarco76 fmarco76 merged commit 16e42fd into dogtagpki:master Oct 24, 2024
157 of 166 checks passed
@fmarco76 fmarco76 deleted the nextRange branch October 24, 2024 18:08
@fmarco76
Copy link
Member Author

I dont think a new id generator is really required since currently the existing has no potential of Serial overlap which would be really bad, it just can result in gaps since the math is not done the same way for the value written to the DS, and the value written to the CS.cfg

Testing with just fixing the conversion of nextRange and increment will generate range overlaps and CA stop working so it is impossible to fix but this is a problem with master branch. In older releases, the increment is correctly evaluated and the problem is only in the nextRange. In fact, the CI test we have in master branch https://github.com/dogtagpki/pki/blob/master/.github/workflows/ca-ssnv1-test.yml will generate different gaps if tests with code from older branch (like DOGTAG_10_5_BRANCH).

To have a single solution for all versions and to avoid modifying running instance the optional new generator seems a better solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants