Add confidence intervals to the cumulative incidence estimator for competing risks. #500

mvlvrd · 2024-12-06T05:41:34Z

There are some minimal differences in the results of the confidence intervals with respect to the implementation in
the R package cmprsk, so we add some tolerance to the assertions. Apart from the Aalen estimator used in cmprsk we include two other estimators, but these are not tested so thoroughly, because we couldn't find implementations.

There is a new data set with three different terminal states.

Checklist

What does this implement/fix? Explain your changes

This adds the missing Confidence Intervals for the cumulative incidence in competing risks.

competing risks. Tests are included. There are some minimal differences in the results of the confidence intervals with respect to the implementation in the R package cmprsk, so we add some tolerance to the assertions. Apart from the Aalen estimator used in cmprsk we include two other estimators, but these are not tested so thoroughly, because we couldn't find data sources.

codecov · 2024-12-06T13:18:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.54%. Comparing base (3a3ef80) to head (28d7652).
Report is 34 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #500      +/-   ##
==========================================
+ Coverage   98.23%   98.54%   +0.31%     
==========================================
  Files          37       37              
  Lines        3564     3647      +83     
  Branches      472      476       +4     
==========================================
+ Hits         3501     3594      +93     
+ Misses         30       25       -5     
+ Partials       33       28       -5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sebp · 2024-12-16T17:50:51Z

Thanks for your PR.

Can you explain a little bit what the differences between the estimators are? In particular, in which situations one would prefer estimator X over the others?

mvlvrd · 2024-12-17T02:19:44Z

Thanks for the reply!

Aalen estimator is based on counting process theory and it is tested against the cmprsk R package, implemented by Gray. The actual formula used is Eq. 4.5 from Pintilie's book.
Dinse comes from the Delta method applied to the cumulative incidence estimator, assumed it follows a multinomial distribution. I couldn't find any external implementation, so the test data is the data generated by the current code. The formula used is derived from the original Dinse-Larson paper.
Dinse_approx is a large number of events approximation of the Dinse estimator. It seems that the approximation was popular in the 00's but with current computing power I guess it is not relevant anymore. Nevertheless, I have included it because I couldn't find real data to test the "full" Dinse estimator. The formulas used are derived from Dinse and Larson original paper and are equivalent to Eq. 4.4 in Pintilie's book.

There are multiple versions of both kind of estimators. According to Braun and Yuan (Statist. Med. 2007; 26:1170–1180), the Dinse-based estimators seems to work better than the Gray version of the Aalen-based estimator and "are fairly accurate even in small samples". I have decided to include both kind of estimators for completeness and because the Aalen estimators seem to be the standard due to the R implementation.

Apart from this, I have the personal impression that the Aalen-based estimators would not work properly in the case of discrete time because of the possibility of multiple simultaneous events. But I don't understand the theory well enough to justify it.

So in practice, I think Dinse is the best practical solution, while Aalen would be useful when comparing with the cmprsk implementation. Dinse_approx could be deprecated, but may be useful for historical reasons.

I am sorry if this is confusing, I am not an expert on these issues and it took me a while to go through the literature and such.

BR,

codacy-production · 2024-12-20T13:42:18Z

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation	Diff coverage
✅ +0.02%	✅ 100.00%

Coverage variation details

	Coverable lines	Covered lines	Coverage
Common ancestor commit (`3a3ef80`)	3564	3534	99.16%
Head commit (`344e15d`)	3653 (+89)	3623 (+89)	99.18% (+0.02%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details

	Coverable lines	Covered lines	Diff coverage
Pull request (#500)	91	91	100.00%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings Change summary preferences

_{Codacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more}

sebp · 2024-12-20T17:18:33Z

I started reviewing your code based on the following 2 papers, because I don't have access to the book by M. Pintilie that you referenced.

Chen J, Hou Y, Chen Z. Statistical inference methods for cumulative incidence function curves at a fixed point in time. Communications in Statistics - Simulation and Computation. 2020 Jan 2;49(1):79–94. http://arxiv.org/abs/1807.05846

Choudhury JB. Non‐parametric confidence interval estimation for competing risks analysis: application to contraceptive data. Statistics in Medicine. 2002 Apr 30;21(8):1129–44. https://doi.org/10.1002/sim.1070

The implementation of Aalen’s variance seems to be correct, based on [1]. Unfortunately, I couldn't find out if the cmprsk R package is supposed to implement the same estimator, or a different one. I started to look at the source code, but it is written in Fortran 😕 If cmprsk implements the same estimator, the results should probably be identical by more than 4 decimals points.

Regarding the estimator by Dinse and Larson, I used [2] as my reference (it provides some rough R code in the appendix). One thing I don't understand at the moment, is cprod = np.cumprod(1 + x) / (1 + x). First, equation (B1) in [2] doesn't have a 1 + … term, second it would expect that the call tonp.cumprod would have an axis argument. I will try if I can reproduce the results of [2] using the code from the appendix. This could potentially be a reference implementation

mvlvrd · 2024-12-21T12:47:22Z

I tried to back-engineer the cmprsk code and I think it may be not the same formula. I plotted all three components for both the Fortran code (va, vb and vc) and our implementation (var_a, var_b and var_c) and they are quite different from each other, but somehow the differences cancel out.

Regarding the Dinse formula, I think (B1) in Choudhury (2002) is wrong:

$$ \prod_{l=1}^{r-1} \frac{c_{(l)}}{N_l(N_l - c_{(l)})} $$

should be

$$ \prod_{l=1}^{r-1} \left( 1 + \frac{c_{(l)}}{N_l(N_l - c_{(l)})} \right) $$

The right formula can be deduced from Dinse et al. (1986) Eq. 5
I actually first tried to implement it using the formula in Choudhury's paper, but got very bad results. Then I went to the original Dinse's paper and noticed the error. Notice how the Taylor expansion doesn't work if you use the wrong formula.

Regarding the missing axis argument, I think it is not needed, because x only has one index (the time variable l) and no dependence on the type of risk.

I attach the relevant part of Pintilie's book
pintilie_var.pdf

sebp

I apologize for the delay.

I don't have any major concerns. The code seems to be correct, although I did not compare against reference implementation yet.

I'm not sure what exactly R's cmprsk implements. mstate seems to implement the Aalen estimator. So far, I could not find an R implementation of the Dinse estimator. Only STATA and SAS seem to have it.

I'm inclined to drop the exact version of the Dinse estimator, because the approximation seems to be more widespread (STATA and SAS only use the approximation).

sksurv/datasets/base.py

sksurv/nonparametric.py

tests/test_nonparametric.py

- Better documentation. - Some errors with `::` in Rest are corrected. - Clearer processing of the cum_inc = 0 case when calculating Confidence Intervals.

mvlvrd · 2025-01-07T01:46:13Z

Thanks for the review and sorry for the late response.
I have addressed most of the comments, please tell if you need more explanations or corrections.

I would like to keep the exact Dinse estimator, as it is the one I am using in production code.

sebp

Thanks for changing the use of einsum. My idea was to avoid the multiplication with mask, but I now understand that this isn't possible with np.cumsum. Using np.sum makes it a little bit easier to understand, but it might be worth it. Using np.einsum should be faster due to avoiding intermediate arrays.

I would revert to using np.einsum and leave the np.sum as a comment to explain what is being computed.

sksurv/datasets/base.py

sksurv/nonparametric.py

Reverting to use einsum. Other couple of minor changes.

sksurv/nonparametric.py

sebp · 2025-01-12T21:16:41Z

@mvlvrd Thanks for your hard work 🙏

mvlvrd added 3 commits December 6, 2024 14:22

Remove MS-DOS new line characters from data files.

da316fc

Move txt data file to arff format.

5ef4d0a

mvlvrd added 3 commits December 7, 2024 23:29

Add tests for wrong parameters.

a04c6bc

Add regression tests to cover the Dinse variance estimator.

ed170a4

Add comments to clarify the meaning of the CI tests.

344e15d

sebp requested changes Dec 31, 2024

View reviewed changes

Changes in the pull request review:

90a379c

- Better documentation. - Some errors with `::` in Rest are corrected. - Clearer processing of the cum_inc = 0 case when calculating Confidence Intervals.

Improve API doc of load_cgvhd()

f617776

sebp requested changes Jan 7, 2025

View reviewed changes

sksurv/datasets/base.py Outdated Show resolved Hide resolved

sksurv/nonparametric.py Outdated Show resolved Hide resolved

sksurv/nonparametric.py Outdated Show resolved Hide resolved

mvlvrd added 2 commits January 9, 2025 15:56

Better documentation of the CGVHD dataset.

6d389db

Reverting to use einsum. Other couple of minor changes.

Refactor the calculation of confidence intervals.

32f4bc0

sebp reviewed Jan 11, 2025

View reviewed changes

sksurv/nonparametric.py Outdated Show resolved Hide resolved

Fix API doc of _ci_logmlog

28d7652

sebp approved these changes Jan 12, 2025

View reviewed changes

sebp merged commit 7b6d478 into sebp:master Jan 12, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add confidence intervals to the cumulative incidence estimator for competing risks. #500

Add confidence intervals to the cumulative incidence estimator for competing risks. #500

mvlvrd commented Dec 6, 2024 •

edited

Loading

codecov bot commented Dec 6, 2024 •

edited

Loading

sebp commented Dec 16, 2024

mvlvrd commented Dec 17, 2024

codacy-production bot commented Dec 20, 2024

sebp commented Dec 20, 2024

mvlvrd commented Dec 21, 2024

sebp left a comment

mvlvrd commented Jan 7, 2025

sebp left a comment

sebp commented Jan 12, 2025

Add confidence intervals to the cumulative incidence estimator for competing risks. #500

Add confidence intervals to the cumulative incidence estimator for competing risks. #500

Conversation

mvlvrd commented Dec 6, 2024 • edited Loading

codecov bot commented Dec 6, 2024 • edited Loading

Codecov Report

sebp commented Dec 16, 2024

mvlvrd commented Dec 17, 2024

codacy-production bot commented Dec 20, 2024

Coverage summary from Codacy

See diff coverage on Codacy

See your quality gate settings Change summary preferences

sebp commented Dec 20, 2024

mvlvrd commented Dec 21, 2024

sebp left a comment

Choose a reason for hiding this comment

mvlvrd commented Jan 7, 2025

sebp left a comment

Choose a reason for hiding this comment

sebp commented Jan 12, 2025

mvlvrd commented Dec 6, 2024 •

edited

Loading

codecov bot commented Dec 6, 2024 •

edited

Loading