Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add confidence intervals to the cumulative incidence estimator for competing risks. #500

Merged
merged 11 commits into from
Jan 12, 2025

Conversation

mvlvrd
Copy link
Contributor

@mvlvrd mvlvrd commented Dec 6, 2024

There are some minimal differences in the results of the confidence intervals with respect to the implementation in
the R package cmprsk, so we add some tolerance to the assertions. Apart from the Aalen estimator used in cmprsk we include two other estimators, but these are not tested so thoroughly, because we couldn't find implementations.

There is a new data set with three different terminal states.

Checklist

  • closes #xxxx
  • pytest passes
  • tests are included
  • code is well formatted
  • documentation renders correctly

What does this implement/fix? Explain your changes

This adds the missing Confidence Intervals for the cumulative incidence in competing risks.

competing risks.

Tests are included.
There are some minimal differences in the results of the confidence
intervals with respect to the implementation in
the R package cmprsk, so we add some tolerance to the assertions.
Apart from the Aalen estimator used in cmprsk we include two other
estimators, but these are not tested so thoroughly, because we couldn't
find data sources.
Copy link

codecov bot commented Dec 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.54%. Comparing base (3a3ef80) to head (28d7652).
Report is 34 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #500      +/-   ##
==========================================
+ Coverage   98.23%   98.54%   +0.31%     
==========================================
  Files          37       37              
  Lines        3564     3647      +83     
  Branches      472      476       +4     
==========================================
+ Hits         3501     3594      +93     
+ Misses         30       25       -5     
+ Partials       33       28       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sebp
Copy link
Owner

sebp commented Dec 16, 2024

Thanks for your PR.

Can you explain a little bit what the differences between the estimators are? In particular, in which situations one would prefer estimator X over the others?

@mvlvrd
Copy link
Contributor Author

mvlvrd commented Dec 17, 2024

Thanks for the reply!

  • Aalen estimator is based on counting process theory and it is tested against the cmprsk R package, implemented by Gray. The actual formula used is Eq. 4.5 from Pintilie's book.
  • Dinse comes from the Delta method applied to the cumulative incidence estimator, assumed it follows a multinomial distribution. I couldn't find any external implementation, so the test data is the data generated by the current code. The formula used is derived from the original Dinse-Larson paper.
  • Dinse_approx is a large number of events approximation of the Dinse estimator. It seems that the approximation was popular in the 00's but with current computing power I guess it is not relevant anymore. Nevertheless, I have included it because I couldn't find real data to test the "full" Dinse estimator. The formulas used are derived from Dinse and Larson original paper and are equivalent to Eq. 4.4 in Pintilie's book.

There are multiple versions of both kind of estimators. According to Braun and Yuan (Statist. Med. 2007; 26:1170–1180), the Dinse-based estimators seems to work better than the Gray version of the Aalen-based estimator and "are fairly accurate even in small samples". I have decided to include both kind of estimators for completeness and because the Aalen estimators seem to be the standard due to the R implementation.

Apart from this, I have the personal impression that the Aalen-based estimators would not work properly in the case of discrete time because of the possibility of multiple simultaneous events. But I don't understand the theory well enough to justify it.

So in practice, I think Dinse is the best practical solution, while Aalen would be useful when comparing with the cmprsk implementation. Dinse_approx could be deprecated, but may be useful for historical reasons.

I am sorry if this is confusing, I am not an expert on these issues and it took me a while to go through the literature and such.

BR,

Copy link

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
+0.02% 100.00%
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (3a3ef80) 3564 3534 99.16%
Head commit (344e15d) 3653 (+89) 3623 (+89) 99.18% (+0.02%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#500) 91 91 100.00%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

Codacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more

@sebp
Copy link
Owner

sebp commented Dec 20, 2024

I started reviewing your code based on the following 2 papers, because I don't have access to the book by M. Pintilie that you referenced.

  1. Chen J, Hou Y, Chen Z. Statistical inference methods for cumulative incidence function curves at a fixed point in time. Communications in Statistics - Simulation and Computation. 2020 Jan 2;49(1):79–94. http://arxiv.org/abs/1807.05846
  2. Choudhury JB. Non‐parametric confidence interval estimation for competing risks analysis: application to contraceptive data. Statistics in Medicine. 2002 Apr 30;21(8):1129–44. https://doi.org/10.1002/sim.1070

The implementation of Aalen’s variance seems to be correct, based on [1]. Unfortunately, I couldn't find out if the cmprsk R package is supposed to implement the same estimator, or a different one. I started to look at the source code, but it is written in Fortran 😕 If cmprsk implements the same estimator, the results should probably be identical by more than 4 decimals points.

Regarding the estimator by Dinse and Larson, I used [2] as my reference (it provides some rough R code in the appendix). One thing I don't understand at the moment, is cprod = np.cumprod(1 + x) / (1 + x). First, equation (B1) in [2] doesn't have a 1 + … term, second it would expect that the call tonp.cumprod would have an axis argument. I will try if I can reproduce the results of [2] using the code from the appendix. This could potentially be a reference implementation

@mvlvrd
Copy link
Contributor Author

mvlvrd commented Dec 21, 2024

I tried to back-engineer the cmprsk code and I think it may be not the same formula. I plotted all three components for both the Fortran code (va, vb and vc) and our implementation (var_a, var_b and var_c) and they are quite different from each other, but somehow the differences cancel out.

Regarding the Dinse formula, I think (B1) in Choudhury (2002) is wrong:

$$ \prod_{l=1}^{r-1} \frac{c_{(l)}}{N_l(N_l - c_{(l)})} $$

should be

$$ \prod_{l=1}^{r-1} \left( 1 + \frac{c_{(l)}}{N_l(N_l - c_{(l)})} \right) $$

The right formula can be deduced from Dinse et al. (1986) Eq. 5
I actually first tried to implement it using the formula in Choudhury's paper, but got very bad results. Then I went to the original Dinse's paper and noticed the error. Notice how the Taylor expansion doesn't work if you use the wrong formula.

Regarding the missing axis argument, I think it is not needed, because x only has one index (the time variable l) and no dependence on the type of risk.

I attach the relevant part of Pintilie's book
pintilie_var.pdf

Copy link
Owner

@sebp sebp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize for the delay.

I don't have any major concerns. The code seems to be correct, although I did not compare against reference implementation yet.

I'm not sure what exactly R's cmprsk implements. mstate seems to implement the Aalen estimator. So far, I could not find an R implementation of the Dinse estimator. Only STATA and SAS seem to have it.

I'm inclined to drop the exact version of the Dinse estimator, because the approximation seems to be more widespread (STATA and SAS only use the approximation).

sksurv/datasets/base.py Outdated Show resolved Hide resolved
sksurv/datasets/base.py Outdated Show resolved Hide resolved
sksurv/datasets/base.py Outdated Show resolved Hide resolved
sksurv/datasets/base.py Outdated Show resolved Hide resolved
sksurv/datasets/base.py Outdated Show resolved Hide resolved
sksurv/nonparametric.py Show resolved Hide resolved
sksurv/nonparametric.py Outdated Show resolved Hide resolved
sksurv/nonparametric.py Show resolved Hide resolved
sksurv/nonparametric.py Outdated Show resolved Hide resolved
tests/test_nonparametric.py Show resolved Hide resolved
- Better documentation.
- Some errors with `::` in Rest are corrected.
- Clearer processing of the cum_inc = 0 case when calculating Confidence
  Intervals.
@mvlvrd
Copy link
Contributor Author

mvlvrd commented Jan 7, 2025

Thanks for the review and sorry for the late response.
I have addressed most of the comments, please tell if you need more explanations or corrections.

I would like to keep the exact Dinse estimator, as it is the one I am using in production code.

Copy link
Owner

@sebp sebp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for changing the use of einsum. My idea was to avoid the multiplication with mask, but I now understand that this isn't possible with np.cumsum. Using np.sum makes it a little bit easier to understand, but it might be worth it. Using np.einsum should be faster due to avoiding intermediate arrays.

I would revert to using np.einsum and leave the np.sum as a comment to explain what is being computed.

sksurv/datasets/base.py Outdated Show resolved Hide resolved
sksurv/nonparametric.py Outdated Show resolved Hide resolved
sksurv/nonparametric.py Outdated Show resolved Hide resolved
sksurv/nonparametric.py Outdated Show resolved Hide resolved
@sebp sebp merged commit 7b6d478 into sebp:master Jan 12, 2025
17 checks passed
@sebp
Copy link
Owner

sebp commented Jan 12, 2025

@mvlvrd Thanks for your hard work 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants