-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add confidence intervals to the cumulative incidence estimator for competing risks. #500
Conversation
competing risks. Tests are included. There are some minimal differences in the results of the confidence intervals with respect to the implementation in the R package cmprsk, so we add some tolerance to the assertions. Apart from the Aalen estimator used in cmprsk we include two other estimators, but these are not tested so thoroughly, because we couldn't find data sources.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #500 +/- ##
==========================================
+ Coverage 98.23% 98.54% +0.31%
==========================================
Files 37 37
Lines 3564 3647 +83
Branches 472 476 +4
==========================================
+ Hits 3501 3594 +93
+ Misses 30 25 -5
+ Partials 33 28 -5 ☔ View full report in Codecov by Sentry. |
Thanks for your PR. Can you explain a little bit what the differences between the estimators are? In particular, in which situations one would prefer estimator X over the others? |
Thanks for the reply!
There are multiple versions of both kind of estimators. According to Braun and Yuan (Statist. Med. 2007; 26:1170–1180), the Dinse-based estimators seems to work better than the Gray version of the Aalen-based estimator and "are fairly accurate even in small samples". I have decided to include both kind of estimators for completeness and because the Aalen estimators seem to be the standard due to the R implementation. Apart from this, I have the personal impression that the Aalen-based estimators would not work properly in the case of discrete time because of the possibility of multiple simultaneous events. But I don't understand the theory well enough to justify it. So in practice, I think Dinse is the best practical solution, while Aalen would be useful when comparing with the I am sorry if this is confusing, I am not an expert on these issues and it took me a while to go through the literature and such. BR, |
Coverage summary from CodacySee diff coverage on Codacy
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferencesCodacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more |
I started reviewing your code based on the following 2 papers, because I don't have access to the book by M. Pintilie that you referenced.
The implementation of Aalen’s variance seems to be correct, based on [1]. Unfortunately, I couldn't find out if the cmprsk R package is supposed to implement the same estimator, or a different one. I started to look at the source code, but it is written in Fortran 😕 If cmprsk implements the same estimator, the results should probably be identical by more than 4 decimals points. Regarding the estimator by Dinse and Larson, I used [2] as my reference (it provides some rough R code in the appendix). One thing I don't understand at the moment, is |
I tried to back-engineer the Regarding the Dinse formula, I think (B1) in Choudhury (2002) is wrong: should be The right formula can be deduced from Dinse et al. (1986) Eq. 5 Regarding the missing I attach the relevant part of Pintilie's book |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize for the delay.
I don't have any major concerns. The code seems to be correct, although I did not compare against reference implementation yet.
I'm not sure what exactly R's cmprsk implements. mstate seems to implement the Aalen estimator. So far, I could not find an R implementation of the Dinse estimator. Only STATA and SAS seem to have it.
I'm inclined to drop the exact version of the Dinse estimator, because the approximation seems to be more widespread (STATA and SAS only use the approximation).
- Better documentation. - Some errors with `::` in Rest are corrected. - Clearer processing of the cum_inc = 0 case when calculating Confidence Intervals.
Thanks for the review and sorry for the late response. I would like to keep the exact Dinse estimator, as it is the one I am using in production code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for changing the use of einsum
. My idea was to avoid the multiplication with mask
, but I now understand that this isn't possible with np.cumsum
. Using np.sum
makes it a little bit easier to understand, but it might be worth it. Using np.einsum
should be faster due to avoiding intermediate arrays.
I would revert to using np.einsum
and leave the np.sum
as a comment to explain what is being computed.
Reverting to use einsum. Other couple of minor changes.
@mvlvrd Thanks for your hard work 🙏 |
There are some minimal differences in the results of the confidence intervals with respect to the implementation in
the R package cmprsk, so we add some tolerance to the assertions. Apart from the Aalen estimator used in cmprsk we include two other estimators, but these are not tested so thoroughly, because we couldn't find implementations.
There is a new data set with three different terminal states.
Checklist
What does this implement/fix? Explain your changes
This adds the missing Confidence Intervals for the cumulative incidence in competing risks.