-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle wes/wgs inheritance edge case #4440
Conversation
hail_search/queries/mito.py
Outdated
other_sample_type_pass_samples.contains | ||
) & ht[sample_type.other_sample_type.failed_family_sample_field][other_sample_type_family_idx].all( | ||
sample_type_pass_samples.contains | ||
)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these binds are a little wild, but the logic is that for all families, keep family if all of its samples pass this check:
if sample failed in one sample type, it must pass in the other - and if a sample for that family failed in the other sample type, it must pass in the current sample type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this, I think we may also need extra logic for comp hets, but I would do that as a separate follow up PR after this is done to aooid scope creep
hail_search/queries/mito.py
Outdated
ht = self._apply_multi_sample_type_entry_filters(ht, family_idx_map) | ||
ch_ht = self._apply_multi_sample_type_entry_filters(ch_ht, family_idx_map) | ||
return ht, ch_ht | ||
|
||
def _annotate_failed_family_samples_inheritance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels incredibly similar to _annotate_families_inheritance
. Rather than making a whole separate function for this, you could make a much more tightly scoped conditional helper to pass into _annotate_families_inheritance
, perhaps just for the lambda
function applied to the hl.enumerate(ht[entries_ht_field]).starmap(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this, but because I don't want the single sample type families that come originally through the mito class code path to call the mito 'family_passes_inheritance_filter' function, I'm still passing a family_passes_inheritance_filter
function to _filter_inheritance
instead of using python inheritance here. Do you know if there's a cleaner way to do this?
hail_search/test_search.py
Outdated
@@ -388,10 +388,11 @@ async def test_both_sample_types_search(self): | |||
[VARIANT2_BOTH_SAMPLE_TYPES], sample_data=FAMILY_2_BOTH_SAMPLE_TYPE_SAMPLE_DATA_MISSING_PARENTAL_WGS, | |||
inheritance_mode=inheritance_mode, **COMP_HET_ALL_PASS_FILTERS, intervals=[variant2_interval] | |||
) | |||
# Genome passes quality and inheritance exome fails inheritance (parental data shows variant is inherited). | |||
# Genome passes quality and inheritance but exome fails inheritance (parental data shows variant is inherited). | |||
# Variant is excluded from search results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is testing the case where its not returned when theres no valid WGS parental data, but can we also test that it IS returned when we include the parental data for WGS and its overridden?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this case is covered by an above test -
# Variant1 in family_2 is de novo in exome but maternally inherited in genome.
# Genome passes quality and inheritance, show genotypes for both sample types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The search criteria changes between these 2 test cases. We really need back-to-back tests that the exact same search returns different results when the paternal data is present/absent. Instead of running a dominant and recessive search each filtered to variant 1 and variant 2, it would probably be better to not use an interval filter at all and test that dominant and recessive search return the expected combo of variants with and without paternal data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this^!
hail_search/queries/mito.py
Outdated
lambda other_sample_type_pass_samples: ( | ||
ht[sample_type.failed_family_sample_field][family_idx].all(other_sample_type_pass_samples.contains) | ||
), ht[sample_type.other_sample_type.family_entries_field][other_sample_type_family_idx].filter( | ||
lambda s: ~ht[sample_type.other_sample_type.failed_family_sample_field][other_sample_type_family_idx].contains(s['sampleId']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string comparisons in hail are substantially slower than integer comparisons. family_entries_field
and failed_family_sample_field
for a given sample type should be the same length and the same order so you should be able to rewrite this more performantly as
hl.enumerate(ht[sample_type.other_sample_type.family_entries_field][other_sample_type_family_idx]).starmap(
lambda i, s: hl.or_missing(
hl.is_missing(ht[sample_type.other_sample_type.failed_family_sample_field][other_sample_type_family_idx][i]),
s['sampleId'],
)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this won't work since family_entries_field
and failed_family_sample_field
for a given sample type are not necessarily the same length. ht[sample_type.other_sample_type.failed_family_sample_field][other_sample_type_family_idx]
is a list of sample_ids of variable length - it only contains failed sample IDs - so the [i] access won't work. I originally had failed_family_sample_field
as an array of failing indices but switched it to an array of failing sample IDs because it was easier for me to come up with a solution to compare the same samples from different sample types.
Maybe a better way to structure failed_family_sample_field
is an array of booleans that matches the shape of family_entries_field
where the value at each sample index is true if failed or false if pass. It's just more complicated to compare samples individually that way - I'd need to reintroduce that map of families to samples to sample types to indices inside of their respective family_entries field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd need to reintroduce that map of families to samples to sample types to indices inside of their respective family_entries field.
if failed_family_sample_field
is a list of booleans then you can easily get the list of passing sample ids from the main family_entries field without maintaining an entry map, and then your logic for using other_sample_type_pass_samples
would not need to change
hl.enumerate(ht[family_entries_field][family_idx]).starmap(
lambda sample_idx, sample: hl.or_missing(
~ht[failed_family_sample_field][family_idx][sample_idx]),
sample['sampleId'],
)
).filter(hl.is_defined)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I refactored the passes_inheritance_field
to be a list of booleans corresponding to the samples in family entries, ^ and used this more performant code to get a list of passing sample IDs
hail_search/queries/mito.py
Outdated
lambda other_sample_type_pass_samples: ( | ||
ht[sample_type.failed_family_sample_field][family_idx].all(other_sample_type_pass_samples.contains) | ||
), ht[sample_type.other_sample_type.family_entries_field][other_sample_type_family_idx].filter( | ||
lambda s: ~ht[sample_type.other_sample_type.failed_family_sample_field][other_sample_type_family_idx].contains(s['sampleId']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd need to reintroduce that map of families to samples to sample types to indices inside of their respective family_entries field.
if failed_family_sample_field
is a list of booleans then you can easily get the list of passing sample ids from the main family_entries field without maintaining an entry map, and then your logic for using other_sample_type_pass_samples
would not need to change
hl.enumerate(ht[family_entries_field][family_idx]).starmap(
lambda sample_idx, sample: hl.or_missing(
~ht[failed_family_sample_field][family_idx][sample_idx]),
sample['sampleId'],
)
).filter(hl.is_defined)
hail_search/test_search.py
Outdated
@@ -388,10 +388,11 @@ async def test_both_sample_types_search(self): | |||
[VARIANT2_BOTH_SAMPLE_TYPES], sample_data=FAMILY_2_BOTH_SAMPLE_TYPE_SAMPLE_DATA_MISSING_PARENTAL_WGS, | |||
inheritance_mode=inheritance_mode, **COMP_HET_ALL_PASS_FILTERS, intervals=[variant2_interval] | |||
) | |||
# Genome passes quality and inheritance exome fails inheritance (parental data shows variant is inherited). | |||
# Genome passes quality and inheritance but exome fails inheritance (parental data shows variant is inherited). | |||
# Variant is excluded from search results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The search criteria changes between these 2 test cases. We really need back-to-back tests that the exact same search returns different results when the paternal data is present/absent. Instead of running a dominant and recessive search each filtered to variant 1 and variant 2, it would probably be better to not use an interval filter at all and test that dominant and recessive search return the expected combo of variants with and without paternal data
Created at 2024/11/04 13:45:23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this table to make variant2 de novo in WGS.
self._family_has_valid_quality(ht, sample_type.other_sample_type, other_sample_type_family_idx) | ||
) & | ||
self._family_has_valid_inheritance(ht, sample_type, family_idx, other_sample_type_family_idx) & | ||
self._family_has_valid_inheritance(ht, sample_type.other_sample_type, other_sample_type_family_idx, family_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting for posterity that I'm not 100% sure that this is the final logic we want, because it would allow us to return variants that pass quality and fail inheritance in one sample type and fail quality and pass inheritance in the other, meaning theres no sample type that clearly passes both inheritance and quality. However, I think we maybe do want to return these, the logic gets kind of confusing and I can't quite be sure these would not be helpful. I think being overly permissive here is better, if the analysts are seeing a bunch of cases where they ultimately think that the returned variants are not helpful and should be filtered out we can always go back later and make this a stricter criteria, so we should leave this as is for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, we may want to consider quality and inheritance passing together and handle it differently (the &
seems like it could be too simple) but it's not clear to me what the logic/change would be. I agree that trying this out and getting feedback from analysts before we do that is the way to go.
hail_search/test_search.py
Outdated
# Variant 2 in family_2 is inherited in exome and there is no parental data in genome. | ||
# Genome and exome pass quality and inheritance, show genotypes for both sample types. | ||
variant2_interval = ['1', 38724418, 38724420] | ||
# Same variants, but genome data is proband-only. | ||
inheritance_mode = 'recessive' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test file would be more readable if instead of toggling the inheritance back and forth you run both recessive searches back to back and then both de novo searches, so its clearer that nothings changing other than the parental data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is unneeded
hail_search/test_search.py
Outdated
intervals=[variant1_interval] | ||
[VARIANT1_BOTH_SAMPLE_TYPES, VARIANT2_BOTH_SAMPLE_TYPES, VARIANT4_BOTH_SAMPLE_TYPES], | ||
sample_data=FAMILY_2_BOTH_SAMPLE_TYPE_SAMPLE_DATA, inheritance_mode=inheritance_mode, | ||
**COMP_HET_ALL_PASS_FILTERS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't need COMP_HET_ALL_PASS_FILTERS
for de novo searches
hail_search/test_search.py
Outdated
# Genome passes quality and inheritance, show genotypes for both sample types. | ||
variant1_interval = ['1', 10438, 10440] | ||
# Variant 1 is de novo in exome but maternally inherited in genome. | ||
# Variant 2 is inherited in exome and de novo in genome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would clarify in the comment that inherited means "inherited and homozygous" as usually maternally inherited means mom has one alt allele and proband inherited that and has 1 alt allele
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You updated the comment for variant 1 but not variant 2, de novo is usually a het call so its still confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be better if these variants are heterozygous and inherited? That seems like it's more representative of what we'd see in real searches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having the ability to have at least one variant passing for homozygous recessive search is valuable for testing and I like the coverage we get from this variant configuration. And we do sometimes see de novo homozygotes, we just don't refer to it as plain "de novo" which is why I think a comment update is sufficient
hail_search/test_search.py
Outdated
# Variant 2 in family_2 is inherited in exome and there is no parental data in genome. | ||
# Genome and exome pass quality and inheritance, show genotypes for both sample types. | ||
variant2_interval = ['1', 38724418, 38724420] | ||
# Same variants, but genome data is proband-only. | ||
inheritance_mode = 'recessive' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is unneeded
hail_search/test_search.py
Outdated
) | ||
# Genome passes quality and inheritance exome fails inheritance (parental data shows variant is inherited). | ||
inheritance_mode = 'de_novo' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is unneeded
hail_search/test_search.py
Outdated
# Genome passes quality and inheritance, show genotypes for both sample types. | ||
variant1_interval = ['1', 10438, 10440] | ||
# Variant 1 is de novo in exome but maternally inherited in genome. | ||
# Variant 2 is inherited in exome and de novo in genome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You updated the comment for variant 1 but not variant 2, de novo is usually a het call so its still confusing
) | ||
# Genome passes quality and inheritance exome fails inheritance (parental data shows variant is inherited). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think a comment here explaining whats being tested is helpful. Maybe something like "Variant 2 fails inheritance when parental data is present"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated!
No description provided.