Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mark analysis.rdf.InterRDF and analysis.rdf.InterRDF_s as not parallizable #4884

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

tanishy7777
Copy link
Contributor

@tanishy7777 tanishy7777 commented Jan 7, 2025

Fixes #4675

Changes made in this Pull Request:

  • Mark rdf.InterRDF and rdf.InterRDF_s as non parallizable

In _single_frame method of analysis.rdf.InterRDF_s we can see that

 def _single_frame(self):
        for i, (ag1, ag2) in enumerate(self.ags):
            pairs, dist = distances.capped_distance(
                ag1.positions,
                ag2.positions,
                self._maxrange,
                box=self._ts.dimensions,
            )

            for j, (idx1, idx2) in enumerate(pairs):
                count, _ = np.histogram(dist[j], **self.rdf_settings)
                self.results.count[i][idx1, idx2, :] += count

        if self.norm == "rdf":
            self.volume_cum += self._ts.volume

self.volume_cum is cummulated across the frames so we cant parallize simply using the split-apply-combine technique.

Similiary in _single_frame method of analysis.rdf.InterRDF_s self.volume_cum is cummulated across the frames.

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

Developers certificate of origin


📚 Documentation preview 📚: https://mdanalysis--4884.org.readthedocs.build/en/4884/

@pep8speaks
Copy link

pep8speaks commented Jan 7, 2025

Hello @tanishy7777! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 317:76: W291 trailing whitespace
Line 581:1: W293 blank line contains whitespace

Line 181:1: E302 expected 2 blank lines, found 1
Line 185:1: E302 expected 2 blank lines, found 1

Line 121:80: E501 line too long (96 > 79 characters)
Line 155:1: W391 blank line at end of file

Line 175:1: E302 expected 2 blank lines, found 1
Line 182:80: E501 line too long (90 > 79 characters)
Line 184:1: E302 expected 2 blank lines, found 1
Line 187:27: E241 multiple spaces after ','

Comment last updated at 2025-01-13 20:07:19 UTC

Copy link

codecov bot commented Jan 7, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.63%. Comparing base (6842fd7) to head (de5d5e0).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4884      +/-   ##
===========================================
- Coverage    93.65%   93.63%   -0.03%     
===========================================
  Files          177      189      +12     
  Lines        21795    22870    +1075     
  Branches      3067     3067              
===========================================
+ Hits         20413    21415    +1002     
- Misses         931     1004      +73     
  Partials       451      451              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@marinegor
Copy link
Contributor

@tanishy7777 thanks for the prompt PR! I have some comments though:

self.volume_cum is cummulated across the frames so we cant parallize simply using the split-apply-combine technique.

Can we just sum it up separately though? I mean, make it a part of self.results for each worker's trajectory, and in _conclude write sum of it to self.volume_cum.

@tanishy7777
Copy link
Contributor Author

tanishy7777 commented Jan 8, 2025

Can we just sum it up separately though? I mean, make it a part of self.results for each worker's trajectory, and in _conclude write sum of it to self.volume_cum.

Got it! I have made analysis.rdf.InterRDF parallizable with this approach but analysis.rdf.InterRDF_s needs a bit more work.

@tanishy7777
Copy link
Contributor Author

tanishy7777 commented Jan 8, 2025

when trying to make analysis.rdf.InterRDF_s parallizable I am running into an error with aggregating results.count.

FAILED testsuite\MDAnalysisTests\analysis\test_rdf_s.py::test_nbins[client_InterRDF_s1] - ValueError: setting an array element with a 
sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous p...

I tried to find out why this is happening by looking at the result itself which is being aggregated along 'count'

so the arrs that is passed to the ResultsGroup.ndarray_vstack function as shown below can be broken into arrs = [arr, arr1] (refere imgs)

image
image

so I tried finding the dimensions of the arrays manually because it wasnt converting to a numpy array.

image
image

so its not able to convert to a numpy array, because of inconsistent dimensions. I am not sure how to resolve this

here arr is basically made up of of 2 arrays of (1,2,412) and (2,2,412)
and arr1 is also made up of of 2 arrays of (1,2,412) and (2,2,412)

@tanishy7777
Copy link
Contributor Author

tanishy7777 commented Jan 8, 2025

based on the above comment #4884 (comment)

I think we can mark analysis.rdf.InterRDF_s as non parallizable and mark analysis.rdf.InterRDF as parallizable

because its not possible to convert array of inhomogenous dimensions to a numpy array and since rdf.InterRDF_s needs numpy arrays to be passed. so even if we concatenate the arrays as normal lists performing even basic operations would be hard.

Because after the _get_aggregator method when _conclude is run operations like

self.results.count[i] / norm

wont be possible as division is not supported between list and int

image

@marinegor
Copy link
Contributor

@tanishy7777 the class should be marked as non-parallelizable only if the algorithm to run it is not actually parallelizable, which I'm not yet convinced is the case for all the mentioned classes.

But I think you're on the right path here, you just need to implement a custom aggregation function, instead of those implemented among ResultsGroup staticmethods -- they obviously don't cover all the possible cases for aggregation, just the basic ones for user's convenience.

Can you describe what kind of arrays you're trying to aggregate and can not find an appropriate function for? I didn't quite get it from your screenshots.

@tanishy7777
Copy link
Contributor Author

But I think you're on the right path here, you just need to implement a custom aggregation function, instead of those implemented among ResultsGroup staticmethods -- they obviously don't cover all the possible cases for aggregation, just the basic ones for user's convenience.

Can you describe what kind of arrays you're trying to aggregate and can not find an appropriate function for? I didn't quite get it from your screenshots.

So, the array is something like this
image

The [412] denotes an array with 412 elements.

That is why it cant be processed by numpy directly. But I think if I modify it, using a custom comparator like you mentioned I can sum the entries(by adding the 3 blue arrays of shape 2x412 as in the picture) and convert it to a 1x2x412 array I think? I am not sure about the final dimension it needs to be converted to.

@marinegor

@marinegor
Copy link
Contributor

I am not sure about the final dimension it needs to be converted to

it should be the same as if you'd run it without parallelization. And I assume you want to stack/sum/whatever along the dimension that corresponds to the timestep -- you can probably guess which one it is if you run it on some example with known number of frames. Example trajectories you can find in MDAnalysisTests.

@tanishy7777
Copy link
Contributor Author

it should be the same as if you'd run it without parallelization. And I assume you want to stack/sum/whatever along the dimension that corresponds to the timestep -- you can probably guess which one it is if you run it on some example with known number of frames. Example trajectories you can find in MDAnalysisTests.

Got it. Will work on that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MDAnalysis.analysis.rdf: Implement parallelization or mark as unparallelizable
3 participants