-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize summary methods #56
Comments
I ran a test script on my desktop to check the timings today. Here's the script: import time
from yammbs import MoleculeStore
# full file from hpc3 after running an industry benchmark
store = MoleculeStore("/tmp/test.sqlite")
# doesn't have to exist, just has to match the force_field column in
# mm_conformers table
forcefield = "forcefields/new-tm-2.2.offxml"
for label, fn in [
("dde", MoleculeStore.get_dde),
("rmsd", MoleculeStore.get_rmsd),
("tfd", MoleculeStore.get_tfd),
("icrmsd", MoleculeStore.get_internal_coordinate_rmsd),
]:
start = time.time()
fn(store, forcefield, skip_check=True)
print(f"{label}: {time.time() - start:.2f} s") And the output:
Clearly the internal-coordinate RMSDs are the slowest by a big margin. I'd be interested in working on speeding this up, but I'm also happy for you to handle it if you've already started/are interested yourself! The sqlite file I used for this was 318 MB, but I'm sure I can find a way to get it to you if you want to test locally too. |
I repeated the benchmark above on a much smaller dataset (60 molecules vs 9873 molecules) to get a more useful reference for repeated benchmarks and got the following results:
Then I used this profiling script to look for hot spots in the internal coordinate RMSD code: import cProfile
from pstats import SortKey
from yammbs import MoleculeStore
store = MoleculeStore("openff-2.2.0.sqlite")
forcefield = "openff-2.2.0.offxml"
with cProfile.Profile() as pr:
store.get_internal_coordinate_rmsd(forcefield, skip_check=True)
pr.print_stats(SortKey.CUMULATIVE) Most of the time is obviously spent in the Here's the top of the profile output too. I can upload the whole thing if it's of interest.
|
Do we know if geomeTRIC itself is optimized for this sort of behavior? I don't know where it's been used at scale, so I don't want to run past "can this be optimized for performance?" if people haven't used it before in the way we are |
That's a good point, I'm not sure. I saw how much time it was spending doing numpy.cross products and assumed it was mostly doing necessary math, but from a quick peek at the code it could certainly be doing stuff that we don't need. I'll take a closer look at this before slapping multiprocessing on it. |
I looked a bit closer, and I don't really think this is the intended usage. geometric itself appears to keep around the I copied over the parts of the The most interesting part of geometric for our use case is the Otherwise, I have a rough draft with multiprocessing that takes the profiling run above from 19.8 seconds down to 4.5 seconds with 8 CPUs. It possibly needs one more modification to use a generator instead of a list to avoid allocating too much memory, but it's working and passing the current yammbs analysis tests, at least. |
@ntBre reports that
.get_tfd
,.get_icrmsd
, and possibly methods can take ~20 minutes in some runs. These operations are almost surely serial and I haven't ever thought about performance here. This makes a difference if paying for compute, especially on runners with many cores.The text was updated successfully, but these errors were encountered: