-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hotstart overwrites previous output #56
Comments
I believe the actual issue in this case is that when the specified output directory is the same as the directory where the resume file is located leads to the new writes overwriting the old ones, since the file counters aren't kept across resumes, so they restart at 0. This should be relatively easy to check, by checking the write timestamps of the files and the corresponding simulated times in the index files. Could you verify that, please? |
Ahhh...yes. I believe you are correct. I checked my output files and PART_00000.vtp - PART_00018.vtp are all timestamped at a later date. Since I started at hot_00082.bin, and there is 100 output files, there would be 18 files left to create. I can also confirm this by looking at the last hotstart file, which is hot_00018.bin. |
Thanks. I've taken the liberty to rename the issue title. I've been thinking about the possible approaches to solve this. I can think of three, some of which have more far-reaching implications:
In fact, 3. could probably be extended for standard (non-resume) writes, where Suggestions and opinions welcome. |
The bin file is by default in save-dir/data; what about just checking if the parent of the hotfile dir does not match the save dir? |
The question is not how to detect the situation but what to do about it. Should we just abort, or resume but put stuff in a different subdir ? |
I would prefer the solution be found with automation in mind. If my job is running with SLURM and gets killed, it will be queued for automatic restart. I've already modified GPUSPH to handle "auto resume", but clearly the internal counter status is not saved.
|
Hello @saltynexus, the issue with your 2nd proposal is that different writers will have different counters, and trying to map simulated time to counter is unreliable (off by one depending on the actual time-step, extra saves requested by the problem, etc). This is particularly relevant when resuming from any hotfile but the last (which might be the case e.g. if the last hotfile is corrupt), but also for the last hotfile case (e.g. if GPUSPH got terminated abruptly during or right after a VTKWRITER checkpoint, but before the next HOTWRITER checkpoint). This is in addition to the fact that several writers don't take very well to resumes anyway, because they either have a single output file (e.g. the common writer produces most data as “append only”) or they have additional metadata that would have to be reloaded to be kept in sync. This makes the implementation of the counter resume non-trivial in coding terms (i.e. it will take longer to get to a reliably working solution). I'm not particularly happy about the subdir/altdir solution either, but it could be a stopgap to avoid data loss until the proper counter resume is implemented. |
@Oblomov OK, so we both agree that the subdir option is a temporary fix and that the real solution lies within the workings of the counter. If I knew more how GPUSPH works in this regard, I could offer advice, but I'm a new user and still learning. As for the subdir option, I know you said "
I'm totally fine with this temporary solution. However can you advise on how to perform the post processing? Will GPUSPH stitch the output files together or are we responsible for this? Maybe it's not necessary? Would renaming the files in order, then place them all in one directory work? Again, I'm asking because I'm a new user of GPUSPH and all I know is to point Paraview to the "data" directory and load the VTK group files. |
Hello @saltynexus, that would be indeed the general idea. Padding the name of the resume directory is a good idea if we assume a worst-case scenario where more than 10 resumes are needed. For data visualization and post-processing, rather than opening the VTK file groups directly (which wouldn't work out-of-the-box on resume due to the counter restart), something that should work almost out of the box would be to open the I think it should also be relatively easy to write a post-processing script that takes the index files from all the resume directories and builds a new index file, and possibly symlinks all the data files (reindexing them as appropriate) into a new 'recovery' directory, where at least the VTKWriter output can be perused as if it was the usual |
Awesome...sounds like a plan! |
@Oblomov I'm sure you have a lot on your plate, but I just wanted to see if you can provide an estimated timeline on when this temporary solution will become available? |
Before this change, when resuming into the same sudir that we resumed from, the older data files would be overwritten, because we do not track the writer's file indices, and resuming for some writers would be very non-trivial (metadata handling etc). To avoid data loss, following the discussion around issue #56 on GitHub, we enact the following policy: 1. nothing changes if we resume into a new directory; 2. we prevent resuming into an existing directory that is _not_ the same we are resuming from; 3. when resuming into the same directory we resume from, the actual problem dir is shifted into the first available `resumeN` subdir.
Hello @saltynexus , I've just pushed to the |
@Oblomov Thank you for your support with this! |
@Oblomov As far as the "merge the VTUs" stuff goes, I reached out to the ParaView group and got some good advice (https://discourse.paraview.org/t/how-to-merge-multiple-pvd-and-vtp-files/3506/6). I haven't tried testing or developing the script yet but it's a good start. I'll probably work on it early next week and if I make any progress I'll share with you. |
@saltynexus that's very good news. The proposed idea for merging the PVDs also looks very promising and surprisingly simple. Excellent. Thanks for looking into this. |
Bug description
during hotstart, GPUSPH fails to write output to specified directory
Summary
I'm currently running GPUSPH on a cluster, which uses SLURM scheduling. The cluster scheduling is configured to give priority to certain users. In one instance, my job was killed during execution. I therefore attempted to resume the job using a hotstart file. GPUSPH successfully read the hotstart file and the simulation carried on as expected.
After the job finished, I check my output directory and noticed that there was no output generated following the hotstart. The only output provided was that associated with the initial simulation, prior to the job being killed.
This is the command that I executed in the initial job submission
./GPUSPH --deltap 0.005 --dir /home/user/nfs_fs02/high_res
This is the command that I executed after the job was killed to resume
./GPUSPH --deltap 0.005 --dir /home/user/nfs_fs02/high_res --resume /home/user/nfs_fs02/high_res/data/hot_00082.bin
The simulation is a modified version of the "WaveTank" example test case provided with the GPUSPH source code downloaded from here (github master branch). The only thing that I changed was removal of the slope in the experiment. I've run it in the past and it works as intended, so I'm 99.9% sure it has nothing to do with the specific application.
I suspect that the bug might be related to me specifying the output directory (non default). Somewhere in the hotstart procedure, it fails to properly identify that output is requested and where it is to be generated.
Details
Here is my error log
and here is my output log
The "git_branch.txt" output is
The "make_show.txt" output is
The "summary.txt" output is
The text was updated successfully, but these errors were encountered: