-
Notifications
You must be signed in to change notification settings - Fork 23
Fix regrid memory leak #225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix regrid memory leak #225
Conversation
memory will be released when `RegridOperator` goes out of scope
if not released, memory appeared to build up, even if the objects were going out of scope, likely due to ESMF.Manager still being active
unlike regrids, regridc works on subsections of the data sections, so the memory release needed to happen for each subsection
now all create_Field and create_Grid should have a corresponding `destroy()`
if the mask does not change, the fields/grids won't be recreated each time, so if they are destroyed after the first section, the section second will have nothing to work with
Thanks, @ThibHlln - I look forward to looking at this next week. Have a good weekend, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Thibault,
This all looks sound, along the lines we have discussed. I confirm the behaviour you report with this PR vs. 3.10.0.
I've noted couple of small things we should look at before we merge.
Thanks for sorting this out - I especially appreciate you taking the time ensure that everything is consistent by modifying files in cf/regrid/
.
Some of these changes will need to be copied in cf.Field.regridc
, and you are welcome to add those in :), but equally I'm happy to do that in a separate PR after this has been merged.
David
Hi David, Thank you for your careful review. T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - looks good.
@davidhassell - when you return from leave - is there a reason this hasn't been merged despite looking like it is ready / in a mergable state? We don't want to keep Thibault waiting if there isn't a reason. Thanks. |
Ah - no reason! I'll merge it now. Thanks. |
This version includes: - fix for memory leak due to ESMF objects NCAS-CMS/cf-python#225 - lift for size 1 restriction on grid NCAS-CMS/cf-python#249 - fix bug preventing cyclic domain (e.g. global runs) NCAS-CMS/cf-python#237
resolve #222
The special methods
__del__
ofESMF.Field
andESMF.Grid
objects feature calls to their respectivedestroy
methods that release the memory allocated by ESMF in Fortran. When these objects go out of scope, the Python garbage collection ought to call these__del__
methods, yet their allocated memory does not seem to be released properly, as repeatedly regridding fields results in running out of memory after a while.After a discussion with @davidhassell, we suspect that the
ESMF.Manager
may be responsible for not making the memory release possible as long as it stays in scope. However, incf
the Manager is initialised but not bound to a variable name (see e.g. https://github.com/ThibHlln/cf-python/blob/d9cde28d9abd9cfeb6575c737560956f40620942/cf/field.py#L15619), so to me it goes out of scope directly in a Python sense. The ESMF documentation says that "The Manager will be created when the first ESMPy object is created if it is not created explicitly by the user". So I suspect that the Manager stays around (as a Fortran object I suppose) even if not bound to a name in Python's namespace (which may also very well be the case for Fields/Grids objects then).In the
cf
regridding, there are ESMF fields and grids created during each call toregrids
/regridc
. If these are explicitly releasing memory (through calls to their destroy methods), this does seem to eliminate the running out of memory when repeatedly calling the regridding methods. Note, calling the destroy methods on these objects several times does not raise an error, if memory was already released, it remains silent. So this PR does just that, i.e. it explicitly releases memory for all ESMF objects as soon as they are not required anymore. This does not provide as explanation for why the destroy calls in Python's garbage collection are not working/not enough, but it seems to be a cheap and harmless solution that solves the problem.I am attaching these field files if you'd like to test the following script:
Before this PR, after the expensive generation of the weights (it may take a short while), the memory was building up very quickly in the loop until running out of memory (using 8GB of RAM for my test), while with the modifications in this PR the memory now remains stable across the loop.