Skip to content

Calculation job can be submitted twice if first submission succeeds but communication of result from scheduler to AiiDA times out #3404

@Zeleznyj

Description

@Zeleznyj

I've encountered an issue where sometimes a calculation will show as finished in aiida, but the actual calculation on remote computer is still running. Aiida will retrieve the files and run parser without showing any error. This happened with ssh transport and slurm scheduler. I'm not sure if the problem is necessarily related to slurm though, since we are now not using other schedulers much. The calculations are using our own FPLO calculation plugins. It is possible that the issue is somehow related to some problem in the plugins, but to me it seems like a problem with aiida, since everything on our side is working fine. The calculation is submitted correctly and finishes correctly, the only problem is that the results are retrieved before the remote calculation is finished. This thus looks like a problem with parsing the queue status. The problem happens randomly, when we resubmit a calculation, it will usually finish fine.

I've noticed the problem after checking out the develop branch couple days ago, but most likely the problem existed also before, when I was using the 1.0.06b version.

I can try to include more details, but I'm not sure where to start about debugging this.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions