Skip to content

add retry logic and cleaner WS shutdown#622

Closed
mbartsch wants to merge 1013 commits into
Unmanic:stagingfrom
mbartsch:mbartsch/staging-ai
Closed

add retry logic and cleaner WS shutdown#622
mbartsch wants to merge 1013 commits into
Unmanic:stagingfrom
mbartsch:mbartsch/staging-ai

Conversation

@mbartsch

Copy link
Copy Markdown

Pull request

CLA

  • I agree that by opening a pull requests I am handing over copyright ownership
    of my work contained in that pull request to the Unmanic project and the project
    owner. My contribution will become licensed under the same license as the overall project.
    This extends upon paragraph 11 of the Terms & Conditions stipulated in the GPL v3.0

Checklist

  • I have ensured that my pull request is being opened to merge into the staging branch.

  • I have ensured that all new python file contributions contain the correct header as
    stipulated in the Contributing Docs.

Description of the pull request

This PR contains two fixes that prevent crashes in production environments:

  1. WebSocket async task cleanup — Gracefully exits pending async tasks when connections close, preventing "Task was destroyed but it is pending!" errors
  2. Database lock error handling — Implements exponential backoff retry logic for SQLite concurrency issues, preventing Foreman thread crashes.

Disclaimer: AI assisted PR

Josh5 and others added 30 commits August 29, 2024 10:35
Sometimes I need a little reminder of what I am doing on this project.
```
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see pypa/pip#5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
```
Grafana does not understand that and so everything now has no level formatting
rgregg and others added 23 commits June 9, 2026 10:29
Manually imported from Unmanic#617 after the source PR branch was no longer available for merge.

Imported only the runtime fix in unmanic/libs/plugins.py.

The original test file from the PR was intentionally omitted from this manual import.
This ensures that during remote postprocessing we never delete the source file without first confirming the dest file has been successfully copied in
Carry out TZ conversion when reading back data.
This will probably mess some things up for some people for a day, sorry. But this is probably a worthy change moving forward to better accomodate viewing this data.
… installations when it pushes a new task

This means we only need to manage library configs in our main installation which will then seed any updates to the libraray config to all the other installations in our links.
This will prevent an error log and notification when the datastore is temporarily down for a restart or something
…ote instalaltion library even when that library has "Configure Library for receiving remote files only" enabled

This "Configure Library for receiving remote files only" option defines a libraray as being managed for linking only. But I do not think that should stop it from being able to receive file paths for creating the task instead of needing to always use HTTP uploads
update Docker file to install correct intel media driver
Root cause: Async tasks spawned via spawn_callback() were not immediately
interrupted when WebSocket closed. Tasks would be awaiting gen.sleep() or
write_message() when on_close() set the flags to False. Setting flags to
False doesn't immediately interrupt pending awaits, so tasks would try
writing to a closed connection and fail.

Solution: Check self.close_event.is_set() before and during sleep periods
in all 5 async_* methods. This allows tasks to exit immediately when the
connection closes, before they attempt to write to a closed socket.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
When SQLite encounters concurrent access, it may raise "database is locked"
errors. Previously, these would crash the Foreman thread with an unhandled
exception and no logging. Now:

1. Foreman thread exception handling:
   - Log the exception with full context using logger.exception()
   - Wait 5 seconds before continuing instead of crashing
   - Prevents thread death when transient database errors occur

2. Task queue fetch with retry logic:
   - Detect "database is locked" errors specifically
   - Retry up to 3 times with exponential backoff (0.1s, 0.2s, 0.4s)
   - Log retry attempts and failures for debugging
   - Re-raise errors that aren't lock-related (actual problems)

This handles the common case where multiple Unmanic instances compete for
database access (especially common in multi-machine setups). Lock errors
are transient and typically resolve quickly with a brief wait.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Output 'No idle workers available; pending tasks waiting for worker availability'
only when the total configured worker count is greater than 0. This prevents
confusing log messages when no workers are configured at all.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
When saving task objects (including command logs), SQLite may raise
'database is locked' errors due to concurrent access. Now implements
exponential backoff retry logic (0.1s, 0.2s, 0.4s) up to 3 attempts
before giving up. This prevents crashes when multiple machines or
threads compete for database access.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Added retry logic with exponential backoff to the build_tasks_query()
function to handle SQLite 'database is locked' errors that occur during
concurrent task lookups. Retries up to 3 times with 0.1s, 0.2s, 0.4s
waits before giving up.

This protects the initial task query/selection phase in addition to
the existing protections in task claiming and saving.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@mbartsch mbartsch changed the title Mbartsch/staging ai add retry logic and cleaner WS shutdown Jun 17, 2026
Added configurable timeout for remote installation API requests with
exponential backoff retry logic (0.5s, 1s, 2s) up to 3 attempts.

Changes:
- New config option: remote_installation_request_timeout (default: 10s)
- Updated remote_api_get/post/delete with retry logic
- Handles Timeout and ConnectionError exceptions gracefully
- Increased default from 2s to 10s for slow/high-latency networks

This allows users to adjust timeouts for their network conditions and
prevents transient connection issues from failing requests.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@mbartsch mbartsch closed this Jun 17, 2026
@mbartsch mbartsch force-pushed the mbartsch/staging-ai branch from d77bf7f to affd691 Compare June 17, 2026 11:05
@mbartsch mbartsch deleted the mbartsch/staging-ai branch June 17, 2026 11:06
@mbartsch mbartsch restored the mbartsch/staging-ai branch June 17, 2026 11:06
@mbartsch mbartsch deleted the mbartsch/staging-ai branch June 17, 2026 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.