Video-LLaVA: Harden training pipeline, fix model embedding, and overhaul docs by rodosingh · Pull Request #3 · AMD-AGI/DUET-VLM

rodosingh · 2026-02-25T09:33:23Z

Summary

Harden Video-LLaVA training and inference pipeline — validate file existence before loading image/video data, handle corrupted video frames with fallback, fix text-only sample initialization, and tune DeepSpeed config (ZeRO stage 1, lower loss scale power) to mitigate NaN during fine-tuning
Refactor model embedding logic — rewrite image token position tracking in llava_arch.py to correctly handle multi-image inputs and fix prompt_len computation; add longest common contiguous subsequence utility for PDrop salient token matching
Overhaul README — add prerequisites (transformers patch, model consolidation), expand evaluation and training docs for LLaVA-1.5, 1.6, Video-LLaVA, and Qwen2.5-VL with full parametrized examples and token budget presets
Cleanup — add missing dependencies (nltk, mpi4py, openai, en-core-web-sm, huggingface_hub[hf_xet]), remove dead visionzip setup block, and delete 28 deprecated pdrop scripts superseded by scripts/llava/

Test plan

Video-LLaVA fine-tuning completes without NaN loss on a representative dataset
Inference handles videos with corrupted/missing frames without crashing
Text-only samples in mixed datasets are processed correctly
All eval scripts under scripts/llava/ and scripts/videollava/ still work after deprecated script removal
pip install -e . succeeds with the updated setup.py dependencies

Made with Cursor

- Add prerequisites section (transformers patch, model consolidation) - Expand evaluation docs for LLaVA-1.5, 1.6, Video-LLaVA, and Qwen - Add full parametrized training examples with token budget presets - Update directory tree to reflect current project structure

- Add en-core-web-sm, nltk, mpi4py, openai, huggingface_hub[hf_xet], ipdb - Remove dead commented-out visionzip setup block

- Refactor image token position tracking to handle multi-image inputs and fix prompt_len computation in llava_arch.py - Add fallback for corrupted/missing video frames in processing_video.py (repeat last valid frame or use black) - Add longest common contiguous subsequence utility in utils.py for PDrop salient token matching

- Add file existence validation before loading image/video data to skip missing or zero-byte files gracefully - Fix text-only samples missing image list initialization - Add optional training throughput statistics reporting - Reduce DeepSpeed initial_scale_power (16->10) and switch to ZeRO stage 1 to mitigate NaN loss during fine-tuning

These scripts were superseded by the unified scripts under scripts/llava/v1_5/ and scripts/llava/v1_6/. Removing to avoid confusion with the canonical script locations.

rodosingh added 5 commits February 25, 2026 09:29

chore(deps): Add nlp, mpi, and openai dependencies

15a2546

- Add en-core-web-sm, nltk, mpi4py, openai, huggingface_hub[hf_xet], ipdb - Remove dead commented-out visionzip setup block

refactor: Remove deprecated pdrop eval/train scripts

fb42fc6

These scripts were superseded by the unified scripts under scripts/llava/v1_5/ and scripts/llava/v1_6/. Removing to avoid confusion with the canonical script locations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Video-LLaVA: Harden training pipeline, fix model embedding, and overhaul docs#3

Video-LLaVA: Harden training pipeline, fix model embedding, and overhaul docs#3
rodosingh wants to merge 5 commits intomainfrom
duetvlm_s

rodosingh commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rodosingh commented Feb 25, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant