Skip to content

Video-LLaVA: Harden training pipeline, fix model embedding, and overhaul docs#3

Open
rodosingh wants to merge 5 commits intomainfrom
duetvlm_s
Open

Video-LLaVA: Harden training pipeline, fix model embedding, and overhaul docs#3
rodosingh wants to merge 5 commits intomainfrom
duetvlm_s

Conversation

@rodosingh
Copy link
Copy Markdown
Member

Summary

  • Harden Video-LLaVA training and inference pipeline — validate file existence before loading image/video data, handle corrupted video frames with fallback, fix text-only sample initialization, and tune DeepSpeed config (ZeRO stage 1, lower loss scale power) to mitigate NaN during fine-tuning
  • Refactor model embedding logic — rewrite image token position tracking in llava_arch.py to correctly handle multi-image inputs and fix prompt_len computation; add longest common contiguous subsequence utility for PDrop salient token matching
  • Overhaul README — add prerequisites (transformers patch, model consolidation), expand evaluation and training docs for LLaVA-1.5, 1.6, Video-LLaVA, and Qwen2.5-VL with full parametrized examples and token budget presets
  • Cleanup — add missing dependencies (nltk, mpi4py, openai, en-core-web-sm, huggingface_hub[hf_xet]), remove dead visionzip setup block, and delete 28 deprecated pdrop scripts superseded by scripts/llava/

Test plan

  • Video-LLaVA fine-tuning completes without NaN loss on a representative dataset
  • Inference handles videos with corrupted/missing frames without crashing
  • Text-only samples in mixed datasets are processed correctly
  • All eval scripts under scripts/llava/ and scripts/videollava/ still work after deprecated script removal
  • pip install -e . succeeds with the updated setup.py dependencies

Made with Cursor

- Add prerequisites section (transformers patch, model consolidation)
- Expand evaluation docs for LLaVA-1.5, 1.6, Video-LLaVA, and Qwen
- Add full parametrized training examples with token budget presets
- Update directory tree to reflect current project structure
- Add en-core-web-sm, nltk, mpi4py, openai, huggingface_hub[hf_xet], ipdb
- Remove dead commented-out visionzip setup block
- Refactor image token position tracking to handle multi-image
  inputs and fix prompt_len computation in llava_arch.py
- Add fallback for corrupted/missing video frames in
  processing_video.py (repeat last valid frame or use black)
- Add longest common contiguous subsequence utility in utils.py
  for PDrop salient token matching
- Add file existence validation before loading image/video data
  to skip missing or zero-byte files gracefully
- Fix text-only samples missing image list initialization
- Add optional training throughput statistics reporting
- Reduce DeepSpeed initial_scale_power (16->10) and switch to
  ZeRO stage 1 to mitigate NaN loss during fine-tuning
These scripts were superseded by the unified scripts under
scripts/llava/v1_5/ and scripts/llava/v1_6/. Removing to
avoid confusion with the canonical script locations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant