Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
252 changes: 235 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ Modern VLMs can produce 2,800+ visual tokens from a single high-resolution image
| Model | VisionZip Function | PyramidDrop | Config Location |
|-------|-------------------|-------------|-----------------|
| LLaVA-1.5 | `visionzip()` | `modeling_llama_pdrop.py` | `llava/model/` |
| LLaVA-1.6/Next | `visionzip()` | `modeling_llama_pdrop.py` | `llava/model/` |
| Video-LLaVA | `visionzip_video()` | `modeling_llama_pdrop.py` | `videollava/model/` |
| Qwen2.5-VL | Built-in `configure_duet()` | Built-in | `qwen2_5_vl/modeling_qwen2_5vl_duet.py` |

Expand Down Expand Up @@ -122,24 +123,169 @@ model.configure_duet(
)
```

## Prerequisites

### Patching Transformers Generation (`utils.py`)

DUET-VLM's Stage 2 (T2V pruning) passes salient text-token indices (`idxs`) through the HuggingFace `model.generate()` pipeline. Stock Transformers does not forward this keyword argument to the model, so a patched `utils.py` must replace the installed copy:

```bash
cp utils.py $(python -c "import transformers; print(transformers.__path__[0])")/generation/
```

> Without this patch, `idxs` is silently dropped and PDrop pruning will not activate during inference.

### Model Consolidation

Training with DeepSpeed produces sharded checkpoint files. Before evaluation, these must be consolidated into a single HuggingFace-compatible checkpoint to avoid key-mismatch errors when `load_pretrained_model()` is called:

```bash
python -m llava.model.consolidate \
--src /path/to/deepspeed-checkpoint \
--dst /path/to/consolidated-checkpoint
```

The consolidated path is what you pass as `MODEL_PATH` to the evaluation scripts.

## Evaluation

### Running Benchmarks

Each LLaVA-1.5 eval script accepts three positional arguments:

| Arg | Position | Required? | Default |
|-----|----------|-----------|---------|
| `CUSTOM_NAME` | `$1` | No | `vzpd_cw4_192` (used in output filenames) |
| `MODEL_PATH` | `$2` | No (but should be set for finetuned models) | `liuhaotian/llava-v1.5-7b` |
| `EXTRA_ARGS` | `$3` | No | `--layer_list [16,24] --image_token_ratio_list [0.5,0.0] --dominant 300 --contextual 7 --cluster_width 4 --conv_mode vicuna_v1 --compute_salient_tokens True` |

**Quick run (base model, default hyperparams):**

```bash
# LLaVA-1.5 TextVQA
bash scripts/llava/v1_5/pdrop_eval/textvqa.sh
```

**Full run with a finetuned model (single benchmark):**

```bash
bash scripts/llava/v1_5/pdrop_eval/textvqa.sh \
"duet_192" \
"/path/to/consolidated-checkpoint" \
"--layer_list [16,24] --image_token_ratio_list [0.5,0.0] --dominant 300 --contextual 7 --cluster_width 4 --conv_mode vicuna_v1 --compute_salient_tokens True"
```

**Full evaluation suite (all benchmarks in parallel across GPUs):**

```bash
VERSION="duet_cw4_192"
CUSTOM_NAME="duet_192"
MODEL_PATH="/path/to/consolidated-checkpoint"
EXTRA_ARGS="--layer_list [16,24] --image_token_ratio_list [0.5,0.0] --dominant 300 --contextual 7 --cluster_width 4 --conv_mode vicuna_v1 --compute_salient_tokens True"

mkdir -p logs/eval
CUDA_VISIBLE_DEVICES=0 bash scripts/llava/v1_5/pdrop_eval/gqa.sh "$CUSTOM_NAME" "$MODEL_PATH" "$EXTRA_ARGS" 2>&1 | tee -a logs/eval/${VERSION}_gqa.log &
CUDA_VISIBLE_DEVICES=1 bash scripts/llava/v1_5/pdrop_eval/sqa.sh "$CUSTOM_NAME" "$MODEL_PATH" "$EXTRA_ARGS" 2>&1 | tee -a logs/eval/${VERSION}_sqa.log &
CUDA_VISIBLE_DEVICES=2 bash scripts/llava/v1_5/pdrop_eval/pope.sh "$CUSTOM_NAME" "$MODEL_PATH" "$EXTRA_ARGS" 2>&1 | tee -a logs/eval/${VERSION}_pope.log &
CUDA_VISIBLE_DEVICES=3 bash scripts/llava/v1_5/pdrop_eval/mme.sh "$CUSTOM_NAME" "$MODEL_PATH" "$EXTRA_ARGS" 2>&1 | tee -a logs/eval/${VERSION}_mme.log &
CUDA_VISIBLE_DEVICES=4 bash scripts/llava/v1_5/pdrop_eval/textvqa.sh "$CUSTOM_NAME" "$MODEL_PATH" "$EXTRA_ARGS" 2>&1 | tee -a logs/eval/${VERSION}_textvqa.log &
CUDA_VISIBLE_DEVICES=5 bash scripts/llava/v1_5/pdrop_eval/mmbench.sh "$CUSTOM_NAME" "$MODEL_PATH" "$EXTRA_ARGS" 2>&1 | tee -a logs/eval/${VERSION}_mmbench.log &
CUDA_VISIBLE_DEVICES=6 bash scripts/llava/v1_5/pdrop_eval/seed.sh "$CUSTOM_NAME" "$MODEL_PATH" "$EXTRA_ARGS" 2>&1 | tee -a logs/eval/${VERSION}_seed.log &
CUDA_VISIBLE_DEVICES=7 bash scripts/llava/v1_5/pdrop_eval/vqav2.sh "$CUSTOM_NAME" "$MODEL_PATH" "$EXTRA_ARGS" 2>&1 | tee -a logs/eval/${VERSION}_vqav2.log &
wait

CUDA_VISIBLE_DEVICES=0 bash scripts/llava/v1_5/pdrop_eval/mmvet.sh "$CUSTOM_NAME" "$MODEL_PATH" "$EXTRA_ARGS" 2>&1 | tee -a logs/eval/${VERSION}_mmvet.log &
CUDA_VISIBLE_DEVICES=1 bash scripts/llava/v1_5/pdrop_eval/llavabench.sh "$CUSTOM_NAME" "$MODEL_PATH" "$EXTRA_ARGS" 2>&1 | tee -a logs/eval/${VERSION}_llavabench.log &
CUDA_VISIBLE_DEVICES=2 bash scripts/llava/v1_5/pdrop_eval/vizwiz.sh "$CUSTOM_NAME" "$MODEL_PATH" "$EXTRA_ARGS" 2>&1 | tee -a logs/eval/${VERSION}_vizwiz.log &
wait
```

`VERSION` is only used for log-file naming and is not passed to the eval scripts themselves.

**Token budget presets (dominant-contextual-cluster_width):**

| Total Tokens | dominant | contextual | cluster_width |
|---|---|---|---|
| 192 | 300 | 7 | 4 |
| 128 | 170 | 32 | 4 |
| 64 | 72 | 30 | 4 |

### LLaVA-1.6/Next (Image)

LLaVA-1.6 evaluation scripts follow the same interface as v1.5 but use `train_mem_pdrop_next.py` and different default layer schedules:

```bash
# Single benchmark
bash scripts/llava/v1_6/pdrop_eval/textvqa.sh

# All 12 benchmarks: vqav2, gqa, sqa, textvqa, vizwiz, pope, mmvet, mme, mmbench, mmbench_cn, llavabench, seed
# Same parallel pattern as LLaVA-1.5 above, replacing v1_5 with v1_6
```

Available benchmarks in `scripts/llava/v1_6/pdrop_eval/`: VQAv2, GQA, SQA, TextVQA, VizWiz, POPE, MM-Vet, MME, MMBench, MMBench-CN, LLaVA-Bench, SEED.

### Video-LLaVA (Image + Video)

Video-LLaVA evaluation covers both image benchmarks and video QA benchmarks.

# Video-LLaVA MSVD
**Image benchmarks** (evaluating the Video-LLaVA model on image tasks):

```bash
# Available: vqav2, gqa, sqa, textvqa, vizwiz, pope, mmvet, mmbench, llavabench
bash scripts/videollava/v1_5/eval/eval_image_gqa.sh
bash scripts/videollava/v1_5/eval/eval_image_textvqa.sh
bash scripts/videollava/v1_5/eval/eval_image_pope.sh
# ... etc.
```

**Video QA benchmarks** (two-step: inference then GPT-based scoring):

```bash
# Step 1: Generate answers
bash scripts/videollava/v1_5/eval/run_qa_msvd.sh
bash scripts/videollava/v1_5/eval/run_qa_msrvtt.sh
bash scripts/videollava/v1_5/eval/run_qa_tgif.sh
bash scripts/videollava/v1_5/eval/run_qa_activitynet.sh

# Step 2: GPT-based evaluation
bash scripts/videollava/v1_5/eval/eval_qa_msvd.sh
bash scripts/videollava/v1_5/eval/eval_qa_msrvtt.sh
bash scripts/videollava/v1_5/eval/eval_qa_tgif.sh
bash scripts/videollava/v1_5/eval/eval_qa_activitynet.sh
```

# Qwen2.5-VL TextVQA
bash scripts/qwen/textvqa.sh duet_640
**Video-ChatGPT benchmarks** (correctness, detail, contextual, temporal, consistency):

# Qwen2.5-VL POPE
```bash
# Step 1: Generate answers
bash scripts/videollava/v1_5/eval/run_benchmark_1_correctness.sh
bash scripts/videollava/v1_5/eval/run_benchmark_2_detail.sh
bash scripts/videollava/v1_5/eval/run_benchmark_3_contextual.sh
bash scripts/videollava/v1_5/eval/run_benchmark_4_temporal.sh
bash scripts/videollava/v1_5/eval/run_benchmark_5_consistency.sh

# Step 2: GPT-based evaluation
bash scripts/videollava/v1_5/eval/eval_benchmark_1_correctness.sh
bash scripts/videollava/v1_5/eval/eval_benchmark_2_detail.sh
bash scripts/videollava/v1_5/eval/eval_benchmark_3_contextual.sh
bash scripts/videollava/v1_5/eval/eval_benchmark_4_temporal.sh
bash scripts/videollava/v1_5/eval/eval_benchmark_5_consistency.sh
```

> **Note:** Video eval scripts have hardcoded model paths and DUET parameters. Edit `dominant`, `contextual`, `layer_list`, and `image_token_ratio_list` directly in the scripts before running.

### Qwen2.5-VL

```bash
# Available benchmarks: gqa, sqa, textvqa, pope, mme
bash scripts/qwen/gqa.sh duet_640
bash scripts/qwen/sqa.sh duet_640
bash scripts/qwen/textvqa.sh duet_640
bash scripts/qwen/pope.sh duet_640
bash scripts/qwen/mme.sh duet_640
```

Qwen scripts support mode selection via `EXTRA_ARGS`: `--mode duet` (default), `--mode ori_visionzip`, or `--mode baseline`.

### Inference-Only Results (LLaVA-1.5-7B)

| Method | Avg Tokens | Token Reduction | Avg Accuracy (%) |
Expand All @@ -163,34 +309,106 @@ bash scripts/qwen/pope.sh duet_640

```
DUET-VLM/
├── llava/ # LLaVA-1.5 model (image VLM)
├── llava/ # LLaVA-1.5/1.6 model (image VLM)
├── videollava/ # Video-LLaVA model (image + video VLM)
├── qwen2_5_vl/ # Qwen2.5-VL DUET (standalone implementation)
├── visionzip/ # Shared VisionZip module
├── scripts/ # Evaluation and training scripts
│ ├── llava/ # LLaVA-1.5 scripts
│ ├── videollava/ # Video-LLaVA scripts
│ └── qwen/ # Qwen2.5-VL scripts
├── scripts/
│ ├── llava/
│ │ ├── v1_5/
│ │ │ ├── pdrop_train/ # LLaVA-1.5 training (pretrain, finetune)
│ │ │ └── pdrop_eval/ # LLaVA-1.5 image eval (11 benchmarks)
│ │ └── v1_6/
│ │ ├── pdrop_train/ # LLaVA-1.6 training (pretrain, finetune)
│ │ └── pdrop_eval/ # LLaVA-1.6 image eval (12 benchmarks)
│ ├── videollava/
│ │ └── v1_5/
│ │ ├── pretrain.sh # Video-LLaVA pre-training
│ │ ├── finetune.sh # Video-LLaVA fine-tuning (full)
│ │ ├── finetune_lora.sh # Video-LLaVA fine-tuning (LoRA)
│ │ └── eval/ # Video-LLaVA eval (image + video benchmarks)
│ └── qwen/ # Qwen2.5-VL eval (gqa, sqa, textvqa, pope, mme)
├── setup.py # Package installation
├── STRUCTURE.md # Detailed codebase documentation
└── utils.py # Modified HF generation utils
```

## Training

DUET-VLM supports training with integrated token compression. See the training scripts for each model:
DUET-VLM supports training with integrated token compression. Both pretrain and finetune scripts accept positional arguments for all hyperparameters (with sensible defaults).

### LLaVA-1.5

**Quick run (default hyperparams):**

```bash
# LLaVA-1.5 pre-training
bash scripts/llava/v1_5/pdrop_train/pretrain.sh
# Pre-training (projector only)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/llava/v1_5/pdrop_train/pretrain.sh

# LLaVA-1.5 fine-tuning
bash scripts/llava/v1_5/pdrop_train/finetune.sh
# Fine-tuning (full model)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/llava/v1_5/pdrop_train/finetune.sh
```

**Full parametrized run:**

```bash
RUN_NAME="duet-192-cw4"
PRETRAIN_OUTPUT_DIR="/path/to/checkpoints/pretrain-${RUN_NAME}"
FINETUNE_OUTPUT_DIR="/path/to/checkpoints/finetune-${RUN_NAME}"
layer_list="[16,24]"
image_token_ratio_list="[0.5,0.0]"
dominant_num="300"
context_num="7"
cluster_width="4"
PRETRAIN_MM_MLP_ADAPTER="${PRETRAIN_OUTPUT_DIR}/mm_projector.bin"

# Stage 1: Pre-training
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/llava/v1_5/pdrop_train/pretrain.sh \
"$RUN_NAME" "$PRETRAIN_OUTPUT_DIR" "$layer_list" "$image_token_ratio_list" \
"$dominant_num" "$context_num" "$cluster_width"

# Stage 2: Fine-tuning (pass pretrain adapter as 5th arg)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/llava/v1_5/pdrop_train/finetune.sh \
"$RUN_NAME" "$FINETUNE_OUTPUT_DIR" "$layer_list" "$image_token_ratio_list" \
"$PRETRAIN_MM_MLP_ADAPTER" "$dominant_num" "$context_num" "$cluster_width"

# Consolidate before evaluation
python -m llava.model.consolidate \
--src "$FINETUNE_OUTPUT_DIR" \
--dst "${FINETUNE_OUTPUT_DIR}-CONSOLIDATE"
```

Pretrain script args: `$1=RUN_NAME, $2=OUTPUT_DIR, $3=layer_list, $4=image_token_ratio_list, $5=dominant_num, $6=context_num, $7=cluster_width`

Finetune script args: `$1=RUN_NAME, $2=OUTPUT_DIR, $3=layer_list, $4=image_token_ratio_list, $5=PRETRAIN_MM_MLP_ADAPTER, $6=dominant_num, $7=context_num, $8=cluster_width`

### LLaVA-1.6/Next

LLaVA-1.6 uses `train_mem_pdrop_next.py` and the LLaVA-Next architecture with a different default layer schedule (`[8,16,24]`):

# Video-LLaVA fine-tuning
bash scripts/videollava/v1_5/finetune.sh
```bash
# Pre-training (projector only)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/llava/v1_6/pdrop_train/pretrain.sh

# Fine-tuning (full model)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/llava/v1_6/pdrop_train/finetune.sh
```

### Video-LLaVA

```bash
# Pre-training
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/videollava/v1_5/pretrain.sh

# Fine-tuning (full)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/videollava/v1_5/finetune.sh

# Fine-tuning (LoRA)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/videollava/v1_5/finetune_lora.sh
```

> **Note:** Video-LLaVA training scripts have hardcoded data paths and hyperparameters. Edit them directly to adjust `--dominant_num`, `--context_num`, `--layer_list`, `--image_token_ratio_list`, output directories, and data paths before running.

## Acknowledgement

This codebase builds on [LLaVA](https://github.com/haotian-liu/LLaVA), [Video-LLaVA](https://github.com/PKU-YuanGroup/Video-LLaVA), [VisionZip](https://github.com/dvlab-research/VisionZip), [PyramidDrop](https://github.com/Cooperx521/PyramidDrop), and [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL).
Expand Down
83 changes: 0 additions & 83 deletions scripts/v1_5/pdrop_eval/gqa.sh

This file was deleted.

Loading