TS-LLaVA

First version of the code has been released.

This is the official implementation for TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

by Tingyu Qu, Mingxiao Li, Tinne Tuytelaars, Marie-Francine Moens.

We explore various visual tokens compression strategies. Our TS-LLaVA achieves the state-of-the-art performance among trianing-free video LLMs.

Our ModelScope repo is available at: https://www.modelscope.cn/models/tingyuqu/TS-LLaVA

Results

Multiple Choice VideoQA:

Multitask Benchmarks

Ranked #9 among all video LLMs: the average accuracy for multple choice questions on MLVU-test Leaderboard

Open-Ended VideoQA & Video-based Text Generation

Installation

Building the environment

To create conda env, please run:

conda env create -n llava --file llava.yml
conda activate llava

Install additional packages (llava & flash-attention)

pip install flash-attn --no-build-isolation
pip install -e ".[train]"

These two packages, i.e. llava and flash-attention, are commented out from the yml file. In case of problems, please refer to the original LLaVA repo.

Downloading the checkpoints:

The checkpoints for LLaVA-v1.6 can be found here:

git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b .ckpt/llava-v1.6-vicuna-7b
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-34b .ckpt/llava-v1.6-34b

After downloading, the checkpoints should be stored in the ckpt folder.

[Optional] To enable GPT evaluation for open-ended video QA, please do the following:

export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY

Dataset preparation

Multiple Choice VideoQA and Open-Ended VideoQA

We prepare the ground-truth question and answer files based on IG-VLM and SF-LLaVA, and put them under playground/gt_qa_files.
- NExT-QA: Download the NExT_QA.csv from here
- EgoSchema: Download the EgoSchema.csv from here
- IntentQA: Download the IntentQA.csv from here
If you want to run our model for Open-Ended VideoQA and video-based Text Generation, please download the datasets as:
- MSVD-QA: Download the MSVD_QA.csv from here
- MSRVTT-QA: Download the MSRVTT_QA.csv from here
- TGIF-QA: Download the TGIF_FrameQA.csv from here
- Activitynet-QA: Download the Activitynet_QA.csv from the here
- VCGBench
  - Download all files under text_generation_benchmark
  - Reformat the files by running
```
python scripts/data/prepare_vcgbench_qa_file.py --qa_folder $TEXT_GENERATION_BENCHMARK
```
Reformatting the files:
- After getting the csv files, please reformat the files (apart from VCGBench) by running
```
python scripts/data/prepare_{DATASET}_file.py --qa_file $PATH_TO_CSV_FILE
```
- replace DATASET with the names of the dataset. Check the scripts/data to make sure the name is correct.
Download the raw videos from the official websites.
- Multiple Choice VideoQA
  - Download datasets from the data owners.
- Open-Ended VideoQA & video-based Text Generation:
  - [Recomanded] Option 1: Follow the instruction in Video-LLaVA to download raw videos.
  - Option 2: Download videos from the data owners.
- Store the videos to the dir of your choice (BASE_VIDEO_DIR), and replace BASE_VIDEO_DIR in scripts when needed

Multitask Benchmarks

Download the data:
- MVBench
  - Download the data from here
  - The official repo can be found here
- MLVU
  - Download the data from here
  - The official repo can be found here
- Store the videos in BASE_VIDEO_DIR

Inference and Evaluation

By default, we use all the visible GPUs on the node for the model inference. To manually select GPUs, please modify CUDA_VISIBLE_DEVICES in the scripts accordingly.
Please note that the model inference of TS-LLaVA-34B requires GPUs with at least 80G memory.
In each scripts, change CKPT_NAME and model_path accordingly.

Multiple Choice VideoQA

cd scripts/infer_videos
bash run_qa_{DATASET_NAME}.sh {AGGREGATION_METHOD} {NUM_FRAMES} {NUM_SAMPLED_TOKENS} {PROMPT_VERSION} {IMAGE_ASPECT_RATIO}
The evaluation is automatically done after inference

replace DATASET_NAME to one of {nextqa, egoschema, intentqa}
AGGREGATION_METHOD refers to the visual token compression method of choice. The default for TS-LLaVA is V2, you can select from
- X1, X2, X3: only use the thumbnail image.
- Z1, Z2, Z3: using multiple thumbnail images. (remember to sed the total number of frames divisible to the number of frames per thumbnail image)
- Y1, Y2, Y3: use both thumbnail image and sampled visual tokens. And prepend thumbnail image tokens to sampled visual tokens.
- V1, V2, V3: similar as Y1, Y2, Y3. But sampled tokens are prepended to thumbnail image tokens.
- W1, W2, W3 & U1, U2, U3: using multiple thumbnail images with sampled visual tokens (for ablation studies, remember to set the number of sample tokens accordingly).
- Here 1, 2 and 3 correspond to using 4, 6, and 8 frames per thumbnail image, respectively.
- For details, please refer to llava_arch.py
NUM_FRAMES refers to the total number of frames used. The default for TS-LLaVA is 50.
NUM_SAMPLED_TOKENS refers to the number of sampled tokens. The default for TS-LLaVA is 2880.
PROMPT_VERSION refers to the textual prompt version used. The default for TS-LLaVA is v4. Please refer to get_prompt.py for more information
IMAGE_ASPECT_RATIO refers to the type of image aspect ratio. The default for TS-LLaVA is resize, which resizes each frame to 336$\times$336.

Multitask Benchmarks

The default arguments AGGREGATION_METHOD, NUM_FRAMES, NUM_SAMPLED_TOKENS, PROMPT_VERSION and IMAGE_ASPECT_RATIO are the same as Multiple Choice VideoQA.

MLVU

cd scripts/infer_videos
bash run_qa_mlvu_mcqa.sh V2 50 2880 v4 resize
Submit the resulting json file to the official evaluation server (https://github.com/JUNJIE99/MLVU) for evaluation

MVBench

cd scripts/infer_videos
bash run_qa_mlvu_mcqa.sh V2 50 2880 v4 resize {INPUT_FORMAT}
The evaluation is automatically done after inference

In the script, change video_dir, gt_file_qa and output_dir accordingly for different subtasks.
The sixth argument INPUT_FORMAT refers to the input format of visual contents, which corresponds to the subtask of choice. It should be either video or image.

Open-Ended VideoQA

The default value for PROMPT_VERSION is v3. The rest are the same as Multiple Choice VideoQA.

Inference

cd scripts/infer_videos
bash run_qa_{DATASET_NAME}.sh V2 50 2880 v3 resize

Same as Multiple Choice VideoQA. Replace DATASET_NAME to one of {msvd, msrvtt, anet, tgif}

Evaluation

cd scripts/eval
bash eval_{DATASET_NAME}.sh V2 50 2880 v3 resize {API_KEY}

Use your own api_key from openai for API_KEY.

For VCGBench (Video ChatGPT), the inference and evaluation procedures are similar. Please refer to run_gen_qa_{TASK_TYPE}.sh and eval_gen_qa.sh

Acknowledgement

We extend our gratitude to the following awesome projects: LLaVA, FreeVA, IG-VLM and SF-LLaVA.

Citations

@article{qu2024tsllava,
    title={TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models}, 
    author={Tingyu Qu and Mingxiao Li and Tinne Tuytelaars and Marie-Francine Moens},
    year={2024},
    journal={arXiv preprint arXiv:2411.11066},
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
eval		eval
figures		figures
llava		llava
scripts		scripts
videomme_utils		videomme_utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
llava.yml		llava.yml
prompt.py		prompt.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TS-LLaVA

Table of contents

Results

Multiple Choice VideoQA:

Multitask Benchmarks

Open-Ended VideoQA & Video-based Text Generation

Installation

Building the environment

Downloading the checkpoints:

Dataset preparation

Multiple Choice VideoQA and Open-Ended VideoQA

Multitask Benchmarks

Inference and Evaluation

Multiple Choice VideoQA

Multitask Benchmarks

MLVU

MVBench

Open-Ended VideoQA

Inference

Evaluation

Acknowledgement

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

tingyu215/TS-LLaVA

Folders and files

Latest commit

History

Repository files navigation

TS-LLaVA

Table of contents

Results

Multiple Choice VideoQA:

Multitask Benchmarks

Open-Ended VideoQA & Video-based Text Generation

Installation

Building the environment

Downloading the checkpoints:

Dataset preparation

Multiple Choice VideoQA and Open-Ended VideoQA

Multitask Benchmarks

Inference and Evaluation

Multiple Choice VideoQA

Multitask Benchmarks

MLVU

MVBench

Open-Ended VideoQA

Inference

Evaluation

Acknowledgement

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages