First version of the code has been released.
This is the official implementation for TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
by Tingyu Qu, Mingxiao Li, Tinne Tuytelaars, Marie-Francine Moens.
We explore various visual tokens compression strategies. Our TS-LLaVA achieves the state-of-the-art performance among trianing-free video LLMs.
Our ModelScope repo is available at: https://www.modelscope.cn/models/tingyuqu/TS-LLaVA
Ranked #9 among all video LLMs: the average accuracy for multple choice questions on MLVU-test Leaderboard
To create conda env, please run:
conda env create -n llava --file llava.yml
conda activate llava
Install additional packages (llava & flash-attention)
pip install flash-attn --no-build-isolation
pip install -e ".[train]"
- These two packages, i.e. llava and flash-attention, are commented out from the yml file. In case of problems, please refer to the original LLaVA repo.
The checkpoints for LLaVA-v1.6 can be found here:
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b .ckpt/llava-v1.6-vicuna-7b
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-34b .ckpt/llava-v1.6-34b
- After downloading, the checkpoints should be stored in the ckpt folder.
[Optional] To enable GPT evaluation for open-ended video QA, please do the following:
export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY
-
We prepare the ground-truth question and answer files based on
IG-VLMandSF-LLaVA, and put them under playground/gt_qa_files.- NExT-QA: Download the
NExT_QA.csvfromhere - EgoSchema: Download the
EgoSchema.csvfromhere - IntentQA: Download the
IntentQA.csvfromhere
If you want to run our model for Open-Ended VideoQA and video-based Text Generation, please download the datasets as:
- MSVD-QA: Download the
MSVD_QA.csvfromhere - MSRVTT-QA: Download the
MSRVTT_QA.csvfromhere - TGIF-QA: Download the
TGIF_FrameQA.csvfromhere - Activitynet-QA: Download the
Activitynet_QA.csvfrom thehere - VCGBench
- Download all files under
text_generation_benchmark - Reformat the files by running
python scripts/data/prepare_vcgbench_qa_file.py --qa_folder $TEXT_GENERATION_BENCHMARK
- Download all files under
- NExT-QA: Download the
-
Reformatting the files:
- After getting the csv files, please reformat the files (apart from VCGBench) by running
python scripts/data/prepare_{DATASET}_file.py --qa_file $PATH_TO_CSV_FILE - replace DATASET with the names of the dataset. Check the
scripts/datato make sure the name is correct.
- After getting the csv files, please reformat the files (apart from VCGBench) by running
-
Download the raw videos from the official websites.
-
Multiple Choice VideoQA
-
Open-Ended VideoQA & video-based Text Generation:
- [Recomanded] Option 1: Follow the instruction in
Video-LLaVAto download raw videos. - Option 2: Download videos from the data owners.
- [Recomanded] Option 1: Follow the instruction in
-
Store the videos to the dir of your choice (
BASE_VIDEO_DIR), and replaceBASE_VIDEO_DIRin scripts when needed
-
- Download the data:
- By default, we use all the visible GPUs on the node for the model inference. To manually select GPUs, please modify
CUDA_VISIBLE_DEVICESin the scripts accordingly. - Please note that the model inference of TS-LLaVA-34B requires GPUs with at least 80G memory.
- In each scripts, change
CKPT_NAMEandmodel_pathaccordingly.
cd scripts/infer_videos
bash run_qa_{DATASET_NAME}.sh {AGGREGATION_METHOD} {NUM_FRAMES} {NUM_SAMPLED_TOKENS} {PROMPT_VERSION} {IMAGE_ASPECT_RATIO}
The evaluation is automatically done after inference
- replace DATASET_NAME to one of {nextqa, egoschema, intentqa}
AGGREGATION_METHODrefers to the visual token compression method of choice. The default for TS-LLaVA isV2, you can select fromX1,X2,X3: only use the thumbnail image.Z1,Z2,Z3: using multiple thumbnail images. (remember to sed the total number of frames divisible to the number of frames per thumbnail image)Y1,Y2,Y3: use both thumbnail image and sampled visual tokens. And prepend thumbnail image tokens to sampled visual tokens.V1,V2,V3: similar asY1,Y2,Y3. But sampled tokens are prepended to thumbnail image tokens.W1,W2,W3&U1,U2,U3: using multiple thumbnail images with sampled visual tokens (for ablation studies, remember to set the number of sample tokens accordingly).- Here 1, 2 and 3 correspond to using 4, 6, and 8 frames per thumbnail image, respectively.
- For details, please refer to llava_arch.py
NUM_FRAMESrefers to the total number of frames used. The default for TS-LLaVA is 50.NUM_SAMPLED_TOKENSrefers to the number of sampled tokens. The default for TS-LLaVA is 2880.PROMPT_VERSIONrefers to the textual prompt version used. The default for TS-LLaVA isv4. Please refer to get_prompt.py for more informationIMAGE_ASPECT_RATIOrefers to the type of image aspect ratio. The default for TS-LLaVA isresize, which resizes each frame to 336$\times$336.
The default arguments AGGREGATION_METHOD, NUM_FRAMES, NUM_SAMPLED_TOKENS, PROMPT_VERSION and IMAGE_ASPECT_RATIO are the same as Multiple Choice VideoQA.
cd scripts/infer_videos
bash run_qa_mlvu_mcqa.sh V2 50 2880 v4 resize
Submit the resulting json file to the official evaluation server (https://github.com/JUNJIE99/MLVU) for evaluation
cd scripts/infer_videos
bash run_qa_mlvu_mcqa.sh V2 50 2880 v4 resize {INPUT_FORMAT}
The evaluation is automatically done after inference
- In the script, change
video_dir,gt_file_qaandoutput_diraccordingly for different subtasks. - The sixth argument
INPUT_FORMATrefers to the input format of visual contents, which corresponds to the subtask of choice. It should be eithervideoorimage.
The default value for PROMPT_VERSION is v3. The rest are the same as Multiple Choice VideoQA.
cd scripts/infer_videos
bash run_qa_{DATASET_NAME}.sh V2 50 2880 v3 resize
- Same as Multiple Choice VideoQA. Replace DATASET_NAME to one of {msvd, msrvtt, anet, tgif}
cd scripts/eval
bash eval_{DATASET_NAME}.sh V2 50 2880 v3 resize {API_KEY}
- Use your own api_key from openai for
API_KEY.
For VCGBench (Video ChatGPT), the inference and evaluation procedures are similar. Please refer to run_gen_qa_{TASK_TYPE}.sh and eval_gen_qa.sh
We extend our gratitude to the following awesome projects: LLaVA, FreeVA, IG-VLM and SF-LLaVA.
@article{qu2024tsllava,
title={TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models},
author={Tingyu Qu and Mingxiao Li and Tinne Tuytelaars and Marie-Francine Moens},
year={2024},
journal={arXiv preprint arXiv:2411.11066},
}
