Evaluation
We currently provide evaluations on 8 benchmarks, including VQAv2, GQA, ScienceQA, ScienceQA, POPE, MME, MM-Vet and MMMU.
For VQAv2, GQA, ScienceQA, POPE, MME and MM-Vet, you MUST first download eval.zip. It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Please extract it to path/to/your/dataset/eval.
Or you can just follow the evaluation instructions of LLaVA v1.5.
For MMMU, you MUST first download MMMU.zip. It contains custom annotations and scripts. Please extract it to path/to/your/dataset/eval/MMMU.
VQAv2
Dataset: Download test2015 and put it under
path/to/your/dataset/eval/vqav2.
Please change
MODEL_PATH,MODEL_NAME,EVAL_DIR, andconv-modeinscripts/eval/vqav2.sh.Inference: VQAv2 supports multi-gpus inference with the following command.
cd TinyLLaVA_Factory CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/vqav2.sh
Submit the results(
path/to/your/dataset/eval/vqav2/answers_upload) to the vqav2_evaluation_server.
GQA
Dataset: Download the data and evaluation_scripts following the official instructions and put under
path/to/your/dataset/eval/gqa/data.
Please change
MODEL_PATH,MODEL_NAME,EVAL_DIR, andconv-modeinscripts/eval/gqa.sh.Inference: GQA supports multi-gpus inference with the following command.
cd TinyLLaVA_Factory CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/gqa.sh
ScienceQA
Dataset: Under
path/to/your/dataset/eval/scienceqa, downloadimages,pid_splits.json,problems.jsonfrom thescienceqafolder of the ScienceQA repo.
Please change
MODEL_PATH,MODEL_NAME,EVAL_DIR, andconv-modeinscripts/eval/sqa.sh.Inference: ScienceQA does not support multi-gpus inference, please use the following command for single-gpu inference.
cd TinyLLaVA_Factory CUDA_VISIBLE_DEVICES=0 bash scripts/eval/sqa.sh
TextVQA
Dataset: Download TextVQA_0.5.1_val.json and images and extract to
path/to/your/dataset/eval/textvqa.
Please change
MODEL_PATH,MODEL_NAME,EVAL_DIR, andconv-modeinscripts/eval/textvqa.sh.Inference: TextVQA does not support multi-gpus inference, please use the following command for single-gpu inference.
cd TinyLLaVA_Factory CUDA_VISIBLE_DEVICES=0 bash scripts/eval/textvqa.sh
POPE
Dataset: Download COCO val2014 and the coco folder that contains 3 json files, put them under
path/to/your/dataset/eval/pope.
Please change
MODEL_PATH,MODEL_NAME,EVAL_DIR, andconv-modeinscripts/eval/pope.sh.Inference: POPE does not support multi-gpus inference, please use the following command for single-gpu inference.
cd TinyLLaVA_Factory CUDA_VISIBLE_DEVICES=0 bash scripts/eval/pope.sh
MME
Dataset: Download the data following the official instructions here.
Please change
MODEL_PATH,MODEL_NAME,EVAL_DIR, andconv-modeinscripts/eval/mme.sh.Downloaded images to
MME_Benchmark_release_version.put the official
eval_toolandMME_Benchmark_release_versionunderpath/to/your/dataset/eval/MME.Inference: MME does not support multi-gpus inference, please use the following command for single-gpu inference.
cd TinyLLaVA_Factory CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mme.sh
MM-Vet
Datasets: Extract mm-vet.zip to
path/to/your/dataset/eval/mmvet.
Please change
MODEL_PATH,MODEL_NAME,EVAL_DIR, andconv-modeinscripts/eval/mmvet.sh.Inference: MM-Vet does not support multi-gpus inference, please use the following command for single-gpu inference.
cd TinyLLaVA_Factory CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mmvet.sh
Submit the results(
path/to/your/dataset/eval/mmvet/results) to the mmvet_evaluation_server.
MMMU
Datasets: Extract MMMU.zip to
path/to/your/dataset/eval/MMMU.
Please change
sample["img_path"]to your path ineval/download_images.py,and download images as following.cd path/to/your/dataset/eval/MMMU mkdir all_images python eval/download_images.py
Please change
MODEL_PATH,MODEL_NAME,EVAL_DIR, andconv-modeinscripts/eval/mmmu.sh.Inference: MMMU does not support multi-gpus inference, please use the following command for single-gpu inference.
cd TinyLLaVA_Factory CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mmmu.sh
Organize Data
Organize the evaluation dataset as follows in path/to/your/eval :
eval
├── vqav2
│ ├── answers
│ ├── answers_upload
│ ├── test2015
│ ├── llava_vqav2_mscoco_test2015.jsonl
│ ├── llava_vqav2_mscoco_test-dev2015.jsonl
├── gqa
│ ├── answers
│ ├── images
│ ├── train_all_questions
│ │ ├── train_all_questions_0.json
│ │ ├── ...
│ │ ├── train_all_questions_9.json
│ ├── llava_gqa_testdev_balanced.jsonl
│ ├── eval.py
│ ├── challenge_all_questions.json
│ ├── challenge_balanced_questions.json
│ ├── submission_all_questions.json
│ ├── test_all_questions.json
│ ├── test_balanced_questions.json
│ ├── testdev_all_questions.json
│ ├── testdev_balanced_questions.json
│ ├── train_balanced_questions.json
│ ├── val_all_questions.json
│ ├── val_balanced_questions.json
├── scienceqa
│ ├── answers
│ ├── images
│ │ ├── test
│ ├── llava_test_CQM-A.json
│ ├── pid_splits.json
│ ├── problems.json
├── textvqa
│ ├── answers
│ ├── train_images
│ ├── llava_textvqa_val_v051_ocr.jsonl
│ ├── TextVQA_0.5.1_val.json
├── pope
│ ├── answers
│ ├── coco
│ │ ├── coco_pope_adversarial.json
│ │ ├── coco_pope_popular.json
│ │ ├── coco_pope_random.json
│ ├── val2014
│ ├── llava_pope_test.jsonl
├── MME
│ ├── answers
│ ├── eval_tool
│ │ ├── LaVIN
│ │ ├── Your_Results
│ │ ├── calculation.py
│ ├── MME_Benchmark_release_version
│ │ ├── artwork
│ │ ├── celebrity
│ │ ├── code_reasoning
│ │ ├── color
│ │ ├── commonsense_reasoning
│ │ ├── count
│ │ ├── eval_tool
│ │ ├── existence
│ │ ├── landmark
│ │ ├── numerical_calculation
│ │ ├── OCR
│ │ ├── position
│ │ ├── posters
│ │ ├── scene
│ │ ├── text_translation
│ ├── convert_answer_to_mme.py
│ ├── llava_mme.jsonl
├── mm-vet
│ ├── answers
│ ├── images
│ ├── results
│ ├── mm-vet
│ │ ├── bard_set.json
│ │ ├── mm-vet.json
│ ├── convert_answers.py
│ ├── llava-mm-vet.jsonl
├── MMMU
│ ├── all_images
│ ├── eval
│ │ ├── utils
│ │ ├── answer_dict_val.json
│ │ ├── download_images.py
│ │ ├── main_eval_only.py
│ ├── anns_for_eval.json