Evaluation

We currently provide evaluations on 8 benchmarks, including VQAv2, GQA, ScienceQA, ScienceQA, POPE, MME, MM-Vet and MMMU.

For VQAv2, GQA, ScienceQA, POPE, MME and MM-Vet, you MUST first download eval.zip. It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Please extract it to path/to/your/dataset/eval. Or you can just follow the evaluation instructions of LLaVA v1.5.

For MMMU, you MUST first download MMMU.zip. It contains custom annotations and scripts. Please extract it to path/to/your/dataset/eval/MMMU.

VQAv2

Dataset: Download test2015 and put it under path/to/your/dataset/eval/vqav2.

Please change MODEL_PATH, MODEL_NAME, EVAL_DIR, and conv-mode in scripts/eval/vqav2.sh.
Inference: VQAv2 supports multi-gpus inference with the following command.

cd TinyLLaVA_Factory
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/vqav2.sh

Submit the results(path/to/your/dataset/eval/vqav2/answers_upload) to the vqav2_evaluation_server.

GQA

Dataset: Download the data and evaluation_scripts following the official instructions and put under path/to/your/dataset/eval/gqa/data.

Please change MODEL_PATH, MODEL_NAME, EVAL_DIR, and conv-mode in scripts/eval/gqa.sh.
Inference: GQA supports multi-gpus inference with the following command.

cd TinyLLaVA_Factory
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/gqa.sh

ScienceQA

Dataset: Under path/to/your/dataset/eval/scienceqa, download images, pid_splits.json, problems.json from the scienceqa folder of the ScienceQA repo.

Please change MODEL_PATH, MODEL_NAME, EVAL_DIR, and conv-mode in scripts/eval/sqa.sh.
Inference: ScienceQA does not support multi-gpus inference, please use the following command for single-gpu inference.

cd TinyLLaVA_Factory
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/sqa.sh

TextVQA

Dataset: Download TextVQA_0.5.1_val.json and images and extract to path/to/your/dataset/eval/textvqa.

Please change MODEL_PATH, MODEL_NAME, EVAL_DIR, and conv-mode in scripts/eval/textvqa.sh.
Inference: TextVQA does not support multi-gpus inference, please use the following command for single-gpu inference.

cd TinyLLaVA_Factory
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/textvqa.sh

POPE

Dataset: Download COCO val2014 and the coco folder that contains 3 json files, put them under path/to/your/dataset/eval/pope.

Please change MODEL_PATH, MODEL_NAME, EVAL_DIR, and conv-mode in scripts/eval/pope.sh.
Inference: POPE does not support multi-gpus inference, please use the following command for single-gpu inference.

cd TinyLLaVA_Factory
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/pope.sh

MME

Dataset: Download the data following the official instructions here.

Please change MODEL_PATH, MODEL_NAME, EVAL_DIR, and conv-mode in scripts/eval/mme.sh.
Downloaded images to MME_Benchmark_release_version.
put the official eval_tool and MME_Benchmark_release_version under path/to/your/dataset/eval/MME.
Inference: MME does not support multi-gpus inference, please use the following command for single-gpu inference.

cd TinyLLaVA_Factory
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mme.sh

MM-Vet

Datasets: Extract mm-vet.zip to path/to/your/dataset/eval/mmvet.

Please change MODEL_PATH, MODEL_NAME, EVAL_DIR, and conv-mode in scripts/eval/mmvet.sh.
Inference: MM-Vet does not support multi-gpus inference, please use the following command for single-gpu inference.

cd TinyLLaVA_Factory
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mmvet.sh

Submit the results(path/to/your/dataset/eval/mmvet/results) to the mmvet_evaluation_server.

MMMU

Datasets: Extract MMMU.zip to path/to/your/dataset/eval/MMMU.

Please change sample["img_path"] to your path in eval/download_images.py，and download images as following.
```
cd path/to/your/dataset/eval/MMMU
mkdir all_images
python eval/download_images.py
```
Please change MODEL_PATH, MODEL_NAME, EVAL_DIR, and conv-mode in scripts/eval/mmmu.sh.
Inference: MMMU does not support multi-gpus inference, please use the following command for single-gpu inference.
```
cd TinyLLaVA_Factory
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mmmu.sh
```

Organize Data

Organize the evaluation dataset as follows in path/to/your/eval :

eval
├── vqav2
│   ├── answers
│   ├── answers_upload
│   ├── test2015
│   ├── llava_vqav2_mscoco_test2015.jsonl
│   ├── llava_vqav2_mscoco_test-dev2015.jsonl
├── gqa
│   ├── answers
│   ├── images
│   ├── train_all_questions
│   │   ├── train_all_questions_0.json
│   │   ├── ...
│   │   ├── train_all_questions_9.json
│   ├── llava_gqa_testdev_balanced.jsonl
│   ├── eval.py
│   ├── challenge_all_questions.json
│   ├── challenge_balanced_questions.json
│   ├── submission_all_questions.json
│   ├── test_all_questions.json
│   ├── test_balanced_questions.json
│   ├── testdev_all_questions.json
│   ├── testdev_balanced_questions.json
│   ├── train_balanced_questions.json
│   ├── val_all_questions.json
│   ├── val_balanced_questions.json
├── scienceqa
│   ├── answers
│   ├── images
│   │   ├── test
│   ├── llava_test_CQM-A.json
│   ├── pid_splits.json
│   ├── problems.json
├── textvqa
│   ├── answers
│   ├── train_images
│   ├── llava_textvqa_val_v051_ocr.jsonl
│   ├── TextVQA_0.5.1_val.json
├── pope
│   ├── answers
│   ├── coco
│   │   ├── coco_pope_adversarial.json
│   │   ├── coco_pope_popular.json
│   │   ├── coco_pope_random.json
│   ├── val2014
│   ├── llava_pope_test.jsonl
├── MME
│   ├── answers
│   ├── eval_tool
│   │   ├── LaVIN
│   │   ├── Your_Results
│   │   ├── calculation.py
│   ├── MME_Benchmark_release_version
│   │   ├── artwork
│   │   ├── celebrity
│   │   ├── code_reasoning
│   │   ├── color
│   │   ├── commonsense_reasoning
│   │   ├── count
│   │   ├── eval_tool
│   │   ├── existence
│   │   ├── landmark
│   │   ├── numerical_calculation
│   │   ├── OCR
│   │   ├── position
│   │   ├── posters
│   │   ├── scene
│   │   ├── text_translation
│   ├── convert_answer_to_mme.py
│   ├── llava_mme.jsonl
├── mm-vet
│   ├── answers
│   ├── images
│   ├── results
│   ├── mm-vet
│   │   ├── bard_set.json
│   │   ├── mm-vet.json
│   ├── convert_answers.py
│   ├── llava-mm-vet.jsonl
├── MMMU
│   ├── all_images
│   ├── eval
│   │   ├── utils
│   │   ├── answer_dict_val.json
│   │   ├── download_images.py
│   │   ├── main_eval_only.py
│   ├── anns_for_eval.json