Disentangled VQA
This module aims to implement the Disentangled VQA metric inspired by Section 4.1 from the paper T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation.
Using the disentangled BLIP-VQA model for attribute-binding evaluation as proposed in T2I-CompBench |
Example
Step 1: Generate evaluation dataset
Generate the dataset consisting of prompts in the format “a {adj_1} {noun_1} and a {adj_2} {noun_2}”
and the corresponding metadata using an LLM capable of generating json objects like GPT4-O. The dataset is then published both as a W&B dataset artifact and as a
weave dataset.
from hemm.metrics.attribute_binding import AttributeBindingDatasetGenerator
dataset_generator = AttributeBindingDatasetGenerator(
openai_model="gpt-4o",
openai_seed=42,
num_prompts_in_single_call=20,
num_api_calls=50,
project_name="disentangled_vqa",
)
dataset_generator(dump_dir="./dump")
Step 2: Evaluate
import asyncio
import weave
from hemm.metrics.vqa import DisentangledVQAMetric
from hemm.metrics.vqa.judges import BlipVQAJudge
from hemm.models import DiffusersModel
weave.init(project_name=project)
diffusion_model = DiffusersModel(
diffusion_model_name_or_path=diffusion_model_address,
enable_cpu_offfload=diffusion_model_enable_cpu_offfload,
image_height=image_size[0],
image_width=image_size[1],
)
judge = BlipVQAJudge()
metric = DisentangledVQAMetric(judge=judge, name="disentangled_blip_metric")
evaluation_pipeline.add_metric(metric)
evaluation = weave.Evaluation(dataset=dataset, scorers=[metric])
asyncio.run(evaluation.evaluate(model))
Metrics
DisentangledVQAMetric
Bases: Scorer
Disentangled VQA metric to evaluate the attribute-binding capability for image generation models as proposed in Section 4.1 from the paper T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation.
Sample usage
import wandb
import weave
from hemm.eval_pipelines import BaseDiffusionModel, EvaluationPipeline
from hemm.metrics.vqa import DisentangledVQAMetric
from hemm.metrics.vqa.judges import BlipVQAJudge
wandb.init(project=project, entity=entity, job_type="evaluation")
weave.init(project_name=project)
diffusion_model = BaseDiffusionModel(
diffusion_model_name_or_path=diffusion_model_address,
enable_cpu_offfload=diffusion_model_enable_cpu_offfload,
image_height=image_size[0],
image_width=image_size[1],
)
evaluation_pipeline = EvaluationPipeline(model=diffusion_model)
judge = BlipVQAJudge()
metric = DisentangledVQAMetric(judge=judge, name="disentangled_blip_metric")
evaluation_pipeline.add_metric(metric)
evaluation_pipeline(dataset=dataset)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
judge
|
Union[Model, BlipVQAJudge]
|
The judge model to evaluate the attribute-binding capability. |
required |
Source code in hemm/metrics/vqa/disentangled_vqa.py
score(prompt, adj_1, noun_1, adj_2, noun_2, model_output)
Evaluate the attribute-binding capability of the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt
|
str
|
The prompt for the model. |
required |
adj_1
|
str
|
The first adjective. |
required |
noun_1
|
str
|
The first noun. |
required |
adj_2
|
str
|
The second adjective. |
required |
noun_2
|
str
|
The second noun. |
required |
model_output
|
Dict[str, Any]
|
The model output. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dict[str, Any]: The evaluation result. |
Source code in hemm/metrics/vqa/disentangled_vqa.py
Judges
BlipVQAJudge
Bases: Model
Weave Model to judge the presence of entities in an image using the Blip-VQA model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
blip_processor_address
|
str
|
The address of the BlipProcessor model. |
'Salesforce/blip-vqa-base'
|
blip_vqa_address
|
str
|
The address of the BlipForQuestionAnswering model. |
'Salesforce/blip-vqa-base'
|
device
|
str
|
The device to use for inference |
'cuda'
|
Source code in hemm/metrics/vqa/judges/blip_vqa.py
predict(adj_1, noun_1, adj_2, noun_2, image)
Predict the probabilities presence of entities in an image using the Blip-VQA model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adj_1
|
str
|
The adjective of the first entity. |
required |
noun_1
|
str
|
The noun of the first entity. |
required |
adj_2
|
str
|
The adjective of the second entity. |
required |
noun_2
|
str
|
The noun of the second entity. |
required |
image
|
Image
|
The input image. |
required |
Returns:
Name | Type | Description |
---|---|---|
Dict |
Dict
|
The probabilities of the presence of the entities. |
Source code in hemm/metrics/vqa/judges/blip_vqa.py
Dataset Generation
AttributeBindingDatasetGenerator
Dataset generator for evaluation of attribute binding capability of image-generation models.
This class enables us to generate the dataset consisting of prompts in the format
“a {adj_1} {noun_1} and a {adj_2} {noun_2}”
and the corresponding metadata using an LLM capable
of generating json objects like GPT4-O. The dataset is then published both as a
W&B dataset artifact and as a
weave dataset.
Sample usage
Parameters:
Name | Type | Description | Default |
---|---|---|---|
openai_model
|
Optional[str]
|
The OpenAI model to use for generating prompts. |
'gpt-3.5-turbo'
|
openai_seed
|
Optional[Union[int, List[int]]]
|
Seed to use for generating prompts. If not provided, seeds will be auto-generated. |
None
|
num_prompts_in_single_call
|
Optional[int]
|
Number of prompts to generate in a single API call. |
20
|
num_api_calls
|
Optional[int]
|
Number of API calls to make. |
50
|
project_name
|
Optional[str]
|
Name of the Weave project to use for logging the dataset. |
'diffusion_leaderboard'
|
Source code in hemm/metrics/vqa/dataset_generator/attribute_binding.py
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 |
|
__call__(dump_dir='./dump')
Generate the dataset and publish it to Weave.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dump_dir
|
Optional[str]
|
Directory to dump the dataset. |
'./dump'
|
Source code in hemm/metrics/vqa/dataset_generator/attribute_binding.py
AttributeBindingModel
Bases: Model
Weave Model to generate prompts for evaluation of attribute binding capability of image-generation models using an OpenAI model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
openai_model
|
Optional[str]
|
The OpenAI model to use for generating prompts. |
required |
num_prompts
|
Optional[int]
|
Number of prompts to generate. |
required |
Source code in hemm/metrics/vqa/dataset_generator/attribute_binding.py
predict(seed)
Generate prompts and corresponding metadata for evaluation of attribute binding capability of image-generation models.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
seed
|
int
|
OpenAI seed to use for generating prompts. |
required |