Hemm: Holistic Evaluation of Multi-modal Generative Models
Hemm is a library for performing comprehensive benchmark of text-to-image diffusion models on image quality and prompt comprehension integrated with Weights & Biases and Weave.
Hemm is highly inspired by the following projects:
-
T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation
-
T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation
-
GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment
The evaluation pipeline will take each example, pass it through your application and score the output on multiple custom scoring functions using Weave Evaluation. By doing this, you'll have a view of the performance of your model, and a rich UI to drill into individual ouputs and scores. |
Leaderboards
Leaderboard | Weave Evals |
---|---|
Rendering prompts with Complex Actions | Weave Evals |
Installation
First, we recommend you install the PyTorch by visiting pytorch.org/get-started/locally.
Quickstart
First, you need to publish your evaluation dataset to Weave. Check out this tutorial that shows you how to publish a dataset on your project.
Once you have a dataset on your Weave project, you can evaluate a text-to-image generation model on the metrics.
import wandb
import weave
from hemm.eval_pipelines import BaseDiffusionModel, EvaluationPipeline
from hemm.metrics.prompt_alignment import CLIPImageQualityScoreMetric, CLIPScoreMetric
# Initialize Weave and WandB
wandb.init(project="image-quality-leaderboard", job_type="evaluation")
weave.init(project_name="image-quality-leaderboard")
# Initialize the diffusion model to be evaluated as a `weave.Model` using `BaseWeaveModel`
# The `BaseDiffusionModel` class uses a `diffusers.DiffusionPipeline` under the hood.
# You can write your own model `weave.Model` if your model is not diffusers compatible.
model = BaseDiffusionModel(diffusion_model_name_or_path="CompVis/stable-diffusion-v1-4")
# Add the model to the evaluation pipeline
evaluation_pipeline = EvaluationPipeline(model=model)
# Add PSNR Metric to the evaluation pipeline
psnr_metric = PSNRMetric(image_size=evaluation_pipeline.image_size)
evaluation_pipeline.add_metric(psnr_metric)
# Add SSIM Metric to the evaluation pipeline
ssim_metric = SSIMMetric(image_size=evaluation_pipeline.image_size)
evaluation_pipeline.add_metric(ssim_metric)
# Add LPIPS Metric to the evaluation pipeline
lpips_metric = LPIPSMetric(image_size=evaluation_pipeline.image_size)
evaluation_pipeline.add_metric(lpips_metric)
# Get the Weave dataset reference
dataset = weave.ref("COCO:v0").get()
# Evaluate!
evaluation_pipeline(dataset=dataset)