THUDM/ImageRewardDB|文本到图像生成数据集|人类偏好评估数据集

hugging_face2023-06-21 更新2024-03-04 收录

文本到图像生成

人类偏好评估

下载链接：

https://hf-mirror.com/datasets/THUDM/ImageRewardDB

下载链接

链接失效反馈

资源简介：

--- license: apache-2.0 task_categories: - text-to-image language: - en pretty_name: ImageReward Dataset size_categories: - 100K<n<1M --- # ImageRewardDB ## Dataset Description - **Homepage: https://huggingface.co/datasets/wuyuchen/ImageRewardDB** - **Repository: https://github.com/THUDM/ImageReward** - **Paper: https://arxiv.org/abs/2304.05977** ### Dataset Summary ImageRewardDB is a comprehensive text-to-image comparison dataset, focusing on text-to-image human preference. It consists of 137k pairs of expert comparisons, based on text prompts and corresponding model outputs from DiffusionDB. To build the ImageRewadDB, we design a pipeline tailored for it, establishing criteria for quantitative assessment and annotator training, optimizing labeling experience, and ensuring quality validation. And ImageRewardDB is now publicly available at [🤗 Hugging Face Dataset](https://huggingface.co/datasets/wuyuchen/ImageRewardDB). Notice: All images in ImageRewardDB are collected from DiffusionDB, and in addition, we gathered together images corresponding to the same prompt. ### Languages The text in the dataset is all in English. ### Four Subsets Considering that the ImageRewardDB contains a large number of images, we provide four subsets in different scales to support different needs. For all subsets, the validation and test splits remain the same. The validation split(1.10GB) contains 412 prompts and 2.6K images(7.32K pairs) and the test(1.16GB) split contains 466 prompts and 2.7K images(7.23K pairs). The information on the train split in different scales is as follows: |Subset|Num of Pairs|Num of Images|Num of Prompts|Size| |:--|--:|--:|--:|--:| |ImageRewardDB 1K|17.6K|6.2K|1K|2.7GB| |ImageRewardDB 2K|35.5K|12.5K|2K|5.5GB| |ImageRewardDB 4K|71.0K|25.1K|4K|10.8GB| |ImageRewardDB 8K|141.1K|49.9K|8K|20.9GB| ## Dataset Structure All the data in this repository is stored in a well-organized way. The 62.6K images in ImageRewardDB are split into several folders, stored in corresponding directories under "./images" according to its split. Each folder contains around 500 prompts, their corresponding images, and a JSON file. The JSON file links the image with its corresponding prompt and annotation. The file structure is as follows: ``` # ImageRewardDB ./ ├── images │ ├── train │ │ ├── train_1 │ │ │ ├── 0a1ed3a5-04f6-4a1b-aee6-d584e7c8ed9c.webp │ │ │ ├── 0a58cfa8-ff61-4d31-9757-27322aec3aaf.webp │ │ │ ├── [...] │ │ │ └── train_1.json │ │ ├── train_2 │ │ ├── train_3 │ │ ├── [...] │ │ └── train_32 │ ├── validation │ │ └── [...] │ └── test │ └── [...] ├── metadata-train.parquet ├── metadata-validation.parquet └── metadata-test.parquet ``` The sub-folders have the name of {split_name}_{part_id}, and the JSON file has the same name as the sub-folder. Each image is a lossless WebP file and has a unique name generated by [UUID](https://en.wikipedia.org/wiki/Universally_unique_identifier). ### Data Instances For instance, below is the image of `1b4b2d61-89c2-4091-a1c0-f547ad5065cb.webp` and its information in train_1.json. ```json { "image_path": "images/train/train_1/0280642d-f69f-41d1-8598-5a44e296aa8b.webp", "prompt_id": "000864-0061", "prompt": "painting of a holy woman, decorated, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, 8 k ", "classification": "People", "image_amount_in_total": 9, "rank": 5, "overall_rating": 4, "image_text_alignment_rating": 3, "fidelity_rating": 4 } ``` ### Data Fields * image: The image object * prompt_id: The id of the corresponding prompt * prompt: The text of the corresponding prompt * classification: The classification of the corresponding prompt * image_amount_in_total: Total amount of images related to the prompt * rank: The relative rank of the image in all related images * overall_rating: The overall score of this image * image_text_alignment_rating: The score of how well the generated image matches the given text * fidelity_rating: The score of whether the output image is true to the shape and characteristics that the object should have ### Data Splits As we mentioned above, all scales of the subsets we provided have three splits of "train", "validation", and "test". And all the subsets share the same validation and test splits. ### Dataset Metadata We also include three metadata tables `metadata-train.parquet`, `metadata-validation.parquet`, and `metadata-test.parquet` to help you access and comprehend ImageRewardDB without downloading the Zip files. All the tables share the same schema, and each row refers to an image. The schema is shown below, and actually, the JSON files we mentioned above share the same schema: |Column|Type|Description| |:---|:---|:---| |`image_path`|`string`|The relative path of the image in the repository.| |`prompt_id`|`string`|The id of the corresponding prompt.| |`prompt`|`string`|The text of the corresponding prompt.| |`classification`|`string`| The classification of the corresponding prompt.| |`image_amount_in_total`|`int`| Total amount of images related to the prompt.| |`rank`|`int`| The relative rank of the image in all related images.| |`overall_rating`|`int`| The overall score of this image. |`image_text_alignment_rating`|`int`|The score of how well the generated image matches the given text.| |`fidelity_rating`|`int`|The score of whether the output image is true to the shape and characteristics that the object should have.| Below is an example row from metadata-train.parquet. |image_path|prompt_id|prompt|classification|image_amount_in_total|rank|overall_rating|image_text_alignment_rating|fidelity_rating| |:---|:---|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---|:---|:---|:---|:---|:---| |images/train/train_1/1b4b2d61-89c2-4091-a1c0-f547ad5065cb.webp|001324-0093|a magical forest that separates the good world from the dark world, ...|Outdoor Scenes|8|3|6|6|6| ## Loading ImageRewardDB You can use the Hugging Face [Datasets](https://huggingface.co/docs/datasets/quickstart) library to easily load the ImageRewardDB. As we mentioned before, we provide four subsets in the scales of 1k, 2k, 4k, and 8k. You can load them using as following: ```python from datasets import load_dataset # Load the 1K-scale dataset dataset = load_dataset("THUDM/ImageRewardDB", "1k") # Load the 2K-scale dataset dataset = load_dataset("THUDM/ImageRewardDB", "2k") # Load the 4K-scale dataset dataset = load_dataset("THUDM/ImageRewardDB", "4K") # Load the 8K-scale dataset dataset = load_dataset("THUDM/ImageRewardDB", "8k") ``` ## Additional Information ### Licensing Information The ImageRewardDB dataset is available under the [Apache license 2.0](https://www.apache.org/licenses/LICENSE-2.0.html). The Python code in this repository is available under the [MIT License](https://github.com/poloclub/diffusiondb/blob/main/LICENSE). ### Citation Information ``` @misc{xu2023imagereward, title={ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation}, author={Jiazheng Xu and Xiao Liu and Yuchen Wu and Yuxuan Tong and Qinkai Li and Ming Ding and Jie Tang and Yuxiao Dong}, year={2023}, eprint={2304.05977}, archivePrefix={arXiv}, primaryClass={cs.CV} } ```

提供机构：

THUDM

AI搜集汇总

数据集介绍

构建方式

ImageRewardDB的构建过程旨在量化评估文本到图像的人类偏好。该数据集的构建者设计了一套专门的流水线，包括确立定量评估标准、标注者培训、优化标注体验以及确保质量验证。基于DiffusionDB的模型输出和对应的文本提示，构建者收集了137k对专家比较数据，形成了ImageRewardDB。

特点

ImageRewardDB是一个全面性的文本到图像比较数据集，专注于文本到图像的人类偏好。它包含四个不同规模的数据子集，以满足不同需求。所有子集均采用相同的验证和测试分割，且每个子集的结构都经过精心组织，包含对应提示的图像和元数据。此外，数据集还提供了详细的元数据表，方便用户在不下载压缩文件的情况下访问和理解数据。

使用方法

使用ImageRewardDB数据集，用户可以利用Hugging Face的Datasets库轻松加载数据。数据集提供了四个不同规模（1k、2k、4k、8k）的子集，用户可以根据需求加载相应的子集。加载后，用户可以访问图像路径、提示文本、图像分类、评分等信息，以进行进一步的数据分析和模型训练。

背景与挑战

背景概述

ImageRewardDB数据集，由清华大学计算机系自然语言处理与社会人文计算实验室（THUDM）创建于2023年，是一套专注于文本到图像人类偏好的综合性对比数据集。该数据集汇聚了基于DiffusionDB模型输出的137,000对专家比较结果，其构建旨在为文本到图像生成任务提供人类偏好学习的评估基准。ImageRewardDB的构建过程中，研究团队设计了一套专门的流程，包括定量评估标准的确立、标注者培训、标注体验优化以及质量验证等环节，以确保数据集的质量和实用性。该数据集的发布，对推动文本到图像生成领域的研究具有重要意义，并为相关算法的改进提供了丰富的实验资源。

当前挑战

在构建ImageRewardDB数据集的过程中，研究团队面临了多重挑战。首先，如何确保收集到的图像与文本提示之间的准确对应是一个关键问题。其次，数据集的质量控制，尤其是在标注过程中保持一致性和准确性，也是一个重大挑战。此外，考虑到数据集的规模庞大，如何高效地管理和访问数据，以及提供不同规模子集以满足不同研究需求，也是必须解决的问题。在研究领域问题上，ImageRewardDB数据集旨在解决文本到图像生成任务中对人类偏好理解的不足，如何准确捕捉和量化这些偏好，是该数据集面临的主要挑战之一。

常用场景

经典使用场景

在文本到图像生成的领域中，ImageRewardDB数据集的构建旨在对模型输出的图像与给定文本提示之间的匹配度进行量化评估。其经典使用场景在于，研究者通过该数据集，能够对不同的文本到图像生成模型进行偏好学习与评价，进而优化模型以生成更符合人类审美和需求的艺术作品。

实际应用

在实际应用中，ImageRewardDB数据集可以被用于训练和评估生成对抗网络（GANs）和其他文本到图像生成模型，以创建更加逼真和符合用户需求的图像。例如，在艺术创作、游戏开发、虚拟现实等领域，该数据集有助于生成高质量的场景和角色图像，满足不同应用场景的需求。

衍生相关工作

基于ImageRewardDB数据集，研究者可以进一步开展相关工作，如开发新的文本到图像生成算法、设计更有效的评价体系，以及探索人类审美偏好的内在规律。该数据集已经催生了多篇学术论文，对文本到图像生成领域的技术进步和理论发展产生了深远影响。

以上内容由AI搜集并总结生成

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

Beijing Traffic

The Beijing Traffic Dataset collects traffic speeds at 5-minute granularity for 3126 roadway segments in Beijing between 2022/05/12 and 2022/07/25.

Papers with Code 收录

MOOCs Dataset

该数据集包含了大规模开放在线课程（MOOCs）的相关数据，包括课程信息、用户行为、学习进度等。数据主要用于研究在线教育的行为模式和学习效果。

www.kaggle.com 收录

Breast Cancer Dataset

该项目专注于清理和转换一个乳腺癌数据集，该数据集最初由卢布尔雅那大学医学中心肿瘤研究所获得。目标是通过应用各种数据转换技术（如分类、编码和二值化）来创建一个可以由数据科学团队用于未来分析的精炼数据集。

github 收录

China Air Quality Historical Data

该数据集包含了中国多个城市的空气质量历史数据，涵盖了PM2.5、PM10、SO2、NO2、CO、O3等污染物浓度以及空气质量指数（AQI）等信息。数据按小时记录，提供了详细的空气质量监测数据。

www.cnemc.cn 收录

Traditional-Chinese-Medicine-Dataset-SFT

该数据集是一个高质量的中医数据集，主要由非网络来源的内部数据构成，包含约1GB的中医各个领域临床案例、名家典籍、医学百科、名词解释等优质内容。数据集99%为简体中文内容，质量优异，信息密度可观。数据集适用于预训练或继续预训练用途，未来将继续发布针对SFT/IFT的多轮对话和问答数据集。数据集可以独立使用，但建议先使用配套的预训练数据集对模型进行继续预训练后，再使用该数据集进行进一步的指令微调。数据集还包含一定比例的中文常识、中文多轮对话数据以及古文/文言文<->现代文翻译数据，以避免灾难性遗忘并加强模型表现。

huggingface 收录