AudioSetCaps|音频-语言多模态数据集|多模态数据数据集
收藏AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language Models
数据集概述
- 数据来源: AudioSet, YouTube-8M, VGGSound
- 音频文件数量: 6,117,099个10秒音频文件
- Q&A数据对数量: 18,414,789对
数据集内容
- 音频描述: 每个音频文件附有一个描述性标题。
- Q&A对: 每个音频文件附有三个Q&A对,作为生成最终标题的元数据。
示例
ID | 音频 | 描述 | Q&A 1 | Q&A 2 | Q&A 3 |
---|---|---|---|---|---|
_7Xe9vD3Hpg_4_10 | <audio controls><source src="Example /_7Xe9vD3Hpg_4_10.mp3" type="audio/mpeg"> Your browser does not support the audio element.</audio> | A solemn instrumental piece unfolds, featuring the melancholic strains of a cello and the resonant tolling of a bell. The initial tempo is slow and deliberate, gradually building intensity with each successive bell ring. | Question: Describe this audio according to the sounds in it. Answer: The music starts with a slow string melody and continues with a bass note. The sound of a bell rings and the music becomes more intense. | Question: Based on the QAs, give some information about the speech, such as the emotion of the speaker, the gender of the speaker, and the spoken language, only if speech is present in this audio. Answer: Im sorry, but there is no speech in the audio. | Question: Based on the QAs, give some information about the music, such as music genre and music instruments, only if music is present in this audio. Answer: The music genre is instrumental. The music instruments are the cello and the bell. |
-TL8Mp3xcUM_0_10 | <audio controls><source src="Example/-TL8Mp3xcUM_0_10.mp3" type="audio/mpeg"> Your browser does not support the audio element.</audio> | A woman expresses strong emotions with a noticeably high-pitched vocal tone. | Question: Describe this audio according to the sounds in it. Answer: A woman speaks with a high-pitched voice. | Question: Based on the QAs, give some information about the speech, such as the emotion of the speaker, the gender of the speaker, and the spoken language, only if speech is present in this audio. Answer: The speech is emotional, as the woman speaks in a high-pitched voice. | Question: Based on the QAs, give some information about the music, such as music genre and music instruments, only if music is present in this audio. Answer: There is no music in this audio. |
数据统计
数据集 | 音频描述数量 | Q&A描述数量 | 总计 |
---|---|---|---|
AudioSetCaps | 1910920 | 5736072 | 7646992 |
YouTube-8M | 4023990 | 12086037 | 16110027 |
VGGSound | 182189 | 592680 | 774869 |
总计 | 6117099 | 18414789 | 24531888 |
下载
许可证
- 仅允许学术用途
引用
bibtex @inproceedings{ bai2024audiosetcaps, title={AudioSetCaps: Enriched Audio Captioning Dataset Generation Using Large Audio Language Models}, author={JISHENG BAI and Haohe Liu and Mou Wang and Dongyuan Shi and Wenwu Wang and Mark D Plumbley and Woon-Seng Gan and Jianfeng Chen}, booktitle={Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation}, year={2024}, url={https://openreview.net/forum?id=uez4PMZwzP} }

poi
本项目收集国内POI兴趣点,当前版本数据来自于openstreetmap。
github 收录
VoxBox
VoxBox是一个大规模语音语料库,由多样化的开源数据集构建而成,用于训练文本到语音(TTS)系统。
github 收录
OpenPose
OpenPose数据集包含人体姿态估计的相关数据,主要用于训练和评估人体姿态检测算法。数据集包括多视角的图像和视频,标注了人体关键点位置,适用于研究人体姿态识别和动作分析。
github.com 收录
URPC系列数据集, S-URPC2019, UDD
URPC系列数据集包括URPC2017至URPC2020DL,主要用于水下目标的检测和分类。S-URPC2019专注于水下环境的特定检测任务。UDD数据集信息未在README中详细描述。
github 收录
PDT Dataset
PDT数据集是由山东计算机科学中心(国家超级计算济南中心)和齐鲁工业大学(山东省科学院)联合开发的无人机目标检测数据集,专门用于检测树木病虫害。该数据集包含高分辨率和低分辨率两种版本,共计5775张图像,涵盖了健康和受病虫害影响的松树图像。数据集的创建过程包括实地采集、数据预处理和人工标注,旨在为无人机在农业中的精准喷洒提供高精度的目标检测支持。PDT数据集的应用领域主要集中在农业无人机技术,旨在提高无人机在植物保护中的目标识别精度,解决传统检测模型在实际应用中的不足。
arXiv 收录