AudioSetCaps|音频-语言多模态数据集|多模态数据数据集
huggingface2024-10-23 更新2024-10-24 收录380
资源简介:
AudioSetCaps是一个丰富的音频-字幕数据集,通过使用大型音频和语言模型的自动化生成管道创建。数据集包含6,117,099个10秒的音频文件,这些文件来源于AudioSet、YouTube-8M和VGGSound。此外,数据集还提供了每个音频的中间Q&A结果,总计18,414,789对Q&A数据。数据集旨在促进未来音频-语言多模态研究的发展。
原始地址:
https://huggingface.co/datasets/baijs/AudioSetCaps
开放时间:
2024-10-21
创建时间:
2024-10-21
AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language Models
数据集概述
- 数据来源: AudioSet, YouTube-8M, VGGSound
- 音频文件数量: 6,117,099个10秒音频文件
- Q&A数据对数量: 18,414,789对
数据集内容
- 音频描述: 每个音频文件附有一个描述性标题。
- Q&A对: 每个音频文件附有三个Q&A对,作为生成最终标题的元数据。
示例
ID | 音频 | 描述 | Q&A 1 | Q&A 2 | Q&A 3 |
---|---|---|---|---|---|
_7Xe9vD3Hpg_4_10 | <audio controls><source src="Example /_7Xe9vD3Hpg_4_10.mp3" type="audio/mpeg"> Your browser does not support the audio element.</audio> | A solemn instrumental piece unfolds, featuring the melancholic strains of a cello and the resonant tolling of a bell. The initial tempo is slow and deliberate, gradually building intensity with each successive bell ring. | Question: Describe this audio according to the sounds in it. Answer: The music starts with a slow string melody and continues with a bass note. The sound of a bell rings and the music becomes more intense. | Question: Based on the QAs, give some information about the speech, such as the emotion of the speaker, the gender of the speaker, and the spoken language, only if speech is present in this audio. Answer: Im sorry, but there is no speech in the audio. | Question: Based on the QAs, give some information about the music, such as music genre and music instruments, only if music is present in this audio. Answer: The music genre is instrumental. The music instruments are the cello and the bell. |
-TL8Mp3xcUM_0_10 | <audio controls><source src="Example/-TL8Mp3xcUM_0_10.mp3" type="audio/mpeg"> Your browser does not support the audio element.</audio> | A woman expresses strong emotions with a noticeably high-pitched vocal tone. | Question: Describe this audio according to the sounds in it. Answer: A woman speaks with a high-pitched voice. | Question: Based on the QAs, give some information about the speech, such as the emotion of the speaker, the gender of the speaker, and the spoken language, only if speech is present in this audio. Answer: The speech is emotional, as the woman speaks in a high-pitched voice. | Question: Based on the QAs, give some information about the music, such as music genre and music instruments, only if music is present in this audio. Answer: There is no music in this audio. |
数据统计
数据集 | 音频描述数量 | Q&A描述数量 | 总计 |
---|---|---|---|
AudioSetCaps | 1910920 | 5736072 | 7646992 |
YouTube-8M | 4023990 | 12086037 | 16110027 |
VGGSound | 182189 | 592680 | 774869 |
总计 | 6117099 | 18414789 | 24531888 |
下载
许可证
- 仅允许学术用途
引用
bibtex @inproceedings{ bai2024audiosetcaps, title={AudioSetCaps: Enriched Audio Captioning Dataset Generation Using Large Audio Language Models}, author={JISHENG BAI and Haohe Liu and Mou Wang and Dongyuan Shi and Wenwu Wang and Mark D Plumbley and Woon-Seng Gan and Jianfeng Chen}, booktitle={Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation}, year={2024}, url={https://openreview.net/forum?id=uez4PMZwzP} }