Conversation Genome Project|对话AI数据集|自然语言处理数据集
收藏Conversation Genome Project 数据集概述
数据集描述
Conversation Genome Project (CGP) 是一个开源项目,旨在通过提供一个包含索引和标记的对话的全面数据集,来实现个性化对话AI。该项目利用了Bittensor基础设施来注释与对话相关的数据。
关键特性
- 对来自不同来源(如YouTube、播客等)的数十亿对话进行索引和标记
- 利用分形数据挖掘和对话窗口进行高效且保护隐私的处理
- 从对话元数据生成的合成参与者配置文件
- 评估对话质量的算法(相关性、参与度、新颖性、连贯性和流畅性)
- 用于训练和微调对话AI模型的开源数据集
- 激励数据贡献和完整性的挖矿和验证系统
系统设计
- 数据存储:主要数据源、对话窗口、参与者配置文件和向量数据库
- 验证者角色:拉取数据、为基础对话生成概览元数据、创建窗口并评分提交
- 矿工角色:处理对话窗口、提供元数据和标签
- 数据流程:从CGP API获取对话,通过LLM处理生成标签和元数据,最终存储在数据库中
安装与计算要求
- 需要Python 3.8或更高版本
- 矿工和验证者使用OpenAI API密钥时,需要至少8GB RAM和20GB磁盘空间
配置
- 需要配置
.env
文件,包括API密钥和LLM类型选择 - 支持的LLM类型包括OpenAI、Anthropic和groq
测试运行
- 通过运行测试验证器套件来检查配置和环境设置
- 测试包括启动验证器和矿工,处理对话并返回元数据
注册
- 在测试网或主网上注册UID,用于挖矿或验证
子网角色
- 挖矿:使用特定命令启动矿工
- 验证:使用特定命令启动验证器
自定义对话服务器
- 验证器可以运行自己的数据源,处理自定义或专有对话数据
- 提供了一个示例实现,需要根据需求修改
使用Runpod
- 使用Runpod启动和管理云GPU和CPU实例
- 需要特定的配置设置,包括端口映射和实例选择
进程管理
- 推荐使用pm2或Screen管理进程
- pm2的安装和基本使用命令
许可证
- 该项目使用MIT许可证

Population and Housing Census of 2007 - Ethiopia
Geographic coverage --------------------------- National coverage Analysis unit --------------------------- Household Person Housing unit Universe --------------------------- The census has counted people on dejure and defacto basis. The dejure population comprises all the persons who belong to a given area at a given time by virtue of usual residence, while under defacto approach people were counted as the residents of the place where they found. In the census, a person is said to be a usual resident of a household (and hence an area) if he/she has been residing in the household continuously for at least six months before the census day or intends to reside in the household for six months or longer. Thus, visitors are not included with the usual (dejure) population. Homeless persons were enumerated in the place where they spent the night on the enumeration day. The 2007 census counted foreign nationals who were residing in the city administration. On the other hand all Ethiopians living abroad were not counted. Kind of data --------------------------- Census/enumeration data [cen] Mode of data collection --------------------------- Face-to-face [f2f] Research instrument --------------------------- Two type sof questionnaires were used to collect census data: i) Short questionnaire ii) Long questionnaire Unlike the previous censuses, the contents of the short and long questionnaires were similar both for the urban and rural areas as well as for the entire city. But the short and the long questionnaires differ by the number of variables they contained. That is, the short questionnaire was used to collect basic data on population characteristics, such as population size, sex, age, language, ethnic group, religion, orphanhood and disability. Whereas the long questionnaire includes information on marital status, education, economic activity, migration, fertility, mortality, as well as housing stocks and conditions in addition to those questions contained in a short questionnaire.
catalog.ihsn.org 收录
UniMed
UniMed是一个大规模、开源的多模态医学数据集,包含超过530万张图像-文本对,涵盖六种不同的医学成像模态:X射线、CT、MRI、超声、病理学和眼底。该数据集通过利用大型语言模型(LLMs)将特定模态的分类数据集转换为图像-文本格式,并结合现有的医学领域的图像-文本数据,以促进可扩展的视觉语言模型(VLM)预训练。
github 收录
poi
本项目收集国内POI兴趣点,当前版本数据来自于openstreetmap。
github 收录
Analog Circuit Fault Diagnosis Dataset
The simulation experiment is based on Candence 16.6 software, where the tolerance of the resistance (R) is set to 5%, the tolerance of the capacitance (C) is set to 10%, the input is a single-pulse signal (amplitude 5 V, pulse width 10 µs, eriod 2ms), and the working temperature is set to 27 ℃. The operational amplifier(op-amp) uses the actual UA741 pspice model. The experiment includes the soft fault diagnosis of Sallen-Key band-pass filter circuit (TC1), Four-op-amp biquad high-pass filter circuit (TC2), and Leap-frog low-pass filter circuit (TC3). The dataset is a CSV file, with each row representing a feature vector, the last column being the data label, and the remaining columns being the feature vectors.
Mendeley Data 收录
AIS数据集
该研究使用了多个公开的AIS数据集,这些数据集经过过滤、清理和统计分析。数据集涵盖了多种类型的船舶,并提供了关于船舶位置、速度和航向的关键信息。数据集包括来自19,185艘船舶的AIS消息,总计约6.4亿条记录。
github 收录