HPAI-BSC/medical-fields|医疗分类数据集|语言模型数据集

hugging_face2024-07-11 更新2024-07-13 收录

医疗分类

语言模型

下载链接：

https://hf-mirror.com/datasets/HPAI-BSC/medical-fields

下载链接

链接失效反馈

资源简介：

该数据集设计用于医疗语言模型的评估。它将多个重要的医疗问答数据集合并为统一格式，并将其分类为35个不同的医疗类别。这种结构使用户能够识别模型在特定类别中可能表现不佳的地方，并相应地解决这些问题。数据集的结构包括每个问题的唯一标识、问题文本、四个选项、正确答案、来源数据集名称、预测的医疗领域、医疗领域的思维链以及思维链的对数概率。数据集的创建过程使用了Llama-3-70B-Instruct模型，并详细列出了使用的数据集和提示配置。

This dataset is designed for medical language models evaluation. It merges several of the most important medical QA datasets into a common format and classifies them into 35 distinct medical categories. This structure enables users to identify any specific categories where the models performance may be lacking and address these areas accordingly. The dataset includes fields such as id, question, options, correct option, source dataset name, predicted medical field, chain of thought for the medical field, and log probability of the CoT medical field. The dataset was created using the Llama-3-70B-Instruct model to classify medical questions into predefined medical fields.

提供机构：

HPAI-BSC

原始信息汇总

Medical Question Classification Dataset

数据集概述

该数据集旨在评估医疗语言模型。它将多个重要的医疗问答数据集合并为统一格式，并将其分类为35个不同的医疗类别。这种结构使用户能够识别模型在特定类别中的性能不足，并相应地解决这些问题。

数据集结构

数据字段

id: 每个问题的唯一标识符。
question: 医疗问题。
op1: 问题的第一个选项。
op2: 问题的第二个选项。
op3: 问题的第三个选项。
op4: 问题的第四个选项。
cop: 正确选项（1, 2, 3, 或 4）。
dataset: 数据集来源名称。
medical_field: 问题预测的医疗领域。
cot_medical_field: 医疗领域的思维链（CoT）。
cumulative_logprob_cot_medical_field: 医疗领域CoT的对数概率。

示例实例

json [ { "id": "test-00000", "question": "A junior orthopaedic surgery resident is completing a carpal tunnel repair with the department chairman as the attending physician. During the case, the resident inadvertently cuts a flexor tendon. The tendon is repaired without complication. The attending tells the resident that the patient will do fine, and there is no need to report this minor complication that will not harm the patient, as he does not want to make the patient worry unnecessarily. He tells the resident to leave this complication out of the operative report. Which of the following is the correct next action for the resident to take?", "op1": "Disclose the error to the patient and put it in the operative report", "op2": "Tell the attending that he cannot fail to disclose this mistake", "op3": "Report the physician to the ethics committee", "op4": "Refuse to dictate the operative report", "cop": 2, "dataset": "medqa_4options_test", "medical_field": "Surgery", "cot_medical_field": "This question involves a scenario related to surgical procedures and reporting complications, which falls under the category of Surgery. The category is: Surgery", "cumulative_logprob_cot_medical_field": -2.603069230914116 } ]

数据集创建

该数据集使用Llama-3-70B-Instruct模型将医疗问题分类到预定义的医疗领域。创建过程包括从HuggingFace下载数据集，根据配置文件中的指定字段对问题进行分类，并创建合并数据集。

使用的数据集

CareQA: https://huggingface.co/datasets/HPAI-BSC/CareQA (CareQA_en.json)
headqa_test: https://huggingface.co/datasets/openlifescienceai/headqa (test split)
medmcqa_validation: https://huggingface.co/datasets/openlifescienceai/medmcqa (validation split)
medqa_4options_test: https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options-hf (test split)
mmlu_anatomy_test: https://huggingface.co/datasets/openlifescienceai/mmlu_anatomy (test split)
mmlu_clinical_knowledge_test: https://huggingface.co/datasets/openlifescienceai/mmlu_clinical_knowledge (test split)
mmlu_college_medicine_test: https://huggingface.co/datasets/openlifescienceai/mmlu_college_medicine (test split)
mmlu_medical_genetics_test: https://huggingface.co/datasets/openlifescienceai/mmlu_medical_genetics (test split)
mmlu_professional_medicine_test: https://huggingface.co/datasets/openlifescienceai/mmlu_professional_medicine (test split)

提示配置

yaml system_prompt: "You are a medical assistant tasked with classifying medical questions into specific categories. You will be given a medical question. Your job is to categorize the question into one of the following categories: MEDICAL_FIELDS. Ensure that your output includes a step-by-step explanation of your reasoning process followed by the final category. Provide the name of the category as a single word and nothing else. If you have any doubts or the question does not fit clearly into one category, respond with The category is: None. End your response with The category is: <category>." fewshot_examples:

question: "What are the common symptoms of a myocardial infarction?" answer: "Myocardial infarction refers to a heart attack, which is a condition related to the heart. Heart conditions are categorized under Cardiology. The category is: Cardiology"
question: "What is the first-line treatment for type 2 diabetes?" answer: "Type 2 diabetes is a metabolic disorder that involves insulin regulation. Disorders related to metabolism and insulin are categorized under Endocrinology. The category is: Endocrinology"
question: "What are the stages of non-small cell lung cancer?" answer: "Non-small cell lung cancer is a type of cancer. The staging of cancer is a process that falls under the field of Oncology. The category is: Oncology"
question: "How is rheumatoid arthritis diagnosed?" answer: "Rheumatoid arthritis is an autoimmune disease that affects the joints. Diseases affecting the joints and autoimmune conditions are categorized under Rheumatology. The category is: Rheumatology"
question: "What are the side effects of the MMR vaccine?" answer: "The MMR vaccine triggers immune responses to prevent measles, mumps, and rubella. Immune responses and vaccinations are categorized under Immunology. The category is: Immunology"
question: "What is the capital of France?" answer: "The question is unrelated to medical fields and does not fit into any medical category. The category is: None"
question: "Waht are l" answer: "The question is incomplete and contains significant typos, making it unclear and impossible to categorize. The category is: None" regex: "The category is: (?P<category>\w+)"

数据集统计

image/png

引用

如果使用此数据集，请引用：

bibtex @misc{gururajan2024aloe, title={Aloe: A Family of Fine-tuned Open Healthcare LLMs}, author={Ashwin Kumar Gururajan and Enrique Lopez-Cuena and Jordi Bayarri-Planas and Adrian Tormos and Daniel Hinjos and Pablo Bernabeu-Perez and Anna Arias-Duart and Pablo Agustin Martin-Torres and Lucia Urcelay-Ganzabal and Marta Gonzalez-Mallo and Sergio Alvarez-Napagao and Eduard Ayguadé-Parra and Ulises Cortés Dario Garcia-Gasulla}, year={2024}, eprint={2405.01886}, archivePrefix={arXiv}, primaryClass={cs.CL} }

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

中国1km分辨率逐月降水量数据集（1901-2023）

该数据集为中国逐月降水量数据，空间分辨率为0.0083333°（约1km），时间为1901.1-2023.12。数据格式为NETCDF，即.nc格式。该数据集是根据CRU发布的全球0.5°气候数据集以及WorldClim发布的全球高分辨率气候数据集，通过Delta空间降尺度方案在中国降尺度生成的。并且，使用496个独立气象观测点数据进行验证，验证结果可信。本数据集包含的地理空间范围是全国主要陆地（包含港澳台地区），不含南海岛礁等区域。为了便于存储，数据均为int16型存于nc文件中，降水单位为0.1mm。 nc数据可使用ArcMAP软件打开制图; 并可用Matlab软件进行提取处理，Matlab发布了读入与存储nc文件的函数，读取函数为ncread，切换到nc文件存储文件夹，语句表达为：ncread (‘XXX.nc’,‘var’, [i j t],[leni lenj lent])，其中XXX.nc为文件名，为字符串需要’’；var是从XXX.nc中读取的变量名，为字符串需要’’；i、j、t分别为读取数据的起始行、列、时间，leni、lenj、lent i分别为在行、列、时间维度上读取的长度。这样，研究区内任何地区、任何时间段均可用此函数读取。Matlab的help里面有很多关于nc数据的命令，可查看。数据坐标系统建议使用WGS84。

国家青藏高原科学数据中心收录

中国高分辨率高质量PM2.5数据集（2000-2023）

ChinaHighPM2.5数据集是中国高分辨率高质量近地表空气污染物数据集（ChinaHighAirPollutants, CHAP）中PM2.5数据集。该数据集利用人工智能技术，使用模式资料填补了卫星MODIS MAIAC AOD产品的空间缺失值，结合地基观测、大气再分析和排放清单等大数据生产得到2000年至今全国无缝隙地面PM2.5数据。数据十折交叉验证决定系数R2为0.92，均方根误差RMSE为10.76 µg/m3。主要范围为整个中国地区，空间分辨率为1 km，时间分辨率为日、月、年，单位为µg/m3。注意：该数据集持续更新，如需要更多数据，请发邮件联系作者（weijing_rs@163.com; weijing@umd.edu）。数据文件中包含NC转GeoTiff的四种代码（Python、Matlab、IDL和R语言）nc2geotiff codes。

国家青藏高原科学数据中心收录

China Air Quality Historical Data

该数据集包含了中国多个城市的空气质量历史数据，涵盖了PM2.5、PM10、SO2、NO2、CO、O3等污染物浓度以及空气质量指数（AQI）等信息。数据按小时记录，提供了详细的空气质量监测数据。

www.cnemc.cn 收录

中文《诗歌总集》

这是一个收录所有中文诗词的数据集，旨在提供一个系统、完善、高质量的诗词数据集合。数据集包括诗词的收录、校正、鉴赏和评分，并标准化为统一的JSON格式。

github 收录

MedDialog

MedDialog数据集（中文）包含了医生和患者之间的对话（中文）。它有110万个对话和400万个话语。数据还在不断增长，会有更多的对话加入。原始对话来自好大夫网。

github 收录