cogsci13/Amazon-Reviews-2023-Books-Meta

Name: cogsci13/Amazon-Reviews-2023-Books-Meta
Creator: cogsci13
Published: 2024-06-12T00:32:09+08:00

Hugging Face2024-04-18 更新2024-06-12 收录

图书评论

推荐系统

数据链接：

https://hf-mirror.com/datasets/cogsci13/Amazon-Reviews-2023-Books-Meta 数据链接链接失效反馈

官方服务：

资源简介：

--- language: - en tags: - recommendation - reviews size_categories: - 100M<n<1B --- # Amazon Reviews 2023 (Books Only) **This is a subset of Amazon Review 2023 dataset. Please visit [amazon-reviews-2023.github.io/](https://amazon-reviews-2023.github.io/) for more details, loading scripts, and preprocessed benchmark files.** **[April 18, 2024]** Update 1. This dataset was created and pushed for the first time. ---  This is a large-scale **Amazon Reviews** dataset, collected in **2023** by [McAuley Lab](https://cseweb.ucsd.edu/~jmcauley/), and it includes rich features such as: 1. **User Reviews** (*ratings*, *text*, *helpfulness votes*, etc.); 2. **Item Metadata** (*descriptions*, *price*, *raw image*, etc.); ## What's New? In the Amazon Reviews'23, we provide: 1. **Larger Dataset:** We collected 571.54M reviews, 245.2% larger than the last version; 2. **Newer Interactions:** Current interactions range from May. 1996 to Sep. 2023; 3. **Richer Metadata:** More descriptive features in item metadata; 4. **Fine-grained Timestamp:** Interaction timestamp at the second or finer level; 5. **Cleaner Processing:** Cleaner item metadata than previous versions; 6. **Standard Splitting:** Standard data splits to encourage RecSys benchmarking. ## Basic Statistics > We define the #R_Tokens as the number of [tokens](https://pypi.org/project/tiktoken/) in user reviews and #M_Tokens as the number of [tokens](https://pypi.org/project/tiktoken/) if treating the dictionaries of item attributes as strings. We emphasize them as important statistics in the era of LLMs. > We count the number of items based on user reviews rather than item metadata files. Note that some items lack metadata. ### Grouped by Category | Category | #User | #Item | #Rating | #R_Token | #M_Token | Download | | ------------------------ | ------: | ------: | --------: | -------: | -------: | ------------------------------: | | Books | 10.3M | 4.4M | 29.5M | 2.9B | 3.7B | <a href='https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_2023/raw/review_categories/Books.jsonl.gz' download> review</a>, <a href='https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_2023/raw/meta_categories/meta_Books.jsonl.gz' download> meta </a> | meta </a> | > Check Pure ID files and corresponding data splitting strategies in [Common Data Processing](https://amazon-reviews-2023.github.io/data_processing/index.html) section. ## Quick Start ### Load User Reviews ```python from datasets import load_dataset dataset = load_dataset("cogsci13/Amazon-Reviews-2023-Books-Review", "raw_review_Books", trust_remote_code=True) print(dataset["full"][0]) ``` ```json {'rating': {0: 1.0}, 'title': {0: 'Not a watercolor book! Seems like copies imo.'}, 'text': {0: 'It is definitely not a watercolor book. The paper bucked completely. The pages honestly appear to be photo copies of other pictures. I say that bc if you look at the seal pics you can see the tell tale line at the bottom of the page. As someone who has made many photocopies of pages in my time so I could try out different colors & mediums that black line is a dead giveaway to me. It’s on other pages too. The entire book just seems off. Nothing is sharp & clear. There is what looks like toner dust on all the pages making them look muddy. There are no sharp lines & there is no clear definition. At least there isn’t in my copy. And the Coloring Book for Adult on the bottom of the front cover annoys me. Why is it singular & not plural? They usually say coloring book for kids or coloring book for kids & adults or coloring book for adults- plural. Lol Plus it would work for kids if you can get over the grey scale nature of it. Personally I’m not going to waste expensive pens & paints trying to paint over the grey & black mess. I grew up in SW Florida minutes from the beaches & I was really excited about the sea life in this. I hope the printers & designers figure out how to clean up the mess bc some of the designs are really cute. They just aren’t worth my time to hand trace & transfer them, but I’m sure there are ppl that will be up to the challenge. This is one is a hard no. Going back. I tried.'}, 'images': {0: array([{'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/516HBU7LQoL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/516HBU7LQoL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/516HBU7LQoL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/71+XwcacMmL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/71+XwcacMmL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/71+XwcacMmL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/71RbTuvD1ZL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/71RbTuvD1ZL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/71RbTuvD1ZL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/71U63wdOeZL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/71U63wdOeZL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/71U63wdOeZL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/71WFEDyKcKL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/71WFEDyKcKL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/71WFEDyKcKL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/8109NwjpHKL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/8109NwjpHKL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/8109NwjpHKL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/814gxfh8wcL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/814gxfh8wcL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/814gxfh8wcL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/81HC0vKRC2L._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/81HC0vKRC2L._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/81HC0vKRC2L._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/81Nx6BnRLxL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/81Nx6BnRLxL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/81Nx6BnRLxL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/81QQMwBcVPL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/81QQMwBcVPL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/81QQMwBcVPL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/81fgT3R3OwL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/81fgT3R3OwL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/81fgT3R3OwL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/81mfzny0I5L._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/81mfzny0I5L._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/81mfzny0I5L._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/81nir7bf91L._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/81nir7bf91L._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/81nir7bf91L._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/81yLUo6ZL3L._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/81yLUo6ZL3L._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/81yLUo6ZL3L._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/81zh9h5RwkL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/81zh9h5RwkL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/81zh9h5RwkL._SL256_.jpg'}, {'attachment_type': 'IMAGE', 'large_image_url': 'https://m.media-amazon.com/images/I/91yfcpFlEqL._SL1600_.jpg', 'medium_image_url': 'https://m.media-amazon.com/images/I/91yfcpFlEqL._SL800_.jpg', 'small_image_url': 'https://m.media-amazon.com/images/I/91yfcpFlEqL._SL256_.jpg'}], dtype=object)}, 'asin': {0: 'B09BGPFTDB'}, 'parent_asin': {0: 'B09BGPFTDB'}, 'user_id': {0: 'AFKZENTNBQ7A7V7UXW5JJI6UGRYQ'}, 'timestamp': {0: 1642399598485}, 'helpful_vote': {0: 0}, 'verified_purchase': {0: True}} ``` ### Load Item Metadata ```python dataset = load_dataset("cogsci13/Amazon-Reviews-2023-Books-Meta", "raw_meta_Books", split="full", trust_remote_code=True) print(dataset[0]) ``` ```json {'main_category': {0: 'Books'}, 'title': {0: 'Chaucer'}, 'average_rating': {0: 4.5}, 'rating_number': {0: 29}, 'features': {0: array([], dtype=object)}, 'description': {0: array([], dtype=object)}, 'price': {0: '8.23'}, 'images': {0: {'hi_res': array([None], dtype=object), 'large': array(['https://m.media-amazon.com/images/I/41X61VPJYKL._SX334_BO1,204,203,200_.jpg'], dtype=object), 'thumb': array([None], dtype=object), 'variant': array(['MAIN'], dtype=object)}}, 'videos': {0: {'title': array([], dtype=object), 'url': array([], dtype=object), 'user_id': array([], dtype=object)}}, 'store': {0: 'Peter Ackroyd (Author)'}, 'categories': {0: array(['Books', 'Literature & Fiction', 'History & Criticism'], dtype=object)}, 'details': {0: '{"Publisher": "Chatto & Windus; First Edition (January 1, 2004)", "Language": "English", "Hardcover": "196 pages", "ISBN 10": "0701169850", "ISBN 13": "978-0701169855", "Item Weight": "10.1 ounces", "Dimensions": "5.39 x 0.71 x 7.48 inches"}'}, 'parent_asin': {0: '0701169850'}, 'bought_together': {0: None}, 'subtitle': {0: 'Hardcover – Import, January 1, 2004'}, 'author': {0: "{'avatar': 'https://m.media-amazon.com/images/I/21Je2zja9pL._SY600_.jpg', 'name': 'Peter Ackroyd', 'about': ['Peter Ackroyd, (born 5 October 1949) is an English biographer, novelist and critic with a particular interest in the history and culture of London. For his novels about English history and culture and his biographies of, among others, William Blake, Charles Dickens, T. S. Eliot and Sir Thomas More, he won the Somerset Maugham Award and two Whitbread Awards. He is noted for the volume of work he has produced, the range of styles therein, his skill at assuming different voices and the depth of his research.', 'He was elected a fellow of the Royal Society of Literature in 1984 and appointed a Commander of the Order of the British Empire in 2003.', 'Bio from Wikipedia, the free encyclopedia.']}"}} ``` > Check data loading examples and Huggingface datasets APIs in [Common Data Loading](https://amazon-reviews-2023.github.io/data_loading/index.html) section. ## Data Fields ### For User Reviews | Field | Type | Explanation | | ----- | ---- | ----------- | | rating | float | Rating of the product (from 1.0 to 5.0). | | title | str | Title of the user review. | | text | str | Text body of the user review. | | images | list | Images that users post after they have received the product. Each image has different sizes (small, medium, large), represented by the small_image_url, medium_image_url, and large_image_url respectively. | | asin | str | ID of the product. | | parent_asin | str | Parent ID of the product. Note: Products with different colors, styles, sizes usually belong to the same parent ID. The “asin” in previous Amazon datasets is actually parent ID. Please use parent ID to find product meta. | | user_id | str | ID of the reviewer | | timestamp | int | Time of the review (unix time) | | verified_purchase | bool | User purchase verification | | helpful_vote | int | Helpful votes of the review | ### For Item Metadata | Field | Type | Explanation | | ----- | ---- | ----------- | | main_category | str | Main category (i.e., domain) of the product. | | title | str | Name of the product. | | average_rating | float | Rating of the product shown on the product page. | | rating_number | int | Number of ratings in the product. | | features | list | Bullet-point format features of the product. | | description | list | Description of the product. | | price | float | Price in US dollars (at time of crawling). | | images | list | Images of the product. Each image has different sizes (thumb, large, hi_res). The “variant” field shows the position of image. | | videos | list | Videos of the product including title and url. | | store | str | Store name of the product. | | categories | list | Hierarchical categories of the product. | | details | dict | Product details, including materials, brand, sizes, etc. | | parent_asin | str | Parent ID of the product. | | bought_together | list | Recommended bundles from the websites. | ## Citation ```bibtex @article{hou2024bridging, title={Bridging Language and Items for Retrieval and Recommendation}, author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian}, journal={arXiv preprint arXiv:2403.03952}, year={2024} } ``` ## Contact Us - **Report Bugs**: To report bugs in the dataset, please file an issue on our [GitHub](https://github.com/hyp1231/AmazonReviews2023/issues/new). - **Others**: For research collaborations or other questions, please email **yphou AT ucsd.edu**.

提供机构：

cogsci13

原始信息汇总

数据集概述：Amazon Reviews 2023 (Books Only)

数据集基本信息

名称: Amazon Reviews 2023 (Books Only)
语言: 英语
标签: 推荐系统, 评论
大小: 100M<n<1B

数据集内容

来源: 由McAuley Lab在2023年收集
包含内容:
1. 用户评论: 包括评分、文本、有用投票等;
2. 商品元数据: 包括描述、价格、原始图像等。

数据集更新

首次发布: 2024年4月18日
更新内容:
1. 数据集大小: 收集了571.54M条评论，比上一版本大245.2%;
2. 交互时间范围: 从1996年5月到2023年9月;
3. 元数据丰富度: 增加了商品元数据的描述性特征;
4. 时间戳精度: 交互时间戳精度达到秒级或更细;
5. 数据处理: 商品元数据比之前版本更清洁;
6. 标准分割: 提供标准的数据分割，以促进推荐系统基准测试。

数据集统计

分类统计:

类别用户数商品数评分数 R_Token数 M_Token数下载链接

书籍 10.3M 4.4M 29.5M 2.9B 3.7B 评论, 元数据

数据集字段

用户评论

字段	类型	说明
rating	float	产品评分（1.0到5.0）
title	str	用户评论标题
text	str	用户评论文本
images	list	用户上传的产品图像
asin	str	产品ID
parent_asin	str	产品父ID
user_id	str	评论者ID
timestamp	int	评论时间（Unix时间）
verified_purchase	bool	用户购买验证
helpful_vote	int	评论的有用投票

商品元数据

字段	类型	说明
main_category	str	产品主类别
title	str	产品名称
average_rating	float	产品页面显示的评分
rating_number	int	产品评分数量
features	list	产品特征（点格式）
description	list	产品描述
price	float	产品价格（爬取时）
images	list	产品图像
videos	list	产品视频
store	str	产品商店名称
categories	list	产品类别层次
details	dict	产品详细信息
parent_asin	str	产品父ID
bought_together	list	网站推荐的捆绑销售

数据集引用

bibtex @article{hou2024bridging, title={Bridging Language and Items for Retrieval and Recommendation}, author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian}, journal={arXiv preprint arXiv:2403.03952}, year={2024} }

搜集汇总

数据集介绍

cogsci13/Amazon-Reviews-2023-Books-Meta 数据集图片

构建方式

该数据集由McAuley Lab在2023年精心构建，专注于亚马逊书籍评论的收集与整理。数据集不仅涵盖了用户评论的详细信息，如评分、文本、有用性投票等，还包含了书籍的丰富元数据，如描述、价格、原始图像等。通过从1996年5月至2023年9月的广泛交互数据中筛选，确保了数据的新鲜性和全面性。此外，数据集还进行了精细的时间戳处理，提供了秒级的交互时间记录，并采用了标准的数据分割策略，以支持推荐系统的基准测试。

特点

该数据集的显著特点在于其大规模和多样性。数据集包含了571.54M条评论，比之前的版本增加了245.2%，且提供了更丰富的元数据描述。此外，数据集还包含了细粒度的时间戳信息，使得研究者可以进行更精确的时间序列分析。数据集的清洗处理也更为严格，确保了元数据的准确性和一致性。标准化的数据分割策略进一步增强了其在推荐系统研究中的应用价值。

使用方法

使用该数据集时，用户可以通过HuggingFace的datasets库轻松加载数据。对于用户评论，可以使用`load_dataset`函数加载`raw_review_Books`数据集，并访问其中的详细字段，如评分、评论文本、图片等。对于书籍的元数据，可以使用`raw_meta_Books`数据集，获取书籍的标题、价格、分类等信息。数据集的详细字段说明和加载示例可在官方文档中找到，确保用户能够高效地利用这些数据进行研究和开发。

背景与挑战

背景概述

Amazon Reviews 2023 (Books Only)数据集由McAuley Lab于2023年创建，旨在为推荐系统和自然语言处理领域提供丰富的用户评论和商品元数据。该数据集包含了从1996年5月至2023年9月的用户交互数据，涵盖了书籍类别的详细评论信息，如评分、文本、有用性投票等，以及商品的描述、价格、图像等元数据。该数据集的发布不仅为研究人员提供了大规模的基准数据，还通过标准化的数据分割策略，促进了推荐系统领域的基准测试和算法比较。

当前挑战

该数据集在构建过程中面临多项挑战。首先，数据规模庞大，处理和存储571.54M条评论和相关元数据需要高效的计算资源和存储解决方案。其次，数据的时间跨度长，涵盖了从1996年至今的交互数据，这要求数据清洗和处理过程中需考虑不同时期的数据格式和内容变化。此外，数据集中包含了丰富的元数据，如图像和视频，这些多媒体数据的处理和分析增加了数据集的复杂性。最后，如何确保数据的质量和一致性，特别是在处理缺失或不完整的元数据时，也是一个重要的挑战。

常用场景

经典使用场景

在推荐系统领域，cogsci13/Amazon-Reviews-2023-Books-Meta数据集的经典使用场景主要集中在用户评论分析与商品推荐算法的优化上。通过分析用户对书籍的评分、评论文本、帮助性投票等数据，研究者可以构建更为精准的用户画像和商品特征模型，从而提升推荐系统的个性化和准确性。此外，该数据集的丰富元数据（如商品描述、价格、图片等）为多模态推荐系统提供了宝贵的资源，使得推荐算法能够综合考虑文本、图像等多种信息源，进一步增强推荐效果。

实际应用

在实际应用中，cogsci13/Amazon-Reviews-2023-Books-Meta数据集被广泛应用于电子商务平台的推荐系统优化。通过分析用户评论和商品元数据，电商平台能够为用户提供更为精准的书籍推荐，提升用户体验和购买转化率。此外，该数据集还可用于市场分析，帮助出版商和书商了解用户对不同书籍的反馈，从而优化出版策略和库存管理。在教育领域，该数据集也可用于构建个性化学习资源推荐系统，帮助学生和教师更高效地选择合适的教材和参考书籍。

衍生相关工作

基于cogsci13/Amazon-Reviews-2023-Books-Meta数据集，研究者们开展了多项经典工作。例如，有研究利用该数据集进行用户评论情感分析，探索情感因素对推荐系统的影响；还有研究通过多模态信息融合技术，构建了更为精准的书籍推荐模型。此外，该数据集还被用于开发和评估新型推荐算法，如基于图神经网络的推荐系统，以及结合时间序列分析的动态推荐模型。这些衍生工作不仅丰富了推荐系统的理论研究，也为实际应用提供了新的技术支持。

以上内容由遇见数据集搜集并总结生成