website-hot-hub 微信读书爬虫模块解析

本文介绍 website-hot-hub 项目中的微信读书爬虫模块 website_weread.py，该模块实现了自动抓取微信读书平台热门书籍榜单数据的功能，支持数据归档和 README 自动更新。

项目简介

website-hot-hub 是一个开源的多平台热点数据抓取项目，支持 36Kr、bilibili、GitHub、抖音、掘金、微信读书、快手等主流平台。项目采用 Python 开发，通过 GitHub Actions 实现每小时自动抓取数据并按天归档。

website_weread.py 是该项目中专门用于抓取微信读书热门榜单的模块，实现了从微信读书 Web 端获取飙升榜数据的功能。

核心特性

自动数据抓取：定时抓取微信读书飙升榜数据
数据清洗处理：将原始 API 数据转换为结构化格式
增量更新机制：支持合并历史数据，避免重复记录
多格式输出：支持 JSON 原始数据、Markdown 归档、README 更新
自动归档：按日期自动归档数据到指定目录

技术栈

Python 3.x - 主要开发语言
requests - HTTP 请求库
urllib3 - 连接池和重试机制
pathlib - 文件路径处理
contextlib - 上下文管理器

核心代码解析

1. 请求会话管理

使用上下文管理器实现带重试机制的 HTTP 会话：

python

@contextlib.contextmanager
def request_session():
    s = requests.session()
    try:
        s.headers.update(headers)
        s.mount("http://", HTTPAdapter(max_retries=retries))
        s.mount("https://", HTTPAdapter(max_retries=retries))
        yield s
    finally:
        s.close()

2. 数据获取与清洗

python

@staticmethod
def get_raw() -> dict:
    """从微信读书 API 获取原始数据"""
    ret = {}
    try:
        with request_session() as s:
            resp = s.get(url, timeout=30)
            ret = resp.json()
    except:
        logger.exception("get data failed")
        raise
    return ret

@staticmethod
def clean_raw(raw_data: dict) -> typing.List[typing.Dict[str, typing.Any]]:
    """清洗原始数据，提取书籍标题和链接"""
    ret: typing.List[typing.Dict[str, typing.Any]] = []
    for item in raw_data.get("books", []):
        ret.append({
            "title": item["bookInfo"]["title"],
            "url": f"https://weread.qq.com/web/bookDetail/{get_weread_id(item['bookInfo']['bookId'])}",
        })
    return ret

3. 数据合并与去重

python

@staticmethod
def merge_data(
    cur: typing.List[typing.Dict[str, typing.Any]],
    another: typing.List[typing.Dict[str, typing.Any]],
):
    """合并两组数据，以 URL 为键去重"""
    merged_dict: typing.Dict[str, typing.Any] = {}
    for item in chain(cur, another):
        merged_dict[item["url"]] = item["title"]
    return [{"url": k, "title": v} for k, v in merged_dict.items()]

4. Markdown 列表生成

python

@staticmethod
def create_list(content: typing.List[typing.Dict[str, typing.Any]]) -> str:
    """生成 Markdown 格式的书籍列表"""
    topics = []
    template = """<!-- BEGIN WEREAD -->
<!-- 最后更新时间 {update_time} -->
{topics}
<!-- END WEREAD -->"""
    for item in content:
        topics.append(f"1. [{item['title']}]({item['url']})")
    template = template.replace("{update_time}", current_time())
    template = template.replace("{topics}", "\n".join(topics))
    return template

5. README 自动更新

python

def update_readme(self, content: typing.List[typing.Dict[str, typing.Any]]) -> str:
    """更新 README.md 中的微信读书板块"""
    with open("./README.md", "r") as fd:
        readme = fd.read()
    return re.sub(
        r"<!-- BEGIN WEREAD -->[\W\w]*<!-- END WEREAD -->",
        self.create_list(content),
        readme,
    )

使用示例

独立运行

python

from website_weread import WebSiteWeRead

# 创建实例并运行
weread_obj = WebSiteWeRead()
weread_obj.run(update_readme=True)

作为模块调用

python

from website_weread import WebSiteWeRead

weread_obj = WebSiteWeRead()
result = weread_obj.run(update_readme=False)

# 返回数据结构
{
    "section_name": "WEREAD",
    "content": "<!-- BEGIN WEREAD -->...",
    "data_count": 10
}

数据存储结构

text

project-root/
├── raw/weread/           # 原始 JSON 数据
│   └── 2026-02-23.json
├── archives/weread/      # Markdown 归档
│   └── 2026-02-23.md
└── README.md             # 自动更新的主文档

API 参考

WebSiteWeRead 类

方法	说明	返回值
`get_raw()`	获取微信读书 API 原始数据	`dict`
`clean_raw(raw_data)`	清洗原始数据	`List[Dict]`
`merge_data(cur, another)`	合并并去重数据	`List[Dict]`
`create_list(content)`	生成 Markdown 列表	`str`
`update_readme(content)`	更新 README 文件	`str`
`create_archive(content, date)`	创建归档内容	`str`
`run(update_readme=True)`	执行完整流程	`bool` / `dict`

注意事项

API 限制：微信读书 Web API 可能有访问频率限制，建议合理设置抓取间隔
数据格式：API 返回格式可能随平台更新而变化，需要定期维护
编码处理：中文内容使用 ensure_ascii=False 确保正确存储
错误处理：网络异常时会记录日志并抛出异常

项目链接

GitHub 仓库：https://github.com/cxyfreedom/website-hot-hub
微信读书模块：https://github.com/cxyfreedom/website-hot-hub/blob/main/website_weread.py
数据归档：https://github.com/cxyfreedom/website-hot-hub/tree/main/archives/weread

字节笔记本

website-hot-hub 微信读书爬虫模块解析

项目简介

核心特性

技术栈

核心代码解析

1. 请求会话管理

2. 数据获取与清洗

3. 数据合并与去重

4. Markdown 列表生成

5. README 自动更新

使用示例

独立运行

作为模块调用

数据存储结构

API 参考

WebSiteWeRead 类

注意事项

项目链接