OpenClaw官方文档爬取与同步:构建自动化文档更新系统

内容管家 AI领域评论13字数 1263阅读4分12秒阅读模式

引言

在快速发展的AI技术领域,保持文档的时效性和准确性至关重要。OpenClaw作为一款先进的AI代理平台,其官方文档会持续更新。有了本地的 OpenClaw 知识库,你的OpenClaw会更了解整个系统,因而会更聪明更好用!当然,另外一个好处就是你不用再去官方文档查命令和说明了,有问题直接问你的OpenClaw就可以~

本文将详细介绍如何使用Python构建一个自动化的OpenClaw文档爬取和同步系统。

核心实现

我们的系统通过以下步骤工作:

  1. 从 https://docs.openclaw.ai/llms.txt 获取完整的文档URL列表
  2. 使用SHA-256哈希比对检测文档变更
  3. 只下载和保存有变更的文档内容
  4. 生成详细的更新摘要报告

完整代码示例 - 完整同步脚本

以下是完整的、可直接运行的同步脚本:

#!/usr/bin/env python3
"""
OpenClaw Documentation Crawler
This script crawls the OpenClaw documentation site and saves all pages locally.
"""

import requests
import os
import time
import re
import hashlib
import json
from datetime import datetime
from urllib.parse import urljoin

# Base URL
BASE_URL = "https://docs.openclaw.ai"
INDEX_URL = "https://docs.openclaw.ai/llms.txt"
OUTPUT_DIR = "crawled_docs"
HASHES_FILE = ".page_hashes.json"

def fetch_page(url):
    """Fetch a single page with error handling"""
    try:
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        return response.text
    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return None

def save_page(content, filepath):
    """Save page content to file"""
    os.makedirs(os.path.dirname(filepath), exist_ok=True)
    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(content)

def extract_urls_from_index(index_content):
    """Extract URLs from the llms.txt index"""
    urls = []
    for line in index_content.split('n'):
        if line.strip().startswith('- [') and '](' in line and '.md)' in line:
            # Extract URL from markdown link
            start = line.find('](') + 2
            end = line.find('.md)') + 3
            if start > 1 and end > start:
                url = line[start:end]
                if url.startswith('/'):
                    url = BASE_URL + url
                urls.append(url)
    return urls

def clean_filename(url):
    """Convert URL to safe filename"""
    # Remove base URL
    path = url.replace(BASE_URL, '')
    # Remove .md extension
    if path.endswith('.md'):
        path = path[:-3]
    # Replace special characters
    path = re.sub(r'[^w-_.]', '_', path)
    # Ensure it starts with a letter or number
    if path and not path[0].isalnum():
        path = 'page_' + path
    return path + '.md'

def get_content_hash(content):
    """Calculate SHA-256 hash of content"""
    return hashlib.sha256(content.encode('utf-8')).hexdigest()

def load_old_hashes():
    """Load previous hashes from file"""
    if os.path.exists(HASHES_FILE):
        try:
            with open(HASHES_FILE, 'r', encoding='utf-8') as f:
                return json.load(f)
        except Exception:
            return {}
    return {}

def save_hashes(hashes):
    """Save current hashes to file"""
    with open(HASHES_FILE, 'w', encoding='utf-8') as f:
        json.dump(hashes, f, ensure_ascii=False, indent=2)

def main():
    print("Starting OpenClaw documentation crawl...")
    
    # Fetch index
    print("Fetching index...")
    index_content = fetch_page(INDEX_URL)
    if not index_content:
        print("Failed to fetch index")
        return
    
    # Extract URLs
    urls = extract_urls_from_index(index_content)
    print(f"Found {len(urls)} pages to crawl")
    
    # Load previous hashes
    old_hashes = load_old_hashes()
    new_hashes = {}
    
    # Track changes
    added = []
    updated = []
    removed = list(old_hashes.keys())
    
    # Create output directory
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    
    # Crawl each page
    for i, url in enumerate(urls, 1):
        print(f"Crawling {i}/{len(urls)}: {url}")
        content = fetch_page(url)
        if content:
            # Calculate hash
            content_hash = get_content_hash(content)
            new_hashes[url] = content_hash
            
            # Check if URL was in removed list
            if url in removed:
                removed.remove(url)
            
            # Compare with old hash
            if url not in old_hashes:
                added.append(url)
                print(f"  New page: {url}")
            elif old_hashes[url] != content_hash:
                updated.append(url)
                print(f"  Updated page: {url}")
            
            # Save content
            filename = clean_filename(url)
            filepath = os.path.join(OUTPUT_DIR, filename)
            save_page(content, filepath)
        else:
            print(f"Failed to crawl {url}")
        
        # Be respectful to the server
        time.sleep(0.3)
    
    # Save new hashes
    save_hashes(new_hashes)
    
    # Print summary
    print("n=== CRAWL SUMMARY ===")
    print(f"Total pages: {len(urls)}")
    print(f"New pages: {len(added)}")
    print(f"Updated pages: {len(updated)}")
    print(f"Removed pages: {len(removed)}")
    print("=====================")
    
    if added:
        print("nNew pages:")
        for url in added:
            print(f"  - {url}")
    
    if updated:
        print("nUpdated pages:")
        for url in updated:
            print(f"  - {url}")
    
    if removed:
        print("nRemoved pages:")
        for url in removed:
            print(f"  - {url}")
    
    print("nCrawling completed!")

if __name__ == "__main__":
    main()

使用OpenClaw直接完成任务的方法

除了编写Python脚本,你也可以直接让OpenClaw帮你完成这个任务!

你可以直接告诉OpenClaw:

请创建一个子代理进程完成下面任务:“爬取OpenClaw官方文档(https://docs.openclaw.ai/llms.txt),同步到本地知识库;然后设置一个自动化任务:每周一凌晨2点自动检查OpenClaw官方文档更新,如果有新内容就同步到本地知识库,并向我发送更新概要通知。”

OpenClaw会自动:

  • 访问 https://docs.openclaw.ai/llms.txt 获取文档索引
  • 逐个爬取所有文档页面
  • 保存为干净的Markdown格式
  • 添加定时任务,在检测到官方文档更新后自动同步并通知你

如果模型比较笨无法完成任务,可以直接把前面的脚本代码复制后发给它。

最佳实践

  • 增量更新:只下载变更内容,节省带宽
  • 错误处理:包含重试机制和异常处理
  • 数据完整性:使用哈希验证确保内容正确
  • 版本控制:可集成Git进行版本管理

总结

通过这个自动化系统,你可以始终保持最新的OpenClaw文档,为开发和学习提供可靠的技术参考。系统高效、可靠且易于维护,是技术文档管理的最佳实践。更重要的是,有了本地知识库,你的OpenClaw会变得更聪明,能够直接回答关于OpenClaw系统的问题!

 
内容管家

发表评论