使用crawl4ai实现登录后爬取受限页面的问题求助
crawl4ai实现登录后爬取受限页面的问题求助
我现在想使用crawl4ai爬取需要登录才能访问的受限页面,而不只是公开可访问的页面。我已经保存了登录后的user_data目录和cookies.json文件,并且用Playwright单独验证过这些凭证是有效的——运行下面的代码后,我能成功访问受限页面,还能把页面内容保存下来,说明登录状态是正常的:
import asyncio import json from playwright.async_api import async_playwright async def check_login_state(): async with async_playwright() as p: browser = await p.chromium.launch_persistent_context( user_data_dir=user_data_path, headless=False ) page = await browser.new_page() try: with open(cookie_path, "r", encoding="utf-8") as f: cookies = json.load(f) await page.context.add_cookies(cookies) except FileNotFoundError: print("ERROR") await page.goto(test_url) current_url = page.url page_html = await page.content() with open(save_html_path, "w", encoding="utf-8") as f: f.write(page_html) await page.wait_for_timeout(3000) await browser.close()
但是当我把这些登录凭证用到crawl4ai中时,却始终无法带上登录状态访问目标页面。我参考了crawl4ai的参数文档,写了下面的爬取代码,但就是没法成功登录:
from crawl4ai import AsyncWebCrawler from crawl4ai.extensions import BrowserConfig async def crawl4aiUrl(enter_url, user_data_path): browser_cfg = BrowserConfig( browser_type="chromium", headless=True, use_persistent_context=True, user_data_dir=user_data_path ) async with AsyncWebCrawler( enable_click=True, max_depth=2, max_pages=100, delay=2, browser_config=browser_cfg ) as crawler: try: result = await crawler.arun( url=enter_url, depth=2, wait_for="document.readyState === 'complete'", timeout=15, ) print("Crawl Success") except Exception as e: print(f"Crawl Failed: {e}") return with open("original_page.html", "w", encoding="utf-8") as f: f.write(result.html) print("Crawling result saved as .html FILE")
我已经反复核对了crawl4ai的参数文档,但还是没找到问题所在。有没有朋友能帮我看看,我在crawl4ai的配置里哪里出错了?或者加载登录凭证还有什么需要额外注意的细节吗?
备注:内容来源于stack exchange,提问作者user29459329




