You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

使用crawl4ai实现登录后爬取受限页面的问题求助

crawl4ai实现登录后爬取受限页面的问题求助

我现在想使用crawl4ai爬取需要登录才能访问的受限页面,而不只是公开可访问的页面。我已经保存了登录后的user_data目录和cookies.json文件,并且用Playwright单独验证过这些凭证是有效的——运行下面的代码后,我能成功访问受限页面,还能把页面内容保存下来,说明登录状态是正常的:

import asyncio
import json
from playwright.async_api import async_playwright

async def check_login_state():
    async with async_playwright() as p:
        browser = await p.chromium.launch_persistent_context(
            user_data_dir=user_data_path,
            headless=False
        )
        page = await browser.new_page()
        try:
            with open(cookie_path, "r", encoding="utf-8") as f:
                cookies = json.load(f)
            await page.context.add_cookies(cookies)
        except FileNotFoundError:
            print("ERROR")

        await page.goto(test_url)
        current_url = page.url
        page_html = await page.content()
        with open(save_html_path, "w", encoding="utf-8") as f:
            f.write(page_html)
            
        await page.wait_for_timeout(3000)
        await browser.close()

但是当我把这些登录凭证用到crawl4ai中时,却始终无法带上登录状态访问目标页面。我参考了crawl4ai的参数文档,写了下面的爬取代码,但就是没法成功登录:

from crawl4ai import AsyncWebCrawler
from crawl4ai.extensions import BrowserConfig

async def crawl4aiUrl(enter_url, user_data_path):
    browser_cfg = BrowserConfig(
        browser_type="chromium",
        headless=True,
        use_persistent_context=True,
        user_data_dir=user_data_path
    )

    async with AsyncWebCrawler(
        enable_click=True,
        max_depth=2,
        max_pages=100,
        delay=2,
        browser_config=browser_cfg
    ) as crawler:
        try:
            result = await crawler.arun(
                url=enter_url,
                depth=2,
                wait_for="document.readyState === 'complete'",
                timeout=15,
            )
            print("Crawl Success")
        except Exception as e:
            print(f"Crawl Failed: {e}")
            return

        with open("original_page.html", "w", encoding="utf-8") as f:
            f.write(result.html)
        print("Crawling result saved as .html FILE")

我已经反复核对了crawl4ai的参数文档,但还是没找到问题所在。有没有朋友能帮我看看,我在crawl4ai的配置里哪里出错了?或者加载登录凭证还有什么需要额外注意的细节吗?

备注:内容来源于stack exchange,提问作者user29459329

火山引擎 最新活动