Selenium-Webdriver爬虫API与Python文件scrape.py对接及LangChain AI调用协同调试方案

阿华AIGC实验室

2026-4-28

解决方案：对接Selenium爬虫与LangChain协同工作

我仔细看了你的代码和需求，核心问题在于当前爬虫仅提取页面<p>标签文本，没有完整获取并处理页面DOM内容，同时缺少和LangChain的直接衔接逻辑，导致无法将爬取结果顺畅传给AI做标准化输出。下面是一步步的修复和优化方案：

1. 修复爬虫的DOM内容获取逻辑

你的scrape_website函数目前只抓取段落内容，但后续的extract_body_content、clean_body_content等函数是为处理完整HTML设计的。修改爬虫函数，让它返回完整页面HTML，而非仅段落文本：

def scrape_website(website):
    print("Launching chrome browser...")
    AUTH = environ.get('AUTH', '**********')
    if not AUTH:
        raise Exception("Missing AUTH credentials")
    print('Connecting to Browser...')
    server_addr = f'https://{AUTH}@brd.superproxy.io:9515'
    connection = NoSSLConnection(server_addr, 'goog', 'chrome')
    driver = Remote(connection, options=Options())

    def cdp(cmd, params={}):
        return driver.execute('executeCdpCommand', {
            'cmd': cmd,
            'params': params,
        })['value']

    try:
        print('Connected! Starting inspect session...')
        frames = cdp('Page.getFrameTree')
        frame_id = frames['frameTree']['frame']['id']
        cdp('Page.inspect', {'frameId': frame_id})
        sleep(2)  # 缩短不必要的等待时间

        print(f'Navigating to {website}...')
        driver.get(website)
        print('Waiting for captcha to solve...')
        result = driver.execute('executeCdpCommand', {
            'cmd': 'Captcha.waitForSolve',
            'params': {'detectTimeout': 10000},
        })
        print(f"Captcha status: {result['value']['status']}")
        sleep(3)  # 页面加载完成后等待几秒确保渲染完毕

        print('Scraping full page content...')
        # 获取完整页面HTML，而非仅p标签
        page_html = driver.page_source
        return page_html
    except Exception as e:
        print(f"Scraping error: {e}")
        return ""
    finally:
        driver.quit()

2. 串联爬虫处理流程与LangChain调用

现在需要把爬虫获取的HTML，经过清洗、分块后，传给LangChain的LLM进行处理。添加一个核心函数完成整个流程：

# 导入LangChain相关模块（确保已安装：pip install langchain openai）
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser

def process_website_with_langchain(website_url, output_type="crm_data"):
    # 1. 爬取页面HTML
    page_html = scrape_website(website_url)
    if not page_html:
        print("Failed to scrape website content")
        return None
    
    # 2. 处理HTML内容：提取body -> 清理 -> 分块
    body_content = extract_body_content(page_html)
    cleaned_content = clean_body_content(body_content)
    content_batches = list(split_dom_content(cleaned_content))
    
    # 3. 定义LangChain提示词，根据输出类型定制
    prompt_template = ChatPromptTemplate.from_messages([
        ("system", f"You are an expert data extractor. Convert the provided website content into structured {output_type}. The output should be clear, formatted as JSON if possible, and include all relevant {output_type} fields."),
        ("user", "Website content batch: {content}")
    ])
    
    # 初始化LLM（从.env读取OPENAI_API_KEY）
    llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
    chain = prompt_template | llm | StrOutputParser()
    
    # 4. 逐批处理内容并合并结果
    final_result = []
    for batch in content_batches:
        try:
            batch_result = chain.invoke({"content": batch})
            final_result.append(batch_result)
        except Exception as e:
            print(f"Error processing batch: {e}")
            continue
    
    # 合并所有批次的结果（如果是JSON可进一步合并，这里先简单拼接）
    merged_result = "\n\n".join(final_result)
    return merged_result

3. 调用示例

在脚本末尾添加测试调用，验证整个流程：

if __name__ == "__main__":
    # 测试抓取并处理网站内容
    target_website = "https://example.com"
    crm_data = process_website_with_langchain(target_website, output_type="crm_data")
    if crm_data:
        print("\nFinal CRM Data Output:")
        print(crm_data)
    
    # 也可以处理竞品数据
    # competitor_data = process_website_with_langchain(target_website, output_type="竞品数据")