Node.js网页爬取：如何提取HTML中ABCD后的JSON字符串？

阿华AIGC实验室

2026-5-11

最优实现思路：提取script中的ABCD字段JSON

嘿，这个需求我太熟了！爬网页遇到内嵌script里的JSON数据，其实有几种不同层级的实现方式，从快速搞定到稳健容错都有，给你拆解一下：

1. 正则表达式快速提取（适合结构稳定的场景）

如果这段script的格式不会轻易变动，正则绝对是最快的方案。我们可以精准匹配ABCD:后面的JSON数组部分，再转成对象。

代码示例：

const html = `你的HTML内容`;

// 匹配ABCD: 后面的JSON数组，兼容空格、换行的情况
const regex = /ABCD:\s*(\[.*?\])/s;
const match = html.match(regex);

if (match) {
  try {
    const abcdData = JSON.parse(match[1]);
    console.log(abcdData); // 这就是你要的JSON数组啦
  } catch (err) {
    console.error('JSON解析失败:', err);
  }
} else {
  console.log('没找到ABCD字段');
}

注意事项：

正则里的s标志是让.匹配换行符，避免script代码换行导致匹配失败
.*?的非贪婪匹配会自动停在第一个]，刚好拿到完整的数组，不会把后面的languageCode等内容包含进来

2. AST语法树解析（稳健容错，适合复杂场景）

如果怕正则因为代码格式变动（比如多了注释、变量写法调整）失效，那用JS语法解析器把script代码转成AST（抽象语法树）是更靠谱的方案，比如用acorn这个轻量库。

步骤&代码：

首先安装依赖：

npm install acorn

然后写解析逻辑：

const acorn = require('acorn');
const html = `你的HTML内容`;

// 先提取script标签里的代码内容
const scriptRegex = /<script type="text\/javascript">(.*?)<\/script>/s;
const scriptMatch = html.match(scriptRegex);

if (scriptMatch) {
  const scriptCode = scriptMatch[1];
  try {
    // 解析成AST语法树
    const ast = acorn.parse(scriptCode, { ecmaVersion: 2020 });
    
    // 遍历AST，定位到LowFareFinder构造函数的参数对象
    ast.body.forEach(node => {
      if (node.type === 'ExpressionStatement' && node.expression.type === 'CallExpression') {
        const funcExpr = node.expression.callee;
        if (funcExpr.type === 'FunctionExpression') {
          funcExpr.body.body.forEach(innerNode => {
            if (innerNode.type === 'ExpressionStatement' && innerNode.expression.type === 'NewExpression') {
              const args = innerNode.expression.arguments;
              args.forEach(arg => {
                if (arg.type === 'ObjectExpression') {
                  // 遍历对象属性，找到key为ABCD的项
                  arg.properties.forEach(prop => {
                    if (prop.key.name === 'ABCD') {
                      // 把AST节点转成JSON字符串再解析
                      const abcdJson = JSON.stringify(prop.value);
                      const abcdData = JSON.parse(abcdJson);
                      console.log('提取到的ABCD数据:', abcdData);
                    }
                  });
                }
              });
            }
          });
        }
      }
    });
  } catch (err) {
    console.error('AST解析失败:', err);
  }
}