Node.js中字符串长度统计错误的原因及响应内容截断问题求助

阿华AIGC实验室

2026-4-29

问题解析与解决方案

Hey there! Let's dig into your issue and break down what's happening, plus how to fix it.

Why Your `Content-Length` Was Causing Truncation

The core problem here is a mismatch between what html.length measures and what the Content-Length header expects:

In Node.js, a string's length property counts UTF-16 code units, not the number of bytes the string will occupy when encoded as UTF-8 (which is what your Content-Type header specifies).
When your HTML contains multi-byte characters (like Chinese, emojis, or special symbols), each of these takes up 2-4 bytes in UTF-8, but html.length only counts them as 1 unit. This means your calculated Content-Length was smaller than the actual number of bytes being sent, so the browser stopped reading early—cutting off the last </html> part.

Fixes to Try

1. Let Node.js Handle `Content-Length` Automatically

This is the simplest solution, which you already saw works: just remove the Content-Length header entirely. When you call response.end(html), Node.js will either:

Automatically calculate the correct byte length and set the header for you, or
Use chunked transfer encoding (which doesn't require a Content-Length header) to send the data in pieces.

2. Manually Calculate the Correct Byte Length (If You Need To)

If you must set Content-Length explicitly, convert the string to a UTF-8 Buffer first—Buffer's length property gives the actual number of bytes:

let html = this.code!.asHtml();
// Convert string to UTF-8 buffer to get accurate byte count
const htmlBuffer = Buffer.from(html, 'utf-8');
response.writeHead(200, { 
  "Content-Type": "text/html; charset=utf-8", 
  "Content-Length": htmlBuffer.length 
});
response.end(htmlBuffer);

What Causes String Length Mismatches in Node.js?

Here are the most common factors that lead to incorrect string length counts:

Encoding mismatches: As you suspected, using string.length (UTF-16 units) when you need UTF-8 byte length is a top culprit. Any multi-byte character will throw off the count.
Surrogate pairs: Some complex characters (like emojis 🤯 or rare Unicode symbols) are made up of two UTF-16 code units. string.length counts these as 2, even though they're a single visual character—this can confuse character count logic, and if you mix that up with byte count, it causes errors.
Hidden/invisible characters: Zero-width spaces, non-standard line breaks (e.g., \r\n vs \n), or control characters can add to the length count without being visible, leading to unexpected byte counts when encoded.
Invalid UTF-16 sequences: If a string is created from raw binary data that isn't properly UTF-16 encoded, string.length will count invalid code units, resulting in completely inaccurate numbers.

内容的提问来源于stack exchange，提问作者Peter Wone