如何生成带正确缩进的阿拉伯语文本文件？技术求助

阿华AIGC实验室

2026-5-27

Solution for Arabic RTL Text Issues in .docx to TXT Conversion

Hey there, let's tackle those right-to-left (RTL) Arabic formatting quirks you're hitting when converting .docx files to plain text in C#. I've dealt with similar bidirectional text rendering headaches before, so here's how to fix both problems cleanly:

1. Fixing Left-Indentation for Arabic Paragraphs

Arabic text relies on right-aligned/right-indented formatting, but plain text files don't support explicit alignment settings natively. Instead, we can use Unicode bidirectional control characters to force the correct RTL direction for entire paragraphs.

Add the Right-to-Left Override (RLO) character (\u202E) at the start of each RTL paragraph, and the Pop Directional Formatting (PDF) character (\u202C) at the end to reset the direction for subsequent content. This ensures the paragraph renders correctly in any RTL-aware text viewer.

2. Fixing Leading Numbers Appearing on the Left

This is a classic bidirectional text issue—numbers are treated as left-to-right (LTR) characters, so they get stuck at the visual left of an RTL paragraph. To fix this, we need to explicitly tie the number to the following Arabic text using a Right-to-Left Mark (RLM) character (\u200F) right after the number. This tells the BiDi algorithm to keep the number associated with the RTL text block, positioning it correctly on the visual right relative to the Arabic content.

Example C# Code Implementation

Assuming you're using the OpenXML SDK to read the .docx file, here's how to modify your code to handle both issues:

using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using System.Text;

public static void ConvertDocxToTxtWithRtlFix(string docxPath, string txtPath)
{
    StringBuilder txtContent = new StringBuilder();

    using (WordprocessingDocument doc = WordprocessingDocument.Open(docxPath, false))
    {
        Body body = doc.MainDocumentPart.Document.Body;
        foreach (Paragraph para in body.Elements<Paragraph>())
        {
            // Check if the paragraph is marked as RTL in the original docx
            ParagraphProperties paraProps = para.ParagraphProperties;
            bool isRtl = paraProps?.Bidi != null && paraProps.Bidi.Val == OnOffValue.FromBoolean(true);

            // Extract raw text from the paragraph
            string paraText = string.Join("", para.Elements<Run>()
                .Select(r => string.Join("", r.Elements<Text>().Select(t => t.Text))));

            if (isRtl)
            {
                // Enforce RTL direction for the entire paragraph
                txtContent.Append("\u202E");

                // Fix leading numbers: insert RLM after any leading numeric sequence
                if (!string.IsNullOrEmpty(paraText))
                {
                    int firstNonDigitIndex = 0;
                    while (firstNonDigitIndex < paraText.Length && char.IsDigit(paraText[firstNonDigitIndex]))
                    {
                        firstNonDigitIndex++;
                    }
                    if (firstNonDigitIndex > 0 && firstNonDigitIndex < paraText.Length)
                    {
                        string numberPart = paraText.Substring(0, firstNonDigitIndex);
                        string arabicPart = paraText.Substring(firstNonDigitIndex);
                        paraText = $"{numberPart}\u200F{arabicPart}";
                    }
                }
            }

            txtContent.Append(paraText);
            // Reset direction if we applied RTL override
            if (isRtl)
            {
                txtContent.Append("\u202C");
            }
            txtContent.AppendLine();
        }
    }

    // Write with UTF-8 encoding to preserve Arabic characters
    File.WriteAllText(txtPath, txtContent.ToString(), Encoding.UTF8);
}

Key Notes:

Encoding: Always use Encoding.UTF8 when writing the TXT file—this ensures Arabic characters don't get mangled during conversion.
Control Characters: The RLO/PDF pair locks in RTL direction for the paragraph, while the RLM fixes the number positioning by linking the LTR number to the RTL text flow.
RTL Detection: The code checks the original docx's paragraph Bidi property, so your existing support for other languages stays fully functional.

Testing the Output

After running this code, open the generated TXT file in an RTL-aware editor (like Notepad++ with RTL support enabled, or Microsoft Word) to verify:

Arabic paragraphs are visually right-aligned (no left indentation)
Leading numbers appear immediately before the Arabic text on the right side of the paragraph, not the left

内容的提问来源于stack exchange，提问作者madan