Matplotlib Word Cloud多语言兼容问题求助：多字体适配与算法方案

阿华AIGC实验室

2026-5-7

解决Matplotlib WordCloud多语言共存显示问题

嘿，这个多语言词云的问题确实挺头疼的，我之前做跨国项目时也踩过类似的坑，给你几个亲测有效的方案：

方案一：动态为不同语言指定对应字体

WordCloud其实支持通过函数动态返回字体路径，这就能完美解决“同时用中文字体和Arial”的需求——我们可以给不同语言的词匹配专门的字体：

步骤1：定义字体匹配函数

先写一个函数，根据文本的字符范围判断语言，返回对应的字体文件路径：

import unicodedata
from wordcloud import WordCloud
import arabic_reshaper
from bidi.algorithm import get_display

def get_font_for_text(text):
    for char in text:
        # 匹配中文（CJK统一表意文字）
        if '\u4e00' <= char <= '\u9fff':
            return './fonts/SimHei.ttf'  # 替换成你的中文字体路径，比如Noto Sans SC
        # 匹配阿拉伯语/乌尔都语（阿拉伯字母范围）
        elif '\u0600' <= char <= '\u06ff':
            return './fonts/Arial.ttf'  # 或者用Noto Sans Arabic更靠谱
        # 匹配印地语（天城文范围）
        elif '\u0900' <= char <= '\u097f':
            return './fonts/NotoSansDevanagari.ttf'
        # 匹配俄语（西里尔字母范围）
        elif '\u0400' <= char <= '\u04ff':
            return './fonts/Arial.ttf'  # Arial本身支持西里尔字母
    # 默认 fallback 字体
    return './fonts/Arial.ttf'

步骤2：处理阿拉伯语/乌尔都语的排版

这类RTL（从右到左）语言必须先做字形重塑，再处理双向排版，否则会显示成乱码：

# 假设你的词频字典是 word_freq，格式为 {词: 频率}
processed_words = {}
for word, freq in word_freq.items():
    # 检查是否包含阿拉伯/乌尔都语字符
    if any('\u0600' <= c <= '\u06ff' for c in word):
        # 字形重塑 + Bidi算法处理
        reshaped = arabic_reshaper.reshape(word)
        display_word = get_display(reshaped)
        processed_words[display_word] = freq
    else:
        processed_words[word] = freq

步骤3：生成多语言词云

创建WordCloud实例时，把font_path指定为我们定义的函数即可：

wc = WordCloud(
    font_path=get_font_for_text,
    width=1200,
    height=800,
    background_color="white",
    max_words=200
)
wc.generate_from_frequencies(processed_words)
wc.to_file("multilingual_wordcloud.png")