如何使用python-pptx提取PowerPoint中不常出现的表格数据？

阿华AIGC实验室

2026-4-13

嘿，我刚好碰到过类似的问题！你之前的代码没抓到表格是因为表格在python-pptx里是单独的Shape类型，没有直接的text属性，所以你的hasattr(shape, "text")判断会直接跳过它。我来给你梳理下怎么解决这个问题：

首先，python-pptx专门提供了shape.has_table这个判断条件，不管表格在幻灯片的哪个位置，用这个都能精准识别出表格形状。接下来你只需要在原来的循环里加一个分支，专门处理表格的内容提取就行。

完整的改进代码

我把你原来的代码扩展了下，不仅能抓普通文本，还能提取表格里的每个单元格内容，最后还整理成你需要的“演示名、日期、全内容文本”的结构：

from pptx import Presentation

def extract_pptx_content(pptx_path):
    presentation = Presentation(pptx_path)
    all_slide_content = []
    
    # 获取演示文稿的名称和创建日期
    presentation_name = pptx_path.split("\\")[-1]  # Windows用\\，Mac/Linux可以换成/
    presentation_date = presentation.core_properties.created
    
    for slide_num, slide in enumerate(presentation.slides):
        slide_content = f"--- Slide {slide_num + 1} ---\n"
        for shape in slide.shapes:
            # 处理普通带文本的形状，跳过空文本
            if hasattr(shape, "text"):
                if shape.text.strip():
                    slide_content += shape.text + "\n"
            # 专门处理表格形状
            elif shape.has_table:
                table = shape.table
                slide_content += "【表格内容】\n"
                # 遍历表格的每一行和单元格
                for row in table.rows:
                    cell_texts = []
                    for cell in row.cells:
                        cleaned_text = cell.text.strip()
                        if cleaned_text:
                            cell_texts.append(cleaned_text)
                    # 把一行的单元格文本用分隔符拼接，可按需调整格式
                    if cell_texts:
                        slide_content += " | ".join(cell_texts) + "\n"
        all_slide_content.append(slide_content)
    
    # 整理成你需要的最终输出结构
    final_result = {
        "presentation_name": presentation_name,
        "presentation_date": presentation_date.strftime("%Y-%m-%d %H:%M:%S"),
        "full_content": "\n".join(all_slide_content)
    }
    return final_result

# 测试调用示例
if __name__ == "__main__":
    ppt_data = extract_pptx_content("your_presentation.pptx")
    print("演示文稿名称:", ppt_data["presentation_name"])
    print("创建日期:", ppt_data["presentation_date"])
    print("\n全部内容:\n", ppt_data["full_content"])