如何在R语言中不使用Spark包读取Parquet文件？

阿华AIGC实验室

2026-5-27

R Alternatives to Spark for Reading Parquet Files (No Cluster Overhead)

Got it, I totally get why you'd want to skip spinning up a Spark cluster just to work with Parquet files—those extra overheads are such a hassle. Here are the R equivalents to Python's pandas.read_parquet and Apache Arrow workflows, all running locally without Spark:

1. Use the `arrow` Package (Apache Arrow's Official R Implementation)

This is the most direct match to using Apache Arrow in Python. It's lightweight, optimized for speed, and handles Parquet files natively without any Spark dependencies.

First install the package if you haven't:
```
install.packages("arrow")
```

Read a Parquet file into a regular R data frame (just like pandas.read_parquet):

library(arrow)

# Read into an R data frame
df <- read_parquet("path/to/your/file.parquet")

For large files that don't fit in memory, you can read into an Arrow Table to keep data in Arrow's efficient columnar format:

# Read into an Arrow Table (memory-efficient for big datasets)
arrow_table <- read_parquet("path/to/your/file.parquet", as_data_frame = FALSE)

You can also write Parquet files with write_parquet()—mirroring Python's pandas.to_parquet() or Arrow's writing utilities.

2. Lazy Evaluation with `dplyr` + `arrow` (For Large Datasets)

If you're working with massive Parquet files (too big to load entirely into memory), you can use dplyr syntax with Arrow's lazy query engine. This is similar to using Dask with pandas in Python, letting you run efficient queries without loading everything at once.

library(arrow)
library(dplyr)

# Connect to a single Parquet file or directory of Parquet files lazily
parquet_dataset <- open_dataset("path/to/parquet_files/")

# Run dplyr queries (they're executed efficiently in the background)
filtered_results <- parquet_dataset %>%
  filter(your_column > 100) %>%
  select(col1, col2, col3) %>%
  collect() # Bring the final results into an R data frame when ready

Both of these approaches avoid the need for Spark entirely, keeping your workflow lightweight and local—just like the Python tools you're used to.

内容的提问来源于stack exchange，提问作者Gerg