如何用SQLite模拟大规模文件系统并实现本地持久化存储？

阿华AIGC实验室

2026-5-25

Alright, let's break down how to build a large-scale filesystem-like storage system with SQLite for metadata and disk-based file storage—handling millions of folders (each with dozens of files) without tanking performance is totally achievable with the right design choices.

Core Design Foundations

First, we need to split concerns clearly to keep the system efficient:

SQLite handles metadata: Folder IDs, file metadata (names, sizes, paths, timestamps), and parent-child relationships. SQLite is perfect for this because it’s lightweight, ACID-compliant, and scales well for read-heavy and moderate write workloads when optimized.
Local disk handles file content: Storing small files directly in SQLite works, but larger MB-scale files belong on the filesystem to avoid bloating the database and slowing down queries.

SQLite Schema Design

Since we need millions of folders with fixed-length unique IDs, we’ll use a space-efficient fixed-size identifier (e.g., 16-byte UUID stored as BLOB—hex strings work too but take more space).

Folders Table

Tracks all folder metadata and nested structure:

CREATE TABLE folders (
    folder_id BLOB(16) PRIMARY KEY NOT NULL, -- Fixed-length unique ID (binary UUID)
    name TEXT NOT NULL,
    parent_folder_id BLOB(16), -- NULL for root folders
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (parent_folder_id) REFERENCES folders(folder_id)
);

-- Critical indexes for fast nested traversal and name lookups
CREATE INDEX idx_folders_parent ON folders(parent_folder_id);
CREATE INDEX idx_folders_parent_name ON folders(parent_folder_id, name);

Files Table

Links files to their parent folders and tracks file-specific metadata:

CREATE TABLE files (
    file_id INTEGER PRIMARY KEY AUTOINCREMENT, -- Internal unique ID for joins
    folder_id BLOB(16) NOT NULL,
    filename TEXT NOT NULL,
    file_size INTEGER NOT NULL, -- Size in bytes
    storage_path TEXT, -- Path for disk-stored files (NULL if using BLOB)
    file_blob BLOB, -- For small files (<= ~100KB, adjust based on your needs)
    hash_sha256 TEXT, -- For integrity verification
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (folder_id) REFERENCES folders(folder_id),
    UNIQUE(folder_id, filename) -- Prevent duplicate filenames in the same folder
);

-- Indexes for fast file lookups by folder or filename
CREATE INDEX idx_files_folder ON files(folder_id);
CREATE INDEX idx_files_folder_filename ON files(folder_id, filename);

Performance Optimization for Million-Scale Folders

SQLite can handle millions of rows, but these tweaks are non-negotiable:

Enable WAL Mode: Allows concurrent reads and writes, a must for scalability.
```
PRAGMA journal_mode=WAL;
```
Match Page Size to Filesystem: Set page size to your filesystem’s block size (usually 4KB or 8KB) for better I/O efficiency.
```
PRAGMA page_size=4096;
```
Batch Operations: Wrap bulk folder/file creation/deletion in transactions to avoid per-query disk I/O overhead.
Cache Frequently Used Metadata: Keep recent or top-level folder metadata in an in-memory cache (like a Python dict) to reduce SQLite query load.
Avoid Full Table Scans: Always use indexed columns in WHERE clauses—never search for folders by name without filtering by parent_folder_id first.

Disk-Based File Storage Strategy

Split file storage based on size to balance efficiency:

Small Files (<= 100KB): Store directly in the file_blob column. This cuts down on filesystem overhead and simplifies small file management.
Large Files (> 100KB): Store them in a hierarchical directory structure to avoid single-directory bloat. Use the folder_id to generate paths:
1. Convert the binary folder_id to a hex string.
2. Use the first 2 characters as the top-level directory, next 2 as the second-level directory.
3. Name the file with a unique identifier (e.g., file_id + extension).
Example path: ./storage/ab/cd/123456.pdf where ab/cd come from the folder ID’s hex string, and 123456 is the file_id.

Key Operation Examples

Create a Folder

-- Insert a nested folder (replace with your actual binary UUIDs)
INSERT INTO folders (folder_id, name, parent_folder_id)
VALUES (X'your-16-byte-uuid-here', 'Project Docs', X'parent-folder-uuid-here');

Create a File

For a small file (stored as BLOB):

INSERT INTO files (folder_id, filename, file_size, file_blob, hash_sha256)
VALUES (X'folder-uuid-here', 'notes.txt', 512, X'blob-data-here', 'sha256-hash-string');

For a large file (stored on disk):

INSERT INTO files (folder_id, filename, file_size, storage_path, hash_sha256)
VALUES (X'folder-uuid-here', 'annual-report.pdf', 2097152, './storage/ab/cd/123456.pdf', 'sha256-hash-string');

Query All Files in a Folder

SELECT filename, file_size, storage_path, created_at
FROM files
WHERE folder_id = X'folder-uuid-here'
ORDER BY created_at DESC;

Final Tips

Backup Regularly: Back up both the SQLite database and the file storage directory. Use VACUUM periodically to optimize the database file.
Concurrency Limits: SQLite’s WAL mode allows multiple readers and one writer—if you need higher write concurrency, consider sharding the database by folder ID ranges.
Integrity Checks: Use the hash_sha256 column to verify file content hasn’t been corrupted, whether stored on disk or in the BLOB.

内容的提问来源于stack exchange，提问作者valdo