如何在倒排索引（Inverted Index）中表示日期或日期时间？

阿华AIGC实验室

2026-5-21

Handling Dates/Datetimes in Inverted Indexes

Great question! Let's build on the string-based inverted index example you provided to cover how dates and datetimes can be represented—there are a few practical approaches depending on your use case:

1. Exact String Match (Just Like Your City Example)

If you only need to exact-match specific dates, you can treat dates as unique string values, identical to how you handled city names.

For example, say you have 3 documents with dates:

Doc 1: 6/30/16
Doc 2: 7/1/16
Doc 3: 6/30/16

Your inverted index would look exactly like the city example:

6/30/16 = [1, 0, 1]
7/1/16 = [0, 1, 0]

⚠️ Important: Make sure you standardize the date format upfront (e.g., always use MM/DD/YY instead of mixing with DD/MM/YY), otherwise the same date written differently will be treated as separate values.

2. Hierarchical Granularity Split (For Range Queries)

Dates have inherent hierarchical structure (year → month → day → hour, etc.), which makes this approach far more useful for real-world scenarios where you might want to query "all documents from June 2016" instead of just 6/30/16.

Split your date into its component parts, then create inverted indexes for each granularity:

For 6/30/16, split into: 2016 (year), 6 (month), 30 (day)
For 7/1/16, split into: 2016 (year), 7 (month), 1 (day)

Your indexes would then be:

2016 = [1, 1, 1]  # All docs are from 2016
6 = [1, 0, 1]     # Docs 1 and 3 are from June
30 = [1, 0, 1]    # Docs 1 and 3 are from the 30th
7 = [0, 1, 0]     # Doc 2 is from July
1 = [0, 1, 0]     # Doc 2 is from the 1st

Now, if you want to find all June 2016 documents, you can intersect the vectors for 2016 and 6 to get [1, 0, 1]—matching Docs 1 and 3.

3. Timestamp Conversion (For Numeric Range Queries)

Convert your date/datetime to a numeric timestamp (e.g., Unix time, which counts seconds since 1970-01-01). For 6/30/16, this would be 1467254400 (assuming UTC).

You have two options here:

Exact timestamp match: Treat each unique timestamp like a string value, just like approach 1. Useful if you need precise datetime matches (e.g., "all documents from 6/30/16 10:00 AM").
Interval-based indexing: Group timestamps into logical intervals (e.g., daily, hourly, monthly) and create inverted indexes for each interval. For example, all timestamps in June 2016 would map to a 2016-06 interval, with a vector of all docs from that month. This is way more efficient for large-scale range queries (e.g., "all documents from Q2 2016").

Which Approach to Choose?

Use exact string match if you only ever need to look up specific, full dates.
Use hierarchical splits if you need flexible queries across date components (year, month, day).
Use timestamp intervals if you're dealing with large datasets and need fast range queries over broad date ranges.

内容的提问来源于stack exchange，提问作者longtimelurker42