如何在倒排索引(Inverted Index)中表示日期或日期时间?
Great question! Let's build on the string-based inverted index example you provided to cover how dates and datetimes can be represented—there are a few practical approaches depending on your use case:
1. Exact String Match (Just Like Your City Example)
If you only need to exact-match specific dates, you can treat dates as unique string values, identical to how you handled city names.
For example, say you have 3 documents with dates:
- Doc 1:
6/30/16 - Doc 2:
7/1/16 - Doc 3:
6/30/16
Your inverted index would look exactly like the city example:
6/30/16 = [1, 0, 1] 7/1/16 = [0, 1, 0]
⚠️ Important: Make sure you standardize the date format upfront (e.g., always use MM/DD/YY instead of mixing with DD/MM/YY), otherwise the same date written differently will be treated as separate values.
2. Hierarchical Granularity Split (For Range Queries)
Dates have inherent hierarchical structure (year → month → day → hour, etc.), which makes this approach far more useful for real-world scenarios where you might want to query "all documents from June 2016" instead of just 6/30/16.
Split your date into its component parts, then create inverted indexes for each granularity:
- For
6/30/16, split into:2016(year),6(month),30(day) - For
7/1/16, split into:2016(year),7(month),1(day)
Your indexes would then be:
2016 = [1, 1, 1] # All docs are from 2016 6 = [1, 0, 1] # Docs 1 and 3 are from June 30 = [1, 0, 1] # Docs 1 and 3 are from the 30th 7 = [0, 1, 0] # Doc 2 is from July 1 = [0, 1, 0] # Doc 2 is from the 1st
Now, if you want to find all June 2016 documents, you can intersect the vectors for 2016 and 6 to get [1, 0, 1]—matching Docs 1 and 3.
3. Timestamp Conversion (For Numeric Range Queries)
Convert your date/datetime to a numeric timestamp (e.g., Unix time, which counts seconds since 1970-01-01). For 6/30/16, this would be 1467254400 (assuming UTC).
You have two options here:
- Exact timestamp match: Treat each unique timestamp like a string value, just like approach 1. Useful if you need precise datetime matches (e.g., "all documents from 6/30/16 10:00 AM").
- Interval-based indexing: Group timestamps into logical intervals (e.g., daily, hourly, monthly) and create inverted indexes for each interval. For example, all timestamps in June 2016 would map to a
2016-06interval, with a vector of all docs from that month. This is way more efficient for large-scale range queries (e.g., "all documents from Q2 2016").
Which Approach to Choose?
- Use exact string match if you only ever need to look up specific, full dates.
- Use hierarchical splits if you need flexible queries across date components (year, month, day).
- Use timestamp intervals if you're dealing with large datasets and need fast range queries over broad date ranges.
内容的提问来源于stack exchange,提问作者longtimelurker42




