You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何在倒排索引(Inverted Index)中表示日期或日期时间?

Handling Dates/Datetimes in Inverted Indexes

Great question! Let's build on the string-based inverted index example you provided to cover how dates and datetimes can be represented—there are a few practical approaches depending on your use case:

1. Exact String Match (Just Like Your City Example)

If you only need to exact-match specific dates, you can treat dates as unique string values, identical to how you handled city names.

For example, say you have 3 documents with dates:

  • Doc 1: 6/30/16
  • Doc 2: 7/1/16
  • Doc 3: 6/30/16

Your inverted index would look exactly like the city example:

6/30/16 = [1, 0, 1]
7/1/16 = [0, 1, 0]

⚠️ Important: Make sure you standardize the date format upfront (e.g., always use MM/DD/YY instead of mixing with DD/MM/YY), otherwise the same date written differently will be treated as separate values.

2. Hierarchical Granularity Split (For Range Queries)

Dates have inherent hierarchical structure (year → month → day → hour, etc.), which makes this approach far more useful for real-world scenarios where you might want to query "all documents from June 2016" instead of just 6/30/16.

Split your date into its component parts, then create inverted indexes for each granularity:

  • For 6/30/16, split into: 2016 (year), 6 (month), 30 (day)
  • For 7/1/16, split into: 2016 (year), 7 (month), 1 (day)

Your indexes would then be:

2016 = [1, 1, 1]  # All docs are from 2016
6 = [1, 0, 1]     # Docs 1 and 3 are from June
30 = [1, 0, 1]    # Docs 1 and 3 are from the 30th
7 = [0, 1, 0]     # Doc 2 is from July
1 = [0, 1, 0]     # Doc 2 is from the 1st

Now, if you want to find all June 2016 documents, you can intersect the vectors for 2016 and 6 to get [1, 0, 1]—matching Docs 1 and 3.

3. Timestamp Conversion (For Numeric Range Queries)

Convert your date/datetime to a numeric timestamp (e.g., Unix time, which counts seconds since 1970-01-01). For 6/30/16, this would be 1467254400 (assuming UTC).

You have two options here:

  • Exact timestamp match: Treat each unique timestamp like a string value, just like approach 1. Useful if you need precise datetime matches (e.g., "all documents from 6/30/16 10:00 AM").
  • Interval-based indexing: Group timestamps into logical intervals (e.g., daily, hourly, monthly) and create inverted indexes for each interval. For example, all timestamps in June 2016 would map to a 2016-06 interval, with a vector of all docs from that month. This is way more efficient for large-scale range queries (e.g., "all documents from Q2 2016").

Which Approach to Choose?

  • Use exact string match if you only ever need to look up specific, full dates.
  • Use hierarchical splits if you need flexible queries across date components (year, month, day).
  • Use timestamp intervals if you're dealing with large datasets and need fast range queries over broad date ranges.

内容的提问来源于stack exchange,提问作者longtimelurker42

火山引擎 最新活动