Prefetching Data Structures for Query Optimization
Overview
To optimize query performance on an index, we can prefetch the underlying field-specific or global data structures required by different query types. By understanding which structures are used in which contexts, we can avoid unnecessary I/O and reduce latency.
This document outlines the data structures needed during various phases of query execution.
Data Structures by Query Type
Index Probe Query
An index probe is performed on every segment to identify matching documents for a field term, especially for prefix search like queries.
- Required Data Structures (per field):
TermDictionaryFstTreePostings
Term Query
A term query uses a Bloom filter to determine whether a segment should be included for an index probe.
- Required Data Structures (per field):
BloomFilterTermDictionaryFstTreePostings
Aggregation and Sort Query
These operations require access to column-stored field data, typically used for sorting or aggregating numeric or keyword data.
- Required Data Structures (per field):
ColumnStoredDataColumnStoredSkip
Final Hit Collection
Once document hits are determined, their full rows are retrieved for result formation or output.
- Required Data Structures (global — not per field):
RowStoredDataRowStoredSkip
Field-Specific Data Structures
Text Fields
Text fields require special structures for scoring and positional access.
- Required Data Structures (per field):
NormsPositions
Geo Type Fields
Geo queries (e.g., bounding box, radius search) rely on spatial indexes.
- Required Data Structures (per field):
TermDictionaryBkdTermDictionaryMetasTermDictionaryBlocksPostings
Range Queries
Range-capable fields (typically numeric or date types) use sketch-based approximations.
- Required Data Structures (per field):
Sketch
Special Case: “Must Not” Term Optimization
For must_not queries, an optimization may bypass the index probe entirely. Instead, a column lookup is used based on a metric threshold.
- Required Data Structures (per field):
- From Index Probe:
TermDictionaryFstTreePostings
- From Aggregation/Sort:
ColumnStoredDataColumnStoredSkip
- From Index Probe:
Summary Table
| Query Type / Operation | Field-Specific | Required Data Structures |
|---|---|---|
| Index Probe | ✅ | TermDictionaryFstTree, Postings |
| Term Query | ✅ | BloomFilter, TermDictionaryFstTree, Postings |
| Aggregation / Sort | ✅ | ColumnStoredData, ColumnStoredSkip |
| Final Hit Collection | ❌ | RowStoredData, RowStoredSkip |
| Text Fields | ✅ | Norms, Positions |
| Geo Fields | ✅ | TermDictionaryBkd, TermDictionaryMetas, TermDictionaryBlocks, Postings |
| Range Queries | ✅ | Sketch |
| Must Not Term Optimization | ✅ | TermDictionaryFstTree, Postings, ColumnStoredData, ColumnStoredSkip |
Notes
-
All data structures except
RowStoredDataandRowStoredSkipare field-specific. -
Prefetching should be context-aware: only load structures relevant to the query type and fields involved.
-
This enables fine-grained resource management and query performance improvements.
Query to Extract Data Structure Metadata
To find out the data structure metadata for a field, use the following. These queries are to be run using Notebook resource within the Mach5 admin UI.
Field wise metadata
index_segment_metadata('index_name')
| summarize sum(component_length) by field_name
| render piechart
Summarize metadata
index_segment_metadata('index_name')
| summarize sum(component_length) by component_type
| render piechart
Prefetch metadata
set ldop=16;
index_segment_field_prefetch("index-name", "field-name", "metadata”)
Summarize metadata by segment
index_segment_metadata('index_name')
| summarize sum(component_length) by segment_name
Number of segments
index_segment_metadata('index_name')
| summarize sum(component_length) by segment_name
| summarize count()