Documentation

Prefetching Data Structures for Query Optimization

Overview

To optimize query performance on an index, we can prefetch the underlying field-specific or global data structures required by different query types. By understanding which structures are used in which contexts, we can avoid unnecessary I/O and reduce latency.

This document outlines the data structures needed during various phases of query execution.


Data Structures by Query Type

Index Probe Query

An index probe is performed on every segment to identify matching documents for a field term, especially for prefix search like queries.

  • Required Data Structures (per field):
    • TermDictionaryFstTree
    • Postings

Term Query

A term query uses a Bloom filter to determine whether a segment should be included for an index probe.

  • Required Data Structures (per field):
    • BloomFilter
    • TermDictionaryFstTree
    • Postings

Aggregation and Sort Query

These operations require access to column-stored field data, typically used for sorting or aggregating numeric or keyword data.

  • Required Data Structures (per field):
    • ColumnStoredData
    • ColumnStoredSkip

Final Hit Collection

Once document hits are determined, their full rows are retrieved for result formation or output.

  • Required Data Structures (global — not per field):
    • RowStoredData
    • RowStoredSkip

Field-Specific Data Structures

Text Fields

Text fields require special structures for scoring and positional access.

  • Required Data Structures (per field):
    • Norms
    • Positions

Geo Type Fields

Geo queries (e.g., bounding box, radius search) rely on spatial indexes.

  • Required Data Structures (per field):
    • TermDictionaryBkd
    • TermDictionaryMetas
    • TermDictionaryBlocks
    • Postings

Range Queries

Range-capable fields (typically numeric or date types) use sketch-based approximations.

  • Required Data Structures (per field):
    • Sketch

Special Case: “Must Not” Term Optimization

For must_not queries, an optimization may bypass the index probe entirely. Instead, a column lookup is used based on a metric threshold.

  • Required Data Structures (per field):
    • From Index Probe:
      • TermDictionaryFstTree
      • Postings
    • From Aggregation/Sort:
      • ColumnStoredData
      • ColumnStoredSkip

Summary Table

Query Type / OperationField-SpecificRequired Data Structures
Index ProbeTermDictionaryFstTree, Postings
Term QueryBloomFilter, TermDictionaryFstTree, Postings
Aggregation / SortColumnStoredData, ColumnStoredSkip
Final Hit CollectionRowStoredData, RowStoredSkip
Text FieldsNorms, Positions
Geo FieldsTermDictionaryBkd, TermDictionaryMetas, TermDictionaryBlocks, Postings
Range QueriesSketch
Must Not Term OptimizationTermDictionaryFstTree, Postings, ColumnStoredData, ColumnStoredSkip

Notes

  • All data structures except RowStoredData and RowStoredSkip are field-specific.

  • Prefetching should be context-aware: only load structures relevant to the query type and fields involved.

  • This enables fine-grained resource management and query performance improvements.


Query to Extract Data Structure Metadata

To find out the data structure metadata for a field, use the following. These queries are to be run using Notebook resource within the Mach5 admin UI.

Field wise metadata

index_segment_metadata('index_name')
| summarize sum(component_length) by field_name
| render piechart

Summarize metadata

index_segment_metadata('index_name')
| summarize sum(component_length) by component_type
| render piechart

Prefetch metadata

set ldop=16;
index_segment_field_prefetch("index-name", "field-name", "metadata”)

Summarize metadata by segment

index_segment_metadata('index_name')
| summarize sum(component_length) by segment_name

Number of segments

index_segment_metadata('index_name')
| summarize sum(component_length) by segment_name
| summarize count()