Blog

Low-Latency Search on Apache Iceberg with Mach5

Zachary Heilbron

Mar 1, 2025

7 min read

Share it on

The Rise of Open Data Lake Formats Is One Storage Format Enough?Understanding Search Workloads The Limitations of Open Data Lake Formats for Search Bridging the Gap with Mach5 Unlocking the Best of Both Worlds

Need Help ?

Our team of experts is ready to assist you with your integration.

Training Sessions

Get your team up to speed with personalized training.

Contact Sales

Low-Latency Search on Apache Iceberg with Mach5

The Rise of Open Data Lake Formats

As data volumes continue to grow, open data lake formats have become a popular choice for large-scale data storage and analytics. These formats offer several advantages:

•

Low-Cost Storage:

Storing data in cloud object storage significantly reduces costs.

•

Columnar Storage Format:

Native support for columnar formats like Parquet enhances performance for analytical queries.

•

Interoperability:

Open standards and formats enable seamless data integration and sharing across an organization, making them ideal for maintaining a single source of truth.

Among the most prominent open data lake formats are Apache Iceberg, Apache Hudi, and Delta Lake. Apache Iceberg has gained significant traction and is supported by major data warehouses and cloud providers. AWS's recent introduction of native S3 table support for Iceberg further solidifies its role in modern data ecosystems.

Is One Storage Format Enough?

While open data lake formats offer numerous benefits, they aren't a one-size-fits-all solution. Columnar storage formats are optimized for OLAP (Online Analytical Processing) workloads, but have limitations when it comes to:

•

Transactional Workloads:

These require row-based storage for efficient updates and writes. To support high transaction rates, the storage layer must be positioned higher in the memory hierarchy, as object storage has high I/O latency.

•

Search Workloads:

These involve retrieving specific records quickly, often for exploratory analysis or interactive queries that demand low-latency responses.

In this blog, we will dive deep into traditional search workloads on open data lake formats like Apache Iceberg, exploring why Iceberg struggles with search at scale and how integrating it with Mach5 can bridge the gap for real-time, high-performance search.

Understanding Search Workloads: Definition and Key Differences with Analytical Workloads

Search workloads are intended to quickly retrieve specific records from massive datasets with minimal latency. Unlike traditional analytical (OLAP) workloads which focus on aggregating data and executing long-running batch processes search workloads emphasize real-time, interactive querying.

Key Characteristics of Search Workloads:

•

Ad-hoc Exploration:

Users explore vast datasets, often without predefined query patterns.

•

Complex Filtering:

Queries often involve multiple filters and full-text predicates.

•

Low-Latency Response:

Users demand sub-second response times to support interactive analysis.

•

Fuzzy Matching:

Search systems must accommodate queries (e.g. wildcard matching and synonym recognition)

Differences from Analytical Workloads:

•

Query Intent:

•

Search: Retrieves individual, precise records based on specific criteria.

•

Analytical: Joins, aggregates, and summarizes data to generate insights.

•

Data Access Pattern:

•

Search: Utilizes indexing and caching to avoid scanning entire datasets.

•

Analytical: Usually involves full table scans or coarse-grained partition scans.

•

Performance Priorities:

•

Search: Optimized for low latency and real-time interactivity.

•

Analytical: Optimized for high throughput and large-scale computations.

•

Use Cases:

•

Search: Ideal for log analysis, security investigations, and real-time monitoring.

•

Analytical: Suited for generating summaries, trends, and deep insights from aggregated data.

Understanding these distinctions clarifies why open data lake formats—while excellent for analytical processing—often struggle with the speed and efficiency required for search workloads. This gap necessitates specialized solutions, like Mach5, that leverage advanced indexing and caching mechanisms to deliver real-time, low-latency search capabilities.

The Limitations of Open Data Lake Formats for Search

Although Apache Iceberg excels at large-scale analytical workloads, it presents challenges for search-intensive use cases due to its reliance on full table scans. To optimize search performance, efficient partitioning, indexing, and caching strategies are required. However, Iceberg's support for these features is limited:

•

Partitioning:

•

Helps narrow searches to relevant data segments

•

Works well for predefined access patterns but lacks flexibility for ad-hoc queries

•

Supported by Iceberg, but not sufficient by itself to satisfy search workload requirements

•

Indexing:

•

Iceberg relies on underlying data formats (Parquet, Avro, ORC) for indexing

•

No built-in support for full-text search (e.g., inverted indexes)

•

Caching (engine-dependent):

•

No direct caching mechanism for object storage reads

•

Every query must access object storage, increasing read latency

Bridging the Gap with Mach5

Apache Iceberg is a powerful format for managing large-scale data lakes, but it wasn't designed for real-time, low-latency search. This is where Mach5 excels, offering a specialized search-optimized layer that complements Iceberg's strengths. By integrating Mach5 with Iceberg, organizations can unlock high-performance search capabilities while maintaining cost-efficient storage and governance.

Why Mach5?

Mach5 is built specifically to address the challenges of search workloads on object storage, providing:

•

Sub-Second Query Latency:

Avoiding full table scans by leveraging advanced indexing and caching.

•

Complex Filtering Support:

Mach5's indexing supporting complex filters and predicates, including fuzzy matching and full-text queries.

•

Seamless Iceberg Integration:

Mach5 maintains synchronized, search-optimized indexes of Iceberg data ensuring real-time updates without duplicating storage costs.

How Mach5 Enhances Search Performance

Apache Iceberg tracks changes over time, making it simple to identify which data files do not have an up-to-date index. Mach5 utilizes this capability to efficiently index newly arrived records--including appends, deletes, and updates--ensuring that search queries reflect the most recent data state. Here's how the process works:

•

Data is stored in the Iceberg table as the primary source of truth.

•

Mach5 periodically identifies which data files do not have up-to-date indexes.

•

Mach5 builds indexes on top of these data files and transactionally adds these indexes back as metadata in the Iceberg table.

•

The transactionally consistent indexes can now be used to perform low-latency searches without scanning the entire dataset.

Unlocking the Best of Both Worlds

By integrating Apache Iceberg with Mach5, organizations can create a hybrid architecture that leverages the cost-effective, scalable storage of Iceberg alongside Mach5's real-time, low-latency search capabilities. This integrated solution offers several key advantages:

•

Scalability & Cost Efficiency:

•

Iceberg provides a robust, cost-effective data storage layer.

•

Mach5 enhances performance with advanced indexing and caching, without duplicating storage costs.

•

Real-Time Search & Analytics:

•

Enables fast, interactive queries for use cases such as log analysis, security investigations, and real-time monitoring.

•

Delivers sub-second response times for immediate insights and decision-making.

•

Data Consistency & Governance:

•

Maintains Iceberg as the single source of truth while synchronizing changes with Mach5's search indexes.

•

Ensures robust data governance and compliance across the ecosystem.

•

Flexibility for Diverse Workloads:

•

Empowers teams to choose the right tool for each task—using Iceberg for large-scale analytical processing and Mach5 for specialized search workloads.

•

Combines the strengths of both platforms, ensuring optimal performance without compromise.

This hybrid approach unlocks new possibilities by marrying scalable data lake storage with high-performance search, enabling organizations to optimize their data operations and meet diverse workload requirements seamlessly.

Ready to supercharge your data search? Learn more about how Mach5 can optimize your Iceberg-based infrastructure today.

Why can’t Apache Iceberg alone deliver low-latency search on large datasets?

Iceberg is optimized for analytical scans, not interactive search. Because queries read Parquet or other columnar files directly from object storage, even simple filters can trigger full or partial table scans. This leads to high latency and makes real-time filtering, fuzzy search, or investigative workflows slow and expensive at scale.

How can I create search workloads using Iceberg tables?

Iceberg does not natively provide search-optimized indexes, term-level lookups, or sub-second predicate filtering. To power search workloads on Iceberg tables, you need an external engine, like Mach5 that builds lightweight, incremental indexes over those tables. That lets you run low-latency search without duplicating data or scanning entire Parquet files.

What limitations do I face when running search workloads directly on Iceberg?

Search queries on raw Iceberg tables translate into object-storage reads with relatively high latency and no block-level random access. Iceberg also lacks inverted indexes, full-text search, fuzzy matching, and dedicated caching layers. As a result, common search patterns degrade into slow file scans, timeouts, and a poor user experience for interactive applications.

How does Mach5 enable low-latency search on Iceberg data?

Mach5 builds a search-optimized index layer that stays automatically synchronized with your Iceberg tables. Iceberg remains the single source of truth, while search queries hit compact, accelerated indexes backed by object-store-aware caching. This bypasses slow Parquet scans and delivers sub-second search across massive datasets.

What new use cases become possible when Iceberg and Mach5 are combined?

With Iceberg and Mach5 together, you can support real-time log analysis, security investigations, observability dashboards, interactive filtering, and operational analytics on the same underlying data. Mach5’s indexing engine turns Iceberg into a platform that handles both large-scale analytical workloads and low-latency search without sacrificing cost efficiency or cloud-native simplicity.

April 6, 2025Case Study

How Mach5 Search helps Permiso.io streamline security analytics at scale

By Tanisha S Kataria

Jan 30, 2025Blog

Key Issues in Building a Low-Latency Search Engine on Object Storage

By Vinayak Borkar

Dec 16, 2024Blog

Mach5: A Modern Integrated Search and Analytics platform

By Vinayak Borkar

Ready to see an auto-scaling search
and analytics platform that saves costs?

Schedule a demo

Low-Latency Search on Apache Iceberg with Mach5

TABLE OF CONTENTS

Need Help ?

Training Sessions

Low-Latency Search on Apache Iceberg with Mach5

Low-Cost Storage:

Columnar Storage Format:

Interoperability:

Is One Storage Format Enough?

Transactional Workloads:

Search Workloads:

Understanding Search Workloads: Definition and Key Differences with Analytical Workloads

Ad-hoc Exploration:

Complex Filtering:

Low-Latency Response:

Fuzzy Matching:

Query Intent:

Search: Retrieves individual, precise records based on specific criteria.

Analytical: Joins, aggregates, and summarizes data to generate insights.

Data Access Pattern:

Search: Utilizes indexing and caching to avoid scanning entire datasets.

Analytical: Usually involves full table scans or coarse-grained partition scans.

Performance Priorities:

Search: Optimized for low latency and real-time interactivity.

Analytical: Optimized for high throughput and large-scale computations.

Use Cases:

Search: Ideal for log analysis, security investigations, and real-time monitoring.

Analytical: Suited for generating summaries, trends, and deep insights from aggregated data.

The Limitations of Open Data Lake Formats for Search

Partitioning:

Helps narrow searches to relevant data segments

Works well for predefined access patterns but lacks flexibility for ad-hoc queries

Supported by Iceberg, but not sufficient by itself to satisfy search workload requirements

Indexing:

Iceberg relies on underlying data formats (Parquet, Avro, ORC) for indexing

No built-in support for full-text search (e.g., inverted indexes)

Caching (engine-dependent):

No direct caching mechanism for object storage reads

Every query must access object storage, increasing read latency

Bridging the Gap with Mach5

Mach5 is built specifically to address the challenges of search workloads on object storage, providing:

Sub-Second Query Latency:

Complex Filtering Support:

Seamless Iceberg Integration:

Data is stored in the Iceberg table as the primary source of truth.

Mach5 periodically identifies which data files do not have up-to-date indexes.

Mach5 builds indexes on top of these data files and transactionally adds these indexes back as metadata in the Iceberg table.

The transactionally consistent indexes can now be used to perform low-latency searches without scanning the entire dataset.

Unlocking the Best of Both Worlds

Scalability & Cost Efficiency:

Iceberg provides a robust, cost-effective data storage layer.

Mach5 enhances performance with advanced indexing and caching, without duplicating storage costs.

Real-Time Search & Analytics:

Enables fast, interactive queries for use cases such as log analysis, security investigations, and real-time monitoring.

Delivers sub-second response times for immediate insights and decision-making.

Data Consistency & Governance:

Maintains Iceberg as the single source of truth while synchronizing changes with Mach5's search indexes.

Ensures robust data governance and compliance across the ecosystem.

Flexibility for Diverse Workloads:

Empowers teams to choose the right tool for each task—using Iceberg for large-scale analytical processing and Mach5 for specialized search workloads.

Combines the strengths of both platforms, ensuring optimal performance without compromise.

Why can’t Apache Iceberg alone deliver low-latency search on large datasets?

How can I create search workloads using Iceberg tables?

What limitations do I face when running search workloads directly on Iceberg?

How does Mach5 enable low-latency search on Iceberg data?

What new use cases become possible when Iceberg and Mach5 are combined?

How Mach5 Search helps Permiso.io streamline security analytics at scale

Key Issues in Building a Low-Latency Search Engine on Object Storage

Mach5: A Modern Integrated Search and Analytics platform

Ready to see an auto-scaling searchand analytics platform that saves costs?

Ready to see an auto-scaling search
and analytics platform that saves costs?