Blog

Document Views: Eliminating Write Amplification in Search

Mar 31, 2026
15 min read
Document Views: Eliminating Write Amplification in Search

The Problem: Relational Data Doesn’t Fit Neatly Into Documents

Search platforms like OpenSearch and Elasticsearch are built around a simple abstraction: the document. You define a schema, index JSON documents, and query them. It works beautifully, until the data you need to search doesn’t naturally exist as pre-assembled documents.

Consider data that lives in relational databases. Information is normalized across tables, connected through primary keys, foreign keys, and junction tables. This is good relational design, but it creates a fundamental tension with document-oriented search.

Take a concrete example from identity governance. An organization stores Identities in one table, Roles in another, and Entitlements in a third. An identity is assigned roles. Each role carries a set of entitlements, fine-grained permissions like “read files matching *.pdf” or “write to /etc/config.” A security analyst needs to answer a straightforward question: Which identities have the ability to read PDF files?

The answer spans all three tables. But a search platform doesn’t understand table joins. It understands documents.

Relational tables denormalized into search documents

The Conventional Approach

To make this query possible, the conventional approach requires a data pipeline that enriches each identity with its roles and each role with its entitlements, assembles the result into a fully denormalized JSON document, and indexes it into the search platform. Every identity becomes a self-contained document carrying the complete tree of roles and entitlements beneath it.

This works, right up until something changes.

The Write Amplification Problem

Suppose an administrator updates the entitlements on a single role. Say, revoking PDF read access from the “Analyst” role. That role might be assigned to 5,000 identities. Suddenly, a one-row change in the entitlements table cascades into 5,000 document re-computations and 5,000 re-indexing operations. The search platform churns through thousands of documents, most of which are identical except for one nested field.

This is write amplification: the cost of keeping the index current is proportional not to the size of the actual change, but to the fan-out of relationships in the data model. And the deeper or wider the relationship graph, the worse it gets.

The operational cost is significant. Ingestion pipelines have to track dependencies between source tables and target documents. Reindexing jobs consume cluster resources (CPU, memory, I/O, and network) far out of proportion to the change that triggered them. During reindexing, queries may return stale results. And the entire pipeline adds latency between a change at the source and its visibility in search.

For datasets with high fan-out relationships (identities and roles, users and permissions, products and categories, orders and line items) this becomes an architecture-defining constraint. Teams build elaborate change-detection pipelines, batching strategies, and staleness budgets, all to work around a fundamental mismatch between how the data is structured and how the search platform expects to consume it.

Introducing Document Views

Mach5 Search introduces Document Views, a feature designed to eliminate write amplification entirely by rethinking how relational data is represented in a search platform.

A Document View defines relationships between indexes declaratively. Rather than pre-materializing every possible document by joining data at ingest time, Document Views express the shape of a document as a view over base indexes, much like a SQL view over base tables.

From the outside, a Document View looks and behaves exactly like a regular index. You query it using the full OpenSearch DSL that Mach5 supports. You get back documents that appear as if they were fully materialized and indexed. But behind the scenes, no denormalized document was ever written. The base data (identities, roles, and entitlements) each lives in its own index, updated independently and on its own schedule.

When an entitlement changes, you update one document in the entitlements index. That’s it. No cascade. No fan-out. No pipeline to re-enrich and re-index thousands of downstream documents. The next query against the Document View automatically reflects the change because it always reads from the current state of the base indexes.

How Document Views Work

A Document View starts with a definition that describes the relationships between base indexes. This definition specifies which indexes participate, how they relate to each other (analogous to join conditions), and the shape of the resulting document.

At query time, the Mach5 optimizer takes an incoming search request against the Document View and transforms it into an efficient execution plan over the base indexes. The optimizer performs several key operations during this process.

Relationship pruning. Not every query touches every relationship. If a query only filters on identity attributes and never references entitlements, the optimizer eliminates the entitlement join entirely. This is a significant optimization. It means that the cost of a query is proportional to the relationships it actually uses, not the total complexity of the view definition.

Predicate pushdown. Filters are pushed as deep as possible into the base index queries. If you’re searching for identities with a specific entitlement pattern, the optimizer first narrows the entitlement index, then uses those results to constrain which roles are relevant, and finally identifies the matching identities, rather than assembling every possible document and filtering afterward.

Result stitching. Once the relevant data is retrieved from each base index, the optimizer assembles the final document on the fly, presenting it as if it had been materialized all along.

How a query is processed through a Document View

There is a query-time cost to this approach. Assembling documents dynamically is inherently more work than reading a pre-built document from an index. But in practice, the optimizer’s aggressive pruning and pushdown strategies keep this cost manageable, and for many workloads, the elimination of write amplification and ingestion complexity more than compensates for the marginal increase in query latency.

Why This Matters

Write amplification isn’t just a performance problem. It’s an architectural tax that shapes how teams build and operate search-powered systems.

Cost scales with change, not with data. In a pre-materialized world, the cost of running a search cluster is driven by how frequently relationships change and how wide the fan-out is, not by the volume of actual data. Document Views decouple these concerns, so cost scales with the data you store and the queries you run, not with the churn of re-indexing.

Freshness without complexity. Keeping denormalized documents current requires change-data-capture pipelines, dependency tracking, and careful orchestration. With Document Views, each base index is updated independently. There is no pipeline to maintain, no staleness window, and no partial-update inconsistency.

Simpler data architecture. Teams no longer need to design their ingestion around the limitations of the search platform. Data can be indexed in its natural, normalized form. New relationships can be expressed as new views without rearchitecting the ingestion pipeline.

Agility in schema evolution. Adding a new relationship, say, linking identities to a new “certifications” index, is a view definition change. In a pre-materialized model, it would require modifying the enrichment pipeline, backfilling every existing document, and re-indexing the entire corpus.

Example Use Cases

Document Views are a general-purpose mechanism, but they shine in scenarios where relational data has high fan-out and frequent updates.

Identity Governance and Access Management. The motivating example above: identities, roles, entitlements, and access policies span multiple data sources and change frequently. Security analysts need to query across these relationships in real time (“Show me all identities with write access to production databases”) without waiting for a pipeline to catch up.

E-Commerce Product Catalogs. Products relate to categories, suppliers, pricing tiers, inventory levels, and reviews. A pricing change from a supplier shouldn’t require re-indexing every product they provide. A Document View over products, pricing, and inventory allows queries like “Find all in-stock products under $50 in the electronics category” without denormalization.

IT Asset Management and CMDB. Configuration items, their relationships, ownership, and compliance status live across multiple systems. When a compliance policy is updated, Document Views eliminate the need to re-process every asset that falls under that policy.

Security Observability. Network connections, process trees, file access events, and user sessions each have their own index. Analysts query across them (“Show me all processes launched by this user that opened network connections to external IPs”) and Document Views stitch the results together from the underlying event indexes.

Healthcare and Clinical Data. Patients, encounters, diagnoses, medications, and lab results are naturally normalized. Clinicians and researchers need to query across these relationships without the overhead and latency of maintaining a fully denormalized patient record in the search index.

Conclusion

The impedance mismatch between relational data and document-oriented search has been a persistent architectural challenge, one that teams have historically solved with brute force: materialize everything, re-index on every change, and absorb the cost of write amplification.

Document Views in Mach5 Search offer a fundamentally different approach. By expressing relationships declaratively and resolving them at query time, Document Views eliminate the write amplification problem, simplify data architecture, and keep search results fresh without the complexity of enrichment pipelines.

The result is a search platform that handles relational data natively, letting you index data in its natural form and query it as if every document had been fully assembled. Less infrastructure. Lower cost. Simpler operations. And search results that always reflect the current state of your data.

The Spec Is the Program
Mar 15, 2026Blog

The Spec Is the Program

By Vinayak Borkar

Multi-Warehouse Architecture
Nov 27, 2025Blog

Using Multi-Warehouse Architecture to Accelerate Query Performance

By Ram Phadake

Permiso Case Study
April 6, 2025Case Study

Permiso Case Study

By Tanisha S Kataria

Ready to see an auto-scaling searchand analytics platform that saves costs?

Schedule a demo