Synthetic Data Generation

Synthetic Data Generation creates deterministic test, demo, and benchmark data directly in Mach5. You define a generation plan, validate it, execute it, and Mach5 creates one or more output indexes with generated documents.

Synthetic data generation plans

Synthetic generation plan detail

Create synthetic generation plan

Use synthetic generation when you need:

repeatable demo datasets;
benchmark data with controlled size and shape;
correlated indexes such as customers, orders, order items, and payments;
realistic-looking values without importing external files;
parent/child and many-to-many relationships that are valid by construction.

The same plan seed and row definitions produce the same generated data, which makes synthetic datasets easy to recreate across environments.

Basic workflow

Create a GenerationPlan resource in a namespace.
Validate the plan.
Execute the plan.
Watch status until the plan completes.
Query the generated indexes.

REST API

Generation plans are namespace-scoped resources.

GET    /apis/namespaces/{namespace}/generation_plans
GET    /apis/namespaces/{namespace}/generation_plans/{name}
PUT    /apis/namespaces/{namespace}/generation_plans/{name}
PATCH  /apis/namespaces/{namespace}/generation_plans/{name}
DELETE /apis/namespaces/{namespace}/generation_plans/{name}

POST   /apis/namespaces/{namespace}/generation_plans/{name}/_validate
POST   /apis/namespaces/{namespace}/generation_plans/{name}/_execute
POST   /apis/namespaces/{namespace}/generation_plans/{name}/_cancel
GET    /apis/namespaces/{namespace}/generation_plans/{name}/status

Create a plan:

curl -X PUT \
  "https://mach5.example.com/apis/namespaces/default/generation_plans/demo_orders" \
  -H "Content-Type: application/json" \
  -d '{
    "seed": 42,
    "description": "Small order demo dataset",
    "plan": {
      "indexes": [
        {
          "name": "customers",
          "row_count": 100,
          "fields": {
            "customer_id": { "type": "sequence", "format": "CUST-{:04}" },
            "name": { "type": "fake", "category": "person.name" },
            "email": { "type": "fake", "category": "internet.email" }
          }
        }
      ]
    }
  }'

Validate the plan:

curl -X POST \
  "https://mach5.example.com/apis/namespaces/default/generation_plans/demo_orders/_validate"

Execute the plan:

curl -X POST \
  "https://mach5.example.com/apis/namespaces/default/generation_plans/demo_orders/_execute"

Check status:

curl \
  "https://mach5.example.com/apis/namespaces/default/generation_plans/demo_orders/status"

Cancel a running plan:

curl -X POST \
  "https://mach5.example.com/apis/namespaces/default/generation_plans/demo_orders/_cancel"

GenerationPlan resource

A generation plan resource has a deterministic seed, an optional description, and a plan specification.

{
  "seed": 42,
  "description": "E-commerce demo data",
  "plan": {
    "indexes": []
  }
}

Field	Type	Description
`seed`	unsigned integer	Global deterministic seed. Change the seed to produce a different but repeatable dataset.
`description`	string	Optional human-readable description.
`plan`	object	Data generation specification.

The status response also includes lifecycle fields such as state, run_generation, last_observed_generation, and last_execution.

Common states:

State	Description
`idle`	Plan is saved and not currently running.
`running`	Plan execution is in progress.
`completed`	Plan execution completed successfully.
`failed`	Plan execution failed.
`cancelling`	Cancellation has been requested.
`cancelled`	Plan execution was cancelled.

Plan structure

The plan object contains optional execution controls, optional entity metadata, and a required indexes array.

{
  "max_parallel_workers": 4,
  "partitions": 8,
  "workflow_timeout_seconds": 7200,
  "entities": {},
  "indexes": []
}

Field	Type	Description
`max_parallel_workers`	integer	Default maximum concurrent workers per index. Must be greater than `0`.
`partitions`	integer	Default partition count for fixed-row and relationship planning. Must be greater than `0`.
`workflow_timeout_seconds`	integer	Default timeout for generation work. Must be greater than `0`.
`entities`	object	Optional stateless entity catalog used by stateless foreign keys and `entity_field`.
`indexes`	array	Output index specifications. Required and non-empty.

Index-level max_parallel_workers, partitions, and workflow_timeout_seconds override plan-level values.

Index specification

Each index entry defines one generated output index.

{
  "name": "orders",
  "depends_on": ["customers"],
  "row_count": 100000,
  "mapping": {
    "properties": {
      "order_id": { "type": "keyword" },
      "customer_id": { "type": "keyword" },
      "amount": { "type": "double" }
    }
  },
  "fields": {
    "order_id": { "type": "sequence", "format": "ORD-{:08}" },
    "customer_id": { "type": "foreign_key", "source": "customers.customer_id" },
    "amount": { "type": "log_normal", "mean": 4.0, "stddev": 1.0, "min": 1.0 }
  }
}

Field	Type	Description
`name`	string	Output index name. Must be unique within the plan.
`depends_on`	string array	Upstream indexes that must be generated before this index.
`row_count`	integer	Fixed number of rows to generate.
`per_parent`	object	Generate child rows for every row in a parent index.
`relationship`	object	Generate a relationship or junction index between upstream indexes.
`mapping`	object	OpenSearch-compatible mapping used when creating the output index. Defaults to an empty mapping.
`fields`	object	Field name to field generator specification.
`transform`	object	Optional JavaScript or VRL transform applied after field generation.
`max_parallel_workers`	integer	Index-level worker override.
`partitions`	integer	Index-level partition override.
`workflow_timeout_seconds`	integer	Index-level timeout override.

Exactly one row strategy is required for each index:

row_count
per_parent
relationship

Row strategies

Fixed row count

Use row_count when an index should contain exactly a known number of generated rows.

{
  "name": "customers",
  "row_count": 1000,
  "fields": {
    "customer_id": { "type": "sequence", "format": "CUST-{:06}" },
    "name": { "type": "fake", "category": "person.name" }
  }
}

The example creates 1,000 customer documents. customer_id is deterministic and monotonic, while name is deterministic but realistic-looking.

Per-parent rows

Use per_parent when a child index should generate rows for every row in a parent index.

{
  "name": "users",
  "depends_on": ["tenants"],
  "per_parent": {
    "parent": "tenants",
    "count": { "type": "uniform_int", "min": 10, "max": 50 }
  },
  "fields": {
    "tenant_id": { "type": "parent_key" },
    "user_id": { "type": "sequence", "format": "USER-{:08}" },
    "email": { "type": "fake", "category": "internet.email" }
  }
}

The parent index must appear in depends_on. The count generator decides how many child rows to create for each parent row. A child field using parent_key receives the parent value for the same field name.

In this example, each tenant receives between 10 and 50 users, and tenant_id is copied from the parent tenant row.

Junction relationships

Use relationship when an index should pair keys from two upstream indexes, such as students and courses, users and groups, or products and campaigns.

{
  "name": "student_courses",
  "depends_on": ["students", "courses"],
  "relationship": {
    "type": "junction",
    "left": { "index": "students", "key": "student_id" },
    "right": { "index": "courses", "key": "course_id" },
    "left_cardinality": { "type": "uniform_int", "min": 3, "max": 5 },
    "right_max": 250,
    "unique": true
  },
  "fields": {
    "student_id": { "type": "parent_key" },
    "course_id": { "type": "parent_key" },
    "enrolled_at": {
      "type": "timestamp_range",
      "start": "2026-01-01T00:00:00Z",
      "end": "2026-03-31T23:59:59Z"
    }
  }
}

Relationship field	Description
`type`	Must be `junction`.
`left.index`	Left source index name.
`left.key`	Key field from the left source index.
`right.index`	Right source index name.
`right.key`	Key field from the right source index.
`left_cardinality`	Generator that controls how many right-side keys are assigned to each left-side row.
`right_max`	Maximum number of times each right-side row may be used.
`unique`	When true, prevents duplicate left/right pairs for the same left row.

Both source indexes must appear in depends_on.

Field specification

Every generated field has a generator type and optional cross-cutting properties.

"fields": {
  "service": {
    "type": "weighted_choice",
    "values": ["api", "worker", "scheduler"],
    "weights": [0.6, 0.3, 0.1],
    "optional": 0.05
  }
}

Cross-cutting field properties

Property	Type	Description
`type`	string	Generator type. Required.
`optional`	float	Probability from `0.0` to `1.0` that the field is omitted. Defaults to `0.0`.
`cardinality`	generator	When set, the field emits an array. The cardinality generator decides array length.
`unique`	boolean	When the field emits an array, attempts to avoid duplicate elements. Defaults to true.

Optional fields

"middle_name": {
  "type": "fake",
  "category": "person.name",
  "optional": 0.7
}

This field is omitted from about 70% of generated documents.

Array fields with cardinality

"tags": {
  "type": "weighted_choice",
  "values": ["prod", "dev", "pci", "public", "internal"],
  "weights": [0.3, 0.2, 0.1, 0.2, 0.2],
  "cardinality": { "type": "uniform_int", "min": 1, "max": 3 },
  "unique": true
}

This field emits an array with 1 to 3 unique tag values.

Field generators

sequence

Generates a deterministic sequence based on the row index.

{ "type": "sequence" }

Parameter	Type	Description
`start`	integer	First sequence value. Defaults to `1`.
`format`	string	Optional format string. Supports `{}` and zero-padded forms such as `{:06}` or `{:06d}`.

Examples:

"id": { "type": "sequence" }

Produces numeric values such as 1, 2, 3.

"customer_id": { "type": "sequence", "start": 1000, "format": "CUST-{:06}" }

Produces values such as CUST-001000, CUST-001001, CUST-001002.

uuid

Generates a deterministic UUID-like string.

"event_id": { "type": "uuid" }

Use uuid for identifiers that should look globally unique but remain repeatable for the same seed and row.

constant

Returns the same JSON value for every row.

"environment": { "type": "constant", "value": "prod" }

value may be a string, number, boolean, object, array, or null.

uniform_int

Generates an integer uniformly distributed between min and max, inclusive.

"status": { "type": "uniform_int", "min": 200, "max": 599 }

Parameter	Type	Description
`min`	integer	Minimum value.
`max`	integer	Maximum value.

uniform_float

Generates a floating-point value uniformly distributed between min and max.

"cpu": { "type": "uniform_float", "min": 0.0, "max": 100.0 }

Parameter	Type	Description
`min`	number	Minimum value.
`max`	number	Maximum value.

normal

Generates a normally distributed floating-point value.

"latency_ms": { "type": "normal", "mean": 120.0, "stddev": 25.0, "min": 0.0, "max": 1000.0 }

Parameter	Type	Description
`mean`	number	Average value.
`stddev`	number	Standard deviation.
`min`	number	Optional lower clamp.
`max`	number	Optional upper clamp.

Use normal for values clustered around an average, such as latency, age, or utilization.

log_normal

Generates a log-normal floating-point value.

"order_amount": { "type": "log_normal", "mean": 4.0, "stddev": 1.0, "min": 1.0 }

Parameter	Type	Description
`mean`	number	Mean of the underlying normal distribution.
`stddev`	number	Standard deviation of the underlying normal distribution.
`min`	number	Optional lower clamp.
`max`	number	Optional upper clamp.

Use log_normal for skewed positive values such as transaction amounts, payload sizes, and response times.

weighted_choice

Samples one value from a list using weights.

"level": {
  "type": "weighted_choice",
  "values": ["debug", "info", "warn", "error"],
  "weights": [0.1, 0.7, 0.15, 0.05]
}

Parameter	Type	Description
`values`	array	Candidate JSON values.
`weights`	number array	Relative weights matching the `values` array.

Weights are relative. They do not need to add up to 1.0.

fake

Generates deterministic plausible strings for common categories.

"name": { "type": "fake", "category": "person.name" }

Parameter	Type	Description
`category`	string	Fake data category.
`count`	integer	Optional amount of text for categories that support longer generated content.

Supported categories:

Category	Example use
`person.name`	Human names.
`internet.email`	Email addresses.
`address.city`	City names.
`company.name`	Company names.
`commerce.product_name`	Product names.
`phone.number`	Phone numbers.
`lorem`	Placeholder text.

Examples:

"email": { "type": "fake", "category": "internet.email" }

"product": { "type": "fake", "category": "commerce.product_name" }

"description": { "type": "fake", "category": "lorem", "count": 12 }

timestamp_range

Generates a timestamp between start and end.

"timestamp": {
  "type": "timestamp_range",
  "start": "2026-01-01T00:00:00Z",
  "end": "2026-01-31T23:59:59Z"
}

Parameter	Type	Description
`start`	string	Start date/time.
`end`	string	End date/time.
`distribution`	string	Optional distribution. Use `uniform` or `normal`. Defaults to `uniform`.

Normal distribution places more values near the middle of the time range:

"timestamp": {
  "type": "timestamp_range",
  "start": "2026-01-01T00:00:00Z",
  "end": "2026-01-31T23:59:59Z",
  "distribution": "normal"
}

format

Builds a string by replacing placeholders with nested generator values.

"message": {
  "type": "format",
  "pattern": "{service} returned status {status}",
  "service": { "type": "weighted_choice", "values": ["api", "worker"], "weights": [0.7, 0.3] },
  "status": { "type": "weighted_choice", "values": [200, 404, 500], "weights": [0.8, 0.15, 0.05] }
}

Parameter	Type	Description
`pattern`	string	String containing `{placeholder}` tokens.
placeholder name	generator	Nested generator used to replace that placeholder.

expression

Evaluates a numeric arithmetic expression using fields already generated in the same row.

"total": { "type": "expression", "expr": "quantity * unit_price" }

Parameter	Type	Description
`expr`	string	Arithmetic expression.

Supported expression syntax:

numeric literals;
field identifiers;
+, -, *, /;
parentheses;
unary minus.

Example:

"fields": {
  "quantity": { "type": "uniform_int", "min": 1, "max": 5 },
  "unit_price": { "type": "uniform_float", "min": 10.0, "max": 250.0 },
  "total": { "type": "expression", "expr": "quantity * unit_price" }
}

zipf

Generates an integer rank from 1 through n using a Zipf distribution.

"popularity_rank": { "type": "zipf", "n": 1000, "s": 1.2 }

Parameter	Type	Description
`n`	integer	Maximum rank.
`s`	number	Skew parameter. Higher values concentrate more samples on lower ranks.

Use zipf for popularity distributions such as top products, hot users, or frequently accessed resources.

foreign_key

Samples a key from an upstream index or stateless entity.

"customer_id": {
  "type": "foreign_key",
  "source": "customers.customer_id"
}

Parameter	Type	Description
`source`	string	Source reference in `index.field` syntax.
`distribution`	string	Optional sampling distribution. Defaults to `uniform`. Use `zipf` for skewed references.
`s`	number	Zipf skew parameter when `distribution` is `zipf`.
`filter`	object	Optional filter that restricts candidate upstream rows.
`resolution`	string	Optional resolution mode. Use `stateless` for entity-based keys.
`target_entity`	string	Entity name for stateless resolution.
`via`	string	Source reference used to resolve a stateless relationship through another field.

Uniform foreign key:

"customer_id": { "type": "foreign_key", "source": "customers.customer_id" }

Zipf-skewed foreign key:

"customer_id": {
  "type": "foreign_key",
  "source": "customers.customer_id",
  "distribution": "zipf",
  "s": 1.2
}

Filtered foreign key:

"resource_id": {
  "type": "foreign_key",
  "source": "assets.resource_id",
  "filter": {
    "account_id": { "eq": "$account_id" }
  }
}

The filter compares an upstream field with a value from the current row. In this example, generated rows select only assets with the same account_id as the current row.

Stateless foreign key using an entity:

"account_id": {
  "type": "foreign_key",
  "source": "accounts.account_id",
  "resolution": "stateless",
  "target_entity": "account"
}

parent_key

Copies a parent or relationship key into the generated row.

"tenant_id": { "type": "parent_key" }

Use parent_key in per_parent child indexes when the child field name matches a parent field name. Use it in junction indexes to emit the left and right keys into each relationship row.

ancestor_key

Resolves a value through an upstream foreign-key chain.

"account_id": {
  "type": "ancestor_key",
  "source": "accounts.account_id"
}

Use ancestor_key when a downstream row needs a value from an upstream ancestor, not just its immediate parent.

value_set

Samples from the distinct values of an upstream field.

"region": {
  "type": "value_set",
  "source": "customers.region"
}

Parameter	Type	Description
`source`	string	Source reference in `index.field` syntax.
`distribution`	string	Optional distribution. Defaults to `uniform`.

Use value_set when you want values that are known to exist upstream but do not need to preserve a row-level key relationship.

conditional

Chooses a generator based on another field or resolved source value.

"error_code": {
  "type": "conditional",
  "on": "level",
  "cases": {
    "error": { "type": "uniform_int", "min": 5000, "max": 5999 },
    "warn": { "type": "uniform_int", "min": 4000, "max": 4999 },
    "info": { "type": "constant", "value": null }
  }
}

Parameter	Type	Description
`on`	string	Field or source reference used to select a case.
`cases`	object	Map from resolved value to generator.

The resolved value must match a case key.

correlated

Generates a value by adding an offset to another field or source value.

"end_time": {
  "type": "correlated",
  "source": "start_time",
  "offset": { "type": "uniform_int", "min": 1, "max": 300 },
  "offset_unit": "seconds"
}

Parameter	Type	Description
`source`	string	Local field or upstream source reference.
`offset`	number or generator	Offset value.
`offset_unit`	string	Optional timestamp offset unit, such as `seconds`, `minutes`, or `hours`.

Use correlated for fields such as end time after start time, payment amount near order amount, or dependent metrics.

entity_field

Generates a deterministic field from a stateless entity catalog entry.

"account_tier": {
  "type": "entity_field",
  "entity": "account",
  "field": "tier"
}

Parameter	Type	Description
`entity`	string	Entity name from the plan-level `entities` object.
`field`	string	Field name from that entity definition.

Entities

Use entities to define stateless metadata for deterministic relationships.

{
  "entities": {
    "account": {
      "index": "accounts",
      "row_count": 1000,
      "key": { "field": "account_id", "type": "sequence", "start": 1 },
      "fields": {
        "tier": {
          "type": "weighted_choice",
          "values": ["free", "pro", "enterprise"],
          "weights": [0.7, 0.25, 0.05]
        }
      }
    }
  },
  "indexes": []
}

Entity field	Description
`index`	Index associated with the entity.
`row_count`	Number of entity rows.
`key.field`	Key field name.
`key.type`	Key generator type. Entity keys use sequence keys.
`key.start`	Starting sequence value.
`fields`	Additional deterministic entity fields.

Entities are useful when several generated indexes need consistent attributes for the same conceptual object.

Transforms

Transforms run after fields are generated. Use transforms for derivations that are easier to express in JavaScript or VRL than in field generator expressions.

JavaScript transform

"transform": {
  "type": "js",
  "source": "arg.total = arg.quantity * arg.unit_price; arg.discounted = arg.total > 500; arg"
}

VRL transform

"transform": {
  "type": "vrl",
  "source": ".total = .quantity * .unit_price\n.discounted = .total > 500"
}

Field	Description
`type`	`js` or `vrl`.
`source`	Transform source code.

Validation

Validation checks the plan before execution. It reports errors, warnings, computed dependency levels, and estimated row counts.

Validation checks include:

plan JSON is valid;
indexes is present and non-empty;
index names are unique;
worker, partition, and timeout values are greater than 0;
every index has exactly one row strategy;
depends_on references existing indexes;
per_parent.parent appears in depends_on;
junction sides reference valid upstream indexes and keys;
field source references use valid index.field syntax;
field source references point at defined fields;
stateless foreign keys specify target_entity or via;
entity catalog entries reference defined indexes and fields.

Example validation response shape:

{
  "valid": true,
  "errors": [],
  "warnings": [],
  "dag": {
    "levels": [["customers"], ["orders"], ["order_items"]]
  },
  "estimated_total_rows": 111000
}

Execution status

Status shows lifecycle state, dependency progress, per-index progress, row counts, and summary counters.

{
  "state": "running",
  "current_level": 1,
  "total_levels": 3,
  "dag_levels": [["customers"], ["orders"], ["shipments"]],
  "indexes": {
    "customers": { "state": "completed", "rows_generated": 1000, "rows_total": 1000 },
    "orders": { "state": "running", "rows_generated": 2000, "rows_total": 10000 }
  },
  "summary": {
    "indexes_total": 3,
    "indexes_completed": 1,
    "rows_generated": 3000,
    "rows_total": 11000,
    "workers_completed": 2,
    "workers_total": 8
  },
  "run_generation": 1,
  "last_observed_generation": 0
}

Examples

The examples below start simple and gradually add relationships, arrays, transforms, and correlated data.

Example 1: one simple index

This plan creates 10 services.

{
  "seed": 1,
  "description": "Simple services dataset",
  "plan": {
    "indexes": [
      {
        "name": "services",
        "row_count": 10,
        "mapping": {
          "properties": {
            "service_id": { "type": "keyword" },
            "service_name": { "type": "keyword" },
            "tier": { "type": "keyword" }
          }
        },
        "fields": {
          "service_id": { "type": "sequence", "format": "SVC-{:03}" },
          "service_name": { "type": "fake", "category": "company.name" },
          "tier": {
            "type": "weighted_choice",
            "values": ["frontend", "backend", "data"],
            "weights": [0.3, 0.5, 0.2]
          }
        }
      }
    ]
  }
}

Example 2: optional fields and arrays

This plan creates hosts with optional owner data and a variable number of tags.

{
  "seed": 2,
  "description": "Hosts with optional owners and tags",
  "plan": {
    "indexes": [
      {
        "name": "hosts",
        "row_count": 1000,
        "mapping": {
          "properties": {
            "host_id": { "type": "keyword" },
            "hostname": { "type": "keyword" },
            "owner_email": { "type": "keyword" },
            "tags": { "type": "keyword" }
          }
        },
        "fields": {
          "host_id": { "type": "sequence", "format": "HOST-{:06}" },
          "hostname": {
            "type": "format",
            "pattern": "host-{rack}-{slot}",
            "rack": { "type": "uniform_int", "min": 1, "max": 50 },
            "slot": { "type": "uniform_int", "min": 1, "max": 40 }
          },
          "owner_email": {
            "type": "fake",
            "category": "internet.email",
            "optional": 0.25
          },
          "tags": {
            "type": "weighted_choice",
            "values": ["prod", "dev", "pci", "public", "internal"],
            "weights": [0.3, 0.2, 0.1, 0.2, 0.2],
            "cardinality": { "type": "uniform_int", "min": 1, "max": 3 },
            "unique": true
          }
        }
      }
    ]
  }
}

Example 3: customers and orders

This plan creates customers first, then orders that reference generated customers.

{
  "seed": 42,
  "description": "Customers and orders",
  "plan": {
    "max_parallel_workers": 4,
    "indexes": [
      {
        "name": "customers",
        "row_count": 1000,
        "mapping": {
          "properties": {
            "customer_id": { "type": "keyword" },
            "name": { "type": "text", "fields": { "keyword": { "type": "keyword" } } },
            "email": { "type": "keyword" },
            "region": { "type": "keyword" }
          }
        },
        "fields": {
          "customer_id": { "type": "sequence", "format": "CUST-{:06}" },
          "name": { "type": "fake", "category": "person.name" },
          "email": { "type": "fake", "category": "internet.email" },
          "region": {
            "type": "weighted_choice",
            "values": ["US", "EU", "APAC"],
            "weights": [0.5, 0.3, 0.2]
          }
        }
      },
      {
        "name": "orders",
        "depends_on": ["customers"],
        "row_count": 10000,
        "mapping": {
          "properties": {
            "order_id": { "type": "keyword" },
            "customer_id": { "type": "keyword" },
            "order_time": { "type": "date" },
            "amount": { "type": "double" },
            "status": { "type": "keyword" }
          }
        },
        "fields": {
          "order_id": { "type": "sequence", "format": "ORD-{:08}" },
          "customer_id": {
            "type": "foreign_key",
            "source": "customers.customer_id",
            "distribution": "zipf",
            "s": 1.1
          },
          "order_time": {
            "type": "timestamp_range",
            "start": "2026-01-01T00:00:00Z",
            "end": "2026-01-31T23:59:59Z"
          },
          "amount": { "type": "log_normal", "mean": 4.0, "stddev": 1.0, "min": 1.0 },
          "status": {
            "type": "weighted_choice",
            "values": ["created", "paid", "shipped", "cancelled"],
            "weights": [0.1, 0.6, 0.25, 0.05]
          }
        }
      }
    ]
  }
}

Example 4: per-parent users for each tenant

This plan creates tenants, then 10 to 50 users per tenant.

{
  "seed": 7,
  "description": "Tenants and users",
  "plan": {
    "indexes": [
      {
        "name": "tenants",
        "row_count": 100,
        "fields": {
          "tenant_id": { "type": "sequence", "format": "TEN-{:04}" },
          "tenant_name": { "type": "fake", "category": "company.name" },
          "plan": {
            "type": "weighted_choice",
            "values": ["free", "pro", "enterprise"],
            "weights": [0.6, 0.3, 0.1]
          }
        }
      },
      {
        "name": "users",
        "depends_on": ["tenants"],
        "per_parent": {
          "parent": "tenants",
          "count": { "type": "uniform_int", "min": 10, "max": 50 }
        },
        "fields": {
          "tenant_id": { "type": "parent_key" },
          "user_id": { "type": "sequence", "format": "USER-{:08}" },
          "name": { "type": "fake", "category": "person.name" },
          "email": { "type": "fake", "category": "internet.email" },
          "role": {
            "type": "weighted_choice",
            "values": ["admin", "analyst", "viewer"],
            "weights": [0.05, 0.35, 0.60]
          }
        }
      }
    ]
  }
}

Example 5: many-to-many relationship

This plan creates students, courses, and an enrollment junction index.

{
  "seed": 9,
  "description": "Students, courses, and enrollments",
  "plan": {
    "indexes": [
      {
        "name": "students",
        "row_count": 1000,
        "fields": {
          "student_id": { "type": "sequence", "format": "STU-{:06}" },
          "name": { "type": "fake", "category": "person.name" }
        }
      },
      {
        "name": "courses",
        "row_count": 100,
        "fields": {
          "course_id": { "type": "sequence", "format": "CRS-{:04}" },
          "course_name": { "type": "fake", "category": "commerce.product_name" }
        }
      },
      {
        "name": "enrollments",
        "depends_on": ["students", "courses"],
        "relationship": {
          "type": "junction",
          "left": { "index": "students", "key": "student_id" },
          "right": { "index": "courses", "key": "course_id" },
          "left_cardinality": { "type": "uniform_int", "min": 3, "max": 5 },
          "right_max": 80,
          "unique": true
        },
        "fields": {
          "student_id": { "type": "parent_key" },
          "course_id": { "type": "parent_key" },
          "grade": {
            "type": "weighted_choice",
            "values": ["A", "B", "C", "D", "F"],
            "weights": [0.25, 0.35, 0.25, 0.1, 0.05]
          }
        }
      }
    ]
  }
}

Example 6: conditional and correlated fields

This plan creates request logs where end_time follows start_time, duration_ms controls status, and error fields depend on status.

{
  "seed": 12,
  "description": "Request logs with correlated fields",
  "plan": {
    "indexes": [
      {
        "name": "request_logs",
        "row_count": 50000,
        "fields": {
          "request_id": { "type": "uuid" },
          "service": {
            "type": "weighted_choice",
            "values": ["api", "auth", "billing", "worker"],
            "weights": [0.5, 0.2, 0.2, 0.1]
          },
          "start_time": {
            "type": "timestamp_range",
            "start": "2026-01-01T00:00:00Z",
            "end": "2026-01-02T00:00:00Z"
          },
          "duration_ms": { "type": "log_normal", "mean": 4.5, "stddev": 0.8, "min": 1.0, "max": 10000.0 },
          "end_time": {
            "type": "correlated",
            "source": "start_time",
            "offset": { "type": "uniform_int", "min": 1, "max": 300 },
            "offset_unit": "seconds"
          },
          "status": {
            "type": "weighted_choice",
            "values": [200, 200, 200, 404, 429, 500, 503],
            "weights": [0.3, 0.3, 0.2, 0.08, 0.05, 0.05, 0.02]
          },
          "error_code": {
            "type": "conditional",
            "on": "status",
            "cases": {
              "200": { "type": "constant", "value": null },
              "404": { "type": "constant", "value": "NOT_FOUND" },
              "429": { "type": "constant", "value": "RATE_LIMITED" },
              "500": { "type": "constant", "value": "INTERNAL" },
              "503": { "type": "constant", "value": "UNAVAILABLE" }
            }
          }
        }
      }
    ]
  }
}

Example 7: complex e-commerce dataset

This example uses multiple indexes, foreign keys, per-parent rows, expressions, weighted choices, timestamps, arrays, transforms, and a junction index.

{
  "seed": 20260507,
  "description": "E-commerce synthetic dataset",
  "plan": {
    "max_parallel_workers": 6,
    "partitions": 12,
    "workflow_timeout_seconds": 7200,
    "indexes": [
      {
        "name": "customers",
        "row_count": 10000,
        "mapping": {
          "properties": {
            "customer_id": { "type": "keyword" },
            "name": { "type": "text", "fields": { "keyword": { "type": "keyword" } } },
            "email": { "type": "keyword" },
            "region": { "type": "keyword" },
            "tier": { "type": "keyword" },
            "created_at": { "type": "date" }
          }
        },
        "fields": {
          "customer_id": { "type": "sequence", "format": "CUST-{:08}" },
          "name": { "type": "fake", "category": "person.name" },
          "email": { "type": "fake", "category": "internet.email" },
          "region": {
            "type": "weighted_choice",
            "values": ["US", "EU", "APAC", "LATAM"],
            "weights": [0.45, 0.25, 0.2, 0.1]
          },
          "tier": {
            "type": "weighted_choice",
            "values": ["bronze", "silver", "gold", "platinum"],
            "weights": [0.55, 0.25, 0.15, 0.05]
          },
          "created_at": {
            "type": "timestamp_range",
            "start": "2025-01-01T00:00:00Z",
            "end": "2026-01-01T00:00:00Z"
          }
        }
      },
      {
        "name": "products",
        "row_count": 2000,
        "mapping": {
          "properties": {
            "product_id": { "type": "keyword" },
            "name": { "type": "text", "fields": { "keyword": { "type": "keyword" } } },
            "category": { "type": "keyword" },
            "base_price": { "type": "double" },
            "tags": { "type": "keyword" }
          }
        },
        "fields": {
          "product_id": { "type": "sequence", "format": "PROD-{:06}" },
          "name": { "type": "fake", "category": "commerce.product_name" },
          "category": {
            "type": "weighted_choice",
            "values": ["electronics", "apparel", "home", "outdoor", "books"],
            "weights": [0.25, 0.25, 0.2, 0.15, 0.15]
          },
          "base_price": { "type": "log_normal", "mean": 3.8, "stddev": 0.9, "min": 1.0, "max": 2000.0 },
          "tags": {
            "type": "weighted_choice",
            "values": ["new", "sale", "popular", "premium", "eco", "gift"],
            "weights": [0.2, 0.15, 0.25, 0.1, 0.1, 0.2],
            "cardinality": { "type": "uniform_int", "min": 1, "max": 3 },
            "unique": true
          }
        }
      },
      {
        "name": "orders",
        "depends_on": ["customers"],
        "row_count": 100000,
        "mapping": {
          "properties": {
            "order_id": { "type": "keyword" },
            "customer_id": { "type": "keyword" },
            "region": { "type": "keyword" },
            "ordered_at": { "type": "date" },
            "status": { "type": "keyword" }
          }
        },
        "fields": {
          "order_id": { "type": "sequence", "format": "ORD-{:09}" },
          "customer_id": {
            "type": "foreign_key",
            "source": "customers.customer_id",
            "distribution": "zipf",
            "s": 1.15
          },
          "region": { "type": "value_set", "source": "customers.region" },
          "ordered_at": {
            "type": "timestamp_range",
            "start": "2026-01-01T00:00:00Z",
            "end": "2026-03-31T23:59:59Z",
            "distribution": "normal"
          },
          "status": {
            "type": "weighted_choice",
            "values": ["created", "paid", "shipped", "delivered", "cancelled"],
            "weights": [0.08, 0.12, 0.2, 0.55, 0.05]
          }
        }
      },
      {
        "name": "order_items",
        "depends_on": ["orders", "products"],
        "per_parent": {
          "parent": "orders",
          "count": { "type": "uniform_int", "min": 1, "max": 5 }
        },
        "mapping": {
          "properties": {
            "order_id": { "type": "keyword" },
            "item_id": { "type": "keyword" },
            "product_id": { "type": "keyword" },
            "quantity": { "type": "integer" },
            "unit_price": { "type": "double" },
            "line_total": { "type": "double" }
          }
        },
        "fields": {
          "order_id": { "type": "parent_key" },
          "item_id": { "type": "sequence", "format": "ITEM-{:010}" },
          "product_id": { "type": "foreign_key", "source": "products.product_id", "distribution": "zipf", "s": 1.05 },
          "quantity": { "type": "uniform_int", "min": 1, "max": 4 },
          "unit_price": { "type": "log_normal", "mean": 3.8, "stddev": 0.9, "min": 1.0, "max": 2000.0 },
          "line_total": { "type": "expression", "expr": "quantity * unit_price" }
        },
        "transform": {
          "type": "js",
          "source": "arg.line_total = Math.round(arg.line_total * 100) / 100; arg"
        }
      },
      {
        "name": "customer_product_affinity",
        "depends_on": ["customers", "products"],
        "relationship": {
          "type": "junction",
          "left": { "index": "customers", "key": "customer_id" },
          "right": { "index": "products", "key": "product_id" },
          "left_cardinality": { "type": "uniform_int", "min": 2, "max": 8 },
          "right_max": 1000,
          "unique": true
        },
        "fields": {
          "customer_id": { "type": "parent_key" },
          "product_id": { "type": "parent_key" },
          "affinity_score": { "type": "uniform_float", "min": 0.0, "max": 1.0 }
        }
      }
    ]
  }
}

Best practices

Use explicit mappings

Provide mappings for generated indexes that will be used in dashboards, SQL queries, or OpenSearch-compatible queries. Explicit mappings keep field types stable.

Alias generated fields clearly

Use descriptive field names such as customer_id, order_id, ordered_at, and line_total. This makes downstream queries easier to write.

Put upstream indexes in depends_on

Any index referenced by foreign_key, value_set, ancestor_key, conditional, correlated, per_parent, or relationship should be listed in depends_on.

Start small

Validate and execute a smaller dataset first. Increase row_count, per_parent counts, and relationship cardinalities after the generated shape is correct.

Use one seed per scenario

Keep the same seed when you want repeatable output. Change the seed when you want a different dataset with the same shape.

Prefer transforms for complex row-level derivations

Use expression for simple arithmetic. Use JavaScript or VRL transforms when the derivation needs branching, rounding, or multiple calculated fields.

Mach5 One

m5c

Overview

Getting Started

Mach5 Enterprise

Both Editions

Ingestion

Querying

Security

Enterprise Operations

Workflows

Declarative Apps

App Bundles

Notebooks

Dashboards

Integrations

Reference

Support