Synthetic Data Generation

Synthetic Data Generation creates deterministic test, demo, and benchmark data directly in Mach5. You define a generation plan, validate it, execute it, and Mach5 creates one or more output indexes with generated documents.

Synthetic data generation plans

Synthetic generation plan detail

Create synthetic generation plan

Use synthetic generation when you need:

  • repeatable demo datasets;
  • benchmark data with controlled size and shape;
  • correlated indexes such as customers, orders, order items, and payments;
  • realistic-looking values without importing external files;
  • parent/child and many-to-many relationships that are valid by construction.

The same plan seed and row definitions produce the same generated data, which makes synthetic datasets easy to recreate across environments.

Basic workflow

  1. Create a GenerationPlan resource in a namespace.
  2. Validate the plan.
  3. Execute the plan.
  4. Watch status until the plan completes.
  5. Query the generated indexes.

REST API

Generation plans are namespace-scoped resources.

GET    /apis/namespaces/{namespace}/generation_plans
GET    /apis/namespaces/{namespace}/generation_plans/{name}
PUT    /apis/namespaces/{namespace}/generation_plans/{name}
PATCH  /apis/namespaces/{namespace}/generation_plans/{name}
DELETE /apis/namespaces/{namespace}/generation_plans/{name}

POST   /apis/namespaces/{namespace}/generation_plans/{name}/_validate
POST   /apis/namespaces/{namespace}/generation_plans/{name}/_execute
POST   /apis/namespaces/{namespace}/generation_plans/{name}/_cancel
GET    /apis/namespaces/{namespace}/generation_plans/{name}/status

Create a plan:

curl -X PUT \
  "https://mach5.example.com/apis/namespaces/default/generation_plans/demo_orders" \
  -H "Content-Type: application/json" \
  -d '{
    "seed": 42,
    "description": "Small order demo dataset",
    "plan": {
      "indexes": [
        {
          "name": "customers",
          "row_count": 100,
          "fields": {
            "customer_id": { "type": "sequence", "format": "CUST-{:04}" },
            "name": { "type": "fake", "category": "person.name" },
            "email": { "type": "fake", "category": "internet.email" }
          }
        }
      ]
    }
  }'

Validate the plan:

curl -X POST \
  "https://mach5.example.com/apis/namespaces/default/generation_plans/demo_orders/_validate"

Execute the plan:

curl -X POST \
  "https://mach5.example.com/apis/namespaces/default/generation_plans/demo_orders/_execute"

Check status:

curl \
  "https://mach5.example.com/apis/namespaces/default/generation_plans/demo_orders/status"

Cancel a running plan:

curl -X POST \
  "https://mach5.example.com/apis/namespaces/default/generation_plans/demo_orders/_cancel"

GenerationPlan resource

A generation plan resource has a deterministic seed, an optional description, and a plan specification.

{
  "seed": 42,
  "description": "E-commerce demo data",
  "plan": {
    "indexes": []
  }
}
FieldTypeDescription
seedunsigned integerGlobal deterministic seed. Change the seed to produce a different but repeatable dataset.
descriptionstringOptional human-readable description.
planobjectData generation specification.

The status response also includes lifecycle fields such as state, run_generation, last_observed_generation, and last_execution.

Common states:

StateDescription
idlePlan is saved and not currently running.
runningPlan execution is in progress.
completedPlan execution completed successfully.
failedPlan execution failed.
cancellingCancellation has been requested.
cancelledPlan execution was cancelled.

Plan structure

The plan object contains optional execution controls, optional entity metadata, and a required indexes array.

{
  "max_parallel_workers": 4,
  "partitions": 8,
  "workflow_timeout_seconds": 7200,
  "entities": {},
  "indexes": []
}
FieldTypeDescription
max_parallel_workersintegerDefault maximum concurrent workers per index. Must be greater than 0.
partitionsintegerDefault partition count for fixed-row and relationship planning. Must be greater than 0.
workflow_timeout_secondsintegerDefault timeout for generation work. Must be greater than 0.
entitiesobjectOptional stateless entity catalog used by stateless foreign keys and entity_field.
indexesarrayOutput index specifications. Required and non-empty.

Index-level max_parallel_workers, partitions, and workflow_timeout_seconds override plan-level values.

Index specification

Each index entry defines one generated output index.

{
  "name": "orders",
  "depends_on": ["customers"],
  "row_count": 100000,
  "mapping": {
    "properties": {
      "order_id": { "type": "keyword" },
      "customer_id": { "type": "keyword" },
      "amount": { "type": "double" }
    }
  },
  "fields": {
    "order_id": { "type": "sequence", "format": "ORD-{:08}" },
    "customer_id": { "type": "foreign_key", "source": "customers.customer_id" },
    "amount": { "type": "log_normal", "mean": 4.0, "stddev": 1.0, "min": 1.0 }
  }
}
FieldTypeDescription
namestringOutput index name. Must be unique within the plan.
depends_onstring arrayUpstream indexes that must be generated before this index.
row_countintegerFixed number of rows to generate.
per_parentobjectGenerate child rows for every row in a parent index.
relationshipobjectGenerate a relationship or junction index between upstream indexes.
mappingobjectOpenSearch-compatible mapping used when creating the output index. Defaults to an empty mapping.
fieldsobjectField name to field generator specification.
transformobjectOptional JavaScript or VRL transform applied after field generation.
max_parallel_workersintegerIndex-level worker override.
partitionsintegerIndex-level partition override.
workflow_timeout_secondsintegerIndex-level timeout override.

Exactly one row strategy is required for each index:

  • row_count
  • per_parent
  • relationship

Row strategies

Fixed row count

Use row_count when an index should contain exactly a known number of generated rows.

{
  "name": "customers",
  "row_count": 1000,
  "fields": {
    "customer_id": { "type": "sequence", "format": "CUST-{:06}" },
    "name": { "type": "fake", "category": "person.name" }
  }
}

The example creates 1,000 customer documents. customer_id is deterministic and monotonic, while name is deterministic but realistic-looking.

Per-parent rows

Use per_parent when a child index should generate rows for every row in a parent index.

{
  "name": "users",
  "depends_on": ["tenants"],
  "per_parent": {
    "parent": "tenants",
    "count": { "type": "uniform_int", "min": 10, "max": 50 }
  },
  "fields": {
    "tenant_id": { "type": "parent_key" },
    "user_id": { "type": "sequence", "format": "USER-{:08}" },
    "email": { "type": "fake", "category": "internet.email" }
  }
}

The parent index must appear in depends_on. The count generator decides how many child rows to create for each parent row. A child field using parent_key receives the parent value for the same field name.

In this example, each tenant receives between 10 and 50 users, and tenant_id is copied from the parent tenant row.

Junction relationships

Use relationship when an index should pair keys from two upstream indexes, such as students and courses, users and groups, or products and campaigns.

{
  "name": "student_courses",
  "depends_on": ["students", "courses"],
  "relationship": {
    "type": "junction",
    "left": { "index": "students", "key": "student_id" },
    "right": { "index": "courses", "key": "course_id" },
    "left_cardinality": { "type": "uniform_int", "min": 3, "max": 5 },
    "right_max": 250,
    "unique": true
  },
  "fields": {
    "student_id": { "type": "parent_key" },
    "course_id": { "type": "parent_key" },
    "enrolled_at": {
      "type": "timestamp_range",
      "start": "2026-01-01T00:00:00Z",
      "end": "2026-03-31T23:59:59Z"
    }
  }
}
Relationship fieldDescription
typeMust be junction.
left.indexLeft source index name.
left.keyKey field from the left source index.
right.indexRight source index name.
right.keyKey field from the right source index.
left_cardinalityGenerator that controls how many right-side keys are assigned to each left-side row.
right_maxMaximum number of times each right-side row may be used.
uniqueWhen true, prevents duplicate left/right pairs for the same left row.

Both source indexes must appear in depends_on.

Field specification

Every generated field has a generator type and optional cross-cutting properties.

"fields": {
  "service": {
    "type": "weighted_choice",
    "values": ["api", "worker", "scheduler"],
    "weights": [0.6, 0.3, 0.1],
    "optional": 0.05
  }
}

Cross-cutting field properties

PropertyTypeDescription
typestringGenerator type. Required.
optionalfloatProbability from 0.0 to 1.0 that the field is omitted. Defaults to 0.0.
cardinalitygeneratorWhen set, the field emits an array. The cardinality generator decides array length.
uniquebooleanWhen the field emits an array, attempts to avoid duplicate elements. Defaults to true.

Optional fields

"middle_name": {
  "type": "fake",
  "category": "person.name",
  "optional": 0.7
}

This field is omitted from about 70% of generated documents.

Array fields with cardinality

"tags": {
  "type": "weighted_choice",
  "values": ["prod", "dev", "pci", "public", "internal"],
  "weights": [0.3, 0.2, 0.1, 0.2, 0.2],
  "cardinality": { "type": "uniform_int", "min": 1, "max": 3 },
  "unique": true
}

This field emits an array with 1 to 3 unique tag values.

Field generators

sequence

Generates a deterministic sequence based on the row index.

{ "type": "sequence" }
ParameterTypeDescription
startintegerFirst sequence value. Defaults to 1.
formatstringOptional format string. Supports {} and zero-padded forms such as {:06} or {:06d}.

Examples:

"id": { "type": "sequence" }

Produces numeric values such as 1, 2, 3.

"customer_id": { "type": "sequence", "start": 1000, "format": "CUST-{:06}" }

Produces values such as CUST-001000, CUST-001001, CUST-001002.

uuid

Generates a deterministic UUID-like string.

"event_id": { "type": "uuid" }

Use uuid for identifiers that should look globally unique but remain repeatable for the same seed and row.

constant

Returns the same JSON value for every row.

"environment": { "type": "constant", "value": "prod" }

value may be a string, number, boolean, object, array, or null.

uniform_int

Generates an integer uniformly distributed between min and max, inclusive.

"status": { "type": "uniform_int", "min": 200, "max": 599 }
ParameterTypeDescription
minintegerMinimum value.
maxintegerMaximum value.

uniform_float

Generates a floating-point value uniformly distributed between min and max.

"cpu": { "type": "uniform_float", "min": 0.0, "max": 100.0 }
ParameterTypeDescription
minnumberMinimum value.
maxnumberMaximum value.

normal

Generates a normally distributed floating-point value.

"latency_ms": { "type": "normal", "mean": 120.0, "stddev": 25.0, "min": 0.0, "max": 1000.0 }
ParameterTypeDescription
meannumberAverage value.
stddevnumberStandard deviation.
minnumberOptional lower clamp.
maxnumberOptional upper clamp.

Use normal for values clustered around an average, such as latency, age, or utilization.

log_normal

Generates a log-normal floating-point value.

"order_amount": { "type": "log_normal", "mean": 4.0, "stddev": 1.0, "min": 1.0 }
ParameterTypeDescription
meannumberMean of the underlying normal distribution.
stddevnumberStandard deviation of the underlying normal distribution.
minnumberOptional lower clamp.
maxnumberOptional upper clamp.

Use log_normal for skewed positive values such as transaction amounts, payload sizes, and response times.

weighted_choice

Samples one value from a list using weights.

"level": {
  "type": "weighted_choice",
  "values": ["debug", "info", "warn", "error"],
  "weights": [0.1, 0.7, 0.15, 0.05]
}
ParameterTypeDescription
valuesarrayCandidate JSON values.
weightsnumber arrayRelative weights matching the values array.

Weights are relative. They do not need to add up to 1.0.

fake

Generates deterministic plausible strings for common categories.

"name": { "type": "fake", "category": "person.name" }
ParameterTypeDescription
categorystringFake data category.
countintegerOptional amount of text for categories that support longer generated content.

Supported categories:

CategoryExample use
person.nameHuman names.
internet.emailEmail addresses.
address.cityCity names.
company.nameCompany names.
commerce.product_nameProduct names.
phone.numberPhone numbers.
loremPlaceholder text.

Examples:

"email": { "type": "fake", "category": "internet.email" }
"product": { "type": "fake", "category": "commerce.product_name" }
"description": { "type": "fake", "category": "lorem", "count": 12 }

timestamp_range

Generates a timestamp between start and end.

"timestamp": {
  "type": "timestamp_range",
  "start": "2026-01-01T00:00:00Z",
  "end": "2026-01-31T23:59:59Z"
}
ParameterTypeDescription
startstringStart date/time.
endstringEnd date/time.
distributionstringOptional distribution. Use uniform or normal. Defaults to uniform.

Normal distribution places more values near the middle of the time range:

"timestamp": {
  "type": "timestamp_range",
  "start": "2026-01-01T00:00:00Z",
  "end": "2026-01-31T23:59:59Z",
  "distribution": "normal"
}

format

Builds a string by replacing placeholders with nested generator values.

"message": {
  "type": "format",
  "pattern": "{service} returned status {status}",
  "service": { "type": "weighted_choice", "values": ["api", "worker"], "weights": [0.7, 0.3] },
  "status": { "type": "weighted_choice", "values": [200, 404, 500], "weights": [0.8, 0.15, 0.05] }
}
ParameterTypeDescription
patternstringString containing {placeholder} tokens.
placeholder namegeneratorNested generator used to replace that placeholder.

expression

Evaluates a numeric arithmetic expression using fields already generated in the same row.

"total": { "type": "expression", "expr": "quantity * unit_price" }
ParameterTypeDescription
exprstringArithmetic expression.

Supported expression syntax:

  • numeric literals;
  • field identifiers;
  • +, -, *, /;
  • parentheses;
  • unary minus.

Example:

"fields": {
  "quantity": { "type": "uniform_int", "min": 1, "max": 5 },
  "unit_price": { "type": "uniform_float", "min": 10.0, "max": 250.0 },
  "total": { "type": "expression", "expr": "quantity * unit_price" }
}

zipf

Generates an integer rank from 1 through n using a Zipf distribution.

"popularity_rank": { "type": "zipf", "n": 1000, "s": 1.2 }
ParameterTypeDescription
nintegerMaximum rank.
snumberSkew parameter. Higher values concentrate more samples on lower ranks.

Use zipf for popularity distributions such as top products, hot users, or frequently accessed resources.

foreign_key

Samples a key from an upstream index or stateless entity.

"customer_id": {
  "type": "foreign_key",
  "source": "customers.customer_id"
}
ParameterTypeDescription
sourcestringSource reference in index.field syntax.
distributionstringOptional sampling distribution. Defaults to uniform. Use zipf for skewed references.
snumberZipf skew parameter when distribution is zipf.
filterobjectOptional filter that restricts candidate upstream rows.
resolutionstringOptional resolution mode. Use stateless for entity-based keys.
target_entitystringEntity name for stateless resolution.
viastringSource reference used to resolve a stateless relationship through another field.

Uniform foreign key:

"customer_id": { "type": "foreign_key", "source": "customers.customer_id" }

Zipf-skewed foreign key:

"customer_id": {
  "type": "foreign_key",
  "source": "customers.customer_id",
  "distribution": "zipf",
  "s": 1.2
}

Filtered foreign key:

"resource_id": {
  "type": "foreign_key",
  "source": "assets.resource_id",
  "filter": {
    "account_id": { "eq": "$account_id" }
  }
}

The filter compares an upstream field with a value from the current row. In this example, generated rows select only assets with the same account_id as the current row.

Stateless foreign key using an entity:

"account_id": {
  "type": "foreign_key",
  "source": "accounts.account_id",
  "resolution": "stateless",
  "target_entity": "account"
}

parent_key

Copies a parent or relationship key into the generated row.

"tenant_id": { "type": "parent_key" }

Use parent_key in per_parent child indexes when the child field name matches a parent field name. Use it in junction indexes to emit the left and right keys into each relationship row.

ancestor_key

Resolves a value through an upstream foreign-key chain.

"account_id": {
  "type": "ancestor_key",
  "source": "accounts.account_id"
}

Use ancestor_key when a downstream row needs a value from an upstream ancestor, not just its immediate parent.

value_set

Samples from the distinct values of an upstream field.

"region": {
  "type": "value_set",
  "source": "customers.region"
}
ParameterTypeDescription
sourcestringSource reference in index.field syntax.
distributionstringOptional distribution. Defaults to uniform.

Use value_set when you want values that are known to exist upstream but do not need to preserve a row-level key relationship.

conditional

Chooses a generator based on another field or resolved source value.

"error_code": {
  "type": "conditional",
  "on": "level",
  "cases": {
    "error": { "type": "uniform_int", "min": 5000, "max": 5999 },
    "warn": { "type": "uniform_int", "min": 4000, "max": 4999 },
    "info": { "type": "constant", "value": null }
  }
}
ParameterTypeDescription
onstringField or source reference used to select a case.
casesobjectMap from resolved value to generator.

The resolved value must match a case key.

correlated

Generates a value by adding an offset to another field or source value.

"end_time": {
  "type": "correlated",
  "source": "start_time",
  "offset": { "type": "uniform_int", "min": 1, "max": 300 },
  "offset_unit": "seconds"
}
ParameterTypeDescription
sourcestringLocal field or upstream source reference.
offsetnumber or generatorOffset value.
offset_unitstringOptional timestamp offset unit, such as seconds, minutes, or hours.

Use correlated for fields such as end time after start time, payment amount near order amount, or dependent metrics.

entity_field

Generates a deterministic field from a stateless entity catalog entry.

"account_tier": {
  "type": "entity_field",
  "entity": "account",
  "field": "tier"
}
ParameterTypeDescription
entitystringEntity name from the plan-level entities object.
fieldstringField name from that entity definition.

Entities

Use entities to define stateless metadata for deterministic relationships.

{
  "entities": {
    "account": {
      "index": "accounts",
      "row_count": 1000,
      "key": { "field": "account_id", "type": "sequence", "start": 1 },
      "fields": {
        "tier": {
          "type": "weighted_choice",
          "values": ["free", "pro", "enterprise"],
          "weights": [0.7, 0.25, 0.05]
        }
      }
    }
  },
  "indexes": []
}
Entity fieldDescription
indexIndex associated with the entity.
row_countNumber of entity rows.
key.fieldKey field name.
key.typeKey generator type. Entity keys use sequence keys.
key.startStarting sequence value.
fieldsAdditional deterministic entity fields.

Entities are useful when several generated indexes need consistent attributes for the same conceptual object.

Transforms

Transforms run after fields are generated. Use transforms for derivations that are easier to express in JavaScript or VRL than in field generator expressions.

JavaScript transform

"transform": {
  "type": "js",
  "source": "arg.total = arg.quantity * arg.unit_price; arg.discounted = arg.total > 500; arg"
}

VRL transform

"transform": {
  "type": "vrl",
  "source": ".total = .quantity * .unit_price\n.discounted = .total > 500"
}
FieldDescription
typejs or vrl.
sourceTransform source code.

Validation

Validation checks the plan before execution. It reports errors, warnings, computed dependency levels, and estimated row counts.

Validation checks include:

  • plan JSON is valid;
  • indexes is present and non-empty;
  • index names are unique;
  • worker, partition, and timeout values are greater than 0;
  • every index has exactly one row strategy;
  • depends_on references existing indexes;
  • per_parent.parent appears in depends_on;
  • junction sides reference valid upstream indexes and keys;
  • field source references use valid index.field syntax;
  • field source references point at defined fields;
  • stateless foreign keys specify target_entity or via;
  • entity catalog entries reference defined indexes and fields.

Example validation response shape:

{
  "valid": true,
  "errors": [],
  "warnings": [],
  "dag": {
    "levels": [["customers"], ["orders"], ["order_items"]]
  },
  "estimated_total_rows": 111000
}

Execution status

Status shows lifecycle state, dependency progress, per-index progress, row counts, and summary counters.

{
  "state": "running",
  "current_level": 1,
  "total_levels": 3,
  "dag_levels": [["customers"], ["orders"], ["shipments"]],
  "indexes": {
    "customers": { "state": "completed", "rows_generated": 1000, "rows_total": 1000 },
    "orders": { "state": "running", "rows_generated": 2000, "rows_total": 10000 }
  },
  "summary": {
    "indexes_total": 3,
    "indexes_completed": 1,
    "rows_generated": 3000,
    "rows_total": 11000,
    "workers_completed": 2,
    "workers_total": 8
  },
  "run_generation": 1,
  "last_observed_generation": 0
}

Examples

The examples below start simple and gradually add relationships, arrays, transforms, and correlated data.

Example 1: one simple index

This plan creates 10 services.

{
  "seed": 1,
  "description": "Simple services dataset",
  "plan": {
    "indexes": [
      {
        "name": "services",
        "row_count": 10,
        "mapping": {
          "properties": {
            "service_id": { "type": "keyword" },
            "service_name": { "type": "keyword" },
            "tier": { "type": "keyword" }
          }
        },
        "fields": {
          "service_id": { "type": "sequence", "format": "SVC-{:03}" },
          "service_name": { "type": "fake", "category": "company.name" },
          "tier": {
            "type": "weighted_choice",
            "values": ["frontend", "backend", "data"],
            "weights": [0.3, 0.5, 0.2]
          }
        }
      }
    ]
  }
}

Example 2: optional fields and arrays

This plan creates hosts with optional owner data and a variable number of tags.

{
  "seed": 2,
  "description": "Hosts with optional owners and tags",
  "plan": {
    "indexes": [
      {
        "name": "hosts",
        "row_count": 1000,
        "mapping": {
          "properties": {
            "host_id": { "type": "keyword" },
            "hostname": { "type": "keyword" },
            "owner_email": { "type": "keyword" },
            "tags": { "type": "keyword" }
          }
        },
        "fields": {
          "host_id": { "type": "sequence", "format": "HOST-{:06}" },
          "hostname": {
            "type": "format",
            "pattern": "host-{rack}-{slot}",
            "rack": { "type": "uniform_int", "min": 1, "max": 50 },
            "slot": { "type": "uniform_int", "min": 1, "max": 40 }
          },
          "owner_email": {
            "type": "fake",
            "category": "internet.email",
            "optional": 0.25
          },
          "tags": {
            "type": "weighted_choice",
            "values": ["prod", "dev", "pci", "public", "internal"],
            "weights": [0.3, 0.2, 0.1, 0.2, 0.2],
            "cardinality": { "type": "uniform_int", "min": 1, "max": 3 },
            "unique": true
          }
        }
      }
    ]
  }
}

Example 3: customers and orders

This plan creates customers first, then orders that reference generated customers.

{
  "seed": 42,
  "description": "Customers and orders",
  "plan": {
    "max_parallel_workers": 4,
    "indexes": [
      {
        "name": "customers",
        "row_count": 1000,
        "mapping": {
          "properties": {
            "customer_id": { "type": "keyword" },
            "name": { "type": "text", "fields": { "keyword": { "type": "keyword" } } },
            "email": { "type": "keyword" },
            "region": { "type": "keyword" }
          }
        },
        "fields": {
          "customer_id": { "type": "sequence", "format": "CUST-{:06}" },
          "name": { "type": "fake", "category": "person.name" },
          "email": { "type": "fake", "category": "internet.email" },
          "region": {
            "type": "weighted_choice",
            "values": ["US", "EU", "APAC"],
            "weights": [0.5, 0.3, 0.2]
          }
        }
      },
      {
        "name": "orders",
        "depends_on": ["customers"],
        "row_count": 10000,
        "mapping": {
          "properties": {
            "order_id": { "type": "keyword" },
            "customer_id": { "type": "keyword" },
            "order_time": { "type": "date" },
            "amount": { "type": "double" },
            "status": { "type": "keyword" }
          }
        },
        "fields": {
          "order_id": { "type": "sequence", "format": "ORD-{:08}" },
          "customer_id": {
            "type": "foreign_key",
            "source": "customers.customer_id",
            "distribution": "zipf",
            "s": 1.1
          },
          "order_time": {
            "type": "timestamp_range",
            "start": "2026-01-01T00:00:00Z",
            "end": "2026-01-31T23:59:59Z"
          },
          "amount": { "type": "log_normal", "mean": 4.0, "stddev": 1.0, "min": 1.0 },
          "status": {
            "type": "weighted_choice",
            "values": ["created", "paid", "shipped", "cancelled"],
            "weights": [0.1, 0.6, 0.25, 0.05]
          }
        }
      }
    ]
  }
}

Example 4: per-parent users for each tenant

This plan creates tenants, then 10 to 50 users per tenant.

{
  "seed": 7,
  "description": "Tenants and users",
  "plan": {
    "indexes": [
      {
        "name": "tenants",
        "row_count": 100,
        "fields": {
          "tenant_id": { "type": "sequence", "format": "TEN-{:04}" },
          "tenant_name": { "type": "fake", "category": "company.name" },
          "plan": {
            "type": "weighted_choice",
            "values": ["free", "pro", "enterprise"],
            "weights": [0.6, 0.3, 0.1]
          }
        }
      },
      {
        "name": "users",
        "depends_on": ["tenants"],
        "per_parent": {
          "parent": "tenants",
          "count": { "type": "uniform_int", "min": 10, "max": 50 }
        },
        "fields": {
          "tenant_id": { "type": "parent_key" },
          "user_id": { "type": "sequence", "format": "USER-{:08}" },
          "name": { "type": "fake", "category": "person.name" },
          "email": { "type": "fake", "category": "internet.email" },
          "role": {
            "type": "weighted_choice",
            "values": ["admin", "analyst", "viewer"],
            "weights": [0.05, 0.35, 0.60]
          }
        }
      }
    ]
  }
}

Example 5: many-to-many relationship

This plan creates students, courses, and an enrollment junction index.

{
  "seed": 9,
  "description": "Students, courses, and enrollments",
  "plan": {
    "indexes": [
      {
        "name": "students",
        "row_count": 1000,
        "fields": {
          "student_id": { "type": "sequence", "format": "STU-{:06}" },
          "name": { "type": "fake", "category": "person.name" }
        }
      },
      {
        "name": "courses",
        "row_count": 100,
        "fields": {
          "course_id": { "type": "sequence", "format": "CRS-{:04}" },
          "course_name": { "type": "fake", "category": "commerce.product_name" }
        }
      },
      {
        "name": "enrollments",
        "depends_on": ["students", "courses"],
        "relationship": {
          "type": "junction",
          "left": { "index": "students", "key": "student_id" },
          "right": { "index": "courses", "key": "course_id" },
          "left_cardinality": { "type": "uniform_int", "min": 3, "max": 5 },
          "right_max": 80,
          "unique": true
        },
        "fields": {
          "student_id": { "type": "parent_key" },
          "course_id": { "type": "parent_key" },
          "grade": {
            "type": "weighted_choice",
            "values": ["A", "B", "C", "D", "F"],
            "weights": [0.25, 0.35, 0.25, 0.1, 0.05]
          }
        }
      }
    ]
  }
}

Example 6: conditional and correlated fields

This plan creates request logs where end_time follows start_time, duration_ms controls status, and error fields depend on status.

{
  "seed": 12,
  "description": "Request logs with correlated fields",
  "plan": {
    "indexes": [
      {
        "name": "request_logs",
        "row_count": 50000,
        "fields": {
          "request_id": { "type": "uuid" },
          "service": {
            "type": "weighted_choice",
            "values": ["api", "auth", "billing", "worker"],
            "weights": [0.5, 0.2, 0.2, 0.1]
          },
          "start_time": {
            "type": "timestamp_range",
            "start": "2026-01-01T00:00:00Z",
            "end": "2026-01-02T00:00:00Z"
          },
          "duration_ms": { "type": "log_normal", "mean": 4.5, "stddev": 0.8, "min": 1.0, "max": 10000.0 },
          "end_time": {
            "type": "correlated",
            "source": "start_time",
            "offset": { "type": "uniform_int", "min": 1, "max": 300 },
            "offset_unit": "seconds"
          },
          "status": {
            "type": "weighted_choice",
            "values": [200, 200, 200, 404, 429, 500, 503],
            "weights": [0.3, 0.3, 0.2, 0.08, 0.05, 0.05, 0.02]
          },
          "error_code": {
            "type": "conditional",
            "on": "status",
            "cases": {
              "200": { "type": "constant", "value": null },
              "404": { "type": "constant", "value": "NOT_FOUND" },
              "429": { "type": "constant", "value": "RATE_LIMITED" },
              "500": { "type": "constant", "value": "INTERNAL" },
              "503": { "type": "constant", "value": "UNAVAILABLE" }
            }
          }
        }
      }
    ]
  }
}

Example 7: complex e-commerce dataset

This example uses multiple indexes, foreign keys, per-parent rows, expressions, weighted choices, timestamps, arrays, transforms, and a junction index.

{
  "seed": 20260507,
  "description": "E-commerce synthetic dataset",
  "plan": {
    "max_parallel_workers": 6,
    "partitions": 12,
    "workflow_timeout_seconds": 7200,
    "indexes": [
      {
        "name": "customers",
        "row_count": 10000,
        "mapping": {
          "properties": {
            "customer_id": { "type": "keyword" },
            "name": { "type": "text", "fields": { "keyword": { "type": "keyword" } } },
            "email": { "type": "keyword" },
            "region": { "type": "keyword" },
            "tier": { "type": "keyword" },
            "created_at": { "type": "date" }
          }
        },
        "fields": {
          "customer_id": { "type": "sequence", "format": "CUST-{:08}" },
          "name": { "type": "fake", "category": "person.name" },
          "email": { "type": "fake", "category": "internet.email" },
          "region": {
            "type": "weighted_choice",
            "values": ["US", "EU", "APAC", "LATAM"],
            "weights": [0.45, 0.25, 0.2, 0.1]
          },
          "tier": {
            "type": "weighted_choice",
            "values": ["bronze", "silver", "gold", "platinum"],
            "weights": [0.55, 0.25, 0.15, 0.05]
          },
          "created_at": {
            "type": "timestamp_range",
            "start": "2025-01-01T00:00:00Z",
            "end": "2026-01-01T00:00:00Z"
          }
        }
      },
      {
        "name": "products",
        "row_count": 2000,
        "mapping": {
          "properties": {
            "product_id": { "type": "keyword" },
            "name": { "type": "text", "fields": { "keyword": { "type": "keyword" } } },
            "category": { "type": "keyword" },
            "base_price": { "type": "double" },
            "tags": { "type": "keyword" }
          }
        },
        "fields": {
          "product_id": { "type": "sequence", "format": "PROD-{:06}" },
          "name": { "type": "fake", "category": "commerce.product_name" },
          "category": {
            "type": "weighted_choice",
            "values": ["electronics", "apparel", "home", "outdoor", "books"],
            "weights": [0.25, 0.25, 0.2, 0.15, 0.15]
          },
          "base_price": { "type": "log_normal", "mean": 3.8, "stddev": 0.9, "min": 1.0, "max": 2000.0 },
          "tags": {
            "type": "weighted_choice",
            "values": ["new", "sale", "popular", "premium", "eco", "gift"],
            "weights": [0.2, 0.15, 0.25, 0.1, 0.1, 0.2],
            "cardinality": { "type": "uniform_int", "min": 1, "max": 3 },
            "unique": true
          }
        }
      },
      {
        "name": "orders",
        "depends_on": ["customers"],
        "row_count": 100000,
        "mapping": {
          "properties": {
            "order_id": { "type": "keyword" },
            "customer_id": { "type": "keyword" },
            "region": { "type": "keyword" },
            "ordered_at": { "type": "date" },
            "status": { "type": "keyword" }
          }
        },
        "fields": {
          "order_id": { "type": "sequence", "format": "ORD-{:09}" },
          "customer_id": {
            "type": "foreign_key",
            "source": "customers.customer_id",
            "distribution": "zipf",
            "s": 1.15
          },
          "region": { "type": "value_set", "source": "customers.region" },
          "ordered_at": {
            "type": "timestamp_range",
            "start": "2026-01-01T00:00:00Z",
            "end": "2026-03-31T23:59:59Z",
            "distribution": "normal"
          },
          "status": {
            "type": "weighted_choice",
            "values": ["created", "paid", "shipped", "delivered", "cancelled"],
            "weights": [0.08, 0.12, 0.2, 0.55, 0.05]
          }
        }
      },
      {
        "name": "order_items",
        "depends_on": ["orders", "products"],
        "per_parent": {
          "parent": "orders",
          "count": { "type": "uniform_int", "min": 1, "max": 5 }
        },
        "mapping": {
          "properties": {
            "order_id": { "type": "keyword" },
            "item_id": { "type": "keyword" },
            "product_id": { "type": "keyword" },
            "quantity": { "type": "integer" },
            "unit_price": { "type": "double" },
            "line_total": { "type": "double" }
          }
        },
        "fields": {
          "order_id": { "type": "parent_key" },
          "item_id": { "type": "sequence", "format": "ITEM-{:010}" },
          "product_id": { "type": "foreign_key", "source": "products.product_id", "distribution": "zipf", "s": 1.05 },
          "quantity": { "type": "uniform_int", "min": 1, "max": 4 },
          "unit_price": { "type": "log_normal", "mean": 3.8, "stddev": 0.9, "min": 1.0, "max": 2000.0 },
          "line_total": { "type": "expression", "expr": "quantity * unit_price" }
        },
        "transform": {
          "type": "js",
          "source": "arg.line_total = Math.round(arg.line_total * 100) / 100; arg"
        }
      },
      {
        "name": "customer_product_affinity",
        "depends_on": ["customers", "products"],
        "relationship": {
          "type": "junction",
          "left": { "index": "customers", "key": "customer_id" },
          "right": { "index": "products", "key": "product_id" },
          "left_cardinality": { "type": "uniform_int", "min": 2, "max": 8 },
          "right_max": 1000,
          "unique": true
        },
        "fields": {
          "customer_id": { "type": "parent_key" },
          "product_id": { "type": "parent_key" },
          "affinity_score": { "type": "uniform_float", "min": 0.0, "max": 1.0 }
        }
      }
    ]
  }
}

Best practices

Use explicit mappings

Provide mappings for generated indexes that will be used in dashboards, SQL queries, or OpenSearch-compatible queries. Explicit mappings keep field types stable.

Alias generated fields clearly

Use descriptive field names such as customer_id, order_id, ordered_at, and line_total. This makes downstream queries easier to write.

Put upstream indexes in depends_on

Any index referenced by foreign_key, value_set, ancestor_key, conditional, correlated, per_parent, or relationship should be listed in depends_on.

Start small

Validate and execute a smaller dataset first. Increase row_count, per_parent counts, and relationship cardinalities after the generated shape is correct.

Use one seed per scenario

Keep the same seed when you want repeatable output. Change the seed when you want a different dataset with the same shape.

Prefer transforms for complex row-level derivations

Use expression for simple arithmetic. Use JavaScript or VRL transforms when the derivation needs branching, rounding, or multiple calculated fields.

Analytics Cookies

Help us understand website usage.

Necessary storage remembers your choice. With your consent, Mach5 also uses PostHog analytics to measure website traffic and interactions.

Change this anytime from Cookie Settings in the footer. Privacy Notice.