Documentation

Ingesting data from S3 into Mach5

This document explains how to ingest data from files present in an S3 bucket into a Mach5 Search index by configuring connections, setting up ingest pipelines, and verifying ingestion using the Mach5 UI.

Prerequisites

  • This document assumes that Mach5 is deployed and running successfully. Mach5 Administrative UI page looks as below. Lets assume it’s running at http://localhost:8888/

FirstPageUI

  • Store, store route and warehouse are created successfully. Refer to Quickstart document for help
  • Consider there is a file in the S3 bucket s3ipdata containing following 10 records:
{ "id": "1", "ip_address": "192.168.1.1" } 
{ "id": "2", "ip_address": ["192.168.1.2", "10.0.0.5"] } 
{ "id": "3", "ip_address": "10.0.0.1" }
{ "id": "4", "ip_address": ["10.0.0.2", "8.8.8.8"] } 
{ "id": "5", "ip_address": "172.16.0.1" }
{ "id": "6", "ip_address": ["172.16.0.2", "192.168.1.1"] } 
{ "id": "7", "ip_address": "8.8.8.8" }
{ "id": "8", "ip_address": ["8.8.4.4", "10.0.0.5"] }
{ "id": "9", "ip_address": "192.168.2.1" } 
{ "id": "10", "ip_address": ["192.168.2.2", "8.8.4.4"] }
  • Mach5 Index is already created with the name s3index and with relevant mappings for the s3 data. If no mappings are provided, Mach5 will dynamically infer the mapping when ingesting data. Please note that some data may not get ingested if mappings are not provided.

Connections

Connection is the resource or the entity which stores the properties needed to connect or access the source where the data resides, for instance in Iceberg, Kafka or S3. It is important to note that the same connection can be used in multiple ingest pipelines

Click on Connections on the left panel of Mach5 UI. This opens the following page:

ConnectionsBlankPage

Add a new connection

Click on + icon on Connections page to create a new connection

S3ConnectionCreation

  • Name: Provide a name for the connection, eg. s3conn
  • ConnectionType: There are 3 options in dropdown: AwsS3, Iceberg, Kafka. Choose AwsS3
  • S3 Endpoint: For an AWS deployment, this can be empty. For S3 compatible object stores, like MinIO, specify that endpoint
  • Access Key: For an AWS deployment, this can be empty. For S3 compatible object stores, like MinIO, specify Access Key
  • Secret Name: For an AWS deployment, this can be empty. For S3 compatible object stores, like MinIO, specify Secret Name
  • Secret Key: For an AWS deployment, this can be empty. For S3 compatible object stores, like MinIO, specify Secret Key
  • Default Region: Specify the specific AWS region for the S3 endpoint. Default value is us-east-1
  • Test: Click on the Test button to verify connectivity to S3
  • Click on Save

S3ConnectionTest This test connection image above is an example for S3 compatible object store, eg MinIO

Verify new connection

Verify that the new connection is created in the Connections page

S3onnectionDetails

Ingest Pipelines

Mach5 Search ingest pipelines allow you to process and ingest documents from various different sources like Iceberg, Kafka, S3 bucket, etc. This is useful for transforming, enriching, or modifying data at ingestion time. To access the source data, ingest pipeline needs the specific connection to source data. For example if we are creating an ingest pipeline to index data from S3 bucket as data source, then you need a connection of type S3

Click on Ingest Pipelines on the left panel of Mach5 UI. This opens the following page:

IngestPipelineBlank

Add an ingest pipeline

Click on + icon on Ingest Pipelines page to create a new ingest pipeline

IngestPipelineCreation1

  • Name: Provide name of the ingest pipeline, eg. s3ingest-pipeline
  • Index: Specify name of the Mach5 index eg. s3index Please note that the index must be created in Mach5 prior to configuring the ingest pipeline
  • Transform Type: Select options between None or Javascript. This helps to transform data before ingestion. If Javascript is selected, specify the script details in the given box
  • Connection Name: Select the connection name that was created earlier
  • Ingest Pipeline Type: Options are S3, Iceberg, Kafka. Select S3

IngestPipelineCreation2

  • Bucket: Specify the S3 bucket from which data will be ingested, eg s3ipdata
  • Prefix: Prefix in the S3 bucket
  • Filter Regex: This field filters the files in S3 bucket as per the regex given
  • Data Format: This field denotes the format in which data resides in S3 bucket. Options are NDJSON and DELIMITED. Select the option as per your data file format in S3
  • Compression Code: This field denotes the compression type for data files stored in S3. Options are None, gzip, zstd. Select the option as per compression of data files in S3
  • Advanced: Leave the advanced section as default, Operation Mode being Append/Upsert
  • Enabled: Select this checkbox to Enable the Ingest Pipeline
  • Click on Save
  • Once the ingest pipeline is created, records will start getting reflected in the Mach5 index corresponding to the S3 bucket data

Verify an ingest pipeline

In Ingest Pipelines page verify if the ingestion pipeline is created properly

IngestPipelineDetails

Once the s3ingest-pipeline ingestion pipeline is successfully created, records from S3 are ingested into Mach5 index s3index. As and when new data files comes to the S3 bucket, it will get added into the Mach5 index

Verify data ingestion

Using Mach5 Dashboards - Dev Tools, verify that the data from S3 is ingested into Mach5

  • Execute a count query on the Mach5 index to verify the number of ingested records. As expected, the count is 10 records. This is what was ingested into S3 bucket s3ipdata, refer to Prerequisites

DevToolsCount

  • Execute a search query on the Mach5 index to verify the ingested records.

DevToolsSearch

As expected, the output shows 10 ipdata records. This is what was ingested into into S3 bucket s3ipdata, refer to Prerequisites

Disable ingest pipeline

When not in use, ingest pipeline can be disabled. So any updates to source data are not reflected in Mach5

IPList

  • To Disable an existing pipeline, click on the Edit icon next to the the ingest pipeline, eg. s3ingest-pipeline
  • In the Edit ingest pipeline page, at end of all options, before Save button, deselect the Enabled checkbox

DisableIP

  • Save the ingest pipeline to take effect
  • The s3ingest-pipeline ingest pipeline is now disabled. It will not read data from source to ingest data into Mach5. It can be enabled any time it needs to be re-used

DisableIPDetails