Ingesting data from S3 into Mach5

This document explains how to ingest data from files present in an S3 bucket into a Mach5 Search index by configuring connections, setting up ingest pipelines, and verifying ingestion using the Mach5 UI.

Prerequisites

This document assumes that Mach5 is deployed and running successfully. Mach5 Administrative UI page looks as below. Lets assume it’s running at http://localhost:8888/

FirstPageUI

Store, store route and warehouse are created successfully. Refer to Quickstart document for help
Consider there is a file in the S3 bucket s3ipdata containing following 10 records:

{ "id": "1", "ip_address": "192.168.1.1" } 
{ "id": "2", "ip_address": ["192.168.1.2", "10.0.0.5"] } 
{ "id": "3", "ip_address": "10.0.0.1" }
{ "id": "4", "ip_address": ["10.0.0.2", "8.8.8.8"] } 
{ "id": "5", "ip_address": "172.16.0.1" }
{ "id": "6", "ip_address": ["172.16.0.2", "192.168.1.1"] } 
{ "id": "7", "ip_address": "8.8.8.8" }
{ "id": "8", "ip_address": ["8.8.4.4", "10.0.0.5"] }
{ "id": "9", "ip_address": "192.168.2.1" } 
{ "id": "10", "ip_address": ["192.168.2.2", "8.8.4.4"] }

Mach5 Index is already created with the name s3index and with relevant mappings for the s3 data. If no mappings are provided, Mach5 will dynamically infer the mapping when ingesting data. Please note that some data may not get ingested if mappings are not provided.

Connections

Connection is the resource or the entity which stores the properties needed to connect or access the source where the data resides, for instance in Iceberg, Kafka or S3. It is important to note that the same connection can be used in multiple ingest pipelines

Click on Connections on the left panel of Mach5 UI

Add a new connection

Click on + icon on Connections page to create a new connection

S3ConnectionCreation

Name: Provide a name for the connection, eg. s3conn
ConnectionType: There are 3 options in dropdown: AwsS3, Iceberg, Kafka. Choose AwsS3
S3 Endpoint: For an AWS deployment, this can be empty. For S3 compatible object stores, like MinIO, specify that endpoint
Access Key: For an AWS deployment, this can be empty. For S3 compatible object stores, like MinIO, specify Access Key
Secret Name: For an AWS deployment, this can be empty. For S3 compatible object stores, like MinIO, specify Secret Name
Secret Key: For an AWS deployment, this can be empty. For S3 compatible object stores, like MinIO, specify Secret Key
Default Region: Specify the specific AWS region for the S3 endpoint. Default value is us-east-1
Test: Click on the Test button to verify connectivity to S3
Click on Save

S3ConnectionTest This test connection image above is an example for S3 compatible object store, eg MinIO

Verify new connection

Verify that the new connection is created in the Connections page

S3onnectionDetails

Ingest Pipelines

Mach5 Search ingest pipelines allow you to process and ingest documents from various different sources like Iceberg, Kafka, S3 bucket, etc. This is useful for transforming, enriching, or modifying data at ingestion time. To access the source data, ingest pipeline needs the specific connection to source data. For example if we are creating an ingest pipeline to index data from S3 bucket as data source, then you need a connection of type S3

Click on Ingest Pipelines on the left panel of Mach5 UI

Add an ingest pipeline

Click on + icon on Ingest Pipelines page to create a new ingest pipeline

IngestPipelineCreation1

Name: Provide name of the ingest pipeline, eg. s3ingest-pipeline
Index: Specify name of the Mach5 index eg. s3index Please note that the index must be created in Mach5 prior to configuring the ingest pipeline
Transform Type: Select options between None or Javascript. This helps to transform data before ingestion. If Javascript is selected, specify the script details in the given box
Connection Name: Select the connection name that was created earlier
Ingest Pipeline Type: Options are S3, Iceberg, Kafka. Select S3

IngestPipelineCreation2

Bucket: Specify the S3 bucket from which data will be ingested, eg s3ipdata
Prefix: Prefix in the S3 bucket
Filter Regex: This field filters the files in S3 bucket as per the regex given
Data Format: This field denotes the format in which data resides in S3 bucket. Options are NDJSON and DELIMITED. Select the option as per your data file format in S3
Compression Code: This field denotes the compression type for data files stored in S3. Options are None, gzip, zstd. Select the option as per compression of data files in S3
Advanced: Leave the advanced section as default, Operation Mode being Append/Upsert
Enabled: Select this checkbox to Enable the Ingest Pipeline
Click on Save
Once the ingest pipeline is created, records will start getting reflected in the Mach5 index corresponding to the S3 bucket data

Verify an ingest pipeline

In Ingest Pipelines page verify if the ingestion pipeline is created properly

IngestPipelineDetails

Once the s3ingest-pipeline ingestion pipeline is successfully created, records from S3 are ingested into Mach5 index s3index. As and when new data files comes to the S3 bucket, it will get added into the Mach5 index

Verify data ingestion

Using Mach5 Dashboards - Dev Tools, verify that the data from S3 is ingested into Mach5

Execute a count query on the Mach5 index to verify the number of ingested records. As expected, the count is 10 records. This is what was ingested into S3 bucket s3ipdata, refer to Prerequisites

DevToolsCount

Execute a search query on the Mach5 index to verify the ingested records.

As expected, the output shows 10 ipdata records. This is what was ingested into into S3 bucket s3ipdata, refer to Prerequisites

Disable ingest pipeline

When not in use, ingest pipeline can be disabled. So any updates to source data are not reflected in Mach5

IPList

To Disable an existing pipeline, click on the Edit icon next to the the ingest pipeline, eg. s3ingest-pipeline
In the Edit ingest pipeline page, at end of all options, before Save button, deselect the Enabled checkbox

DisableIP

Save the ingest pipeline to take effect
The s3ingest-pipeline ingest pipeline is now disabled. It will not read data from source to ingest data into Mach5. It can be enabled any time it needs to be re-used

Getting Started

Deployment & Setup

Ingesting Data

Querying Data

Authentication & Authorization

Advanced Features

Dashboards & Visualization

Integrations

API Reference(Mach5)

Support