How to leverage AWS Firehose on Event-Driven Architecture
Simple tool to ingest all your domain events
I’ve been working with AWS Firehose in production for the last 1,5 years, and I learned some things. In my case, I have microservices deployed in Kubernetes consuming data from Kafka, transforming it a bit, and sending it to AWS Firehose.
Many things happened during this journey.
But Marcos, why do I have to learn a tool like AWS Firehose if I’m a Software Engineer?
That’s a good question.
🎁 But before that, just to let you know that at the end of this email, I have something useful for you and your career growth as a Software Engineer or Tech Lead.
Now, the answer to the previous question.
A tool like AWS Firehose plays a critical role in scalable, real-time data architectures. Key reasons include:
Simplifies Data Ingestion: Eliminates the overhead of managing infrastructure, scaling, and retries.
Essential for Streaming Architectures: Enables real-time pipelines to distribute your data in event-driven systems.
Optimized Cost & Performance: Provides built-in buffering, compression, and transformation to reduce storage and processing costs.
Seamless Integration: Natively connects with AWS services like S3, Redshift, OpenSearch, and third-party endpoints.
If you have use cases that match some of the points indicated above, you will find the rest of the email interesting.
In today’s newsletter edition, I will explain:
What is AWS Firehose?
Use Cases.
How to use AWS Firehose in production.
Limitations.
And Cost Considerations.
Let’s get started.
🧐 What is AWS Firehose?
☝🏼 AWS Kinesis Data Firehose (a.k.a AWS Firehose) is designed to capture, transform, and load streaming data with minimal setup and management.
Unlike other streaming platforms where you must manage the infrastructure, Firehose abstracts much of that complexity so that engineers can focus on developing application logic and analytical models.
Via configuration, whether in the AWS Console or using the Terraform module, you can configure all the features for this managed service.
👉🏼 AWS Firehose can admit different kinds of data sources, like:
AWS Kafka (MSK).
Direct input (from your app).
CloudWatch.
SNS.
IoT.
And so until 20 AWS Services (by today).
👉🏼 And different kinds of target sources, like:
S3.
OpenSearch.
Redshift.
Splunk.
Apache Iceberg.
Custom HTTP endpoint.
S3 Tables (quite recent).
Its serverless architecture makes it easy to scale with the volume of data ingested, making it ideal for scenarios where real-time analytics are essential.
So, on paper, this technology looks good.
But, is it the right fit for your use case?
🎯 Use Cases
☝🏼 AWS Firehose is best suited for scenarios that require real-time or near-real-time data ingestion and processing. Here are some common use cases:
Log and Event Data Ingestion: Stream logs from applications or infrastructure into a central data store for real-time monitoring and alerting.
IoT Data Streaming: Capture and process data from IoT devices, enabling quick insights and immediate actions.
Real-Time Analytics: Load streaming data into Amazon Redshift or Elasticsearch for real-time business intelligence dashboards.
Security and Compliance: Stream security events and audit logs to central repositories for threat detection and compliance monitoring.
Clickstream Analysis: Ingest user interaction data from web or mobile applications to gain insights into customer behavior.
👉🏼 In my case, I’m using it for near-real-time data ingestion. I have a microservice, deployed in Kubernetes, which sends data to AWS Firehose. The AWS Firehose is configured to put all the data in files in a concrete S3 bucket.
From there, the Data Engineering team takes over and uses their ETLs to do magic with that data.
Now, let’s dig a bit more into how to use this managed service in Production.
⚙️ How to use AWS Firehose in production
☝🏼 We all see in the documentation around, the typical Hello World, the configurations are pretty straightforward. Something like setting up the ARN of the AWS Firehose and that’s it.
👉🏼 But in this newsletter, I write about real life, and in real life, you have networking, security, and many other things that prevent you from just setting the ARN in your application’s properties.
In my case, I have my application running in Kubernetes. To allow my application to connect to AWS Firehose, I deploy a Kubernetes Service Account that an IAM Role takes into account. That IAM Role is the one that will have set up the permissions to use AWS Firehose. So, in my setup, I indicate:
ARN of the IAM Role to be used.
Region.
The SDK of the language you use will be able to use that information, via environment variables, to make the connection with AWS Firehose.
☝🏼 Also, you have to know that you can configure multiple “delivery streams” in your AWS Firehose. Each delivery stream will point to a concrete target. In the configuration of the IAM Role indicated above, you can even configure a concrete set of delivery streams allowed to be used.
In my case, I have multiple delivery streams in the same AWS Firehose, and each of them points to a different S3 bucket.
♾️ Limitations
Despite its ease of use, AWS Firehose comes with some limitations:
Limited Data Transformation: Although Firehose supports data transformation via AWS Lambda, the complexity is constrained by the Lambda execution environment (memory and timeout limits). For more advanced processing, you might need to integrate with AWS Kinesis Data Analytics.
Latency Concerns: Due to its buffering mechanism, there might be slight delays between data ingestion and delivery, which may not be suitable for ultra-low latency requirements. In my experience, this is not a big deal, and in less than 5 seconds, you have your data in the target service.
Fixed Batch Size: This is the most relevant to me. You can send your data record by record, but there is a possibility to send a set of records in a batch. The API has a limit of 500 elements to send in a batch.
Error Handling Complexity: Though Firehose provides automatic retries, persistent errors (such as schema mismatches) require careful handling and monitoring, potentially adding operational overhead.
💸 Cost Considerations
☝🏼 Money. Money is a dimension we, as Software Engineers, have to take into account when we design a technical solution.
👉🏼 The pricing is primarily based on the volume of data ingested and delivered. Here are the key cost factors:
Data Ingestion Volume: Charges are incurred per GB of data ingested into the service.
Data Transformation: If you enable Lambda-based transformations, additional costs related to Lambda execution are applied.
Data Delivery: Depending on the destination (e.g., S3, Redshift), there may be additional costs associated with data transfer or storage. I find S3 very flexible and cheap and, from S3, you can easily jump to other AWS Services.
Optional Features: Enabling compression and encryption might slightly impact performance, but can reduce storage costs and enhance security.
Please, take a look more in detail about the pricing in the official web page.
Alright! Let’s wrap up for today.
✨ Takeaways
You read about what AWS Firehose is, the use cases where it makes sense to use it, tips for configuring in Production, the limitations, and even the cost dimensions that AWS applies.
Let me finish by providing a short set of important takeaways for you.
The near-real-time use case is where I find AWS Firehose shines. For pure real-time, I did not try it, maybe it can be tuned up more, but a delay of ~5 seconds to see my data in S3 could be a problem for real-time cases, but not for near-real-time cases.
If you send data from your application, I recommend using the Batch API because the performance will be better.
If you have the possibility to deploy AWS Firehose either via Infrastructure As Code or with a framework like CDK, the better. The UI is nice, but you will lose track of what you have deployed. Besides, we are engineers, we like code 😊.
Do you have experience with AWS Firehose or a similar tool? I would like to hear from your experience; reply to this email or in the comments of Substack, I read all the messages.
🎁 And the gift for you.
You have the opportunity to decide what you want to learn more about. For free. Click the button below to find out how 👇🏻
Also, a bit down in this email, you will find 3 curated links related to today’s reading. Hope you like them and subscribe!
We are ✨1175 Optimist Engineers✨!! 🚀
Thanks for your support of my work here, really appreciate it!
If you enjoyed this article, then click the 💜. It helps!
If you know someone else will benefit from this, ♻️ share this post.