admin管理员组

文章数量:1593957

Let's kick it off. And I'm here today to talk to you about failures. And of course, there's gonna be a lot of details about Amazon Kinesis Data Streams and AWS Lambda and how they work together and how they interact. But through it all, I will be focusing on my personal favorite subject, the failures.

And technically, of course, this is a Kinesis and Lambda talk. But by the end of it, you might actually realize this is not that much about Kinesis and Lambda at all. Let's see how it goes.

And on a more practical note, you are gonna see this anchor icon appearing on some of the key takeaway slides. So keep your eyes open for that. And with that, let's get started.

And I want to start out with Bengals law that says all the complex systems that work, one started out as simple systems that work. So it's only wise for us to start by focusing on such a simple system or simple architecture. And this one has been becoming increasingly popular lately.

It's a simple architecture for new real time data streaming. It is asynchronous by nature. We are using the so called storage first pattern. We want to capture the data before we process it. And for that, we take Kinesis Data Stream, we write our data to it with our producer application, so the application that generates our data. And on the other end, we connect AWS Lambda to it to read and consume that data from the stream and we can do whatever we want with that data after that right sky is the limit.

So this should sound simple. But I'll tell you a quick real life story today and you will decide the rest on your own. But first, really quickly, who am I? Why am I here today? So, my name is Anna and I'm a lead cloud software engineer at the European company called Solita. I'm personally based in Finland and also I'm an AWS Data Hero, which by the way, doesn't mean that AWS tells me what to say. I'm still sharing just my personal thoughts and experiences with you.

Alright. All the introductions out of the way. Let's get to our story. And as I already said, this story is going to be mainly about Kinesis Data Streams, but not entirely. So I was working with this customer where we're using this simple yet powerful architecture for near real time data streaming. And all was going well until one day we realized that we were actually losing data, talking about storage first, right?

And we didn't realize it because logs told us. So at least on the producer side, there was not much errors visible. Neither did metrics tell us much or I would say we didn't quite realize what metrics were trying to say, let's just say for now, metrics were somewhat controversial. But instead we started to notice that our consumer Lambda started to behave weirdly like, really, really oddly, I mean, I call it write a ticket to AWS support. Kind of oddly when you basically have no idea what's happening and it kind of started to unfold from there.

So let's go step by step and try to figure out what was actually going on. And let's start with a brief introduction of Kinesis itself. So Kinesis Data Streams is a fully managed massively scalable service to stream data on AWS. After you write your data to the stream, it is available for you to read within milliseconds, hence the near real time and then it's safely stored in your stream for at least 24 hours or you can configure it to be up to one year.

And during that time, you can replay or consume that data process that data in any way that you want. But you cannot delete the data from the stream. Once it's written to the stream, it stays in the stream for at least 24 hours. And the this data and retention, uh the data replay and retention capabilities, one distinct feature that separates Kinesis from some other similar services that AWS has to offer.

So the things that make Kinesis very powerful is that well, you don't have to manage any servers or clusters, it scales pretty much endlessly. And also it integrates really well within AWS ecosystem itself. So there are many services that can either write to Kinesis or read from Kinesis directly or actually do both.

And when it comes to reading from Kinesis, you can have your uh custom data consumers that you can attach to your stream to consume your data in real time in any way that you prefer. And that's where the actual magic happens. They will also tell you in AWS documentation that Kinesis is servless. Is it though? I'm kind of here to tell you right off the bat. It's not, not at least fully, not to my opinion. Uh not, not, not in my opinion, a lot of disclaimers there right now. Um and yes, I know I have the word servless in the title of my talk. That's fine. I'm gonna explain in just a minute what I mean here.

Also, if you're someone like fellow AWS hero who have been trying to tell us that servless in fact, is not binary as many things in life, it's a spectrum. Uh then you might think, what am I talking about here? But I'll explain to you in a second why I think this is not fully servless on that spectrum.

So to achieve the massive scalability that it promises Kinesis uses a concept of a shard. And you can think of a shard as an ordered queue within your stream and each stream will be composed of multiple such queues. And each queue or each shard will come with a limited capacity. So you can write up to one megabyte or 1000 records of data per second for each shard. And the number of shards that you can have in the stream is defining your available throughput that you are going to have in the stream.

So the throughput of each shard is limited, but the number of shards you can have in the stream is virtually unlimited. So you provision as many shards as you want to stream as much data as you would need. And that to me doesn't sound too servless moreover, the amount of incoming data probably not gonna stay the same data rarely does.

There are usually fluctuations during different times of the day or during different days of the week or a month or what not. And if the amount of incoming data gets larger than your stream can accommodate, well, you will need to scale it up. You will need to add more shards to it. And this is a process called resharding.

There are different, different strategies to do so. But whatever you do, it's your responsibility to do it and you will use some API calls for that and you can of course, automate that process. You can build auto scaling solutions. There are different approaches to that, but it's still something you are gonna build and you are gonna maintain as well. It's not a part of the Kinesis service.

The other option for you is of course to over provision the stream. So you just add a huge amount of shards to your stream, not to worry about all these data fluctuations ever again. But of course, there's a catch because with Kinesis Data Streams, you don't only pay for the data that you are streaming, you are also paying for every open shard in your stream.

So you provision your capacity and then you pay for your provisioned capacity. So again, doesn't ring a servless bell to me.

So by now you might be asking why then AWS claim its servless? Oh I'm not going to put words in their mouth, but my guess is that and I haven't been entirely honest with you. There is this thing called capacity mode nowadays. So you can actually choose between two different capacity modes for your Kinesis stream.

And one of them is this good old, not so servless provisioned capacity mode with all the shard management and manual scaling. And then there is this newer option called on demand capacity mode and the on demand mode promises you auto scaling and no shard management. Of course, the shards are still gonna be there. That's how everything in Kinesis stream works, they are shards, but you are just not gonna see them. Neither are you gonna have any control over them? They are gonna be completely abstracted away from you. And this kind of starts to sound kind of servless, right?

But of course, there's a catch with on demand mode, you can handle up to the double of maximum traffic that you got during the previous 30 days. So instead of this dynamic scaling up and down as I have here on my expectation sketch, what it does, it actually doubles the amount of the shards that you have based on the previous peak that you had during the previous 30 days.

So let's say you had a previous peak that would require your stream to have 20 shards. This means that you are going to have 40 shards in your on demand stream at all time. And also if you are gonna have these spikes of traffic every now and then, let's say once per month, you're gonna have those 40 shards all the time, it's not gonna go down, it's gonna stay there.

This is maybe a good point to note that the pricing mode for the pricing model for the on demand mode is a bit different. You are not paying for the individual shards anymore. So in theory, you would say, why would I care how many shards there are? Right? You are nodding, you know what I'm saying? But the reality is that my guess at least is that while the shards are still there, they are reserved for you, either you use them or not, the pricing model is made so that it's probably gonna cost you a lot more to have an on demand stream versus a provisioned stream, which is over provisioned.

Even my personal experience from a couple of years back was that it can be up to 2 to 3 times more expensive to have a node demand stream versus a heavily over provisioned provision stream. So those are the trade offs that you need to know about and need to think about.

But all of that being said, I still think that the on demand mode is really great addition to Kinesis capabilities because what it allows you to do is it allows you to get started. So if you have no idea about the amount of traffic or data that you are going to have, if you don't want to bother about the shard management or nothing like that, you just spin up an on demand stream and you just start streaming your data.

And then later on, if you start to get a better idea of how much data you need, how much data you're gonna have. You can actually switch to the provisioned mode and you can build all the auto scale solutions that I was mentioning about. And as a matter of fact, you can switch back and forth between those two modes up to twice per day without any extra cost. Those are just API calls. You don't need to do any modifications to your stream or your data. It's pretty seamless.

So it is a very useful feature that Kinesis provides. But me personally, I'm still waiting for the true auto scaling mode with, you know, this dynamic scaling up and down and scaling down to zero before I personally can call it fully serverless that in the end of the day, does it really matter what we call it? I leave it to you to decide, but I had this rant so that you have a better understanding of what are the different options you have with Kinesis and also how Kinesis works under the hood because it's very important to understand how the services that we are using actually work.

Ok. So let's take a quick glance now at how we can write the data to the stream and how we can read it from the stream. And why would we do that in the first place? Meaning what are the use cases for Kinesis? So our producer application in our story with some potential issues, how did we get the data flowing to K? And in general, there are plenty of options to choose from.

There are libraries uh direct service integration, some third party tools. But in our story, we had a producer that writes data to the stream with the most flexible way possible, which is the AWS SDK, which is basically an obstruction over API calls.

On the other end of things, when we want to read the data from the stream, again, there's a similar list of service or a similar list of options that you will have also direct service integrations and libraries and third party tools, but we cannot know who is going to be the hero of this story. And it's AWS Lambda and Lambda turns out comes with a lot of benefits when acting as a stream consumer. Of course, things like scalability. But then there is also plenty of configuration options and all the other magic that Lambda does and last but not least extensive error handling capabilities. And I will go into detail into that, that in just a second.

When it comes to the use cases for Kinesis, it is an excellent candidate for any data streaming application. So collecting and processing large amounts of data. That's where Kinesis shines. And when it comes to data, it can be any data that you have log data IOT website, click stream, social media, gaming, finance, you can come up with a huge list of data.

It has been al also becoming popular with so-called event driven architectures where you could use Kinesis uh to pass events between different components in your event driven architecture, just like you would do with other services like Even Bridge may be as good as SNS and so on and each of the services here of course, will come with their separate set of capabilities considerations.

So for example, things like costs and all of those things should be considered when we are picking one service over the other in our particular use case. And the fun fact, actually many of those services that I showed on the previous slide, they do use Kinesis under the hood themselves. So the services are built on top of Kinesis, which makes Kinesis an essential service in AWS ecosystem. And for us and users, this is important because of course, they want to make sure that it works properly, that it's as reliable as possible. So it's a good news for us.

Ok? So now we have been talking about high level stuff long enough. Let's go to some details. And in our story, we decided to check our entire pipeline just in case try to figure out what was going on, maybe we are missing something and lo and behold, here came our first revelation. We were actually losing data while trying to write it to the stream and we had zero idea about it happening. And trust me, it was a very awkward moment to realize that.

So what was going on? So with Kinesis, there are two separate ways that you can write data to the stream. So two separate API calls, you can either write individual records or you can batch up to 500 records and send them as a single request to Kinesis. And here's a really quick illustration of why you should probably always opt for batch operations whenever you have a chance. And this is nothing specific, this applies to all the services that support batching.

So things like things like the DynamoDB or SQS or SNS, they all support batching. And you can see here that on the left, we have uh three records being written to the stream with three individual requests. While on the right, we batch those three records in just a single request. And it's pretty obvious that the amount of request overhead is pretty considerable when you send individual requests.

So by batching requests, you are of course reducing the size of the request, but also the number of the request and the number is very important because if your producer application waits for every single request to complete before it sends the next one with large amount of data, you will end up with a producer that has a lot of latency.

So this is a thing to consider always. But also with batching, there is a different aspect or different discussion to be had. It's how real time you want your data to be. Because if your producer is sitting there waiting for the batch to be collected, be before sending it out to the stream. Chances are high that some of the records in that batch will get kind of old from the real time perspective, the moment they get to the stream.

So those are the things and considerations that you need to have in your particular use case and, and uh decide on the size of batch probably that you are gonna use. And also batching is awesome, but with it comes extra responsibilities. And here finally, we get to my favorite failures.

So let's talk a little bit. What happens if a request to Kinesis fails for some reason. So the good news here is that AWS SDK will take care of most of it for you by default, AWS SDK will retry uh um a request that failed for so called retrievable um errors.

So things like service unavailable and the other transient 500 errors uh time out, it will retry those failed requests up to three times by default and it will use the so called exponential back off. And this basically means that the delays between the retry attempts will be increasing exponentially and this is done so that the retry attempts are spread out more uniformly across time.

We do not want to send bursts of retries right away to a system that is probably already under a lot of stress. So that's how um SDK handles that and all of those parameters i told about they can and should be configured. For example, in Node.js SDK, you can do that when you are creating an instance of Kinesis service.

And together with this retry settings, you should also consider configuring the time out settings because by default, AWS SDK will wait for two entire minutes before i decide to time out the request. Let's just pause there for a second and think about it. We are talking about real time processing. We are counting milliseconds, we are doing some tradeoffs and here we have AWS SDK that will wait for two entire minutes before it decides that request times out and add on top of that, that we actually have two separate time out settings. One is for new connections, one is for existing ones, then add on top of that all the possible retries and you might end up with a producer that is pretty much stuck waiting for something that might not even happen.

So it's very crucial to configure those time out settings to be appropriate for your particular case. And once again, this is something that is not specific to Kinesis. This applies to basically every API call that you will get make to AWS. So every time you need to think about those time out settings and how big you want them to be, but you do not want to have the defaults, defaults can be really dangerous.

Ok. So the errors are retried and if for some reason, the retries don't go through, we want to know about it, right. So we log it and if we look at the logs and we see nothing there, well, we assume everything is fine, right? But can you spot what's wrong with this code? Because i personally couldn't for a very long time.

So remember as i told you AWS dk takes care of most of it for you. Right? There is a catch because in case of batch operations, like we have put records here instead of handling the failures, only failures of the entire requests, we should also handle the so-called partial failures.

So the thing here is that batch operations are not atomic, they are not either all succeeds or all fails, it can happen. So that part of the records in your batch go through, but then the other part fails and you still are getting a 200 success response back from SDK.

I can tell you more, it can happen so that every single record in your batch will fail and you are still getting a 200 success response back. So it's your responsibility to detect those partial failures. It's your responsibility to handle them. And if you are not doing that, you are probably having no idea that that's happening.

And once again, this is a very common consideration in distributed systems in general, this is nothing specific to Kinesis. If you have batch operations that are not atomic, partial failures are gonna happen, they are happening right now. And if you don't have proper error handling, you have no idea about them happening. And trust me, it's a very awkward moment. When that realization comes, so we better be prepared and when it comes to the retries of partial failures or any retries for that matter, there are just two simple things to keep in mind that will get us really, really far.

First one, we don't want to configure the upper limits for our retries, right? Just like AWS SDK. We do not want to be retrying forever. We want to stop at some point. And the second one, we need to have proper back offs between these retry attempts. We do not want to send burst of retries to a system that is already under a lot of stress. We want to spread out retry attempts as uniformly as it's humanly possible.

And for that, there is a very simple trick, exponential back off. And jitter and jitter here is just a random component that we add to our exponential backup. And with that very simple trick, we managed to spread out those retry attempts even more uniformly and avoid a lot of headache with retries. And actually, jitter is what SDK uses itself. I just didn't mention it on the previous side.

So it's a very, very simple tool, but it's also a very powerful one. And if you remember just two things from this entire talk, please let it be partial failures of batch operations. They are happening, they are real and second retries with exponential back off and jitter, this will help you tremendously. No matter what service you are working with because those are universal things, they are not service specific.

And here i want to bring up a great quote by Gregor Hohpe who says free trials have brought more distributed systems down that all the other causes combined. And this doesn't mean we shouldn't retry. Of course, quite the opposite. In distributed systems, we should always retry, but we should be smart about how we are doing those retries. We should be mindful not to kill the system that we are trying to fix in the first place.

But let's take a step back and talk a little bit about why those failures happened in the first place. Why do we need all that retries? Right. So once again, there are several reasons that are common to distributed systems. We are working over a network after all. So all the familiar issues that happen in network environment can happen here.

So network glitches services are unavailable. Maybe you send the request, it did go through, but you never got the response back and your request time out, all of these things happen all the time and more often than not, you will also get the so called throttling, basically exceeding one of the API limits that the API has.

And if you get throttling request fail, like in the case of Kinesis, for example, we had this limit of 1000 records or one megabyte uh of records per shard per second. And we exceed that when all the other requests will fail.

And here came our next not so pleasant revelation. You could actually get throttled with Kinesis even if you look at the CloudWatch metrics. And they tell you that the amount of incoming data is way lower than your stream throughput capacity that you provisioned for your stream. But you can still get throttled. And the reason for that is that all the Kinesis limits that i told you about they are per second limits.

So it's one megabyte, 1000 records per second. And CloudWatch logs are well at best per minute aggregates.

So with CloudWatch Logs, you are uh metric or you are never getting the second to second picture of actually happening. Instead, you just have this aggregate over a minute. And the thing is that you can of course have these sudden spikes of traffic, for example, that will happen on a second to second basis and you can still get throttled even if the minute values are ok.

And if you remember in the very beginning, I told you that our metrics were somewhat controversial. Well, this is exactly what I meant because we were looking at the overall throughput of our stream and it looked perfectly fine. I mean, we had plenty of way to go still, it was heavily over provisioned. But then for some reason, we were seeing spikes of um throttling metrics and we were like what's going on? It can't be right. We can't get throttled. I mean, we have a lot of throughput but that can happen.

And if you only look at the metrics that tell you about the good stuff, uh, like the amount of incoming data you are, of course not getting the entire picture. So you always need to look at the metrics that tell you about failures. Those are the most interesting ones anyway. And in case of Kinesis, there are two separate metrics that will tell you about throttling.

So after we were digging deep into how we write data to Kinesis, we figured out that it boils down to just a few simple principles that we need to keep in mind. First and foremost, we need to have proper error handling for partial failures, limits for retro atoms, proper back off. We need to have those in place. We don't want to bring our system down while trying to fix it.

On the same note, we want to configure proper retry and time out settings that SDK uses because well defaults can be dangerous. And lastly, we want better monitoring, which is kind of the obvious. But what I mean here is that in addition to monitoring various metrics across the stream, you probably also want to create your own custom metric that will tell you about the retries and failures with the partial failures. So in that way, you will have a more holistic view of your stream and what's happening there. At the same time, you will have a more reliable producer application.

Ok. So we have been talking about writing to a stream, but it all started from reading from the stream. Right. Right. We had our consumer lambda that was behaving godly and then the hell broke loose. So let's finally get to that lambda. And as usual, let's take a look at how lambda actually works when it works with kinesis.

So I mentioned in the beginning that there are some really great benefits when it comes to using lambda as a stream consumer. So things like scalability, of course, come pretty much as a package deal with lambda. But it turns out it also takes care of a lot of heavy lifting on your behalf when it comes to reading from the stream.

So for example, it it keeps track of where it should read next because well, Kinesis retains the data and also it can have multiple consumer consumers. So it's each consumer's responsibility to track where to read next. And lambda will do that for you when acting as a extreme consumer. It also takes care of record patching and it has extensive error handling and retry capabilities if things go wrong.

And lambda itself of course, is not just one monolith, it's com combined of multiple different components that work together behind the scenes to make it so powerful. And here we finally get to meet the hidden hero behind the rest of our story. It's literally gonna be behind everything else that I'm gonna tell you about today.

So let's get really familiar with it. And this wonderful component is called event source mapping. And you will recognize it from this unicorn icon. It's not an official icon, just something I'm using. But event source mapping is really, really important. It's a crucial part of our story because when you are attaching a lambda function to read from your K stream, what happens is that you are actually attaching an event source mapping to that strip and then you are pointing your lambda function to the events source mapping.

And it's gonna be the events source mapping that picks batches of records for you from the stream and passes them to your lambda function invocation. And when your lambda function is done with that one batch, events, source mapping will pick the next batch from the stream and pass it to a new lambda invocation.

And once again, this is not something specific to kis only even even source mapping is used with some other lambda triggers. So things like the animal li streams or sqs, for example, it's behind those integrations as well. It's just hidden there. You probably haven't heard about it but it's there.

So with the help of the event source mapping, each lambda invocation will get a batch of records and you can configure how big or small you want that batch to be. But it's important to realize that even though we are dealing with batches, now we are still handling one event at a time. That's how lambda works. Lambda gets one event and that event will contain a batch of records.

And one, once your one day is done processing that one batch, then events sources, mapping picks up the next batch and pass it to a new one day vacation. And the other important things to thing to know is that those batches are picked from all the shards in your stream in parallel.

So there's gonna be one lambda taking batches of records from each individual shark. So this in turn means that you will have as many concurrent lambda invocations reading from your stream as you have sharks in that stream unless you want to speed things up.

And in that case, event source mapping comes to help with this very useful feature called parallelization factor. And with polarization factor, you can actually have up to 10 lambda functions reading from each of the shards in your stream at all times. So instead of having just one lambda per shard, you can have up to 10 lambdas per shard, which can be extremely useful.

But of course, there are things that you always need to keep in mind. And one such an important thing is the lambda concurrency limit. So you are probably all aware that as anything else in life, lambda as awesome as it is, it comes with its own limitations. And one such very important limit is lambda concurrency limit.

And it's very important because, because it has a very big blast radius. And I'm going to explain in a second what that means. So lambda concurrency limit basically means that you can have a limited amount of concurrent lambda invocations in the same account. And in the same region, it's usually a soft or it is a soft limit, you can increase it, but there still is gonna be a limit no matter how much you increase it.

And once you reach that limit, all the new lambda in vacations in that account, in that region will be throttled, they will fail. So imagine you have a k stream with 100 threads and then you decide to speed things up and you set your lambda concurrence um factor to 10. And here we are having 1000 lambda functions just bleeding from your stream at all times. Things are probably gonna be just fine until someone else somewhere in your account in your region deploys another lambda function that does something very business critical.

It has no idea about your entire data stream about the consumer. Nothing but that lambda begins to fail and it fails because your lambda consumer consumed the entire capacity that there was so concurrent, no capacity limit. So that's why as i said, it's a limit with a very big blast radius and it's something you should always keep in mind with lambda and especially with k it's really easy to overs scale lambda. So be aware of that.

Ok. So there was some very seriously weird behavior with our consumer lambda. So what was going on? So to recap by default, even source mapping, we'll pick up batches from your stream, pass them to your lambda function and your lambda function will try to process those batches one by one. And what happens if that fails? For some reason, let's say there was a bad record in your batch. Some corrupt data, your lambda can't process it. You didn't implement error handling. Lambda throws an exception. What happens then again, the good news is that the event source mapping will be there to help you but you need to know exactly how it's going to do that. And what are the defaults? Because yes, the defaults can be dangerous.

So by default, the events are lambda will try to process failed batch of records over and over and over again until it either succeeds or until the data in the batch expires. And we know that by now that data expires in kis in at least 24 hours. So for at least one day, lambda will be retrying and retrying and retrying that one failed batch of records without success. Of course.

So you can imagine the amount of all the unnecessary lambda vacations that come with that. And i mind you, they are not free, you are paying for them. Ok. So what about if we configured it to be a week or a month or a year? We might be in quite some trouble there but problems don't stop there because all those retries will probably cause reprocessing of the same data because you see from the perspective of the event source mapping, either the entire batch succeeds or the entire batch fails.

So if you think about it, it's the exact opposite of what we had. When we were writing to the stream, the batch operation was not atomic. So part of it could succeed, part of it could fail. But here, even if one record fails, the entire batch fails, which in turn means that even if you have already processed some of the records before the failure. Like here, records 12 and three were successfully processed and then record four through an error. Events, source mapping will fail the entire batch and then it will pass the entire batch batch to the next one day in vacation and you will process records 123 again over and over and over. Probably not something you want to do and bad things don't stop there neither.

Because while lambda is retrying that one batch of records from that one particular chart, no other records are being read from that chart. So all the other shards go on with their lives like nothing ever happened. But that one shard is stuck with all that reprocessing or all that retries. And that's why that kind of a bad record is often referred to as a poison pill record because just one bad record was enough to poison your entire shark.

Well, bad things don't stop there neither. So

Ok, it's been 24 hours. Hopefully, data finally expires. It finally discarded from the from the shard, probably partially unprocessed because you never got to the end part of that batch. But well, that's life. You need to move on. At least now your lambda can start picking up next batches from the chart, right? So it will get back.

But the problem here is that at that point in time, your sh is probably filled with records that were written to the stream around the same time, which means what they expire around the same time as the already expired records. So in practice, your lambda might not even have enough time to process all those records and they will keep expiring and expiring and expiring without you having a chance to read them.

And I often bring up this overflowing sink analogy because that's what is happening. You can't drain the sink fast enough and the water just overflows. So we started with just one bed record and we ended up losing a lot of valid and valuable data.

Well, that's exactly what we are seeing in our story and what I said, we were seeing this weird behavior from our lambda. That's exactly what I meant. So we're seeing a lot of uh repeated and prolonged bursts of lamda invocations which turned out to be just this retries of some failed matches. We were also seeing a lot of duplicates reprocessing. We are also seeing a lot of delays in what was supposed to be real time pipeline because 24 hours come on.

Of course, the costs were going up. All those lambdas are not free, you are paying for them. But of course, the most important one was that we were losing that data and all of that happened just because we didn't know better. And we just went with a good old default when it comes to error handling and retries. And I cannot not bring this quote by Gregor once again.

So luckily, there is many simple ways how we can be smarter about retries with the event source mapping and lambda. So event source mapping comes with a lot of knobs to tweak when it comes to our handling. But we kind of already know that the most important thing that we can do is to set the limits for the retries. And in case of event source mapping and kinesis, there is actually two separate settings that we can use.

There is one when we can say the maximum attempts or we can also set the maximum age of the records that we want to consume. But both of them are set to minus one by default, which means no limits. And well, that's exactly what we got.

Then there are other useful features that you should also use. So for example, badge bisecting on failure, which means that you allow your lambda to split the failed batch into two and try to process those two parts separately. And then it will keep splitting the failed batch recursively and trying to process the separate parts until hopefully it processes everything else except for that one failed record.

So it will, it will do that automatically. You just need to put that flag on and lambda will take care of that. But with that, you probably will also get a lot of reprocessing because of all that splitting and and um retrying.

So another very useful feature here is returning partial success, which basically means that you can tell exactly which record failed in your batch to the event source mapping. So you do not have to fail the entire batch. You can say exactly which record was the better one.

So like here in our example, we can tell to the events source mapping. Hey, it was the record for that failed by the way. And that will tell it that records 12 and three were already successfully processed. So it will not give them to us anymore. So we will avoid all that unnecessary processing of the same records.

Finally, when all else fails, you do not want to discard your data. But you would want to allow lambda to send that to a failure destination, which can be sqs or sns in case of kidney disease. But the thing to remember here is that it's not the actual records that are being sent to to the failure destination. It's the metadata about the record that metadata will allow you to go and pick that from the stream later on.

But this also means that records are not copied anywhere. They only live in your k stream, which again, again means that they will expire as any other record in the stream. So you will need to go and process that failed records within the window of your data exploration. Otherwise it had no point to set up the entire failure destination, right?

And you can use any of those features in any combination that you want. But whatever you do, just please do not go for the default.

And there's one last thing I really wanted to mention when it comes to the events source mapping. And this is a very useful feature and one of my personal favorites because it, what it allows you to do is it allows you to control which records are being sent to your lambda function.

So if you have a consumer that is not interested in all the records in your stream, you can actually tell to the event source mapping to filter or to pick only the ones that you want to and only pass them to your lambda function. And in that way, you can avoid a lot of unnecessary landing vacations that will probably result in less cost.

You will also probably have less records to deal with, which reduces the probability of poison fuel to happen. And last but not least, you will probably also have less custom code that you have to write otherwise within your lambda function, which less code is always a win win situation, right?

And all of that is just a simple configuration that you can use with event source mapping. So just adjacent filter, if you're familiar with eventbridge filtering, it's exactly the same and it comes completely for free.

Ok. So after all that digging and after all that problems, we figured out that actually, once again, it boils down to just a few simple principles. Coincidentally three simple principles.

Once again, um we should always adjust the processing speed with our stream. We can use matching parameters for that or we can use polarization factor for that. So if we want to speed things up, we for example, can increase the polarization factor and have more longest reading from the stream, right?

It's always critical to spend time on configuring and implementing proper error handling. I cannot emphasize this enough plan for failures fail gracefully use all the knobs that the event source mapping gives to. You do not go with the defaults

Finally save the planet only consume what you actually need avoid all the unnecessary landing vacations, all the unnecessary record processing. And for that event filtering can be very, very useful.

And with those very simple principles, you will get your kinesis consumer to be much more reliable, right?

So we are almost in the end of our story, just a quick wrap up and a few parting thoughts left here.

So first of all, there's of course, so much things i couldn't go into. So i've written these articles about kiss and lambda. If at all interested, go read them, they go into a lot more details there.

But also I always try to bring up this great article from amazon builders library. And once again, this is something that is not specific to kinesis and that's about their struggle when building on top of distributed systems and it's about the retries and backups and jitter, all that fun stuff. So high recommendation for that.

Finally in life, it's all about tradeoffs when you are picking one service over the other, there's probably gonna be a lot of different options that might work for you, but none of them are gonna be perfect in all possible ways. Things just don't work like that.

And even though combination of kinesis and lambda can be extremely powerful, it might not be the right fit for your particular case and that's fine. But we need to be pragmatic when we are making those choices and base them on our actual architectural needs that we have.

And well, if something doesn't meet our needs, we just choose something else. But it's also fairly easy to dismiss something right away if we don't understand how things work, instead of trying to figure out what is actually happening there.

And in case of k, i'm getting this feeling that it's one of those services that are kind of easy to dismiss because you don't really know what's happening. In my opinion, we would be doing ourselves a disservice because despite all my rant about it being service or not being serve, the reality is it's still one of the most reliable and scalable services that aws provides.

So we would be doing ourselves a, a service if we wouldn't use it or at least have it in our service toolbox. Of course, in the ideal world, you just connect all pieces together and everything just works by magic. But even then i would argue that it would be very valuable to understand how exactly the services work, also, how they interact and how they fail.

And while we are not in the ideal world just yet, stories like this one that i told you today, they can actually discourage them from using certain architectures or certain services altogether instead of trying to figure out what is actually happening.

And once again, i think we would be doing ourselves a disservice because of stories like this is nothing unusual. It's just a simple consequence of having real world big scale distributed application instead of just a hello world example.

And of course, each service would come with their own set of capabilities and considerations. But many things that we touched upon today, they are actually common in distributed systems in general. So, so things like service limits are handling partial failures, retries back offs. All of those things are in the distributed systems.

So whatever service that we are using, we still need to be very aware of all of those aspects.

And last but not least, don't be afraid of failures. Embrace failures, learn from them. Each failure is an opportunity to do things better. Next time, if we wouldn't embrace failures as kids, we would never learn how to walk and do all the other fun stuff. But somehow along the way, we kind of start to take failures as something negative. Right?

Well, it can be one of the best things that actually happens to us. And as dr vernal logos likes to remind us everything fails all the time. That's just the reality of things.

So either in life in general or with aws services in particular, I would argue that the best thing that we could actually do is just to stay calm and be prepared when those services or failures happen.

And that's it for me for today. Thank you so very much for listening. Please go fill in the survey. I really love to hear your feedback. Thank you very much.

本文标签: streamingAmazonServerlessDataKinesis