Your services will fail, but you can do something about it

Your services will fail, but you can do something about it

kndb's photo
kndb

Published on Sep 29, 2021

4 min read

You read it right, go ahead and search for faulted errors, socket exceptions in your favorite instrumentation tool (azure application insights, new relic etc), there is a very high chance you are going to find few of these logged.

Here is a snippet from one of our sandbox application insights image.png

The criminal

Transient failures

asd

These failures can occur anytime, irrespective of the platform, the operating system, the programming language you are using to build your application. On a bad day, your application would simply not respond to key actions giving an impression to the end user that the system in place is highly unreliable.

There could be various reasons for such kind of failures, network issues, temporary unavailable services, server not able to respond in time(timeouts), some idiot spilled coffee on the server where your service is hosted etc.

With a paradigm shift towards cloud, these kind of errors have become more and more prominent. You are not going to get rid of them, but you could make your system more resilient and fault tolerant to such kind of failures.

💡 Note

They are often self-correcting, if the action is repeated again it is likely to succeed.

On your SSR enabled react app, you could notify the user to wait for sometime and try again, but microservices/components behind the scenes communicate with each other all the time, strategies have to be in-place beforehand, manual intervention just wont work in this case.

Polly

Enter Polly - A library that enables resilience and transient-fault-handling in your .NET application.

I am going to talk about 3 types of policies that I have used in my projects.

  1. Fixed amount of retries but retry after an interval
  2. Fixed amount of retries but retry with exponential backoff
  3. Circuit breaker policy

Base setup

I exposed a throttled api which accepts only 2 requests in 10sec from a particular IP address. The API would respond with a 429 response code and a message. We will call this API continuously and see behaviour as a result of polly policies.

Behavior without any policy in place final_6154aee2ad89470077c10ddf_27415.gif

💡 Note

In production environments, the behavior would be one failure and we are done, no further execution.

Fixed amount of retries but retry after an interval

⚙️ Setup in place
If response code is 429(too many requests) -> retry 3 times, wait for 2 sec before each retry.

var asyncRetryPolicy = Policy.HandleResult<HttpResponseMessage>(r => r.StatusCode == System.Net.HttpStatusCode.TooManyRequests)
                   .WaitAndRetryAsync(
                    3,  
                    (retryNumber) => TimeSpan.FromSeconds(2),    
                    (exception, attemptTimespan) =>
                    {
                            Console.WriteLine($"[Polly] - Encountered an error: \"{ response.Result.Content.ReadAsStringAsync().Result}\" - Retrying after {attemptTimespan.TotalSeconds} sec.");
                    });
// run in loop
await asyncRetryPolicy.ExecuteAsync(async () => await httpClient.GetAsync("endpoint"));

Behavior with policy in place final_6154aee2ad89470077c10ddf_910190.gif

Fixed amount of retries but retry with exponential backoff

⚙️ Setup in place
If response code is 429(too many requests) -> retry 3 times, 1st retry after 2 sec, second retry after 4 sec, third retry after 8 sec.

var asyncRetryPolicy = Policy.HandleResult<HttpResponseMessage>(r => r.StatusCode == System.Net.HttpStatusCode.TooManyRequests)
                   .WaitAndRetryAsync(
                    3,    
                    (retryNumber) => TimeSpan.FromSeconds(Math.Pow(2, retryNumber)), 
                    (exception, attemptTimespan) =>  
                    {
                            Console.WriteLine($"[Polly] - Encountered an error: \"{ response.Result.Content.ReadAsStringAsync().Result}\" - Retrying after {attemptTimespan.TotalSeconds} sec.");
                    });
// run in loop
await asyncRetryPolicy.ExecuteAsync(async () => await httpClient.GetAsync("endpoint"));

Behavior with policy in place final_6154aee2ad89470077c10ddf_371182.gif

Circuit breaker policy

⚙️ Setup in place
If response code is 429(too many requests) -> And it happens for 3 times consecutively, open the circuit for 10 sec so that no further API calls can go through.

var circuitBreakerPolicy = Policy.HandleResult<HttpResponseMessage>(r => r.StatusCode == System.Net.HttpStatusCode.TooManyRequests)
           .CircuitBreakerAsync(3, TimeSpan.FromSeconds(10),
            (response, attemptTimespan) => 
            {
                       Console.WriteLine("[Polly] - Circuit is open for 10 sec.");
             },
             () =>
             {
                       Console.WriteLine("[Polly] - Circuit closed, requests flow normally.");
             });
// run in loop
await circuitBreakerPolicy .ExecuteAsync(async () => await httpClient.GetAsync("endpoint"));

Behavior with policy in place final_6154aee2ad89470077c10ddf_213348.gif

In this post, I created a policy around the 429 http response code but you can write your policies around any kind of transient fault, or any kind of http response code(even treat 200 response as a failure 🙃 and do something about it) its all configurable. Polly provides a lot of resiliency options. Do check out their GitHub page here.

Thank you for going through this post. I hope you had some takeaways from it. Cheers!