Circuit breaker in microservices

June 12, 2020

The circuit breaker is a design pattern, used extensively in distributed systems to prevent cascading failures. In this post, we’ll go through the problem of cascading failures and go over how the circuit breaker pattern is used to prevent it.

Motivation: The problem of cascading failures

Before jumping into the circuit breaker pattern, let’s try and understand what problem it tries to solve.

When service A tries to communicate with service_B, it allocates a thread to make that call. There are 2 kinds of failures that can occur while making that call. We use the example of a user service making a call to friends service.

''' user service '''
def get_user_info(user_id: str):
  try:
    friends_service.get_friends(user_id)
  except Exception as e:
    raise InternalServerError

Immediate failures: In immediate failure, an exception is raise immediately (like: Connection Refused) and the service_A thread is freed.

Timeout failures: However serviceb takes a long time to respond. As we get new requests to service A, we’re getting more and more threads waiting for serviceb. If several requests are made while waiting for timeouts this can exhaust service A’s thread-pool and can bring down service A.

Your code can’t just wait forever for a response that might never come, sooner or later, it needs to give up. Hope is not a design method.” -Michael T. Nygard, Release It!

Let’s walk through an example of a social media application to understand this better. Here we have an aggregator service which is what the client interacts with, it aggregates results from a bunch of services including user service. User service calls photo service and friends service which in turn calls friends_db.

In this example, aggregator , user, photos and friends services are working fine but all requests to friends_dbare timing out.

Here, friends service tries to make requests to the friends_db, however friends_db is not responding with an immediate failure, instead keeps the threads from the friends service waiting. The friends service tries to retry thereby using more threads. As it gets new requests more threads are waiting on the friends_db to respond.

friends service exhausts it’s thread-pool as all of it’s threads are waiting for a response from friends_db

We can now see how friends service is now becoming the source of timeouts for user service. User service exhausts it’s thread-pool waiting for requests from friends service, just how friends service was waiting for friends_db. We can now see how failure in friends_db caused a cascading failure in services indirectly dependent on it,.

user service tries to call friends service and ends up exhausting it’s threadpool

Eventually the aggregator service will also come down with the same reason. The client calls the aggregator service and so our system is effectively shutdown for the users. We see how one error in one component of our architecture caused a cascading failure bringing all other services down.

The client can now no longer reach the aggregator service and effectively our whose system is down for the end users.

Circuit Breaker Pattern

Circuit breaker is usually implemented as an interceptor pattern/chain of responsibility/filter. It consists of 3 states:

- **Closed**: All requests are allowed to pass to the upstream service and interceptor passes on response of the upstream service to the caller.
- **Open**: No requests are allowed to pass to upstream and interceptor responses with a default response; usually an error response.
- **Half-Open**: Some of the requests are allowed pass to upstream others are terminated and responded with default response.

A state diagram of circuit breaker

Create a request interceptor

The circuit breaker is implemented as an interceptor intercepting all requests from user service to friends service. In this picture it is in the “closed” state and allows all requests to be passed to the friends service

The circuit-breaker switches to the “open” state when the number of failures to the friend service are more than the failure threshold. It doesn’t allow requests from the user service to reach friends service instead it responds immediately with a default response

After a set “recovery timeout” period has passed the circuit breaker switches to a “half-open” state where it allows some of the requests to reach the friends service and the others are terminated and responded with the default response.

Circuit breaking by wrapping service calls around a circuit breaker in code:

from circuitbreaker import CircuitBreaker

class MyCircuitBreaker(CircuitBreaker):
    FAILURE_THRESHOLD = 20
    RECOVERY_TIMEOUT = 60
    EXPECTED_EXCEPTION = RequestException
@MyCircuitBreaker()
def get_user_info(user_id):
  try:
    friends_service.get_friends(user_id)
  except Exception as e:
    raise InternalServerError

We can also leverage the sidecar pattern to this. In this approach we don’t have to modify our services by wrapping them around circuit-breakers, but instead, we ship our applications with a sidecar like Envoy. All outbound traffic from the service is proxies through envoy. Envoy supports circuit breaking out of the box. Following is an example configuration of circuit-breaking with Envoy:

circuit_breakers:
  thresholds:
    - priority: DEFAULT
      max_connections: 1000
      max_requests: 1000
    - priority: HIGH
      max_connections: 2000
      max_requests: 2000

Resources


Written by Ganesh Iyer A software engineer building platforms for leveraging artificial intelligence in healthcare.