The circuit breaker is a design pattern, used extensively in distributed systems to prevent cascading failures. In this post, we’ll go through the problem of cascading failures and go over how the circuit breaker pattern is used to prevent it.
Motivation: The problem of cascading failures
Before jumping into the circuit breaker pattern, let’s try and understand what problem it tries to solve.
When service A tries to communicate with service_B, it allocates a thread to make that call. There are 2 kinds of failures that can occur while making that call. We use the example of a user service making a call to friends service.
''' user service '''
def get_user_info(user_id: str):
except Exception as e:
Immediate failures: In immediate failure, an exception is raise immediately (like: Connection Refused) and the service_A thread is freed.
Timeout failures: However serviceb takes a long time to respond. As we get new requests to service A, we’re getting more and more threads waiting for serviceb. If several requests are made while waiting for timeouts this can exhaust service A’s thread-pool and can bring down service A.
”Your code can’t just wait forever for a response that might never come, sooner or later, it needs to give up. Hope is not a design method.” -Michael T. Nygard, Release It!
Let’s walk through an example of a social media application to understand this better. Here we have an
aggregator service which is what the client interacts with, it aggregates results from a bunch of services including
User service calls
photo service and
friends service which in turn calls
friends service tries to make requests to the
friends_db is not responding with an immediate failure, instead keeps the threads from the
friends service waiting. The
friends service tries to retry thereby using more threads. As it gets new requests more threads are waiting on the
friends_db to respond.
We can now see how friends service is now becoming the source of timeouts for user service. User service exhausts it’s thread-pool waiting for requests from
friends service, just how
friends service was waiting for
friends_db. We can now see how failure in
friends_db caused a cascading failure in services indirectly dependent on it,.
Eventually the aggregator service will also come down with the same reason. The client calls the aggregator service and so our system is effectively shutdown for the users. We see how one error in one component of our architecture caused a cascading failure bringing all other services down.
Circuit Breaker Pattern
Circuit breaker is usually implemented as an interceptor pattern/chain of responsibility/filter. It consists of 3 states:
- **Closed**: All requests are allowed to pass to the upstream service and interceptor passes on response of the upstream service to the caller.
- **Open**: No requests are allowed to pass to upstream and interceptor responses with a default response; usually an error response.
- **Half-Open**: Some of the requests are allowed pass to upstream others are terminated and responded with default response.
Create a request interceptor
Circuit breaking by wrapping service calls around a circuit breaker in code:
from circuitbreaker import CircuitBreaker
FAILURE_THRESHOLD = 20
RECOVERY_TIMEOUT = 60
EXPECTED_EXCEPTION = RequestException
except Exception as e:
We can also leverage the sidecar pattern to this. In this approach we don’t have to modify our services by wrapping them around circuit-breakers, but instead, we ship our applications with a sidecar like Envoy. All outbound traffic from the service is proxies through envoy. Envoy supports circuit breaking out of the box. Following is an example configuration of circuit-breaking with Envoy:
- priority: DEFAULT
- priority: HIGH
- Circuitbreaker python library: https://pypi.org/project/circuitbreaker/
- Release it (Book) https://books.google.com/books/about/Release_It.html?id=Ug9QDwAAQBAJ&source=kp_book_description
- Circuit breaking in Envoy: https://www.envoyproxy.io/learn/circuit-breaking