Explore "Adaptive Safety Control" Approaches

A good article by William Louth (Twitter @williamlouth) talks about adaptive systems versus circuit breaker patterns: https://www.jinspired.com/site/jxinsight-opencore-6-4-ea-11-released-adaptive-safety-control

I'd like to use this issue to explore whether there are principles that can be applied to Hystrix since Twitter is very limited and that blog doesn't take comments.

---

Some background for the discussion (brain dump so forgive bad grammar and stupid thoughts):

**Circuit Breaker**

The circuit breaker patterns gets way too much credit in Hystrix. It is the concept people seem to grab onto but it is a very minor aspect of the Hystrix implementation and I have quite [publicly](https://speakerdeck.com/benjchristensen/performance-and-fault-tolerance-for-the-netflix-api-august-2012) stated that it is just "icing on the cake" and a "release valve" when the underlying system is known to be bad.

The reason (as William states in his article) is that it's an all-or-nothing affair and it is reactive. 

For example, at the velocity of traffic for the most used HystrixCommands on the Netflix API system we would be dead by the time a circuit breaker can do anything about it.

The concurrent execution limits via semaphores or threadpools are the actual thing that prevents underlying latency from saturating all application resources.

**System Characteristics**

Hystrix was obviously designed for the use cases Netflix has and doesn't necessarily apply to different architectures or scale down very well.

Some of these factors are:
- most backend systems are accessed via synchronous client libraries over blocking IO (some use non-blocking IO which Hystrix will better support at some point as they have different resource utilization and failure scenarios - https://github.com/Netflix/Hystrix/issues/11)
- application clusters scale instance counts from ~600 cores to ~3200 cores each day (number of instances depends on instance type ... but it can be ~100 on the low end to 1200+ on the high end)
- the backend consists of 150+ different functional services (modeled as HystrixCommand instances) in 40+ groups (representing backend system resources and modeled generally by thread-pool for isolation groups)
- some functionality has good fallbacks, some can fail and cause graceful degradation of user experience and others must fail fast as they are required

**Behavior - Adaptive, Proactive, Reactive?**

This depends on which aspect of Hystrix is being looked at ... 

**_Concurrency Limits and Timeouts**_

These are the proactive portion of Hystrix - they prevent anything from going beyond limits and throttle immediately. They don't wait for statistics or for the system to break before doing anything. They are the actual source of protection given by Hystrix - not circuit breakers. (see diagram for the decision flow: https://github.com/Netflix/Hystrix/wiki/How-it-Works#wiki-Flow)

We configure concurrency limits using semaphores or thread-pools based on simple math - 99th percentile latency when system is healthy \* peak RPS = needed concurrency. Timeouts are set with similar logic.

It's to get an order of magnitude, not exact value. The principle is that if something needs 2-3 concurrent threads/connections let's give it 10 but not 50, 100 or 200. This constrains what happens when the median becomes the 99th and the 99th multiplies.

We have considered making this value adaptive - use metrics over time to adjust it dynamically. In principle though we have not found a need to do this as systems rarely change behavior enough to need a change in their order of magnitude once first configured and if they do it is an obvious change in a system modification (new code push etc). 

Making it adaptive would have to take into account long-term trends (at least 24 hours, probably longer) otherwise slow increase in latency and concurrency needs could "boil the frog" and allow the limit to slowly increase until it has stopped providing the protection it was put there for in the first place.

We have found that after operating 150+ commands at high volume for over a year that we don't have a desire for adaptive changes of what the concurrency limit or timeouts should be as that complicates the system, makes reasoning harder and opens up the ability for "drift" to just raise the limits over time and expose vulnerability.

**_Circuit Breaker**_

Circuit breakers are reactive. They kick in after statistics show a HystrixCommand to be in bad state (resulting from failures, timeouts, concurrency throttling, etc). It's a release valve to skip what we statistically have determined to be bad instead of trying every time. It helps the underlying system by reducing load and gets the user a response (fallback or failure) faster.

On a single application instance this is not a very "adaptive" thing - it's an on/off switch,

However, at scale it is actually quite adaptive because each HystrixCommand on each instance makes independent decisions. 

What we see in practice in production is that when a backend has an error rate high enough to cause issues but not enough to shut the entire thing down the circuits on individual instances are tripping open/closed in a rolling fashion back and forth across the fleet as the individual instances use their view of the world to make decisions.

This screenshot of one circuit during such a situation demonstrates this behavior. Note the circuit open/closed and how the different counts represent different types of throttling and rejection occurring while most is still successful:

![Screen Shot 2013-03-27 at 11 58 30 AM](https://f.cloud.github.com/assets/813492/309966/f58eb8a6-970f-11e2-9b68-67a62c86a6e4.png)

It very dynamically reduces load so that a percentage of traffic can succeed and backs off, tries again etc.

We have many times considered if we should make the logic of the circuit breaker more "adaptive" like a control valve that constrains a percentage of traffic depending on algorithms and statistics.

Every time we consider it we decide not to because it makes reasoning about the system harder and because at our scale we already effectually get this behavior because of the large size of the fleet.

**When Hystrix Doesn't Work**

The principles above will not work well if a cluster size is 1 or a very small number of boxes. In that case a more adaptive algorithm would likely be preferable to the on/off switch of a circuit breaker - or just turn off the circuit breaker and use just the concurrency/timeout protection.

Also, if an application only has 2 or 3 critical backend components without any reasonable fallbacks or graceful degradation then Hystrix won't be able to help much. Constraining the single service your app needs breaks the user experience - the only value it would then give is very detailed and low-latency metrics and quick recovery when the backend service comes back to life - but it won't help be resilient during the problem since there's nothing else to do.

---

With the above thoughts on the matter I'm curious as to where more "adaptive" approaches would make sense.

Do they provide benefit to a large system like Netflix or just make a more complicated version of the same end result?

Are they critical to make something like Hystrix work on a smaller system?

Even with an adaptive approach and a "valve" instead of "circuit" it still means it is shedding load, failing fast, doing fallbacks. Is that any different than what already happens with circuits opening/closing independently and rolling around a large fleet.

Other than the circuit breaker (which is already a limited aspect of Hystrix) where else would this concept apply?

Thoughts, insights, (intelligent) opinions etc welcome ... I'm interested in whatever the best ides are to operating a resilient system.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explore "Adaptive Safety Control" Approaches #131

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Explore "Adaptive Safety Control" Approaches #131

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions