• duuyidong@gmail.com

Throttling in Distributed System


Throttling is one of the three effective methods for protecting a high concurrency system. The other two are respectively caching and downgrading. Throttling is used in many scenarios to limit the concurrency and the number of requests. Our service has tens of millions of TPS, with tens of thousands of hosts serving traffic. Throttling is vital for such a large distributed service.

Why throttle

To preserve service availability, and security(crawler, prevent brute forcing), reduce expenses, when the traffic reaches a certain rate, use the dependency degradation appropriately to protect the core system.

When throttle

When throttling is an important part of the thread model, bellows are some examples:

*Scenario#1: When a single customer suddenly starts cleaning their resources, resulting in generating millions of TPS requests for query/deleting cold data, we should reduce its request frequency to avoid impacting other users.*

*Scenario#2: When multi customers from the same IP generate huge amounts of traffic to request a certain API, it’s most likely someone is attacking your system with an account pool. We should reject its request when it’s reaching a certain limit.*

*Scenario#3: When a customer creates a lot of objects in your system(memory) without deleting them. We should reject their creation request unless objects have been deleted or expired(remove from memory).*

*Scenario#4: When the backend host is having too many threads to process the customer’s resources, we should prevent the customer triggers the processing API to avoid too many thread waiting.*

*Scenario#5: When an internal service is calling the backend service too frequently, we should use the dependency degradation appropriately to protect the backend system.*

How

Standalone Throttling

Fixed Window

Counting the number of requests as QPS, comparing QPS with limit, reset the counter every 1 second. code example:

1
2
3
4
5
6
7
8
9
public synchronized static boolean tryAcquire() {
long now = System.currentTimeMillis();
if ((now - START_TIME) > TIME_WINDOWS) {
START_TIME = now;
REQ_COUNT.set(1);
return true;
}
return REQ_COUNT.incrementAndGet() <= QPS;
}

We can also have a separate thread to reset the counter: code example

The fixed window has deficiencies, it can’t guarantee every one seconds limitation. sliding windows were designed to optimize this issue.

Sliding window

Separating window as multi slot, slotTime = windowSize / slotCount, if request time(current time) excess slot time, throw the slot and create new one(reset slot). By doing this, the window moves forward with slot size as a step:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public synchronized boolean tryAcquire() {
long currentTimeMillis = System.currentTimeMillis();
int slotIndex = (int)(currentTimeMillis % WINDOW_SIZE / SLOT_SIZE);
int sum = 0;
for (int i = 0; i < slots.length; i++) {
Slot slot = slots[i];
if ((currentTimeMillis - slot.getStartTime()) > WINDOW_SIZE) { // Sliding
slot.getCount().set(0);
slot.setStartTime(currentTimeMillis);
}
if (slotIndex == i && slot.getCount().get() < qps) {
slot.getCount().incrementAndGet();
}
sum += slot.getCount().get();
}
return sum <= qps;
}

The sliding window didn’t solve the fixed window’s issue completely, there is still a small chance that QPS can be higher than expected as the window is moving step by step. The smaller the slot size, the excess chance will be smaller. However, it’s resource-consuming, especially for large time windows like one minute or longer.

Leaky bucket

The leaky bucket algorithm can ensure that QPS is not exceeded, it’s based on the Producer-Consumer model, having a fixed size of the bucket, each new request is considered as a drop into a bucket(Producer), and the request will be throttled when the bucket is full of water. By having timing and quantitative leak of water(Consumer), rate limitation can be achieved.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
synchronized public boolean tryConsume(int drop) {
tryProduce();
if (water + drop > BUCKET_SIZE) { // bucket full
return false;
}
water = water + drop;
return true;
}

private void tryProduce() {
long currentTimeMillis = System.currentTimeMillis();
if (currentTimeMillis > lastLeakTimestamp) {
long leak = ((currentTimeMillis - lastLeakTimestamp) / LEAKS_INTERVAL_IN_MILLIS) * qps;
if (leak > 0) {
water = Math.max(0, water - leak);
this.lastLeakTimestamp = currentTimeMillis;
}
}
}

The Leaky bucket solves the QPS exceeded problem, but it can’t smooth traffic as the leak speed is constant. for example, if QPS = 2, when 2 requests come in at the same time, one of the requests will be throttled because only one request can be processed in each 1s/2=500ms.

Token Bucket

The token bucket is the most commonly used rate limit algorithm for traffic shaping and rate limiting, it’s the reverse of the leaky bucket, the main difference is that the leaky bucket can forcibly limit the transmission rate of data, while the token bucket can limit the average transmission rate of data and also allow a certain degree of burst transmission. In the Token Bucket, as long as there are tokens in the token bucket, data is allowed to be transmitted in bursts until the user-configured threshold is reached, so it is suitable for traffic with burst characteristics.

Google guava rate Limit provides out-of-box APIs to implement that.

Sliding Log

Sliding log is for scenarios that require precise control (that is, require the number of requests within any given time window T to be less than or equal to N.) For precise control, you need to log each user request. Upon each throttling judgment, you can retrieve the number of logs in the latest time window to check if it is greater than the throttling threshold. This is the idea of the sliding log algorithm. As you can imagine, this algorithm is preciseness, but cost a lot of memories;

Counting

The above limit algorithms are mostly used for network or gateways’ rate limiting, if we are throttle in the server it will be more straightforward because we know when the transaction ends, so we can return the token when finished.

A typical way to implement this throttling is using a ConcurrentHashMap store <key: count> structure, where the key can be customerId::operation, a count is several actions in processing, and each operation come to the system will check the count of the related key, if it’s excessed certain limit, the operation should be throttled.

There are some notes from my learning:

  • Similar to tokens, the semaphore was designed by JDK to limit the number of concurrent threads accessing a specific resource.
  • Besides ConcurrentHashMap, ConcurrentHashMultiset also can be used as a more powerful counter.
  • Google Guava Cache is an LRU cache better than ConcurrentHashMap and ConcurrentHashMultiSet, this article (in Chinese) talks about how to use it very clearly.

Override rate limit

Normally customer-based throttle should be able to override by hot feature toggle, in case some customer needs that high volume of traffic(thinking about some paying VIP customers).

How to implement the hot feature toggle? It’s not hard, having a map structure cache in your memory, and always read config from that cache when needed. You can have another thread update cache from some config service regularly (like 5 minutes), the downgrade straight is when config service is not reachable, the degradation solution would be read from a local file.

This mechanism is also useful for feature toggles as well as allow/deny lists.

Distributed Throttling

Normally real distributed throttling will increase complexity a lot, and it has a limitation on the number of hosts, be extra cautious when thinking about using them. Instead, I’m going to introduce some centralized throttling strategies for distributed systems.

Gateway Throttling

In most cases, if the gateway/load balancer provides a feature to throttle that is the best because it can prevent traffic hit our server, the however gateway might have some limitations in terms of flexibility to customize, like throw retryable exception, and dependency degradation, or determining whether the request was authorized or not.

DB-based token bucket

Using per host throttling, the overall capacity will change when fleet size change, it’s hard to tell the customer how many limitations exactly is.

For example in Scenario#3, we actually can use the database as a counter, each time when initializing an operation we query the count of an object from DB, and reject if excess; However query directly from DB is generating a heavy load to our database and increase too much latency, we can add a cache in our memory, the cache will expire every certain time(1 minute), when read operation can’t found the object count in the cache, it will read from DB.

How to Implement Rate Limiting

From Server Perspective, the 2 things required are thread modeling and performance testing.

  • Based on thread modeling, decide on an “identity for throttling application”, I.e. IP address, Client name, and API key.
  • Based on performance testing results, determine your application’s traffic volume breakpoint.
  • Choice throttling algorithm, use Rate Limiting Libraries, or build your own.

When Clients were throttled, if they retry their connections too frequently, it can cause the service to struggle to recover, it calls “Retry storm”, there are some practices to prevent that on the client side:

  • Use Retry policies with:
    • Exponential Backoff: wait for 1s,2s, and 4s,… before retrying
    • Jitter - add random ms between sleep periods
    • Retry limit: 3 - 5 times at most

Most important if you’re not using standard SDK provided by the service provider, you should be aware and fault tolerant, make sure your status can be consistent, and roll back properly if an API call gets throttled exception.

Reference