• duuyidong@gmail.com

Troubleshooting in a Large System

Our server has 75M daily TPS which generates a lot of ops work, I’m often asked “What do you guys do when you’re on call?” this blog will demon straight how we do troubleshooting in a distributed system.

When we start

There are 3 triggers of investigation which help us find the issue in the system:

  • Alarm cut ticket: This is the most common case, to help us notice the issue before the customer faces it, ideally as soon as possible.
  • Customer cut ticket: Internal customers can cut us a ticket directly, and external customers can cut us a ticket via operation support. It could be client-side problem, edge case for the new feature launched recently or capacity issue.
  • Dashboard investigation: We have weekly dashboard investigation: A manager look for suspicious metric fluctuations and assigns them to oncall engineers for investigation. Those metrics may be small or brief so they didn’t trigger the alarm, but the earlier investigation can prevent later greater failure.

One of the things I like in AWS is the Weekly Operational Metrics Meeting, in the meeting the leadership team will randomly pick up some services and ask questions about some suspicious metrics, this mechanism pushes all service team to be serious about every metric fluctuation, and lead to the good customer experience.

And of course, every investigation will generate RC(Root Cause) and AI(Action Items), every time we take action to address some wide customer impact issue (or potential issues), we have Ops Win, another important thing to share cross team/org.

In my opinion, early investigation and ops win sharing are 2 pillars of AWS operation excellence.

Availability drop

I’m going to present the troubleshooting process through a typical dashboard investigation - availability drop.

Our server has an average 99.999% availability, and availability below 99.99% will trigger sev2 alarm that oncall needs to take action immediately. This investigation is an issue that caused our availability to drop below 100 but not reach 99.99%.

Given 75M TPS, 0.001% error means 750 failed requests - Potentially multiple users affected, so it is definitely worth investigating.

How we defined availability?

Availability is the percentage of HTTP response with 5XX status code.

Our service is using Handler Chain to do the context injection, metric emitting, rate limiter, authentication, and serialization/deserialization. In those handler chains, we have a MetricsHandler who checks the HTTP response status in after() method, if it’s greater than 500, it plus one on InternalError counter. The metric emitter will regularly push these counters to the metric service, which will be used to aggregate and displayed as a time series as a line chart, and also be used to trigger the alarms.

Each micro service has its own availability, and the API service’s availability represents the overall availability for this service because it’s customer-facing.

So, what action we should take when availability drop happens?

Step#1 Check the pattern

All Metrics use time as the x-axis, there are different patterns we can check:

  • Is the pattern match recent deployment?
  • Is the pattern match some customer behavior?
  • Does this metric show periodic fluctuations in units of time(daily/weekly)?

We have different dashboards to check, some commonly used are:

  • Deployment dashboards show deployment events and latency, request volume, and error rate changes.
  • Customer dashboards volume of traffics made by top customers, throttling, categorize by customer id and operations.
  • Dependency dashboards for each dependency’s availability, latency, error rate, etc.

The pattern can not tell you the RC but it can give you a direction. You can narrow down the problem type and impact scope by doing this.

Step#2 Request log diving

Before starting log diving, let’s answer 2 questions: What is the log and how does it collect?

What is the log?

Here is an example:


Each request entry contains multi key-value pairs call properties, properties key follows same pattern so they can be used as index. Because they are repeated, there is a lot of space saving when compressed, and good readability after decompression. The property’s value can be a string or map, for example, the Timing is the map with the function name as key and latency as value.

How does log collect?

We have a ThreadLocal object for each user session called activityContext (I’ve written another article talking about how does it implement, in Chinese), which represents the context for each request, include customer identity and metric object. The metric object contains all properties constructed by different handler chains/call stacks. and written into log file on LogHandler.

We have another Timber daemon thread to monitor the request log(as well as other logs) and push them to Timber server in a scheduled manner. And Timber server provides API and user interface for logs query and download.

By doing that we aggregate logs for millions of hosts into a single server. You can think logs are like records stored in a relational database table, that each property key is a column name, and each log is a record in the table. We can use SQL to query the logs. for example:

SELECT Operation, count(1) as Count
FROM timber_api_service_request_logs
AND StatusCode > 500
GROUP BY Operation;

This SQL queries all request has 5xx response and groups by operation. We and also group by other properties that helps us narrow down the problem, for example, finding the hosts has the most internal errors in a time range:

SELECT Operation, count(1) as Count
FROM timber_api_service_request_logs
WHERE StartTime >= datatime('2022-xx-xx') AND StartTime <= datatime('2022-xx-xx')
AND Counters#InternalError > 0

If we found all internal errors heavy happened in one or few hosts, that normally means the hosts were corrupted and might need to be patched or replaced, the action going to take is stayDown the host (remove it from the LB cluster and terminate Java process) and move to hardware/OS analysis.

If it’s not obviously caused by a bad host, we’ll need to dig further into every single host.

Step#3 Per host metrics

Once we identified the host with the issue, we can check the per-host metrics including CPU, Memory, Disk, I/O, and JVM GC. The hardware metrics are mature, however, JVM metrics might need some customization.

How does JMX Metrics collect?

JVM provided a JMX(Java Management Extensions) as the standard of APIs for Java application monitoring, our application has a thread that regularly polls the JMX metrics, aggregates, and pushes to the metric server. It was controlled by a bean integrated on the web framework level.

After JMX metrics are emitted, we can see the metric in the same way as other metrics. In our case, we found the JVM on the hosts with the issue halted 10 minutes during the event. More specifically, the G1OldGenInUse metric dropped to 0 for 10 minutes.

This indicates JVM has been restarted for some reason, it can be caused by host reboot or deployment. However, after checking other metrics, turns out these’re not the RC, so we move forward to the application log diving.

Step#4 Application log diving

Application logs are regular logs formatted by log4j and directly appended into the log file, similar to request logs, it was pushed to timber service in a scheduled manner, but the difference is it does not have a fixed format because some lines are purely call stacks, so it’s hard to query by SQL. We need to download the log from the timber server and do some string queries locally.

Once we get the hostname from PID, and the time range, we can download the log from Timber server. Simply unarchive the log file and grep it by Exception or Error you’ll find the error stack. In our case we found:

<Time>[ERROR](deadlock-deterctor)com.xx.xx.DeadlockDetector: Detected monitor deadlock among 1234, 5678; all thread traces follow, and then halting the JVM.

Turns out JVM was halted because of a deadlock.

Using the thread ids in the error log, We found the call stacks for those threads when deadlock happens:

Thread 1234:

Thread 5678:

Looks like an issue because of the Netty server.

Step#5 Dig into Netty

After some research, we found this issue is the same as what we saw: https://github.com/netty/netty/issues/1817, luckily it was already fixed in a newer Netty version.

Take action

We took action to upgrade netty version, and check all metrics again to prove the issue has been addressed properly.