Util today I’ve been working at Amazon for 1 year and 9 months, almost same as I have been working at Alibaba, I was lucky enough to go through the full project cycle at both companies as an engineer, I’ll describe the the fact and thoughs based on my expeirence and comparing those 2 companies.
Some well-known facts
Alibaba was founded 1999, when I worked there in 2019 it has 10k+ employees(now is 25k) and has around 300 billion market value. It’s core business is e-commerce, recently gradual shift to cloud computing.
Amazon was founded 1995, now day it has 150k+ with 1.6+ trillion market capitalization, it’s core business is also e-commerce, but with its own complete supply chain, and it is the world’s largest cloud computing provider, the absolute leader in technology and market share.
There are lot of similiarities, they both start from e-commerce and now develop cloud computing, they both has a great leader(also founder) who is good at speech, they both believe customers’ needs are always the highest priority.
However there are also lots of differences, when talk about customers, Alibaba means millions peddlers who selling their product on taobao website; Amazon mostly means the customers who buy the product from amazon website, By extension, Alibaba’s genes are rooted in the Internet, it’s building end to end connections; and Amazon is more like a new era of hypermarkets, it’s produce, transport, and selling, once a book, now a service.
Another matter of fact, Alibaba is 5 times smaller than Amazon, it’s born and raised in China, it tried and failed multi times to enter the oversea markets; Amazon was born in US and raised in the world, it tried very hard to enter Chinese market, so far the e-commers was failed, however the cloud computing is still struggling. Amazon is more like an open-minded Westerner who is like a duck to water in an open world; Alibaba is more like an introverted oriental person, who it has more advanced ideas but still hard to integrate into the earth village.
The experiences in giant companies can be varies depends on the team. As a brief context, I had worked on Cloud Effective team in Alibaba, which developed a code pipeline service in Alicoud; in Amazon I worked in Tax Platform team, which is supporting all tax calcutations cross all Amazon businesses.
Way of start project
I’ve wrote a artical project start in alicoud(in Chinese) when I was working there, Alibaba’s project start with chaos but really fast, on the other hand, Amazon’s project start slower but more orderly.
Amazon project start from 3 year strategy, it’s included some high level target the org want to achieve in next 3 year, leadership team will review and adjust this plan every year, then the smaller team will create 1 year strategy based on 3 year plan. After 1 year strategy reviewed, Sr SDEs will be involved with project SDMs to make the OP1 and OP2 planning, those documents are answers for “What will we do next year” and “How we going to do it”, the second one will going to high level task estimation and HC arrangement.
This sounds like the flow from top to down, but in fact, on each step, management team will review with the experienced SDEs in the team, especially for OP1 and OP2 planning, SDEs’ idea can be important, I personally had a chance to write OP1 OP2 planning this year, and all ideas was generated by me and my manager.
However, Alibaba’s project start from bosses’ idea, they might get data from marketing, or get the idea from competitive product, anyway, if the big bosses wanted the feature, it will become the priority. the boss will ask manager grap a group of people(include PM and SDE, sometime UX) and start the project in a meeting room. Boss’ mind can be changed, even boss can be changed as the restructure happens so frequently, an engineer can be involved into multi projects but hardly successfuly delivery one, that is Ali-style fail-fast, you can image how engineers busy in this environment, but the company does gain the ability to iterate quickly by trial and failed.
I’ve never seen the design doc in Alibaba, most documents were written to describe the how the service work after it’s implemented and launched. However, in Amazon, the design review is extremely valuable, it can use for decision making and knowledge transfer, it can ensure that after SDE leaves Amazon, the team still has the ability to maintain and continue to develop the service, giving the service a longer lifespan. When you’re taking ownership of a new service, you shall be very happy to find the design doc that not only describe How the system works, but also why it designed like this.
The design doc contains 3 parts:
Why section, it has following sections:
- Goal: List down what problem we’ll discuss in this design and, and
out of scopewhich will not be considered in this design.
- Backgroud: Describing business background or the current system architecture if exist as contexts about the design
- Requirement: Include AC(Acceptance Criteria, represent user journey), and technical requirement like performance, security, availability, etc.
- Solution: List down different options detail, better to have
- Matrix: Comparing options based on several factors, like cost, performance, implementation effort, etc. The option with highest score in most factors are the prefered solution
- Meeting Notes: The decision need to be made by the team, so meeting notes is important, it include
Consensus, and the most important:
Action Itemswith owners list down.
For more information about design doc can refer to the example.
After design docs ready, the team will sit together, review the doc for 30mintes, left comments and then author will reply and discuss concern/suggestion in next 30 minutes.
The important out come of design review meeting is
action items, those contain what team has align and what’s the next step.
CICD is a big topic that worth a separate blog, in a nut shell, Amazon’s CI integrate earlier than Alibaba’s, and Full-CD pipeline is more automated than Alibaba’s CD pipeline.
Alibaba’s most team is using the CI flow based on git flow(refer to my Chinese blog), when start implement task, you’ll need to create a
Application（应用）, a change included a feature brach created in the application’s code repo, and you’ll be able to develop the feature in that feature branch, and test it in your exclusive dev environment(same as shared dev environment but it’s own by you), when you finish developing, you can commit your change into
change zone(变更区), it will create a release branch that included all changes in this environment, and it’s build & deploy to that environment so you can start e2e testing in a shared environment. This way of working provide a good flexibility but it’s generate a lot of release/feature branches, and the merge conflicts in change zone are so painful. More important, your change actually merged into mainline after it’s deployed to prod.
Amazon’s flow is more like trunk based development flow(refer to my Chinese blog), it doesn’t require feature branch, most time we develop in local mainline branch, and test in local environment(include integration and e2e, however the “local” can means a cloud desktop which is a vitual machine running in cloud). When test was done, publish a CR to mainline brach, after peers reviewed, your changes merged to mainline and you’re now able to test in shared evironment and eventually deploy to prod.
Alibaba also has another work flow called CD flow which is less popular but more like Amazon’s flow, it force changes merged to mainline before deploy to beta. This method achieves continuous integration, but the disadvantage is that each stage’s deploy process still rely on manual trigger, because the variety and number of tests are not sufficient to support auto deploy to prod with confidence, manual testing is still mandatory for most services.
Amazon has a CD maturity model, the highest maturity called Full-CD, means once your CR merged to mainline it will going bellow flow without human interation:
Build & UT -> deploy to Beta -> Integration Testing -> Deploy to gamma -> Function testing -> Deploy to PST -> Performance testing -> Deploy to 1-box -> Shadow testing -> Deploy to Gray environment -> wait for a while -> Deploy to Prod -> Release packages to live
Each stage has it’s testing purpose, any stages testing failure will trigger a rollback block the pipeilne, an alarm will goes off to let oncall check in and fix the issue. Also, each testing has multi testing tools to choice for different type of application, sometime the team will develop their own testing framework, and it can also intergrate with pipeline interfaces gracefully.
The Operation flows are similiar between Alibaba & Amazon, in most time the alarm goes off, and developers check metrics and logs, identify the change that introduce this issue, then either rollback of hot fix. General difference:
- Alibaba’s operation tools are much more user friendly(especially for log search, cloud watch log is the worst).
- Amazon has dedicate oncall who is responsible for dealing with cridical issues, however Alibaba is more rely on the “application owner”, this cause worse WLB.
For customer impactful issues, Alibaba has
fault review(故障 review)，Amazon has CoE(Correction of Errors), they’re both for same purpose: 1. Report the impact, 2. Avoid it happen again, 3. Share cross orgs. and they’re having similiar format:
- Customer impact
- Root Cause analysis
- Solution to imporve
- Actions need to be taken
Alibaba has more information like “person who should take responsibility”, and “level of fault”, if you’re the person cause a number of high level fault, you’ll be fired.
Generally I think Alibaba is doing better at fault review stage, because leaders are taking more serious on those issues and that make it truely highest priority. The fault was reviewed by all memebers cross differen teams, and most action items were complished in 1 week.
On the other hand, I’ve been draft/review 3-5 CoE in Amazon and I feel only COE BRs(Bar Raiser) trully care about the COE, however most other reviewers(include SDE and SDM) are more care about ation items and who should be the owner, pushing action items to another team happens all the time. Most action item was complish in a month but I also seen small ation took 3 months to be finished. most time COE action items doesn’t be considered as promotion data point, so SDEs is normally not so happy to work on COE action items.
Work life balance
I’ve heard a lot of complain about horrible WLB in both 2 compaies, objectively speaking, Alibaba is worse than Amazon.
When I was working in Alibaba, the working hour is 9am-10pm 5 days a week, some time it’s getting late until 2-3am but not too offen(once per month), I know some other teams in project sprint stage has to work until 3am, 6 days a week, and the project is sprinting until it succeed - no one knows when it will succeed. In Amazon, my team’s working hour is 8am-5pm, it’s rarely cross 6pm unless you’re new member who is trying to spend some time to learn the new stuff.
Alibaba’s weekends normally is free but some time your boss will call you for “urgent issue”, some team has too many “urgent issue”, that makes everyone hates
Dingding (communication tool developed and used by Alibaba). on the other hand, I’ve never gotten manager/peer’s message in Amazon on weekend.
There are a lot of complain about oncall in Amazon, but Alibaba is worse. In Amazon SDEs were rotated for oncall, for a tipical 2 piza team we get 8 people, each person a week means you’ll be oncall for a week in 2 months, and during oncall you might be paged at night 2-10 times. In Alibaba, you’re oncall for 7*24 hours and 365 days a year as any issue might page you as you’re the application owner. Because move fast and lack of documentation, A single issue might paged a whole team together at mid night for debugging because everyone is unfamiliar with the logic developed by other engineers.
How could Alibaba become like this? I think the major reason is Alibaba’s leadership team are all grows from engineers, almost leaders in Alicould has engineering background, they are good at rational thinking but lacks empathy, they pursuit of efficiency but not good at long-term planning because it is easy to get caught in the details. They were promoted to manager without preperation, they are trained in management with an inherent logic mindset, without humanistic thinking, a toxic environment was grown. Comparing with that, Amazon’s manager has varies background, not all of they were engineers before managing, and because of the diversity, the organization is more rely on the rules to establish connections between different roles, we call it - profession.
Alibaba and Amazon are both good company, however Amazon is more friendly for career growth. in 2020 the 996ICU project was blooming in China, it reflects the voice of Chinese programmers. I wish Chinese companies can learn how to treat their employees better, so they can keep grow in a sustainable way.