Coinbase’s pivot to AI-driven operations hasn’t gone very well

Table of Contents

Coinbase (Nasdaq: COIN) has once again shown crypto traders how slow cloud hardware can ruin even the fastest exchanges. The company’s AI-powered operations pivot strategy appears to have been its worst move yet.

The company announced Friday that a cooling failure within Amazon Web Services (Nasdaq: AMZN) caused an outage that lasted several hours, impacting trading, exchange access, and balance updates across the platform.

The issue began on May 7th at approximately 23:50 UTC when our internal monitors detected a widespread quoting error occurring within our internal systems.

At that time, several Sev1 incidents were created by engineers and customers were already affected in terms of services such as Spot Trading, Coinbase Prime, International, Derivatives, Retail, Advanced, and Institutional Exchanges.

Brian Armstrong, CEO of Coinbase, wrote to X that his company had “experienced an outage” and that such an event was “absolutely unacceptable.” He said the cause was “overheating in an AWS data center room due to multiple cooling equipment failures.”

According to Brian, the company is designed so that if one AWS Availability Zone fails, all services will not be taken offline. With the exception of exchange services, most services are structured this way. Exchange services require high latency and therefore use a separate infrastructure.

Coinbase blames AWS chiller failure as market system begins to fail before midnight UTC

Cryptopolitan previously reported that Coinbase plans to lay off 700 employees, about 14% of its workforce. And this is done with the aim of replacing manual processes with AI.

Rob Witoff, Head of Platform at Coinbase, explained the technical details of the issue. He said the outage lasted for a long time and affected “trading, access to exchanges, and balance updates.”

The first alert occurred at 23:50 UTC due to a quote error that occurred from within our internal systems. Sev1 analysis was then immediately performed. According to Rob, the cause of the issue was a “thermal event” that occurred on a small number of racks in one of the AWS us-east-1 facilities.

This structure of exchange infrastructure was useful. Robb said Coinbase keeps its exchange infrastructure in one availability zone because the industry values speed.

Additionally, the company has distributed backup copies of this exchange infrastructure for such scenarios. However, the failure of some of the exchange infrastructures in question at the moment did not stop there, prolonging the process of rectifying the situation.

Two components have failed. A malfunction occurred within the underlying hardware of the matching engine. Therefore, first and foremost, we needed to perform a recovery and failover operation.

Also down was the distributed Kafka cluster responsible for sharing information across all systems in the organization. I needed to recover a Kafka partition on a new hardware broker and reached TiB of information.

Engineers rebuild quorum and revive Coinbase market through cancellation-only and auction modes

The matching engine was the biggest cause of transaction stagnation. The matching engine processes orders and maintains an order book. The system operates on a distributed cluster and requires a quorum before a leader can be elected and transactions can be made securely.

During the outage, all nodes were not healthy due to data center constraints, which prevented quorum from being achieved, which hampered trading activity on retail, advanced, and institutional exchanges.

Rob said on-call support and engineering teams must execute the company’s disaster recovery procedures, establish quorum, and assess system health under difficult infrastructure conditions.

He said the team had to develop, test, deploy and validate solutions while managing widespread outages. Kafka manages thousands of terabytes every day in a partitioned architecture, requiring extensive manual recovery.

We had some issues with balance stream latencies because Kafka was lagging. Rob said that after the replication was synchronized, these issues with the balance went away. According to Coinbase, no data was lost.

When the matching engine returned to service, the market was not re-enabled at the same time. First, Coinbase switched all products to cancellation-only mode, verified product status, switched all markets to auction mode, and finally enabled trading on Coinbase Exchange.

Additionally, Rob stressed that customers should not be temporarily locked out of their accounts. Coinbase promised everyone that they would have a detailed explanation of this incident in the coming weeks.

However, Josh Elithorpe refuted the rumors after reading Rob’s Twitter post. In his words: “No one vibe-coded anything that failed. No ‘non-engineers’ pushed the production code and took out the trading engine. It wasn’t intentional. It’s not because Coinbase failed to design a failover system. Things happen at scale. Don’t let armchair quarterbacks talk big.”