Back to posts

502 Errors After a Major Headless CMS Update

How a major CMS upgrade caused intermittent 502s, and how CloudWatch triage plus SQL execution plan tuning resolved it.

Dec 20, 20253 min read
HeadlessCMS
MySQL
CloudWatch
Performance
SQL

TL;DR

  • Right after a major CMS update, the API started returning intermittent 502s.
  • CloudWatch showed unstable DatabaseConnections, a spike in selectAttempt, and sustained CPU utilization, pointing to the DB.
  • The schema became more normalized, table count and JOINs exploded, and EXPLAIN showed full table scans everywhere.
  • Rebuilt indexes, refined JOIN conditions and selected columns, and removed unnecessary JOINs.
  • 502s disappeared and CPU dropped; in some cases performance improved beyond pre-upgrade.

System Overview (High Level)

  • Headless CMS (schema managed as code, auto-generated)
  • MySQL (RDS)
  • Applications use the CMS via API
  • Monitoring via CloudWatch

A very typical setup.


What Changed

I ran a major update of the headless CMS.

We knew there were breaking changes, but the schema definition still generated correctly, and migrations completed successfully.

At that point I expected some performance drop, but nothing dramatic.


What Happened

After deployment, these symptoms appeared:

  • API intermittently returned 502
  • Some requests were extremely slow
  • Hard to reproduce consistently

CloudWatch revealed several suspicious signals.


CloudWatch Signals

Unstable Connections

DatabaseConnections had been steady before, but started fluctuating with request volume after the update. In addition:

  • selectAttempt (number of SELECTs) nearly doubled
  • DB load clearly increased

I started to suspect connection pool behavior or query volume.

CPU Stuck High

CPUUtilization hovered around 70% consistently, not just spikes. That suggested:

  • not a single heavy query
  • something constantly keeping the DB busy

Narrowing Down the Cause

Reviewing the update changelog showed one key change.

Heavier Normalization

After the update:

  • Table count more than doubled
  • Many new join tables were added

Result:

  • JOINs exploded for reads
  • SQL per request became much more complex

It felt like the SQL layer was the real culprit, so I inspected the queries issued by the app.


EXPLAIN and Despair

Running EXPLAIN on the problematic queries showed full table scans everywhere.

  • Indexes were not used
  • Many JOIN targets
  • Decent row counts

That explained the CPU burn.


What I Did

Then came the slow tuning work:

  • Rebuild indexes
  • Reorder JOINs and conditions
  • Limit selected columns to the minimum
  • Remove unnecessary JOINs

I repeated EXPLAIN -> fix -> recheck over and over.


Results

  • 502s were resolved
  • CPU utilization dropped significantly
  • Some endpoints became faster than before the upgrade

It was painful, but it reinforced a simple truth: if you face the query plan, it gets better.


Lessons

Two things stood out the most.

Always Check the Execution Plan

  • Even with ORM/CMS, SQL always runs underneath
  • "It works" is not enough
  • Full scans are almost always bad

CloudWatch Is an Excellent Triage Tool

  • CPU
  • Connections
  • Query counts

With these, you can quickly judge whether the issue is app, DB, or schema related.


Closing Thoughts

Major updates make things easier, but internal structures can change a lot.

Especially:

  • deeper normalization
  • more JOINs
  • connection management changes

These are areas to scrutinize after upgrades.

It was a tough incident, but learning to read execution plans and appreciate CloudWatch made it worthwhile. I hope this helps anyone who gets stuck in a similar swamp.