Case Study: Fixing Broken SNS Notifications in Production

While working with a client who runs a cloud-based monitoring platform, I was tasked with troubleshooting a critical issue: SNS notifications had stopped reaching end users. These alerts were tied to production systems and SLAs, so restoring delivery was high priority.

The Problem

The system was designed to send real-time alerts through AWS Simple Notification Service (SNS)—primarily email and SMS. The alerts were triggered as expected, but users weren’t receiving them.

We had no obvious error messages, and the CloudWatch dashboards looked normal. But something was off under the hood.

What I Did

Step 1: Verified the SNS Trigger

First, I confirmed the alerts were being published to the SNS topic. I checked CloudWatch logs and metrics—everything showed successful Publish API calls. That ruled out the application layer.

🔎 Step 2: Checked Subscription Status

Then I looked at the subscriptions tied to the topic. Several were stuck in PendingConfirmation status. These users hadn’t clicked the confirmation link from AWS, which meant the notifications never went out.

🛠️ Step 3: Reviewed Delivery Logs

After enabling SNS delivery logging, I found that some messages were failing silently. Reasons ranged from email typos to domains blocking Amazon SES. One SMS endpoint failed because it didn’t include a country code.

🔐 Step 4: Audited Permissions and Policies

Even though the IAM role had sns:Publish permissions, the topic policy didn’t explicitly allow publishes from the service’s role. This kind of misalignment can block messages without generating obvious errors.

🔧 Step 5: Fixed Data Issues

I corrected invalid email addresses, updated SMS formats with the proper international prefix, and cleaned up subscription scripts to validate inputs before adding them.

The Fix

  • Manually re-confirmed or re-sent pending subscriptions

  • Corrected endpoint formats and typos

  • Updated the SNS topic policy to allow proper publishing

  • Enabled delivery logging and added alerts for future failures

  • Documented all steps in the team runbook for future incidents

Results

  • ✅ SNS alerts fully restored

  • 📉 24% drop in support tickets related to missed alerts

  • 🛡️ SLA risk mitigated

  • 📚 Lessons documented for reusability

What I Learned

  • SNS delivery failures are easy to miss without logging

  • IAM permissions aren’t enough—SNS topic policies also matter

  • Always validate inputs before subscribing endpoints

  • Real-time alerting needs real-time observability

Skills Demonstrated

  • AWS SNS, CloudWatch, IAM

  • Real-world troubleshooting

  • Scripting and automation

  • Communication across support and engineering

  • Writing clear technical documentation


Related Posts

NEWSLETTER

Sign Up to get the latest blog article and Tutorials link from FahmaCloud.

We talk about: