June 20, 2025
Writen By FahmaCloud

Case Study: Fixing Broken SNS Notifications in Production

While working with a client who runs a cloud-based monitoring platform, I was tasked with troubleshooting a critical issue: SNS notifications had stopped reaching end users. These alerts were tied to production systems and SLAs, so restoring delivery was high priority.

The Problem

The system was designed to send real-time alerts through AWS Simple Notification Service (SNS)—primarily email and SMS. The alerts were triggered as expected, but users weren’t receiving them.

We had no obvious error messages, and the CloudWatch dashboards looked normal. But something was off under the hood.

What I Did

✅ Step 1: Verified the SNS Trigger

First, I confirmed the alerts were being published to the SNS topic. I checked CloudWatch logs and metrics—everything showed successful Publish API calls. That ruled out the application layer.

🔎 Step 2: Checked Subscription Status

Then I looked at the subscriptions tied to the topic. Several were stuck in PendingConfirmation status. These users hadn’t clicked the confirmation link from AWS, which meant the notifications never went out.

🛠️ Step 3: Reviewed Delivery Logs

After enabling SNS delivery logging, I found that some messages were failing silently. Reasons ranged from email typos to domains blocking Amazon SES. One SMS endpoint failed because it didn’t include a country code.

🔐 Step 4: Audited Permissions and Policies

Even though the IAM role had sns:Publish permissions, the topic policy didn’t explicitly allow publishes from the service’s role. This kind of misalignment can block messages without generating obvious errors.

🔧 Step 5: Fixed Data Issues

I corrected invalid email addresses, updated SMS formats with the proper international prefix, and cleaned up subscription scripts to validate inputs before adding them.

The Fix

Manually re-confirmed or re-sent pending subscriptions
Corrected endpoint formats and typos
Updated the SNS topic policy to allow proper publishing
Enabled delivery logging and added alerts for future failures
Documented all steps in the team runbook for future incidents

Results

✅ SNS alerts fully restored
📉 24% drop in support tickets related to missed alerts
🛡️ SLA risk mitigated
📚 Lessons documented for reusability

What I Learned

SNS delivery failures are easy to miss without logging
IAM permissions aren’t enough—SNS topic policies also matter
Always validate inputs before subscribing endpoints
Real-time alerting needs real-time observability

Skills Demonstrated

AWS SNS, CloudWatch, IAM
Real-world troubleshooting
Scripting and automation
Communication across support and engineering
Writing clear technical documentation

Case Study: Fixing Broken SNS Notifications in Production

The Problem

What I Did

✅ Step 1: Verified the SNS Trigger

🔎 Step 2: Checked Subscription Status

🛠️ Step 3: Reviewed Delivery Logs

🔐 Step 4: Audited Permissions and Policies

🔧 Step 5: Fixed Data Issues

The Fix

Results

What I Learned

Skills Demonstrated

Related Posts

Get In Touch And We'll Discuss Your IT Transformation Needs

Request a free call

Find Us Here

Get In touch

Consulting Hours

NEWSLETTER

Case Study: Fixing Broken SNS Notifications in Production

The Problem

What I Did

✅ Step 1: Verified the SNS Trigger

🔎 Step 2: Checked Subscription Status

🛠️ Step 3: Reviewed Delivery Logs

🔐 Step 4: Audited Permissions and Policies

🔧 Step 5: Fixed Data Issues

The Fix

Results

What I Learned

Skills Demonstrated

Share this:

Related Posts

Find Us Here

Get In touch

Consulting Hours

NEWSLETTER