While working with a client who runs a cloud-based monitoring platform, I was tasked with troubleshooting a critical issue: SNS notifications had stopped reaching end users. These alerts were tied to production systems and SLAs, so restoring delivery was high priority.
The Problem
The system was designed to send real-time alerts through AWS Simple Notification Service (SNS)—primarily email and SMS. The alerts were triggered as expected, but users weren’t receiving them.
We had no obvious error messages, and the CloudWatch dashboards looked normal. But something was off under the hood.
What I Did
✅ Step 1: Verified the SNS Trigger
First, I confirmed the alerts were being published to the SNS topic. I checked CloudWatch logs and metrics—everything showed successful Publish
API calls. That ruled out the application layer.
🔎 Step 2: Checked Subscription Status
Then I looked at the subscriptions tied to the topic. Several were stuck in PendingConfirmation
status. These users hadn’t clicked the confirmation link from AWS, which meant the notifications never went out.
🛠️ Step 3: Reviewed Delivery Logs
After enabling SNS delivery logging, I found that some messages were failing silently. Reasons ranged from email typos to domains blocking Amazon SES. One SMS endpoint failed because it didn’t include a country code.
🔐 Step 4: Audited Permissions and Policies
Even though the IAM role had sns:Publish
permissions, the topic policy didn’t explicitly allow publishes from the service’s role. This kind of misalignment can block messages without generating obvious errors.
🔧 Step 5: Fixed Data Issues
I corrected invalid email addresses, updated SMS formats with the proper international prefix, and cleaned up subscription scripts to validate inputs before adding them.
The Fix
-
Manually re-confirmed or re-sent pending subscriptions
-
Corrected endpoint formats and typos
-
Updated the SNS topic policy to allow proper publishing
-
Enabled delivery logging and added alerts for future failures
-
Documented all steps in the team runbook for future incidents
Results
-
✅ SNS alerts fully restored
-
📉 24% drop in support tickets related to missed alerts
-
🛡️ SLA risk mitigated
-
📚 Lessons documented for reusability
What I Learned
-
SNS delivery failures are easy to miss without logging
-
IAM permissions aren’t enough—SNS topic policies also matter
-
Always validate inputs before subscribing endpoints
-
Real-time alerting needs real-time observability
Skills Demonstrated
-
AWS SNS, CloudWatch, IAM
-
Real-world troubleshooting
-
Scripting and automation
-
Communication across support and engineering
-
Writing clear technical documentation