While working with a client who runs a cloud-based monitoring platform, I was tasked with troubleshooting a critical issue: SNS notifications had stopped reaching end users. These alerts were tied to production systems and SLAs, so restoring delivery was high priority.

The Problem

The system was designed to send real-time alerts through AWS Simple Notification Service (SNS)—primarily email and SMS. The alerts were triggered as expected, but users weren’t receiving them.

We had no obvious error messages, and the CloudWatch dashboards looked normal. But something was off under the hood.

What I Did

Step 1: Verified the SNS Trigger

First, I confirmed the alerts were being published to the SNS topic. I checked CloudWatch logs and metrics—everything showed successful Publish API calls. That ruled out the application layer.

🔎 Step 2: Checked Subscription Status

Then I looked at the subscriptions tied to the topic. Several were stuck in PendingConfirmation status. These users hadn’t clicked the confirmation link from AWS, which meant the notifications never went out.

🛠️ Step 3: Reviewed Delivery Logs

After enabling SNS delivery logging, I found that some messages were failing silently. Reasons ranged from email typos to domains blocking Amazon SES. One SMS endpoint failed because it didn’t include a country code.

🔐 Step 4: Audited Permissions and Policies

Even though the IAM role had sns:Publish permissions, the topic policy didn’t explicitly allow publishes from the service’s role. This kind of misalignment can block messages without generating obvious errors.

🔧 Step 5: Fixed Data Issues

I corrected invalid email addresses, updated SMS formats with the proper international prefix, and cleaned up subscription scripts to validate inputs before adding them.

The Fix

Results

What I Learned

Skills Demonstrated


Leave a Reply

Your email address will not be published. Required fields are marked *