Post Mortem: Sept 5th, 2014

Last Friday afternoon (September 5th, 2014), Sendwithus rolled out a major infrastructure upgrade aimed at increasing API throughput and email delivery times.

This infrastructure upgrade introduced a bug into our email sending pipeline. For a small subset of Sendwithus customers, drip campaign and segmentation emails were delivered multiple times over the course of a two hour periods early Saturday morning.

We’d like to discuss exactly how this incident happened, as well as the steps we’ve taken to correct and prevent future occurrences.

A full timeline of events is available at the bottom of this post.

Update: At this time all affected customers have been contacted by a Sendwithus representative. If a Sendwithus representative has not been in contact with you, your account and outgoing emails were not affected.

A bug in our new infrastructure caused our email queueing to falsely fail email sends during periods of high email traffic. As a result our system attempted to re-send emails, believing the receiving ESP to not be responding properly.

We were able to identify and correct the issue within an hour of discovery, early Saturday morning.

We’ve also added fail-safes to our email pipeline, preventing any email from every being sent more than once over a short period of time. This extra protection against future duplicate sends comes at a slight performance cost, but we believe it to be a requirement at this time in order to provide the best service to our customers.

If you have questions additional questions about this incident, please feel free to email us and we’re respond promptly.

If you’d like be notified automatically with system status updates, please subscribe to our status page.

Timeline of Events

All times are in Pacific Standard Time.

8:30 am A team member noticed a unusually high number of emails being sent.

8:40 am All drip campaigns and segmentation delivery pipelines were immediately suspended, senior engineering staff alerted.

9:00 am Identified and confirmed bug, related to the infrastructure upgrade on Friday.

9:15 am Patch deployed to production systems. Verified fix.

9:30 am After verification, all email pipelines were brought back online.

10:00 am Incident response team begins investigating scope of customers affected, update our status page:

10:15 am Begin to personally notify customers affected by the incident, providing full details of duplicated emails and recipients.

2:00 pm Incident is marked as “resolved” and Production system back:

Share this post
Tweet about this on TwitterShare on Facebook

Leave a Reply

Your email address will not be published. Required fields are marked *