On September 5th, 2014, Sendwithus rolled out a major infrastructure upgrade aimed at increasing API throughput and email delivery times.
Update: All affected customers were notified and the issue was fully resolved.
The infrastructure upgrade introduced a bug into our email sending pipeline. For a small subset of Sendwithus customers, drip campaign and segmentation emails were delivered multiple times over the course of a two hour period early on the morning of the 6th.
We’d like to discuss exactly how this incident happened, as well as the steps we’ve taken to correct it and prevent future occurrences.
A full timeline of events is available at the bottom of this post.
A bug in our new infrastructure caused our email queueing to falsely fail email sends during periods of high email traffic. As a result, our system attempted to re-send emails, believing the receiving ESP to not be responding properly.
We were able to identify and correct the issue within an hour of discovery, early Saturday morning.
We’ve also added fail-safes to our email pipeline, preventing any email from every being sent more than once over a short period of time. This extra protection against future duplicate sends comes at a slight performance cost, but we believe it to be a requirement at this time in order to provide the best service to our customers.
If you’d like to be notified automatically with system status updates, please subscribe to our status page.
Timeline of Events
All times are in Pacific Standard Time.
8:30 am A team member noticed an unusually high number of emails being sent.
8:40 am All drip campaigns and segmentation delivery pipelines were immediately suspended, senior engineering staff alerted.
9:00 am Identified and confirmed bug, related to the infrastructure upgrade on Friday.
9:15 am Patch deployed to production systems. Verified fix.
9:30 am After verification, all email pipelines were brought back online.
10:00 am Incident response team begins investigating the scope of customers affected, update our status page.
10:15 am Begin to personally notify customers affected by the incident, providing full details of duplicated emails and recipients.
2:00 pm Incident is marked as “resolved” and Production system back.