How did I crash my new blog website you ask?
I 504’d my website last week because I wasn’t aware of AWS EFS throughput and burst credits.
Let me take a step back and be clear that my website isn’t anything special, I post and share my thoughts on ALL Things #AWS. I try to post about once a week if not more. Until recently, I didn’t have any analytics setup and I wasn’t sure if I was interested in viewing the metrics. After watching the “stats” on other social media platforms, I decided it was time to setup Google Analytics and configure a WordPress Plugin to support it. Keep in mind, I regularly post on social media about specific events that I’m doing or supporting. My most recent post was about “Founding the AWS #UndergroundDeepRacer road to victory, in my basement”. (see link below)
After setting up analytics, all was good and I should have some data within 24 hours.
That night I’m checking email and I noticed an AWS CloudWatch alarm for my EFS volume that my burst credits were in a critical state. Wow, I setup CloudWatch for my EFS volume. Pretty cool, right?
"alarmDescription": "fs-0a572e8a burst credit balance - Critical - BlogWebsite-efsalarms"
Well guess what I did immediate after receiving that email? Yup, you guessed it, NOTHING! I thought, oh it’s just hitting a threshold and it will resume normal operations shortly. An hour later, I checked my website and I received the fun 504 error. Of course I immediately logged in and thought, “Oh great, another WordPress plugin crashed my websites”. I proceeded to remove the most recent plugin installed, killed any instances and, my website still didn’t come up.
There I am, troubleshooting what went wrong, why is it crashing? At first, I provisioned more resources, instead of running on a general T series instance, I moved it up to an M series. Modified my Auto-scaling group and terminated any instances, but still nothing!
I continued to be frustrated that my website was down but part of me was sort of OK with it because heck, it wasn’t like I’m publishing articles with thousands of readers.
The current stats (5-20-2020) don’t lie but only because at the end of this post you’ll find out it wasn’t analytics that killed my website, I think the title gave it away.
Where was I? Oh that’s right, my website is down. I decided since it was late to head to bed and look at it first ting in the morning to try and figure out the issue.
The next morning, my first instinct was to enable access to my WordPress server, SCP the files to the jump host and start over.
Hint *Suspend Auto-Scaling while transferring files or you have to restart the whole process over and over* (insert palm to face)
Two (2) hours later and the files were still copying to the jump host. Seriously, this is taking too long for a small blogging site. I decided to take a look at the email I receive the night before and dig into the CloudWatch logs.
Maybe even take a look at the EFS volume and see if I can make any changes.
It looks like the CloudFormation template I deployed sets the EFS Volume to Bursting. That shouldn’t be a problem, unless you post an awesome blog article on an AWS #UndergroundDeepRacer event that took place and it was shared out socially.
I decided to change EFS from Bursting to Provisioned while I was still copying files from WordPress server to my jump server and immediately after the change happened, the files quickly transferred in seconds.
Now that was really interesting and it seems I didn’t understand EFS that well when setting up my website. I knew what burst credits were or provisioned storage to maximize throughput. Also, it was time to go do some research about EFS burst credits and converting from Bursting to Provisioned.
The next step was to test out the changes that were made, first by terminating the instances, updating the ASG (Auto-scaling Group) to one(1) instance and removing the Termination suspension. After waiting five (5) minutes, my website came back up!
I wasn’t sure if I was out of the woods yet or the site was going to crash again in a few minutes? All day I kept checking it to make sure it was fine and continued to watched the CloudWatch metrics to make sure that the changes I made didn’t have any adverse effects.
I waited a full day before reinstalling and configuring analytics again and retesting everything seemed to work as it should.
Now I realize that I didn’t answer the question of what a 504 error was or what does it mean.
A 504 Gateway Timeout Error indicates that a web server attempting to load a page for you did not get a timely response from another server from which it requested information. It’s called a 504 error because that’s the HTTP status code that the web server uses to define that kind of error.
In my case, the underlying storage was too slow to respond causing my website not to load and my Auto-scaling group to go crazy because the ELB (Elastic Load Balancer) did not have any healthy instances.
- Understand your architecture and the AWS Services being utilized for production deployment, which includes limits.
- Listen to CloudWatch alarms and notifications or don’t even bother setting them up.
- Make sure your infrastructure it capable of scaling during peak times, including when posting something important that might have more traffic than expected.