Giant sites with hundreds of machines can most likely skip this post, or just use it to remember back to when you were small. Remember these are my opinions about how things should be done. Your mileage may vary. Or I could just be completely wrong… please let me know if I am.
- Tip #1: A couple of little shell scripts – one called SPINUP (click to view) and the other called BACKUP (click to view). SPINUP creates instances based on the parameters you set in the shell script; it’s a wrapper for ec2-run-instances. BACKUP takes a snapshot of any running image – it’s a trivial shell around ec2-create-image program. You can run BACKUP then use SPINUP to make a clone. Instead of building my environments from scratch using Chef, I tend to improve on a base AMI and use that with BACKUP and SPINUP. You’ll need to install the AWS Command line tools and you should rename SPINUP.txt and BACKUP.txt to SPINUP.sh and BACKUP.sh respectively.
- Tip #2: Useful Instance Descriptions with Dates For quite a while I was guilty of naming a new image something like “NEW APP SERVER”. Fine, until the next iteration… then it’s something like “REALLY NEW APP SERVER”… and although if you look hard, AWS will tell you how long an instance has been running (like 7435 hours), it’s not really helpful in dispelling the self-induced confusion. So include the date and description in a consistent format, i.e.: “APP1 SERVER – MAY 18 2014″.
- Tip #3: Use 2 factor Authentication to login to the AWS console. For all accounts, always. Amazon has the AWS Virtual MFA app for Android and Google Authenticator for the iphone to generate auth codes, and you’ve always got your phone, so like, there’s no reason not to do it. Remember if someone gets unauthorized access to your console they can destroy everything. Or mine bitcoin.
- Tip #4:Use Amazon Route53 for your DNS if you can, it’s awesome. Incredibly cheap (about 1/100 of something like DYN), and updates propagate across Amazon’s internal network very very quickly (like in seconds).
- Tip #5: Assign Elastic IP addresses to each of your “main services”, and use those fixed addresses in Route53 above. This suggestion may appear to be a little bit counterintuitive – since using the CNAME to Amazon’s internal names would allow you to do something like “ssh app1.bobo.com” and return the public address if you’re outside of AWS, or the internal AWS address if you’re inside. My logic for doing this is that I’ve been burned by DNS propagation times – I’d make the change, ssh to app1.bobo.com and would still be pointed to the old machine. So I’d have to ssh in using the (new) IP address. Gets old fast. Especially when you power down the production machines my mistake. Gets even worse when you’re swapping machines around (dev->prod->old)… at least this way ssh to app1.bobo.com always goes to the right place.
- Tip #6: Use your DNS to impose a sensible structure, especially if you’re in an environment with dev, production, etc. Trying to do this is a little like changing the oil in a moving car, so be careful. Originally, and in many places, I see machine names like “devapp1.bobo.com”; not the end of the world until you go to promote devapp1.bobo.com->app1.bobo.com. It tends to get confusing, and it’s harder to do programatically (using a shell script or chef/puppet/ansible). Instead, create subdomains for dev, and friends. So when we move a machine into production it goes from app1.pre.bobo.com -> app1.bobo.com and app1.bobo.com -> app1.old.bobo.com… and we change the IP addresses accordingly… so this way if something goes wrong we just move stuff from app1.old.bobo.com back to app1.bobo.com instead of trying to find where that old machine went.
- Tip #7: Use a load balancer as the only publicly accessible endpoint for your webservices. The AWS loadbalancers are easy to program, if you want to do it that way, but manually putting a new machine into production is easy – load the new machine onto the load balancer and take the other one off. Voila, zero downtime. Of course, I’d point the loadbalancer to be “app.bobo.com” using Amazon Route53. This configuration will also allow you to add more servers (app1->appN) and divide the load seamlessly if the need arises.
- Tip #8: Have a meaningful /etc/motd and prompt. Just like the old days (actually the old days were more fun, since we had various rude fortune programs running), but the idea being, with AWS it’s entirely possible to have a zillion windows open, and it’s often hard to tell where you are. Click here for an ASCII banner generator:
- Tip #9: Backups. You have a backup script, use it. App servers can more or less be regenerated, so that’s not such a big deal, but in places running MySQL databases? I’ve heard people say “we don’t really need backups, we’re running in master-slave mode with N slaves, with auto-failover, so what’s the point?” – the point is data corruption and being able to go back to that point in time *before* the new guy wiped out that table/database/whatever. So, how do you take a backup of a MySQL database? Simple, bring down MySQL and run the BACKUP script. But bring MySQL down first, because taking a snapshot forces a reboot, and you should never crash a machine if you can avoid it.
- Tip #10: Expiration Dates. See Tip #9, Backups? I’m sort of sloppy; I just clone an entire running instance as my backup, I don’t just snapshot the volume, and re-attach it, I just grab the whole machine. Space is pretty cheap on AWS, and being able to run SPINUP to get my instance back is priceless. The downside is you’re going to end up with a lot of AMIs, Volumes, and Snapshots floating around, and cleaning them up is miserable – especially if backups were being taken sort of “organically” – See Tip #2. I recently had 32 TB of crap to go through… and that was incentive for this not to happen again. The solution is to use AWS’s tagging ability to add an Expiration Date to anything you can, i.e. EXPIRES=20140618.. For example, with backups, I have daily backups that expire in a week, weekly backups are kept for a month, and monthly backups are kept for a year. An example of tagging is in the BACKUP script. Here’s a copy of my CLEANUP script which goes through AMIs tagged in the BACKUP script and removes anything that has expired.
- Tip #11: Right-size your Instances. Similar to the above – audit your AWS environment to see if you’re using your AWS resources efficiently. In general, and in my case, I’ll often spin up an instances larger than I really need “just in case”. The problem is AWS prices double between levels, i.e. s large instance costs twice as much as a medium instance which costs twice as much as a small instance. On top of that AWS has been bringing out new instances type (hello M3.medium), which offer better value for money than the older generation of boxes they’re replacing. So if it can be run on a smaller instance, do it. In fact, better a couple of smaller instances hanging off a load balancer then a single giant point of failure.
- Tip #12: Use Reserved Instances. A lot of places use AWS a little like their old machine room… spin ’em up and let them run forever. If you’re going to use AWS that way, then take advantage of the ~40% discount they’ll give you for telling them you’ll keep the machine for a year.
Between tips 11 and 12 you can potentially drop your AWS bill by half, without any sort of performance hit.