EC2 at XaoP
Posted by Peter Vanbroekhoven on Apr 21, 2008
At XaoP, we have recently started checking out Amazon’s EC2. Although the use of virtualization technology is hardly new in hosting technologies, Amazon’s take on it offers extra flexibility for developers to exploit.
The principle
The principle behind virtual hosting is simple: you get what looks like a machine of your own to do with as you please, but in reality that machine runs inside another machine that you share with other people. This technique allows hosting companies to provide you with fully customizable servers for a very affordable price because they need less hardware to support the same number of servers.
Amazon EC2 takes it that one step further and gives this idea an interesting twist. Instead of getting one virtual host, you get one image (an Amazon Machine Image or AMI) that you can customize just like a virtual host, and that you can instantiate as many times as you like. In classic setups — physical hosts or virtual hosts — adding extra servers is always somewhat of a chore because you need to set it up again, taking care to have it configured as close as possible to that of your other servers to make problems with deployment as unlikely as possible. With EC2, you can add instances easily and with a certainty that they are all identical.
Each instance gets most of its state from the AMI, such as the installed software. The partitions these are on are limited in size though. For storing your data, you get a much larger partition. This is where you’d usually store your database data and such. The AMIs are stored on S3 and can be run from there. The instances of these AMIs are ephemeral though; unless the instance is bundled as an AMI before shut-down, all state is lost. This not only applies to the data gotten from the AMI that was booted, but also everything on the data partition.
Opportunities
The obvious advantage of EC2 is that you can scale up your server cloud when for instance your web application gains traction. A less obvious observation is that you can also scale it down. It is an elastic cloud that can expand and shrink to fit your current need exactly, and it can do so fast.
There are many applications that have highly varying needs for computing power. A simple example of this is a document management system that is used only in a local office, and is heavily used during office hours and very sporadically used outside these hours. You can easily shave off 50% of your hosting costs by running only a fifth of your EC2 instances outside business hours.
A more pronounced example is that of our own DRP system. Mostly, the DRP application will be consulted for shipments plans. Periodically, the planner will need to run the heavy tasks to calculate these shipment plans and all data required to do so. This typically happens once a week, on Friday or over the weekend. Outside these hours, the fire power needed to run these heavy tasks is wasted, so again it pays to shut down the hosts running these tasks at that time.
Part of our core business are document migrations. While some migrations consist of just dump and import, most migrations require some transformation or enrichment of the attached meta-data. Because this process is still largely manual — at the least in reviewing the meta-data after enrichment and after import — it is usually performed in batches and we want to throw the computing power at it only when needed. EC2 is ideal for this.
These examples are all somewhat predictable in its periodicity and the starting up and shutting down of extra services can be hard-coded or initiated manually. This is not always the case. For instance a document management system that stores monthly income statements will usually have heavy traffic when these statements are published and consulted shortly afterwards, but because of weekends and holidays, the exact day of publishing can vary somewhat. An intelligent system can dynamically adapt to the extra traffic and start extra hosts when needed and shut them down afterwards.
Starting extra hosts can still take some time, so the ultimate system would try to predict the load and start extra hosts in anticipation. It doesn’t even have to start enough to match the expected load, just enough to prevent going down completely and react sufficiently quick to get back up.
Challenges
Working with EC2 poses its own set of challenges. The obvious one is the lack at persistent storage, although Amazon is working on that. This means that if your server goes down due to unforeseen circumstances such as hardware failure, then without extra precautions, you stand to lose all your data. The usual solution is to backup the data regularly to S3 or your own storage. Even if you do hourly backups, you can still lose an hour worth of data which can be detrimental if the data is highly important.
A better solution looks to be to introduce redundancy. Amazon EC2 supports availability zones that allow you to force hosts to start in physically separated and isolated locations, severely reducing the chance of overall failure. If the data is thus placed on two or more hosts and one goes down, then the other still has the data. This supposes then that at least one host is running all the time, or that we save the data to S3 before taking a all hosts down.
An important choice is how we split the full data set of a running instance over the corresponding AMI and the data partition. More precisely, the question arises whether the AMI should carry an installation of your application or not. On the one hand, you can create an AMI for each of your applications, or even for each role without your application such as application server, database server, a server for running heavy tasks, etc. On the other hand, you can create one generic AMI on which you deploy your specific application after instantiation.
Currently, we are tending towards the option of using a single generic AMI because we believe it is the more flexible one. It really puts all its money on your deployment. You should really have scripted your deployment with packages like Capistrano anyway, whichever option you choose. This deployment can then be made as flexible as you could ever want, allowing each instance you start to be uniquely configured. Another advantage is that you can put multiple applications on a single host. This could be useful if you have an application that is rarely used and that you “piggyback” off another application’s host, or if you have applications whose use in time doesn’t overlap (much), e.g., one for Europe, one for the US and one for China. The downside of a generic AMI is of course that starting a new instance takes longer because we need to do a full deployment. If we need to be able to start instances quickly, we can still complement this scheme by taking “AMI snapshots” of deployed instances and booting these.
Implementation
With one generic AMI, all boils down to the deployment. The more flexible we make it, the more flexibility we gain in maintaining our applications. Our requirements were these:
- Be able to put the database, the web application and the heavy tasks on separate EC2 instances, with the possibility of starting multiple instances for each of these parts of our application.
- Start the EC2 instances and install the right version of our application automatically on deployment so we get “one-click” deployment. Basically, we want to run one Capistrano command which we pass a single config file, and all is set up automatically.
- Allow easily changing the setup of the deployment. This includes changing the number of mongrels, the number of processes for running the heavy tasks, and how many EC2 instances we dedicate to each part of our application.
- Because EC2 dynamically assigns addresses to the EC2 instances (except if we assign one of our elastic addresses), we need some more user-friendly way to refer to the EC2 instances than by address.
- Deploy multiple applications to the same set of EC2 instances.
We basically use a slightly modified version of the AMIs provided by the EC2 on Rails project. For future reference, the data partition is mounted on /mnt. For deployment we use Capistrano which is a great tool for the job. We do need some tricks to get some of the behaviour we want though.
To refer to EC2 instances, we use labels. Labels are uniquely attached to each EC2 instance we deploy. This label is placed in a file /mnt/LABEL on each of the instances. This allows us to identify the instances. We don’t want to be constantly fetching those files from all our running EC2 instances, so we cache it in a file with the ID assigned by Amazon to the instance and the address as keys. When we read the file, we cross-reference it against the output of the “ec2-describe-instances” tool. If this output shows that some instances disappeared or that an instance changed address, we invalidate the entry for that instance. Only for new instances, we fetch the labels to update our cache. So only if an instance with the same identifier and the same address is started by us, we may run into trouble, but this is highly unlikely.
Next, we start the instances for the labels that have no instance assigned per the procedure above. Because we want to have one-click deployment, after starting an instance we would right away deploy to it. This doesn’t work though as the instance doesn’t get an address right away, and it still needs to boot before the SSH server is started. The output of “ec2-run-instances” gives us the ID of the instance we started and we then poll by periodically running “ec2-describe-instances” on the ID until an address shows up. Then we use that address to contact the instance to put the label file on it. The connection may fail because the SSH server hasn’t started yet, so we have to rescue any exceptions and retry, like this
begin
put label, "/mnt/LABEL"
rescue Exception => e
p e
retry
end
The method put is used by Capistrano to upload files. We only need
to do this in this method, afterwards we are sure the EC2 instance is
reachable.
Right away, the question is raised how we connect to the newly started
server only. A Capistrano task is typically executed for several
predetermined servers at once, with commands such as run and put
being run on each of the target servers. This makes it impossible to
include if statements that are evaluated on a per server basis, but
that is what we need. For each server, we need to check if it runs and
if not, start it. To make things worse, we can’t specify the servers
ahead of time because we don’t know the addresses ahead of time.
To solve this, we generate the Capistrano tasks dynamically from the moment the server addresses are known. It would help if Capistrano’s roles could be dynamically scoped, but as far as we know they can’t. For each conceptual task, we generate an actual task for each of the addresses. This allows us to take the actual addresses from the config file or get them from EC2. The template we use is something like this:
def configurated_task(name, &blk)
task name do
configure
instance_eval &blk
end
end
def configure
config = YAML.load(File.read(conf))
task_for_hosts :do_setup, config[:hosts] do
setup_host unless host_is_setup
end
end
task_for_hosts name, hosts, &blk
hosts.each do |h|
task "#{name}_for_#{h}", :hosts => [h] do
end
end
task name do
hosts.each do |h|
send("#{name}_for_#{h}")
end
end
end
configurated_task :setup do
do_setup
end
This defines the setup task that is called as so
cap setup -s conf=conf_file
The configured_task method generates a task with the given name
which first calls configure and then calls the block. The hosts are
extracted from the config file and tasks are generated for these
hosts. This operation is actually done by task_for_hosts which
generates tasks for each of the hosts that execute the block and that
encode the host in its name. It generates one additional task that
just calls the host specific tasks it just generated.
This setup enables us to give each host a personal treatment. We can check per host if it was already set up, and if not set it up. We can check per host if a certain revision of our application has already been deployed, and if not deploy it. We can start a different number of mongrels or tasks per host, depending on what instance type we’ve chosen.
The two techniques outlined above provide us with all the tools necessary to satisfy all of our requirements. There may be better ways to do this, especially the hack with hosts, but it works well.
Conclusion
In our business, it sure looks like EC2 could be a valuable asset. It not only gives us the computing power we need at will, but it also allows us to reduce the computing power when possible to save costs. This gives us a cost-effective solution that allows for a flexibility that is unseen with the classic hosting solutions. Deploying to EC2 has its own set of challenges though, but some of these will be relieved by the persistent storage that will be introduced to EC2 later this year.
To make deployment even easier, we will be building a web interface to control deployment. This web interface can just call the Capistrano recipes and have it do all the hard work.