CPU Steal Time on EC2 cloud

Lately my EC2 micro instance has been suffering from spotty performance, often reaching the extent of preventing me from logging into my machine using SSH, running a slimmed down ubuntu version with barely any services running (other than LAMP) I had no idea why, I mainly had a standard deployment that was working just fine so I knew that it was something related to that new java app I just added to the server or….perhaps something else

The top command showed an elevated “st” value, this is something I havne’t seen before, going through the forums turns out that st is in fact steal time, which is defined by IBM as

Steal time is the percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor.

In other words it means that my deployment outgrew the virtual machine dedicated to me, the CPU cycle my VM steals isn’t enough for my app, If it wasn’t amazon i’d have assumed that my Cloud host is over selling their infrastructure, I dont think amazon would do such a thing though.

Checking with the Top and vmstat commands I could tell that the java app I just was behaving exactly as it should, I had it bench marked on my dev machine (a physical rather than virtual machine) and it had a predictable footprint that should be handled easily by my deployment, my conclusion was it had something in my code that wasn’t playing nice with the hypervisor used by Amazon. The logical thing to do here is killing it and seeing what happens, which i did, and to my surprise nothing change, the st value stayed at around 98% for 10 minutes before dropping. which may or may not be related to me killing the app, especially restarting it didnt impact the st time directly.

My current suspicion right now is that in my java code I’m using Thread.sleep() which is not safe, and Timers, both of them aren’t that safe.

The other suspicion I’m keeping to myself is the cloud capacity its self, perhaps they are over selling when it comes to micro instances.

I can simply solve this issue by switching from micro to small instance, but this would triple what I’m currently paying.

 

***UPDATE***

After carefully monitoring the system, turns out mysql was creating too many waiting threads causing the performance fluctuation, Restarting it cleared out the issue, until I started my java apps, and then the performance would degrade again.

So as I suspected it was a code issue, as it turns out in my code, I created many DB connections and didnt care enough to disconnect them, believing that the java garbage collection would take care of that, obviously that wasn’t the case, I took care of that by forcing the code to terminate any connection it starts, and from the DB side i’m going to add a threshold on idle connections to make sure no idle orphan connections would exist.

Right now the system is back to its expected performance level.

Advertisements

3 thoughts on “CPU Steal Time on EC2 cloud

  1. I have been reading some other posts around steal time % and Amazon EC2, it seems that if your CPU utilization spikes for more than a brief period of time 1 or 2 seconds they will throttle back your CPU to about 2% stealing 98% of your compute time to service others.

    I am wondering if the garbage collection in the Java runtime is triggering this spike which causes EC2 to throttle back your CPU resources. It would be interesting to hear other experiences with running Java based web applications on EC2.

    I know that the web application I have been working on is currently deployed in EC2 on a micro-instance and is performing terribly. I will most likely move it to another provider which does not throttle instances so much.

    • I’m not surprised by what you are saying here, lately I’ve been seeing my steal time flapping wildly and for no particular reason, the system isn’t nearly loaded enough to cause that flapping and I’ve already benchmarked my code on my machine, running it for hours and even building in a couple of forced GarbageCollection points ( not a best practice but my only way to get it to run on EC2), and on my test machine it never exceeded the expected load, still and for some reason that’s beyond me at least once or twice everyday the steal time starts increasing and later settles down…or not.

      Lately I started losing the ability to connect to the server due to the fact that it gets too loaded down for SSHD to function properly (most probably due to steal%) forcing me to restart the instance in order to gain connectivity to it.

      I’m starting to suspect that this is perhaps a gentle reminder from Amazon telling us that we should upgrade to small instance instead of the less expensive micro instance, what i dont get though is, if my service requires nothing more than the resources available within a micro instance why the $@#$ do I have to upgrade.

      As for alternatives, you can deploy on Openshift, or google app engine but both provide software as a service rather than a system you can access (which I need in my case) and rack space is too expensive for my budget, if you have any other options please do share them.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s