Lately my EC2 micro instance has been suffering from spotty performance, often reaching the extent of preventing me from logging into my machine using SSH, running a slimmed down ubuntu version with barely any services running (other than LAMP) I had no idea why, I mainly had a standard deployment that was working just fine so I knew that it was something related to that new java app I just added to the server or….perhaps something else
The top command showed an elevated “st” value, this is something I havne’t seen before, going through the forums turns out that st is in fact steal time, which is defined by IBM as
Steal time is the percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor.
In other words it means that my deployment outgrew the virtual machine dedicated to me, the CPU cycle my VM steals isn’t enough for my app, If it wasn’t amazon i’d have assumed that my Cloud host is over selling their infrastructure, I dont think amazon would do such a thing though.
Checking with the Top and vmstat commands I could tell that the java app I just was behaving exactly as it should, I had it bench marked on my dev machine (a physical rather than virtual machine) and it had a predictable footprint that should be handled easily by my deployment, my conclusion was it had something in my code that wasn’t playing nice with the hypervisor used by Amazon. The logical thing to do here is killing it and seeing what happens, which i did, and to my surprise nothing change, the st value stayed at around 98% for 10 minutes before dropping. which may or may not be related to me killing the app, especially restarting it didnt impact the st time directly.
My current suspicion right now is that in my java code I’m using Thread.sleep() which is not safe, and Timers, both of them aren’t that safe.
The other suspicion I’m keeping to myself is the cloud capacity its self, perhaps they are over selling when it comes to micro instances.
I can simply solve this issue by switching from micro to small instance, but this would triple what I’m currently paying.
After carefully monitoring the system, turns out mysql was creating too many waiting threads causing the performance fluctuation, Restarting it cleared out the issue, until I started my java apps, and then the performance would degrade again.
So as I suspected it was a code issue, as it turns out in my code, I created many DB connections and didnt care enough to disconnect them, believing that the java garbage collection would take care of that, obviously that wasn’t the case, I took care of that by forcing the code to terminate any connection it starts, and from the DB side i’m going to add a threshold on idle connections to make sure no idle orphan connections would exist.
Right now the system is back to its expected performance level.