Saturday, June 30, 2012

Research in the cloud


I'm continuing to work on methods that investigate large amounts of biological sequence data. I have a new method in the works for assembly of next generation sequence data. This was work that culminated from a rather successful honors thesis by Matt Segar, '12. I'm also delving into mining large amounts of HIV sequence data to sift for possible patterns, mutations, motifs, etc. that are related to different levels of infection. This is work that has just started with Charles Cole, '13. I have another project that is a bit of a stretch for me. It has involved mining for descriptive and predictive patterns in Twitter data related to the stock market. This is work that was started by Marc Burian, '12, a student in my data mining class a couple of semesters ago. This work is actually turning out some interesting results. The papers for two of these projects are in the works as I write this.

Now, I have a different problem. I am facing the reality of being a research-motivated faculty member in a non-R1 institution. At Bucknell, we strive for excellence in teaching and research. We work hard to provide an excellent education to our students, one that will maximize their chances of success after graduation. We believe that research needs to be a part of our overall career, as it makes us better teachers.  In fact, one of the reasons I chose to come to Bucknell was because of their insistence that we do both.  However, as I've noted in the past, I'm increasingly aware of the difficulties of doing both simultaneously. (I understand now, more than I ever have, why so many faculty at R-1 universities have their grad students doing the teaching of undergraduate courses. They don't have time for both!) Fortunately, I'm slowly discovering the sweet realization that undergraduate research can be successful, with the right students, and the right mindset. I've had an extraordinary number of students come to me with interest in completing some interesting research projects related to my work. I did not expect to see that type of interest at the undergrad level. Bucknell really has some great students.

OK, so my main point is not so much on completing successful undergraduate research, but on completing research that requires high performance computational resources at a non R-1 university. We currently have one primary cluster of 128 cores, and roughly 4-8GB of memory per processor, depending on the configuration of the system. Now that my research is underway, I have an extraordinary amount of data to mine through, and a lot of parameters in my models to evaluate in order to understand what parameter values will give me the best results. I have thousands of jobs to execute. On my quad-core Macbook with 8GB of memory, one job takes roughly 2 hours to complete. Surely, executing one job at a time is, well, as my 4 yr old son would tell me, that's just "silly!"

At an institution like Bucknell, with faculty that must be excellent teachers during the semester, there honestly is not much research that gets done during the semesters. The vast majority of the faculty complete their hard-core research during the summer. Our cluster sits largely unused during the semester, and then summer arrives. We have a lot of faculty using the cluster. To put this into perspective, since the beginning of the summer, all 128 cores have been running at 100% utilization. I submitted a batch of about 800 jobs about 10 days ago, and only about 300 of them have completed. To say that this is frustrating is an understatement. Thankfully, my colleague, Felipe Perrone, is letting me use his own 48-core system to get some of my jobs done in a timely manner. But, that is not a solution for the long term.

I met with our awesome IT support team, and we're now investigating some possible solutions to help with these summer surges in HPC demands. One solution that I am currently investigating is the use of a cloud-based compute server. Specifically, the first service I am investigating is Amazon Web Services Elastic Compute Cloud (EC2).

Amazon EC2 was surprisingly easy to get a basic "free" system set up. Without going through all of the details, it took me roughly a couple of hours to read through some documentation on their site and set up a simple, single-core Ubuntu Linux system with 8GB of storage. I was able to ssh into the system, and scp'd my research folder and corresponding files from our local systems over to the cloud system. I started my program, and that was it! I now have my program running in the cloud. I can keep this one system running 24/7 for about a year, and assuming I don't start any more instances of the system or use enormous amounts of disk space beyond this, this will stay free for the entire year. But, clearly this is not an HPC system. How much would it cost to configure a true multi-core cluster in the cloud?

Amazon has a few different HPC options. Let's consider their "Cluster Compute Eight Extra Large" system. It boasts 60.5 GB memory, 88 EC2 "Compute Units",  3370 GB of local instance storage, 64-bit platform, and 10 Gigabit Ethernet.  One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. This specific configuration uses 2 x Intel Xeon E5-2670 (eight-core "Sandy Bridge" architecture.) So, this is a pretty reasonable compute instance for one researcher.  So, let's suppose I wanted one of these systems entirely for my own work for the summer -- 3 mos. to be exact. What would the cost be?

Let's use Amazon's calculator -- http://calculator.s3.amazonaws.com/calc5.html. A single instance of this system configured "on demand", would cost $1756.80 for one entire month at 100% utilization. Multiplied by 3 months, and thats about $5270 for the summer. In comparison, this same system configured as a "reserved" instance for an entire year, with "medium utilization", would cost $4541.28. Of course, averaged out over a month, the reserved configuration is surely better, despite the fact that the usage will not be consistent over the year. The on-demand three-month configuration is actually more costly than the full year reserved instance. If I'm willing to pay for a full three years, the cost only goes up to $6773.28. Averaged out, that is pretty good, and I don't have to worry about losing anything for the entire duration. All of my data, my programs, etc... are all stored in the cloud.

I have complete access to install / configure / maintain this entire system to my own liking for the needs of my own research.

I want to run some actual benchmarks to see how well an ECU measures up in practice.

For information about how AWS solutions can be used to address HPC needs, check out the following: http://aws.amazon.com/hpc-applications/.

As an aside, for those into data mining like myself, if you choose to go the Amazon EC2 path for compute time, you will have access to some Amazon-hosted public data sets: http://aws.amazon.com/publicdatasets

Finally, as you might know, Google just announced their compute server yesterday. That should only make things even cheaper in the near future. I have not seen any prices made available yet.


The research momentum continues (a.k.a Thank you, Matt!)

It has been a while since I've written any updates here. I have become quite busy doing... wait for it... here it comes... RESEARCH! I have finally gained my momentum back, thanks to great work by a student of mine, Matthew Segar. Matt completed an honor's thesis under my advisement, which subsequently won the Harold W. Miller Prize, an award given to one or two seniors for an exemplary honors thesis. His work focused on the development of a novel, probabilistic method for the assembly of next generation sequence data. Way to go, Matt! The award was well-deserved. I wish you the very best in your future endeavors as you continue to pursue your interests in bioinformatics. You will continue to make great achievements no matter what you do!