Tuesday, October 09, 2012

Big data, big jobs?

See the article Big data, big jobs? in a recent issue of ComputerWorld.

I teach an undergraduate course in data mining. It is definitely one of my favorite courses to teach.  I've been telling students over the past couple of years that this area is a huge opportunity in the job market, and will continue to blossom for years to come.

Yet, even as this article suggests, data mining is not for everyone.

Here are my observations: Unfortunately, I am not seeing enough students ready or willing to dive into some of the statistical math required to really understand how to get the most interesting information out of the data.  Being a good programmer is not enough for getting into data mining. Knowing a good programming language that is popular for data mining tasks (such as R) will be better.  However, knowing how to analyze, visualize, and extract relevant information from large data sets is important. To do this, you need to know what models to apply when, and why, and what to expect out of them.  You need to know how to evaluate model performance, and select appropriate parameters to improve it.  You need to understand causes of poor performance (e.g. noisy data, lack of preprocessing, etc.) I see too many students blindly meandering around the data mining landscape with programs such as SAS, SPSS, or even Weka, not really understanding why a particular model is behaving the way it is. (BTW, Weka is an absolutely wonderful piece of software for exploring data mining!)

The best foundation you can give yourself for a career in data mining is to give yourself a solid foundation in statistics and probability. Then, take a course in data mining! I find that failure to fully embrace these important topics makes it difficult for a student to understand the strengths and weaknesses of the wide range of statistical models and algorithms for inference and induction in data mining tasks. More importantly -- you will likely miss many hidden gems buried in the data, and this is what your potential employer is after.

Sunday, August 05, 2012

How Software Updates Are Destroying the Stock Market - Businessweek

This is a sobering reminder of how far we have yet to go in this modern era of software development of engineering. We continue to strive to release some great development tools, and yet it is practically impossible to set up tests to capture every possible event that might raise exceptions....

Friday, July 27, 2012

High school student wins Google Science Fair with an artificial neural net to predict breast tumor malignancy

You can view her presentation and information related to the project here. I am thoroughly impressed. Congratulations Brittany on a job well done! You are going to go extremely far!

Now, I have two questions:

  1. How can I recruit her to work with me on my research?  The reality is that Brittany is most likely MIT or CMU bound, though she will be able to choose whatever school she really wants, and will most likely get a full scholarship on top of the awards she has already earned. Awesome.
  2. How do I get in touch with her parents to see what they did with this amazing girl as she was growing up? My son is almost 4 yrs old now. And, right now, he is all about Thomas the Tank Engine! But, the young mind is so malleable, impressionable... they are little sponges! Do I start giving him some Java and Python books now? (Yes, I'm kidding... I'll wait until he is 5 yrs old. :-) Do I get him started with some neat robotics project? Teach him how to code in R? Yet, I want him to enjoy his life as a child. This is more important to me. It goes by so fast! There must be a balance, right? I suspect that the voice of encouragement and support from the parental units are probably the most critical part of any child's success. But, is there something more? Perhaps feeding them more brain food? Keeping them away from the TV? I wonder what steps parents need to take to maximize our chances of raising more Brittanys for this world? What did they do to inspire such motivation and creativity in her mind at such a young age?

    To me, it sounds like this is an opportunity for data mining to answer these questions... if I only had a large repository of data...
Again, congratulations Brittany on a job well done! Awesome job, girl. Your parents must be extremely proud of you.

Saturday, June 30, 2012

Research in the cloud


I'm continuing to work on methods that investigate large amounts of biological sequence data. I have a new method in the works for assembly of next generation sequence data. This was work that culminated from a rather successful honors thesis by Matt Segar, '12. I'm also delving into mining large amounts of HIV sequence data to sift for possible patterns, mutations, motifs, etc. that are related to different levels of infection. This is work that has just started with Charles Cole, '13. I have another project that is a bit of a stretch for me. It has involved mining for descriptive and predictive patterns in Twitter data related to the stock market. This is work that was started by Marc Burian, '12, a student in my data mining class a couple of semesters ago. This work is actually turning out some interesting results. The papers for two of these projects are in the works as I write this.

Now, I have a different problem. I am facing the reality of being a research-motivated faculty member in a non-R1 institution. At Bucknell, we strive for excellence in teaching and research. We work hard to provide an excellent education to our students, one that will maximize their chances of success after graduation. We believe that research needs to be a part of our overall career, as it makes us better teachers.  In fact, one of the reasons I chose to come to Bucknell was because of their insistence that we do both.  However, as I've noted in the past, I'm increasingly aware of the difficulties of doing both simultaneously. (I understand now, more than I ever have, why so many faculty at R-1 universities have their grad students doing the teaching of undergraduate courses. They don't have time for both!) Fortunately, I'm slowly discovering the sweet realization that undergraduate research can be successful, with the right students, and the right mindset. I've had an extraordinary number of students come to me with interest in completing some interesting research projects related to my work. I did not expect to see that type of interest at the undergrad level. Bucknell really has some great students.

OK, so my main point is not so much on completing successful undergraduate research, but on completing research that requires high performance computational resources at a non R-1 university. We currently have one primary cluster of 128 cores, and roughly 4-8GB of memory per processor, depending on the configuration of the system. Now that my research is underway, I have an extraordinary amount of data to mine through, and a lot of parameters in my models to evaluate in order to understand what parameter values will give me the best results. I have thousands of jobs to execute. On my quad-core Macbook with 8GB of memory, one job takes roughly 2 hours to complete. Surely, executing one job at a time is, well, as my 4 yr old son would tell me, that's just "silly!"

At an institution like Bucknell, with faculty that must be excellent teachers during the semester, there honestly is not much research that gets done during the semesters. The vast majority of the faculty complete their hard-core research during the summer. Our cluster sits largely unused during the semester, and then summer arrives. We have a lot of faculty using the cluster. To put this into perspective, since the beginning of the summer, all 128 cores have been running at 100% utilization. I submitted a batch of about 800 jobs about 10 days ago, and only about 300 of them have completed. To say that this is frustrating is an understatement. Thankfully, my colleague, Felipe Perrone, is letting me use his own 48-core system to get some of my jobs done in a timely manner. But, that is not a solution for the long term.

I met with our awesome IT support team, and we're now investigating some possible solutions to help with these summer surges in HPC demands. One solution that I am currently investigating is the use of a cloud-based compute server. Specifically, the first service I am investigating is Amazon Web Services Elastic Compute Cloud (EC2).

Amazon EC2 was surprisingly easy to get a basic "free" system set up. Without going through all of the details, it took me roughly a couple of hours to read through some documentation on their site and set up a simple, single-core Ubuntu Linux system with 8GB of storage. I was able to ssh into the system, and scp'd my research folder and corresponding files from our local systems over to the cloud system. I started my program, and that was it! I now have my program running in the cloud. I can keep this one system running 24/7 for about a year, and assuming I don't start any more instances of the system or use enormous amounts of disk space beyond this, this will stay free for the entire year. But, clearly this is not an HPC system. How much would it cost to configure a true multi-core cluster in the cloud?

Amazon has a few different HPC options. Let's consider their "Cluster Compute Eight Extra Large" system. It boasts 60.5 GB memory, 88 EC2 "Compute Units",  3370 GB of local instance storage, 64-bit platform, and 10 Gigabit Ethernet.  One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. This specific configuration uses 2 x Intel Xeon E5-2670 (eight-core "Sandy Bridge" architecture.) So, this is a pretty reasonable compute instance for one researcher.  So, let's suppose I wanted one of these systems entirely for my own work for the summer -- 3 mos. to be exact. What would the cost be?

Let's use Amazon's calculator -- http://calculator.s3.amazonaws.com/calc5.html. A single instance of this system configured "on demand", would cost $1756.80 for one entire month at 100% utilization. Multiplied by 3 months, and thats about $5270 for the summer. In comparison, this same system configured as a "reserved" instance for an entire year, with "medium utilization", would cost $4541.28. Of course, averaged out over a month, the reserved configuration is surely better, despite the fact that the usage will not be consistent over the year. The on-demand three-month configuration is actually more costly than the full year reserved instance. If I'm willing to pay for a full three years, the cost only goes up to $6773.28. Averaged out, that is pretty good, and I don't have to worry about losing anything for the entire duration. All of my data, my programs, etc... are all stored in the cloud.

I have complete access to install / configure / maintain this entire system to my own liking for the needs of my own research.

I want to run some actual benchmarks to see how well an ECU measures up in practice.

For information about how AWS solutions can be used to address HPC needs, check out the following: http://aws.amazon.com/hpc-applications/.

As an aside, for those into data mining like myself, if you choose to go the Amazon EC2 path for compute time, you will have access to some Amazon-hosted public data sets: http://aws.amazon.com/publicdatasets

Finally, as you might know, Google just announced their compute server yesterday. That should only make things even cheaper in the near future. I have not seen any prices made available yet.


The research momentum continues (a.k.a Thank you, Matt!)

It has been a while since I've written any updates here. I have become quite busy doing... wait for it... here it comes... RESEARCH! I have finally gained my momentum back, thanks to great work by a student of mine, Matthew Segar. Matt completed an honor's thesis under my advisement, which subsequently won the Harold W. Miller Prize, an award given to one or two seniors for an exemplary honors thesis. His work focused on the development of a novel, probabilistic method for the assembly of next generation sequence data. Way to go, Matt! The award was well-deserved. I wish you the very best in your future endeavors as you continue to pursue your interests in bioinformatics. You will continue to make great achievements no matter what you do!

Thursday, April 05, 2012

That Dalai Lama Quotation, and the Historical Sceptic

There has been an excellent "quote" (I'm putting that word in literal quotes for a reason) going around the blogs lately. It is a supposed quote from the Dalai Lama. Oh, ahem -- what I mean is that there is absolutely no evidence that he ever said it.  But if you post something to Facebook, Twitter, Wikipedia, LinkedIn, etc. then is must be true, right? (See NT Blog: That Dalai Lama Quotation, and the Historical Sceptic) Regardless, whoever actually generated this quote is awesome:
"The Dalai Lama, when asked what surprised him most about humanity, answered, ‘Man. Because he sacrifices his health in order to make money. Then he sacrifices money to recuperate his health. And then he is so anxious about the future that he does not enjoy the present; the result being that he does not live in the the present or the future; he lives as if he is never going to die, and then dies having never really lived.’"
I rarely, if ever, post things not directly related to CS on this blog. So, why did I post this? For two reasons:

  1. Stop trusting everything you read online. Learn how to use the web to investigate sources of information before spreading it
  2. It's a really thought-provoking quote. 'Nuff said.  
Reflect on your own life. Adjust your priorities. Determine what is really important in your life, and start living it now. All you have is now, and despite what your spouse, friend, relative, girl/boyfriend, parents, peers, blog, your dog, news source, or your professor will tell you, now is good. Be thankful for it.

Now, I need to eat my own words.

The next DARPA challenge -- a Humanoid


This is a video showing some of the more recent advances in the development of humanoids. Boston Dynamics has been doing some great work in this field. And, it represents the next DARPA Grand Challenge.  See the Wired article here that documents a bit more about the challenge.

This video reminds me a bit of The Terminator. Yes, it's kinda cool, yet kinda creepy at the same time. But, entirely fascinating. Be sure to check out some of the other robots these guys are working on. Fantastic work!