Wednesday, May 05, 2010

N.Y. bomb plot highlights limitations of data mining - Computerworld

N.Y. bomb plot highlights limitations of data mining - Computerworld

The field of data mining is a huge interest of mine. Like most researchers in computational methods, we always keep our ears and eyes open for opportunities to apply these algorithms. Let us consider the use of data mining to aid in identifying terrorists. I am a little surprised that the government bought into data mining as a key solution for this task. (OK, let's be honest -- I'm really not surprised, but that is for reasons unrelated to this article.) For the most part, I agree with what Bruce Schneier said. You need a well defined signal, profile, or pattern in order to extract the item of interest from the data.

The key to successful data mining is, of course, successful learning. The question becomes, how can a computer learn patterns of interest from exorbitant amounts of data, despite the obvious lack of significant examples that enable these algorithms to learn how to distinguish terrorist activity from the overabundant supply of non-terrorist activity? There are (thankfully) very few examples presented each year. The most these algorithms can do is try to learn a model for typical behavior of the average citizen, and flag those that are exhibiting some pattern of behavior that does not fit the model. Another common approach is to provide fictitious examples that "domain experts" think would fit terrorist activity. NSA is likely providing examples of what they believe represent patterns that led to known activity in the past. Whatever the method, the aim of a terrorist is to follow the model, right? They are only successful if they remain completely inconspicuous. Thus, the problem really does represent the equivalent of searching for a "needle in a haystack" at best. You will undoubtly deal with a high rate of false positives, which has an enormous high cost. Or, you adjust your algorithm to turn the sensitivity down. But, then you deal with the potential high cost of a false negative, such as what happened this past week. Where is the "sweet" spot here? (That is the job of risk management!) This is a very difficult problem, indeed.

I don't think data mining should be disregarded as a potentially useful tool in this field. If nothing else, this represents an area where researchers still have much to learn, so to speak.

On a related note, I'm intrigued as to what these nearly 200 data mining programs are that the fed has invested in. As we fall into more of a digital world all around us, data will continue to pile up higher and deeper. There is certainly no dearth of opportunities for data mining! Despite this, I think I'll stick with bioinformatics for the time. One single strand of DNA provides plenty of interesting potential for various algorithms. Consider the opportunities of sifting for informative blocks in DNA over hundreds and thousands of DNA sequences from thousands of species... sounds like a needle in a haystack again!

No comments: