I have been on a data mining kick lately due to the immense amount of information that sites like Facebook, Twitter and FriendFeed gather every day. I have always loved data mining, mainly because there are a lot of answers waiting to be found in large amounts of data. That being said, I know many of my readers probably do not know much about data mining, so I figured a really gentle introduction might be required. Let’s start with the Wikipedia definition of data mining:

Data mining is the process of extracting hidden patterns from data. As more data is gathered data mining is becoming an increasingly important tool to transform this data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.

In addition to this fairly vague definition, data mining is more of an aggregate field of Artificial Intelligence. What I mean is that data mining uses techniques from smaller specializations of artificial intelligence, like machine learning, and applies these techniques to large amounts of data.

To give you an idea of what we are really talking about, let’s look at the supermarket example. Many supermarkets now have discount cards that let them track what you are buying, and you get some small discount on your groceries. What these supermarkets do is look at what you buy during each trip and determine patterns of behavior. For example, you probably notice that the entrance of your supermarket changes somewhat frequently. This happens due to the fact that many people may come in for only a few items. So the front of the store now has some basic items like milk, juice and eggs. It also has current seasonal items, and maybe some prepackaged fruits and vegetables. The reason comes from mining what people purchased during their trips. Supermarkets had found that when people buy a small number of items, it was typically “core” items like milk or a seasonal item like candy near Halloween. They also noticed that these people tend to buy things like the small packages of baby carrots. This is why the entrance to your supermarket changes sometimes, they are optimizing the items so that you purchase more even if you just planned to buy milk and eggs.

The previous example is one of the reasons that data mining is closer to the mass consumer than it used to be. Another reason is the internet. Every day websites are collecting huge amounts of data regarding what we read, purchase and share. Given that many of my readers are fairly technical, the question becomes how do you do some data mining and where do you find more information?

One place to start is the KD Nuggets newsletter. It may be a bit much for people wanting some light reading, but it does highlight some really good sources of information. The current issue mentioned a book that was recently published about the top 10 data mining algorithms. I have not read the book or even looked at its table of contents, but the algorithms discussed in the book are:

  • C4.5 – a decision tree algorithm that can be used for classification and prediction.
  • k-Means – mostly used as a clustering algorithm.
  • Support Vector Machines (SVM) – a supervised learning method used for classification and regression.
  • Apriori – an algorithm for learning association rules.
  • Expectation-Maximization (EM) – I am not familiar with this, but it seems like a statistical method for estimation.
  • PageRank – yes, the Google PageRank algorithm.
  • AdaBoost – Adaptive boosting is a “meta-learning” algorithm and is used to improve the performance of other algorithms.
  • k-Nearest Neighbors (kNN) – a method typically used for classification.
  • Naive Bayes – a statistical classification method used heavily in spam detection.
  • Classification and Regression Trees (CART) – basically a classification tree algorithm that has been expanded upon in various ways.

This is not a full catalog of data mining methods, but it is definitely a good sampling to get started with. There are other good places to look for data mining information as well. The ACM has a special interest group SIGKDD that has published a proposed curriculum for data mining. The AAAI also has an AITopics wiki where you can browse various topics in artificial intelligence. I also subscribe to a few data mining related blogs as well:

Granted, this is a lot of high level information and a bunch of hand waving, but a real introduction into data mining takes around 300 pages in book form. If any of you want more detailed information, just let me know in the comments, and hopefully I can give a more thorough treatment of a specific algorithm or topic.

Reblog this post [with Zemanta]