Thoughts on Data Mining
Data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information (see prior blogs including The Data Information Hierarchy series). The term is overused and conjures impressions that do not reflect the true state of the industry. Knowledge Discovery from Databases (KDD) is more descriptive and not as misused – but the base meaning is the same.
Nevertheless, this definition of data mining is a very general definition and does not convey the different aspects of data mining / knowledge discovery.
The basic types of Data Mining are:
- Descriptive data mining, and
- Predictive data mining
Descriptive Data Mining generally seeks groups, subgroups and clusters. Algorithms are developed that draw associative relationships from which actionable results may be derived. (ie. a diamond head snake should be considered poisonous.)
Generally, a descriptive data mining result will appear as a series of if – then – elseif – then … conditions. Alternatively, a system of scoring may be used much like some magazine based self assessment exams. Regardless of the approach, the end result is a clustering of the samples with some measure of quality.
Predictive Data Mining is then performing an analysis on previous data to derive a prediction to the next outcome. For example: new business incorporation tend to look for credit card merchant solutions. This may seem obvious, but someone had to discover this tendency – and then exploit it.
Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature: 1) massive data collection, 2) powerful multiprocessor computers, and 3) data mining algorithms (http://www.thearling.com/text/dmwhite/dmwhite.htm).
Kurt Thearling identifies five type od data mining: (definitions taken from Wikipedia)
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal. If in practice decisions have to be taken online with no recall under incomplete knowledge, a decision tree should be paralleled by a Probability model as a best choice model or online selection model algorithm. Another use of decision trees is as a descriptive means for calculating conditional probabilities.
Nearest neighbour or shortest distance is a method of calculating distances between clusters in hierarchical clustering. In single linkage, the distance between two clusters is computed as the distance between the two closest elements in the two clusters.
The term neural network was traditionally used to refer to a network or circuit of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes.
Rule induction is an area of machine learning in which formal rules are extracted from a set of observations. The rules extracted may represent a full scientific model of the data, or merely represent local patterns in the data.
Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters.