|
|
|
|
| Who am I? > Logs > Research log 240208 - 230710 > 110510 Datamining |
110510 Datamining
Summary of book on Data Mining. See also http://slidesha.re/T8LEx
- Start with table with data. Choice: (i) general decision tree or (ii) detailed. One can get from (ii) to (i) by pruning.
- Now one could infer diagnostic rules through machine learning, and though expert domain knowledge.
- Statistiscs and Machine Learning converge.
- Bias: language bias, search bias, overfitting-avoidance bias.
- Central: concepts, instances, attributes
- Four basic styles of learning: classification learning, association learning, clustering, numeric prediction. Regardless concept -> concept description.
- Association rules differ from classification rules in two ways: they can "predict" any attribute, not just the class, and they can predict more than one attribute's value at a time. Thus, more assoc. rules than classif. rules.
- Denormalization (note: in my research and in the log files there is a relationship between the instances) > can "add discoveries".
- Differences nominal (sunny, overcast, rainy), ordinal (ordered, no distance), interval, ratio.
- ARFF files.
% ARFF file for log file analysis DME (example) % @relation dmelogfiles
@attribute outlook {sunny, rainy} @attribute temperature numeric @attribute description string @attribute today date > format 2004-04-03T12:00:00
@data % % several thousand lines % sunny, 85, nice, 2004-04-03T12:00:00 ....
- Sparse data: 0, not there then ?
- Interval and ratio data can be normalized (statistically N(0,1)
- Data cleaning is essential.
- Propositional versus relational (comparing attributes with other attributes)
- Linear regression, regression tree and model trees.
- Decision trees R1, Naive Bayes, ID3 > C4,5
- Covering/rules vs. trees.
- PRISM: separate-and-conquer vs. divide-and-conquer.
- p311: Time Series > http://davis.wpi.edu/~xmdv/weka/
time series spss, http://www.let.leidenuniv.nl/history/RES/VStat/html/les7.html Of met R.
- So: for WEKA. Clean up csv, perhaps add columns in Excel. Import in WEKA and use J4.8 (?)
|
|
|
|
|
|
|