Who am I? > Logs > Research log 240208 - 230710 > 110510 Datamining

110510 Datamining

Summary of book on Data Mining. See also http://slidesha.re/T8LEx

  1. Start with table with data. Choice: (i) general decision tree or (ii) detailed. One can get from (ii) to (i) by pruning.
  2. Now one could infer diagnostic rules through machine learning, and though expert domain knowledge.
  3. Statistiscs and Machine Learning converge.
  4. Bias: language bias, search bias, overfitting-avoidance bias.
  5. Central: concepts, instances, attributes
  6. Four basic styles of learning: classification learning, association learning, clustering, numeric prediction. Regardless concept -> concept description.
  7. Association rules differ from classification rules in two ways: they can "predict" any attribute, not just the class, and they can predict more than one attribute's value at a time. Thus, more assoc. rules than classif. rules.
  8. Denormalization (note: in my research and in the log files there is a relationship between the instances) > can "add discoveries".
  9. Differences nominal (sunny, overcast, rainy), ordinal (ordered, no distance), interval, ratio.
  10. ARFF files.
    % ARFF file for log file analysis DME (example)
    %
    @relation dmelogfiles

    @attribute outlook {sunny, rainy}
    @attribute temperature numeric
    @attribute description string
    @attribute today date     > format 2004-04-03T12:00:00

    @data
    %
    % several thousand lines
    %
    sunny, 85, nice, 2004-04-03T12:00:00
    ....
  11. Sparse data: 0, not there then ?
  12. Interval and ratio data can be normalized (statistically N(0,1)
  13. Data cleaning is essential.
  14. Propositional versus relational (comparing attributes with other attributes)
  15. Linear regression, regression tree and model trees.
  16. Decision trees R1, Naive Bayes, ID3 > C4,5
  17. Covering/rules vs. trees.
  18. PRISM: separate-and-conquer vs. divide-and-conquer.
  19. p311: Time Series > http://davis.wpi.edu/~xmdv/weka/
    time series spss, http://www.let.leidenuniv.nl/history/RES/VStat/html/les7.html
    Of met R.
  20. So: for WEKA. Clean up csv, perhaps add columns in Excel. Import in WEKA and use J4.8 (?)