wekaコマンドラインからRandomForest/J48実行メモ(モデルファイル作成)
予測の確度を算出するためにRandom Forestアルゴリズムを使ってる。利用しているツールはWEKA。日次バッチ処理の中で利用してるのでGUIは使わず、シェルスクリプトとJavaアプリから利用している。
このメモは、WEKAのRandom Forestエンジンをバッチ(シェルスクリプト)から利用するメモ(備忘録)。
wekaバージョンは3.7.3。3.6系は、以下の例で示している出力形式を決める機能(CSV形式やXML形式など)が無い様子でエラーになる。
大きな流れは以下の2工程
- 教師データ(ファイル)を入力とし、RandomForestエンジンを使って、予測モデル(ファイル)を出力
- 予測モデル(ファイル)と、class(分類)が不明なデータ(ファイル)を入力し、class(分類)とその確度を出力
予測モデル(ファイル)を作らない場合は、こちら
データの準備
WAKAに付属するサンプルデータiris.arffで試してみる。iris.arffは、あやめ科の3種類の花 setosa, versicolor, virginica それぞれ50サンプルづつ、萼(がく)(sepal)と花弁(petal)の幅(width)と長さ(length)のデータが入っている。arffファイルのデータ主要部分はCSVの行が並ぶ。CSV部分の前にカラム名やデータ型の情報が盛り込まれている。
- 教師データとしてiris.arffをそのまま利用する
- 予測するデータはiris.arffをエディタで編集して利用(→iris-test.arff)
- iris.arffのCSV部分から適当な2,3レコードを残して削除
- 残したレコードの分類(class)カラムを「?」に変更
手順ログ(サマリ)
- 前提条件
- 教師データからRandomForestを使い予測モデル(ファイル)作成
- java weka.classifiers.trees.RandomForest -t iris.arff -i -k -d iris.model
- 教師データ(入力ファイル)はiris.arff
- 予測モデル(出力ファイル)はiris.model
- java weka.classifiers.trees.RandomForest -t iris.arff -i -k -d iris.model
- 予測モデルを使い、class(分類)が不明なデータでclass(分類)とその確度を演算
- 結果出力(CSV)
inst#,actual,predicted,error,prediction 1,1:?,1:Iris-setosa,,1 2,1:?,2:Iris-versicolor,,1 3,1:?,3:Iris-virginica,,1
#J48アルゴリズムの場合はMainクラスをweka.classifiers.trees.J48に変えればOK。
手順ログ(詳細)
D:\java\weka-3-7-3\data>type iris.arff % 1. Title: Iris Plants Database % % 2. Sources: % (a) Creator: R.A. Fisher % (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) % (c) Date: July, 1988 % % 3. Past Usage: % - Publications: too many to mention!!! Here are a few. % 1. Fisher,R.A. "The use of multiple measurements in taxonomic problems" % Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions % to Mathematical Statistics" (John Wiley, NY, 1950). % 2. Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis. % (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. % 3. Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System % Structure and Classification Rule for Recognition in Partially Exposed % Environments". IEEE Transactions on Pattern Analysis and Machine % Intelligence, Vol. PAMI-2, No. 1, 67-71. % -- Results: % -- very low misclassification rates (0% for the setosa class) % 4. Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE % Transactions on Information Theory, May 1972, 431-433. % -- Results: % -- very low misclassification rates again % 5. See also: 1988 MLC Proceedings, 54-64. Cheeseman et al's AUTOCLASS II % conceptual clustering system finds 3 classes in the data. % % 4. Relevant Information: % --- This is perhaps the best known database to be found in the pattern % recognition literature. Fisher's paper is a classic in the field % and is referenced frequently to this day. (See Duda & Hart, for % example.) The data set contains 3 classes of 50 instances each, % where each class refers to a type of iris plant. One class is % linearly separable from the other 2; the latter are NOT linearly % separable from each other. % --- Predicted attribute: class of iris plant. % --- This is an exceedingly simple domain. % % 5. Number of Instances: 150 (50 in each of three classes) % % 6. Number of Attributes: 4 numeric, predictive attributes and the class % % 7. Attribute Information: % 1. sepal length in cm % 2. sepal width in cm % 3. petal length in cm % 4. petal width in cm % 5. class: % -- Iris Setosa % -- Iris Versicolour % -- Iris Virginica % % 8. Missing Attribute Values: None % % Summary Statistics: % Min Max Mean SD Class Correlation % sepal length: 4.3 7.9 5.84 0.83 0.7826 % sepal width: 2.0 4.4 3.05 0.43 -0.4194 % petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) % petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) % % 9. Class Distribution: 33.3% for each of 3 classes. @RELATION iris @ATTRIBUTE sepallength REAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidth REAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa 4.6,3.4,1.4,0.3,Iris-setosa 5.0,3.4,1.5,0.2,Iris-setosa 4.4,2.9,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa 5.4,3.7,1.5,0.2,Iris-setosa 4.8,3.4,1.6,0.2,Iris-setosa 4.8,3.0,1.4,0.1,Iris-setosa 4.3,3.0,1.1,0.1,Iris-setosa 5.8,4.0,1.2,0.2,Iris-setosa 5.7,4.4,1.5,0.4,Iris-setosa 5.4,3.9,1.3,0.4,Iris-setosa 5.1,3.5,1.4,0.3,Iris-setosa 5.7,3.8,1.7,0.3,Iris-setosa 5.1,3.8,1.5,0.3,Iris-setosa 5.4,3.4,1.7,0.2,Iris-setosa 5.1,3.7,1.5,0.4,Iris-setosa 4.6,3.6,1.0,0.2,Iris-setosa 5.1,3.3,1.7,0.5,Iris-setosa 4.8,3.4,1.9,0.2,Iris-setosa 5.0,3.0,1.6,0.2,Iris-setosa 5.0,3.4,1.6,0.4,Iris-setosa 5.2,3.5,1.5,0.2,Iris-setosa 5.2,3.4,1.4,0.2,Iris-setosa 4.7,3.2,1.6,0.2,Iris-setosa 4.8,3.1,1.6,0.2,Iris-setosa 5.4,3.4,1.5,0.4,Iris-setosa 5.2,4.1,1.5,0.1,Iris-setosa 5.5,4.2,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa 5.0,3.2,1.2,0.2,Iris-setosa 5.5,3.5,1.3,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa 4.4,3.0,1.3,0.2,Iris-setosa 5.1,3.4,1.5,0.2,Iris-setosa 5.0,3.5,1.3,0.3,Iris-setosa 4.5,2.3,1.3,0.3,Iris-setosa 4.4,3.2,1.3,0.2,Iris-setosa 5.0,3.5,1.6,0.6,Iris-setosa 5.1,3.8,1.9,0.4,Iris-setosa 4.8,3.0,1.4,0.3,Iris-setosa 5.1,3.8,1.6,0.2,Iris-setosa 4.6,3.2,1.4,0.2,Iris-setosa 5.3,3.7,1.5,0.2,Iris-setosa 5.0,3.3,1.4,0.2,Iris-setosa 7.0,3.2,4.7,1.4,Iris-versicolor 6.4,3.2,4.5,1.5,Iris-versicolor 6.9,3.1,4.9,1.5,Iris-versicolor 5.5,2.3,4.0,1.3,Iris-versicolor 6.5,2.8,4.6,1.5,Iris-versicolor 5.7,2.8,4.5,1.3,Iris-versicolor 6.3,3.3,4.7,1.6,Iris-versicolor 4.9,2.4,3.3,1.0,Iris-versicolor 6.6,2.9,4.6,1.3,Iris-versicolor 5.2,2.7,3.9,1.4,Iris-versicolor 5.0,2.0,3.5,1.0,Iris-versicolor 5.9,3.0,4.2,1.5,Iris-versicolor 6.0,2.2,4.0,1.0,Iris-versicolor 6.1,2.9,4.7,1.4,Iris-versicolor 5.6,2.9,3.6,1.3,Iris-versicolor 6.7,3.1,4.4,1.4,Iris-versicolor 5.6,3.0,4.5,1.5,Iris-versicolor 5.8,2.7,4.1,1.0,Iris-versicolor 6.2,2.2,4.5,1.5,Iris-versicolor 5.6,2.5,3.9,1.1,Iris-versicolor 5.9,3.2,4.8,1.8,Iris-versicolor 6.1,2.8,4.0,1.3,Iris-versicolor 6.3,2.5,4.9,1.5,Iris-versicolor 6.1,2.8,4.7,1.2,Iris-versicolor 6.4,2.9,4.3,1.3,Iris-versicolor 6.6,3.0,4.4,1.4,Iris-versicolor 6.8,2.8,4.8,1.4,Iris-versicolor 6.7,3.0,5.0,1.7,Iris-versicolor 6.0,2.9,4.5,1.5,Iris-versicolor 5.7,2.6,3.5,1.0,Iris-versicolor 5.5,2.4,3.8,1.1,Iris-versicolor 5.5,2.4,3.7,1.0,Iris-versicolor 5.8,2.7,3.9,1.2,Iris-versicolor 6.0,2.7,5.1,1.6,Iris-versicolor 5.4,3.0,4.5,1.5,Iris-versicolor 6.0,3.4,4.5,1.6,Iris-versicolor 6.7,3.1,4.7,1.5,Iris-versicolor 6.3,2.3,4.4,1.3,Iris-versicolor 5.6,3.0,4.1,1.3,Iris-versicolor 5.5,2.5,4.0,1.3,Iris-versicolor 5.5,2.6,4.4,1.2,Iris-versicolor 6.1,3.0,4.6,1.4,Iris-versicolor 5.8,2.6,4.0,1.2,Iris-versicolor 5.0,2.3,3.3,1.0,Iris-versicolor 5.6,2.7,4.2,1.3,Iris-versicolor 5.7,3.0,4.2,1.2,Iris-versicolor 5.7,2.9,4.2,1.3,Iris-versicolor 6.2,2.9,4.3,1.3,Iris-versicolor 5.1,2.5,3.0,1.1,Iris-versicolor 5.7,2.8,4.1,1.3,Iris-versicolor 6.3,3.3,6.0,2.5,Iris-virginica 5.8,2.7,5.1,1.9,Iris-virginica 7.1,3.0,5.9,2.1,Iris-virginica 6.3,2.9,5.6,1.8,Iris-virginica 6.5,3.0,5.8,2.2,Iris-virginica 7.6,3.0,6.6,2.1,Iris-virginica 4.9,2.5,4.5,1.7,Iris-virginica 7.3,2.9,6.3,1.8,Iris-virginica 6.7,2.5,5.8,1.8,Iris-virginica 7.2,3.6,6.1,2.5,Iris-virginica 6.5,3.2,5.1,2.0,Iris-virginica 6.4,2.7,5.3,1.9,Iris-virginica 6.8,3.0,5.5,2.1,Iris-virginica 5.7,2.5,5.0,2.0,Iris-virginica 5.8,2.8,5.1,2.4,Iris-virginica 6.4,3.2,5.3,2.3,Iris-virginica 6.5,3.0,5.5,1.8,Iris-virginica 7.7,3.8,6.7,2.2,Iris-virginica 7.7,2.6,6.9,2.3,Iris-virginica 6.0,2.2,5.0,1.5,Iris-virginica 6.9,3.2,5.7,2.3,Iris-virginica 5.6,2.8,4.9,2.0,Iris-virginica 7.7,2.8,6.7,2.0,Iris-virginica 6.3,2.7,4.9,1.8,Iris-virginica 6.7,3.3,5.7,2.1,Iris-virginica 7.2,3.2,6.0,1.8,Iris-virginica 6.2,2.8,4.8,1.8,Iris-virginica 6.1,3.0,4.9,1.8,Iris-virginica 6.4,2.8,5.6,2.1,Iris-virginica 7.2,3.0,5.8,1.6,Iris-virginica 7.4,2.8,6.1,1.9,Iris-virginica 7.9,3.8,6.4,2.0,Iris-virginica 6.4,2.8,5.6,2.2,Iris-virginica 6.3,2.8,5.1,1.5,Iris-virginica 6.1,2.6,5.6,1.4,Iris-virginica 7.7,3.0,6.1,2.3,Iris-virginica 6.3,3.4,5.6,2.4,Iris-virginica 6.4,3.1,5.5,1.8,Iris-virginica 6.0,3.0,4.8,1.8,Iris-virginica 6.9,3.1,5.4,2.1,Iris-virginica 6.7,3.1,5.6,2.4,Iris-virginica 6.9,3.1,5.1,2.3,Iris-virginica 5.8,2.7,5.1,1.9,Iris-virginica 6.8,3.2,5.9,2.3,Iris-virginica 6.7,3.3,5.7,2.5,Iris-virginica 6.7,3.0,5.2,2.3,Iris-virginica 6.3,2.5,5.0,1.9,Iris-virginica 6.5,3.0,5.2,2.0,Iris-virginica 6.2,3.4,5.4,2.3,Iris-virginica 5.9,3.0,5.1,1.8,Iris-virginica % % % D:\java\weka-3-7-3\data>java weka.classifiers.trees.RandomForest -t iris.arff -i -k -d iris.model Random forest of 10 trees, each constructed while considering 3 random features. Out of bag error: 0.0867 Time taken to build model: 0.04 seconds Time taken to test model on training data: 0.01 seconds === Error on training data === Correctly Classified Instances 150 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 K&B Relative Info Score 14745.7647 % K&B Information Score 233.7148 bits 1.5581 bits/instance Class complexity | order 0 237.7444 bits 1.585 bits/instance Class complexity | scheme 4.0295 bits 0.0269 bits/instance Complexity improvement (Sf) 233.7148 bits 1.5581 bits/instance Mean absolute error 0.0111 Root mean squared error 0.0467 Relative absolute error 2.5 % Root relative squared error 9.8995 % Coverage of cases (0.95 level) 100 % Mean rel. region size (0.95 level) 36.8889 % Total Number of Instances 150 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 Iris-setosa 1 0 1 1 1 1 Iris-versicolor 1 0 1 1 1 1 Iris-virginica Weighted Avg. 1 0 1 1 1 1 === Confusion Matrix === a b c <-- classified as 50 0 0 | a = Iris-setosa 0 50 0 | b = Iris-versicolor 0 0 50 | c = Iris-virginica === Stratified cross-validation === Correctly Classified Instances 141 94 % Incorrectly Classified Instances 9 6 % Kappa statistic 0.91 K&B Relative Info Score 13709.9102 % K&B Information Score 217.2969 bits 1.4486 bits/instance Class complexity | order 0 237.7444 bits 1.585 bits/instance Class complexity | scheme 3238.3528 bits 21.589 bits/instance Complexity improvement (Sf) -3000.6084 bits -20.0041 bits/instance Mean absolute error 0.0431 Root mean squared error 0.176 Relative absolute error 9.7 % Root relative squared error 37.3363 % Coverage of cases (0.95 level) 98 % Mean rel. region size (0.95 level) 36.6667 % Total Number of Instances 150 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 Iris-setosa 0.92 0.05 0.902 0.92 0.911 0.968 Iris-versicolor 0.9 0.04 0.918 0.9 0.909 0.973 Iris-virginica Weighted Avg. 0.94 0.03 0.94 0.94 0.94 0.98 === Confusion Matrix === a b c <-- classified as 50 0 0 | a = Iris-setosa 0 46 4 | b = Iris-versicolor 0 5 45 | c = Iris-virginica D:\java\weka-3-7-3\data>type iris-test.arff @RELATION iris @ATTRIBUTE sepallength REAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidth REAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,? 7.0,3.2,4.7,1.4,? 6.3,3.3,6.0,2.5,? % % % D:\java\weka-3-7-3\data>java weka.classifiers.trees.RandomForest -classifications weka.classifiers.evaluation.output.prediction.CSV -l iris.model -T iris-test.arff === Predictions on test data === inst#,actual,predicted,error,prediction 1,1:?,1:Iris-setosa,,1 2,1:?,2:Iris-versicolor,,1 3,1:?,3:Iris-virginica,,1 D:\java\weka-3-7-3\data>
RandomForest コマンドラインヘルプ
Weka exception: No training file and no object input file given. General options: -h or -help Output help information. -synopsis or -info Output synopsis for classifier (use in conjunction with -h) -t <name of training file> Sets training file. -T <name of test file> Sets test file. If missing, a cross-validation will be performed on the training data. -c <class index> Sets index of class attribute (default: last). -x <number of folds> Sets number of folds for cross-validation (default: 10). -no-cv Do not perform any cross validation. -split-percentage <percentage> Sets the percentage for the train/test set split, e.g., 66. -preserve-order Preserves the order in the percentage split. -s <random number seed> Sets random number seed for cross-validation or percentage split (default: 1). -m <name of file with cost matrix> Sets file with cost matrix. -l <name of input file> Sets model input file. In case the filename ends with '.xml', a PMML file is loaded or, if that fails, options are loaded from the XML file. -d <name of output file> Sets model output file. In case the filename ends with '.xml', only the options are saved to the XML file, not the model. -v Outputs no statistics for training data. -o Outputs statistics only, not the classifier. -i Outputs detailed information-retrieval statistics for each class. -k Outputs information-theoretic statistics. -classifications "weka.classifiers.evaluation.output.prediction.AbstractOutput + options" Uses the specified class for generating the classification output. E.g.: weka.classifiers.evaluation.output.prediction.PlainText -p range Outputs predictions for test instances (or the train instances if no test instances provided and -no-cv is used), along with the attributes in the specified range (and nothing else). Use '-p 0' if no attributes are desired. Deprecated: use "-classifications ..." instead. -distribution Outputs the distribution instead of only the prediction in conjunction with the '-p' option (only nominal classes). Deprecated: use "-classifications ..." instead. -r Only outputs cumulative margin distribution. -z <class name> Only outputs the source representation of the classifier, giving it the supplied name. -g Only outputs the graph representation of the classifier. -xml filename | xml-string Retrieves the options from the XML-data instead of the command line. -threshold-file <file> The file to save the threshold data to. The format is determined by the extensions, e.g., '.arff' for ARFF format or '.csv' for CSV. -threshold-label <label> The class label to determine the threshold data for (default is the first label) Options specific to weka.classifiers.trees.J48: -U Use unpruned tree. -O Do not collapse tree. -C <pruning confidence> Set confidence threshold for pruning. (default 0.25) -M <minimum number of instances> Set minimum number of instances per leaf. (default 2) -R Use reduced error pruning. -N <number of folds> Set number of folds for reduced error pruning. One fold is used as pruning set. (default 3) -B Use binary splits only. -S Don't perform subtree raising. -L Do not clean up after the tree has been built. -A Laplace smoothing for predicted probabilities. -J Do not use MDL correction for info gain on numeric attributes. -Q <seed> Seed for random data shuffling (default 1).