wekaコマンドラインからRandomForest/J48実行メモ(モデルファイル作成)

予測の確度を算出するためにRandom Forestアルゴリズムを使ってる。利用しているツールはWEKA。日次バッチ処理の中で利用してるのでGUIは使わず、シェルスクリプトJavaアプリから利用している。

このメモは、WEKAのRandom Forestエンジンをバッチ(シェルスクリプト)から利用するメモ(備忘録)。

wekaバージョンは3.7.3。3.6系は、以下の例で示している出力形式を決める機能(CSV形式やXML形式など)が無い様子でエラーになる。

大きな流れは以下の2工程

  1. 教師データ(ファイル)を入力とし、RandomForestエンジンを使って、予測モデル(ファイル)を出力
  2. 予測モデル(ファイル)と、class(分類)が不明なデータ(ファイル)を入力し、class(分類)とその確度を出力

予測モデル(ファイル)を作らない場合は、こちら

データの準備

WAKAに付属するサンプルデータiris.arffで試してみる。iris.arffは、あやめ科の3種類の花 setosa, versicolor, virginica それぞれ50サンプルづつ、萼(がく)(sepal)と花弁(petal)の幅(width)と長さ(length)のデータが入っている。arffファイルのデータ主要部分はCSVの行が並ぶ。CSV部分の前にカラム名やデータ型の情報が盛り込まれている。

  • 教師データとしてiris.arffをそのまま利用する
  • 予測するデータはiris.arffをエディタで編集して利用(→iris-test.arff)
    • iris.arffのCSV部分から適当な2,3レコードを残して削除
    • 残したレコードの分類(class)カラムを「?」に変更

手順ログ(サマリ)

  1. 前提条件
    • javaコマンドが実行可能で、weka.jarにclasspathが通ったコンソール
    • 「データの準備」にあるiris.arff, iris-test.arff
  2. 教師データからRandomForestを使い予測モデル(ファイル)作成
    • java weka.classifiers.trees.RandomForest -t iris.arff -i -k -d iris.model
      • 教師データ(入力ファイル)はiris.arff
      • 予測モデル(出力ファイル)はiris.model
  3. 予測モデルを使い、class(分類)が不明なデータでclass(分類)とその確度を演算
    • java weka.classifiers.trees.RandomForest -classifications weka.classifiers.evaluation.output.prediction.CSV -l iris.model -T iris-test.arff
      • CSV, XMLなど出力方法のバラエティーは → ここ
      • 予測モデル(入力ファイル)はiris.model
      • 予測するデータ(ファイル)はiris-test.arff
  4. 結果出力(CSV)
inst#,actual,predicted,error,prediction
1,1:?,1:Iris-setosa,,1
2,1:?,2:Iris-versicolor,,1
3,1:?,3:Iris-virginica,,1

#J48アルゴリズムの場合はMainクラスをweka.classifiers.trees.J48に変えればOK。

手順ログ(詳細)

D:\java\weka-3-7-3\data>type iris.arff
% 1. Title: Iris Plants Database
%
% 2. Sources:
%      (a) Creator: R.A. Fisher
%      (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
%      (c) Date: July, 1988
%
% 3. Past Usage:
%    - Publications: too many to mention!!!  Here are a few.
%    1. Fisher,R.A. "The use of multiple measurements in taxonomic problems"
%       Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions
%       to Mathematical Statistics" (John Wiley, NY, 1950).
%    2. Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
%       (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
%    3. Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
%       Structure and Classification Rule for Recognition in Partially Exposed
%       Environments".  IEEE Transactions on Pattern Analysis and Machine
%       Intelligence, Vol. PAMI-2, No. 1, 67-71.
%       -- Results:
%          -- very low misclassification rates (0% for the setosa class)
%    4. Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE
%       Transactions on Information Theory, May 1972, 431-433.
%       -- Results:
%          -- very low misclassification rates again
%    5. See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al's AUTOCLASS II
%       conceptual clustering system finds 3 classes in the data.
%
% 4. Relevant Information:
%    --- This is perhaps the best known database to be found in the pattern
%        recognition literature.  Fisher's paper is a classic in the field
%        and is referenced frequently to this day.  (See Duda & Hart, for
%        example.)  The data set contains 3 classes of 50 instances each,
%        where each class refers to a type of iris plant.  One class is
%        linearly separable from the other 2; the latter are NOT linearly
%        separable from each other.
%    --- Predicted attribute: class of iris plant.
%    --- This is an exceedingly simple domain.
%
% 5. Number of Instances: 150 (50 in each of three classes)
%
% 6. Number of Attributes: 4 numeric, predictive attributes and the class
%
% 7. Attribute Information:
%    1. sepal length in cm
%    2. sepal width in cm
%    3. petal length in cm
%    4. petal width in cm
%    5. class:
%       -- Iris Setosa
%       -- Iris Versicolour
%       -- Iris Virginica
%
% 8. Missing Attribute Values: None
%
% Summary Statistics:
%                  Min  Max   Mean    SD   Class Correlation
%    sepal length: 4.3  7.9   5.84  0.83    0.7826
%     sepal width: 2.0  4.4   3.05  0.43   -0.4194
%    petal length: 1.0  6.9   3.76  1.76    0.9490  (high!)
%     petal width: 0.1  2.5   1.20  0.76    0.9565  (high!)
%
% 9. Class Distribution: 33.3% for each of 3 classes.

@RELATION iris

@ATTRIBUTE sepallength  REAL
@ATTRIBUTE sepalwidth   REAL
@ATTRIBUTE petallength  REAL
@ATTRIBUTE petalwidth   REAL
@ATTRIBUTE class        {Iris-setosa,Iris-versicolor,Iris-virginica}

@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa
4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa
5.4,3.9,1.3,0.4,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa
5.4,3.4,1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
5.0,3.0,1.6,0.2,Iris-setosa
5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa
5.2,3.4,1.4,0.2,Iris-setosa
4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa
5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
4.4,3.0,1.3,0.2,Iris-setosa
5.1,3.4,1.5,0.2,Iris-setosa
5.0,3.5,1.3,0.3,Iris-setosa
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
4.8,3.0,1.4,0.3,Iris-setosa
5.1,3.8,1.6,0.2,Iris-setosa
4.6,3.2,1.4,0.2,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
5.7,2.8,4.5,1.3,Iris-versicolor
6.3,3.3,4.7,1.6,Iris-versicolor
4.9,2.4,3.3,1.0,Iris-versicolor
6.6,2.9,4.6,1.3,Iris-versicolor
5.2,2.7,3.9,1.4,Iris-versicolor
5.0,2.0,3.5,1.0,Iris-versicolor
5.9,3.0,4.2,1.5,Iris-versicolor
6.0,2.2,4.0,1.0,Iris-versicolor
6.1,2.9,4.7,1.4,Iris-versicolor
5.6,2.9,3.6,1.3,Iris-versicolor
6.7,3.1,4.4,1.4,Iris-versicolor
5.6,3.0,4.5,1.5,Iris-versicolor
5.8,2.7,4.1,1.0,Iris-versicolor
6.2,2.2,4.5,1.5,Iris-versicolor
5.6,2.5,3.9,1.1,Iris-versicolor
5.9,3.2,4.8,1.8,Iris-versicolor
6.1,2.8,4.0,1.3,Iris-versicolor
6.3,2.5,4.9,1.5,Iris-versicolor
6.1,2.8,4.7,1.2,Iris-versicolor
6.4,2.9,4.3,1.3,Iris-versicolor
6.6,3.0,4.4,1.4,Iris-versicolor
6.8,2.8,4.8,1.4,Iris-versicolor
6.7,3.0,5.0,1.7,Iris-versicolor
6.0,2.9,4.5,1.5,Iris-versicolor
5.7,2.6,3.5,1.0,Iris-versicolor
5.5,2.4,3.8,1.1,Iris-versicolor
5.5,2.4,3.7,1.0,Iris-versicolor
5.8,2.7,3.9,1.2,Iris-versicolor
6.0,2.7,5.1,1.6,Iris-versicolor
5.4,3.0,4.5,1.5,Iris-versicolor
6.0,3.4,4.5,1.6,Iris-versicolor
6.7,3.1,4.7,1.5,Iris-versicolor
6.3,2.3,4.4,1.3,Iris-versicolor
5.6,3.0,4.1,1.3,Iris-versicolor
5.5,2.5,4.0,1.3,Iris-versicolor
5.5,2.6,4.4,1.2,Iris-versicolor
6.1,3.0,4.6,1.4,Iris-versicolor
5.8,2.6,4.0,1.2,Iris-versicolor
5.0,2.3,3.3,1.0,Iris-versicolor
5.6,2.7,4.2,1.3,Iris-versicolor
5.7,3.0,4.2,1.2,Iris-versicolor
5.7,2.9,4.2,1.3,Iris-versicolor
6.2,2.9,4.3,1.3,Iris-versicolor
5.1,2.5,3.0,1.1,Iris-versicolor
5.7,2.8,4.1,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica
6.5,3.0,5.8,2.2,Iris-virginica
7.6,3.0,6.6,2.1,Iris-virginica
4.9,2.5,4.5,1.7,Iris-virginica
7.3,2.9,6.3,1.8,Iris-virginica
6.7,2.5,5.8,1.8,Iris-virginica
7.2,3.6,6.1,2.5,Iris-virginica
6.5,3.2,5.1,2.0,Iris-virginica
6.4,2.7,5.3,1.9,Iris-virginica
6.8,3.0,5.5,2.1,Iris-virginica
5.7,2.5,5.0,2.0,Iris-virginica
5.8,2.8,5.1,2.4,Iris-virginica
6.4,3.2,5.3,2.3,Iris-virginica
6.5,3.0,5.5,1.8,Iris-virginica
7.7,3.8,6.7,2.2,Iris-virginica
7.7,2.6,6.9,2.3,Iris-virginica
6.0,2.2,5.0,1.5,Iris-virginica
6.9,3.2,5.7,2.3,Iris-virginica
5.6,2.8,4.9,2.0,Iris-virginica
7.7,2.8,6.7,2.0,Iris-virginica
6.3,2.7,4.9,1.8,Iris-virginica
6.7,3.3,5.7,2.1,Iris-virginica
7.2,3.2,6.0,1.8,Iris-virginica
6.2,2.8,4.8,1.8,Iris-virginica
6.1,3.0,4.9,1.8,Iris-virginica
6.4,2.8,5.6,2.1,Iris-virginica
7.2,3.0,5.8,1.6,Iris-virginica
7.4,2.8,6.1,1.9,Iris-virginica
7.9,3.8,6.4,2.0,Iris-virginica
6.4,2.8,5.6,2.2,Iris-virginica
6.3,2.8,5.1,1.5,Iris-virginica
6.1,2.6,5.6,1.4,Iris-virginica
7.7,3.0,6.1,2.3,Iris-virginica
6.3,3.4,5.6,2.4,Iris-virginica
6.4,3.1,5.5,1.8,Iris-virginica
6.0,3.0,4.8,1.8,Iris-virginica
6.9,3.1,5.4,2.1,Iris-virginica
6.7,3.1,5.6,2.4,Iris-virginica
6.9,3.1,5.1,2.3,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
6.8,3.2,5.9,2.3,Iris-virginica
6.7,3.3,5.7,2.5,Iris-virginica
6.7,3.0,5.2,2.3,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica
6.5,3.0,5.2,2.0,Iris-virginica
6.2,3.4,5.4,2.3,Iris-virginica
5.9,3.0,5.1,1.8,Iris-virginica
%
%
%

D:\java\weka-3-7-3\data>java weka.classifiers.trees.RandomForest -t iris.arff -i -k -d iris.model

Random forest of 10 trees, each constructed while considering 3 random features.
Out of bag error: 0.0867



Time taken to build model: 0.04 seconds
Time taken to test model on training data: 0.01 seconds

=== Error on training data ===

Correctly Classified Instances         150              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1
K&B Relative Info Score              14745.7647 %
K&B Information Score                  233.7148 bits      1.5581 bits/instance
Class complexity | order 0             237.7444 bits      1.585  bits/instance
Class complexity | scheme                4.0295 bits      0.0269 bits/instance
Complexity improvement     (Sf)        233.7148 bits      1.5581 bits/instance
Mean absolute error                      0.0111
Root mean squared error                  0.0467
Relative absolute error                  2.5    %
Root relative squared error              9.8995 %
Coverage of cases (0.95 level)         100      %
Mean rel. region size (0.95 level)      36.8889 %
Total Number of Instances              150


=== Detailed Accuracy By Class ===

               TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
                 1         0          1         1         1          1        Iris-setosa
                 1         0          1         1         1          1        Iris-versicolor
                 1         0          1         1         1          1        Iris-virginica
Weighted Avg.    1         0          1         1         1          1


=== Confusion Matrix ===

  a  b  c   <-- classified as
50  0  0 |  a = Iris-setosa
  0 50  0 |  b = Iris-versicolor
  0  0 50 |  c = Iris-virginica



=== Stratified cross-validation ===

Correctly Classified Instances         141               94      %
Incorrectly Classified Instances         9                6      %
Kappa statistic                          0.91
K&B Relative Info Score              13709.9102 %
K&B Information Score                  217.2969 bits      1.4486 bits/instance
Class complexity | order 0             237.7444 bits      1.585  bits/instance
Class complexity | scheme             3238.3528 bits     21.589  bits/instance
Complexity improvement     (Sf)      -3000.6084 bits    -20.0041 bits/instance
Mean absolute error                      0.0431
Root mean squared error                  0.176
Relative absolute error                  9.7    %
Root relative squared error             37.3363 %
Coverage of cases (0.95 level)          98      %
Mean rel. region size (0.95 level)      36.6667 %
Total Number of Instances              150


=== Detailed Accuracy By Class ===

               TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
                 1         0          1         1         1          1        Iris-setosa
                 0.92      0.05       0.902     0.92      0.911      0.968    Iris-versicolor
                 0.9       0.04       0.918     0.9       0.909      0.973    Iris-virginica
Weighted Avg.    0.94      0.03       0.94      0.94      0.94       0.98


=== Confusion Matrix ===

  a  b  c   <-- classified as
50  0  0 |  a = Iris-setosa
  0 46  4 |  b = Iris-versicolor
  0  5 45 |  c = Iris-virginica


D:\java\weka-3-7-3\data>type iris-test.arff
@RELATION iris

@ATTRIBUTE sepallength  REAL
@ATTRIBUTE sepalwidth   REAL
@ATTRIBUTE petallength  REAL
@ATTRIBUTE petalwidth   REAL
@ATTRIBUTE class        {Iris-setosa,Iris-versicolor,Iris-virginica}

@DATA
5.1,3.5,1.4,0.2,?
7.0,3.2,4.7,1.4,?
6.3,3.3,6.0,2.5,?
%
%
%

D:\java\weka-3-7-3\data>java weka.classifiers.trees.RandomForest -classifications weka.classifiers.evaluation.output.prediction.CSV -l iris.model -T iris-test.arff


=== Predictions on test data ===

inst#,actual,predicted,error,prediction
1,1:?,1:Iris-setosa,,1
2,1:?,2:Iris-versicolor,,1
3,1:?,3:Iris-virginica,,1


D:\java\weka-3-7-3\data>

RandomForest コマンドラインヘルプ

Weka exception: No training file and no object input file given.

General options:

-h or -help
	Output help information.
-synopsis or -info
	Output synopsis for classifier (use in conjunction  with -h)
-t <name of training file>
	Sets training file.
-T <name of test file>
	Sets test file. If missing, a cross-validation will be performed
	on the training data.
-c <class index>
	Sets index of class attribute (default: last).
-x <number of folds>
	Sets number of folds for cross-validation (default: 10).
-no-cv
	Do not perform any cross validation.
-split-percentage <percentage>
	Sets the percentage for the train/test set split, e.g., 66.
-preserve-order
	Preserves the order in the percentage split.
-s <random number seed>
	Sets random number seed for cross-validation or percentage split
	(default: 1).
-m <name of file with cost matrix>
	Sets file with cost matrix.
-l <name of input file>
	Sets model input file. In case the filename ends with '.xml',
	a PMML file is loaded or, if that fails, options are loaded
	from the XML file.
-d <name of output file>
	Sets model output file. In case the filename ends with '.xml',
	only the options are saved to the XML file, not the model.
-v
	Outputs no statistics for training data.
-o
	Outputs statistics only, not the classifier.
-i
	Outputs detailed information-retrieval statistics for each class.
-k
	Outputs information-theoretic statistics.
-classifications "weka.classifiers.evaluation.output.prediction.AbstractOutput + options"
	Uses the specified class for generating the classification output.
	E.g.: weka.classifiers.evaluation.output.prediction.PlainText
-p range
	Outputs predictions for test instances (or the train instances if
	no test instances provided and -no-cv is used), along with the 
	attributes in the specified range (and nothing else). 
	Use '-p 0' if no attributes are desired.
	Deprecated: use "-classifications ..." instead.
-distribution
	Outputs the distribution instead of only the prediction
	in conjunction with the '-p' option (only nominal classes).
	Deprecated: use "-classifications ..." instead.
-r
	Only outputs cumulative margin distribution.
-z <class name>
	Only outputs the source representation of the classifier,
	giving it the supplied name.
-g
	Only outputs the graph representation of the classifier.
-xml filename | xml-string
	Retrieves the options from the XML-data instead of the command line.
-threshold-file <file>
	The file to save the threshold data to.
	The format is determined by the extensions, e.g., '.arff' for ARFF 
	format or '.csv' for CSV.
-threshold-label <label>
	The class label to determine the threshold data for
	(default is the first label)

Options specific to weka.classifiers.trees.J48:

-U
	Use unpruned tree.
-O
	Do not collapse tree.
-C <pruning confidence>
	Set confidence threshold for pruning.
	(default 0.25)
-M <minimum number of instances>
	Set minimum number of instances per leaf.
	(default 2)
-R
	Use reduced error pruning.
-N <number of folds>
	Set number of folds for reduced error
	pruning. One fold is used as pruning set.
	(default 3)
-B
	Use binary splits only.
-S
	Don't perform subtree raising.
-L
	Do not clean up after the tree has been built.
-A
	Laplace smoothing for predicted probabilities.
-J
	Do not use MDL correction for info gain on numeric attributes.
-Q <seed>
	Seed for random data shuffling (default 1).

Random ForestとJ48のコマンドラインヘルプのdiff

冒頭〜中盤は同一。終盤だけ少し違う。