|
The LibCVM Toolkit is a C++ implementation of the improved Core Vector Machine (CVM) and recently developed Ball Vector Machine (BVM), which are fast Support Vector Machine (SVM) training algorithms using core-set approximation on very large scale data sets. It is adapted from the LIBSVM implementation (version 2.85). The code has been used on a large range of problems, including network intrusion detection, face detection and implicit surface modeling.
The main features of the toolkit are the following:
Pending features:
If you have any suggestions or bug findings, please email to ivor.tsang@gmail.com. Thank you!
The toolkit is free for research purpose. The software is not tested under Linux/Unix platform, and the performance under Linux/Unix platform may not be reliable. The software must not be further distributed without prior permission of the authors. The author is not responsible for implications from the use of this software. If you use LibCVM Toolkit in your scientific work, please cite as
The toolkit consists of two modules, bvm_train for SVM training and bvm_predict for SVM prediction. Since the LibCVM Toolbox is adapted from the LIBSVM, some options are used for LIBSVM only. The options of bvm_train are:
Usage: bvm-train [options] training_set_file [model_file]
options:
-s svm_type : set type of SVM (default 9)
0 -- C-SVC
1 -- nu-SVC
2 -- one-class SVM
3 -- epsilon-SVR
4 -- nu-SVR
5 -- CVDD (Core Vector Data Description for novelty detection)
[Tsang, Kwok, Cheung, JMLR 2005]
6 -- CVM (sqr. hinge-loss for classification) [Tsang, Kwok, Cheung, JMLR 2005]
7 -- CVM-LS (sqr. eps.-insensitive loss for sparse least-squares classification)
[Tsang, Kwok, Lai, ICML 2005]
8 -- CVR [Tsang, Kwok, Lai, ICML 2005], [Tsang, Kwok, Zurada, TNN 2006]
9 -- BVM [Tsang, Kocsor, Kwok, ICML 2007]
-t kernel_type : set type of kernel function (default 2)
0 -- linear: u'*v
1 -- polynomial: (gamma*u'*v + coef0)^degree
2 -- radial basis function: exp(-gamma*|u-v|^2)
3 -- sigmoid: tanh(gamma*u'*v + coef0)
4 -- precomputed kernel (kernel values in training_set_file)
5 -- laplacian: exp(-sqrt(gamma)*|u-v|)
6 -- normalized poly: ((gamma*u'*v+coef0)/sqrt((gamma*u'*u+coef0)*(gamma*v'*v+coef0)))^degree
7 -- inverse distance: 1/(sqrt(gamma)*|u-v|+1)
8 -- inverse square distance: 1/(gamma*|u-v|^2+1)
-d degree : set degree in kernel function (default 3)
-g gamma : set gamma in kernel function (default -1, which sets 1/averaged distance between patterns)
-r coef0 : set coef0 in kernel function (default 0)
-c cost : set the regularization parameter C(= cost) of C-SVC, eps.-SVR, nu-SVR, BVM and CVDD/CVM,
and the regularization parameter C/(mu*m) in CVM-LS s.t. (mu = s_ratio/(cost*m))
(default 100 for BVM/CVDD/CVM/CVM-LS)
-C s_ratio : set the scale parameter C = s_ratio*max|Y_i| in CVR,
and the scale parameter C = s_ratio in CVM-LS (same as the scale parameter in LASSO)
(default 10000 for CVR/CVM-LS)
-u mu_ratio : set the regularization parameter mu = mu_ratio*max|Y_i| in CVR (default = 0.02)
-n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5)
-p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1)
-m cachesize : set cache memory size in MB (default 200)
-e epsilon : set tolerance of termination criterion
(In CVM/BVM, default eps=-1 which sets eps according to the bound |f(x)-f(x)^*|; default 1e-3 for others)
-f max #CVs : MAX number of Core Vectors in binary CVM and BVM (default 50000)
-h shrinking: whether to use the shrinking heuristics, 0 or 1 (default 1)
-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)
-wi weight: set the parameter C of class i to weight*C, for C-SVC (default 1)
-v n: n-fold cross validation mode
-a size: sample size for probabilistic sampling (default 60)
Example: Zero/one digit classification using the Gaussian kernel:
bvm_train -s 9 -t 2 -c 100 zero_one.txt zero_one.model.txt
Example: Forest Cover Type binary class data using the Gaussian kernel:
bvm_train -s 9 -t 2 -c 10000 -g 1e-4 forest.txt forest.model.txt
Example: KDDCUP-99 Intrusion detection binary class data using the Gaussian kernel:
bvm_train -s 9 -t 2 -c 1000000 intrusion.txt intrusion.model.txt
Example: Regression using the Gaussian kernel:
bvm_train -s 8 -t 2 -u 0.01 -C 20000 census.txt census.model.txt
bvm_train -s 8 -t 2 -u 0.03 -C 100000 cpu.txt cpu.model.txt
Note that -g -1 option can be used to set the width (gamma) of the Gaussian kernel exp(-gamma*|u-v|^2), where
1/gamma = sum ||x_i-x_j||^2/m^2
is the average distance between patterns from a subsampled training set (with 5000 patterns).
The usage of bvm_predict is:
Usage: bvm_predict [options] test_file model_file output_file
options:
-b probability_estimates: whether to predict probability estimates, 0 or 1 (default 0);
one-class SVM not supported yet
-r Ranking: whether to output the ranking score or not, 0 or 1 (default 0)
Example: Zero/one digit classification:
bvm_predict zero_one.test.txt zero_one.model.txt zero_one.output.txt
Example: Forest cover type binary class classification:
bvm_predict forest.test.txt forest.model.txt forest.output.txt
Example: KDDCUP-99 intrusion detection binary class classification:
bvm_predict intrusion.test.txt intrusion.model.txt intrusion.output.txt
Example: Regression:
bvm_predict census.test.txt census.model.txt census.output.txt
bvm_predict cpu.test.txt cpu.model.txt cpu.output.txt
The LibCVM Toolkit supports both sparse and dense data representations. The sparse format is the same as in SVMlight and LIBSVM. Each line represents one training example and is of the form:
<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value>
<target> .=. <label> | <float>
<feature> .=. <unsigned integer>
<value> .=. <float>
For the dense format, each line represents one training example and is of the form:
<line> .=. <value>,<value>, ... <value>,<target>
<target> .=. <label> | <float>
<value> .=. <float>
For the multiple training file mode, the input file is started with a header:
#m N M <file name 1> ... <file name N> <label renaming rule 1> ... <label renaming rule M>
The label renaming rule is defined as:
E.g.:<label renaming rule> .=. <original label>><new label>
#m 3 6 usps1.txt usps2.txt usps3.txt 0>1 1>-1 2>1 3>-1 4>1 5>-1
Some classification and regresssion data sets used in experiments:
Extended MIT face + non-face images data set (Dense format: training 489,410 patterns, testing 24,045 patterns, 361 attributes) (full Sets)
Extended USPS 0/1digit images data set (Sparse format: training 266,079 patterns, testing 75,383 patterns, 676 attributes) (full Sets)
Forest Cover Type binary class data set (Sparse format: training 522,910 patterns, testing 58,102 patterns, 54 attributes)
KDDCUP-99 intrusion detection binary class data set (Sparse format: training 4,898,431 patterns, testing 311,029 patterns, 127 attributes)
Letter data set (Sparse format: training 15,000 patterns, testing 5,000 patterns, 16 attributes)
Reuters (money-fx/non money-fx) data set (Sparse format: training 7,770 patterns, testing 3,299 patterns, 8,315 attributes)
IJCNN1 data set (Sparse format: training 49,990 patterns, testing 91,701 patterns, 22 attributes)
Web data set (Sparse format: training 49,749 patterns, testing 14,951 patterns, 300 attributes)
Census housing regression data set (Sparse format: training 18,186 patterns, testing 2,273 patterns, 121attributes)
Computer activity regression data set (Sparse format: training 6,554 patterns, testing 819 patterns, 21attributes)