machine learning - Huge number of classes with Multinominal Naive Bayes (scikit-learn) -


whenever start having bigger number of classes (1000 , more) multinominalnb gets super slow , takes gigabytes of ram. same true scikit learn classification algorithms support .partial_fit() (sgdclassifier, perceptron). when working convolutional neural networks 10000 classes no problem. when want train multinominalnb on same data 12gb of ram not enough , very slow. understanding of naive bayes, lot of classes, should lot faster. might problem of scikit-learn implementation (maybe of .partial_fit() function) ? how can train multinominalnb/sgdclassifier/perceptron on 10000+ classes (batchwise)?

short answer without information:

  • the multinomialnb fits independent model each of classes, thus, if have c=10000+ classes fit c=10000+ models , therefore, model parameters [n_classes x n_features], quite lot of memory if n_features large.

  • the sgdclassifier of scikits-learn uses ova (one-versus-all) strategy train multiclass model (as sgdc not inherently multiclass) , therefore, c=10000+ models need trained.

  • and perceptron, documentation of scikits-learn:

perceptron , sgdclassifier share same underlying implementation. in fact, perceptron() equivalent sgdclassifier(loss=”perceptron”, eta0=1, learning_rate=”constant”, penalty=none).

so, 3 classifiers mention don't work high number of classes, independent model needs trained each of classes. recommend try inherently support multiclass classification, such randomforestclassifier.


Comments

Popular posts from this blog

c# - Can I intercept a SOAP response in .NET before a content type binding mismatch ProtocolException? -

python - Terminate a gnome-terminal opened with subprocess -

c - Unrecognised emulation mode: elf_i386 on MinGW32 -