Large-scale sparse multiclass classification
I’m thrilled to announce that my paper “Block Coordinate Descent Algorithms for Large-scale Sparse Multiclass Classiﬁcation” (published in the Machine Learning journal) is now online: PDF, BibTeX [*].
Over the past decade, l1 regularization has emerged as a powerful way to learn classifiers with implicit feature selection. More recently, mixed-norm (e.g., l1/l2) regularization has been utilized as a way to select entire groups of features. In this paper, we propose a novel direct multiclass formulation specifically designed for large-scale and high-dimensional problems such as document classification. Based on a multiclass extension of the squared hinge loss, our formulation employs l1/l2 regularization so as to force weights corresponding to the same features to be zero across all classes, resulting in compact and fast-to-evaluate multiclass models. For optimization, we employ two globally-convergent variants of block coordinate descent, one with line search (Tseng and Yun, 2009) and the other without (Richtárik and Takáč, 2012). We present the two variants in a unified manner and develop the core components needed to efficiently solve our formulation. The end result is a couple of block coordinate descent algorithms specifically tailored to our multiclass formulation. Experimentally, we show that block coordinate descent performs favorably to other solvers such as FOBOS, FISTA and SpaRSA. Furthermore, we show that our formulation obtains very compact multiclass models and outperforms l1/l2- regularized multiclass logistic regression in terms of training speed, while achieving comparable test accuracy.
The code of the proposed multiclass method is available in my Python/Cython machine learning library, lightning. Below is an example of how to use it on the News20 dataset.
from sklearn.datasets import fetch_20newsgroups_vectorized from lightning.primal_cd import CDClassifier bunch = fetch_20newsgroups_vectorized(subset="all") X = bunch.data y = bunch.target clf = CDClassifier(penalty="l1/l2", loss="squared_hinge", multiclass=True, max_iter=20, alpha=1e-4, C=1.0 / X.shape, tol=1e-3) clf.fit(X, y) # accuracy print clf.score(X, y) # percentage of selected features print clf.n_nonzero(percentage=True)
To use the variant without line search (as presented in the paper), add the max_steps=0 option to CDClassifier.
I also released the Amazon7 dataset used in the paper. It contains 1,362,109 reviews of Amazon products. Each review may belong to one of 7 categories (apparel, book, dvd, electronics, kitchen & housewares, music, video) and is represented as a 262,144-dimensional vector. It is, to my knowledge, one of the largest publically available multiclass classification dataset.
[*] The final publication is available here.