Amazon7 is a dataset that we created from raw data collected by Mark Dredze and his colleagues. The dataset contains 1,362,109 reviews of Amazon products. Each review may belong to one of 7 categories (apparel, book, dvd, electronics, kitchen & housewares, music, video) and is represented as a 262,144 dimensional vector. The dataset contains 0.04% non-zero features. This is the dataset that we used in our paper "Block Coordinate Descent Algorithms for Large-scale Sparse Multiclass Classification" (see below).


amazon7.pkl.tar.bz2 [303 MB]

For Python users, the easiest way to load the dataset is to use joblib's pickle functionalities. Note that joblib is also part of scikit-learn as sklearn.externals.joblib. After decompressing the above archive, you can load the dataset as follows:

    import joblib
except ImportError:
    from sklearn.externals import joblib

data = joblib.load("amazon7.pkl")
X = data["X"]
y = data["y"]
print X.shape
print y.shape
print data["categories"]

svmlight / libsvm format

amazon7.bz2 [209 MB]

For convenience, the dataset is also provided in the well-known svmlight / libsvm format.


If you use Amazon7 in a paper, please cite both the following papers.