GraphLab Create Performance
GraphLab Create™ outperforms scikit-learn
GraphLab Create™ is built for scale. It can process datasets with billions of points and millions of features. It scales to large datasets without compromising accuracy. This is a hallmark of our product, and it distinguishes us with our customers.
GraphLab Create™ vs. Scikit-Learn
The below chart shows a comparison with scikit-learn , a popular open source package for single-node machine learning. Scikit-learn implementations are completely executed in memory. In comparison, GraphLab Create is able to scale out-of-core, and many of our multi-threaded implementations take advantage of multiple cores on the machine. We expect GraphLab Create™ to outperform scikit-learn in terms of scale, but also hope to win in terms of speed. Here we validate our hypotheses.
Performance is measured on binary classification tasks using logistic regression, a common task for many applications including ad targeting systems, spam detection and customer demography classification. The test dataset Epsilon was featured in the Pascal Large Scale Learning Challenge . It contains about half a million points, each with a few thousand features, and is roughly 12GB on disk. All experiments are conducted on commodity hardware: an 8-core Intel Core i7-4770 Processor with a 1TB 7200RPM hard drive and 32 GB RAM.
Note: stay tuned for more benchmarks on larger datasets of varying dimensions.
Default options were used for both products. Measurement included time of computation (speed) as well as classification error % (accuracy). The best performing product should excel at both.
Looking at the speed outcome, it is clear that GraphLab Create™ outperforms scikit-learn. Out-of-core computation expectedly gives GraphLab Create™ a significant advantage on scale. But even when all data fits in memory, GraphLab Create™’s carefully engineered logistic regression module is shown to be five times faster than scikit-learn, with comparable training accuracy.