Turi Products Release Notes

Latest versions:

GraphLab Create™ 2.1
Turi Predictive Services™ 2.0
Turi Distributed™ 2.1


GraphLab Create

  • bug fixes with classifer early stopping behavior

GraphLab Create

  • minor bug fixing
  • GraphLab Create for Python 3 Beta

    GraphLab Create is available for Python 3 as a beta release. You can download the Python installation package from the links below or directly pip install it from there. Note that neither Turi Predictive Services nor Turi Distributed support Python 3 yet.

  • Python 3.5 beta install files:
  • Python 3.4 beta install files:

GraphLab Create

  • Automatic Feature Vectorizer

    Automatically detect, interpret, and apply transforms to various types of content to prepare it for use in other ML models.

Toolkits Enhancements:

  • Lead Scoring toolkit

    Predict how likely users are to take an action like subscribing to a product.

  • Item Content Recommender toolkit

    Build a recommender system for similar content (like blog posts) without user-item interactions data (like user ratings).

  • Object detection toolkit

    Automatically detect objects within images.

  • Updated Views to visually explore, explain and evaluate models.
    • Recommender toolkit
    • Churn predictor toolkit
  • GraphLab Create for Python 3 Beta

    GraphLab Create is available for Python 3 as a beta release. You can download the Python installation package from the links below or directly pip install it from there. Note that neither Turi Predictive Services nor Turi Distributed support Python 3 yet.

    Turi Distributed

    • Distributed tree models

      Train your Boosted Trees and Random Forest models in a distributed fashion using a cluster of machines. This will drastically speed up the creation of tree models from billions of rows, scaling linearly with the number of nodes. Distributed tree model training is robust by periodically saving checkpoints to enable resuming training after interruptions.

    • Distributed ML job launch improvements

      Improved launching of distributed ML jobs, as well as the usability of the respective APIs.

    Turi Predictive Services

    • Web-based dashboard

      In addition to the programmatic APIs you can now access metadata, logs, and metrics of a Predictive Services deployment through a web-based dashboard.

    • Built-in metrics for live evaluation of recommender model quality

      Evaluate the performance of recommender models in real-time, based on live data.

    • REST-based management interface

      All operations to manage a Turi Predictive Services deployment are now based on pure REST interfaces through HTTP/HTTPS. You can choose to use our client wrappers for development in Python, JavaScript, etc. or write your own.

    • Tableau Integration

      Turi has partnered with Tableau to integrate Predictive Services with Tableau software, allowing the Tableau user to create more powerful dashboards by integrating ML libraries and trained models into their analytics (More info in this blog posting).

    • Standalone Python client library

      The Python client API is now available as a standalone package psclient. You can now create and manage a Predictive Services deployment without installing GraphLab Create. Several of the Predictive Services APIs formerly under GraphLab Create have changed:

      • connect() replaces load() for getting a handle to a running deployment. The signature changed as well, as the new management interface does not use the state path to identify a Predictive Services deployment anymore.
      • As we removed dependencies on GraphLab Create, we don't use SFrame as return type for any API, like get_endpoints().
      • Relatedly, we made it easier to retrieve log files directly, separating the log parsing into tabular structures into a utility module psclient.logparse.
      • We removed show() as the functionality has been subsumed by the dashboard.
      • Connecting to and working with a Predictive Services deployment will not create artifacts under ~/.graphlab anymore. When you create a new deployment, psclient.create will write an explicit configuration file by default.
      • We removed deployed_predictive_objects() as it is subsumed by get_endpoints().
      • get_predictive_objects_status() is now subsumed by get_status('endpoint').
      • We removed apply_changes(), adding an endpoint becomes immediately effective. Accordingly we removed pending_endpoint_changes, clear_pending_changes() and test_query().
      • save_client_config() is now generate_config().
      • reconfigure() is now configure().
      • endpoints is now get_endpoints().
      • environment_variables is replaced by a dict environ, which behaves like os.environ.
      • We removed describe(); its functionality is subsumed by an endpoint's description and the new, more powerful input and output schema metadata infrastructure.
      • add() and update() are replaced with deploy() and deploy(override=True).
      • As the authorization model of Turi Predictive Services has become more powerful a new set of key-related APIs have been added, replacing set_admin_key(), set_api_key(), admin_key, api_key.
      • We replaced set_CORS() and set_scale_factor() by key-value pairs specified through configure().
      • set_query_timeout() is replaced by a parameter in connect().

      We have consolidated all documentation resources related to Turi Predictive Services here, incorporating and improving the former Turi User Guide content: Turi Predictive Services User Guide.

  • GraphLab Create

    • Improvements to XGBoost
      • Add lossless json tree export via m._dump_to_json()
      • Bug fixes
    • GraphLab Create for Python 3.5 Beta

      You can download the Python installation package from the links below or directly pip install it from there. Note that neither Dato Predictive Services nor Dato Distributed support Python 3 yet.

    GraphLab Create

    • GraphLab Create for Python 3.5 Beta

      We now provide GraphLab Create for Python 3.5 as a beta release. You can download the Python installation package from the links below or directly pip install it from there. Note that neither Turi Predictive Services nor Turi Distributed support Python 3 yet.

    • SFrame performance improvements
      • Faster SFrame Sort

        We significantly improved the sorting performance for SFrames with a large number of columns.

      • Faster slicing

        Slicing operations such as sf[:1000] are now lazy and faster.

    • SArray API additions
      • SArray.where

        A ternary operator which shared similarities with numpy.where. Allows values in an SArray to be replaced.

      • SArray.hash

        Computes a hash value for each SArray element.

      • SArray.contains and SArray.is_in

        Performs an element-wise search of an item in the SArray.

    Toolkits Enhancements:

    • Churn prediction improvements
      • We improved the explanations for predictions by the churn model and made them more intuitive and understandable. The explanations now also account for missing values during predictions.
      • The churn predictor view has been improved with a new layout.
    • Item Similarity Scalability

      We improved scalability to efficiently handle a significantly larger numbers of items, plus increased speed and decreased memory usage at both training and recommend time.

    • mxnet improvements
      • New API to iterate over and preprocess an image SFrame: SFrameImageIter
      • New API for finetune
      • New API for extract_features

    Turi Predictive Services

    • Support for text feature engineering

      Turi Predictive Services now comes pre-installed with the feature engineering transformers for splitting sentences and tagging parts of speech (SentenceSplitter, PartOfSpeechExtractor), to facilitate server-side transformations of text-based features. Note that this improvement significantly increases the size of the on-premises installation package.

    • Latency Improvements

      In this release we reduced the impact of model updates and logging on the performance of a predictive service. This improvement removes the variations in throughput and latency that management operations caused in previous versions.

    • Improved VPC support

      You can now deploy a predictive service into a VPC with a private subnet, using an internal load balancer.

    GraphLab Create

    • Integrates mxnet into neural network toolkit
    • New text feature engineering toolkit
      • Includes functionality for splitting sentences and tagging parts of speech
      • Leverages spaCy for superior performance.
    • Reduced memory usage for decision tree method
    • Interactive views to evaluate and explore Recommender and Churn Predictor models (Beta)
    • Beta support for Python 3.4. Install files:

    GraphLab Create

    • Fixes performance regression for apply operations
    • Resolves crash with IPython notebook on some Python distributions due to conflict with pyzmq
    • Fixes inconsistent JSON export representation for decision tree method
      • Previous serialization truncated float split value into integer, which led to prediction inconsistencies between the binary model and model based on exported JSON.

    GraphLab Create

    Feature Engineering Enhancements:

    • Dimensionality reduction through Gaussian random projection
    • RareWordTrimmer
      • Allows for the easy trimming of rare words in a corpus
    • Added Count Featurizer
      • Replaces categorical features with simple count features that accelerate classification methods without impacting quality

    Toolkits Enhancements:

    • Enhancements to boosted trees (classification and regression)
      • Speed and scale improvements. Up to 20% faster and works with datasets that are 5x larger
      • Reduced memory usage by 2x and disk usage by up to 5x
      • Ability to stop training early by tracking model performance on a validation set
      • Better progress tracking of evaluation metrics during training
    • Improved churn prediction model
      • Lower memory footprint
      • Ability to provide pre-aggregated data
      • Added evaluation metrics
      • Note that this improvement is not backwards-compatible with the previous version of the churn prediction toolkit!

    Turi Predictive Services

    • Schema annotation for custom methods
    • Increased throughput by parallelized model execution
    • Additional metrics about server resource utilization

    GraphLab Create

    • Bug fix in SFrame logical filter

    GraphLab Create

    • DBAPI2 interface for SFrame
      • Write to and read from SQL databases through Python DBAPI2
    • Reduced startup time of the GraphLab Create engine
    • Toolkits Enhancements:

      • Decision trees for regression and classification
      • Bayesian change point detection for univariate time series

    GraphLab Create

    • Further stability improvements in long-running lambda workers
    • Fixed memory leak of empty SFrames
    • Bug fixes in the churn predictor

    GraphLab Create

    • Stability improvements in lambda workers
    • Rolling window and cumulative aggregations for SFrames, SArrays, and TimeSeries

    Toolkits Enhancements:

    • Anomaly detection toolkit
      • Local outlier factor for multivariate IID data
      • Moving Z-score for univariate time series
    • Improved performance for Gradient Boosted Trees and Random Forest models
      • Out-of core implementation
      • Multicore on Windows and OS X
    • Provide standard errors for linear models
    • Predict method for kmeans models

    Turi Predictive Services

    • Improved security for on-premises deployments
    • HTTP header authentication

    GraphLab Create

    • Improved python lambda stability
    • Improved SSL certificate verification for S3 bucket access
    • Union operator for time series data
    • SFrame and SArray methods for conversion to NumPy arrays

    Toolkits Enhancements:

    • Tab completion for models
    • Additional evaluation metrics for classifier models
    • Adding class weights in NeuralNet for imbalanced datasets
    • Feature engineering enhancements:
      • Multiple column support for all feature engineering objects
      • New bag-of-words and bag-of-ngrams transformers: graphlab.feature_engineering.WordCounter and graphlab.feature_engineering.NGramCounter
      • New function graphlab.text_analytics.tokenize
      • New options for graphlab.feature_engineering.Tokenizer

    Breaking Changes:

    • graphlab.feature_engineering.Tokenizer now defaults to using space characters as delimiters.
    • Multiple column selection on a TimeSeries object (e.g., ts[['col1', 'col2']]) now returns TimeSeries instead of SFrame
    • TimeSeries.date_range is renamed to slice
    • graphlab.feature_engineering.tf_idf now returns an SArray rather than a one-column SFrame.

    Turi Predictive Services

    New Features:

    • Support for specifying custom Python packages as user code dependencies
    • Support for setting environment variables in a predictive service
    • Increased number of supported EC2 instance types
    • SSL support for management operations

    Breaking Changes:

    • graphlab.deploy.PredictiveService.api_key is now a read-only property. Use set_api_key to set the key.

    Bug fixes

    • fix for Python lambda workers not starting on Windows when system username includes certain characters
    • fix for GraphLab Canvas output in IPython Notebook

    GraphLab Create

    New Features:

    • GraphLab Create Launcher
      Unless you are a python expert setting up a python environment can be tricky and error-prone. With the GraphLab Create Launcher, we automatically set up a working python environment with GraphLab Create and other useful packages to get you started faster.
    • Time Series Data
      • For scalable manipulation and aggregation of multi-variate time series we are providing a new out-of-core time series data type
      • Improved time zone support
      • Increasing the time resolution to microseconds
    • Improvements to tree models
      • Retrieve feature importance metrics
      • Better missing value support
      • Model-based feature extraction on new data
    • Model tuning in Canvas
      Use GraphLab Create Canvas to evaluate and compare your recommender model in a more intuitive way.

    New Machine Learning Toolkits:

    • Churn Prediction Toolkit
      Predict which users will churn (stop using) a product or website given user activity logs.
    • Product Sentiment Analysis
      Easily analyze text to keep a pulse on users' sentiments about your products or services.
    • DBSCAN Clustering
      We have extended our clustering toolkit with the DBSCAN method, making clustering tasks more flexible and intuitive.
    • Record Linker
      By adding a Record Linker as a new tool to the Data Matching toolkit you can now match tabular queries to tabular reference data.
    • Frequent Pattern Mining
      Extract and analyze frequent patterns in log/event data.

    Performance Improvements:

    • Improved performance of Nearest Neighbor toolkit by constructing a similarity graph directly.
    • Fast approximation of Nearest Neighbors through locality-sensitive hashing.
    • More efficient and faster access of data in S3.
    • Improved performance of SFrame creation.

    Turi Predictive Services

    New Features:

    • Support adaptive model serving through endpoint policies.
    • Support for deploying Turi Predictive Services into a non-default VPC in AWS.
    • Setup package for an on-premises Turi Predictive Services deployment.
    • Support metrics for on-premises Turi Predictive Services deployment
    • More flexible and intuitive metrics retrieval.

    Performance Improvements:

    • Improved service latency for all supervised learning models.

    Turi Distributed

    New Features:

    • Setup package for Turi Distributed on Hadoop.
    • Distributed Machine Learning running in EC2.
    • Surfacing job preemption in Hadoop.
    • Interface between DataFrames and SFrames in scala.

    Performance Improvements:

    • Improved performance of distributed graph analytics.

    Critical Bug Fixes:

    • Fixed installation issues on Mac OS X 10.9 (Mavericks).
    • Fixed installation issues on Windows (all versions).
    • Fixed import issues when there are spaces in the install path.

    New Features:

    GraphLab Create

    • Numpy integration with SArray.
      • When Numpy is used with GraphLab Create, it is now possible to scale Numpy arrays out-of-core and efficiently convert NumPy array to SArray and vice versa.
    • GraphLab Canvas enhanced scalable plots.
      • In addition to scalable, streaming heatmap introduced in 1.4, now bar chart, line chart, and scatter plot also scale to unlimited data size with automatic, streaming aggregation. A box and whisker plot type was added.
    • Native Windows support.
      • In addition to Mac and Linux, GraphLab Create can now be pip installed on Windows. The following capabilities are not yet fully supported on Windows:
        • Scalable Numpy integration
        • Some data connectors: HDFS: read/write, ODBC: graphlab.connect_odbc and associated ODBC methods, Spark: graphlab.from_rdd, graphlab.to_rdd and associated RDD methods
        • Initiating jobs to Turi Distributed on Hadoop
        • Initiating distributed ML jobs to Turi Distributed
    • New machine learning toolkits.
      • Random decision forest for classification and regression. See random_forest_classifier, random_forest_ regression.
      • Support for boosted tree models for feature extraction. See BoostedTreesClassifier.extract_features
      • Image similarity search toolkit. Find similar images using locality-sensitive hashing (LSH) for approximate nearest neighbors of features extracted from a pretrained model. See data_matching.similarity_search.
      • ROC curve for binary classification models. Returns an SFrame of predictive performance when defining a cutoff threshold. See evaluation.roc_curve
      • Get factorization recommender similar users. Returns the k most similar users. See factorization_recommender.FactorizationRecommender.get_similar_users.
      • New graphlab.feature_engineering. methods
        • DeepFeatureExtractor. Extract features from a NeuralNetClassifier model or an online ImageNet NeuralNetClassifer model pretrained by Turi.
        • Tokenizer. Split sentences intelligently with punctuation.
        • BM25. Search relevance score for text features.
        • CategoricalImputer. Fill missing values (None) in data sets that have categorical data.
    • SArray subslice method. graphlab.SArray.subslice performs a range slice (ex: element[start:stop:step]) on each element of the SArray. This can be used for substring or sublist operations.

    Turi Predictive Services

  • Turi Predictive Service is available for on-premises deployment. Deployment of a single node Turi Predictive Service server using Docker technology is supported.
  • Support for configuring of the connection port to Turi Predictive Service on EC2.
  • New API for obtaining service logs for Turi Predictive Service.
  • Turi Distributed

  • Support for creating a long-running EC2 cluster and enabling multiple jobs to be submitted to one cluster versus each job creating their own.
  • Turi Distributed is available for on-premises deployment. In addition to using on EC2, deployment of Turi Distributed a local Hadoop infrastructure is supported.
  • Turi Distributed on Hadoop supports distributed machine learning for the following algorithms: logistic regression, linear regression, svm classifier, label propagation, pagerank.
  • Performance Optimzations

  • nearest_neighbors is now 2-3X faster for batch queries on pre-featurized dense, numeric, euclidean data objects
  • kmeans now supports minibatch k-means which enables training over a specified subset of the dataset
  • Support for multiple GPUs for improved performance when using the deeplearning.NeuralNet toolkit
  • SFrame.save has been significantly optimized
  • Integer and floating point SArray/SFrames are smaller and faster due to improved compression methods.
  • Breaking Changes

    • Data Connectors.
      • When using SFrame.to_rdd or SFrame.from_rdd, SparkContext(‘yarn-client’) is not supported. Future release will provide this support and more.
    • Machine Learning Toolkits.
      • vowpal_wabbit has been removed
    • Deployment
      • deploy.predictive_service.copy_predictive_object has been changed to deploy.predictive_service.copy_ec2_predictive_object
      • deploy.environment.EC2 has been replaced with deploy.Ec2Config, use the latter for creating both EC2 cluster and Predictive Service
      • deploy.environment.Hadoop has been removed and subsumed by deploy.hadoop_cluster.create
      • aws.launch_EC2, aws.terminate_EC2 have been removed and replaced with deploy.ec2_cluster

    Bug Fixes:

    • Fixed issue with S3 upload/download for some GraphLab Create models.
    • Fixed configuration file corruption issue that resulted in missing product key.

    New Features:

    • Label Propagation Toolkit: now easily detect communities from large graph data in a semi-supervised setting.
    • Nearest Neighbor Classifier: now leverage nearest-neighbors when performing multi-class classification (or binary classification)
    • Feature Engineering Transformers:
      • TF-IDF as a stateful transformer (by popular demand!). Now easily compute TF-IDF scores at predict time
      • Numeric Imputation
    • Support Spark DataFrame (Spark 1.3)
    • Model Visualization in GraphLab Canvas (by popular demand!)
    • Model Parameter Search (Hyper-parameter tuning) on a Hadoop cluster with Turi Distributed
    • Fine-grained cache control, Predictive Object migration between deployments, and easily consuming query/feedback/result/server logs in Turi Predictive Services
    • New Java and JavaScript client libraries for Turi Turi Predictive Services
    • Deploy scikit-learn models directly to Turi Predictive Services

    Performance Optimizations:

    • SFrame Query Optimization - new lazy-evaluation pipeline for faster execution of SFrame operations.
    • Kmeans clustering - significant improvements for large k.
    • PageRank and Connected components 20x speedup.
    • Streaming histograms and improved heat map rendering provide substantial Canvas performance improvements (~3800% improvement for heat maps)..

    Other Changes:

    • SFrame.apply() operations can now reference other SFrames (by popular demand!)
    • Deduplication toolkit now has a “grouping_features” parameter, rather than an ‘exact’ distance component..
    • Support custom Python packages specified for Hadoop execution.
    • Backwards compatibility breaking changes to Job artifacts and model_parameter_search API.

    New Features

    • Data Matching Toolkit
      • Now automatically annotate unstructured text and duplicate records from multiple data sources.
    • Feature Engineering Transformers
      • Powerful building blocks to compose complex feature engineering work flows as reusable Python functions.
      • Some of these include on-the-fly feature hashing, quadratic feature generation, one-hot-encoding, numerical feature binning. Extensibility for the module allows custom transformations to work seamlessly with the built-in ones.

    Enhancements

    • Turi Predictive Services
      • Now scale deployments up and down with one command.
      • Repair/Replace impaired nodes with one command.
      • Simplified definition of Custom Predictive Objects.
        • Note: This change is not backwards compatible, but is much simpler to define and manage.
      • Detailed Cache administration (enable/disable per Predictive Object, overall for the deployment).
      • CORS Support: secure/administer which domains can make REST queries to the deployment.
      • More robust Predictive Object upload/download with AWS S3.
      • New test_query() method on deployment to test queries prior to deploying Predictive Objects.
      • Improved get_status() method with node/cache/object view (pragmatically accessible)
      • Turi Predictive Services: Changed response format, will require minor changes in applications calling a new Predictive Service deployment.
        • Note: This change is not backwards compatible. Please note that the previous response JSON will now be wrapped in another layer, to include a uuid and Predictive Object version number. (ex. previous: { ‘result’ : ‘good’ }, now: { ‘response’: { ‘result’ : ‘good’ }, ‘uuid’ : ‘x’, ‘version’: 1 } )
    • Jobs
      • Simplified definition of Jobs. Note: This change is not backwards compatible, as the Task object is no longer visible. However, these changes are based on customer feedback, and we believe the changes make defining Jobs significantly simpler.
      • Simplified definition for distributed execution of arbitrary python code on remote environments.
      • Support for exception handling, improved visualization, and logs can be streamed to GraphLab Canvas™.
    • Toolkits
      • Simplified Model Parameter Search API, with easier result evaluation. Trained models are now returned with Job results.
      • Validation tracking for classifiers and regression models.
      • Improved training stability and accuracy for Factorization Recommender with AdaGrad optimization method.
      • Improved recommendation speed for Factorization Recommender (often 2-3x improvement).
    • Engine
      • CSV parsing now supports multi-char delimiters.
      • Improvements in type inference when parsing CSV files.
      • More efficient compression and computation of Image-type columns.
      • Improved support for Anaconda distributions of Python.
      • Fixes for Spark 1.2 usage in from_rdd/to_rdd.
      • Reduction in SFrame memory utilization.
      • Fixes for Unicode representation in SArray.
      • Fixes for SGraph.edge id column types.
      • Fixes for SGraph.select_fields doesn't preserve column ordering.
      • Fixes for SGraph.triple_apply failed when vertex and edge have the same field name.

    New Features

    • GraphLab Create™ SDK 1.x (beta)
      • Extend the GraphLab Create™ C++ engine with custom algorithms built directly on top of the SFrame and SGraph data structures. The SDK enables you to natively optimized data transformations, remove bottlenecks with object instantiation in Python. The SDK is released under the commercially-friendly BSD3 license, and we welcome contributions from the community. Fork our repo , check out the API docs, and send us a pull request!
    • GraphLab Turi Predictive Services Python Client Library
      • We’ve released the first version of our client library for Turi Predictive Services, making it even easier to query a Model or Predictive Object. Check out this Turi Predictive Services Notebook to learn more and get started. Sample code:
        $ pip install graphlab-service-client
        $ python
        >>> from graphlab_service_client import PredictiveServiceClient
        >>> client = PredictiveServiceClient(endpoint='', api_key='')
        >>> results = client.query('uri', {'users':[111,222,333]})
        
    • Recommender Toolkit
      • Recommend method has been improved to allow for observation data to specify context for recommendations (e.g. recommending items based on day of week).
    • Deep Learning Toolkit
      • Support for categorical features as class labels
    • Classifier Toolkit
      • Logistic Regression model now included in multi-class classifier
    • Turi Predictive Services
      • Several performance improvements, improved logging, added maintenance operations (i.e. flushing logs, enable/disable/restart cache)
    • Data Pipelines
      • Improved EC2 logging, better Job status reported in GraphLab Canvas™, flushing logs

    Fixes

    • SFrame: Improved performance and resiliency for S3 operations (now using a multi-part uploader/downloader).

    New Features

    • SFrame: SQL Integration
      • Local ODBC support for connecting to databases in GraphLab Create™. Easily write SQL queries inside GraphLab Create™ to ingest SQL tables into our SFrames, and write SFrames back to SQL databases with one command. Refer to the ODBC section in the Data Sources and Formats chapter of the User Guide for more information.
    • Toolkits: Multi-class Classification supported in graphlab.classifier.create
      • The boosted trees classifier model can now do multi-class classification.
      • classifier.create can also handle multi-class classification (using boosted trees)
    • Toolkits: Classifier improvements
      • SVM and Logistic regression now support class weights to help with training of imbalanced classes.
      • All classifiers can now handle string columns as target.
    • Toolkits: Recommender now supports Implicit Matrix Factorization
      • Implicit Matrix Factorization using Alternating Least Squares is now an additional option for recommenders on datasets with implicit feedback. It is available using ranking_factorization with the option solver = ‘ials’.
    • Toolkits: Deep Learning Toolkit now exposes feature extraction for trained NeuralNet Classifiers
      • Uses hidden layer activations of a trained NeuralNetworkClassifier applied to training data as a feature vector
      • Using this NeuralNetClassifiers , learned for one task can be re-purposed for another task
      • Reduces the need for extensive hyper-parameter search, large data, and training time to harness the power of deep learning effectively.
      • See the NeuralNetClassifier.extract_features API documentation for how to use this.
    • Toolkits: Nearest Neighbors Improvements
      • Distance functions can now be customized when building a NearestNeighborsModel . This enables more flexibility describing the similarity between rows.
      • Distance functions can be combined using the composite_params options to enable sophisticated distance measures.
      • Levenshtein distance, also known as "edit distance", is now available and is the default method for comparing string columns when creating NearestNeighborsModels .
      • Distances are now exposed through a top level name space, graphlab.distances, and can be used when doing SFrame.apply operations.
      • Refer to the graphlab.nearest_neighbors.create API documentation for more details.
    • Toolkits: Kmeans now allows the user to set initial centers
      • Also allows the user to retrieve the initial centers when they are chosen automatically by the kmeans++ algorithm. Refer to the graphlab.kmeans.create API documentation for more details.
    • Toolkits: Distributed Execution of Model Parameter Search on EC2
      • Model Parameter Search can now be run in parallel on EC2, simply specify the number of hosts to use and the Job will automatically spin up that many instances to work in a distributed way to execute the experiments specified. When the experiments are completed, the results are automatically tabulated and available as an SFrame.
    • Deployment: Turi Predictive Services Now Support Custom Predictive Objects
      • Now specify a Python function to introduce business rules or ensemble GraphLab Models as a custom Predictive Object
      • Refer to the new GraphLab Turi Predictive Services section in the Deployment chapter of the User Guide for more information.
      • Deployment: Many Improvements to Turi Predictive Services (including 1-node clusters now)
      • Add a Predictive Object to a Predictive Service from an S3 path
      • Support 1-2 node Predictive Service Deployments now
      • Improved logging and request handling (easier to retrieve and parse request logs)
      • Better status retrieval and monitoring ( get_status() now returns detailed status from each node in deployment)
      • GraphLab Canvas™ now shows plots of important metrics for monitoring Predictive Service Deployments
      • Refer to the new GraphLab Turi Predictive Services section in the Deployment chapter of the User Guide for more information
    • Deployment: Improved Efficiency on EC2 Instances
      • Utilizing the ephemeral storage for temporary objects allows instance's local storage to be utilized for SFrames, SGraphs, SArray, and Model objects. This greatly improves performance of Jobs run in EC2, and with operating Predictive Service Deployments
    • GraphLab Canvas™: Now supports images
      • Thumbnails for images are displayed in the table view
      • Table layout is more uniform and responsive to window resizing
      • Large Performance improvements in loading and interacting with GraphLab Canvas™
    • Bug Fixes
      • Improved robustness of pylambda_workers, correctly clean up memory allocated in certain situations
      • Better support for CentOS 6.4 where Python 2.7 installed through Anaconda. pylambda_workers now use the same libpython27.so path as Python 2.7
      • SFrame Table navigation displays correct index of last row
      • datetime_to_str and str_to_datetime support missing values
    • Compatibility Changes
      • Environment (ex. graphlab.deploy.environment.EC2 , graphlab.deploy.environment.Hadoop , etc) objects, in some cases, may need to be recreated with GraphLab Create 1.1. This is due to a format change for these objects that is not compatible with GraphLab Create 1.0. Please recreate these objects if they do not successfully load in GraphLab Create 1.1
      • graphlab.toolkits.model_parameter_search positional arguments changed order, so if you have code that calls this API, please update it. This API was changed to be consistent with graphlab.deploy.job.create and the new graphlab.deploy.parallel_for_each APIs

    New Features

    • Deep Learning toolkit for building customized neural nets
    • Neural net classifier
    • Image type
    • Turi Predictive Services
    • Spark integration
    • Datetime support
    • Visualization of bivariate relationships (scatter plot, heat map, line/bar chart)
    • New recommender models for implicit data
    • New solvers for K-means and topic models
    • Added get_similar_items for factorization-based recommender models

    Compatibility Changes

    • API changes for better organization of various models in toolkits

    Known Issues

    • EC2 Environment objects for use with Jobs must provide the num_hosts parameter.

    New Features

    • support for avro file format when creating SArray
    • count_ngrams supports counting n-gram words or characters
    • new SFrame.groupby aggregator graphlab.aggregate.ARGMIN or ARGMAX to allow you to pick a value from one column according to value of another column -- e.g. for each user, pick the item that has the highest ranking score
    • SFrame.read_csv supports loading from URL
    • GraphLab Canvas TM support for boosted trees visualization

    Compatibility Changes

    • SFrame.unpack parameter new_colum_name_prefix is renamed to column_name_prefix
    • recommender.create parameter names have changed
      • user_column is renamed to user_id
      • item_column is renamed to item_id
      • target_column is renamed to target
      • The loss function for linear_model, matrix_factorization, and factorization_model is now the average loss instead of total loss. Hence, the new input regularization parameter should be on the scale of old_regularization/num_datapoints.
      • with method=matrix_factorization, factorization_model, and linear_model recommender handling of binary targets - if binary_targets=True, and there are values other than 0 or 1 in the target column, then an error is thrown.
    • recommender.MatrixFactorizationModel and recommender.FactorizationModel parameters names have changed
      • unobserved_rating_regularization is renamed to ranking_regularization
    • toolkit recommender model method .score() is renamed to .predict()
    • individual recommender models no longer have create() methods
    • random_split_by_user no longer supports the min_items_per_user option
    • boosted_trees.create target_column renamed to target
    • svm.create solver = ‘sdca’ is no longer supported. ‘lbfgs’ is the only solver supported
    • graphlab.pagerank.create renamed random_jump_prob to reset_prob
    • graphlab.kmeans.create
      • renamed max_iter to max_iterations
      • renamed data -> dataset
    • topic_model.create now also accepts an SFrame with a single column of type dict as the input dataset
    • graph_analytics models and boosted_trees model created and saved in 0.9 or earlier are no longer loadable in 0.9.1
    • model_parameter_search parameters renamed:
      • train_set to train_set_path
      • test_sete to test_set_path

    Known Issues

    • deployment Environment objects created with deploy.environment in 0.9 not load with 0.9.1; to fix recreate in 0.9.1. Example:
      • [ERROR] Error opening from path: /home/rajat/.graphlab/artifacts/ec2.Environment, error: No module named _context
      • to fix delete the following file, restart Python session and recreate Environment in 0.9.1
        rm /home/rajat/.graphlab/artifacts/ec2.Environment

    New Features

    SFrame:

    • stack -- convert an SFrame with a wide column ( list / array / dict ) to an SFrame with tall columns where each value in the column is put into more columns
    • unstack -- transforms an SFrame by concatenating values from one or two columns into one column
    • pack_columns - pack two or more columns of SFrame into one column of list / array / dict type
    • unpack -- expand one column of SFrame to multiple columns
    • sort -- sort the SFrame by one or more columns
    • flat_map -- arbitrary transformation to transform each row of an SFrame to multiple rows in the new SFrame
    • read_csv -- support read from a directory
    • read_csv -- automatic type inference
    • read_csv_with_errors -- support returning error rows as a separate Array
    • groupby support new aggregators -- CONCAT and SELECT_ONE
    • dropna -- support removing rows from SFrame that have missing values.
    • dropna_split -- support spliting SFrame to two sframes there one without missing values and the other with missing values
    • fillna -- support filling missing values in a given column with some value
    • add_columns -- support add from another SFrame , in additional to be able to add a collection of SArray
    • add_row_number -- support add a sequentially increasing row number as a new column to existing SFrame
    • append -- is now lazily evaluated
    • Many performance improvements -- join , logical_filter , sample , file format, SFrames now handle many columns (>1K)
    • Improved file-handle limit
    • We now only have a soft dependency on numpy and pandas. (i.e. numpy arrays and pandas DataFrames are supported if numpy and pandas are available. But numpy / pandas are not required to use GraphLab Create™)
    • Pretty print of SFrame
    • print_rows -- customized control pretty print of the SFrame
    • show -- show SFrame in GraphLab Canvas™ SArray
    • fillna -- fill missing values with some non missing value
    • from_const -- create an SArray from constant
    • sketch_summary -- extended to support complex SArray type like list / array / dict . element_length_summary , element_sub_sketch , dict_key_summary , dict_value_summary are supported.
    • unpack -- unpack SArray of list / array / dict type to an SFrame
    • show -- show SArray in GraphLab Canvas™

    SGraph:

    • SGraph edges and vertices are now represented as SFrames: g.edges and g.vertices
    • mutating vertices and edges of SGraph -- g.edges and g.vertices allows the graph vertices and edges being mutated
    • triple_apply -- apply a transform function to each edge and its associated source and target vertices in parallel, gives the ability to write custom graph algorithms
    • show now uses GraphLab Canvas instead of NetworkX for layout and matplotlib for display
      • Additional parameters: per-vertex coloring, custom layout, vertex/edge label on hover


    Toolkits:

    • new tree_ensembles toolkit with boosted_trees model
    • new classification toolkit with Support-Vector-Machine (SVM) for binary classification
    • new nearest_neighbors toolkit
    • new text toolkit with topic modeling (LDA) and many supporting utilities
    • text.utilities:
      • tf_idf
      • stopwords
      • parse_sparse
      • parse_docword
      • random_split - implementation specific for splitting text in bag-of-words format
      • Also see relevant string and dictionary processing utilities in SArray such as count_words , dict_trim_by_key , dict_trim_by_value
    • recommender toolkit:
      • more efficient implementation of training/create procedure item_similarity
        • fast and in-memory for fewer than 20K items
        • uses SGraph (out-of-core) for more than 20K items
      • ItemSimilarityModel.get_similar_items accepts a list of items and will return a list of the most similar items for each
      • Model parameters (linear coefficients, latent factors, global bias terms) of matrix factorization, factorization machine, and linear models are now exposed through model.get("coefficients") or model["coefficients"]
    • Regression/classification toolkits:
      • feature rescaling: Automatic feature rescaling in linear/logistic/SVM
      • model_parameter_search - a new toolkit utility which creates a GraphLab job to asynchronously run a Hyper parameter search, using multiple EC2 hosts or Yarn Containers


      Miscellaneous

      • Many GraphLab internal performance configuration variables are now accessible via gl.set_runtime_config() and gl.get_runtime_config() .


      Deployment

      • This module includes tools for building and executing Data Pipelines for ETL and machine learning in local (synchronous/blocking and asynchronous/non-blocking) and remote (EC2 and hadoop) environments
      • Direct integration to run GraphLab jobs on top of Yarn


      Visualization

      • This module contains all the methods associated with GraphLab Canvas, GraphLab Create™'s visualization platform.
      • GraphLab Canvas is designed to be a companion to and seamlessly augment your development workflow
        • we advocate side-by-side UX: terminal, text editor, GraphLab Canvas in browser
      • Interactive browser experience for GraphLab Canvas (navigate between views and explore data)
        • view table and summary view of SFrame
        • view plot and stats for SArray
        • view and customize visuals for SGraph
      • optionally output GraphLab Canvas™ views in IPython Notebook
      • supported HTML5 capable browsers
        • Chrome
        • Firefox
        • Safari

      Compatibility Changes:

      SFrame
      • group() is now deprecated and has been replaced by sort()
      Toolkits
      • saved models from previous versions is no longer loadable in 0.9
      • models are now versioned

      Known Issues:

      SFrame
      • unpack() of a dictionary only extracts keys from the first 100 rows.
      SGraph
      • triple_apply on machines with many CPUs may trip the operating system file handle limit. This may result in a system crash during triple_apply. If this is the case, run >ulimit -n 2048
        from the console before initiating a new python session.
      Toolkits
      • factorization_model in the recommender toolkit has a known problem during initialization. When training on datasets with a large number of users, items, or side features using a machine with a large number of cores, initialization could take a long time. The fix will be available in the next immediate release.
      Visualization
      • GraphLab Canvas™ “browser” target (the default browser-based experience) only works with a GraphLab Create™ installation on the local machine, not over ssh or other remote mechanism. It should still work when using a remote engine instance (such as EC2).
      • GraphLab Canvas™ “ipynb” target only works with IPython Notebook over http (not https).

    New Features

    ML Toolkits:

    • new recommender API for consistent input across models
    • improved regression API for easy evaluation and prediction on test data
    • categorical variables, sparse vector and dense vector data types can now be used with our regression models
    • linear and logistic regression can work with up to 100K features
    • GraphLab solvers for logistic regression
    • new Factorization Model which allows users to include item and user side features/meta data to improve recommendations
    • L1 & L2 regularization for linear and logistic regression
    • model parameters are now accessible for linear and logistic regression
    • flexible evaluation module for confusion_matrix, accuracy, rmse, max_error
    • new users and new observations may be included when scoring or making recommendations

    SFrame:

    • join
    • append
    • dictionary support and manipulation features (dict_* functions) and text to bag of words conversion
    • list type support (list of arbitrary types)
    • additional CSV parsing options: user defined na_values, skip bad lines, line limit
    • groupby - improved api, you can now specify the aggregation column name

    SGraph:

    • SGraph (“Scalable Graph”) is replacing the in-memory Graph structure
    • SGraph scales to graphs larger than machine’s RAM
    • many graph operations are faster including: pagerank, get_edges(fields=...), get_vertices(fields=...), save/load
    • some graph operations are slower in this version including: kcore, connected_components, and shortest path. The performance impact depends on the diameter of the graph.

    Compatibility Changes

    • model.recommend(users) accepts an SArray or list of users now rather than an SFrame with users, items, and ratings
    • model.score(dataset) now outputs an SArray as opposed to an SFrame
    • the score method has replaced the predict method for models accessed through the recommender toolkit
    • saved models must be retrained as they are now saved as an archive directory
    • formerly saved SFrames must be resaved as they are now saved as an archive directory
    • groupby API has been updated
    • recommender API has multiple changes
    • logistic regression accepts target values as 0/1 rather than -1/+1
    • Graph is deprecated, replaced with SGraph
    • load_graph(format=’adj’) is no longer supported
    • add_vertices(), and add_edges no longer takes “format” argument. Format will be auto inferred

    Known Issues

    • SFrames with many columns (>20) or the use of SGraph with many fields may hit file handle limits. Run “ulimit -n 4096” prior to running python to increase the file handle limit to a comfortable level

    New Features

    ML Toolkits:

    • The item_similarity model supports data sets having a large number of unique items. Using the `only_top_k` argument, users can choose to store only the k most similar items for each item, dropping the amount of memory required and reducing the amount of work needed for predictions. As an example, the model will scale to 500,000 unique items on a machine with 16GB of RAM.
    • We offer regression methods with a variety of backend optimization solvers for regression and classification. Check out graphlab.linear_regression.create() and graphlab.logistic_regression.create().
    • This release includes a wrapping of the open source library Vowpal Wabbit for online learning problems: check out graphlab.vowpal_wabbit.create()

    SFrame:

    • SFrame now supports groupby operations. Many classical operators such as COUNT, SUM are supported. An approximate group QUANTILE is also supported. This was one of the most requested features from users, and we are happy to deliver it.
    • SArrays now support a sketch_summary() function which returns a sketch object containing exact and approximate summary statistics of the array. Using sketch_summary() you can quickly see a bunch of statistics about your SFrame.
    • Long file download / CSV parsing operations can now be cancelled with Ctrl-C. This should help all of us that have accidentally run the long-running download one too many times.
    • Lambda operations can now be applied to SFrames. Ex:
      sf['download_time'] = sf.apply(lambda x: x['filesize'] / x['download_speed'])
    • Data download from S3 now supports all AWS regions.

    Graph:

    • get_neighborhood is a graph querying function to make generation of vertex neighborhoods easy.

    AWS / EC2:

    • Launch EC2 instances in all AWS regions ( us-east-1 , us-west-1, eu-west-1, ap-northeast-1, ap-southeast-1, ap-southeast-2, sa-east-1). Yep, you can still launch in us-west-2 and that is the default.
    • Encryption is on by default (using strong standards-based encryption).
    • EC2 launches can specify CIDR rules to restrict IP address access. EC2 launches can also specify tags to document your EC2 usage.

    Compatibility Changes

    • Column Aggregate Functions (array.sum(), array.var()) now skip missing values instead of throwing an exception
    • All of the models in the recommender toolkit should now be accessed through graphlab.recommender.<model_name> instead of graphlab.<model_name> .
    • The column name of the output of predict() of all recommender models changed from 'rating' to 'prediction' , to be more consistent with the semantics of the operation.
    • Saved models need to be retrained and saved. Unfortunately we were so focused on bringing new features like Linear Regression and better integration with SFrames we weren’t able to maintain backwards compatibility with the binary format for models. Simply rerun the code to train the model and save it again, and then load the newly saved model with GraphLab Create™ 0.2.0.
    • We redesigned the underlying implementation of item_similarity which is now faster, but in doing so we had to re-engineer how we handle new users. Making predictions for new users will be added back in an upcoming release.