Distributing model parameter search

For all model parameter search methods and cross_val_score, you have the choice of running the jobs locally or remotely.

Locally

By default, jobs are scheduled to run locally in an asynchronous fashion. This is called a LocalAsync environment.

Remotely

You may also run jobs on an EC2 cluster or a Hadoop cluster. This is especially useful when you want to perform a larger scale parameter search.

To launch a job on an EC2 cluster, you first create an EC2 environment and pass it into the environment argument:

ec2config = graphlab.deploy.Ec2Config()
ec2 = graphlab.deploy.ec2_cluster.create(name='mps',
                                         s3_path='s3://bucket/path',
                                         ec2_config=ec2config,
                                         num_hosts=4)

j = graphlab.model_parameter_search.create((train, valid),
                                           my_model, my_params,
                                           environment=ec2)

For launching jobs on a Hadoop cluster, you instead create a Hadoop environment and pass this object into the environment argument:

hd = gl.deploy.hadoop_cluster.create(name='hadoop-cluster',
                                     turi_dist_path=<path to installation>)

j = graphlab.model_parameter_search.create((train, valid),
                                           my_model, my_params,
                                           environment=hd)

For more details on creating EC2- and Hadoop-based environments, checkout the API docs or the Deployment chapter of the userguide.

When getting started, it is useful to keep perform_trial_run=True to make sure you are creating your models properly.