graphlab.Sketch

class graphlab.Sketch(array)

The Sketch object contains a sketch of a single SArray (a column of an SFrame). Using a sketch representation of an SArray, many approximate and exact statistics can be computed very quickly.

To construct a Sketch object, the following methods are equivalent:

>>> my_sarray = graphlab.SArray([1,2,3,4,5])
>>> sketch = graphlab.Sketch(my_sarray)
>>> sketch = my_sarray.sketch_summary()

Typically, the SArray is a column of an SFrame:

>>> my_sframe =  graphlab.SFrame({'column1': [1,2,3]})
>>> sketch = graphlab.Sketch(my_sframe['column1'])
>>> sketch = my_sframe['column1'].sketch_summary()

The sketch computation is fast, with complexity approximately linear in the length of the SArray. After the Sketch is computed, all queryable functions are performed nearly instantly.

A sketch can compute the following information depending on the dtype of the SArray:

For numeric columns, the following information is provided exactly:

And the following information is provided approximately:

For non-numeric columns(str), the following information is provided exactly:

And the following information is provided approximately:

For SArray of type list or array, there is a sub sketch for all sub elements. The sub sketch flattens all list/array values and then computes sketch summary over flattened values. Element sub sketch may be retrieved through:

For SArray of type dict, there are sub sketches for both dict key and value. The sub sketch may be retrieved through:

For SArray of type dict, user can also pass in a list of dictionary keys to sketch_summary function, this would generate one sub sketch for each key. For example:

>>> sa = graphlab.SArray([{'a':1, 'b':2}, {'a':3}])
>>> sketch = sa.sketch_summary(sub_sketch_keys=["a", "b"])

Then the sub summary may be retrieved by:

>>> sketch.element_sub_sketch()

or to get subset keys:

>>> sketch.element_sub_sketch(["a"])

Similarly, for SArray of type vector(array), user can also pass in a list of integers which is the index into the vector to get sub sketch For example:

>>> sa = graphlab.SArray([[100,200,300,400,500], [100,200,300], [400,500]])
>>> sketch = sa.sketch_summary(sub_sketch_keys=[1,3,5])

Then the sub summary may be retrieved by:

>>> sketch.element_sub_sketch()

Or:

>>> sketch.element_sub_sketch([1,3])

for subset of keys

Please see the individual function documentation for detail about each of these statistics.

Parameters:

array : SArray

Array to generate sketch summary.

background : boolean

If True, the sketch construction will return immediately and the sketch will be constructed in the background. While this is going on, the sketch can be queried incrementally, but at a performance penalty. Defaults to False.

References

Methods

Sketch.cancel() Cancels a background sketch computation immediately if one is ongoing.
Sketch.dict_key_summary() Returns the sketch summary for all dictionary keys.
Sketch.dict_value_summary() Returns the sketch summary for all dictionary values.
Sketch.element_length_summary() Returns the sketch summary for the element length.
Sketch.element_sub_sketch([keys]) Returns the sketch summary for the given set of keys.
Sketch.element_summary() Returns the sketch summary for all element values.
Sketch.frequency_count(element) Returns a sketched estimate of the number of occurrences of a given element.
Sketch.frequent_items() Returns a sketched estimate of the most frequent elements in the SArray based on the SpaceSaving sketch.
Sketch.max() Returns the maximum value in the SArray.
Sketch.mean() Returns the mean of the values in the SArray.
Sketch.min() Returns the minimum value in the SArray.
Sketch.num_elements_processed() Returns the number of elements processed so far.
Sketch.num_undefined() Returns the the number of undefined elements in the SArray.
Sketch.num_unique() Returns a sketched estimate of the number of unique values in the SArray based on the Hyperloglog sketch.
Sketch.quantile(quantile_val) Returns a sketched estimate of the value at a particular quantile between 0.0 and 1.0.
Sketch.size() Returns the size of the input SArray.
Sketch.sketch_ready() Returns True if the sketch has been executed on all the data.
Sketch.std() Returns the standard deviation of the values in the SArray.
Sketch.sum() Returns the sum of all the values in the SArray.
Sketch.var() Returns the variance of the values in the sarray.