Clustering

helps determine the intrinsic grouping among unlabeled numeric data.

Overview

Clustering helps determine the intrinsic grouping among unlabeled numeric data. All categorical fields are ignored.

Clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. Clustering is a type of unsupervised learning, which is a technique in which you can draw inferences from datasets consisting of data without labeled responses.

Use this Snap:
  • to discover customer segments for marketing purposes
  • to classify different species of plants and animals
  • to group books on the basis of topics and information

Clustering Snap dialog

Prerequisites

None.

Limitations and known issues

None.

Snap views

View Description Examples of upstream and downstream Snaps
Input #1 A document with numeric fields. A Snap that offers documents. Examples:
Input #2 Optional. A document that contains a model built by another Clustering Snap. If the model is not available, the Snap builds a model. A Snap that offers documents which provide a clustering model built by the Clustering Snap. Example: A combination of File Reader and JSON Parser
Output #1 A document with the input data and assigned cluster index. A Snap that accepts documents. Examples:
Output #2 Optional. A document that represents the model built by the Snap. A Snap that accepts documents. Examples:
Error

Error handling is a generic way to handle errors without losing data or failing the Snap execution. You can handle the errors that the Snap might encounter when running the pipeline by choosing one of the following options from the When errors occur list under the Views tab. The available options are:

  • Stop Pipeline Execution Stops the current pipeline execution when the Snap encounters an error.
  • Discard Error Data and Continue Ignores the error, discards that record, and continues with the remaining records.
  • Route Error Data to Error View Routes the error data to an error view without stopping the Snap execution.

Learn more about Error handling in Pipelines.

Important:
  • With one input view, the Snap builds a model. With two input views, the Snap uses the model to give predictions.
  • SnapLogic recommends using either 2 input views with 1 output view or 2 output views with 1 input view. Do not use 2 input views with 2 output views.

Snap settings

Note:
  • Suggestion icon (): Indicates a list that is dynamically populated based on the configuration.
  • Expression icon (): Indicates whether the value is an expression (if enabled) or a static value (if disabled). Learn more about Using Expressions in SnapLogic.
  • Add icon (): Indicates that you can add fields in the field set.
  • Remove icon (): Indicates that you can remove fields from the field set.
Field / Field set Type Description
Label String Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if there are more than one of the same Snap in the pipeline.
Algorithm single selection Required. The clustering algorithm that must be used to cluster the data into specific groups.
Valid values:
  • K-Means: Partitions n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
  • X-Means: An extended K-Means which tries to automatically determine the number of clusters based on Bayesian Information Criterion (BIC) scores.
  • G-Means: Another extended K-Means which tries to automatically determine the number of clusters by normality test.

Default value: K-Means

Max cluster integer Required. The maximum number of clusters that the Snap must create.

Valid values: 2 through 10000

Default value: 3

Important: If you select K-Means, the Snap creates the exact number of clusters that you specify. For X-Means and G-Means algorithms, the Snap performs an automatic optimization on your dataset, and the number of clusters might be equal to or less than the number of clusters that you specify.
Pass through checkbox If selected, includes all input fields in the output. Otherwise, the Snap outputs only the cluster index.

Default status: Selected

Snap execution Dropdown list Select one of the three modes in which the Snap executes.
Available options are:
  • Validate & Execute. Performs limited execution of the Snap and generates a data preview during pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during pipeline runtime.
  • Execute only. Performs full execution of the Snap during pipeline execution without generating preview data.
  • Disabled. Disables the Snap and all Snaps that are downstream from it.

Troubleshooting

None.