AutoML

Overview

Transform-type Snap
Does not support Ultra Pipelines

AutoML automates the process of exploring and tuning machine learning models for a given dataset within the resource limit.

A machine learning model is a mathematical representation of a real-world process that can be used to predict or solve a specific problem. For example, predict whether the customers are going to churn, predict whether the loan will be fully paid, or, forecast sales. To generate a machine learning model, you must provide training data to a machine learning algorithm to learn from.

Currently, the AutoML Snap supports binary classification, multiclass classification, and regression. For each type of problem, the Snap provides a different set of metrics and reports.

Prerequisites

None.

Known issues

The AutoML generates models with statistical information. If the statistic value of a model for RMSLE is NaN, then the Snap skips that value in charts that are generated in reports.

Preview showing one item where RMSLE is NaN

Snap views


View	Description	Examples of upstream and downstream Snaps
Input #1	A document stream with classification or regression dataset.	A Snap that generates a classification or regression dataset. Examples: CSV Generator Mapper A combination of File Reader and JSON Parser
Input #2	A document that contains a model built by an AutoML Snap from a previous execution.	A Snap that offers documents that provide a model built by an AutoML Snap. Example: A combination of File Reader and JSON Parser
Output #1	A serialization of a machine learning model, and metadata that are not human-readable. Additionally, the output includes a human-readable representation of the model if you select the Readable checkbox.	A Snap that formats and saves the model. Example: A combination of JSON Formatter and File Writer Snaps
Output #2	A document that contains the leaderboard. All the models built by this Snap display in the order of ranking along with metrics indicating the performance of the model.	A Snap that accepts documents. Examples: Aggregate Mapper CSV Formatter
Output #3	A document that contains an interactive report of up to top 10 models.	A Snap that formats and saves the report. Example: A combination of Binary to Document and File Writer
Error	Error handling is a generic way to handle errors without losing data or failing the Snap execution. You can handle the errors that the Snap might encounter when running the pipeline by choosing one of the following options from the When errors occur list under the Views tab. The available options are: Stop Pipeline Execution Stops the current pipeline execution when the Snap encounters an error. Discard Error Data and Continue Ignores the error, discards that record, and continues with the remaining records. Route Error Data to Error View Routes the error data to an error view without stopping the Snap execution. Learn more about Error handling in Pipelines.

Snap settings

Note:

Suggestion icon (): Indicates a list that is dynamically populated based on the configuration.
Expression icon (): Indicates whether the value is an expression (if enabled) or a static value (if disabled). Learn more about Using Expressions in SnapLogic.
Add icon (): Indicates that you can add fields in the field set.
Remove icon (): Indicates that you can remove fields from the field set.

Field / Field set Type Description

Label String Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if there are more than one of the same Snap in the pipeline.

Label field

string

Required. The label/class/target field in the dataset. This is the field you will train the model to predict.

Default value: None

Example: $class

Time limit

integer

Required. The maximum number of seconds that the AutoML Snap can be run. If 0, the Snap builds the number of models specified in the Number of models property, without a time restriction.

Note:

The Snap throws an error if it cannot build at least one model within the specified time limit.
- During validation, the Snap uses either the following values or the values you specify (if any and if less than the following values):
  - Time Limit: 60 seconds
  - Number of models: 5
- During execution, the Pipeline runs up to the Time Limit and Number of models that you specify in this dialog.

Default value: 3600

Number of models

integer

The number of models that must be included in AutoML. If 0, the Snap builds as many models as possible within the time specified in the Time limit property.

Note:

- During validation, the Snap uses either the following values or the values you specify (if any and if less than the following values):
  - Time Limit: 60 seconds
  - Number of models: 5
- During execution, the Pipeline runs up to the Time Limit and Number of models that you specify in this dialog.
For Weka, the Snap builds multiple models in parallel; so the number of output models may not be exactly as specified.

Default value: 10

Fold

integer

Required. The number of folds in k-fold cross validation.

Valid values: 2 through 10

Default value: 5

Engine

selection

Required. The engine to be used.

Valid values:

Weka
H2O

Default value: H2O

Algorithms

multiple selection

Valid values:

Standard
Tree
XGBoost
NN

Default value: Standard, Tree, XGBoost, NN

The AutoML Snap supports the following algorithms:


	Weka engine	H2O engine
Standard	Logistic, Linear Regression, Naive Bayes, Naive Bayes Multinomial, Simple Linear Regression, SMO, SGD	Generalized Linear Modeling, Gradient Boosting Machine
Tree	J48, Decision Stump, Random Forest, Random Tree, REP Tree, M5P, LMT	Distributed Random Forest
XGBoost	n/a	XGBoost
NN	Multilayer Perceptron	Deep Learning

Readable

checkbox

If selected, the model is displayed in a human-readable format. A $readable field is added to the output.

Default status: Not selected

Use random seed

checkbox

If selected, the value of Random seed is applied to the randomizer to get reproducible results.

Default status: Selected

Random seed

integer

Required. A number used as the static seed for the randomizer.

Default value: 12345

Report title string Title for the report.

Snap execution

Dropdown list

Select one of the three modes in which the Snap executes.

Available options are:

Validate & Execute. Performs limited execution of the Snap and generates a data preview during pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during pipeline runtime.
Execute only. Performs full execution of the Snap during pipeline execution without generating preview data.
Disabled. Disables the Snap and all Snaps that are downstream from it.

Temporary Files

During execution, data processing on Snaplex nodes occurs principally in-memory as streaming and is unencrypted. When processing larger datasets that exceed the available compute memory, the Snap writes unencrypted pipeline data to local storage to optimize the performance. These temporary files are deleted when the pipeline execution completes. You can configure the temporary data's location in the Global properties table of the Snaplex node properties, which can also help avoid pipeline errors because of the unavailability of space. Learn more about Temporary Folder.

Troubleshooting

None.