This pipeline demonstrates how to use the Clean Missing Values Snap to process missing values in a dataset using the most popular value (mode).
-
Configure the
CSV Generator
Snap to generate the input dataset.
The dataset includes some missing values, particularly in the $Category field.
-
Pass the input through the Type Converter Snap to automatically detect and convert data types.
-
Use the
Copy
Snap to duplicate the input data
stream.
The Snap is configured to output two streams:
- The first stream is passed to the Clean Missing Values Snap.
- The second stream is passed to the Profile Snap to compute field statistics.
-
Configure the Profile Snap to calculate
field-level statistics on the input data.
This Snap identifies missing values and calculates the most popular value for each
field.
Note: In this example, the most popular value in the
$Category field is Publishing, which is used to fill
in missing values.
-
Configure the Clean Missing Values Snap to impute missing values with the most popular value.
The Snap takes two input views:
- The first input view receives the raw data from the Copy Snap.
- The second input view receives the statistics from the Profile Snap.
Configure the Snap with the rule Impute with Popular to handle missing values in the $Category field.
Note:
The Snap replaces missing, null, or whitespace values with the most popular value derived from the Profile Snap.
-
Use the
File Writer
Snap to write the cleaned
output to a file.
To successfully reuse pipelines:
- Download and import the pipeline in to the SnapLogic Platform.
- Configure Snap accounts, as applicable.
- Provide pipeline parameters, as applicable.