Handling Missing Values in a Dataset

This pipeline demonstrates how to use the Clean Missing Values Snap to process missing values in a dataset using the most popular value (mode).

  1. Configure the CSV Generator Snap to generate the input dataset.

    The dataset includes some missing values, particularly in the $Category field.

  2. Pass the input through the Type Converter Snap to automatically detect and convert data types.
  3. Use the Copy Snap to duplicate the input data stream.

    The Snap is configured to output two streams:

    • The first stream is passed to the Clean Missing Values Snap.
    • The second stream is passed to the Profile Snap to compute field statistics.

    Copy Snap Output Showing Missing $Category Value

  4. Configure the Profile Snap to calculate field-level statistics on the input data.

    This Snap identifies missing values and calculates the most popular value for each field.


    Profile Snap Settings


    Profile Snap Output with Popular Category

    Note: In this example, the most popular value in the $Category field is Publishing, which is used to fill in missing values.
  5. Configure the Clean Missing Values Snap to impute missing values with the most popular value.
    The Snap takes two input views:
    • The first input view receives the raw data from the Copy Snap.
    • The second input view receives the statistics from the Profile Snap.

    Configure the Snap with the rule Impute with Popular to handle missing values in the $Category field.


    Clean Missing Values Snap Configuration


    Clean Missing Values Snap Output

    Note: The Snap replaces missing, null, or whitespace values with the most popular value derived from the Profile Snap.
  6. Use the File Writer Snap to write the cleaned output to a file.
To successfully reuse pipelines:
  1. Download and import the pipeline in to the SnapLogic Platform.
  2. Configure Snap accounts, as applicable.
  3. Provide pipeline parameters, as applicable.