Class DLPDeidentifyText

java.lang.Object
org.apache.beam.sdk.transforms.PTransform<PCollection<KV<String,String>>,PCollection<KV<String,com.google.privacy.dlp.v2.DeidentifyContentResponse>>>
org.apache.beam.sdk.extensions.ml.DLPDeidentifyText
All Implemented Interfaces:
Serializable, HasDisplayData

public abstract class DLPDeidentifyText extends PTransform<PCollection<KV<String,String>>,PCollection<KV<String,com.google.privacy.dlp.v2.DeidentifyContentResponse>>>
A PTransform connecting to Cloud DLP (https://cloud.google.com/dlp/docs/libraries) and deidentifying text according to provided settings. The transform supports both columnar delimited input data (eg. CSV) and unstructured input.

If the headerColumns property is set and a sideinput with table headers is added to the PTransform, delimiter also should be set, else the results will be incorrect. If headerColumns is neither set nor passed as side input, input is assumed to be unstructured.

Either deidentifyTemplateName (String) or deidentifyConfig DeidentifyConfig need to be set. inspectTemplateName and inspectConfig (InspectConfig are optional.

Batch size defines how big are batches sent to DLP at once in bytes.

The transform consumes KV of Strings (assumed to be filename as key and contents as value) and outputs KV of String (eg. filename) and DeidentifyContentResponse, which will contain Table of results for the user to consume.

See Also:
  • Field Details

    • DLP_PAYLOAD_LIMIT_BYTES

      public static final Integer DLP_PAYLOAD_LIMIT_BYTES
  • Constructor Details

    • DLPDeidentifyText

      public DLPDeidentifyText()
  • Method Details

    • getInspectTemplateName

      public abstract @Nullable String getInspectTemplateName()
      Returns:
      Template name for data inspection.
    • getDeidentifyTemplateName

      public abstract @Nullable String getDeidentifyTemplateName()
      Returns:
      Template name for data deidentification.
    • getInspectConfig

      public abstract @Nullable com.google.privacy.dlp.v2.InspectConfig getInspectConfig()
      Returns:
      Configuration object for data inspection. If present, supersedes the template settings.
    • getDeidentifyConfig

      public abstract @Nullable com.google.privacy.dlp.v2.DeidentifyConfig getDeidentifyConfig()
      Returns:
      Configuration object for deidentification. If present, supersedes the template.
    • getHeaderColumns

      public abstract @Nullable PCollectionView<List<String>> getHeaderColumns()
      Returns:
      List of column names if the input KV value is a delimited row.
    • getColumnDelimiter

      public abstract @Nullable String getColumnDelimiter()
      Returns:
      Delimiter to be used when splitting values from input strings into columns.
    • getBatchSizeBytes

      public abstract Integer getBatchSizeBytes()
      Returns:
      Size of input elements batch to be sent to Cloud DLP service in one request.
    • getProjectId

      public abstract String getProjectId()
      Returns:
      ID of Google Cloud project to be used when deidentifying data.
    • newBuilder

      public static DLPDeidentifyText.Builder newBuilder()
    • expand

      public PCollection<KV<String,com.google.privacy.dlp.v2.DeidentifyContentResponse>> expand(PCollection<KV<String,String>> input)
      The transform converts the contents of input PCollection into Table.Rows and then calls Cloud DLP service to perform the deidentification according to provided settings.
      Specified by:
      expand in class PTransform<PCollection<KV<String,String>>,PCollection<KV<String,com.google.privacy.dlp.v2.DeidentifyContentResponse>>>
      Parameters:
      input - input PCollection
      Returns:
      PCollection after transformations