Class DLPReidentifyText

java.lang.Object
org.apache.beam.sdk.transforms.PTransform<PCollection<KV<String,String>>,PCollection<KV<String,com.google.privacy.dlp.v2.ReidentifyContentResponse>>>
org.apache.beam.sdk.extensions.ml.DLPReidentifyText
All Implemented Interfaces:
Serializable, HasDisplayData

public abstract class DLPReidentifyText extends PTransform<PCollection<KV<String,String>>,PCollection<KV<String,com.google.privacy.dlp.v2.ReidentifyContentResponse>>>
A PTransform connecting to Cloud DLP (https://cloud.google.com/dlp/docs/libraries) and inspecting text for identifying data according to provided settings.

The transform supports both delimited columnar input data and unstructured input.

If the headerColumns property is set and a sideinput with headers is added to the PTransform, delimiter also should be set, else the results will be incorrect. If headerColumns is neither set nor passed as sideinput, input is assumed to be unstructured.

Batch size defines how big are batches sent to DLP at once in bytes.

The transform consumes KV of Strings (assumed to be filename as key and contents as value) and outputs KV of String (eg. filename) and ReidentifyContentResponse, which will contain Table of results for the user to consume.

Batch size defines how big are batches sent to DLP at once in bytes.

Either reidentifyTemplateName String or reidentifyConfig DeidentifyConfig need to be set. inspectConfig InspectConfig and inspectTemplateName String are optional.

Batch size defines how big are batches sent to DLP at once in bytes.

See Also:
  • Field Details

    • DLP_PAYLOAD_LIMIT_BYTES

      public static final Integer DLP_PAYLOAD_LIMIT_BYTES
  • Constructor Details

    • DLPReidentifyText

      public DLPReidentifyText()
  • Method Details

    • getInspectTemplateName

      public abstract @Nullable String getInspectTemplateName()
      Returns:
      Template name for data inspection.
    • getReidentifyTemplateName

      public abstract @Nullable String getReidentifyTemplateName()
      Returns:
      Template name for data reidentification.
    • getInspectConfig

      public abstract @Nullable com.google.privacy.dlp.v2.InspectConfig getInspectConfig()
      Returns:
      Configuration object for data inspection. If present, supersedes the template settings.
    • getReidentifyConfig

      public abstract @Nullable com.google.privacy.dlp.v2.DeidentifyConfig getReidentifyConfig()
      Returns:
      Configuration object for reidentification. If present, supersedes the template.
    • getColumnDelimiter

      public abstract @Nullable String getColumnDelimiter()
      Returns:
      Delimiter to be used when splitting values from input strings into columns.
    • getHeaderColumns

      public abstract @Nullable PCollectionView<List<String>> getHeaderColumns()
      Returns:
      List of column names if the input KV value is a delimited row.
    • getBatchSizeBytes

      public abstract Integer getBatchSizeBytes()
      Returns:
      Size of input elements batch to be sent to Cloud DLP service in one request.
    • getProjectId

      public abstract String getProjectId()
      Returns:
      ID of Google Cloud project to be used when deidentifying data.
    • newBuilder

      public static DLPReidentifyText.Builder newBuilder()
    • expand

      public PCollection<KV<String,com.google.privacy.dlp.v2.ReidentifyContentResponse>> expand(PCollection<KV<String,String>> input)
      The transform converts the contents of input PCollection into Table.Rows and then calls Cloud DLP service to perform the reidentification according to provided settings.
      Specified by:
      expand in class PTransform<PCollection<KV<String,String>>,PCollection<KV<String,com.google.privacy.dlp.v2.ReidentifyContentResponse>>>
      Parameters:
      input - input PCollection
      Returns:
      PCollection after transformations