apache_beam.ml.gcp.naturallanguageml module

class apache_beam.ml.gcp.naturallanguageml.Document(content, type='PLAIN_TEXT', language_hint=None, encoding='UTF8', from_gcs=False)[source]

Bases: object

Represents the input to AnnotateText transform.

  • content (str) – The content of the input or the Google Cloud Storage URI where the file is stored.
  • type (Union[str, google.cloud.language.enums.Document.Type]) – Text type. Possible values are HTML, PLAIN_TEXT. The default value is PLAIN_TEXT.
  • language_hint (Optional[str]) – The language of the text. If not specified, language will be automatically detected. Values should conform to ISO-639-1 standard.
  • encoding (Optional[str]) – Text encoding. Possible values are: NONE, UTF8, UTF16, UTF32. The default value is UTF8.
  • from_gcs (bool) – Whether the content should be interpret as a Google Cloud Storage URI. The default value is False.
static to_dict(document)[source]
apache_beam.ml.gcp.naturallanguageml.AnnotateText(pcoll, features, timeout=None, metadata=None)[source]

A PTransform for annotating text using the Google Cloud Natural Language API: https://cloud.google.com/natural-language/docs.

  • pcoll (PCollection) – An input PCollection of Document objects.
  • features (Union[Mapping[str, bool], types.AnnotateTextRequest.Features]) –

    A dictionary of natural language operations to be performed on given text in the following format:

    {'extact_syntax'=True, 'extract_entities'=True}
  • timeout (Optional[float]) – The amount of time, in seconds, to wait for the request to complete. The timeout applies to each individual retry attempt.
  • metadata (Optional[Sequence[Tuple[str, str]]]) – Additional metadata that is provided to the method.