apache_beam.io.tfrecordio module

TFRecord sources and sinks.

class apache_beam.io.tfrecordio.ReadFromTFRecord(file_pattern, coder=BytesCoder, compression_type='auto', validate=True)[source]

Bases: PTransform

Transform for reading TFRecord sources.

Initialize a ReadFromTFRecord transform.

Parameters:
  • file_pattern – A file glob pattern to read TFRecords from.

  • coder – Coder used to decode each record.

  • compression_type – Used to handle compressed input files. Default value is CompressionTypes.AUTO, in which case the file_path’s extension will be used to detect the compression.

  • validate – Boolean flag to verify that the files exist during the pipeline creation time.

Returns:

A ReadFromTFRecord transform object.

expand(pvalue)[source]
class apache_beam.io.tfrecordio.ReadAllFromTFRecord(coder=BytesCoder, compression_type='auto', with_filename=False)[source]

Bases: PTransform

A PTransform for reading a PCollection of TFRecord files.

Initialize the ReadAllFromTFRecord transform.

Parameters:
  • coder – Coder used to decode each record.

  • compression_type – Used to handle compressed input files. Default value is CompressionTypes.AUTO, in which case the file_path’s extension will be used to detect the compression.

  • with_filename – If True, returns a Key Value with the key being the file name and the value being the actual data. If False, it only returns the data.

expand(pvalue)[source]
class apache_beam.io.tfrecordio.WriteToTFRecord(file_path_prefix, coder=BytesCoder, file_name_suffix='', num_shards=0, shard_name_template=None, compression_type='auto', triggering_frequency=None)[source]

Bases: PTransform

Transform for writing to TFRecord sinks.

Initialize WriteToTFRecord transform.

Parameters:
  • file_path_prefix – The file path to write to. The files written will begin with this prefix, followed by a shard identifier (see num_shards), and end in a common extension, if given by file_name_suffix.

  • coder – Coder used to encode each record.

  • file_name_suffix – Suffix for the files written.

  • num_shards – The number of files (shards) used for output. If not set, the default value will be used. In streaming if not set, the service will write a file per bundle.

  • shard_name_template – A template string containing placeholders for the shard number and shard count. Currently only '', '-SSSSS-of-NNNNN', '-W-SSSSS-of-NNNNN' and '-V-SSSSS-of-NNNNN' are patterns accepted by the service. When constructing a filename for a particular shard number, the upper-case letters S and N are replaced with the 0-padded shard number and shard count respectively. This argument can be '' in which case it behaves as if num_shards was set to 1 and only one file will be generated. The default pattern used is '-SSSSS-of-NNNNN' for bounded PCollections and for '-W-SSSSS-of-NNNNN' unbounded PCollections. W is used for windowed shard naming and is replaced with [window.start, window.end) V is used for windowed shard naming and is replaced with [window.start.to_utc_datetime().strftime("%Y-%m-%dT%H-%M-%S"), window.end.to_utc_datetime().strftime("%Y-%m-%dT%H-%M-%S")

  • compression_type – Used to handle compressed output files. Typical value is CompressionTypes.AUTO, in which case the file_path’s extension will be used to detect the compression.

  • triggering_frequency – (int) Every triggering_frequency duration, a window will be triggered and all bundles in the window will be written. If set it overrides user windowing. Mandatory for GlobalWindow.

Returns:

A WriteToTFRecord transform object.

expand(pcoll)[source]