apache_beam.io.tfrecordio module
TFRecord sources and sinks.
- class apache_beam.io.tfrecordio.ReadFromTFRecord(file_pattern, coder=BytesCoder, compression_type='auto', validate=True)[source]
Bases:
PTransform
Transform for reading TFRecord sources.
Initialize a ReadFromTFRecord transform.
- Parameters:
file_pattern – A file glob pattern to read TFRecords from.
coder – Coder used to decode each record.
compression_type – Used to handle compressed input files. Default value is CompressionTypes.AUTO, in which case the file_path’s extension will be used to detect the compression.
validate – Boolean flag to verify that the files exist during the pipeline creation time.
- Returns:
A ReadFromTFRecord transform object.
- class apache_beam.io.tfrecordio.ReadAllFromTFRecord(coder=BytesCoder, compression_type='auto', with_filename=False)[source]
Bases:
PTransform
A
PTransform
for reading aPCollection
of TFRecord files.Initialize the
ReadAllFromTFRecord
transform.- Parameters:
coder – Coder used to decode each record.
compression_type – Used to handle compressed input files. Default value is CompressionTypes.AUTO, in which case the file_path’s extension will be used to detect the compression.
with_filename – If True, returns a Key Value with the key being the file name and the value being the actual data. If False, it only returns the data.
- class apache_beam.io.tfrecordio.WriteToTFRecord(file_path_prefix, coder=BytesCoder, file_name_suffix='', num_shards=0, shard_name_template=None, compression_type='auto', triggering_frequency=None)[source]
Bases:
PTransform
Transform for writing to TFRecord sinks.
Initialize WriteToTFRecord transform.
- Parameters:
file_path_prefix – The file path to write to. The files written will begin with this prefix, followed by a shard identifier (see num_shards), and end in a common extension, if given by file_name_suffix.
coder – Coder used to encode each record.
file_name_suffix – Suffix for the files written.
num_shards – The number of files (shards) used for output. If not set, the default value will be used. In streaming if not set, the service will write a file per bundle.
shard_name_template – A template string containing placeholders for the shard number and shard count. Currently only
''
,'-SSSSS-of-NNNNN'
,'-W-SSSSS-of-NNNNN'
and'-V-SSSSS-of-NNNNN'
are patterns accepted by the service. When constructing a filename for a particular shard number, the upper-case lettersS
andN
are replaced with the0
-padded shard number and shard count respectively. This argument can be''
in which case it behaves as if num_shards was set to 1 and only one file will be generated. The default pattern used is'-SSSSS-of-NNNNN'
for bounded PCollections and for'-W-SSSSS-of-NNNNN'
unbounded PCollections. W is used for windowed shard naming and is replaced with[window.start, window.end)
V is used for windowed shard naming and is replaced with[window.start.to_utc_datetime().strftime("%Y-%m-%dT%H-%M-%S"), window.end.to_utc_datetime().strftime("%Y-%m-%dT%H-%M-%S")
compression_type – Used to handle compressed output files. Typical value is CompressionTypes.AUTO, in which case the file_path’s extension will be used to detect the compression.
triggering_frequency – (int) Every triggering_frequency duration, a window will be triggered and all bundles in the window will be written. If set it overrides user windowing. Mandatory for GlobalWindow.
- Returns:
A WriteToTFRecord transform object.