apache_beam.io.textio module

A source and a sink for reading from and writing to text files.

class apache_beam.io.textio.ReadAllFromText(min_bundle_size=0, desired_bundle_size=67108864, compression_type='auto', strip_trailing_newlines=True, validate=False, coder=StrUtf8Coder, skip_header_lines=0, with_filename=False, delimiter=None, escapechar=None, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform for reading a PCollection of text files.

Reads a PCollection of text files or file patterns and produces a PCollection of strings.

Parses a text file as newline-delimited elements, by default assuming UTF-8 encoding. Supports newline delimiters ‘n’ and ‘rn’.

If with_filename is True the output will include the file name. This is similar to ReadFromTextWithFilename but this PTransform can be placed anywhere in the pipeline.

This implementation only supports reading text encoded using UTF-8 or ASCII. This does not support other encodings such as UTF-16 or UTF-32.

This implementation is only tested with batch pipeline. In streaming, reading may happen with delay due to the limitation in ReShuffle involved.

Initialize the ReadAllFromText transform.

Parameters:
  • min_bundle_size – Minimum size of bundles that should be generated when splitting this source into bundles. See FileBasedSource for more details.
  • desired_bundle_size – Desired size of bundles that should be generated when splitting this source into bundles. See FileBasedSource for more details.
  • compression_type – Used to handle compressed input files. Typical value is CompressionTypes.AUTO, in which case the underlying file_path’s extension will be used to detect the compression.
  • strip_trailing_newlines – Indicates whether this source should remove the newline char in each line it reads before decoding that line.
  • validate – flag to verify that the files exist during the pipeline creation time.
  • skip_header_lines – Number of header lines to skip. Same number is skipped from each source file. Must be 0 or higher. Large number of skipped lines might impact performance.
  • coder – Coder used to decode each line.
  • with_filename – If True, returns a Key Value with the key being the file name and the value being the actual data. If False, it only returns the data.
  • delimiter (bytes) – delimiter to split records. Must not self-overlap, because self-overlapping delimiters cause ambiguous parsing.
  • escapechar (bytes) – a single byte to escape the records delimiter, can also escape itself.
DEFAULT_DESIRED_BUNDLE_SIZE = 67108864
expand(pvalue)[source]
class apache_beam.io.textio.ReadAllFromTextContinuously(file_pattern, **kwargs)[source]

Bases: apache_beam.io.textio.ReadAllFromText

A PTransform for reading text files in given file patterns. This PTransform acts as a Source and produces continuously a PCollection of strings.

For more details, see ReadAllFromText for text parsing settings; see apache_beam.io.fileio.MatchContinuously for watching settings.

ReadAllFromTextContinuously is experimental. No backwards-compatibility guarantees. Due to the limitation on Reshuffle, current implementation does not scale.

Initialize the ReadAllFromTextContinuously transform.

Accepts args for constructor args of both ReadAllFromText and MatchContinuously.

expand(pbegin)[source]
class apache_beam.io.textio.ReadFromText(file_pattern=None, min_bundle_size=0, compression_type='auto', strip_trailing_newlines=True, coder=StrUtf8Coder, validate=True, skip_header_lines=0, delimiter=None, escapechar=None, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform for reading text files.

Parses a text file as newline-delimited elements, by default assuming UTF-8 encoding. Supports newline delimiters \n and \r\n or specified delimiter .

This implementation only supports reading text encoded using UTF-8 or ASCII. This does not support other encodings such as UTF-16 or UTF-32.

Initialize the ReadFromText transform.

Parameters:
  • file_pattern (str) – The file path to read from as a local file path or a GCS gs:// path. The path can contain glob characters (*, ?, and [...] sets).
  • min_bundle_size (int) – Minimum size of bundles that should be generated when splitting this source into bundles. See FileBasedSource for more details.
  • compression_type (str) – Used to handle compressed input files. Typical value is CompressionTypes.AUTO, in which case the underlying file_path’s extension will be used to detect the compression.
  • strip_trailing_newlines (bool) – Indicates whether this source should remove the newline char in each line it reads before decoding that line.
  • validate (bool) – flag to verify that the files exist during the pipeline creation time.
  • skip_header_lines (int) – Number of header lines to skip. Same number is skipped from each source file. Must be 0 or higher. Large number of skipped lines might impact performance.
  • coder (Coder) – Coder used to decode each line.
  • delimiter (bytes) – delimiter to split records. Must not self-overlap, because self-overlapping delimiters cause ambiguous parsing.
  • escapechar (bytes) – a single byte to escape the records delimiter, can also escape itself.
expand(pvalue)[source]
class apache_beam.io.textio.ReadFromTextWithFilename(file_pattern=None, min_bundle_size=0, compression_type='auto', strip_trailing_newlines=True, coder=StrUtf8Coder, validate=True, skip_header_lines=0, delimiter=None, escapechar=None, **kwargs)[source]

Bases: apache_beam.io.textio.ReadFromText

A ReadFromText for reading text files returning the name of the file and the content of the file.

This class extend ReadFromText class just setting a different _source_class attribute.

Initialize the ReadFromText transform.

Parameters:
  • file_pattern (str) – The file path to read from as a local file path or a GCS gs:// path. The path can contain glob characters (*, ?, and [...] sets).
  • min_bundle_size (int) – Minimum size of bundles that should be generated when splitting this source into bundles. See FileBasedSource for more details.
  • compression_type (str) – Used to handle compressed input files. Typical value is CompressionTypes.AUTO, in which case the underlying file_path’s extension will be used to detect the compression.
  • strip_trailing_newlines (bool) – Indicates whether this source should remove the newline char in each line it reads before decoding that line.
  • validate (bool) – flag to verify that the files exist during the pipeline creation time.
  • skip_header_lines (int) – Number of header lines to skip. Same number is skipped from each source file. Must be 0 or higher. Large number of skipped lines might impact performance.
  • coder (Coder) – Coder used to decode each line.
  • delimiter (bytes) – delimiter to split records. Must not self-overlap, because self-overlapping delimiters cause ambiguous parsing.
  • escapechar (bytes) – a single byte to escape the records delimiter, can also escape itself.
class apache_beam.io.textio.WriteToText(file_path_prefix, file_name_suffix='', append_trailing_newlines=True, num_shards=0, shard_name_template=None, coder=ToBytesCoder, compression_type='auto', header=None, footer=None, *, max_records_per_shard=None, max_bytes_per_shard=None, skip_if_empty=False)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform for writing to text files.

Initialize a WriteToText transform.

Parameters:
  • file_path_prefix (str) – The file path to write to. The files written will begin with this prefix, followed by a shard identifier (see num_shards), and end in a common extension, if given by file_name_suffix. In most cases, only this argument is specified and num_shards, shard_name_template, and file_name_suffix use default values.
  • file_name_suffix (str) – Suffix for the files written.
  • append_trailing_newlines (bool) – indicate whether this sink should write an additional newline char after writing each element.
  • num_shards (int) – The number of files (shards) used for output. If not set, the service will decide on the optimal number of shards. Constraining the number of shards is likely to reduce the performance of a pipeline. Setting this value is not recommended unless you require a specific number of output files.
  • shard_name_template (str) – A template string containing placeholders for the shard number and shard count. Currently only '' and '-SSSSS-of-NNNNN' are patterns accepted by the service. When constructing a filename for a particular shard number, the upper-case letters S and N are replaced with the 0-padded shard number and shard count respectively. This argument can be '' in which case it behaves as if num_shards was set to 1 and only one file will be generated. The default pattern used is '-SSSSS-of-NNNNN'.
  • coder (Coder) – Coder used to encode each line.
  • compression_type (str) – Used to handle compressed output files. Typical value is CompressionTypes.AUTO, in which case the final file path’s extension (as determined by file_path_prefix, file_name_suffix, num_shards and shard_name_template) will be used to detect the compression.
  • header (str) – String to write at beginning of file as a header. If not None and append_trailing_newlines is set, \n will be added.
  • footer (str) – String to write at the end of file as a footer. If not None and append_trailing_newlines is set, \n will be added.
  • max_records_per_shard – Maximum number of records to write to any individual shard.
  • max_bytes_per_shard – Target maximum number of bytes to write to any individual shard. This may be exceeded slightly, as a new shard is created once this limit is hit, but the remainder of a given record, a subsequent newline, and a footer may cause the actual shard size to exceed this value. This also tracks the uncompressed, not compressed, size of the shard.
  • skip_if_empty – Don’t write any shards if the PCollection is empty.
expand(pcoll)[source]