apache_beam.io.textio module

A source and a sink for reading from and writing to text files.

class apache_beam.io.textio.ReadAllFromText(min_bundle_size=0, desired_bundle_size=67108864, compression_type='auto', strip_trailing_newlines=True, coder=StrUtf8Coder, skip_header_lines=0, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform for reading a PCollection of text files.

Reads a PCollection of text files or file patterns and and produces a PCollection of strings.

Parses a text file as newline-delimited elements, by default assuming UTF-8 encoding. Supports newline delimiters ‘n’ and ‘rn’.

This implementation only supports reading text encoded using UTF-8 or ASCII. This does not support other encodings such as UTF-16 or UTF-32.

Initialize the ReadAllFromText transform.

Parameters:
  • min_bundle_size – Minimum size of bundles that should be generated when splitting this source into bundles. See FileBasedSource for more details.
  • desired_bundle_size – Desired size of bundles that should be generated when splitting this source into bundles. See FileBasedSource for more details.
  • compression_type – Used to handle compressed input files. Typical value is CompressionTypes.AUTO, in which case the underlying file_path’s extension will be used to detect the compression.
  • strip_trailing_newlines – Indicates whether this source should remove the newline char in each line it reads before decoding that line.
  • validate – flag to verify that the files exist during the pipeline creation time.
  • skip_header_lines – Number of header lines to skip. Same number is skipped from each source file. Must be 0 or higher. Large number of skipped lines might impact performance.
  • coder – Coder used to decode each line.
DEFAULT_DESIRED_BUNDLE_SIZE = 67108864
expand(pvalue)[source]
class apache_beam.io.textio.ReadFromText(file_pattern=None, min_bundle_size=0, compression_type='auto', strip_trailing_newlines=True, coder=StrUtf8Coder, validate=True, skip_header_lines=0, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform for reading text files.

Parses a text file as newline-delimited elements, by default assuming UTF-8 encoding. Supports newline delimiters \n and \r\n.

This implementation only supports reading text encoded using UTF-8 or ASCII. This does not support other encodings such as UTF-16 or UTF-32.

Initialize the ReadFromText transform.

Parameters:
  • file_pattern (str) – The file path to read from as a local file path or a GCS gs:// path. The path can contain glob characters (*, ?, and [...] sets).
  • min_bundle_size (int) – Minimum size of bundles that should be generated when splitting this source into bundles. See FileBasedSource for more details.
  • compression_type (str) – Used to handle compressed input files. Typical value is CompressionTypes.AUTO, in which case the underlying file_path’s extension will be used to detect the compression.
  • strip_trailing_newlines (bool) – Indicates whether this source should remove the newline char in each line it reads before decoding that line.
  • validate (bool) – flag to verify that the files exist during the pipeline creation time.
  • skip_header_lines (int) – Number of header lines to skip. Same number is skipped from each source file. Must be 0 or higher. Large number of skipped lines might impact performance.
  • coder (Coder) – Coder used to decode each line.
expand(pvalue)[source]
class apache_beam.io.textio.ReadFromTextWithFilename(file_pattern=None, min_bundle_size=0, compression_type='auto', strip_trailing_newlines=True, coder=StrUtf8Coder, validate=True, skip_header_lines=0, **kwargs)[source]

Bases: apache_beam.io.textio.ReadFromText

A ReadFromText for reading text files returning the name of the file and the content of the file.

This class extend ReadFromText class just setting a different _source_class attribute.

Initialize the ReadFromText transform.

Parameters:
  • file_pattern (str) – The file path to read from as a local file path or a GCS gs:// path. The path can contain glob characters (*, ?, and [...] sets).
  • min_bundle_size (int) – Minimum size of bundles that should be generated when splitting this source into bundles. See FileBasedSource for more details.
  • compression_type (str) – Used to handle compressed input files. Typical value is CompressionTypes.AUTO, in which case the underlying file_path’s extension will be used to detect the compression.
  • strip_trailing_newlines (bool) – Indicates whether this source should remove the newline char in each line it reads before decoding that line.
  • validate (bool) – flag to verify that the files exist during the pipeline creation time.
  • skip_header_lines (int) – Number of header lines to skip. Same number is skipped from each source file. Must be 0 or higher. Large number of skipped lines might impact performance.
  • coder (Coder) – Coder used to decode each line.
class apache_beam.io.textio.WriteToText(file_path_prefix, file_name_suffix='', append_trailing_newlines=True, num_shards=0, shard_name_template=None, coder=ToStringCoder, compression_type='auto', header=None)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform for writing to text files.

Initialize a WriteToText transform.

Parameters:
  • file_path_prefix (str) – The file path to write to. The files written will begin with this prefix, followed by a shard identifier (see num_shards), and end in a common extension, if given by file_name_suffix. In most cases, only this argument is specified and num_shards, shard_name_template, and file_name_suffix use default values.
  • file_name_suffix (str) – Suffix for the files written.
  • append_trailing_newlines (bool) – indicate whether this sink should write an additional newline char after writing each element.
  • num_shards (int) – The number of files (shards) used for output. If not set, the service will decide on the optimal number of shards. Constraining the number of shards is likely to reduce the performance of a pipeline. Setting this value is not recommended unless you require a specific number of output files.
  • shard_name_template (str) – A template string containing placeholders for the shard number and shard count. Currently only '' and '-SSSSS-of-NNNNN' are patterns accepted by the service. When constructing a filename for a particular shard number, the upper-case letters S and N are replaced with the 0-padded shard number and shard count respectively. This argument can be '' in which case it behaves as if num_shards was set to 1 and only one file will be generated. The default pattern used is '-SSSSS-of-NNNNN'.
  • coder (Coder) – Coder used to encode each line.
  • compression_type (str) – Used to handle compressed output files. Typical value is CompressionTypes.AUTO, in which case the final file path’s extension (as determined by file_path_prefix, file_name_suffix, num_shards and shard_name_template) will be used to detect the compression.
  • header (str) – String to write at beginning of file as a header. If not None and append_trailing_newlines is set, \n will be added.
expand(pcoll)[source]