apache_beam.io.textio module¶
A source and a sink for reading from and writing to text files.
- 
class apache_beam.io.textio.ReadAllFromText(min_bundle_size=0, desired_bundle_size=67108864, compression_type='auto', strip_trailing_newlines=True, validate=False, coder=StrUtf8Coder, skip_header_lines=0, with_filename=False, delimiter=None, escapechar=None, **kwargs)[source]¶
- Bases: - apache_beam.transforms.ptransform.PTransform- A - PTransformfor reading a- PCollectionof text files.Reads a- PCollectionof text files or file patterns and produces a- PCollectionof strings.- Parses a text file as newline-delimited elements, by default assuming UTF-8 encoding. Supports newline delimiters ‘n’ and ‘rn’. - If with_filename is - Truethe output will include the file name. This is similar to- ReadFromTextWithFilenamebut this- PTransformcan be placed anywhere in the pipeline.- This implementation only supports reading text encoded using UTF-8 or ASCII. This does not support other encodings such as UTF-16 or UTF-32. - This implementation is only tested with batch pipeline. In streaming, reading may happen with delay due to the limitation in ReShuffle involved. - Initialize the - ReadAllFromTexttransform.- Parameters: - min_bundle_size – Minimum size of bundles that should be generated when
splitting this source into bundles. See FileBasedSourcefor more details.
- desired_bundle_size – Desired size of bundles that should be generated when
splitting this source into bundles. See FileBasedSourcefor more details.
- compression_type – Used to handle compressed input files. Typical value
is CompressionTypes.AUTO, in which case the underlying file_path’s extension will be used to detect the compression.
- strip_trailing_newlines – Indicates whether this source should remove the newline char in each line it reads before decoding that line.
- validate – flag to verify that the files exist during the pipeline creation time.
- skip_header_lines – Number of header lines to skip. Same number is skipped from each source file. Must be 0 or higher. Large number of skipped lines might impact performance.
- coder – Coder used to decode each line.
- with_filename – If True, returns a Key Value with the key being the file name and the value being the actual data. If False, it only returns the data.
- delimiter (bytes) – delimiter to split records. Must not self-overlap, because self-overlapping delimiters cause ambiguous parsing.
- escapechar (bytes) – a single byte to escape the records delimiter, can also escape itself.
 - 
DEFAULT_DESIRED_BUNDLE_SIZE= 67108864¶
 
- min_bundle_size – Minimum size of bundles that should be generated when
splitting this source into bundles. See 
- 
class apache_beam.io.textio.ReadAllFromTextContinuously(file_pattern, **kwargs)[source]¶
- Bases: - apache_beam.io.textio.ReadAllFromText- A - PTransformfor reading text files in given file patterns. This PTransform acts as a Source and produces continuously a- PCollectionof strings.- For more details, see - ReadAllFromTextfor text parsing settings; see- apache_beam.io.fileio.MatchContinuouslyfor watching settings.- ReadAllFromTextContinuously is experimental. No backwards-compatibility guarantees. Due to the limitation on Reshuffle, current implementation does not scale. - Initialize the - ReadAllFromTextContinuouslytransform.- Accepts args for constructor args of both - ReadAllFromTextand- MatchContinuously.
- 
class apache_beam.io.textio.ReadFromText(file_pattern=None, min_bundle_size=0, compression_type='auto', strip_trailing_newlines=True, coder=StrUtf8Coder, validate=True, skip_header_lines=0, delimiter=None, escapechar=None, **kwargs)[source]¶
- Bases: - apache_beam.transforms.ptransform.PTransform- A - PTransformfor reading text files.- Parses a text file as newline-delimited elements, by default assuming - UTF-8encoding. Supports newline delimiters- \nand- \r\nor specified delimiter .- This implementation only supports reading text encoded using - UTF-8or- ASCII. This does not support other encodings such as- UTF-16or- UTF-32.- Initialize the - ReadFromTexttransform.- Parameters: - file_pattern (str) – The file path to read from as a local file path or a
GCS gs://path. The path can contain glob characters (*,?, and[...]sets).
- min_bundle_size (int) – Minimum size of bundles that should be generated
when splitting this source into bundles. See
FileBasedSourcefor more details.
- compression_type (str) – Used to handle compressed input files.
Typical value is CompressionTypes.AUTO, in which case the underlying file_path’s extension will be used to detect the compression.
- strip_trailing_newlines (bool) – Indicates whether this source should remove the newline char in each line it reads before decoding that line.
- validate (bool) – flag to verify that the files exist during the pipeline creation time.
- skip_header_lines (int) – Number of header lines to skip. Same number is skipped from each source file. Must be 0 or higher. Large number of skipped lines might impact performance.
- coder (Coder) – Coder used to decode each line.
- delimiter (bytes) – delimiter to split records. Must not self-overlap, because self-overlapping delimiters cause ambiguous parsing.
- escapechar (bytes) – a single byte to escape the records delimiter, can also escape itself.
 
- file_pattern (str) – The file path to read from as a local file path or a
GCS 
- 
class apache_beam.io.textio.ReadFromTextWithFilename(file_pattern=None, min_bundle_size=0, compression_type='auto', strip_trailing_newlines=True, coder=StrUtf8Coder, validate=True, skip_header_lines=0, delimiter=None, escapechar=None, **kwargs)[source]¶
- Bases: - apache_beam.io.textio.ReadFromText- A - ReadFromTextfor reading text files returning the name of the file and the content of the file.- This class extend ReadFromText class just setting a different _source_class attribute. - Initialize the - ReadFromTexttransform.- Parameters: - file_pattern (str) – The file path to read from as a local file path or a
GCS gs://path. The path can contain glob characters (*,?, and[...]sets).
- min_bundle_size (int) – Minimum size of bundles that should be generated
when splitting this source into bundles. See
FileBasedSourcefor more details.
- compression_type (str) – Used to handle compressed input files.
Typical value is CompressionTypes.AUTO, in which case the underlying file_path’s extension will be used to detect the compression.
- strip_trailing_newlines (bool) – Indicates whether this source should remove the newline char in each line it reads before decoding that line.
- validate (bool) – flag to verify that the files exist during the pipeline creation time.
- skip_header_lines (int) – Number of header lines to skip. Same number is skipped from each source file. Must be 0 or higher. Large number of skipped lines might impact performance.
- coder (Coder) – Coder used to decode each line.
- delimiter (bytes) – delimiter to split records. Must not self-overlap, because self-overlapping delimiters cause ambiguous parsing.
- escapechar (bytes) – a single byte to escape the records delimiter, can also escape itself.
 
- file_pattern (str) – The file path to read from as a local file path or a
GCS 
- 
class apache_beam.io.textio.WriteToText(file_path_prefix, file_name_suffix='', append_trailing_newlines=True, num_shards=0, shard_name_template=None, coder=ToBytesCoder, compression_type='auto', header=None, footer=None, *, max_records_per_shard=None, max_bytes_per_shard=None, skip_if_empty=False)[source]¶
- Bases: - apache_beam.transforms.ptransform.PTransform- A - PTransformfor writing to text files.- Initialize a - WriteToTexttransform.- Parameters: - file_path_prefix (str) – The file path to write to. The files written will begin with this prefix, followed by a shard identifier (see num_shards), and end in a common extension, if given by file_name_suffix. In most cases, only this argument is specified and num_shards, shard_name_template, and file_name_suffix use default values.
- file_name_suffix (str) – Suffix for the files written.
- append_trailing_newlines (bool) – indicate whether this sink should write an additional newline char after writing each element.
- num_shards (int) – The number of files (shards) used for output. If not set, the service will decide on the optimal number of shards. Constraining the number of shards is likely to reduce the performance of a pipeline. Setting this value is not recommended unless you require a specific number of output files.
- shard_name_template (str) – A template string containing placeholders for
the shard number and shard count. Currently only ''and'-SSSSS-of-NNNNN'are patterns accepted by the service. When constructing a filename for a particular shard number, the upper-case lettersSandNare replaced with the0-padded shard number and shard count respectively. This argument can be''in which case it behaves as if num_shards was set to 1 and only one file will be generated. The default pattern used is'-SSSSS-of-NNNNN'.
- coder (Coder) – Coder used to encode each line.
- compression_type (str) – Used to handle compressed output files.
Typical value is CompressionTypes.AUTO, in which case the final file path’s extension (as determined by file_path_prefix, file_name_suffix, num_shards and shard_name_template) will be used to detect the compression.
- header (str) – String to write at beginning of file as a header.
If not Noneand append_trailing_newlines is set,\nwill be added.
- footer (str) – String to write at the end of file as a footer.
If not Noneand append_trailing_newlines is set,\nwill be added.
- max_records_per_shard – Maximum number of records to write to any individual shard.
- max_bytes_per_shard – Target maximum number of bytes to write to any individual shard. This may be exceeded slightly, as a new shard is created once this limit is hit, but the remainder of a given record, a subsequent newline, and a footer may cause the actual shard size to exceed this value. This also tracks the uncompressed, not compressed, size of the shard.
- skip_if_empty – Don’t write any shards if the PCollection is empty.