apache_beam.io.textio module¶
A source and a sink for reading from and writing to text files.
-
class
apache_beam.io.textio.
ReadAllFromText
(min_bundle_size=0, desired_bundle_size=67108864, compression_type='auto', strip_trailing_newlines=True, coder=StrUtf8Coder, skip_header_lines=0, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A
PTransform
for reading aPCollection
of text files.Reads aPCollection
of text files or file patterns and and produces aPCollection
of strings.Parses a text file as newline-delimited elements, by default assuming UTF-8 encoding. Supports newline delimiters ‘n’ and ‘rn’.
This implementation only supports reading text encoded using UTF-8 or ASCII. This does not support other encodings such as UTF-16 or UTF-32.
Initialize the
ReadAllFromText
transform.Parameters: - min_bundle_size – Minimum size of bundles that should be generated when
splitting this source into bundles. See
FileBasedSource
for more details. - desired_bundle_size – Desired size of bundles that should be generated when
splitting this source into bundles. See
FileBasedSource
for more details. - compression_type – Used to handle compressed input files. Typical value
is
CompressionTypes.AUTO
, in which case the underlying file_path’s extension will be used to detect the compression. - strip_trailing_newlines – Indicates whether this source should remove the newline char in each line it reads before decoding that line.
- validate – flag to verify that the files exist during the pipeline creation time.
- skip_header_lines – Number of header lines to skip. Same number is skipped from each source file. Must be 0 or higher. Large number of skipped lines might impact performance.
- coder – Coder used to decode each line.
-
DEFAULT_DESIRED_BUNDLE_SIZE
= 67108864¶
- min_bundle_size – Minimum size of bundles that should be generated when
splitting this source into bundles. See
-
class
apache_beam.io.textio.
ReadFromText
(file_pattern=None, min_bundle_size=0, compression_type='auto', strip_trailing_newlines=True, coder=StrUtf8Coder, validate=True, skip_header_lines=0, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A
PTransform
for reading text files.Parses a text file as newline-delimited elements, by default assuming
UTF-8
encoding. Supports newline delimiters\n
and\r\n
.This implementation only supports reading text encoded using
UTF-8
orASCII
. This does not support other encodings such asUTF-16
orUTF-32
.Initialize the
ReadFromText
transform.Parameters: - file_pattern (str) – The file path to read from as a local file path or a
GCS
gs://
path. The path can contain glob characters (*
,?
, and[...]
sets). - min_bundle_size (int) – Minimum size of bundles that should be generated
when splitting this source into bundles. See
FileBasedSource
for more details. - compression_type (str) – Used to handle compressed input files.
Typical value is
CompressionTypes.AUTO
, in which case the underlying file_path’s extension will be used to detect the compression. - strip_trailing_newlines (bool) – Indicates whether this source should remove the newline char in each line it reads before decoding that line.
- validate (bool) – flag to verify that the files exist during the pipeline creation time.
- skip_header_lines (int) – Number of header lines to skip. Same number is skipped from each source file. Must be 0 or higher. Large number of skipped lines might impact performance.
- coder (Coder) – Coder used to decode each line.
- file_pattern (str) – The file path to read from as a local file path or a
GCS
-
class
apache_beam.io.textio.
ReadFromTextWithFilename
(file_pattern=None, min_bundle_size=0, compression_type='auto', strip_trailing_newlines=True, coder=StrUtf8Coder, validate=True, skip_header_lines=0, **kwargs)[source]¶ Bases:
apache_beam.io.textio.ReadFromText
A
ReadFromText
for reading text files returning the name of the file and the content of the file.This class extend ReadFromText class just setting a different _source_class attribute.
Initialize the
ReadFromText
transform.Parameters: - file_pattern (str) – The file path to read from as a local file path or a
GCS
gs://
path. The path can contain glob characters (*
,?
, and[...]
sets). - min_bundle_size (int) – Minimum size of bundles that should be generated
when splitting this source into bundles. See
FileBasedSource
for more details. - compression_type (str) – Used to handle compressed input files.
Typical value is
CompressionTypes.AUTO
, in which case the underlying file_path’s extension will be used to detect the compression. - strip_trailing_newlines (bool) – Indicates whether this source should remove the newline char in each line it reads before decoding that line.
- validate (bool) – flag to verify that the files exist during the pipeline creation time.
- skip_header_lines (int) – Number of header lines to skip. Same number is skipped from each source file. Must be 0 or higher. Large number of skipped lines might impact performance.
- coder (Coder) – Coder used to decode each line.
- file_pattern (str) – The file path to read from as a local file path or a
GCS
-
class
apache_beam.io.textio.
WriteToText
(file_path_prefix, file_name_suffix='', append_trailing_newlines=True, num_shards=0, shard_name_template=None, coder=ToBytesCoder, compression_type='auto', header=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A
PTransform
for writing to text files.Initialize a
WriteToText
transform.Parameters: - file_path_prefix (str) – The file path to write to. The files written will begin with this prefix, followed by a shard identifier (see num_shards), and end in a common extension, if given by file_name_suffix. In most cases, only this argument is specified and num_shards, shard_name_template, and file_name_suffix use default values.
- file_name_suffix (str) – Suffix for the files written.
- append_trailing_newlines (bool) – indicate whether this sink should write an additional newline char after writing each element.
- num_shards (int) – The number of files (shards) used for output. If not set, the service will decide on the optimal number of shards. Constraining the number of shards is likely to reduce the performance of a pipeline. Setting this value is not recommended unless you require a specific number of output files.
- shard_name_template (str) – A template string containing placeholders for
the shard number and shard count. Currently only
''
and'-SSSSS-of-NNNNN'
are patterns accepted by the service. When constructing a filename for a particular shard number, the upper-case lettersS
andN
are replaced with the0
-padded shard number and shard count respectively. This argument can be''
in which case it behaves as if num_shards was set to 1 and only one file will be generated. The default pattern used is'-SSSSS-of-NNNNN'
. - coder (Coder) – Coder used to encode each line.
- compression_type (str) – Used to handle compressed output files.
Typical value is
CompressionTypes.AUTO
, in which case the final file path’s extension (as determined by file_path_prefix, file_name_suffix, num_shards and shard_name_template) will be used to detect the compression. - header (str) – String to write at beginning of file as a header.
If not
None
and append_trailing_newlines is set,\n
will be added.