Class CompressedSource<T>
- Type Parameters:
T
- The type to read from the compressed file.
- All Implemented Interfaces:
Serializable
,HasDisplayData
CompressedSources
wraps a delegate FileBasedSource
that is able to read the decompressed file format.
For example, use the following to read from a gzip-compressed file-based source:
FileBasedSource<T> mySource = ...;
PCollection<T> collection = p.apply(Read.from(CompressedSource
.from(mySource)
.withCompression(Compression.GZIP)));
Supported compression algorithms are Compression.GZIP
, Compression.BZIP2
,
Compression.ZIP
, Compression.ZSTD
, Compression.LZO
, Compression.LZOP
, Compression.SNAPPY
, and Compression.DEFLATE
. User-defined
compression types are supported by implementing a CompressedSource.DecompressingChannelFactory
.
By default, the compression algorithm is selected from those supported in Compression
based on the file name provided to the source, namely ".bz2"
indicates Compression.BZIP2
, ".gz"
indicates Compression.GZIP
, ".zip"
indicates
Compression.ZIP
, ".zst"
indicates Compression.ZSTD
,
".lzo_deflate"
indicates Compression.LZO
, ".lzo"
indicates Compression.LZOP
, ".snappy"
indicted Compression.SNAPPY
, and ".deflate"
indicates Compression.DEFLATE
. If the file name does not match any of the supported
algorithms, it is assumed to be uncompressed data.
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
Reader for aCompressedSource
.static enum
Deprecated.static interface
Factory interface for creating channels that decompress the content of an underlying channel.Nested classes/interfaces inherited from class org.apache.beam.sdk.io.FileBasedSource
FileBasedSource.FileBasedReader<T>, FileBasedSource.Mode
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.OffsetBasedSource
OffsetBasedSource.OffsetBasedReader<T>
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.BoundedSource
BoundedSource.BoundedReader<T>
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.Source
Source.Reader<T>
-
Method Summary
Modifier and TypeMethodDescriptionprotected FileBasedSource
<T> createForSubrangeOfFile
(MatchResult.Metadata metadata, long start, long end) Creates aCompressedSource
for a subrange of a file.protected final FileBasedSource.FileBasedReader
<T> createSingleFileReader
(PipelineOptions options) Creates aFileBasedReader
to read a single file.static <T> CompressedSource
<T> from
(FileBasedSource<T> sourceDelegate) Creates aCompressedSource
from an underlyingFileBasedSource
.Returns the delegate source's output coder.protected final boolean
Determines whether a single file represented by this source is splittable.void
populateDisplayData
(DisplayData.Builder builder) Register display data for the given transform or component.void
validate()
Validates that the delegate source is a valid source and that the channel factory is not null.withCompression
(Compression compression) LikewithDecompression(org.apache.beam.sdk.io.CompressedSource.DecompressingChannelFactory)
but takes a canonicalCompression
.withDecompression
(CompressedSource.DecompressingChannelFactory channelFactory) Return aCompressedSource
that is like this one but will decompress its underlying file with the givenCompressedSource.DecompressingChannelFactory
.Methods inherited from class org.apache.beam.sdk.io.FileBasedSource
createReader, createSourceForSubrange, getEmptyMatchTreatment, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, getSingleFileMetadata, split, toString
Methods inherited from class org.apache.beam.sdk.io.OffsetBasedSource
getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffset
Methods inherited from class org.apache.beam.sdk.io.Source
getDefaultOutputCoder
-
Method Details
-
from
Creates aCompressedSource
from an underlyingFileBasedSource
. The type of compression used will be based on the file name extension unless explicitly configured viawithDecompression(org.apache.beam.sdk.io.CompressedSource.DecompressingChannelFactory)
. -
withDecompression
public CompressedSource<T> withDecompression(CompressedSource.DecompressingChannelFactory channelFactory) Return aCompressedSource
that is like this one but will decompress its underlying file with the givenCompressedSource.DecompressingChannelFactory
. -
withCompression
LikewithDecompression(org.apache.beam.sdk.io.CompressedSource.DecompressingChannelFactory)
but takes a canonicalCompression
. -
validate
public void validate()Validates that the delegate source is a valid source and that the channel factory is not null.- Overrides:
validate
in classFileBasedSource<T>
-
createForSubrangeOfFile
protected FileBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata metadata, long start, long end) Creates aCompressedSource
for a subrange of a file. Called by superclass to create a source for a single file.- Specified by:
createForSubrangeOfFile
in classFileBasedSource<T>
- Parameters:
metadata
- file backing the newFileBasedSource
.start
- starting byte offset of the newFileBasedSource
.end
- ending byte offset of the newFileBasedSource
. May be Long.MAX_VALUE, in which case it will be inferred usingFileBasedSource.getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions)
.
-
isSplittable
protected final boolean isSplittable()Determines whether a single file represented by this source is splittable. Returns true if we are using the default decompression factory and it determines from the requested file name that the file is not compressed.- Overrides:
isSplittable
in classFileBasedSource<T>
-
createSingleFileReader
Creates aFileBasedReader
to read a single file.Uses the delegate source to create a single file reader for the delegate source. Utilizes the default decompression channel factory to not wrap the source reader if the file name does not represent a compressed file allowing for splitting of the source.
- Specified by:
createSingleFileReader
in classFileBasedSource<T>
-
populateDisplayData
Description copied from class:Source
Register display data for the given transform or component.populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData)
. Implementations may callsuper.populateDisplayData(builder)
in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)
to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayData
in interfaceHasDisplayData
- Overrides:
populateDisplayData
in classFileBasedSource<T>
- Parameters:
builder
- The builder to populate with display data.- See Also:
-
getOutputCoder
Returns the delegate source's output coder.- Overrides:
getOutputCoder
in classSource<T>
-
getChannelFactory
-
Compression
instead