T
- The type to read from the compressed file.@Experimental(value=SOURCE_SINK) public class CompressedSource<T> extends FileBasedSource<T>
CompressedSources
wraps a delegate FileBasedSource
that is able to read the decompressed file format.
For example, use the following to read from a gzip-compressed file-based source:
FileBasedSource<T> mySource = ...;
PCollection<T> collection = p.apply(Read.from(CompressedSource
.from(mySource)
.withCompression(Compression.GZIP)));
Supported compression algorithms are Compression.GZIP
, Compression.BZIP2
,
Compression.ZIP
, Compression.ZSTD
, Compression.LZO
, Compression.LZOP
, Compression.SNAPPY
, and Compression.DEFLATE
. User-defined
compression types are supported by implementing a CompressedSource.DecompressingChannelFactory
.
By default, the compression algorithm is selected from those supported in Compression
based on the file name provided to the source, namely ".bz2"
indicates Compression.BZIP2
, ".gz"
indicates Compression.GZIP
, ".zip"
indicates
Compression.ZIP
, ".zst"
indicates Compression.ZSTD
, ".lzo_deflate"
indicates Compression.LZO
, ".lzo"
indicates Compression.LZOP
, ".snappy"
indicted Compression.SNAPPY
, and ".deflate"
indicates Compression.DEFLATE
. If the file name does not match any of the supported
algorithms, it is assumed to be uncompressed data.
Modifier and Type | Class and Description |
---|---|
static class |
CompressedSource.CompressedReader<T>
Reader for a
CompressedSource . |
static class |
CompressedSource.CompressionMode
Deprecated.
Use
Compression instead |
static interface |
CompressedSource.DecompressingChannelFactory
Factory interface for creating channels that decompress the content of an underlying channel.
|
FileBasedSource.FileBasedReader<T>, FileBasedSource.Mode
OffsetBasedSource.OffsetBasedReader<T>
BoundedSource.BoundedReader<T>
Source.Reader<T>
Modifier and Type | Method and Description |
---|---|
protected FileBasedSource<T> |
createForSubrangeOfFile(MatchResult.Metadata metadata,
long start,
long end)
Creates a
CompressedSource for a subrange of a file. |
protected FileBasedSource.FileBasedReader<T> |
createSingleFileReader(PipelineOptions options)
Creates a
FileBasedReader to read a single file. |
static <T> CompressedSource<T> |
from(FileBasedSource<T> sourceDelegate)
Creates a
CompressedSource from an underlying FileBasedSource . |
CompressedSource.DecompressingChannelFactory |
getChannelFactory() |
Coder<T> |
getOutputCoder()
Returns the delegate source's output coder.
|
protected boolean |
isSplittable()
Determines whether a single file represented by this source is splittable.
|
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
void |
validate()
Validates that the delegate source is a valid source and that the channel factory is not null.
|
CompressedSource<T> |
withCompression(Compression compression)
Like
withDecompression(org.apache.beam.sdk.io.CompressedSource.DecompressingChannelFactory) but takes a canonical Compression . |
CompressedSource<T> |
withDecompression(CompressedSource.DecompressingChannelFactory channelFactory)
Return a
CompressedSource that is like this one but will decompress its underlying file
with the given CompressedSource.DecompressingChannelFactory . |
createReader, createSourceForSubrange, getEmptyMatchTreatment, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, getSingleFileMetadata, split, toString
getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffset
getDefaultOutputCoder
public static <T> CompressedSource<T> from(FileBasedSource<T> sourceDelegate)
CompressedSource
from an underlying FileBasedSource
. The type of
compression used will be based on the file name extension unless explicitly configured via
withDecompression(org.apache.beam.sdk.io.CompressedSource.DecompressingChannelFactory)
.public CompressedSource<T> withDecompression(CompressedSource.DecompressingChannelFactory channelFactory)
CompressedSource
that is like this one but will decompress its underlying file
with the given CompressedSource.DecompressingChannelFactory
.public CompressedSource<T> withCompression(Compression compression)
withDecompression(org.apache.beam.sdk.io.CompressedSource.DecompressingChannelFactory)
but takes a canonical Compression
.public void validate()
validate
in class FileBasedSource<T>
protected FileBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata metadata, long start, long end)
CompressedSource
for a subrange of a file. Called by superclass to create a
source for a single file.createForSubrangeOfFile
in class FileBasedSource<T>
metadata
- file backing the new FileBasedSource
.start
- starting byte offset of the new FileBasedSource
.end
- ending byte offset of the new FileBasedSource
. May be Long.MAX_VALUE, in
which case it will be inferred using FileBasedSource.getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions)
.protected final boolean isSplittable()
isSplittable
in class FileBasedSource<T>
protected final FileBasedSource.FileBasedReader<T> createSingleFileReader(PipelineOptions options)
FileBasedReader
to read a single file.
Uses the delegate source to create a single file reader for the delegate source. Utilizes the default decompression channel factory to not wrap the source reader if the file name does not represent a compressed file allowing for splitting of the source.
createSingleFileReader
in class FileBasedSource<T>
public void populateDisplayData(DisplayData.Builder builder)
Source
populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData)
. Implementations may call super.populateDisplayData(builder)
in order to register display data in the current namespace,
but should otherwise use subcomponent.populateDisplayData(builder)
to use the namespace
of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.
populateDisplayData
in interface HasDisplayData
populateDisplayData
in class FileBasedSource<T>
builder
- The builder to populate with display data.HasDisplayData
public final Coder<T> getOutputCoder()
getOutputCoder
in class Source<T>
public final CompressedSource.DecompressingChannelFactory getChannelFactory()