T
- The type to read from the compressed file.@Experimental(value=SOURCE_SINK) public class CompressedSource<T> extends FileBasedSource<T>
CompressedSources
wraps a delegate
FileBasedSource
that is able to read the decompressed file format.
For example, use the following to read from a gzip-compressed file-based source:
FileBasedSource<T> mySource = ...;
PCollection<T> collection = p.apply(Read.from(CompressedSource
.from(mySource)
.withDecompression(CompressedSource.CompressionMode.GZIP)));
Supported compression algorithms are CompressedSource.CompressionMode.GZIP
,
CompressedSource.CompressionMode.BZIP2
, CompressedSource.CompressionMode.ZIP
and CompressedSource.CompressionMode.DEFLATE
.
User-defined compression types are supported by implementing
CompressedSource.DecompressingChannelFactory
.
By default, the compression algorithm is selected from those supported in
CompressedSource.CompressionMode
based on the file name provided to the source, namely
".bz2"
indicates CompressedSource.CompressionMode.BZIP2
, ".gz"
indicates
CompressedSource.CompressionMode.GZIP
, ".zip"
indicates CompressedSource.CompressionMode.ZIP
and
".deflate"
indicates CompressedSource.CompressionMode.DEFLATE
. If the file name does not match
any of the supported
algorithms, it is assumed to be uncompressed data.
Modifier and Type | Class and Description |
---|---|
static class |
CompressedSource.CompressedReader<T>
Reader for a
CompressedSource . |
static class |
CompressedSource.CompressionMode
Default compression types supported by the
CompressedSource . |
static interface |
CompressedSource.DecompressingChannelFactory
Factory interface for creating channels that decompress the content of an underlying channel.
|
FileBasedSource.FileBasedReader<T>, FileBasedSource.Mode
OffsetBasedSource.OffsetBasedReader<T>
BoundedSource.BoundedReader<T>
Source.Reader<T>
Modifier and Type | Method and Description |
---|---|
protected FileBasedSource<T> |
createForSubrangeOfFile(MatchResult.Metadata metadata,
long start,
long end)
Creates a
CompressedSource for a subrange of a file. |
protected FileBasedSource.FileBasedReader<T> |
createSingleFileReader(PipelineOptions options)
Creates a
FileBasedReader to read a single file. |
static <T> CompressedSource<T> |
from(FileBasedSource<T> sourceDelegate)
Creates a
CompressedSource from an underlying FileBasedSource . |
CompressedSource.DecompressingChannelFactory |
getChannelFactory() |
Coder<T> |
getDefaultOutputCoder()
Returns the delegate source's default output coder.
|
protected boolean |
isSplittable()
Determines whether a single file represented by this source is splittable.
|
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
void |
validate()
Validates that the delegate source is a valid source and that the channel factory is not null.
|
CompressedSource<T> |
withDecompression(CompressedSource.DecompressingChannelFactory channelFactory)
Return a
CompressedSource that is like this one but will decompress its underlying file
with the given CompressedSource.DecompressingChannelFactory . |
createReader, createSourceForSubrange, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, getSingleFileMetadata, split, toString
getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffset
public static <T> CompressedSource<T> from(FileBasedSource<T> sourceDelegate)
CompressedSource
from an underlying FileBasedSource
. The type
of compression used will be based on the file name extension unless explicitly
configured via withDecompression(org.apache.beam.sdk.io.CompressedSource.DecompressingChannelFactory)
.public CompressedSource<T> withDecompression(CompressedSource.DecompressingChannelFactory channelFactory)
CompressedSource
that is like this one but will decompress its underlying file
with the given CompressedSource.DecompressingChannelFactory
.public void validate()
validate
in class FileBasedSource<T>
protected FileBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata metadata, long start, long end)
CompressedSource
for a subrange of a file. Called by superclass to create a
source for a single file.createForSubrangeOfFile
in class FileBasedSource<T>
metadata
- file backing the new FileBasedSource
.start
- starting byte offset of the new FileBasedSource
.end
- ending byte offset of the new FileBasedSource
. May be Long.MAX_VALUE,
in which case it will be inferred using FileBasedSource.getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions)
.protected final boolean isSplittable() throws java.lang.Exception
isSplittable
in class FileBasedSource<T>
java.lang.Exception
protected final FileBasedSource.FileBasedReader<T> createSingleFileReader(PipelineOptions options)
FileBasedReader
to read a single file.
Uses the delegate source to create a single file reader for the delegate source. Utilizes the default decompression channel factory to not wrap the source reader if the file name does not represent a compressed file allowing for splitting of the source.
createSingleFileReader
in class FileBasedSource<T>
public void populateDisplayData(DisplayData.Builder builder)
Source
populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData)
. Implementations may call
super.populateDisplayData(builder)
in order to register display data in the current
namespace, but should otherwise use subcomponent.populateDisplayData(builder)
to use
the namespace of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.
populateDisplayData
in interface HasDisplayData
populateDisplayData
in class FileBasedSource<T>
builder
- The builder to populate with display data.HasDisplayData
public final Coder<T> getDefaultOutputCoder()
getDefaultOutputCoder
in class Source<T>
public final CompressedSource.DecompressingChannelFactory getChannelFactory()