T - The type to read from the compressed file.@Experimental(value=SOURCE_SINK) public class CompressedSource<T> extends FileBasedSource<T>
CompressedSources wraps a delegate
FileBasedSource that is able to read the decompressed file format.
For example, use the following to read from a gzip-compressed file-based source:
FileBasedSource<T> mySource = ...;
PCollection<T> collection = p.apply(Read.from(CompressedSource
.from(mySource)
.withDecompression(CompressedSource.CompressionMode.GZIP)));
Supported compression algorithms are CompressedSource.CompressionMode.GZIP,
CompressedSource.CompressionMode.BZIP2, CompressedSource.CompressionMode.ZIP and CompressedSource.CompressionMode.DEFLATE.
User-defined compression types are supported by implementing
CompressedSource.DecompressingChannelFactory.
By default, the compression algorithm is selected from those supported in
CompressedSource.CompressionMode based on the file name provided to the source, namely
".bz2" indicates CompressedSource.CompressionMode.BZIP2, ".gz" indicates
CompressedSource.CompressionMode.GZIP, ".zip" indicates CompressedSource.CompressionMode.ZIP and
".deflate" indicates CompressedSource.CompressionMode.DEFLATE. If the file name does not match
any of the supported
algorithms, it is assumed to be uncompressed data.
| Modifier and Type | Class and Description |
|---|---|
static class |
CompressedSource.CompressedReader<T>
Reader for a
CompressedSource. |
static class |
CompressedSource.CompressionMode
Default compression types supported by the
CompressedSource. |
static interface |
CompressedSource.DecompressingChannelFactory
Factory interface for creating channels that decompress the content of an underlying channel.
|
FileBasedSource.FileBasedReader<T>, FileBasedSource.ModeOffsetBasedSource.OffsetBasedReader<T>BoundedSource.BoundedReader<T>Source.Reader<T>| Modifier and Type | Method and Description |
|---|---|
protected FileBasedSource<T> |
createForSubrangeOfFile(MatchResult.Metadata metadata,
long start,
long end)
Creates a
CompressedSource for a subrange of a file. |
protected FileBasedSource.FileBasedReader<T> |
createSingleFileReader(PipelineOptions options)
Creates a
FileBasedReader to read a single file. |
static <T> CompressedSource<T> |
from(FileBasedSource<T> sourceDelegate)
Creates a
CompressedSource from an underlying FileBasedSource. |
CompressedSource.DecompressingChannelFactory |
getChannelFactory() |
Coder<T> |
getDefaultOutputCoder()
Returns the delegate source's default output coder.
|
protected boolean |
isSplittable()
Determines whether a single file represented by this source is splittable.
|
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
void |
validate()
Validates that the delegate source is a valid source and that the channel factory is not null.
|
CompressedSource<T> |
withDecompression(CompressedSource.DecompressingChannelFactory channelFactory)
Return a
CompressedSource that is like this one but will decompress its underlying file
with the given CompressedSource.DecompressingChannelFactory. |
createReader, createSourceForSubrange, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, getSingleFileMetadata, split, toStringgetBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffsetpublic static <T> CompressedSource<T> from(FileBasedSource<T> sourceDelegate)
CompressedSource from an underlying FileBasedSource. The type
of compression used will be based on the file name extension unless explicitly
configured via withDecompression(org.apache.beam.sdk.io.CompressedSource.DecompressingChannelFactory).public CompressedSource<T> withDecompression(CompressedSource.DecompressingChannelFactory channelFactory)
CompressedSource that is like this one but will decompress its underlying file
with the given CompressedSource.DecompressingChannelFactory.public void validate()
validate in class FileBasedSource<T>protected FileBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata metadata, long start, long end)
CompressedSource for a subrange of a file. Called by superclass to create a
source for a single file.createForSubrangeOfFile in class FileBasedSource<T>metadata - file backing the new FileBasedSource.start - starting byte offset of the new FileBasedSource.end - ending byte offset of the new FileBasedSource. May be Long.MAX_VALUE,
in which case it will be inferred using FileBasedSource.getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions).protected final boolean isSplittable()
throws java.lang.Exception
isSplittable in class FileBasedSource<T>java.lang.Exceptionprotected final FileBasedSource.FileBasedReader<T> createSingleFileReader(PipelineOptions options)
FileBasedReader to read a single file.
Uses the delegate source to create a single file reader for the delegate source. Utilizes the default decompression channel factory to not wrap the source reader if the file name does not represent a compressed file allowing for splitting of the source.
createSingleFileReader in class FileBasedSource<T>public void populateDisplayData(DisplayData.Builder builder)
SourcepopulateDisplayData(DisplayData.Builder) is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData). Implementations may call
super.populateDisplayData(builder) in order to register display data in the current
namespace, but should otherwise use subcomponent.populateDisplayData(builder) to use
the namespace of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.
populateDisplayData in interface HasDisplayDatapopulateDisplayData in class FileBasedSource<T>builder - The builder to populate with display data.HasDisplayDatapublic final Coder<T> getDefaultOutputCoder()
getDefaultOutputCoder in class Source<T>public final CompressedSource.DecompressingChannelFactory getChannelFactory()