T
- the type of values written to the sink.@Experimental(value=FILESYSTEM) public abstract class FileBasedSink<T> extends java.lang.Object implements java.io.Serializable, HasDisplayData
At pipeline construction time, the methods of FileBasedSink are called to validate the sink
and to create a FileBasedSink.WriteOperation
that manages the process of writing to the sink.
The process of writing to file-based sink is as follows:
In order to ensure fault-tolerance, a bundle may be executed multiple times (e.g., in the
event of failure/retry or for redundancy). However, exactly one of these executions will have its
result passed to the finalize method. Each call to FileBasedSink.Writer.openWindowed(java.lang.String, org.apache.beam.sdk.transforms.windowing.BoundedWindow, org.apache.beam.sdk.transforms.windowing.PaneInfo, int)
or FileBasedSink.Writer.openUnwindowed(java.lang.String, int)
is passed a unique bundle id when it is called
by the WriteFiles transform, so even redundant or retried bundles will have a unique way of
identifying
their output.
The bundle id should be used to guarantee that a bundle's output is unique. This uniqueness
guarantee is important; if a bundle is to be output to a file, for example, the name of the file
will encode the unique bundle id to avoid conflicts with other writers.
FileBasedSink
can take a custom FileBasedSink.FilenamePolicy
object to determine output
filenames, and this policy object can be used to write windowed or triggered
PCollections into separate files per window pane. This allows file output from unbounded
PCollections, and also works for bounded PCollecctions.
Supported file systems are those registered with FileSystems
.
Modifier and Type | Class and Description |
---|---|
static class |
FileBasedSink.CompressionType
Directly supported file output compression types.
|
static class |
FileBasedSink.FilenamePolicy
A naming policy for output files.
|
static class |
FileBasedSink.FileResult
Result of a single bundle write.
|
static class |
FileBasedSink.FileResultCoder
A coder for
FileBasedSink.FileResult objects. |
static interface |
FileBasedSink.WritableByteChannelFactory
Implementations create instances of
WritableByteChannel used by FileBasedSink
and related classes to allow decorating, or otherwise transforming, the raw data that
would normally be written directly to the WritableByteChannel passed into
FileBasedSink.WritableByteChannelFactory.create(WritableByteChannel) . |
static class |
FileBasedSink.WriteOperation<T>
Abstract operation that manages the process of writing to
FileBasedSink . |
static class |
FileBasedSink.Writer<T>
Abstract writer that writes a bundle to a
FileBasedSink . |
Constructor and Description |
---|
FileBasedSink(ValueProvider<ResourceId> baseOutputDirectoryProvider,
FileBasedSink.FilenamePolicy filenamePolicy)
Construct a
FileBasedSink with the given filename policy, producing uncompressed files. |
FileBasedSink(ValueProvider<ResourceId> baseOutputDirectoryProvider,
FileBasedSink.FilenamePolicy filenamePolicy,
FileBasedSink.WritableByteChannelFactory writableByteChannelFactory)
Construct a
FileBasedSink with the given filename policy and output channel type. |
Modifier and Type | Method and Description |
---|---|
static ResourceId |
convertToFileResourceIfPossible(java.lang.String outputPrefix)
This is a helper function for turning a user-provided output filename prefix and converting it
into a
ResourceId for writing output files. |
abstract FileBasedSink.WriteOperation<T> |
createWriteOperation()
Return a subclass of
FileBasedSink.WriteOperation that will manage the write
to the sink. |
ValueProvider<ResourceId> |
getBaseOutputDirectoryProvider()
Returns the base directory inside which files will be written according to the configured
FileBasedSink.FilenamePolicy . |
protected java.lang.String |
getExtension()
Returns the extension that will be written to the produced files.
|
FileBasedSink.FilenamePolicy |
getFilenamePolicy()
Returns the policy by which files will be named inside of the base output directory.
|
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
void |
validate(PipelineOptions options) |
@Experimental(value=FILESYSTEM) public FileBasedSink(ValueProvider<ResourceId> baseOutputDirectoryProvider, FileBasedSink.FilenamePolicy filenamePolicy)
FileBasedSink
with the given filename policy, producing uncompressed files.@Experimental(value=FILESYSTEM) public FileBasedSink(ValueProvider<ResourceId> baseOutputDirectoryProvider, FileBasedSink.FilenamePolicy filenamePolicy, FileBasedSink.WritableByteChannelFactory writableByteChannelFactory)
FileBasedSink
with the given filename policy and output channel type.@Experimental(value=FILESYSTEM) public static ResourceId convertToFileResourceIfPossible(java.lang.String outputPrefix)
ResourceId
for writing output files. See TextIO.Write.to(String)
for an
example use case.
Typically, the input prefix will be something like /tmp/foo/bar
, and the user would
like output files to be named as /tmp/foo/bar-0-of-3.txt
. Thus, this function tries to
interpret the provided string as a file ResourceId
path.
However, this may fail, for example if the user gives a prefix that is a directory. E.g.,
/
, gs://my-bucket
, or c://
. In that case, interpreting the string as a
file will fail and this function will return a directory ResourceId
instead.
@Experimental(value=FILESYSTEM) public ValueProvider<ResourceId> getBaseOutputDirectoryProvider()
FileBasedSink.FilenamePolicy
.@Experimental(value=FILESYSTEM) public final FileBasedSink.FilenamePolicy getFilenamePolicy()
FileBasedSink.FilenamePolicy
may itself specify one or more inner directories before each output
file, say when writing windowed outputs in a output/YYYY/MM/DD/file.txt
format.public void validate(PipelineOptions options)
public abstract FileBasedSink.WriteOperation<T> createWriteOperation()
FileBasedSink.WriteOperation
that will manage the write
to the sink.public void populateDisplayData(DisplayData.Builder builder)
HasDisplayData
populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData)
. Implementations may call
super.populateDisplayData(builder)
in order to register display data in the current
namespace, but should otherwise use subcomponent.populateDisplayData(builder)
to use
the namespace of the subcomponent.
populateDisplayData
in interface HasDisplayData
builder
- The builder to populate with display data.HasDisplayData
protected final java.lang.String getExtension()