Class FileBasedSink<UserT,DestinationT,OutputT>
- Type Parameters:
OutputT- the type of values written to the sink.
- All Implemented Interfaces:
Serializable,HasDisplayData
- Direct Known Subclasses:
AvroSink
At pipeline construction time, the methods of FileBasedSink are called to validate the sink
and to create a FileBasedSink.WriteOperation that manages the process of writing to the sink.
The process of writing to file-based sink is as follows:
- An optional subclass-defined initialization,
- a parallel write of bundles to temporary files, and finally,
- these temporary files are renamed with final output filenames.
In order to ensure fault-tolerance, a bundle may be executed multiple times (e.g., in the
event of failure/retry or for redundancy). However, exactly one of these executions will have its
result passed to the finalize method. Each call to FileBasedSink.Writer.open(java.lang.String) is passed a unique
bundle id when it is called by the WriteFiles transform, so even redundant or retried
bundles will have a unique way of identifying their output.
The bundle id should be used to guarantee that a bundle's output is unique. This uniqueness guarantee is important; if a bundle is to be output to a file, for example, the name of the file will encode the unique bundle id to avoid conflicts with other writers.
FileBasedSink can take a custom FileBasedSink.FilenamePolicy object to determine output
filenames, and this policy object can be used to write windowed or triggered PCollections into
separate files per window pane. This allows file output from unbounded PCollections, and also
works for bounded PCollections.
Supported file systems are those registered with FileSystems.
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumDeprecated.static classA class that allows value-dependent writes inFileBasedSink.static classA naming policy for output files.static final classResult of a single bundle write.static final classA coder forFileBasedSink.FileResultobjects.static interfaceProvides hints about how to generate output files, such as a suggested filename suffix (e.g.static interfaceImplementations create instances ofWritableByteChannelused byFileBasedSinkand related classes to allow decorating, or otherwise transforming, the raw data that would normally be written directly to theWritableByteChannelpassed intoFileBasedSink.WritableByteChannelFactory.create(WritableByteChannel).static classAbstract operation that manages the process of writing toFileBasedSink.static classAbstract writer that writes a bundle to aFileBasedSink. -
Constructor Summary
ConstructorsConstructorDescriptionFileBasedSink(ValueProvider<ResourceId> tempDirectoryProvider, FileBasedSink.DynamicDestinations<?, DestinationT, OutputT> dynamicDestinations) Construct aFileBasedSinkwith the given temp directory, producing uncompressed files.FileBasedSink(ValueProvider<ResourceId> tempDirectoryProvider, FileBasedSink.DynamicDestinations<?, DestinationT, OutputT> dynamicDestinations, Compression compression) Construct aFileBasedSinkwith the given temp directory and output channel type.FileBasedSink(ValueProvider<ResourceId> tempDirectoryProvider, FileBasedSink.DynamicDestinations<?, DestinationT, OutputT> dynamicDestinations, FileBasedSink.WritableByteChannelFactory writableByteChannelFactory) Construct aFileBasedSinkwith the given temp directory and output channel type. -
Method Summary
Modifier and TypeMethodDescriptionstatic ResourceIdconvertToFileResourceIfPossible(String outputPrefix) This is a helper function for turning a user-provided output filename prefix and converting it into aResourceIdfor writing output files.abstract FileBasedSink.WriteOperation<DestinationT, OutputT> Return a subclass ofFileBasedSink.WriteOperationthat will manage the write to the sink.Return theFileBasedSink.DynamicDestinationsused.Returns the directory inside which temporary files will be written according to the configuredFileBasedSink.FilenamePolicy.protected final FileBasedSink.WritableByteChannelFactoryReturns theFileBasedSink.WritableByteChannelFactoryused.voidpopulateDisplayData(DisplayData.Builder builder) Register display data for the given transform or component.voidvalidate(PipelineOptions options)
-
Constructor Details
-
FileBasedSink
public FileBasedSink(ValueProvider<ResourceId> tempDirectoryProvider, FileBasedSink.DynamicDestinations<?, DestinationT, OutputT> dynamicDestinations) Construct aFileBasedSinkwith the given temp directory, producing uncompressed files. -
FileBasedSink
public FileBasedSink(ValueProvider<ResourceId> tempDirectoryProvider, FileBasedSink.DynamicDestinations<?, DestinationT, OutputT> dynamicDestinations, FileBasedSink.WritableByteChannelFactory writableByteChannelFactory) Construct aFileBasedSinkwith the given temp directory and output channel type. -
FileBasedSink
public FileBasedSink(ValueProvider<ResourceId> tempDirectoryProvider, FileBasedSink.DynamicDestinations<?, DestinationT, OutputT> dynamicDestinations, Compression compression) Construct aFileBasedSinkwith the given temp directory and output channel type.
-
-
Method Details
-
convertToFileResourceIfPossible
This is a helper function for turning a user-provided output filename prefix and converting it into aResourceIdfor writing output files. SeeTextIO.Write.to(String)for an example use case.Typically, the input prefix will be something like
/tmp/foo/bar, and the user would like output files to be named as/tmp/foo/bar-0-of-3.txt. Thus, this function tries to interpret the provided string as a fileResourceIdpath.However, this may fail, for example if the user gives a prefix that is a directory. E.g.,
/,gs://my-bucket, orc://. In that case, interpreting the string as a file will fail and this function will return a directoryResourceIdinstead. -
getDynamicDestinations
Return theFileBasedSink.DynamicDestinationsused. -
getTempDirectoryProvider
Returns the directory inside which temporary files will be written according to the configuredFileBasedSink.FilenamePolicy. -
validate
-
createWriteOperation
Return a subclass ofFileBasedSink.WriteOperationthat will manage the write to the sink. -
populateDisplayData
Description copied from interface:HasDisplayDataRegister display data for the given transform or component.populateDisplayData(DisplayData.Builder)is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData). Implementations may callsuper.populateDisplayData(builder)in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)to use the namespace of the subcomponent.- Specified by:
populateDisplayDatain interfaceHasDisplayData- Parameters:
builder- The builder to populate with display data.- See Also:
-
getWritableByteChannelFactory
Returns theFileBasedSink.WritableByteChannelFactoryused.
-
Compression.