Package org.apache.beam.sdk.io
Class TextRowCountEstimator
java.lang.Object
org.apache.beam.sdk.io.TextRowCountEstimator
This returns a row count estimation for files associated with a file pattern.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
Builder forTextRowCountEstimator
.static class
This strategy stops sampling if we sample enough number of bytes.static class
This strategy stops sampling when total number of sampled bytes are more than some threshold.static class
An exception that will be thrown if the estimator cannot get an estimation of the number of lines.static class
This strategy samples all the files.static interface
Sampling Strategy shows us when should we stop reading further files. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionbuilder()
estimateRowCount
(PipelineOptions pipelineOptions) Estimates the number of non empty rows.abstract Compression
abstract byte @Nullable []
abstract EmptyMatchTreatment
abstract String
abstract long
abstract int
-
Constructor Details
-
TextRowCountEstimator
public TextRowCountEstimator()
-
-
Method Details
-
getNumSampledBytesPerFile
public abstract long getNumSampledBytesPerFile() -
getDelimiters
-
getSkipHeaderLines
public abstract int getSkipHeaderLines() -
getFilePattern
-
getCompression
-
getSamplingStrategy
-
getEmptyMatchTreatment
-
getDirectoryTreatment
-
builder
-
estimateRowCount
public Double estimateRowCount(PipelineOptions pipelineOptions) throws IOException, TextRowCountEstimator.NoEstimationException Estimates the number of non empty rows. It samples NumSampledBytesPerFile bytes from every file until the condition in sampling strategy is met. Then it takes the average line size of the rows and divides the total file sizes by that number. If all the sampled rows are empty, and it has not sampled all the lines (due to sampling strategy) it throws Exception.- Returns:
- Number of estimated rows.
- Throws:
TextRowCountEstimator.NoEstimationException
- if all the sampled lines are empty and we have not read all the lines in the matched files.IOException
-