Package org.apache.beam.sdk.io
Class TextRowCountEstimator
java.lang.Object
org.apache.beam.sdk.io.TextRowCountEstimator
This returns a row count estimation for files associated with a file pattern.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classBuilder forTextRowCountEstimator.static classThis strategy stops sampling if we sample enough number of bytes.static classThis strategy stops sampling when total number of sampled bytes are more than some threshold.static classAn exception that will be thrown if the estimator cannot get an estimation of the number of lines.static classThis strategy samples all the files.static interfaceSampling Strategy shows us when should we stop reading further files. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionbuilder()estimateRowCount(PipelineOptions pipelineOptions) Estimates the number of non empty rows.abstract Compressionabstract byte @Nullable []abstract EmptyMatchTreatmentabstract Stringabstract longabstract int
-
Constructor Details
-
TextRowCountEstimator
public TextRowCountEstimator()
-
-
Method Details
-
getNumSampledBytesPerFile
public abstract long getNumSampledBytesPerFile() -
getDelimiters
-
getSkipHeaderLines
public abstract int getSkipHeaderLines() -
getFilePattern
-
getCompression
-
getSamplingStrategy
-
getEmptyMatchTreatment
-
getDirectoryTreatment
-
builder
-
estimateRowCount
public Double estimateRowCount(PipelineOptions pipelineOptions) throws IOException, TextRowCountEstimator.NoEstimationException Estimates the number of non empty rows. It samples NumSampledBytesPerFile bytes from every file until the condition in sampling strategy is met. Then it takes the average line size of the rows and divides the total file sizes by that number. If all the sampled rows are empty, and it has not sampled all the lines (due to sampling strategy) it throws Exception.- Returns:
- Number of estimated rows.
- Throws:
TextRowCountEstimator.NoEstimationException- if all the sampled lines are empty and we have not read all the lines in the matched files.IOException
-