Table Row Inference Sklearn Batch

Model: Scikit-learn classifier on structured table data (Beam.Row) Accelerator: CPU-based inference (fixed batch size) Host: 10 × n1-standard-4 (4 vCPUs, 15 GB RAM)

This batch pipeline performs inference on continuous table rows using RunInference with a Scikit-learn model. It reads structured data (table rows) from GCS in JSONL format, extracts specified feature columns, and runs batched inference while preserving the original table schema. The pipeline ensures exactly-once semantics within batch execution by deduplicating inputs and writing results to BigQuery using file-based loads, enabling reproducible and comparable performance measurements across runs.

The following graphs show various metrics when running Table Row Inference Sklearn Batch pipeline. See the glossary for definitions.

Full pipeline implementation is available here.

What is the estimated cost to run the pipeline?

RunTime and EstimatedCost

RunTime and EstimatedCost

How has various metrics changed when running the pipeline for different Beam SDK versions?

AvgThroughputBytesPerSec by Version

AvgThroughputBytesPerSec by Version

AvgThroughputElementsPerSec by Version

AvgThroughputElementsPerSec by Version

How has various metrics changed over time when running the pipeline?

AvgThroughputBytesPerSec by Date

AvgThroughputBytesPerSec by Date

AvgThroughputElementsPerSec by Date

AvgThroughputElementsPerSec by Date

See also Table Row Inference Sklearn Streaming for the streaming variant of this pipeline.