public class BigQueryIO
extends java.lang.Object
PTransform
s for reading and writing BigQuery tables.
A fully-qualified BigQuery table name consists of three components:
projectId
: the Cloud project id (defaults to GcpOptions.getProject()
).
datasetId
: the BigQuery dataset id, unique within a project.
tableId
: a table id, unique within a dataset.
BigQuery table references are stored as a TableReference
, which comes from the BigQuery Java Client API. Tables
can be referred to as Strings, with or without the projectId
. A helper function is
provided (BigQueryHelpers.parseTableSpec(String)
) that parses the following string forms
into a TableReference
:
project_id
]:[dataset_id
].[table_id
]
dataset_id
].[table_id
]
Reading from BigQuery is supported by read(SerializableFunction)
, which parses
records in AVRO format
into a custom type using a specified parse function, and by readTableRows()
which parses
them into TableRow
, which may be more convenient but has lower performance.
Both functions support reading either from a table or from the result of a query, via
BigQueryIO.TypedRead.from(String)
and BigQueryIO.TypedRead.fromQuery(java.lang.String)
respectively. Exactly one
of these must be specified.
Example: Reading rows of a table as TableRow
.
PCollection<TableRow> weatherData = pipeline.apply(
BigQueryIO.readTableRows().from("clouddataflow-readonly:samples.weather_stations"));
Example: Reading rows of a table and parsing them into a custom type.
PCollection<WeatherRecord> weatherData = pipeline.apply(
BigQueryIO
.read(new SerializableFunction<SchemaAndRecord, WeatherRecord>() {
public WeatherRecord apply(SchemaAndRecord schemaAndRecord) {
return new WeatherRecord(...);
}
})
.from("clouddataflow-readonly:samples.weather_stations"))
.withCoder(SerializableCoder.of(WeatherRecord.class));
Note: When using read(SerializableFunction)
, you may sometimes need to use
BigQueryIO.TypedRead.withCoder(Coder)
to specify a Coder
for the result type, if Beam
fails to infer it automatically.
Example: Reading results of a query as TableRow
.
PCollection<TableRow> meanTemperatureData = pipeline.apply(BigQueryIO.readTableRows()
.fromQuery("SELECT year, mean_temp FROM [samples.weather_stations]"));
To write to a BigQuery table, apply a BigQueryIO.Write
transformation. This consumes
either a PCollection
of TableRows
as input when using writeTableRows()
or of a user-defined type when using write()
.
When using a user-defined type, a function must be provided to turn this type into a TableRow
using BigQueryIO.Write.withFormatFunction(SerializableFunction)
.
PCollection<TableRow> quotes = ...
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("source").setType("STRING"));
fields.add(new TableFieldSchema().setName("quote").setType("STRING"));
TableSchema schema = new TableSchema().setFields(fields);
quotes.apply(BigQueryIO.writeTableRows()
.to("my-project:output.output_table")
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
See BigQueryIO.Write
for details on how to specify if a write should append to an
existing table, replace the table, or verify that the table is empty. Note that the dataset being
written to must already exist. Unbounded PCollections can only be written using BigQueryIO.Write.WriteDisposition.WRITE_EMPTY
or BigQueryIO.Write.WriteDisposition.WRITE_APPEND
.
A common use case is to dynamically generate BigQuery table names based on the current window
or the current value. To support this, BigQueryIO.Write.to(SerializableFunction)
accepts
a function mapping the current element to a tablespec. For example, here's code that outputs
daily tables to BigQuery:
PCollection<TableRow> quotes = ...
quotes.apply(Window.<TableRow>into(CalendarWindows.days(1)))
.apply(BigQueryIO.writeTableRows()
.withSchema(schema)
.to(new SerializableFunction<ValueInSingleWindow<TableRow>, TableDestination>() {
public TableDestination apply(ValueInSingleWindow<TableRow> value) {
// The cast below is safe because CalendarWindows.days(1) produces IntervalWindows.
String dayString = DateTimeFormat.forPattern("yyyy_MM_dd")
.withZone(DateTimeZone.UTC)
.print(((IntervalWindow) value.getWindow()).start());
return new TableDestination(
"my-project:output.output_table_" + dayString, // Table spec
"Output for day " + dayString // Table description
);
}
}));
Note that this also allows the table to be a function of the element as well as the current
pane, in the case of triggered windows. In this case it might be convenient to call write()
directly instead of using the writeTableRows()
helper.
This will allow the mapping function to access the element of the user-defined type. In this
case, a formatting function must be specified using BigQueryIO.Write.withFormatFunction(org.apache.beam.sdk.transforms.SerializableFunction<T, com.google.api.services.bigquery.model.TableRow>)
to convert each element into a TableRow
object.
Per-table schemas can also be provided using BigQueryIO.Write.withSchemaFromView(org.apache.beam.sdk.values.PCollectionView<java.util.Map<java.lang.String, java.lang.String>>)
. This
allows you the schemas to be calculated based on a previous pipeline stage or statically via a
Create
transform. This method expects to receive a
map-valued PCollectionView
, mapping table specifications (project:dataset.table-id), to
JSON formatted TableSchema
objects. All destination tables must be present in this map,
or the pipeline will fail to create tables. Care should be taken if the map value is based on a
triggered aggregation over and unbounded PCollection
; the side input will contain the
entire history of all table schemas ever generated, which might blow up memory usage. This method
can also be useful when writing to a single table, as it allows a previous stage to calculate the
schema (possibly based on the full collection of records being written to BigQuery).
For the most general form of dynamic table destinations and schemas, look at BigQueryIO.Write.to(DynamicDestinations)
.
BigQueryIO.Write
supports two methods of inserting data into BigQuery specified using
BigQueryIO.Write.withMethod(org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.Method)
. If no method is supplied, then a default method will be
chosen based on the input PCollection. See BigQueryIO.Write.Method
for more information
about the methods. The different insertion methods provide different tradeoffs of cost, quota,
and data consistency; please see BigQuery documentation for more information about these
tradeoffs.
When using read()
or readTableRows()
in a template, it's required to specify
BigQueryIO.Read.withTemplateCompatibility()
. Specifying this in a non-template pipeline is not
recommended because it has somewhat lower performance.
When using write()
or writeTableRows()
with batch loads in a template, it is
recommended to specify BigQueryIO.Write.withCustomGcsTempLocation(org.apache.beam.sdk.options.ValueProvider<java.lang.String>)
. Writing to BigQuery via batch
loads involves writing temporary files to this location, so the location must be accessible at
pipeline execution time. By default, this location is captured at pipeline construction
time, may be inaccessible if the template may be reused from a different project or at a moment
when the original location no longer exists.
BigQueryIO.Write.withCustomGcsTempLocation(ValueProvider)
allows specifying the location as an
argument to the template invocation.
Permission requirements depend on the PipelineRunner
that is used to execute the
pipeline. Please refer to the documentation of corresponding PipelineRunner
s for more
details.
Please see BigQuery Access Control for security and permission related information specific to BigQuery.
Modifier and Type | Class and Description |
---|---|
static class |
BigQueryIO.Read
Implementation of
read() . |
static class |
BigQueryIO.TypedRead<T>
Implementation of
read(SerializableFunction) . |
static class |
BigQueryIO.Write<T>
Implementation of
write() . |
Modifier and Type | Method and Description |
---|---|
static BigQueryIO.Read |
read()
Deprecated.
Use
read(SerializableFunction) or readTableRows() instead.
readTableRows() does exactly the same as read() , however
read(SerializableFunction) performs better. |
static <T> BigQueryIO.TypedRead<T> |
read(SerializableFunction<SchemaAndRecord,T> parseFn)
Reads from a BigQuery table or query and returns a
PCollection with one element per
each row of the table or query result, parsed from the BigQuery AVRO format using the specified
function. |
static BigQueryIO.TypedRead<TableRow> |
readTableRows()
Like
read(SerializableFunction) but represents each row as a TableRow . |
static <T> BigQueryIO.Write<T> |
write()
A
PTransform that writes a PCollection to a BigQuery table. |
static BigQueryIO.Write<TableRow> |
writeTableRows()
|
@Deprecated public static BigQueryIO.Read read()
read(SerializableFunction)
or readTableRows()
instead.
readTableRows()
does exactly the same as read()
, however
read(SerializableFunction)
performs better.public static BigQueryIO.TypedRead<TableRow> readTableRows()
read(SerializableFunction)
but represents each row as a TableRow
.
This method is more convenient to use in some cases, but usually has significantly lower
performance than using read(SerializableFunction)
directly to parse data into a
domain-specific type, due to the overhead of converting the rows to TableRow
.
public static <T> BigQueryIO.TypedRead<T> read(SerializableFunction<SchemaAndRecord,T> parseFn)
PCollection
with one element per
each row of the table or query result, parsed from the BigQuery AVRO format using the specified
function.
Each SchemaAndRecord
contains a BigQuery TableSchema
and a
GenericRecord
representing the row, indexed by column name. Here is a
sample parse function that parses click events from a table.
p.apply(BigQueryIO.read(new SerializableFunction<SchemaAndRecord, ClickEvent>() {
public Event apply(SchemaAndRecord record) {
GenericRecord r = record.getRecord();
return new Event((Long) r.get("userId"), (String) r.get("url"));
}
}).from("...");
public static <T> BigQueryIO.Write<T> write()
PTransform
that writes a PCollection
to a BigQuery table. A formatting
function must be provided to convert each input element into a TableRow
using
BigQueryIO.Write.withFormatFunction(SerializableFunction)
.
In BigQuery, each table has an encosing dataset. The dataset being written must already exist.
By default, tables will be created if they do not exist, which corresponds to a
BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED
disposition that matches the default of
BigQuery's Jobs API. A schema must be provided (via BigQueryIO.Write.withSchema(TableSchema)
),
or else the transform may fail at runtime with an IllegalArgumentException
.
By default, writes require an empty table, which corresponds to
a BigQueryIO.Write.WriteDisposition.WRITE_EMPTY
disposition that matches the default of
BigQuery's Jobs API.
Here is a sample transform that produces TableRow values containing "word" and "count" columns:
static class FormatCountsFn extends DoFn<KV<String, Long>, TableRow> {
public void processElement(ProcessContext c) {
TableRow row = new TableRow()
.set("word", c.element().getKey())
.set("count", c.element().getValue().intValue());
c.output(row);
}
}
public static BigQueryIO.Write<TableRow> writeTableRows()