public class CoGroup
extends java.lang.Object
PCollection
s.
This transform has similarites to CoGroupByKey
, however works on PCollections that
have schemas. This allows users of the transform to simply specify schema fields to join on. The
output type of the transform is a KV<Row, Row> where the value contains one field for
every input PCollection and the key represents the fields that were joined on. By default the
cross product is not expanded, so all fields in the output row are array fields.
For example, the following demonstrates joining three PCollections on the "user" and "country" fields.
TupleTag<Input1Type> input1Tag = new TupleTag<>("input1");
TupleTag<Input2Type> input2Tag = new TupleTag<>("input2");
TupleTag<Input3Type> input3Tag = new TupleTag<>("input3");
PCollection<KV<Row, Row>> joined = PCollectionTuple
.of(input1Tag, input1)
.and(input2Tag, input2)
.and(input3Tag, input3)
.apply(CoGroup.byFieldNames("user", "country"));
In the above case, the key schema will contain the two string fields "user" and "country"; in this case, the schemas for Input1, Input2, Input3 must all have fields named "user" and "country". The value schema will contain three array of Row fields named "input1" "input2" and "input3". The value Row contains all inputs that came in on any of the inputs for that key. Standard join types (inner join, outer join, etc.) can be accomplished by expanding the cross product of these arrays in various ways.
To put it in other words, the key schema is convertible to the following POJO:
{@literal @}DefaultSchema(JavaFieldSchema.class)
public class JoinedKey {
public String user;
public String country;
}
PCollection<JoinedKey> keys = joined
.apply(Keys.create())
.apply(Convert.to(JoinedKey.class));
The value schema is convertible to the following POJO:
{@literal @}DefaultSchema(JavaFieldSchema.class)
public class JoinedValue {
// The below lists contain all values from each of the three inputs that match on the given
// key.
public List<Input1Type> input1;
public List<Input2Type> input2;
public List<Input3Type> input3;
}
PCollection<JoinedValue> values = joined
.apply(Values.create())
.apply(Convert.to(JoinedValue.class));
It's also possible to join between different fields in two inputs, as long as the types of those fields match. In this case, fields must be specified for every input PCollection. For example:
PCollection<KV<Row, Row>> joined = PCollectionTuple
.of(input1Tag, input1)
.and(input2Tag, input2)
.apply(CoGroup
.byFieldNames(input1Tag, "referringUser"))
.byFieldNames(input2Tag, "user"));
Modifier and Type | Class and Description |
---|---|
static class |
CoGroup.Inner
The implementing PTransform.
|
Constructor and Description |
---|
CoGroup() |
Modifier and Type | Method and Description |
---|---|
static CoGroup.Inner |
byFieldAccessDescriptor(FieldAccessDescriptor fieldAccessDescriptor)
Join by the following
FieldAccessDescriptor . |
static CoGroup.Inner |
byFieldAccessDescriptor(TupleTag<?> tag,
FieldAccessDescriptor fieldAccessDescriptor)
Select the following fields for the specified PCollection using
FieldAccessDescriptor . |
static CoGroup.Inner |
byFieldIds(java.lang.Integer... fieldIds)
Join by the following field ids.
|
static CoGroup.Inner |
byFieldIds(TupleTag<?> tag,
java.lang.Integer... fieldIds)
Select the following field ids for the specified PCollection.
|
static CoGroup.Inner |
byFieldNames(java.lang.String... fieldNames)
Join by the following field names.
|
static CoGroup.Inner |
byFieldNames(TupleTag<?> tag,
java.lang.String... fieldNames)
Select the following field names for the specified PCollection.
|
public static CoGroup.Inner byFieldNames(java.lang.String... fieldNames)
The same field names are used in all input PCollections.
public static CoGroup.Inner byFieldIds(java.lang.Integer... fieldIds)
The same field ids are used in all input PCollections.
public static CoGroup.Inner byFieldAccessDescriptor(FieldAccessDescriptor fieldAccessDescriptor)
FieldAccessDescriptor
.
The same access descriptor is used in all input PCollections.
public static CoGroup.Inner byFieldNames(TupleTag<?> tag, java.lang.String... fieldNames)
Each PCollection in the input must have fields specified for the join key.
public static CoGroup.Inner byFieldIds(TupleTag<?> tag, java.lang.Integer... fieldIds)
Each PCollection in the input must have fields specified for the join key.
public static CoGroup.Inner byFieldAccessDescriptor(TupleTag<?> tag, FieldAccessDescriptor fieldAccessDescriptor)
FieldAccessDescriptor
.
Each PCollection in the input must have fields specified for the join key.