apache_beam.dataframe.expressions module¶
-
class
apache_beam.dataframe.expressions.
Session
(bindings=None)[source]¶ Bases:
object
A session represents a mapping of expressions to concrete values.
The bindings typically include required placeholders, but may be any intermediate expression as well.
-
class
apache_beam.dataframe.expressions.
PartitioningSession
(bindings=None)[source]¶ Bases:
apache_beam.dataframe.expressions.Session
An extension of Session that enforces actual partitioning of inputs.
Each expression is evaluated multiple times for various supported partitionings determined by its requires_partition_by specification. For each tested partitioning, the input is partitioned and the expression is evaluated on each partition separately, as if this were actually executed in a parallel manner.
For each input partitioning, the results are verified to be partitioned appropriately according to the expression’s preserves_partition_by specification.
For testing only.
-
apache_beam.dataframe.expressions.
output_partitioning
(expr, input_partitioning)[source]¶ Return the expected output partitioning for expr when it’s input is partitioned by input_partitioning.
For internal use only; No backward compatibility guarantees
-
class
apache_beam.dataframe.expressions.
Expression
(name, proxy, _id=None)[source]¶ Bases:
object
An expression is an operation bound to a set of arguments.
An expression represents a deferred tree of operations, which can be evaluated at a specific bindings of root expressions to values.
requires_partition_by indicates the upper bound of a set of partitionings that are acceptable inputs to this expression. The expression should be able to produce the correct result when given input(s) partitioned by its requires_partition_by attribute, or by any partitoning that is _not_ a subpartitioning of it.
preserves_partition_by indicates the upper bound of a set of partitionings that can be preserved by this expression. When the input(s) to this expression are partitioned by preserves_partition_by, or by any partitioning that is _not_ a subpartitioning of it, this expression should produce output(s) partitioned by the same partitioning.
However, if the partitioning of an expression’s input is a subpartitioning of the partitioning that it preserves, the output is presumed to have no particular partitioning (i.e. Arbitrary()).
For example, let’s look at an “element-wise operation”, that has no partitioning requirement, and preserves any partitioning given to it:
requires_partition_by = Arbitrary() -----------------------------+ | +-----------+-------------+---------- ... ----+---------| | | | | | Singleton() < Index([i]) < Index([i, j]) < ... < Index() < Arbitrary() | | | | | +-----------+-------------+---------- ... ----+---------| | preserves_partition_by = Arbitrary() ----------------------------+
As a more interesting example, consider this expression, which requires Index partitioning, and preserves just Singleton partitioning:
requires_partition_by = Index() -----------------------+ | +-----------+-------------+---------- ... ----| | | | | Singleton() < Index([i]) < Index([i, j]) < ... < Index() < Arbitrary() | | preserves_partition_by = Singleton()
Note that any non-Arbitrary partitioning is an acceptable input for this expression. However, unless the inputs are Singleton-partitioned, the expression makes no guarantees about the partitioning of the output.
-
class
apache_beam.dataframe.expressions.
PlaceholderExpression
(proxy, reference=None)[source]¶ Bases:
apache_beam.dataframe.expressions.Expression
An expression whose value must be explicitly bound in the session.
Initialize a placeholder expression.
Parameters: proxy – A proxy object with the type expected to be bound to this expression. Used for type checking at pipeline construction time.
-
class
apache_beam.dataframe.expressions.
ConstantExpression
(value, proxy=None)[source]¶ Bases:
apache_beam.dataframe.expressions.Expression
An expression whose value is known at pipeline construction time.
Initialize a constant expression.
Parameters: - value – The constant value to be produced by this expression.
- proxy – (Optional) a proxy object with same type as value to use for rapid type checking at pipeline construction time. If not provided, value will be used directly.
-
class
apache_beam.dataframe.expressions.
ComputedExpression
(name, func, args, proxy=None, _id=None, requires_partition_by=Index, preserves_partition_by=Singleton)[source]¶ Bases:
apache_beam.dataframe.expressions.Expression
An expression whose value must be computed at pipeline execution time.
Initialize a computed expression.
Parameters: - name – The name of this expression.
- func – The function that will be used to compute the value of this expression. Should accept arguments of the types returned when evaluating the args expressions.
- args – The list of expressions that will be used to produce inputs to func.
- proxy – (Optional) a proxy object with same type as the objects that this ComputedExpression will produce at execution time. If not provided, a proxy will be generated using func and the proxies of args.
- _id – (Optional) a string to uniquely identify this expression.
- requires_partition_by – The required (common) partitioning of the args.
- preserves_partition_by – The level of partitioning preserved.