apache_beam.dataframe.expressions module

class apache_beam.dataframe.expressions.Session(bindings=None)[source]

Bases: object

A session represents a mapping of expressions to concrete values.

The bindings typically include required placeholders, but may be any intermediate expression as well.

evaluate(expr)[source]
lookup(expr)[source]
class apache_beam.dataframe.expressions.PartitioningSession(bindings=None)[source]

Bases: apache_beam.dataframe.expressions.Session

An extension of Session that enforces actual partitioning of inputs.

Each expression is evaluated multiple times for various supported partitionings determined by its requires_partition_by specification. For each tested partitioning, the input is partitioned and the expression is evaluated on each partition separately, as if this were actually executed in a parallel manner.

For each input partitioning, the results are verified to be partitioned appropriately according to the expression’s preserves_partition_by specification.

For testing only.

evaluate(expr)[source]
apache_beam.dataframe.expressions.output_partitioning(expr, input_partitioning)[source]

Return the expected output partitioning for expr when it’s input is partitioned by input_partitioning.

For internal use only; No backward compatibility guarantees

class apache_beam.dataframe.expressions.Expression(name: str, proxy: T, _id: Optional[str] = None)[source]

Bases: typing.Generic

An expression is an operation bound to a set of arguments.

An expression represents a deferred tree of operations, which can be evaluated at a specific bindings of root expressions to values.

requires_partition_by indicates the upper bound of a set of partitionings that are acceptable inputs to this expression. The expression should be able to produce the correct result when given input(s) partitioned by its requires_partition_by attribute, or by any partitoning that is _not_ a subpartitioning of it.

preserves_partition_by indicates the upper bound of a set of partitionings that can be preserved by this expression. When the input(s) to this expression are partitioned by preserves_partition_by, or by any partitioning that is _not_ a subpartitioning of it, this expression should produce output(s) partitioned by the same partitioning.

However, if the partitioning of an expression’s input is a subpartitioning of the partitioning that it preserves, the output is presumed to have no particular partitioning (i.e. Arbitrary()).

For example, let’s look at an “element-wise operation”, that has no partitioning requirement, and preserves any partitioning given to it:

requires_partition_by = Arbitrary() -----------------------------+
                                                                 |
         +-----------+-------------+---------- ... ----+---------|
         |           |             |                   |         |
    Singleton() < Index([i]) < Index([i, j]) < ... < Index() < Arbitrary()
         |           |             |                   |         |
         +-----------+-------------+---------- ... ----+---------|
                                                                 |
preserves_partition_by = Arbitrary() ----------------------------+

As a more interesting example, consider this expression, which requires Index partitioning, and preserves just Singleton partitioning:

requires_partition_by = Index() -----------------------+
                                                       |
         +-----------+-------------+---------- ... ----|
         |           |             |                   |
    Singleton() < Index([i]) < Index([i, j]) < ... < Index() < Arbitrary()
         |
         |
preserves_partition_by = Singleton()

Note that any non-Arbitrary partitioning is an acceptable input for this expression. However, unless the inputs are Singleton-partitioned, the expression makes no guarantees about the partitioning of the output.

proxy() → T[source]
placeholders()[source]

Returns all the placeholders that self depends on.

evaluate_at(session: apache_beam.dataframe.expressions.Session) → T[source]

Returns the result of self with the bindings given in session.

requires_partition_by() → apache_beam.dataframe.partitionings.Partitioning[source]

Returns the partitioning, if any, require to evaluate this expression.

Returns partitioning.Arbitrary() to require no partitioning is required.

preserves_partition_by() → apache_beam.dataframe.partitionings.Partitioning[source]

Returns the partitioning, if any, preserved by this expression.

This gives an upper bound on the partitioning of its ouput. The actual partitioning of the output may be less strict (e.g. if the input was less partitioned).

class apache_beam.dataframe.expressions.PlaceholderExpression(proxy, reference=None)[source]

Bases: apache_beam.dataframe.expressions.Expression

An expression whose value must be explicitly bound in the session.

Initialize a placeholder expression.

Parameters:proxy – A proxy object with the type expected to be bound to this expression. Used for type checking at pipeline construction time.
placeholders()[source]
args()[source]
evaluate_at(session)[source]
requires_partition_by()[source]
preserves_partition_by()[source]
class apache_beam.dataframe.expressions.ConstantExpression(value, proxy=None)[source]

Bases: apache_beam.dataframe.expressions.Expression

An expression whose value is known at pipeline construction time.

Initialize a constant expression.

Parameters:
  • value – The constant value to be produced by this expression.
  • proxy – (Optional) a proxy object with same type as value to use for rapid type checking at pipeline construction time. If not provided, value will be used directly.
placeholders()[source]
args()[source]
evaluate_at(session)[source]
requires_partition_by()[source]
preserves_partition_by()[source]
class apache_beam.dataframe.expressions.ComputedExpression(name, func, args, proxy=None, _id=None, requires_partition_by=Index, preserves_partition_by=Singleton)[source]

Bases: apache_beam.dataframe.expressions.Expression

An expression whose value must be computed at pipeline execution time.

Initialize a computed expression.

Parameters:
  • name – The name of this expression.
  • func – The function that will be used to compute the value of this expression. Should accept arguments of the types returned when evaluating the args expressions.
  • args – The list of expressions that will be used to produce inputs to func.
  • proxy – (Optional) a proxy object with same type as the objects that this ComputedExpression will produce at execution time. If not provided, a proxy will be generated using func and the proxies of args.
  • _id – (Optional) a string to uniquely identify this expression.
  • requires_partition_by – The required (common) partitioning of the args.
  • preserves_partition_by – The level of partitioning preserved.
placeholders()[source]
args()[source]
evaluate_at(session)[source]
requires_partition_by()[source]
preserves_partition_by()[source]
apache_beam.dataframe.expressions.elementwise_expression(name, func, args)[source]
apache_beam.dataframe.expressions.allow_non_parallel_operations(allow=True)[source]
exception apache_beam.dataframe.expressions.NonParallelOperation(msg)[source]

Bases: Exception