apache_beam.dataframe.frames module

Analogs for pandas.DataFrame and pandas.Series: DeferredDataFrame and DeferredSeries.

These classes are effectively wrappers around a schema-aware PCollection that provide a set of operations compatible with the pandas API.

Note that we aim for the Beam DataFrame API to be completely compatible with the pandas API, but there are some features that are currently unimplemented for various reasons. Pay particular attention to the ‘Differences from pandas’ section for each operation to understand where we diverge.

class apache_beam.dataframe.frames.DeferredSeries(expr)[source]

Bases: apache_beam.dataframe.frames.DeferredDataFrameOrSeries

name

Return the name of the Series.

The name of a Series becomes its index or column name if it is used to form a DataFrame. It is also used whenever displaying the Series using the interpreter.

Returns:The name of the DeferredSeries, also the column name if part of a DeferredDataFrame.
Return type:label (hashable object)

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.rename
Sets the DeferredSeries name when given a scalar input.
Index.name
Corresponding Index property.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

The Series name can be set initially when calling the constructor.

>>> s = pd.Series([1, 2, 3], dtype=np.int64, name='Numbers')
>>> s
0    1
1    2
2    3
Name: Numbers, dtype: int64
>>> s.name = "Integers"
>>> s
0    1
1    2
2    3
Name: Integers, dtype: int64

The name of a Series within a DataFrame is its column name.

>>> df = pd.DataFrame([[1, 2], [3, 4], [5, 6]],
...                   columns=["Odd Numbers", "Even Numbers"])
>>> df
   Odd Numbers  Even Numbers
0            1             2
1            3             4
2            5             6
>>> df["Even Numbers"].name
'Even Numbers'
hasnans

Return True if there are any NaNs.

Enables various performance speedups.

Returns:
Return type:bool

Differences from pandas

This operation has no known divergences from the pandas API.

dtype

Return the dtype object of the underlying data.

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3])
>>> s.dtype
dtype('int64')
dtypes

Return the dtype object of the underlying data.

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3])
>>> s.dtype
dtype('int64')
keys()[source]

Return alias for index.

Returns:Index of the DeferredSeries.
Return type:Index

Differences from pandas

This operation has no known divergences from the pandas API.

T(**kwargs)

Return the transpose, which is by definition self.

Differences from pandas

This operation has no known divergences from the pandas API.

transpose(**kwargs)

Return the transpose, which is by definition self.

Returns:
Return type:%(klass)s

Differences from pandas

This operation has no known divergences from the pandas API.

shape

pandas.Series.shape() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

append(to_append, ignore_index, verify_integrity, **kwargs)[source]

This method has been removed in the current version of Pandas.

align(other, join, axis, level, method, **kwargs)[source]

Align two objects on their axes with the specified join method.

Join method is specified for each axis Index.

Parameters:
  • other (DeferredDataFrame or DeferredSeries) –
  • join ({'outer', 'inner', 'left', 'right'}, default 'outer') –
  • axis (allowed axis of the other object, default None) – Align on index (0), columns (1), or both (None).
  • level (int or level name, default None) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • copy (bool, default True) – Always returns new objects. If copy=False and no reindexing is required then original objects are returned.
  • fill_value (scalar, default np.NaN) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
  • method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None) –

    Method to use for filling holes in reindexed DeferredSeries:

    • pad / ffill: propagate last valid observation forward to next valid.
    • backfill / bfill: use NEXT valid observation to fill gap.
  • limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
  • fill_axis ({0 or 'index'}, default 0) – Filling axis, method and limit.
  • broadcast_axis ({0 or 'index'}, default None) – Broadcast values along this axis, if aligning two objects of different dimensions.
Returns:

Aligned objects.

Return type:

tuple of (DeferredSeries, type of other)

Differences from pandas

Aligning per-level is not yet supported. Only the default, level=None, is allowed.

Filling NaN values via method is not supported, because it is order-sensitive. Only the default, method=None, is allowed.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame(
...     [[1, 2, 3, 4], [6, 7, 8, 9]], columns=["D", "B", "E", "A"], index=[1, 2]
... )
>>> other = pd.DataFrame(
...     [[10, 20, 30, 40], [60, 70, 80, 90], [600, 700, 800, 900]],
...     columns=["A", "B", "C", "D"],
...     index=[2, 3, 4],
... )
>>> df
   D  B  E  A
1  1  2  3  4
2  6  7  8  9
>>> other
    A    B    C    D
2   10   20   30   40
3   60   70   80   90
4  600  700  800  900

Align on columns:

>>> left, right = df.align(other, join="outer", axis=1)
>>> left
   A  B   C  D  E
1  4  2 NaN  1  3
2  9  7 NaN  6  8
>>> right
    A    B    C    D   E
2   10   20   30   40 NaN
3   60   70   80   90 NaN
4  600  700  800  900 NaN

We can also align on the index:

>>> left, right = df.align(other, join="outer", axis=0)
>>> left
    D    B    E    A
1  1.0  2.0  3.0  4.0
2  6.0  7.0  8.0  9.0
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN
>>> right
    A      B      C      D
1    NaN    NaN    NaN    NaN
2   10.0   20.0   30.0   40.0
3   60.0   70.0   80.0   90.0
4  600.0  700.0  800.0  900.0

Finally, the default `axis=None` will align on both index and columns:

>>> left, right = df.align(other, join="outer", axis=None)
>>> left
     A    B   C    D    E
1  4.0  2.0 NaN  1.0  3.0
2  9.0  7.0 NaN  6.0  8.0
3  NaN  NaN NaN  NaN  NaN
4  NaN  NaN NaN  NaN  NaN
>>> right
       A      B      C      D   E
1    NaN    NaN    NaN    NaN NaN
2   10.0   20.0   30.0   40.0 NaN
3   60.0   70.0   80.0   90.0 NaN
4  600.0  700.0  800.0  900.0 NaN
argsort(**kwargs)

pandas.Series.argsort() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

array

pandas.Series.array() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

get(**kwargs)

pandas.Series.get() is not yet supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.

For more information see https://s.apache.org/dataframe-non-deferred-columns.

ravel(**kwargs)

pandas.Series.ravel() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

slice_shift(**kwargs)

pandas.Series.slice_shift() is not yet supported in the Beam DataFrame API because it is deprecated in pandas.

tshift(**kwargs)

pandas.Series.tshift() is not yet supported in the Beam DataFrame API because it is deprecated in pandas.

rename(**kwargs)

Alter Series index labels or name.

Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.

Alternatively, change Series.name with a scalar value.

See the user guide for more.

Parameters:
  • index (scalar, hashable sequence, dict-like or function optional) – Functions or dict-like are transformations to apply to the index. Scalar or hashable sequence-like will alter the DeferredSeries.name attribute.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
  • copy (bool, default True) – Also copy underlying data.
  • inplace (bool, default False) – Whether to return a new DeferredSeries. If True the value of copy is ignored.
  • level (int or level name, default None) – In case of MultiIndex, only rename labels in the specified level.
  • errors ({'ignore', 'raise'}, default 'ignore') – If ‘raise’, raise KeyError when a dict-like mapper or index contains labels that are not present in the index being transformed. If ‘ignore’, existing keys will be renamed and extra keys will be ignored.
Returns:

DeferredSeries with index labels or name altered or None if inplace=True.

Return type:

DeferredSeries or None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.rename()
Corresponding DeferredDataFrame method.
DeferredSeries.rename_axis()
Set the name of the axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3])
>>> s
0    1
1    2
2    3
dtype: int64
>>> s.rename("my_name")  # scalar, changes Series.name
0    1
1    2
2    3
Name: my_name, dtype: int64
>>> s.rename(lambda x: x ** 2)  # function, changes labels
0    1
1    2
4    3
dtype: int64
>>> s.rename({1: 3, 2: 5})  # mapping, changes labels
0    1
3    2
5    3
dtype: int64
between(**kwargs)

Return boolean Series equivalent to left <= series <= right.

This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.

Parameters:
  • left (scalar or list-like) – Left boundary.
  • right (scalar or list-like) – Right boundary.
  • inclusive ({"both", "neither", "left", "right"}) –

    Include boundaries. Whether to set each bound as closed or open.

    Changed in version 1.3.0.

Returns:

DeferredSeries representing whether each element is between left and right (inclusive).

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.gt()
Greater than of series and other.
DeferredSeries.lt()
Less than of series and other.

Notes

This function is equivalent to (left <= ser) & (ser <= right)

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([2, 0, 4, 8, np.nan])

Boundary values are included by default:

>>> s.between(1, 4)
0     True
1    False
2     True
3    False
4    False
dtype: bool

With `inclusive` set to ``"neither"`` boundary values are excluded:

>>> s.between(1, 4, inclusive="neither")
0     True
1    False
2    False
3    False
4    False
dtype: bool

`left` and `right` can be any scalar value:

>>> s = pd.Series(['Alice', 'Bob', 'Carol', 'Eve'])
>>> s.between('Anna', 'Daniel')
0    False
1     True
2     True
3    False
dtype: bool
add_suffix(**kwargs)

Suffix labels with string suffix.

For Series, the row labels are suffixed. For DataFrame, the column labels are suffixed.

Parameters:
  • suffix (str) – The string to add after each label.
  • axis ({{0 or 'index', 1 or 'columns', None}}, default None) –

    Axis to add suffix on

    New in version 2.0.0.

Returns:

New DeferredSeries or DeferredDataFrame with updated labels.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.add_prefix()
Prefix row labels with string prefix.
DeferredDataFrame.add_prefix()
Prefix column labels with string prefix.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64

>>> s.add_suffix('_item')
0_item    1
1_item    2
2_item    3
3_item    4
dtype: int64

>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})
>>> df
   A  B
0  1  3
1  2  4
2  3  5
3  4  6

>>> df.add_suffix('_col')
     A_col  B_col
0       1       3
1       2       4
2       3       5
3       4       6
add_prefix(**kwargs)

Prefix labels with string prefix.

For Series, the row labels are prefixed. For DataFrame, the column labels are prefixed.

Parameters:
  • prefix (str) – The string to add before each label.
  • axis ({{0 or 'index', 1 or 'columns', None}}, default None) –

    Axis to add prefix on

    New in version 2.0.0.

Returns:

New DeferredSeries or DeferredDataFrame with updated labels.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.add_suffix()
Suffix row labels with string suffix.
DeferredDataFrame.add_suffix()
Suffix column labels with string suffix.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64

>>> s.add_prefix('item_')
item_0    1
item_1    2
item_2    3
item_3    4
dtype: int64

>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})
>>> df
   A  B
0  1  3
1  2  4
2  3  5
3  4  6

>>> df.add_prefix('col_')
     col_A  col_B
0       1       3
1       2       4
2       3       5
3       4       6
info(**kwargs)

pandas.Series.info() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

idxmin(**kwargs)[source]

Return the row label of the minimum value.

If multiple values equal the minimum, the first row label with that value is returned.

Parameters:
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
  • skipna (bool, default True) – Exclude NA/null values. If the entire DeferredSeries is NA, the result will be NA.
  • **kwargs (*args,) –

    Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Label of the minimum value.

Return type:

Index

Raises:

ValueError – If the DeferredSeries is empty.

Differences from pandas

This operation has no known divergences from the pandas API.

See also

numpy.argmin()
Return indices of the minimum values along the given axis.
DeferredDataFrame.idxmin()
Return index of first occurrence of minimum over requested axis.
DeferredSeries.idxmax()
Return index label of the first occurrence of maximum of values.

Notes

This method is the DeferredSeries version of ndarray.argmin. This method returns the label of the minimum, while ndarray.argmin returns the position. To get the position, use series.values.argmin().

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series(data=[1, None, 4, 1],
...               index=['A', 'B', 'C', 'D'])
>>> s
A    1.0
B    NaN
C    4.0
D    1.0
dtype: float64

>>> s.idxmin()
'A'

If `skipna` is False and there is an NA value in the data,
the function returns ``nan``.

>>> s.idxmin(skipna=False)
nan
idxmax(**kwargs)[source]

Return the row label of the maximum value.

If multiple values equal the maximum, the first row label with that value is returned.

Parameters:
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
  • skipna (bool, default True) – Exclude NA/null values. If the entire DeferredSeries is NA, the result will be NA.
  • **kwargs (*args,) –

    Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Label of the maximum value.

Return type:

Index

Raises:

ValueError – If the DeferredSeries is empty.

Differences from pandas

This operation has no known divergences from the pandas API.

See also

numpy.argmax()
Return indices of the maximum values along the given axis.
DeferredDataFrame.idxmax()
Return index of first occurrence of maximum over requested axis.
DeferredSeries.idxmin()
Return index label of the first occurrence of minimum of values.

Notes

This method is the DeferredSeries version of ndarray.argmax. This method returns the label of the maximum, while ndarray.argmax returns the position. To get the position, use series.values.argmax().

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series(data=[1, None, 4, 3, 4],
...               index=['A', 'B', 'C', 'D', 'E'])
>>> s
A    1.0
B    NaN
C    4.0
D    3.0
E    4.0
dtype: float64

>>> s.idxmax()
'C'

If `skipna` is False and there is an NA value in the data,
the function returns ``nan``.

>>> s.idxmax(skipna=False)
nan
explode(ignore_index)[source]

Transform each element of a list-like to a row.

Parameters:ignore_index (bool, default False) –

If True, the resulting index will be labeled 0, 1, …, n - 1.

New in version 1.1.0.

Returns:Exploded lists to rows; index will be duplicated for these rows.
Return type:DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.str.split()
Split string values on specified separator.
DeferredSeries.unstack()
Unstack, a.k.a. pivot, DeferredSeries with MultiIndex to produce DeferredDataFrame.
DeferredDataFrame.melt()
Unpivot a DeferredDataFrame from wide format to long format.
DeferredDataFrame.explode()
Explode a DeferredDataFrame from list-like columns to long format.

Notes

This routine will explode list-likes including lists, tuples, sets, DeferredSeries, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in a np.nan for that row. In addition, the ordering of elements in the output will be non-deterministic when exploding sets.

Reference the user guide for more examples.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([[1, 2, 3], 'foo', [], [3, 4]])
>>> s
0    [1, 2, 3]
1          foo
2           []
3       [3, 4]
dtype: object

>>> s.explode()
0      1
0      2
0      3
1    foo
2    NaN
3      3
3      4
dtype: object
dot(other)[source]

Compute the matrix multiplication between the DataFrame and other.

This method computes the matrix product between the DataFrame and the values of an other Series, DataFrame or a numpy array.

It can also be called using self @ other in Python >= 3.5.

Parameters:other (DeferredSeries, DeferredDataFrame or array-like) – The other object to compute the matrix product with.
Returns:If other is a DeferredSeries, return the matrix product between self and other as a DeferredSeries. If other is a DeferredDataFrame or a numpy.array, return the matrix product of self and other in a DeferredDataFrame of a np.array.
Return type:DeferredSeries or DeferredDataFrame

Differences from pandas

other must be a DeferredDataFrame or DeferredSeries instance. Computing the dot product with an array-like is not supported because it is order-sensitive.

See also

DeferredSeries.dot()
Similar method for DeferredSeries.

Notes

The dimensions of DeferredDataFrame and other must be compatible in order to compute the matrix multiplication. In addition, the column names of DeferredDataFrame and the index of other must contain the same values, as they will be aligned prior to the multiplication.

The dot method for DeferredSeries computes the inner product, instead of the matrix product here.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

Here we multiply a DataFrame with a Series.

>>> df = pd.DataFrame([[0, 1, -2, -1], [1, 1, 1, 1]])
>>> s = pd.Series([1, 1, 2, 1])
>>> df.dot(s)
0    -4
1     5
dtype: int64

Here we multiply a DataFrame with another DataFrame.

>>> other = pd.DataFrame([[0, 1], [1, 2], [-1, -1], [2, 0]])
>>> df.dot(other)
    0   1
0   1   4
1   2   2

Note that the dot method give the same result as @

>>> df @ other
    0   1
0   1   4
1   2   2

The dot method works also if other is an np.array.

>>> arr = np.array([[0, 1], [1, 2], [-1, -1], [2, 0]])
>>> df.dot(arr)
    0   1
0   1   4
1   2   2

Note how shuffling of the objects does not change the result.

>>> s2 = s.reindex([1, 0, 2, 3])
>>> df.dot(s2)
0    -4
1     5
dtype: int64
nunique(**kwargs)[source]

Return number of unique elements in the object.

Excludes NA values by default.

Parameters:dropna (bool, default True) – Don’t include NaN in the count.
Returns:
Return type:int

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.nunique()
Method nunique for DeferredDataFrame.
DeferredSeries.count()
Count non-NA/null observations in the DeferredSeries.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 3, 5, 7, 7])
>>> s
0    1
1    3
2    5
3    7
4    7
dtype: int64

>>> s.nunique()
4
quantile(q, **kwargs)[source]

Return value at the given quantile.

Parameters:
  • q (float or array-like, default 0.5 (50% quantile)) – The quantile(s) to compute, which can lie in range: 0 <= q <= 1.
  • interpolation ({'linear', 'lower', 'higher', 'midpoint', 'nearest'}) –

    This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:

    • linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
    • lower: i.
    • higher: j.
    • nearest: i or j whichever is nearest.
    • midpoint: (i + j) / 2.
Returns:

If q is an array, a DeferredSeries will be returned where the index is q and the values are the quantiles, otherwise a float will be returned.

Return type:

float or DeferredSeries

Differences from pandas

quantile is not parallelizable. See Issue 20933 tracking the possible addition of an approximate, parallelizable implementation of quantile.

See also

core.window.Rolling.quantile()
Calculate the rolling quantile.
numpy.percentile()
Returns the q-th percentile(s) of the array elements.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> s = pd.Series([1, 2, 3, 4])
>>> s.quantile(.5)
2.5
>>> s.quantile([.25, .5, .75])
0.25    1.75
0.50    2.50
0.75    3.25
dtype: float64
std(*args, **kwargs)[source]

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument.

Parameters:
  • axis ({index (0)}) – For DeferredSeries this parameter is unused and defaults to 0.
  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
  • ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.
Returns:

Return type:

scalar or DeferredSeries (if level specified)

Differences from pandas

This operation has no known divergences from the pandas API.

Notes

To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3],
...                    'age': [21, 25, 62, 43],
...                    'height': [1.61, 1.87, 1.49, 2.01]}
...                   ).set_index('person_id')
>>> df
           age  height
person_id
0           21    1.61
1           25    1.87
2           62    1.49
3           43    2.01

The standard deviation of the columns can be found as follows:

>>> df.std()
age       18.786076
height     0.237417
dtype: float64

Alternatively, `ddof=0` can be set to normalize by N instead of N-1:

>>> df.std(ddof=0)
age       16.269219
height     0.205609
dtype: float64
mean(skipna, **kwargs)[source]

Return the mean of the values over the requested axis.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.
  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.
  • **kwargs – Additional keyword arguments to be passed to the function.
Returns:

Return type:

scalar or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

var(axis, skipna, level, ddof, **kwargs)[source]

Return unbiased variance over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument.

Parameters:
  • axis ({index (0)}) – For DeferredSeries this parameter is unused and defaults to 0.
  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
  • ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.
Returns:

Return type:

scalar or DeferredSeries (if level specified)

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3],
...                   'age': [21, 25, 62, 43],
...                   'height': [1.61, 1.87, 1.49, 2.01]}
...                  ).set_index('person_id')
>>> df
           age  height
person_id
0           21    1.61
1           25    1.87
2           62    1.49
3           43    2.01

>>> df.var()
age       352.916667
height      0.056367
dtype: float64

Alternatively, ``ddof=0`` can be set to normalize by N instead of N-1:

>>> df.var(ddof=0)
age       264.687500
height      0.042275
dtype: float64
corr(other, method, min_periods)[source]

Compute correlation with other Series, excluding missing values.

The two Series objects are not required to be the same length and will be aligned internally before the correlation function is applied.

Parameters:
  • other (DeferredSeries) – DeferredSeries with which to compute the correlation.
  • method ({'pearson', 'kendall', 'spearman'} or callable) –

    Method used to compute correlation:

    • pearson : Standard correlation coefficient
    • kendall : Kendall Tau correlation coefficient
    • spearman : Spearman rank correlation
    • callable: Callable with input two 1d ndarrays and returning a float.

    Warning

    Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

  • min_periods (int, optional) – Minimum number of observations needed to have a valid result.
Returns:

Correlation with other.

Return type:

float

Differences from pandas

Only method='pearson' is currently parallelizable.

See also

DeferredDataFrame.corr()
Compute pairwise correlation between columns.
DeferredDataFrame.corrwith()
Compute pairwise correlation with another DeferredDataFrame or DeferredSeries.

Notes

Pearson, Kendall and Spearman correlation are currently computed using pairwise complete observations.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> def histogram_intersection(a, b):
...     v = np.minimum(a, b).sum().round(decimals=1)
...     return v
>>> s1 = pd.Series([.2, .0, .6, .2])
>>> s2 = pd.Series([.3, .6, .0, .1])
>>> s1.corr(s2, method=histogram_intersection)
0.3
skew(axis, skipna, level, numeric_only, **kwargs)[source]

Return unbiased skew over requested axis.

Normalized by N-1.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.
  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.
  • **kwargs – Additional keyword arguments to be passed to the function.
Returns:

Return type:

scalar or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

kurtosis(axis, skipna, level, numeric_only, **kwargs)[source]

Return unbiased kurtosis over requested axis.

Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.
  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.
  • **kwargs – Additional keyword arguments to be passed to the function.
Returns:

Return type:

scalar or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

kurt(*args, **kwargs)[source]

Return unbiased kurtosis over requested axis.

Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.
  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.
  • **kwargs – Additional keyword arguments to be passed to the function.
Returns:

Return type:

scalar or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

cov(other, min_periods, ddof)[source]

Compute covariance with Series, excluding missing values.

The two Series objects are not required to be the same length and will be aligned internally before the covariance is calculated.

Parameters:
  • other (DeferredSeries) – DeferredSeries with which to compute the covariance.
  • min_periods (int, optional) – Minimum number of observations needed to have a valid result.
  • ddof (int, default 1) –

    Delta degrees of freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

    New in version 1.1.0.

Returns:

Covariance between DeferredSeries and other normalized by N-1 (unbiased estimator).

Return type:

float

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.cov()
Compute pairwise covariance of columns.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s1 = pd.Series([0.90010907, 0.13484424, 0.62036035])
>>> s2 = pd.Series([0.12528585, 0.26962463, 0.51111198])
>>> s1.cov(s2)
-0.01685762652715874
dropna(**kwargs)[source]

Return a new Series with missing values removed.

See the User Guide for more on which values are considered missing, and how to work with missing data.

Parameters:
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
  • inplace (bool, default False) – If True, do operation inplace and return None.
  • how (str, optional) – Not in use. Kept for compatibility.
  • ignore_index (bool, default False) –

    If True, the resulting axis will be labeled 0, 1, …, n - 1.

    New in version 2.0.0.

Returns:

DeferredSeries with NA entries dropped from it or None if inplace=True.

Return type:

DeferredSeries or None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.isna()
Indicate missing values.
DeferredSeries.notna()
Indicate existing (non-missing) values.
DeferredSeries.fillna()
Replace missing values.
DeferredDataFrame.dropna()
Drop rows or columns which contain NA values.
Index.dropna()
Drop missing indices.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> ser = pd.Series([1., 2., np.nan])
>>> ser
0    1.0
1    2.0
2    NaN
dtype: float64

Drop NA values from a Series.

>>> ser.dropna()
0    1.0
1    2.0
dtype: float64

Empty strings are not considered NA values. ``None`` is considered an
NA value.

>>> ser = pd.Series([np.NaN, 2, pd.NaT, '', None, 'I stay'])
>>> ser
0       NaN
1         2
2       NaT
3
4      None
5    I stay
dtype: object
>>> ser.dropna()
1         2
3
5    I stay
dtype: object
set_axis(labels, **kwargs)[source]

Assign desired index to given axis.

Indexes for row labels can be changed by assigning a list-like or Index.

Parameters:
  • labels (list-like, Index) – The values for the new index.
  • axis ({0 or 'index'}, default 0) – The axis to update. The value 0 identifies the rows. For DeferredSeries this parameter is unused and defaults to 0.
  • copy (bool, default True) –

    Whether to make a copy of the underlying data.

    New in version 1.5.0.

Returns:

An object of type DeferredSeries.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.rename_axis()
Alter the name of the index. Examples ——– >>> s = pd.DeferredSeries([1, 2, 3]) >>> s 0 1 1 2 2 3
dtype()
int64 >>> s.set_axis([‘a’, ‘b’, ‘c’], axis=0) a 1 b 2 c 3
dtype()
int64
isnull(**kwargs)

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:Mask of bool values for each element in DeferredSeries that indicates whether an element is an NA value.
Return type:DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.isnull()
Alias of isna.
DeferredSeries.notna()
Boolean inverse of isna.
DeferredSeries.dropna()
Omit axes labels with missing values.
isna()
Top-level isna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.isna()
0    False
1    False
2     True
dtype: bool
isna(**kwargs)

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:Mask of bool values for each element in DeferredSeries that indicates whether an element is an NA value.
Return type:DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.isnull()
Alias of isna.
DeferredSeries.notna()
Boolean inverse of isna.
DeferredSeries.dropna()
Omit axes labels with missing values.
isna()
Top-level isna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.isna()
0    False
1    False
2     True
dtype: bool
notnull(**kwargs)

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:Mask of bool values for each element in DeferredSeries that indicates whether an element is not an NA value.
Return type:DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.notnull()
Alias of notna.
DeferredSeries.isna()
Boolean inverse of notna.
DeferredSeries.dropna()
Omit axes labels with missing values.
notna()
Top-level notna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.notna()
0     True
1     True
2    False
dtype: bool
notna(**kwargs)

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:Mask of bool values for each element in DeferredSeries that indicates whether an element is not an NA value.
Return type:DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.notnull()
Alias of notna.
DeferredSeries.isna()
Boolean inverse of notna.
DeferredSeries.dropna()
Omit axes labels with missing values.
notna()
Top-level notna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.notna()
0     True
1     True
2    False
dtype: bool
items(**kwargs)

pandas.Series.items() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

iteritems(**kwargs)

pandas.Series.iteritems() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

tolist(**kwargs)

pandas.Series.tolist() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

to_numpy(**kwargs)

pandas.Series.to_numpy() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

to_string(**kwargs)

pandas.Series.to_string() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

duplicated(keep)[source]

Indicate duplicate Series values.

Duplicated values are indicated as True values in the resulting Series. Either all duplicates, all except the first or all except the last occurrence of duplicates can be indicated.

Parameters:keep ({'first', 'last', False}, default 'first') –

Method to handle dropping duplicates:

  • ’first’ : Mark duplicates as True except for the first occurrence.
  • ’last’ : Mark duplicates as True except for the last occurrence.
  • False : Mark all duplicates as True.
Returns:DeferredSeries indicating whether each value has occurred in the preceding values.
Return type:DeferredSeries[bool]

Differences from pandas

Only keep=False and keep="any" are supported. Other values of keep make this an order-sensitive operation. Note keep="any" is a Beam-specific option that guarantees only one duplicate will be kept, but unlike "first" and "last" it makes no guarantees about _which_ duplicate element is kept.

See also

Index.duplicated()
Equivalent method on pandas.Index.
DeferredDataFrame.duplicated()
Equivalent method on pandas.DeferredDataFrame.
DeferredSeries.drop_duplicates()
Remove duplicate values from DeferredSeries.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

By default, for each set of duplicated values, the first occurrence is
set on False and all others on True:

>>> animals = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama'])
>>> animals.duplicated()
0    False
1    False
2     True
3    False
4     True
dtype: bool

which is equivalent to

>>> animals.duplicated(keep='first')
0    False
1    False
2     True
3    False
4     True
dtype: bool

By using 'last', the last occurrence of each set of duplicated values
is set on False and all others on True:

>>> animals.duplicated(keep='last')
0     True
1    False
2     True
3    False
4    False
dtype: bool

By setting keep on ``False``, all duplicates are True:

>>> animals.duplicated(keep=False)
0     True
1    False
2     True
3    False
4     True
dtype: bool
drop_duplicates(keep)[source]

Return Series with duplicate values removed.

Parameters:
  • keep ({‘first’, ‘last’, False}, default ‘first’) –

    Method to handle dropping duplicates:

    • ’first’ : Drop duplicates except for the first occurrence.
    • ’last’ : Drop duplicates except for the last occurrence.
    • False : Drop all duplicates.
  • inplace (bool, default False) – If True, performs operation inplace and returns None.
  • ignore_index (bool, default False) –

    If True, the resulting axis will be labeled 0, 1, …, n - 1.

    New in version 2.0.0.

Returns:

DeferredSeries with duplicates dropped or None if inplace=True.

Return type:

DeferredSeries or None

Differences from pandas

Only keep=False and keep="any" are supported. Other values of keep make this an order-sensitive operation. Note keep="any" is a Beam-specific option that guarantees only one duplicate will be kept, but unlike "first" and "last" it makes no guarantees about _which_ duplicate element is kept.

See also

Index.drop_duplicates()
Equivalent method on Index.
DeferredDataFrame.drop_duplicates()
Equivalent method on DeferredDataFrame.
DeferredSeries.duplicated()
Related method on DeferredSeries, indicating duplicate DeferredSeries values.
DeferredSeries.unique()
Return unique values as an array.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

Generate a Series with duplicated entries.

>>> s = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama', 'hippo'],
...               name='animal')
>>> s
0      lama
1       cow
2      lama
3    beetle
4      lama
5     hippo
Name: animal, dtype: object

With the 'keep' parameter, the selection behaviour of duplicated values
can be changed. The value 'first' keeps the first occurrence for each
set of duplicated entries. The default value of keep is 'first'.

>>> s.drop_duplicates()
0      lama
1       cow
3    beetle
5     hippo
Name: animal, dtype: object

The value 'last' for parameter 'keep' keeps the last occurrence for
each set of duplicated entries.

>>> s.drop_duplicates(keep='last')
1       cow
3    beetle
4      lama
5     hippo
Name: animal, dtype: object

The value ``False`` for parameter 'keep' discards all sets of
duplicated entries.

>>> s.drop_duplicates(keep=False)
1       cow
3    beetle
5     hippo
Name: animal, dtype: object
sample(**kwargs)[source]

Return a random sample of items from an axis of object.

You can use random_state for reproducibility.

Parameters:
  • n (int, optional) – Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.
  • frac (float, optional) – Fraction of axis items to return. Cannot be used with n.
  • replace (bool, default False) – Allow or disallow sampling of the same row more than once.
  • weights (str or ndarray-like, optional) – Default ‘None’ results in equal probability weighting. If passed a DeferredSeries, will align with target object on index. Index values in weights not found in sampled object will be ignored and index values in sampled object not in weights will be assigned weights of zero. If called on a DeferredDataFrame, will accept the name of a column when axis = 0. Unless weights are a DeferredSeries, weights must be same length as axis being sampled. If weights do not sum to 1, they will be normalized to sum to 1. Missing values in the weights column will be treated as zero. Infinite values not allowed.
  • random_state (int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional) –

    If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given.

    Changed in version 1.1.0: array-like and BitGenerator object now passed to np.random.RandomState() as seed

    Changed in version 1.4.0: np.random.Generator objects now accepted

  • axis ({0 or ‘index’, 1 or ‘columns’, None}, default None) – Axis to sample. Accepts axis number or name. Default is stat axis for given data type. For DeferredSeries this parameter is unused and defaults to None.
  • ignore_index (bool, default False) –

    If True, the resulting index will be labeled 0, 1, …, n - 1.

    New in version 1.3.0.

Returns:

A new object of same type as caller containing n items randomly sampled from the caller object.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

Only n and/or weights may be specified. frac, random_state, and replace=True are not yet supported. See Issue 21010.

Note that pandas will raise an error if n is larger than the length of the dataset, while the Beam DataFrame API will simply return the full dataset in that case.

See also

DeferredDataFrameGroupBy.sample()
Generates random samples from each group of a DeferredDataFrame object.
DeferredSeriesGroupBy.sample()
Generates random samples from each group of a DeferredSeries object.
numpy.random.choice()
Generates a random sample from a given 1-D numpy array.

Notes

If frac > 1, replacement should be set to True.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
...                    'num_wings': [2, 0, 0, 0],
...                    'num_specimen_seen': [10, 2, 1, 8]},
...                   index=['falcon', 'dog', 'spider', 'fish'])
>>> df
        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
dog            4          0                  2
spider         8          0                  1
fish           0          0                  8

Extract 3 random elements from the ``Series`` ``df['num_legs']``:
Note that we use `random_state` to ensure the reproducibility of
the examples.

>>> df['num_legs'].sample(n=3, random_state=1)
fish      0
spider    8
falcon    2
Name: num_legs, dtype: int64

A random 50% sample of the ``DataFrame`` with replacement:

>>> df.sample(frac=0.5, replace=True, random_state=1)
      num_legs  num_wings  num_specimen_seen
dog          4          0                  2
fish         0          0                  8

An upsample sample of the ``DataFrame`` with replacement:
Note that `replace` parameter has to be `True` for `frac` parameter > 1.

>>> df.sample(frac=2, replace=True, random_state=1)
        num_legs  num_wings  num_specimen_seen
dog            4          0                  2
fish           0          0                  8
falcon         2          2                 10
falcon         2          2                 10
fish           0          0                  8
dog            4          0                  2
fish           0          0                  8
dog            4          0                  2

Using a DataFrame column as weights. Rows with larger value in the
`num_specimen_seen` column are more likely to be sampled.

>>> df.sample(n=2, weights='num_specimen_seen', random_state=1)
        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
fish           0          0                  8
aggregate(func, axis, *args, **kwargs)[source]

Aggregate using one or more operations over the specified axis.

Parameters:
  • func (function, str, list or dict) –

    Function to use for aggregating the data. If a function, must either work when passed a DeferredSeries or when passed to DeferredSeries.apply.

    Accepted combinations are:

    • function
    • string function name
    • list of functions and/or function names, e.g. [np.sum, 'mean']
    • dict of axis labels -> functions, function names or list of such.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
  • *args – Positional arguments to pass to func.
  • **kwargs – Keyword arguments to pass to func.
Returns:

The return can be:

  • scalar : when DeferredSeries.agg is called with single function
  • DeferredSeries : when DeferredDataFrame.agg is called with a single function
  • DeferredDataFrame : when DeferredDataFrame.agg is called with several functions

Return scalar, DeferredSeries or DeferredDataFrame.

Return type:

scalar, DeferredSeries or DeferredDataFrame

Differences from pandas

Some aggregation methods cannot be parallelized, and computing them will require collecting all data on a single machine.

See also

DeferredSeries.apply()
Invoke function on a DeferredSeries.
DeferredSeries.transform()
Transform function producing a DeferredSeries with like indexes.

Notes

agg is an alias for aggregate. Use the alias.

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.

A passed user-defined-function will be passed a DeferredSeries for evaluation.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64

>>> s.agg('min')
1

>>> s.agg(['min', 'max'])
min   1
max   4
dtype: int64
agg(func, axis, *args, **kwargs)

Aggregate using one or more operations over the specified axis.

Parameters:
  • func (function, str, list or dict) –

    Function to use for aggregating the data. If a function, must either work when passed a DeferredSeries or when passed to DeferredSeries.apply.

    Accepted combinations are:

    • function
    • string function name
    • list of functions and/or function names, e.g. [np.sum, 'mean']
    • dict of axis labels -> functions, function names or list of such.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
  • *args – Positional arguments to pass to func.
  • **kwargs – Keyword arguments to pass to func.
Returns:

The return can be:

  • scalar : when DeferredSeries.agg is called with single function
  • DeferredSeries : when DeferredDataFrame.agg is called with a single function
  • DeferredDataFrame : when DeferredDataFrame.agg is called with several functions

Return scalar, DeferredSeries or DeferredDataFrame.

Return type:

scalar, DeferredSeries or DeferredDataFrame

Differences from pandas

Some aggregation methods cannot be parallelized, and computing them will require collecting all data on a single machine.

See also

DeferredSeries.apply()
Invoke function on a DeferredSeries.
DeferredSeries.transform()
Transform function producing a DeferredSeries with like indexes.

Notes

agg is an alias for aggregate. Use the alias.

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.

A passed user-defined-function will be passed a DeferredSeries for evaluation.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64

>>> s.agg('min')
1

>>> s.agg(['min', 'max'])
min   1
max   4
dtype: int64
axes

Return a list of the row axis labels.

Differences from pandas

This operation has no known divergences from the pandas API.

clip(**kwargs)
all(*args, **kwargs)

Return whether all elements are True, potentially over an axis.

Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).

Parameters:
  • axis ({0 or 'index', 1 or 'columns', None}, default 0) –

    Indicate which axis or axes should be reduced. For DeferredSeries this parameter is unused and defaults to 0.

    • 0 / ‘index’ : reduce the index, return a DeferredSeries whose index is the original column labels.
    • 1 / ‘columns’ : reduce the columns, return a DeferredSeries whose index is the original index.
    • None : reduce all axes, return a scalar.
  • bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for DeferredSeries.
  • skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
  • **kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns:

If level is specified, then, DeferredSeries is returned; otherwise, scalar is returned.

Return type:

scalar or DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.all()
Return True if all elements are True.
DeferredDataFrame.any()
Return True if one (or more) elements are True.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

**Series**

>>> pd.Series([True, True]).all()
True
>>> pd.Series([True, False]).all()
False
>>> pd.Series([], dtype="float64").all()
True
>>> pd.Series([np.nan]).all()
True
>>> pd.Series([np.nan]).all(skipna=False)
True

**DataFrames**

Create a dataframe from a dictionary.

>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]})
>>> df
   col1   col2
0  True   True
1  True  False

Default behaviour checks if values in each column all return True.

>>> df.all()
col1     True
col2    False
dtype: bool

Specify ``axis='columns'`` to check if values in each row all return True.

>>> df.all(axis='columns')
0     True
1    False
dtype: bool

Or ``axis=None`` for whether every value is True.

>>> df.all(axis=None)
False
any(*args, **kwargs)

Return whether any element is True, potentially over an axis.

Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).

Parameters:
  • axis ({0 or 'index', 1 or 'columns', None}, default 0) –

    Indicate which axis or axes should be reduced. For DeferredSeries this parameter is unused and defaults to 0.

    • 0 / ‘index’ : reduce the index, return a DeferredSeries whose index is the original column labels.
    • 1 / ‘columns’ : reduce the columns, return a DeferredSeries whose index is the original index.
    • None : reduce all axes, return a scalar.
  • bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for DeferredSeries.
  • skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
  • **kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns:

If level is specified, then, DeferredSeries is returned; otherwise, scalar is returned.

Return type:

scalar or DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

numpy.any()
Numpy version of this method.
DeferredSeries.any()
Return whether any element is True.
DeferredSeries.all()
Return whether all elements are True.
DeferredDataFrame.any()
Return whether any element is True over requested axis.
DeferredDataFrame.all()
Return whether all elements are True over requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

**Series**

For Series input, the output is a scalar indicating whether any element
is True.

>>> pd.Series([False, False]).any()
False
>>> pd.Series([True, False]).any()
True
>>> pd.Series([], dtype="float64").any()
False
>>> pd.Series([np.nan]).any()
False
>>> pd.Series([np.nan]).any(skipna=False)
True

**DataFrame**

Whether each column contains at least one True element (the default).

>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})
>>> df
   A  B  C
0  1  0  0
1  2  2  0

>>> df.any()
A     True
B     True
C    False
dtype: bool

Aggregating over the columns.

>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]})
>>> df
       A  B
0   True  1
1  False  2

>>> df.any(axis='columns')
0    True
1    True
dtype: bool

>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]})
>>> df
       A  B
0   True  1
1  False  0

>>> df.any(axis='columns')
0    True
1    False
dtype: bool

Aggregating over the entire DataFrame with ``axis=None``.

>>> df.any(axis=None)
True

`any` for an empty DataFrame is an empty Series.

>>> pd.DataFrame([]).any()
Series([], dtype: bool)
count(*args, **kwargs)

Return number of non-NA/null observations in the Series.

Returns:Number of non-null values in the DeferredSeries.
Return type:int or DeferredSeries (if level specified)

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.count()
Count non-NA cells for each column or row.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([0.0, 1.0, np.nan])
>>> s.count()
2
describe(*args, **kwargs)

Generate descriptive statistics.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters:
  • percentiles (list-like of numbers, optional) – The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.
  • include ('all', list-like of dtypes or None (default), optional) –

    A white list of data types to include in the result. Ignored for DeferredSeries. Here are the options:

    • ’all’ : All columns of the input will be included in the output.
    • A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'
    • None (default) : The result will include all numeric columns.
  • exclude (list-like of dtypes or None (default), optional,) –

    A black list of data types to omit from the result. Ignored for DeferredSeries. Here are the options:

    • A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(exclude=['O'])). To exclude pandas categorical columns, use 'category'
    • None (default) : The result will exclude nothing.
Returns:

Summary statistics of the DeferredSeries or Dataframe provided.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

describe cannot currently be parallelized. It will require collecting all data on a single node.

See also

DeferredDataFrame.count()
Count number of non-NA/null observations.
DeferredDataFrame.max()
Maximum of the values in the object.
DeferredDataFrame.min()
Minimum of the values in the object.
DeferredDataFrame.mean()
Mean of the values.
DeferredDataFrame.std()
Standard deviation of the observations.
DeferredDataFrame.select_dtypes()
Subset of a DeferredDataFrame including/excluding columns based on their dtype.

Notes

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DeferredDataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DeferredDataFrame are analyzed for the output. The parameters are ignored when analyzing a DeferredSeries.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

Describing a numeric ``Series``.

>>> s = pd.Series([1, 2, 3])
>>> s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Describing a categorical ``Series``.

>>> s = pd.Series(['a', 'a', 'b', 'c'])
>>> s.describe()
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp ``Series``.

>>> s = pd.Series([
...     np.datetime64("2000-01-01"),
...     np.datetime64("2010-01-01"),
...     np.datetime64("2010-01-01")
... ])
>>> s.describe()
count                      3
mean     2006-09-01 08:00:00
min      2000-01-01 00:00:00
25%      2004-12-31 12:00:00
50%      2010-01-01 00:00:00
75%      2010-01-01 00:00:00
max      2010-01-01 00:00:00
dtype: object

Describing a ``DataFrame``. By default only numeric fields
are returned.

>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),
...                    'numeric': [1, 2, 3],
...                    'object': ['a', 'b', 'c']
...                   })
>>> df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a ``DataFrame`` regardless of data type.

>>> df.describe(include='all')  
       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      a
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a ``DataFrame`` by accessing it as
an attribute.

>>> df.numeric.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a ``DataFrame`` description.

>>> df.describe(include=[np.number])
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a ``DataFrame`` description.

>>> df.describe(include=[object])  
       object
count       3
unique      3
top         a
freq        1

Including only categorical columns from a ``DataFrame`` description.

>>> df.describe(include=['category'])
       categorical
count            3
unique           3
top              d
freq             1

Excluding numeric columns from a ``DataFrame`` description.

>>> df.describe(exclude=[np.number])  
       categorical object
count            3      3
unique           3      3
top              f      a
freq             1      1

Excluding object columns from a ``DataFrame`` description.

>>> df.describe(exclude=[object])  
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0
min(*args, **kwargs)

Return the minimum of the values over the requested axis.

If you want the index of the minimum, use idxmin. This is the equivalent of the numpy.ndarray method argmin.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.
  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.
  • **kwargs – Additional keyword arguments to be passed to the function.
Returns:

Return type:

scalar or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.sum()
Return the sum.
DeferredSeries.min()
Return the minimum.
DeferredSeries.max()
Return the maximum.
DeferredSeries.idxmin()
Return the index of the minimum.
DeferredSeries.idxmax()
Return the index of the maximum.
DeferredDataFrame.sum()
Return the sum over the requested axis.
DeferredDataFrame.min()
Return the minimum over the requested axis.
DeferredDataFrame.max()
Return the maximum over the requested axis.
DeferredDataFrame.idxmin()
Return the index of the minimum over the requested axis.
DeferredDataFrame.idxmax()
Return the index of the maximum over the requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> idx = pd.MultiIndex.from_arrays([
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> s
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64

>>> s.min()
0
max(*args, **kwargs)

Return the maximum of the values over the requested axis.

If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.
  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.
  • **kwargs – Additional keyword arguments to be passed to the function.
Returns:

Return type:

scalar or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.sum()
Return the sum.
DeferredSeries.min()
Return the minimum.
DeferredSeries.max()
Return the maximum.
DeferredSeries.idxmin()
Return the index of the minimum.
DeferredSeries.idxmax()
Return the index of the maximum.
DeferredDataFrame.sum()
Return the sum over the requested axis.
DeferredDataFrame.min()
Return the minimum over the requested axis.
DeferredDataFrame.max()
Return the maximum over the requested axis.
DeferredDataFrame.idxmin()
Return the index of the minimum over the requested axis.
DeferredDataFrame.idxmax()
Return the index of the maximum over the requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> idx = pd.MultiIndex.from_arrays([
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> s
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64

>>> s.max()
8
prod(*args, **kwargs)

Return the product of the values over the requested axis.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.
  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.
  • min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
  • **kwargs – Additional keyword arguments to be passed to the function.
Returns:

Return type:

scalar or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.sum()
Return the sum.
DeferredSeries.min()
Return the minimum.
DeferredSeries.max()
Return the maximum.
DeferredSeries.idxmin()
Return the index of the minimum.
DeferredSeries.idxmax()
Return the index of the maximum.
DeferredDataFrame.sum()
Return the sum over the requested axis.
DeferredDataFrame.min()
Return the minimum over the requested axis.
DeferredDataFrame.max()
Return the maximum over the requested axis.
DeferredDataFrame.idxmin()
Return the index of the minimum over the requested axis.
DeferredDataFrame.idxmax()
Return the index of the maximum over the requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

By default, the product of an empty or all-NA Series is ``1``

>>> pd.Series([], dtype="float64").prod()
1.0

This can be controlled with the ``min_count`` parameter

>>> pd.Series([], dtype="float64").prod(min_count=1)
nan

Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.

>>> pd.Series([np.nan]).prod()
1.0

>>> pd.Series([np.nan]).prod(min_count=1)
nan
product(*args, **kwargs)

Return the product of the values over the requested axis.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.
  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.
  • min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
  • **kwargs – Additional keyword arguments to be passed to the function.
Returns:

Return type:

scalar or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.sum()
Return the sum.
DeferredSeries.min()
Return the minimum.
DeferredSeries.max()
Return the maximum.
DeferredSeries.idxmin()
Return the index of the minimum.
DeferredSeries.idxmax()
Return the index of the maximum.
DeferredDataFrame.sum()
Return the sum over the requested axis.
DeferredDataFrame.min()
Return the minimum over the requested axis.
DeferredDataFrame.max()
Return the maximum over the requested axis.
DeferredDataFrame.idxmin()
Return the index of the minimum over the requested axis.
DeferredDataFrame.idxmax()
Return the index of the maximum over the requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

By default, the product of an empty or all-NA Series is ``1``

>>> pd.Series([], dtype="float64").prod()
1.0

This can be controlled with the ``min_count`` parameter

>>> pd.Series([], dtype="float64").prod(min_count=1)
nan

Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.

>>> pd.Series([np.nan]).prod()
1.0

>>> pd.Series([np.nan]).prod(min_count=1)
nan
sum(*args, **kwargs)

Return the sum of the values over the requested axis.

This is equivalent to the method numpy.sum.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.
  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.
  • min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
  • **kwargs – Additional keyword arguments to be passed to the function.
Returns:

Return type:

scalar or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.sum()
Return the sum.
DeferredSeries.min()
Return the minimum.
DeferredSeries.max()
Return the maximum.
DeferredSeries.idxmin()
Return the index of the minimum.
DeferredSeries.idxmax()
Return the index of the maximum.
DeferredDataFrame.sum()
Return the sum over the requested axis.
DeferredDataFrame.min()
Return the minimum over the requested axis.
DeferredDataFrame.max()
Return the maximum over the requested axis.
DeferredDataFrame.idxmin()
Return the index of the minimum over the requested axis.
DeferredDataFrame.idxmax()
Return the index of the maximum over the requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> idx = pd.MultiIndex.from_arrays([
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> s
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64

>>> s.sum()
14

By default, the sum of an empty or all-NA Series is ``0``.

>>> pd.Series([], dtype="float64").sum()  # min_count=0 is the default
0.0

This can be controlled with the ``min_count`` parameter. For example, if
you'd like the sum of an empty series to be NaN, pass ``min_count=1``.

>>> pd.Series([], dtype="float64").sum(min_count=1)
nan

Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.

>>> pd.Series([np.nan]).sum()
0.0

>>> pd.Series([np.nan]).sum(min_count=1)
nan
median(*args, **kwargs)

Return the median of the values over the requested axis.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.
  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.
  • **kwargs – Additional keyword arguments to be passed to the function.
Returns:

Return type:

scalar or scalar

Differences from pandas

median cannot currently be parallelized. It will require collecting all data on a single node.

sem(*args, **kwargs)

Return unbiased standard error of the mean over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:
  • axis ({index (0)}) – For DeferredSeries this parameter is unused and defaults to 0.
  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
  • ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.
Returns:

Return type:

scalar or DeferredSeries (if level specified)

Differences from pandas

sem cannot currently be parallelized. It will require collecting all data on a single node.

argmax(**kwargs)

pandas.Series.argmax() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

argmin(**kwargs)

pandas.Series.argmin() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

cummax(**kwargs)

pandas.Series.cummax() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

cummin(**kwargs)

pandas.Series.cummin() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

cumprod(**kwargs)

pandas.Series.cumprod() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

cumsum(**kwargs)

pandas.Series.cumsum() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

diff(**kwargs)

pandas.Series.diff() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

interpolate(**kwargs)

pandas.Series.interpolate() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

searchsorted(**kwargs)

pandas.Series.searchsorted() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

shift(**kwargs)

pandas.Series.shift() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

pct_change(**kwargs)

pandas.Series.pct_change() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

is_monotonic(**kwargs)

pandas.Series.is_monotonic() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

is_monotonic_increasing(**kwargs)

pandas.Series.is_monotonic_increasing() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

is_monotonic_decreasing(**kwargs)

pandas.Series.is_monotonic_decreasing() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

asof(**kwargs)

pandas.Series.asof() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

first_valid_index(**kwargs)

pandas.Series.first_valid_index() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

last_valid_index(**kwargs)

pandas.Series.last_valid_index() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

autocorr(**kwargs)

pandas.Series.autocorr() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

iat

pandas.Series.iat() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

head(**kwargs)

pandas.Series.head() is not yet supported in the Beam DataFrame API because it is order-sensitive.

If you want to peek at a large dataset consider using interactive Beam’s ib.collect with n specified, or sample(). If you want to find the N largest elements, consider using DeferredDataFrame.nlargest().

tail(**kwargs)

pandas.Series.tail() is not yet supported in the Beam DataFrame API because it is order-sensitive.

If you want to peek at a large dataset consider using interactive Beam’s ib.collect with n specified, or sample(). If you want to find the N largest elements, consider using DeferredDataFrame.nlargest().

filter(**kwargs)

Subset the dataframe rows or columns according to the specified index labels.

Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.

Parameters:
  • items (list-like) – Keep labels from axis which are in items.
  • like (str) – Keep labels from axis for which “like in label == True”.
  • regex (str (regular expression)) – Keep labels from axis for which re.search(regex, label) == True.
  • axis ({0 or ‘index’, 1 or ‘columns’, None}, default None) – The axis to filter on, expressed either as an index (int) or axis name (str). By default this is the info axis, ‘columns’ for DeferredDataFrame. For DeferredSeries this parameter is unused and defaults to None.
Returns:

Return type:

same type as input object

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.loc()
Access a group of rows and columns by label(s) or a boolean array.

Notes

The items, like, and regex parameters are enforced to be mutually exclusive.

axis defaults to the info axis that is used when indexing with [].

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame(np.array(([1, 2, 3], [4, 5, 6])),
...                   index=['mouse', 'rabbit'],
...                   columns=['one', 'two', 'three'])
>>> df
        one  two  three
mouse     1    2      3
rabbit    4    5      6

>>> # select columns by name
>>> df.filter(items=['one', 'three'])
         one  three
mouse     1      3
rabbit    4      6

>>> # select columns by regular expression
>>> df.filter(regex='e$', axis=1)
         one  three
mouse     1      3
rabbit    4      6

>>> # select rows containing 'bbi'
>>> df.filter(like='bbi', axis=0)
         one  two  three
rabbit    4    5      6
memory_usage(**kwargs)

pandas.Series.memory_usage() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

nbytes(**kwargs)

pandas.Series.nbytes() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

to_list(**kwargs)

pandas.Series.to_list() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

factorize(**kwargs)

pandas.Series.factorize() is not yet supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.

For more information see https://s.apache.org/dataframe-non-deferred-columns.

nlargest(keep, **kwargs)[source]

Return the largest n elements.

Parameters:
  • n (int, default 5) – Return this many descending sorted values.
  • keep ({'first', 'last', 'all'}, default 'first') –

    When there are duplicate values that cannot all fit in a DeferredSeries of n elements:

    • first : return the first n occurrences in order of appearance.
    • last : return the last n occurrences in reverse order of appearance.
    • all : keep all occurrences. This can result in a DeferredSeries of size larger than n.
Returns:

The n largest values in the DeferredSeries, sorted in decreasing order.

Return type:

DeferredSeries

Differences from pandas

Only keep=False and keep="any" are supported. Other values of keep make this an order-sensitive operation. Note keep="any" is a Beam-specific option that guarantees only one duplicate will be kept, but unlike "first" and "last" it makes no guarantees about _which_ duplicate element is kept.

See also

DeferredSeries.nsmallest()
Get the n smallest elements.
DeferredSeries.sort_values()
Sort DeferredSeries by values.
DeferredSeries.head()
Return the first n rows.

Notes

Faster than .sort_values(ascending=False).head(n) for small n relative to the size of the DeferredSeries object.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> countries_population = {"Italy": 59000000, "France": 65000000,
...                         "Malta": 434000, "Maldives": 434000,
...                         "Brunei": 434000, "Iceland": 337000,
...                         "Nauru": 11300, "Tuvalu": 11300,
...                         "Anguilla": 11300, "Montserrat": 5200}
>>> s = pd.Series(countries_population)
>>> s
Italy       59000000
France      65000000
Malta         434000
Maldives      434000
Brunei        434000
Iceland       337000
Nauru          11300
Tuvalu         11300
Anguilla       11300
Montserrat      5200
dtype: int64

The `n` largest elements where ``n=5`` by default.

>>> s.nlargest()
France      65000000
Italy       59000000
Malta         434000
Maldives      434000
Brunei        434000
dtype: int64

The `n` largest elements where ``n=3``. Default `keep` value is 'first'
so Malta will be kept.

>>> s.nlargest(3)
France    65000000
Italy     59000000
Malta       434000
dtype: int64

The `n` largest elements where ``n=3`` and keeping the last duplicates.
Brunei will be kept since it is the last with value 434000 based on
the index order.

>>> s.nlargest(3, keep='last')
France      65000000
Italy       59000000
Brunei        434000
dtype: int64

The `n` largest elements where ``n=3`` with all duplicates kept. Note
that the returned Series has five elements due to the three duplicates.

>>> s.nlargest(3, keep='all')
France      65000000
Italy       59000000
Malta         434000
Maldives      434000
Brunei        434000
dtype: int64
nsmallest(keep, **kwargs)[source]

Return the smallest n elements.

Parameters:
  • n (int, default 5) – Return this many ascending sorted values.
  • keep ({'first', 'last', 'all'}, default 'first') –

    When there are duplicate values that cannot all fit in a DeferredSeries of n elements:

    • first : return the first n occurrences in order of appearance.
    • last : return the last n occurrences in reverse order of appearance.
    • all : keep all occurrences. This can result in a DeferredSeries of size larger than n.
Returns:

The n smallest values in the DeferredSeries, sorted in increasing order.

Return type:

DeferredSeries

Differences from pandas

Only keep=False and keep="any" are supported. Other values of keep make this an order-sensitive operation. Note keep="any" is a Beam-specific option that guarantees only one duplicate will be kept, but unlike "first" and "last" it makes no guarantees about _which_ duplicate element is kept.

See also

DeferredSeries.nlargest()
Get the n largest elements.
DeferredSeries.sort_values()
Sort DeferredSeries by values.
DeferredSeries.head()
Return the first n rows.

Notes

Faster than .sort_values().head(n) for small n relative to the size of the DeferredSeries object.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> countries_population = {"Italy": 59000000, "France": 65000000,
...                         "Brunei": 434000, "Malta": 434000,
...                         "Maldives": 434000, "Iceland": 337000,
...                         "Nauru": 11300, "Tuvalu": 11300,
...                         "Anguilla": 11300, "Montserrat": 5200}
>>> s = pd.Series(countries_population)
>>> s
Italy       59000000
France      65000000
Brunei        434000
Malta         434000
Maldives      434000
Iceland       337000
Nauru          11300
Tuvalu         11300
Anguilla       11300
Montserrat      5200
dtype: int64

The `n` smallest elements where ``n=5`` by default.

>>> s.nsmallest()
Montserrat    5200
Nauru        11300
Tuvalu       11300
Anguilla     11300
Iceland     337000
dtype: int64

The `n` smallest elements where ``n=3``. Default `keep` value is
'first' so Nauru and Tuvalu will be kept.

>>> s.nsmallest(3)
Montserrat   5200
Nauru       11300
Tuvalu      11300
dtype: int64

The `n` smallest elements where ``n=3`` and keeping the last
duplicates. Anguilla and Tuvalu will be kept since they are the last
with value 11300 based on the index order.

>>> s.nsmallest(3, keep='last')
Montserrat   5200
Anguilla    11300
Tuvalu      11300
dtype: int64

The `n` smallest elements where ``n=3`` with all duplicates kept. Note
that the returned Series has four elements due to the three duplicates.

>>> s.nsmallest(3, keep='all')
Montserrat   5200
Nauru       11300
Tuvalu      11300
Anguilla    11300
dtype: int64
is_unique

Return boolean if values in the object are unique.

Returns:
Return type:bool

Differences from pandas

This operation has no known divergences from the pandas API.

plot(**kwargs)

pandas.Series.plot() is not yet supported in the Beam DataFrame API because it is a plotting tool.

For more information see https://s.apache.org/dataframe-plotting-tools.

pop(**kwargs)

pandas.Series.pop() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

rename_axis(**kwargs)

Set the name of the axis for the index or columns.

Parameters:
  • mapper (scalar, list-like, optional) – Value to set the axis name attribute.
  • columns (index,) –

    A scalar, list-like, dict-like or functions transformations to apply to that axis’ values. Note that the columns parameter is not allowed if the object is a DeferredSeries. This parameter only apply for DeferredDataFrame type objects.

    Use either mapper and axis to specify the axis to target with mapper, or index and/or columns.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to rename. For DeferredSeries this parameter is unused and defaults to 0.
  • copy (bool, default None) – Also copy underlying data.
  • inplace (bool, default False) – Modifies the object directly, instead of creating a new DeferredSeries or DeferredDataFrame.
Returns:

The same type as the caller or None if inplace=True.

Return type:

DeferredSeries, DeferredDataFrame, or None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.rename()
Alter DeferredSeries index labels or name.
DeferredDataFrame.rename()
Alter DeferredDataFrame index labels or name.
Index.rename()
Set new names on index.

Notes

DeferredDataFrame.rename_axis supports two calling conventions

  • (index=index_mapper, columns=columns_mapper, ...)
  • (mapper, axis={'index', 'columns'}, ...)

The first calling convention will only modify the names of the index and/or the names of the Index object that is the columns. In this case, the parameter copy is ignored.

The second calling convention will modify the names of the corresponding index if mapper is a list or a scalar. However, if mapper is dict-like or a function, it will use the deprecated behavior of modifying the axis labels.

We highly recommend using keyword arguments to clarify your intent.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

**Series**

>>> s = pd.Series(["dog", "cat", "monkey"])
>>> s
0       dog
1       cat
2    monkey
dtype: object
>>> s.rename_axis("animal")
animal
0    dog
1    cat
2    monkey
dtype: object

**DataFrame**

>>> df = pd.DataFrame({"num_legs": [4, 4, 2],
...                    "num_arms": [0, 0, 2]},
...                   ["dog", "cat", "monkey"])
>>> df
        num_legs  num_arms
dog            4         0
cat            4         0
monkey         2         2
>>> df = df.rename_axis("animal")
>>> df
        num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2
>>> df = df.rename_axis("limbs", axis="columns")
>>> df
limbs   num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2

**MultiIndex**

>>> df.index = pd.MultiIndex.from_product([['mammal'],
...                                        ['dog', 'cat', 'monkey']],
...                                       names=['type', 'name'])
>>> df
limbs          num_legs  num_arms
type   name
mammal dog            4         0
       cat            4         0
       monkey         2         2

>>> df.rename_axis(index={'type': 'class'})
limbs          num_legs  num_arms
class  name
mammal dog            4         0
       cat            4         0
       monkey         2         2

>>> df.rename_axis(columns=str.upper)
LIMBS          num_legs  num_arms
type   name
mammal dog            4         0
       cat            4         0
       monkey         2         2
round(**kwargs)

Round each value in a Series to the given number of decimals.

Parameters:
  • decimals (int, default 0) – Number of decimal places to round to. If decimals is negative, it specifies the number of positions to the left of the decimal point.
  • **kwargs (*args,) –

    Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Rounded values of the DeferredSeries.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

numpy.around()
Round values of an np.array.
DeferredDataFrame.round()
Round values of a DeferredDataFrame.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([0.1, 1.3, 2.7])
>>> s.round()
0    0.0
1    1.0
2    3.0
dtype: float64
take(**kwargs)

pandas.Series.take() is not yet supported in the Beam DataFrame API because it is deprecated in pandas.

to_dict(**kwargs)

pandas.Series.to_dict() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

to_frame(**kwargs)

Convert Series to DataFrame.

Parameters:name (object, optional) – The passed name should substitute for the series name (if it has one).
Returns:DeferredDataFrame representation of DeferredSeries.
Return type:DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series(["a", "b", "c"],
...               name="vals")
>>> s.to_frame()
  vals
0    a
1    b
2    c
unique(as_series=False)[source]

Return unique values of Series object.

Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort.

Returns:The unique values returned as a NumPy array. See Notes.
Return type:ndarray or ExtensionArray

Differences from pandas

unique is not supported by default because it produces a non-deferred result: an ndarray. You can use the Beam-specific argument unique(as_series=True) to get the result as a DeferredSeries

See also

DeferredSeries.drop_duplicates()
Return DeferredSeries with duplicate values removed.
unique()
Top-level unique method for any 1-d array-like object.
Index.unique()
Return Index with unique values from an Index object.

Notes

Returns the unique values as a NumPy array. In case of an extension-array backed DeferredSeries, a new ExtensionArray of that type with just the unique values is returned. This includes

  • Categorical
  • Period
  • Datetime with Timezone
  • Datetime without Timezone
  • Timedelta
  • Interval
  • Sparse
  • IntegerNA

See Examples section.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> pd.Series([2, 1, 3, 3], name='A').unique()
array([2, 1, 3])

>>> pd.Series([pd.Timestamp('2016-01-01') for _ in range(3)]).unique()
<DatetimeArray>
['2016-01-01 00:00:00']
Length: 1, dtype: datetime64[ns]

>>> pd.Series([pd.Timestamp('2016-01-01', tz='US/Eastern')
...            for _ in range(3)]).unique()
<DatetimeArray>
['2016-01-01 00:00:00-05:00']
Length: 1, dtype: datetime64[ns, US/Eastern]

An Categorical will return categories in the order of
appearance and with the same dtype.

>>> pd.Series(pd.Categorical(list('baabc'))).unique()
['b', 'a', 'c']
Categories (3, object): ['a', 'b', 'c']
>>> pd.Series(pd.Categorical(list('baabc'), categories=list('abc'),
...                          ordered=True)).unique()
['b', 'a', 'c']
Categories (3, object): ['a' < 'b' < 'c']
update(other)[source]

Modify Series in place using values from passed Series.

Uses non-NA values from passed Series to make updates. Aligns on index.

Parameters:other (DeferredSeries, or object coercible into DeferredSeries) –

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3])
>>> s.update(pd.Series([4, 5, 6]))
>>> s
0    4
1    5
2    6
dtype: int64

>>> s = pd.Series(['a', 'b', 'c'])
>>> s.update(pd.Series(['d', 'e'], index=[0, 2]))
>>> s
0    d
1    b
2    e
dtype: object

>>> s = pd.Series([1, 2, 3])
>>> s.update(pd.Series([4, 5, 6, 7, 8]))
>>> s
0    4
1    5
2    6
dtype: int64

If ``other`` contains NaNs the corresponding values are not updated
in the original Series.

>>> s = pd.Series([1, 2, 3])
>>> s.update(pd.Series([4, np.nan, 6]))
>>> s
0    4
1    2
2    6
dtype: int64

``other`` can also be a non-Series object type
that is coercible into a Series

>>> s = pd.Series([1, 2, 3])
>>> s.update([4, np.nan, 6])
>>> s
0    4
1    2
2    6
dtype: int64

>>> s = pd.Series([1, 2, 3])
>>> s.update({1: 9})
>>> s
0    1
1    9
2    3
dtype: int64
value_counts(sort=False, normalize=False, ascending=False, bins=None, dropna=True)[source]

Return a Series containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Parameters:
  • normalize (bool, default False) – If True then the object returned will contain the relative frequencies of the unique values.
  • sort (bool, default True) – Sort by frequencies.
  • ascending (bool, default False) – Sort in ascending order.
  • bins (int, optional) – Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data.
  • dropna (bool, default True) – Don’t include counts of NaN.
Returns:

Return type:

DeferredSeries

Differences from pandas

sort is False by default, and sort=True is not supported because it imposes an ordering on the dataset which likely will not be preserved.

When bin is specified this operation is not parallelizable. See [Issue 20903](https://github.com/apache/beam/issues/20903) tracking the possible addition of a distributed implementation.

See also

DeferredSeries.count()
Number of non-NA elements in a DeferredSeries.
DeferredDataFrame.count()
Number of non-NA elements in a DeferredDataFrame.
DeferredDataFrame.value_counts()
Equivalent method on DeferredDataFrames.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> index = pd.Index([3, 1, 2, 3, 4, np.nan])
>>> index.value_counts()
3.0    2
1.0    1
2.0    1
4.0    1
Name: count, dtype: int64

With `normalize` set to `True`, returns the relative frequency by
dividing all values by the sum of values.

>>> s = pd.Series([3, 1, 2, 3, 4, np.nan])
>>> s.value_counts(normalize=True)
3.0    0.4
1.0    0.2
2.0    0.2
4.0    0.2
Name: proportion, dtype: float64

**bins**

Bins can be useful for going from a continuous variable to a
categorical variable; instead of counting unique
apparitions of values, divide the index in the specified
number of half-open bins.

>>> s.value_counts(bins=3)
(0.996, 2.0]    2
(2.0, 3.0]      2
(3.0, 4.0]      1
Name: count, dtype: int64

**dropna**

With `dropna` set to `False` we can also see NaN index values.

>>> s.value_counts(dropna=False)
3.0    2
1.0    1
2.0    1
4.0    1
NaN    1
Name: count, dtype: int64
values

pandas.Series.values() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

view(**kwargs)

pandas.Series.view() is not yet supported in the Beam DataFrame API because it relies on memory-sharing semantics that are not compatible with the Beam model.

str

Vectorized string functions for Series and Index.

NAs stay NA unless handled otherwise by a particular method. Patterned after Python’s string methods, with some inspiration from R’s stringr package.

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series(["A_Str_Series"])
>>> s
0    A_Str_Series
dtype: object

>>> s.str.split("_")
0    [A, Str, Series]
dtype: object

>>> s.str.replace("_", "")
0    AStrSeries
dtype: object
cat

Accessor object for categorical properties of the Series values.

Parameters:data (DeferredSeries or CategoricalIndex) –

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series(list("abbccc")).astype("category")
>>> s
0    a
1    b
2    b
3    c
4    c
5    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

>>> s.cat.categories
Index(['a', 'b', 'c'], dtype='object')

>>> s.cat.rename_categories(list("cba"))
0    c
1    b
2    b
3    a
4    a
5    a
dtype: category
Categories (3, object): ['c', 'b', 'a']

>>> s.cat.reorder_categories(list("cba"))
0    a
1    b
2    b
3    c
4    c
5    c
dtype: category
Categories (3, object): ['c', 'b', 'a']

>>> s.cat.add_categories(["d", "e"])
0    a
1    b
2    b
3    c
4    c
5    c
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

>>> s.cat.remove_categories(["a", "c"])
0    NaN
1      b
2      b
3    NaN
4    NaN
5    NaN
dtype: category
Categories (1, object): ['b']

>>> s1 = s.cat.add_categories(["d", "e"])
>>> s1.cat.remove_unused_categories()
0    a
1    b
2    b
3    c
4    c
5    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

>>> s.cat.set_categories(list("abcde"))
0    a
1    b
2    b
3    c
4    c
5    c
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

>>> s.cat.as_ordered()
0    a
1    b
2    b
3    c
4    c
5    c
dtype: category
Categories (3, object): ['a' < 'b' < 'c']

>>> s.cat.as_unordered()
0    a
1    b
2    b
3    c
4    c
5    c
dtype: category
Categories (3, object): ['a', 'b', 'c']
dt
mode(*args, **kwargs)[source]

Return the mode(s) of the Series.

The mode is the value that appears most often. There can be multiple modes.

Always returns Series even if only one value is returned.

Parameters:dropna (bool, default True) – Don’t consider counts of NaN/NaT.
Returns:Modes of the DeferredSeries in sorted order.
Return type:DeferredSeries

Differences from pandas

mode is not currently parallelizable. An approximate, parallelizable implementation of mode may be added in the future (Issue 20946).

apply(**kwargs)

Invoke function on values of Series.

Can be ufunc (a NumPy function that applies to the entire Series) or a Python function that only works on single values.

Parameters:
  • func (function) – Python function or NumPy ufunc to apply.
  • convert_dtype (bool, default True) – Try to find better dtype for elementwise function results. If False, leave as dtype=object. Note that the dtype is always preserved for some extension array dtypes, such as Categorical.
  • args (tuple) – Positional arguments passed to func after the series value.
  • **kwargs – Additional keyword arguments passed to func.
Returns:

If func returns a DeferredSeries object the result will be a DeferredDataFrame.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.map()
For element-wise operations.
DeferredSeries.agg()
Only perform aggregating type operations.
DeferredSeries.transform()
Only perform transforming type operations.

Notes

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Create a series with typical summer temperatures for each city.

>>> s = pd.Series([20, 21, 12],
...               index=['London', 'New York', 'Helsinki'])
>>> s
London      20
New York    21
Helsinki    12
dtype: int64

Square the values by defining a function and passing it as an
argument to ``apply()``.

>>> def square(x):
...     return x ** 2
>>> s.apply(square)
London      400
New York    441
Helsinki    144
dtype: int64

Square the values by passing an anonymous function as an
argument to ``apply()``.

>>> s.apply(lambda x: x ** 2)
London      400
New York    441
Helsinki    144
dtype: int64

Define a custom function that needs additional positional
arguments and pass these additional arguments using the
``args`` keyword.

>>> def subtract_custom_value(x, custom_value):
...     return x - custom_value

>>> s.apply(subtract_custom_value, args=(5,))
London      15
New York    16
Helsinki     7
dtype: int64

Define a custom function that takes keyword arguments
and pass these arguments to ``apply``.

>>> def add_custom_values(x, **kwargs):
...     for month in kwargs:
...         x += kwargs[month]
...     return x

>>> s.apply(add_custom_values, june=30, july=20, august=25)
London      95
New York    96
Helsinki    87
dtype: int64

Use a function from the Numpy library.

>>> s.apply(np.log)
London      2.995732
New York    3.044522
Helsinki    2.484907
dtype: float64
map(**kwargs)

Map values of Series according to an input mapping or function.

Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.

Parameters:
  • arg (function, collections.abc.Mapping subclass or DeferredSeries) – Mapping correspondence.
  • na_action ({None, 'ignore'}, default None) – If ‘ignore’, propagate NaN values, without passing them to the mapping correspondence.
Returns:

Same index as caller.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.apply()
For applying more complex functions on a DeferredSeries.
DeferredDataFrame.apply()
Apply a function row-/column-wise.
DeferredDataFrame.applymap()
Apply a function elementwise on a whole DeferredDataFrame.

Notes

When arg is a dictionary, values in DeferredSeries that are not in the dictionary (as keys) are converted to NaN. However, if the dictionary is a dict subclass that defines __missing__ (i.e. provides a method for default values), then this default is used rather than NaN.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
>>> s
0      cat
1      dog
2      NaN
3   rabbit
dtype: object

``map`` accepts a ``dict`` or a ``Series``. Values that are not found
in the ``dict`` are converted to ``NaN``, unless the dict has a default
value (e.g. ``defaultdict``):

>>> s.map({'cat': 'kitten', 'dog': 'puppy'})
0   kitten
1    puppy
2      NaN
3      NaN
dtype: object

It also accepts a function:

>>> s.map('I am a {}'.format)
0       I am a cat
1       I am a dog
2       I am a nan
3    I am a rabbit
dtype: object

To avoid applying the function to missing values (and keep them as
``NaN``) ``na_action='ignore'`` can be used:

>>> s.map('I am a {}'.format, na_action='ignore')
0     I am a cat
1     I am a dog
2            NaN
3  I am a rabbit
dtype: object
repeat(repeats, axis)[source]

Repeat elements of a Series.

Returns a new Series where each element of the current Series is repeated consecutively a given number of times.

Parameters:
  • repeats (int or array of ints) – The number of repetitions for each element. This should be a non-negative integer. Repeating 0 times will return an empty DeferredSeries.
  • axis (None) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

Newly created DeferredSeries with repeated elements.

Return type:

DeferredSeries

Differences from pandas

repeats must be an int or a DeferredSeries. Lists are not supported because they make this operation order-sensitive.

See also

Index.repeat()
Equivalent function for Index.
numpy.repeat()
Similar method for numpy.ndarray.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> s = pd.Series(['a', 'b', 'c'])
>>> s
0    a
1    b
2    c
dtype: object
>>> s.repeat(2)
0    a
0    a
1    b
1    b
2    c
2    c
dtype: object
>>> s.repeat([1, 2, 3])
0    a
1    b
1    b
2    c
2    c
2    c
dtype: object
compare(other, align_axis, **kwargs)[source]

Compare to another Series and show the differences.

New in version 1.1.0.

Parameters:
  • other (DeferredSeries) – Object to compare with.
  • align_axis ({0 or 'index', 1 or 'columns'}, default 1) –

    Determine which axis to align the comparison on.

    • 0, or ‘index’ : Resulting differences are stacked vertically
      with rows drawn alternately from self and other.
    • 1, or ‘columns’ : Resulting differences are aligned horizontally
      with columns drawn alternately from self and other.
  • keep_shape (bool, default False) – If true, all rows and columns are kept. Otherwise, only the ones with different values are kept.
  • keep_equal (bool, default False) – If true, the result keeps values that are equal. Otherwise, equal values are shown as NaNs.
  • result_names (tuple, default ('self', 'other')) –

    Set the dataframes names in the comparison.

    New in version 1.5.0.

Returns:

If axis is 0 or ‘index’ the result will be a DeferredSeries. The resulting index will be a MultiIndex with ‘self’ and ‘other’ stacked alternately at the inner level.

If axis is 1 or ‘columns’ the result will be a DeferredDataFrame. It will have two columns namely ‘self’ and ‘other’.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.compare()
Compare with another DeferredDataFrame and show differences.

Notes

Matching NaNs will not appear as a difference.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s1 = pd.Series(["a", "b", "c", "d", "e"])
>>> s2 = pd.Series(["a", "a", "c", "b", "e"])

Align the differences on columns

>>> s1.compare(s2)
  self other
1    b     a
3    d     b

Stack the differences on indices

>>> s1.compare(s2, align_axis=0)
1  self     b
   other    a
3  self     d
   other    b
dtype: object

Keep all original rows

>>> s1.compare(s2, keep_shape=True)
  self other
0  NaN   NaN
1    b     a
2  NaN   NaN
3    d     b
4  NaN   NaN

Keep all original rows and also all original values

>>> s1.compare(s2, keep_shape=True, keep_equal=True)
  self other
0    a     a
1    b     a
2    c     c
3    d     b
4    e     e
abs(**kwargs)

Return a Series/DataFrame with absolute numeric value of each element.

This function only applies to elements that are all numeric.

Returns:DeferredSeries/DeferredDataFrame containing the absolute value of each element.
Return type:abs

Differences from pandas

This operation has no known divergences from the pandas API.

See also

numpy.absolute()
Calculate the absolute value element-wise.

Notes

For complex inputs, 1.2 + 1j, the absolute value is \(\sqrt{ a^2 + b^2 }\).

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Absolute numeric values in a Series.

>>> s = pd.Series([-1.10, 2, -3.33, 4])
>>> s.abs()
0    1.10
1    2.00
2    3.33
3    4.00
dtype: float64

Absolute numeric values in a Series with complex numbers.

>>> s = pd.Series([1.2 + 1j])
>>> s.abs()
0    1.56205
dtype: float64

Absolute numeric values in a Series with a Timedelta element.

>>> s = pd.Series([pd.Timedelta('1 days')])
>>> s.abs()
0   1 days
dtype: timedelta64[ns]

Select rows with data closest to certain value using argsort (from
`StackOverflow <https://stackoverflow.com/a/17758115>`__).

>>> df = pd.DataFrame({
...     'a': [4, 5, 6, 7],
...     'b': [10, 20, 30, 40],
...     'c': [100, 50, -30, -50]
... })
>>> df
     a    b    c
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50
>>> df.loc[(df.c - 43).abs().argsort()]
     a    b    c
1    5   20   50
0    4   10  100
2    6   30  -30
3    7   40  -50
add(**kwargs)

Return Addition of series and other, element-wise (binary operator add).

Equivalent to series + other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.radd()
Reverse of the Addition operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.add(b, fill_value=0)
a    2.0
b    1.0
c    1.0
d    1.0
e    NaN
dtype: float64
asfreq(**kwargs)

pandas.Series.asfreq() is not implemented yet in the Beam DataFrame API.

If support for ‘asfreq’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

astype(dtype, copy, errors)

Cast a pandas object to a specified dtype dtype.

Parameters:
  • dtype (str, data type, DeferredSeries or Mapping of column name -> data type) – Use a str, numpy.dtype, pandas.ExtensionDtype or Python type to cast entire pandas object to the same type. Alternatively, use a mapping, e.g. {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DeferredDataFrame’s columns to column-specific types.
  • copy (bool, default True) – Return a copy when copy=True (be very careful setting copy=False as changes to values then may propagate to other pandas objects).
  • errors ({'raise', 'ignore'}, default 'raise') –

    Control raising of exceptions on invalid data for provided dtype.

    • raise : allow exceptions to be raised
    • ignore : suppress exceptions. On error return original object.
Returns:

Return type:

same type as caller

Differences from pandas

astype is not parallelizable when errors="ignore" is specified.

copy=False is not supported because it relies on memory-sharing semantics.

dtype="category is not supported because the type of the output column depends on the data. Please use pd.CategoricalDtype with explicit categories instead.

See also

to_datetime()
Convert argument to datetime.
to_timedelta()
Convert argument to timedelta.
to_numeric()
Convert argument to a numeric type.
numpy.ndarray.astype()
Cast a numpy array to a specified type.

Notes

Changed in version 2.0.0: Using astype to convert from timezone-naive dtype to timezone-aware dtype will raise an exception. Use DeferredSeries.dt.tz_localize() instead.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

Create a DataFrame:

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df.dtypes
col1    int64
col2    int64
dtype: object

Cast all columns to int32:

>>> df.astype('int32').dtypes
col1    int32
col2    int32
dtype: object

Cast col1 to int32 using a dictionary:

>>> df.astype({'col1': 'int32'}).dtypes
col1    int32
col2    int64
dtype: object

Create a series:

>>> ser = pd.Series([1, 2], dtype='int32')
>>> ser
0    1
1    2
dtype: int32
>>> ser.astype('int64')
0    1
1    2
dtype: int64

Convert to categorical type:

>>> ser.astype('category')
0    1
1    2
dtype: category
Categories (2, int32): [1, 2]

Convert to ordered categorical type with custom ordering:

>>> from pandas.api.types import CategoricalDtype
>>> cat_dtype = CategoricalDtype(
...     categories=[2, 1], ordered=True)
>>> ser.astype(cat_dtype)
0    1
1    2
dtype: category
Categories (2, int64): [2 < 1]

Create a series of dates:

>>> ser_date = pd.Series(pd.date_range('20200101', periods=3))
>>> ser_date
0   2020-01-01
1   2020-01-02
2   2020-01-03
dtype: datetime64[ns]
at

pandas.Series.at() is not implemented yet in the Beam DataFrame API.

If support for ‘at’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

at_time(**kwargs)

Select values at particular time of day (e.g., 9:30AM).

Parameters:
  • time (datetime.time or str) – The values to select.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – For DeferredSeries this parameter is unused and defaults to 0.
Returns:

Return type:

DeferredSeries or DeferredDataFrame

Raises:

TypeError – If the index is not a DatetimeIndex

Differences from pandas

This operation has no known divergences from the pandas API.

See also

between_time()
Select values between particular times of the day.
first()
Select initial periods of time series based on a date offset.
last()
Select final periods of time series based on a date offset.
DatetimeIndex.indexer_at_time()
Get just the index locations for values at particular time of the day.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> i = pd.date_range('2018-04-09', periods=4, freq='12H')
>>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)
>>> ts
                     A
2018-04-09 00:00:00  1
2018-04-09 12:00:00  2
2018-04-10 00:00:00  3
2018-04-10 12:00:00  4

>>> ts.at_time('12:00')
                     A
2018-04-09 12:00:00  2
2018-04-10 12:00:00  4
attrs

pandas.DataFrame.attrs() is not yet supported in the Beam DataFrame API because it is experimental in pandas.

backfill(*args, **kwargs)

Synonym for DataFrame.fillna() with method='bfill'.

Deprecated since version 2.0: Series/DataFrame.backfill is deprecated. Use Series/DataFrame.bfill instead.

Returns:Object with missing values filled or None if inplace=True.
Return type:DeferredSeries/DeferredDataFrame or None

Differences from pandas

backfill is only supported for axis=”columns”. axis=”index” is order-sensitive.

between_time(**kwargs)

Select values between particular times of the day (e.g., 9:00-9:30 AM).

By setting start_time to be later than end_time, you can get the times that are not between the two times.

Parameters:
  • start_time (datetime.time or str) – Initial time as a time filter limit.
  • end_time (datetime.time or str) – End time as a time filter limit.
  • inclusive ({"both", "neither", "left", "right"}, default "both") – Include boundaries; whether to set each bound as closed or open.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Determine range time on index or columns value. For DeferredSeries this parameter is unused and defaults to 0.
Returns:

Data from the original object filtered to the specified dates range.

Return type:

DeferredSeries or DeferredDataFrame

Raises:

TypeError – If the index is not a DatetimeIndex

Differences from pandas

This operation has no known divergences from the pandas API.

See also

at_time()
Select values at a particular time of the day.
first()
Select initial periods of time series based on a date offset.
last()
Select final periods of time series based on a date offset.
DatetimeIndex.indexer_between_time()
Get just the index locations for values between particular times of the day.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> i = pd.date_range('2018-04-09', periods=4, freq='1D20min')
>>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)
>>> ts
                     A
2018-04-09 00:00:00  1
2018-04-10 00:20:00  2
2018-04-11 00:40:00  3
2018-04-12 01:00:00  4

>>> ts.between_time('0:15', '0:45')
                     A
2018-04-10 00:20:00  2
2018-04-11 00:40:00  3

You get the times that are *not* between two times by setting
``start_time`` later than ``end_time``:

>>> ts.between_time('0:45', '0:15')
                     A
2018-04-09 00:00:00  1
2018-04-12 01:00:00  4
bfill(*args, **kwargs)

bfill is only supported for axis=”columns”. axis=”index” is order-sensitive.

bool()

Return the bool of a single element Series or DataFrame.

This must be a boolean scalar value, either True or False. It will raise a ValueError if the Series or DataFrame does not have exactly 1 element, or that element is not boolean (integer values 0 and 1 will also raise an exception).

Returns:The value in the DeferredSeries or DeferredDataFrame.
Return type:bool

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.astype()
Change the data type of a DeferredSeries, including to boolean.
DeferredDataFrame.astype()
Change the data type of a DeferredDataFrame, including to boolean.
numpy.bool_()
NumPy boolean data type, used by pandas for boolean values.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

The method will only work for single element objects with a boolean value:

>>> pd.Series([True]).bool()
True
>>> pd.Series([False]).bool()
False

>>> pd.DataFrame({'col': [True]}).bool()
True
>>> pd.DataFrame({'col': [False]}).bool()
False
combine(**kwargs)

Perform column-wise combine with another DataFrame.

Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.

Parameters:
  • other (DeferredDataFrame) – The DeferredDataFrame to merge column-wise.
  • func (function) – Function that takes two series as inputs and return a DeferredSeries or a scalar. Used to merge the two dataframes column by columns.
  • fill_value (scalar value, default None) – The value to fill NaNs with prior to passing any column to the merge func.
  • overwrite (bool, default True) – If True, columns in self that do not exist in other will be overwritten with NaNs.
Returns:

Combination of the provided DeferredDataFrames.

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.combine_first()
Combine two DeferredDataFrame objects and default to non-null values in frame calling the method.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Combine using a simple function that chooses the smaller column.

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]})
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
>>> take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2
>>> df1.combine(df2, take_smaller)
   A  B
0  0  3
1  0  3

Example using a true element-wise combine function.

>>> df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4]})
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
>>> df1.combine(df2, np.minimum)
   A  B
0  1  2
1  0  3

Using `fill_value` fills Nones prior to passing the column to the
merge function.

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]})
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
>>> df1.combine(df2, take_smaller, fill_value=-5)
   A    B
0  0 -5.0
1  0  4.0

However, if the same element in both dataframes is None, that None
is preserved

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]})
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [None, 3]})
>>> df1.combine(df2, take_smaller, fill_value=-5)
    A    B
0  0 -5.0
1  0  3.0

Example that demonstrates the use of `overwrite` and behavior when
the axis differ between the dataframes.

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]})
>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [-10, 1], }, index=[1, 2])
>>> df1.combine(df2, take_smaller)
     A    B     C
0  NaN  NaN   NaN
1  NaN  3.0 -10.0
2  NaN  3.0   1.0

>>> df1.combine(df2, take_smaller, overwrite=False)
     A    B     C
0  0.0  NaN   NaN
1  0.0  3.0 -10.0
2  NaN  3.0   1.0

Demonstrating the preference of the passed in dataframe.

>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1], }, index=[1, 2])
>>> df2.combine(df1, take_smaller)
   A    B   C
0  0.0  NaN NaN
1  0.0  3.0 NaN
2  NaN  3.0 NaN

>>> df2.combine(df1, take_smaller, overwrite=False)
     A    B   C
0  0.0  NaN NaN
1  0.0  3.0 1.0
2  NaN  3.0 1.0
combine_first(**kwargs)

Update null elements with value in the same location in other.

Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two. The resulting dataframe contains the ‘first’ dataframe values and overrides the second one values where both first.loc[index, col] and second.loc[index, col] are not missing values, upon calling first.combine_first(second).

Parameters:other (DeferredDataFrame) – Provided DeferredDataFrame to use to fill null values.
Returns:The result of combining the provided DeferredDataFrame with the other object.
Return type:DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.combine()
Perform series-wise operation on two DeferredDataFrames using a given function.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]})
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
>>> df1.combine_first(df2)
     A    B
0  1.0  3.0
1  0.0  4.0

Null values still persist if the location of that null value
does not exist in `other`

>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [4, None]})
>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1]}, index=[1, 2])
>>> df1.combine_first(df2)
     A    B    C
0  NaN  4.0  NaN
1  0.0  3.0  1.0
2  NaN  3.0  1.0
convert_dtypes(**kwargs)

pandas.Series.convert_dtypes() is not implemented yet in the Beam DataFrame API.

If support for ‘convert_dtypes’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

copy(**kwargs)

Make a copy of this object’s indices and data.

When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below).

When deep=False, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).

Parameters:deep (bool, default True) – Make a deep copy, including a copy of the data and the indices. With deep=False neither the indices nor the data are copied.
Returns:Object type matches caller.
Return type:DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

Notes

When deep=True, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively copies object data (see examples below).

While Index objects are copied when deep=True, the underlying numpy array is not copied for performance reasons. Since Index is immutable, the underlying data can be safely shared and a copy is not needed.

Since pandas is not thread safe, see the gotchas when copying in a threading environment.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2], index=["a", "b"])
>>> s
a    1
b    2
dtype: int64

>>> s_copy = s.copy()
>>> s_copy
a    1
b    2
dtype: int64

**Shallow copy versus default (deep) copy:**

>>> s = pd.Series([1, 2], index=["a", "b"])
>>> deep = s.copy()
>>> shallow = s.copy(deep=False)

Shallow copy shares data and index with original.

>>> s is shallow
False
>>> s.values is shallow.values and s.index is shallow.index
True

Deep copy has own copy of data and index.

>>> s is deep
False
>>> s.values is deep.values or s.index is deep.index
False

Updates to the data shared by shallow copy and original is reflected
in both; deep copy remains unchanged.

>>> s[0] = 3
>>> shallow[1] = 4
>>> s
a    3
b    4
dtype: int64
>>> shallow
a    3
b    4
dtype: int64
>>> deep
a    1
b    2
dtype: int64

Note that when copying an object containing Python objects, a deep copy
will copy the data, but will not do so recursively. Updating a nested
data object will be reflected in the deep copy.

>>> s = pd.Series([[1, 2], [3, 4]])
>>> deep = s.copy()
>>> s[0][0] = 10
>>> s
0    [10, 2]
1     [3, 4]
dtype: object
>>> deep
0    [10, 2]
1     [3, 4]
dtype: object
div(**kwargs)

Return Floating division of series and other, element-wise (binary operator truediv).

Equivalent to series / other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rtruediv()
Reverse of the Floating division operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
divide(**kwargs)

Return Floating division of series and other, element-wise (binary operator truediv).

Equivalent to series / other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rtruediv()
Reverse of the Floating division operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
divmod(**kwargs)

Return Integer division and modulo of series and other, element-wise (binary operator divmod).

Equivalent to divmod(series, other), but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

2-Tuple of DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rdivmod()
Reverse of the Integer division and modulo operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divmod(b, fill_value=0)
(a    1.0
 b    NaN
 c    NaN
 d    0.0
 e    NaN
 dtype: float64,
 a    0.0
 b    NaN
 c    NaN
 d    0.0
 e    NaN
 dtype: float64)
drop(labels, axis, index, columns, errors, **kwargs)

Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the user guide for more information about the now unused levels.

Parameters:
  • labels (single label or list-like) – Index or column labels to drop. A tuple will be used as a single label and not treated as a list-like.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
  • index (single label or list-like) – Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).
  • columns (single label or list-like) – Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).
  • level (int or level name, optional) – For MultiIndex, level from which the labels will be removed.
  • inplace (bool, default False) – If False, return a copy. Otherwise, do operation inplace and return None.
  • errors ({'ignore', 'raise'}, default 'raise') – If ‘ignore’, suppress error and only existing labels are dropped.
Returns:

DeferredDataFrame without the removed index or column labels or None if inplace=True.

Return type:

DeferredDataFrame or None

Raises:

KeyError – If any of the labels is not found in the selected axis.

Differences from pandas

drop is not parallelizable when dropping from the index and errors="raise" is specified. It requires collecting all data on a single node in order to detect if one of the index values is missing.

See also

DeferredDataFrame.loc()
Label-location based indexer for selection by label.
DeferredDataFrame.dropna()
Return DeferredDataFrame with labels on given axis omitted where (all or any) data are missing.
DeferredDataFrame.drop_duplicates()
Return DeferredDataFrame with duplicate rows removed, optionally only considering certain columns.
DeferredSeries.drop()
Return DeferredSeries with specified index labels removed.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame(np.arange(12).reshape(3, 4),
...                   columns=['A', 'B', 'C', 'D'])
>>> df
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

Drop columns

>>> df.drop(['B', 'C'], axis=1)
   A   D
0  0   3
1  4   7
2  8  11

>>> df.drop(columns=['B', 'C'])
   A   D
0  0   3
1  4   7
2  8  11

Drop a row by index

>>> df.drop([0, 1])
   A  B   C   D
2  8  9  10  11

Drop columns and/or rows of MultiIndex DataFrame

>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],
...                              ['speed', 'weight', 'length']],
...                      codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                             [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> df = pd.DataFrame(index=midx, columns=['big', 'small'],
...                   data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
...                         [250, 150], [1.5, 0.8], [320, 250],
...                         [1, 0.8], [0.3, 0.2]])
>>> df
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
        length  1.5     1.0
cow     speed   30.0    20.0
        weight  250.0   150.0
        length  1.5     0.8
falcon  speed   320.0   250.0
        weight  1.0     0.8
        length  0.3     0.2

Drop a specific index combination from the MultiIndex
DataFrame, i.e., drop the combination ``'falcon'`` and
``'weight'``, which deletes only the corresponding row

>>> df.drop(index=('falcon', 'weight'))
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
        length  1.5     1.0
cow     speed   30.0    20.0
        weight  250.0   150.0
        length  1.5     0.8
falcon  speed   320.0   250.0
        length  0.3     0.2

>>> df.drop(index='cow', columns='small')
                big
lama    speed   45.0
        weight  200.0
        length  1.5
falcon  speed   320.0
        weight  1.0
        length  0.3

>>> df.drop(index='length', level=1)
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
cow     speed   30.0    20.0
        weight  250.0   150.0
falcon  speed   320.0   250.0
        weight  1.0     0.8
droplevel(level, axis)

Return Series/DataFrame with requested index / column level(s) removed.

Parameters:
  • level (int, str, or list-like) – If a string is given, must be the name of a level If list-like, elements must be names or positional indexes of levels.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) –

    Axis along which the level(s) is removed:

    • 0 or ‘index’: remove level(s) in column.
    • 1 or ‘columns’: remove level(s) in row.

    For DeferredSeries this parameter is unused and defaults to 0.

Returns:

DeferredSeries/DeferredDataFrame with requested index / column level(s) removed.

Return type:

DeferredSeries/DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame([
...     [1, 2, 3, 4],
...     [5, 6, 7, 8],
...     [9, 10, 11, 12]
... ]).set_index([0, 1]).rename_axis(['a', 'b'])

>>> df.columns = pd.MultiIndex.from_tuples([
...     ('c', 'e'), ('d', 'f')
... ], names=['level_1', 'level_2'])

>>> df
level_1   c   d
level_2   e   f
a b
1 2      3   4
5 6      7   8
9 10    11  12

>>> df.droplevel('a')
level_1   c   d
level_2   e   f
b
2        3   4
6        7   8
10      11  12

>>> df.droplevel('level_2', axis=1)
level_1   c   d
a b
1 2      3   4
5 6      7   8
9 10    11  12
empty

Indicator whether Series/DataFrame is empty.

True if Series/DataFrame is entirely empty (no items), meaning any of the axes are of length 0.

Returns:If DeferredSeries/DeferredDataFrame is empty, return True, if not return False.
Return type:bool

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.dropna
Return series without null values.
DeferredDataFrame.dropna
Return DeferredDataFrame with labels on given axis omitted where (all or any) data are missing.

Notes

If DeferredSeries/DeferredDataFrame contains only NaNs, it is still not considered empty. See the example below.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

An example of an actual empty DataFrame. Notice the index is empty:

>>> df_empty = pd.DataFrame({'A' : []})
>>> df_empty
Empty DataFrame
Columns: [A]
Index: []
>>> df_empty.empty
True

If we only have NaNs in our DataFrame, it is not considered empty! We
will need to drop the NaNs to make the DataFrame empty:

>>> df = pd.DataFrame({'A' : [np.nan]})
>>> df
    A
0 NaN
>>> df.empty
False
>>> df.dropna().empty
True

>>> ser_empty = pd.Series({'A' : []})
>>> ser_empty
A    []
dtype: object
>>> ser_empty.empty
False
>>> ser_empty = pd.Series()
>>> ser_empty.empty
True
eq(**kwargs)

Return Equal to of series and other, element-wise (binary operator eq).

Equivalent to series == other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.eq(b, fill_value=0)
a     True
b    False
c    False
d    False
e    False
dtype: bool
equals(other)

Test whether two objects contain the same elements.

This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.

The row/column index do not need to have the same type, as long as the values are considered equal. Corresponding columns must be of the same dtype.

Parameters:other (DeferredSeries or DeferredDataFrame) – The other DeferredSeries or DeferredDataFrame to be compared with the first.
Returns:True if all elements are the same in both objects, False otherwise.
Return type:bool

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.eq()
Compare two DeferredSeries objects of the same length and return a DeferredSeries where each element is True if the element in each DeferredSeries is equal, False otherwise.
DeferredDataFrame.eq()
Compare two DeferredDataFrame objects of the same shape and return a DeferredDataFrame where each element is True if the respective element in each DeferredDataFrame is equal, False otherwise.
testing.assert_series_equal()
Raises an AssertionError if left and right are not equal. Provides an easy interface to ignore inequality in dtypes, indexes and precision among others.
testing.assert_frame_equal()
Like assert_series_equal, but targets DeferredDataFrames.
numpy.array_equal()
Return True if two arrays have the same shape and elements, False otherwise.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({1: [10], 2: [20]})
>>> df
    1   2
0  10  20

DataFrames df and exactly_equal have the same types and values for
their elements and column labels, which will return True.

>>> exactly_equal = pd.DataFrame({1: [10], 2: [20]})
>>> exactly_equal
    1   2
0  10  20
>>> df.equals(exactly_equal)
True

DataFrames df and different_column_type have the same element
types and values, but have different types for the column labels,
which will still return True.

>>> different_column_type = pd.DataFrame({1.0: [10], 2.0: [20]})
>>> different_column_type
   1.0  2.0
0   10   20
>>> df.equals(different_column_type)
True

DataFrames df and different_data_type have different types for the
same values for their elements, and will return False even though
their column labels are the same values and types.

>>> different_data_type = pd.DataFrame({1: [10.0], 2: [20.0]})
>>> different_data_type
      1     2
0  10.0  20.0
>>> df.equals(different_data_type)
False
ewm(**kwargs)

pandas.Series.ewm() is not yet supported in the Beam DataFrame API because implementing it would require integrating with Beam event-time semantics

For more information see https://s.apache.org/dataframe-event-time-semantics.

expanding(**kwargs)

pandas.Series.expanding() is not yet supported in the Beam DataFrame API because implementing it would require integrating with Beam event-time semantics

For more information see https://s.apache.org/dataframe-event-time-semantics.

ffill(*args, **kwargs)

ffill is only supported for axis=”columns”. axis=”index” is order-sensitive.

fillna(value, method, axis, limit, **kwargs)

Fill NA/NaN values using the specified method.

Parameters:
  • value (scalar, dict, DeferredSeries, or DeferredDataFrame) – Value to use to fill holes (e.g. 0), alternately a dict/DeferredSeries/DeferredDataFrame of values specifying which value to use for each index (for a DeferredSeries) or column (for a DeferredDataFrame). Values not in the dict/DeferredSeries/DeferredDataFrame will not be filled. This value cannot be a list.
  • method ({'backfill', 'bfill', 'ffill', None}, default None) –

    Method to use for filling holes in reindexed DeferredSeries:

    • ffill: propagate last valid observation forward to next valid.
    • backfill / bfill: use next valid observation to fill gap.
  • axis ({0 or 'index', 1 or 'columns'}) – Axis along which to fill missing values. For DeferredSeries this parameter is unused and defaults to 0.
  • inplace (bool, default False) – If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DeferredDataFrame).
  • limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
  • downcast (dict, default is None) – A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).
Returns:

Object with missing values filled or None if inplace=True.

Return type:

DeferredDataFrame or None

Differences from pandas

When axis="index", both method and limit must be None. otherwise this operation is order-sensitive.

See also

interpolate()
Fill NaN values using interpolation.
reindex()
Conform object to new index.
asfreq()
Convert TimeDeferredSeries to specified frequency.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],
...                    [3, 4, np.nan, 1],
...                    [np.nan, np.nan, np.nan, np.nan],
...                    [np.nan, 3, np.nan, 4]],
...                   columns=list("ABCD"))
>>> df
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  NaN  NaN NaN  NaN
3  NaN  3.0 NaN  4.0

Replace all NaN elements with 0s.

>>> df.fillna(0)
     A    B    C    D
0  0.0  2.0  0.0  0.0
1  3.0  4.0  0.0  1.0
2  0.0  0.0  0.0  0.0
3  0.0  3.0  0.0  4.0

We can also propagate non-null values forward or backward.

>>> df.fillna(method="ffill")
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  3.0  4.0 NaN  1.0
3  3.0  3.0 NaN  4.0

Replace all NaN elements in column 'A', 'B', 'C', and 'D', with 0, 1,
2, and 3 respectively.

>>> values = {"A": 0, "B": 1, "C": 2, "D": 3}
>>> df.fillna(value=values)
     A    B    C    D
0  0.0  2.0  2.0  0.0
1  3.0  4.0  2.0  1.0
2  0.0  1.0  2.0  3.0
3  0.0  3.0  2.0  4.0

Only replace the first NaN element.

>>> df.fillna(value=values, limit=1)
     A    B    C    D
0  0.0  2.0  2.0  0.0
1  3.0  4.0  NaN  1.0
2  NaN  1.0  NaN  3.0
3  NaN  3.0  NaN  4.0

When filling using a DataFrame, replacement happens along
the same column names and same indices

>>> df2 = pd.DataFrame(np.zeros((4, 4)), columns=list("ABCE"))
>>> df.fillna(df2)
     A    B    C    D
0  0.0  2.0  0.0  0.0
1  3.0  4.0  0.0  1.0
2  0.0  0.0  0.0  NaN
3  0.0  3.0  0.0  4.0

Note that column D is not affected since it is not present in df2.
first(offset)

Select initial periods of time series data based on a date offset.

For a DataFrame with a sorted DatetimeIndex, this function can select the first few rows based on a date offset.

Parameters:offset (str, DateOffset or dateutil.relativedelta) – The offset length of the data that will be selected. For instance, ‘1M’ will display all the rows having their index within the first month.
Returns:A subset of the caller.
Return type:DeferredSeries or DeferredDataFrame
Raises:TypeError – If the index is not a DatetimeIndex

Differences from pandas

This operation has no known divergences from the pandas API.

See also

last()
Select final periods of time series based on a date offset.
at_time()
Select values at a particular time of the day.
between_time()
Select values between particular times of the day.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> i = pd.date_range('2018-04-09', periods=4, freq='2D')
>>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)
>>> ts
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4

Get the rows for the first 3 days:

>>> ts.first('3D')
            A
2018-04-09  1
2018-04-11  2

Notice the data for 3 first calendar days were returned, not the first
3 days observed in the dataset, and therefore data for 2018-04-13 was
not returned.
flags

pandas.Series.flags() is not implemented yet in the Beam DataFrame API.

If support for ‘flags’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

floordiv(**kwargs)

Return Integer division of series and other, element-wise (binary operator floordiv).

Equivalent to series // other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rfloordiv()
Reverse of the Integer division operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.floordiv(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
ge(**kwargs)

Return Greater than or equal to of series and other, element-wise (binary operator ge).

Equivalent to series >= other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
e    1.0
dtype: float64
>>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f'])
>>> b
a    0.0
b    1.0
c    2.0
d    NaN
f    1.0
dtype: float64
>>> a.ge(b, fill_value=0)
a     True
b     True
c    False
d    False
e     True
f    False
dtype: bool
groupby(by, level, axis, as_index, group_keys, **kwargs)

Group DataFrame using a mapper or by a Series of columns.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Parameters:
  • by (mapping, function, label, pd.Grouper or list of such) – Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or DeferredSeries is passed, the DeferredSeries or dict VALUES will be used to determine the groups (the DeferredSeries’ values are first aligned; see .align() method). If a list or ndarray of length equal to the selected axis is passed (see the groupby user guide), the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted as a (single) key.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Split along rows (0) or columns (1). For DeferredSeries this parameter is unused and defaults to 0.
  • level (int, level name, or sequence of such, default None) – If the axis is a MultiIndex (hierarchical), group by a particular level or levels. Do not specify both by and level.
  • as_index (bool, default True) – For aggregated output, return object with group labels as the index. Only relevant for DeferredDataFrame input. as_index=False is effectively “SQL-style” grouped output.
  • sort (bool, default True) –

    Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

    Changed in version 2.0.0: Specifying sort=False with an ordered categorical grouper will no longer sort the values.

  • group_keys (bool, default True) –

    When calling apply and the by argument produces a like-indexed (i.e. a transform) result, add group keys to index to identify pieces. By default group keys are not included when the result’s index (and column) labels match the inputs, and are included otherwise.

    Changed in version 1.5.0: Warns that group_keys will no longer be ignored when the result from apply is a like-indexed DeferredSeries or DeferredDataFrame. Specify group_keys explicitly to include the group keys or not.

    Changed in version 2.0.0: group_keys now defaults to True.

  • observed (bool, default False) – This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
  • dropna (bool, default True) –

    If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.

    New in version 1.1.0.

Returns:

Returns a groupby object that contains information about the groups.

Return type:

DeferredDataFrameGroupBy

Differences from pandas

as_index must be True.

Aggregations grouping by a categorical column with observed=False set are not currently parallelizable (Issue 21827).

See also

resample()
Convenience method for frequency conversion and resampling of time series.

Notes

See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
...                               'Parrot', 'Parrot'],
...                    'Max Speed': [380., 370., 24., 26.]})
>>> df
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0
>>> df.groupby(['Animal']).mean()
        Max Speed
Animal
Falcon      375.0
Parrot       25.0

**Hierarchical Indexes**

We can groupby different levels of a hierarchical index
using the `level` parameter:

>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
...           ['Captive', 'Wild', 'Captive', 'Wild']]
>>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
>>> df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},
...                   index=index)
>>> df
                Max Speed
Animal Type
Falcon Captive      390.0
       Wild         350.0
Parrot Captive       30.0
       Wild          20.0
>>> df.groupby(level=0).mean()
        Max Speed
Animal
Falcon      370.0
Parrot       25.0
>>> df.groupby(level="Type").mean()
         Max Speed
Type
Captive      210.0
Wild         185.0

We can also choose to include NA in group keys or not by setting
`dropna` parameter, the default setting is `True`.

>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])

>>> df.groupby(by=["b"]).sum()
    a   c
b
1.0 2   3
2.0 2   5

>>> df.groupby(by=["b"], dropna=False).sum()
    a   c
b
1.0 2   3
2.0 2   5
NaN 1   4

>>> l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])

>>> df.groupby(by="a").sum()
    b     c
a
a   13.0   13.0
b   12.3  123.0

>>> df.groupby(by="a", dropna=False).sum()
    b     c
a
a   13.0   13.0
b   12.3  123.0
NaN 12.3   33.0

When using ``.apply()``, use ``group_keys`` to include or exclude the group keys.
The ``group_keys`` argument defaults to ``True`` (include).

>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
...                               'Parrot', 'Parrot'],
...                    'Max Speed': [380., 370., 24., 26.]})
>>> df.groupby("Animal", group_keys=True).apply(lambda x: x)
          Animal  Max Speed
Animal
Falcon 0  Falcon      380.0
       1  Falcon      370.0
Parrot 2  Parrot       24.0
       3  Parrot       26.0

>>> df.groupby("Animal", group_keys=False).apply(lambda x: x)
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0
gt(**kwargs)

Return Greater than of series and other, element-wise (binary operator gt).

Equivalent to series > other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
e    1.0
dtype: float64
>>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f'])
>>> b
a    0.0
b    1.0
c    2.0
d    NaN
f    1.0
dtype: float64
>>> a.gt(b, fill_value=0)
a     True
b    False
c    False
d    False
e     True
f    False
dtype: bool
hist(**kwargs)

pandas.DataFrame.hist() is not yet supported in the Beam DataFrame API because it is a plotting tool.

For more information see https://s.apache.org/dataframe-plotting-tools.

iloc

Purely integer-location based indexing for selection by position.

.iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

Allowed inputs are:

  • An integer, e.g. 5.
  • A list or array of integers, e.g. [4, 3, 0].
  • A slice object with ints, e.g. 1:7.
  • A boolean array.
  • A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.
  • A tuple of row and column indexes. The tuple elements consist of one of the above inputs, e.g. (0, 1).

.iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing (this conforms with python/numpy slice semantics).

See more at Selection by Position.

Differences from pandas

Position-based indexing with iloc is order-sensitive in almost every case. Beam DataFrame users should prefer label-based indexing with loc.

See also

DeferredDataFrame.iat
Fast integer location scalar accessor.
DeferredDataFrame.loc
Purely label-location based indexer for selection by label.
DeferredSeries.iloc
Purely integer-location based indexing for selection by position.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
...           {'a': 100, 'b': 200, 'c': 300, 'd': 400},
...           {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
>>> df = pd.DataFrame(mydict)
>>> df
      a     b     c     d
0     1     2     3     4
1   100   200   300   400
2  1000  2000  3000  4000

**Indexing just the rows**

With a scalar integer.

>>> type(df.iloc[0])
<class 'pandas.core.series.Series'>
>>> df.iloc[0]
a    1
b    2
c    3
d    4
Name: 0, dtype: int64

With a list of integers.

>>> df.iloc[[0]]
   a  b  c  d
0  1  2  3  4
>>> type(df.iloc[[0]])
<class 'pandas.core.frame.DataFrame'>

>>> df.iloc[[0, 1]]
     a    b    c    d
0    1    2    3    4
1  100  200  300  400

With a `slice` object.

>>> df.iloc[:3]
      a     b     c     d
0     1     2     3     4
1   100   200   300   400
2  1000  2000  3000  4000

With a boolean mask the same length as the index.

>>> df.iloc[[True, False, True]]
      a     b     c     d
0     1     2     3     4
2  1000  2000  3000  4000

With a callable, useful in method chains. The `x` passed
to the ``lambda`` is the DataFrame being sliced. This selects
the rows whose index label even.

>>> df.iloc[lambda x: x.index % 2 == 0]
      a     b     c     d
0     1     2     3     4
2  1000  2000  3000  4000

**Indexing both axes**

You can mix the indexer types for the index and columns. Use ``:`` to
select the entire axis.

With scalar integers.

>>> df.iloc[0, 1]
2

With lists of integers.

>>> df.iloc[[0, 2], [1, 3]]
      b     d
0     2     4
2  2000  4000

With `slice` objects.

>>> df.iloc[1:3, 0:3]
      a     b     c
1   100   200   300
2  1000  2000  3000

With a boolean array whose length matches the columns.

>>> df.iloc[:, [True, False, True, False]]
      a     c
0     1     3
1   100   300
2  1000  3000

With a callable function that expects the Series or DataFrame.

>>> df.iloc[:, lambda df: [0, 2]]
      a     c
0     1     3
1   100   300
2  1000  3000
index

The index (row labels) of the DataFrame.

Differences from pandas

This operation has no known divergences from the pandas API.

infer_object(**kwargs)

pandas.Series.infer_objects() is not yet supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.

For more information see https://s.apache.org/dataframe-non-deferred-columns.

infer_objects(**kwargs)

pandas.Series.infer_objects() is not implemented yet in the Beam DataFrame API.

If support for ‘infer_objects’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

isin(**kwargs)

Whether each element in the DataFrame is contained in values.

Parameters:values (iterable, DeferredSeries, DeferredDataFrame or dict) – The result will only be true at a location if all the labels match. If values is a DeferredSeries, that’s the index. If values is a dict, the keys must be the column names, which must match. If values is a DeferredDataFrame, then both the index and column labels must match.
Returns:DeferredDataFrame of booleans showing whether each element in the DeferredDataFrame is contained in values.
Return type:DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.eq()
Equality test for DeferredDataFrame.
DeferredSeries.isin()
Equivalent method on DeferredSeries.
DeferredSeries.str.contains()
Test if pattern or regex is contained within a string of a DeferredSeries or Index.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]},
...                   index=['falcon', 'dog'])
>>> df
        num_legs  num_wings
falcon         2          2
dog            4          0

When ``values`` is a list check whether every value in the DataFrame
is present in the list (which animals have 0 or 2 legs or wings)

>>> df.isin([0, 2])
        num_legs  num_wings
falcon      True       True
dog        False       True

To check if ``values`` is *not* in the DataFrame, use the ``~`` operator:

>>> ~df.isin([0, 2])
        num_legs  num_wings
falcon     False      False
dog         True      False

When ``values`` is a dict, we can pass values to check for each
column separately:

>>> df.isin({'num_wings': [0, 3]})
        num_legs  num_wings
falcon     False      False
dog        False       True

When ``values`` is a Series or DataFrame the index and column must
match. Note that 'falcon' does not match based on the number of legs
in other.

>>> other = pd.DataFrame({'num_legs': [8, 3], 'num_wings': [0, 2]},
...                      index=['spider', 'falcon'])
>>> df.isin(other)
        num_legs  num_wings
falcon     False       True
dog        False      False
item(**kwargs)

pandas.Series.item() is not implemented yet in the Beam DataFrame API.

If support for ‘item’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

last(offset)

Select final periods of time series data based on a date offset.

For a DataFrame with a sorted DatetimeIndex, this function selects the last few rows based on a date offset.

Parameters:offset (str, DateOffset, dateutil.relativedelta) – The offset length of the data that will be selected. For instance, ‘3D’ will display all the rows having their index within the last 3 days.
Returns:A subset of the caller.
Return type:DeferredSeries or DeferredDataFrame
Raises:TypeError – If the index is not a DatetimeIndex

Differences from pandas

This operation has no known divergences from the pandas API.

See also

first()
Select initial periods of time series based on a date offset.
at_time()
Select values at a particular time of the day.
between_time()
Select values between particular times of the day.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> i = pd.date_range('2018-04-09', periods=4, freq='2D')
>>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)
>>> ts
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4

Get the rows for the last 3 days:

>>> ts.last('3D')
            A
2018-04-13  3
2018-04-15  4

Notice the data for 3 last calendar days were returned, not the last
3 observed days in the dataset, and therefore data for 2018-04-11 was
not returned.
le(**kwargs)

Return Less than or equal to of series and other, element-wise (binary operator le).

Equivalent to series <= other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
e    1.0
dtype: float64
>>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f'])
>>> b
a    0.0
b    1.0
c    2.0
d    NaN
f    1.0
dtype: float64
>>> a.le(b, fill_value=0)
a    False
b     True
c     True
d    False
e    False
f     True
dtype: bool
length()

Alternative to len(df) which returns a deferred result that can be used in arithmetic with DeferredSeries or DeferredDataFrame instances.

loc

Access a group of rows and columns by label(s) or a boolean array.

.loc[] is primarily label based, but may also be used with a boolean array.

Allowed inputs are:

  • A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).

  • A list or array of labels, e.g. ['a', 'b', 'c'].

  • A slice object with labels, e.g. 'a':'f'.

    Warning

    Note that contrary to usual python slices, both the start and the stop are included

  • A boolean array of the same length as the axis being sliced, e.g. [True, False, True].

  • An alignable boolean Series. The index of the key will be aligned before masking.

  • An alignable Index. The Index of the returned selection will be the input.

  • A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)

See more at Selection by Label.

Raises:
  • KeyError – If any items are not found.
  • IndexingError – If an indexed key is passed and its index is unalignable to the frame index.

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.at
Access a single value for a row/column label pair.
DeferredDataFrame.iloc
Access group of rows and columns by integer position(s).
DeferredDataFrame.xs
Returns a cross-section (row(s) or column(s)) from the DeferredSeries/DeferredDataFrame.
DeferredSeries.loc
Access group of values using labels.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

**Getting values**

>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
...      index=['cobra', 'viper', 'sidewinder'],
...      columns=['max_speed', 'shield'])
>>> df
            max_speed  shield
cobra               1       2
viper               4       5
sidewinder          7       8

Single label. Note this returns the row as a Series.

>>> df.loc['viper']
max_speed    4
shield       5
Name: viper, dtype: int64

List of labels. Note using ``[[]]`` returns a DataFrame.

>>> df.loc[['viper', 'sidewinder']]
            max_speed  shield
viper               4       5
sidewinder          7       8

Single label for row and column

>>> df.loc['cobra', 'shield']
2

Slice with labels for row and single label for column. As mentioned
above, note that both the start and stop of the slice are included.

>>> df.loc['cobra':'viper', 'max_speed']
cobra    1
viper    4
Name: max_speed, dtype: int64

Boolean list with the same length as the row axis

>>> df.loc[[False, False, True]]
            max_speed  shield
sidewinder          7       8

Alignable boolean Series:

>>> df.loc[pd.Series([False, True, False],
...        index=['viper', 'sidewinder', 'cobra'])]
            max_speed  shield
sidewinder          7       8

Index (same behavior as ``df.reindex``)

>>> df.loc[pd.Index(["cobra", "viper"], name="foo")]
       max_speed  shield
foo
cobra          1       2
viper          4       5

Conditional that returns a boolean Series

>>> df.loc[df['shield'] > 6]
            max_speed  shield
sidewinder          7       8

Conditional that returns a boolean Series with column labels specified

>>> df.loc[df['shield'] > 6, ['max_speed']]
            max_speed
sidewinder          7

Callable that returns a boolean Series

>>> df.loc[lambda df: df['shield'] == 8]
            max_speed  shield
sidewinder          7       8

**Setting values**

Set value for all items matching the list of labels

>>> df.loc[['viper', 'sidewinder'], ['shield']] = 50
>>> df
            max_speed  shield
cobra               1       2
viper               4      50
sidewinder          7      50

Set value for an entire row

>>> df.loc['cobra'] = 10
>>> df
            max_speed  shield
cobra              10      10
viper               4      50
sidewinder          7      50

Set value for an entire column

>>> df.loc[:, 'max_speed'] = 30
>>> df
            max_speed  shield
cobra              30      10
viper              30      50
sidewinder         30      50

Set value for rows matching callable condition

>>> df.loc[df['shield'] > 35] = 0
>>> df
            max_speed  shield
cobra              30      10
viper               0       0
sidewinder          0       0

**Getting values on a DataFrame with an index that has integer labels**

Another example using integers for the index

>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
...      index=[7, 8, 9], columns=['max_speed', 'shield'])
>>> df
   max_speed  shield
7          1       2
8          4       5
9          7       8

Slice with integer labels for rows. As mentioned above, note that both
the start and stop of the slice are included.

>>> df.loc[7:9]
   max_speed  shield
7          1       2
8          4       5
9          7       8

**Getting values with a MultiIndex**

A number of examples using a DataFrame with a MultiIndex

>>> tuples = [
...    ('cobra', 'mark i'), ('cobra', 'mark ii'),
...    ('sidewinder', 'mark i'), ('sidewinder', 'mark ii'),
...    ('viper', 'mark ii'), ('viper', 'mark iii')
... ]
>>> index = pd.MultiIndex.from_tuples(tuples)
>>> values = [[12, 2], [0, 4], [10, 20],
...         [1, 4], [7, 1], [16, 36]]
>>> df = pd.DataFrame(values, columns=['max_speed', 'shield'], index=index)
>>> df
                     max_speed  shield
cobra      mark i           12       2
           mark ii           0       4
sidewinder mark i           10      20
           mark ii           1       4
viper      mark ii           7       1
           mark iii         16      36

Single label. Note this returns a DataFrame with a single index.

>>> df.loc['cobra']
         max_speed  shield
mark i          12       2
mark ii          0       4

Single index tuple. Note this returns a Series.

>>> df.loc[('cobra', 'mark ii')]
max_speed    0
shield       4
Name: (cobra, mark ii), dtype: int64

Single label for row and column. Similar to passing in a tuple, this
returns a Series.

>>> df.loc['cobra', 'mark i']
max_speed    12
shield        2
Name: (cobra, mark i), dtype: int64

Single tuple. Note using ``[[]]`` returns a DataFrame.

>>> df.loc[[('cobra', 'mark ii')]]
               max_speed  shield
cobra mark ii          0       4

Single tuple for the index with a single label for the column

>>> df.loc[('cobra', 'mark i'), 'shield']
2

Slice from index tuple to single label

>>> df.loc[('cobra', 'mark i'):'viper']
                     max_speed  shield
cobra      mark i           12       2
           mark ii           0       4
sidewinder mark i           10      20
           mark ii           1       4
viper      mark ii           7       1
           mark iii         16      36

Slice from index tuple to index tuple

>>> df.loc[('cobra', 'mark i'):('viper', 'mark ii')]
                    max_speed  shield
cobra      mark i          12       2
           mark ii          0       4
sidewinder mark i          10      20
           mark ii          1       4
viper      mark ii          7       1

Please see the :ref:`user guide<advanced.advanced_hierarchical>`
for more details and explanations of advanced indexing.
lt(**kwargs)

Return Less than of series and other, element-wise (binary operator lt).

Equivalent to series < other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
e    1.0
dtype: float64
>>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f'])
>>> b
a    0.0
b    1.0
c    2.0
d    NaN
f    1.0
dtype: float64
>>> a.lt(b, fill_value=0)
a    False
b    False
c     True
d    False
e    False
f     True
dtype: bool
mask(cond, **kwargs)

mask is not parallelizable when errors="ignore" is specified.

mod(**kwargs)

Return Modulo of series and other, element-wise (binary operator mod).

Equivalent to series % other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rmod()
Reverse of the Modulo operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.mod(b, fill_value=0)
a    0.0
b    NaN
c    NaN
d    0.0
e    NaN
dtype: float64
mul(**kwargs)

Return Multiplication of series and other, element-wise (binary operator mul).

Equivalent to series * other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rmul()
Reverse of the Multiplication operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.multiply(b, fill_value=0)
a    1.0
b    0.0
c    0.0
d    0.0
e    NaN
dtype: float64
multiply(**kwargs)

Return Multiplication of series and other, element-wise (binary operator mul).

Equivalent to series * other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rmul()
Reverse of the Multiplication operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.multiply(b, fill_value=0)
a    1.0
b    0.0
c    0.0
d    0.0
e    NaN
dtype: float64
ndim

Return an int representing the number of axes / array dimensions.

Return 1 if Series. Otherwise return 2 if DataFrame.

Differences from pandas

This operation has no known divergences from the pandas API.

See also

ndarray.ndim
Number of array dimensions.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series({'a': 1, 'b': 2, 'c': 3})
>>> s.ndim
1

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df.ndim
2
ne(**kwargs)

Return Not equal to of series and other, element-wise (binary operator ne).

Equivalent to series != other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.ne(b, fill_value=0)
a    False
b     True
c     True
d     True
e     True
dtype: bool
pad(*args, **kwargs)

Synonym for DataFrame.fillna() with method='ffill'.

Deprecated since version 2.0: Series/DataFrame.pad is deprecated. Use Series/DataFrame.ffill instead.

Returns:Object with missing values filled or None if inplace=True.
Return type:DeferredSeries/DeferredDataFrame or None

Differences from pandas

pad is only supported for axis=”columns”. axis=”index” is order-sensitive.

pipe(func, *args, **kwargs)

Apply chainable functions that expect Series or DataFrames.

Parameters:
  • func (function) – Function to apply to the DeferredSeries/DeferredDataFrame. args, and kwargs are passed into func. Alternatively a (callable, data_keyword) tuple where data_keyword is a string indicating the keyword of callable that expects the DeferredSeries/DeferredDataFrame.
  • args (iterable, optional) – Positional arguments passed into func.
  • kwargs (mapping, optional) – A dictionary of keyword arguments passed into func.
Returns:

Return type:

the return type of func.

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.apply()
Apply a function along input axis of DeferredDataFrame.
DeferredDataFrame.applymap()
Apply a function elementwise on a whole DeferredDataFrame.
DeferredSeries.map()
Apply a mapping correspondence on a DeferredSeries.

Notes

Use .pipe when chaining together functions that expect DeferredSeries, DeferredDataFrames or GroupBy objects. Instead of writing

>>> func(g(h(df), arg1=a), arg2=b, arg3=c)  # doctest: +SKIP

You can write

>>> (df.pipe(h)
...    .pipe(g, arg1=a)
...    .pipe(func, arg2=b, arg3=c)
... )  # doctest: +SKIP

If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose func takes its data as arg2:

>>> (df.pipe(h)
...    .pipe(g, arg1=a)
...    .pipe((func, 'arg2'), arg1=a, arg3=c)
...  )  # doctest: +SKIP
pow(**kwargs)

Return Exponential power of series and other, element-wise (binary operator pow).

Equivalent to series ** other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rpow()
Reverse of the Exponential power operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.pow(b, fill_value=0)
a    1.0
b    1.0
c    1.0
d    0.0
e    NaN
dtype: float64
radd(**kwargs)

Return Addition of series and other, element-wise (binary operator radd).

Equivalent to other + series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.add()
Element-wise Addition, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.add(b, fill_value=0)
a    2.0
b    1.0
c    1.0
d    1.0
e    NaN
dtype: float64
rank(**kwargs)

pandas.Series.rank() is not implemented yet in the Beam DataFrame API.

If support for ‘rank’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

rdiv(**kwargs)

Return Floating division of series and other, element-wise (binary operator rtruediv).

Equivalent to other / series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.truediv()
Element-wise Floating division, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
rdivmod(**kwargs)

Return Integer division and modulo of series and other, element-wise (binary operator rdivmod).

Equivalent to other divmod series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

2-Tuple of DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.divmod()
Element-wise Integer division and modulo, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divmod(b, fill_value=0)
(a    1.0
 b    NaN
 c    NaN
 d    0.0
 e    NaN
 dtype: float64,
 a    0.0
 b    NaN
 c    NaN
 d    0.0
 e    NaN
 dtype: float64)
reindex(**kwargs)

pandas.DataFrame.reindex() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

reindex_like(**kwargs)

pandas.Series.reindex_like() is not implemented yet in the Beam DataFrame API.

If support for ‘reindex_like’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

reorder_levels(**kwargs)

Rearrange index levels using input order. May not drop or duplicate levels.

Parameters:
  • order (list of int or list of str) – List representing new level order. Reference level by number (position) or by key (label).
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Where to reorder levels.
Returns:

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> data = {
...     "class": ["Mammals", "Mammals", "Reptiles"],
...     "diet": ["Omnivore", "Carnivore", "Carnivore"],
...     "species": ["Humans", "Dogs", "Snakes"],
... }
>>> df = pd.DataFrame(data, columns=["class", "diet", "species"])
>>> df = df.set_index(["class", "diet"])
>>> df
                                  species
class      diet
Mammals    Omnivore                Humans
           Carnivore                 Dogs
Reptiles   Carnivore               Snakes

Let's reorder the levels of the index:

>>> df.reorder_levels(["diet", "class"])
                                  species
diet      class
Omnivore  Mammals                  Humans
Carnivore Mammals                    Dogs
          Reptiles                 Snakes
replace(to_replace, value, limit, method, **kwargs)

Replace values given in to_replace with value.

Values of the DataFrame are replaced with other values dynamically.

This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.

Parameters:
  • to_replace (str, regex, list, dict, DeferredSeries, int, float, or None) –

    How to find the values that will be replaced.

    • numeric, str or regex:
      • numeric: numeric values equal to to_replace will be replaced with value
      • str: string exactly matching to_replace will be replaced with value
      • regex: regexs matching to_replace will be replaced with value
    • list of str, regex, or numeric:
      • First, if to_replace and value are both lists, they must be the same length.
      • Second, if regex=True then all of the strings in both lists will be interpreted as regexs otherwise they will match directly. This doesn’t matter much for value since there are only a few possible substitution regexes you can use.
      • str, regex and numeric rules apply as above.
    • dict:
      • Dicts can be used to specify different replacement values for different existing values. For example, {'a': 'b', 'y': 'z'} replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way, the optional value parameter should not be given.
      • For a DeferredDataFrame a dict can specify that different values should be replaced in different columns. For example, {'a': 1, 'b': 'z'} looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified in value. The value parameter should not be None in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.
      • For a DeferredDataFrame nested dictionaries, e.g., {'a': {'b': np.nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN. The optional value parameter should not be specified to use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.
    • None:
      • This means that the regex argument must be a string, compiled regular expression, or list, dict, ndarray or DeferredSeries of such elements. If value is also None then this must be a nested dictionary or DeferredSeries.

    See the examples section for examples of each of these.

  • value (scalar, dict, list, str, regex, default None) – Value to replace any values matching to_replace with. For a DeferredDataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.
  • inplace (bool, default False) – Whether to modify the DeferredDataFrame rather than creating a new one.
  • limit (int, default None) – Maximum size gap to forward or backward fill.
  • regex (bool or same types as to_replace, default False) – Whether to interpret to_replace and/or value as regular expressions. If this is True then to_replace must be a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must be None.
  • method ({'pad', 'ffill', 'bfill'}) – The method to use when for replacement, when to_replace is a scalar, list or tuple and value is None.
Returns:

Object after replacement.

Return type:

DeferredDataFrame

Raises:
  • AssertionError – * If regex is not a bool and to_replace is not

    None.

  • TypeError – * If to_replace is not a scalar, array-like, dict, or None * If to_replace is a dict and value is not a list,

    dict, ndarray, or DeferredSeries

    • If to_replace is None and regex is not compilable into a regular expression or is a list, dict, ndarray, or DeferredSeries.
    • When replacing multiple bool or datetime64 objects and the arguments to to_replace does not match the type of the value being replaced
  • ValueError – * If a list or an ndarray is passed to to_replace and

    value but they are not the same length.

Differences from pandas

method is not supported in the Beam DataFrame API because it is order-sensitive. It cannot be specified.

If limit is specified this operation is not parallelizable.

See also

DeferredDataFrame.fillna()
Fill NA values.
DeferredDataFrame.where()
Replace values based on boolean condition.
DeferredSeries.str.replace()
Simple string replacement.

Notes

  • Regex substitution is performed under the hood with re.sub. The rules for substitution for re.sub are the same.
  • Regular expressions will only substitute on strings, meaning you cannot provide, for example, a regular expression matching floating point numbers and expect the columns in your frame that have a numeric dtype to be matched. However, if those floating point numbers are strings, then you can do this.
  • This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works.
  • When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace part and value(s) in the dict are the value parameter.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

**Scalar `to_replace` and `value`**

>>> s = pd.Series([1, 2, 3, 4, 5])
>>> s.replace(1, 5)
0    5
1    2
2    3
3    4
4    5
dtype: int64

>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)
    A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

**List-like `to_replace`**

>>> df.replace([0, 1, 2, 3], 4)
    A  B  C
0  4  5  a
1  4  6  b
2  4  7  c
3  4  8  d
4  4  9  e

>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])
    A  B  C
0  4  5  a
1  3  6  b
2  2  7  c
3  1  8  d
4  4  9  e

>>> s.replace([1, 2], method='bfill')
0    3
1    3
2    3
3    4
4    5
dtype: int64

**dict-like `to_replace`**

>>> df.replace({0: 10, 1: 100})
        A  B  C
0   10  5  a
1  100  6  b
2    2  7  c
3    3  8  d
4    4  9  e

>>> df.replace({'A': 0, 'B': 5}, 100)
        A    B  C
0  100  100  a
1    1    6  b
2    2    7  c
3    3    8  d
4    4    9  e

>>> df.replace({'A': {0: 100, 4: 400}})
        A  B  C
0  100  5  a
1    1  6  b
2    2  7  c
3    3  8  d
4  400  9  e

**Regular expression `to_replace`**

>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
...                    'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)
        A    B
0   new  abc
1   foo  new
2  bait  xyz

>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
        A    B
0   new  abc
1   foo  bar
2  bait  xyz

>>> df.replace(regex=r'^ba.$', value='new')
        A    B
0   new  abc
1   foo  new
2  bait  xyz

>>> df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})
        A    B
0   new  abc
1   xyz  new
2  bait  xyz

>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')
        A    B
0   new  abc
1   new  new
2  bait  xyz

Compare the behavior of ``s.replace({'a': None})`` and
``s.replace('a', None)`` to understand the peculiarities
of the `to_replace` parameter:

>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])

When one uses a dict as the `to_replace` value, it is like the
value(s) in the dict are equal to the `value` parameter.
``s.replace({'a': None})`` is equivalent to
``s.replace(to_replace={'a': None}, value=None, method=None)``:

>>> s.replace({'a': None})
0      10
1    None
2    None
3       b
4    None
dtype: object

When ``value`` is not explicitly passed and `to_replace` is a scalar, list
or tuple, `replace` uses the method parameter (default 'pad') to do the
replacement. So this is why the 'a' values are being replaced by 10
in rows 1 and 2 and 'b' in row 4 in this case.

>>> s.replace('a')
0    10
1    10
2    10
3     b
4     b
dtype: object

On the other hand, if ``None`` is explicitly passed for ``value``, it will
be respected:

>>> s.replace('a', None)
0      10
1    None
2    None
3       b
4    None
dtype: object

    .. versionchanged:: 1.4.0
        Previously the explicit ``None`` was silently ignored.
resample(**kwargs)

pandas.DataFrame.resample() is not yet supported in the Beam DataFrame API because implementing it would require integrating with Beam event-time semantics

For more information see https://s.apache.org/dataframe-event-time-semantics.

reset_index(level=None, **kwargs)

Reset the index, or a level of it.

Reset the index of the DataFrame, and use the default one instead. If the DataFrame has a MultiIndex, this method can remove one or more levels.

Parameters:
  • level (int, str, tuple, or list, default None) – Only remove the given levels from the index. Removes all levels by default.
  • drop (bool, default False) – Do not try to insert index into dataframe columns. This resets the index to the default integer index.
  • inplace (bool, default False) – Whether to modify the DeferredDataFrame rather than creating a new one.
  • col_level (int or str, default 0) – If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level.
  • col_fill (object, default '') – If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.
  • allow_duplicates (bool, optional, default lib.no_default) –

    Allow duplicate column labels to be created.

    New in version 1.5.0.

  • names (int, str or 1-dimensional list, default None) –

    Using the given string, rename the DeferredDataFrame column which contains the index data. If the DeferredDataFrame has a MultiIndex, this has to be a list or tuple with length equal to the number of levels.

    New in version 1.5.0.

Returns:

DeferredDataFrame with the new index or None if inplace=True.

Return type:

DeferredDataFrame or None

Differences from pandas

Dropping the entire index (e.g. with reset_index(level=None)) is not parallelizable. It is also only guaranteed that the newly generated index values will be unique. The Beam DataFrame API makes no guarantee that the same index values as the equivalent pandas operation will be generated, because that implementation is order-sensitive.

See also

DeferredDataFrame.set_index()
Opposite of reset_index.
DeferredDataFrame.reindex()
Change to new indices or expand indices.
DeferredDataFrame.reindex_like()
Change to same indices as other DeferredDataFrame.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame([('bird', 389.0),
...                    ('bird', 24.0),
...                    ('mammal', 80.5),
...                    ('mammal', np.nan)],
...                   index=['falcon', 'parrot', 'lion', 'monkey'],
...                   columns=('class', 'max_speed'))
>>> df
         class  max_speed
falcon    bird      389.0
parrot    bird       24.0
lion    mammal       80.5
monkey  mammal        NaN

When we reset the index, the old index is added as a column, and a
new sequential index is used:

>>> df.reset_index()
    index   class  max_speed
0  falcon    bird      389.0
1  parrot    bird       24.0
2    lion  mammal       80.5
3  monkey  mammal        NaN

We can use the `drop` parameter to avoid the old index being added as
a column:

>>> df.reset_index(drop=True)
    class  max_speed
0    bird      389.0
1    bird       24.0
2  mammal       80.5
3  mammal        NaN

You can also use `reset_index` with `MultiIndex`.

>>> index = pd.MultiIndex.from_tuples([('bird', 'falcon'),
...                                    ('bird', 'parrot'),
...                                    ('mammal', 'lion'),
...                                    ('mammal', 'monkey')],
...                                   names=['class', 'name'])
>>> columns = pd.MultiIndex.from_tuples([('speed', 'max'),
...                                      ('species', 'type')])
>>> df = pd.DataFrame([(389.0, 'fly'),
...                    (24.0, 'fly'),
...                    (80.5, 'run'),
...                    (np.nan, 'jump')],
...                   index=index,
...                   columns=columns)
>>> df
               speed species
                 max    type
class  name
bird   falcon  389.0     fly
       parrot   24.0     fly
mammal lion     80.5     run
       monkey    NaN    jump

Using the `names` parameter, choose a name for the index column:

>>> df.reset_index(names=['classes', 'names'])
  classes   names  speed species
                     max    type
0    bird  falcon  389.0     fly
1    bird  parrot   24.0     fly
2  mammal    lion   80.5     run
3  mammal  monkey    NaN    jump

If the index has multiple levels, we can reset a subset of them:

>>> df.reset_index(level='class')
         class  speed species
                  max    type
name
falcon    bird  389.0     fly
parrot    bird   24.0     fly
lion    mammal   80.5     run
monkey  mammal    NaN    jump

If we are not dropping the index, by default, it is placed in the top
level. We can place it in another level:

>>> df.reset_index(level='class', col_level=1)
                speed species
         class    max    type
name
falcon    bird  389.0     fly
parrot    bird   24.0     fly
lion    mammal   80.5     run
monkey  mammal    NaN    jump

When the index is inserted under another level, we can specify under
which one with the parameter `col_fill`:

>>> df.reset_index(level='class', col_level=1, col_fill='species')
              species  speed species
                class    max    type
name
falcon           bird  389.0     fly
parrot           bird   24.0     fly
lion           mammal   80.5     run
monkey         mammal    NaN    jump

If we specify a nonexistent level for `col_fill`, it is created:

>>> df.reset_index(level='class', col_level=1, col_fill='genus')
                genus  speed species
                class    max    type
name
falcon           bird  389.0     fly
parrot           bird   24.0     fly
lion           mammal   80.5     run
monkey         mammal    NaN    jump
rfloordiv(**kwargs)

Return Integer division of series and other, element-wise (binary operator rfloordiv).

Equivalent to other // series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.floordiv()
Element-wise Integer division, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.floordiv(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
rmod(**kwargs)

Return Modulo of series and other, element-wise (binary operator rmod).

Equivalent to other % series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.mod()
Element-wise Modulo, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.mod(b, fill_value=0)
a    0.0
b    NaN
c    NaN
d    0.0
e    NaN
dtype: float64
rmul(**kwargs)

Return Multiplication of series and other, element-wise (binary operator rmul).

Equivalent to other * series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.mul()
Element-wise Multiplication, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.multiply(b, fill_value=0)
a    1.0
b    0.0
c    0.0
d    0.0
e    NaN
dtype: float64
rolling(**kwargs)

pandas.DataFrame.rolling() is not yet supported in the Beam DataFrame API because implementing it would require integrating with Beam event-time semantics

For more information see https://s.apache.org/dataframe-event-time-semantics.

rpow(**kwargs)

Return Exponential power of series and other, element-wise (binary operator rpow).

Equivalent to other ** series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.pow()
Element-wise Exponential power, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.pow(b, fill_value=0)
a    1.0
b    1.0
c    1.0
d    0.0
e    NaN
dtype: float64
rsub(**kwargs)

Return Subtraction of series and other, element-wise (binary operator rsub).

Equivalent to other - series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.sub()
Element-wise Subtraction, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.subtract(b, fill_value=0)
a    0.0
b    1.0
c    1.0
d   -1.0
e    NaN
dtype: float64
rtruediv(**kwargs)

Return Floating division of series and other, element-wise (binary operator rtruediv).

Equivalent to other / series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.truediv()
Element-wise Floating division, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
set_flags(**kwargs)

pandas.Series.set_flags() is not implemented yet in the Beam DataFrame API.

If support for ‘set_flags’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

size

Return an int representing the number of elements in this object.

Return the number of rows if Series. Otherwise return the number of rows times number of columns if DataFrame.

Differences from pandas

This operation has no known divergences from the pandas API.

See also

ndarray.size
Number of elements in the array.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series({'a': 1, 'b': 2, 'c': 3})
>>> s.size
3

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df.size
4
sort_index(axis, **kwargs)

Sort object by labels (along an axis).

Returns a new DataFrame sorted by label if inplace argument is False, otherwise updates the original DataFrame and returns None.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.
  • level (int or level name or list of ints or list of level names) – If not None, sort on values in specified index level(s).
  • ascending (bool or list-like of bools, default True) – Sort ascending vs. descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.
  • inplace (bool, default False) – Whether to modify the DeferredDataFrame rather than creating a new one.
  • kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See also numpy.sort() for more information. mergesort and stable are the only stable algorithms. For DeferredDataFrames, this option is only applied when sorting on a single column or label.
  • na_position ({'first', 'last'}, default 'last') – Puts NaNs at the beginning if first; last puts NaNs at the end. Not implemented for MultiIndex.
  • sort_remaining (bool, default True) – If True and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level.
  • ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.
  • key (callable, optional) –

    If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect an Index and return an Index of the same shape. For MultiIndex inputs, the key is applied per level.

    New in version 1.1.0.

Returns:

The original DeferredDataFrame sorted by the labels or None if inplace=True.

Return type:

DeferredDataFrame or None

Differences from pandas

axis=index is not allowed because it imposes an ordering on the dataset, and we cannot guarantee it will be maintained (see https://s.apache.org/dataframe-order-sensitive-operations). Only axis=columns is allowed.

See also

DeferredSeries.sort_index()
Sort DeferredSeries by the index.
DeferredDataFrame.sort_values()
Sort DeferredDataFrame by the value.
DeferredSeries.sort_values()
Sort DeferredSeries by the value.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150],
...                   columns=['A'])
>>> df.sort_index()
     A
1    4
29   2
100  1
150  5
234  3

By default, it sorts in ascending order, to sort in descending order,
use ``ascending=False``

>>> df.sort_index(ascending=False)
     A
234  3
150  5
100  1
29   2
1    4

A key function can be specified which is applied to the index before
sorting. For a ``MultiIndex`` this is applied to each level separately.

>>> df = pd.DataFrame({"a": [1, 2, 3, 4]}, index=['A', 'b', 'C', 'd'])
>>> df.sort_index(key=lambda x: x.str.lower())
   a
A  1
b  2
C  3
d  4
sort_values(axis, **kwargs)

sort_values is not implemented.

It is not implemented for axis=index because it imposes an ordering on the dataset, and it likely will not be maintained (see https://s.apache.org/dataframe-order-sensitive-operations).

It is not implemented for axis=columns because it makes the order of the columns depend on the data (see https://s.apache.org/dataframe-non-deferred-columns).

sparse

pandas.DataFrame.sparse() is not implemented yet in the Beam DataFrame API.

If support for ‘sparse’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20902.

squeeze(**kwargs)

pandas.Series.squeeze() is not implemented yet in the Beam DataFrame API.

If support for ‘squeeze’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

sub(**kwargs)

Return Subtraction of series and other, element-wise (binary operator sub).

Equivalent to series - other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rsub()
Reverse of the Subtraction operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.subtract(b, fill_value=0)
a    0.0
b    1.0
c    1.0
d   -1.0
e    NaN
dtype: float64
subtract(**kwargs)

Return Subtraction of series and other, element-wise (binary operator sub).

Equivalent to series - other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rsub()
Reverse of the Subtraction operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.subtract(b, fill_value=0)
a    0.0
b    1.0
c    1.0
d   -1.0
e    NaN
dtype: float64
swapaxes(**kwargs)

pandas.Series.swapaxes() is not yet supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.

For more information see https://s.apache.org/dataframe-non-deferred-columns.

swaplevel(**kwargs)

Swap levels i and j in a MultiIndex.

Default is to swap the two innermost levels of the index.

Parameters:
  • j (i,) – Levels of the indices to be swapped. Can pass level name as string.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to swap levels on. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
Returns:

DeferredDataFrame with levels swapped in MultiIndex.

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame(
...     {"Grade": ["A", "B", "A", "C"]},
...     index=[
...         ["Final exam", "Final exam", "Coursework", "Coursework"],
...         ["History", "Geography", "History", "Geography"],
...         ["January", "February", "March", "April"],
...     ],
... )
>>> df
                                    Grade
Final exam  History     January      A
            Geography   February     B
Coursework  History     March        A
            Geography   April        C

In the following example, we will swap the levels of the indices.
Here, we will swap the levels column-wise, but levels can be swapped row-wise
in a similar manner. Note that column-wise is the default behaviour.
By not supplying any arguments for i and j, we swap the last and second to
last indices.

>>> df.swaplevel()
                                    Grade
Final exam  January     History         A
            February    Geography       B
Coursework  March       History         A
            April       Geography       C

By supplying one argument, we can choose which index to swap the last
index with. We can for example swap the first index with the last one as
follows.

>>> df.swaplevel(0)
                                    Grade
January     History     Final exam      A
February    Geography   Final exam      B
March       History     Coursework      A
April       Geography   Coursework      C

We can also define explicitly which indices we want to swap by supplying values
for both i and j. Here, we for example swap the first and second indices.

>>> df.swaplevel(0, 1)
                                    Grade
History     Final exam  January         A
Geography   Final exam  February        B
History     Coursework  March           A
Geography   Coursework  April           C
to_clipboard(**kwargs)

pandas.DataFrame.to_clipboard() is not implemented yet in the Beam DataFrame API.

If support for ‘to_clipboard’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_csv(path, transform_label=None, *args, **kwargs)

Write object to a comma-separated values (csv) file.

Parameters:
  • path_or_buf (str, path object, file-like object, or None, default None) –

    String, path object (implementing os.PathLike[str]), or file-like object implementing a write() function. If None, the result is returned as a string. If a non-binary file object is passed, it should be opened with newline=’’, disabling universal newlines. If a binary file object is passed, mode might need to contain a ‘b’.

    Changed in version 1.2.0: Support for binary file objects was introduced.

  • sep (str, default ',') – String of length 1. Field delimiter for the output file.
  • na_rep (str, default '') – Missing data representation.
  • float_format (str, Callable, default None) – Format string for floating point numbers. If a Callable is given, it takes precedence over other numeric formatting parameters, like decimal.
  • columns (sequence, optional) – Columns to write.
  • header (bool or list of str, default True) – Write out the column names. If a list of strings is given it is assumed to be aliases for the column names.
  • index (bool, default True) – Write row names (index).
  • index_label (str or sequence, or False, default None) – Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the object uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R.
  • mode (str, default 'w') – Python write mode. The available write modes are the same as open().
  • encoding (str, optional) – A string representing the encoding to use in the output file, defaults to ‘utf-8’. encoding is not supported if path_or_buf is a non-binary file object.
  • compression (str or dict, default 'infer') –

    For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buf’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

    New in version 1.5.0: Added support for .tar files.

    Changed in version 1.0.0: May now be a dict with key ‘method’ as compression mode and other entries as additional compression options if compression mode is ‘zip’.

    Changed in version 1.1.0: Passing compression options as keys in dict is supported for compression modes ‘gzip’, ‘bz2’, ‘zstd’, and ‘zip’.

    Changed in version 1.2.0: Compression is supported for binary file objects.

    Changed in version 1.2.0: Previous versions forwarded dict entries for ‘gzip’ to gzip.open instead of gzip.GzipFile which prevented setting mtime.

  • quoting (optional constant from csv module) – Defaults to csv.QUOTE_MINIMAL. If you have set a float_format then floats are converted to strings and thus csv.QUOTE_NONNUMERIC will treat them as non-numeric.
  • quotechar (str, default '"') – String of length 1. Character used to quote fields.
  • lineterminator (str, optional) –

    The newline character or character sequence to use in the output file. Defaults to os.linesep, which depends on the OS in which this method is called (‘\n’ for linux, ‘\r\n’ for Windows, i.e.).

    Changed in version 1.5.0: Previously was line_terminator, changed for consistency with read_csv and the standard library ‘csv’ module.

  • chunksize (int or None) – Rows to write at a time.
  • date_format (str, default None) – Format string for datetime objects.
  • doublequote (bool, default True) – Control quoting of quotechar inside a field.
  • escapechar (str, default None) – String of length 1. Character used to escape sep and quotechar when appropriate.
  • decimal (str, default '.') – Character recognized as decimal separator. E.g. use ‘,’ for European data.
  • errors (str, default 'strict') –

    Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

    New in version 1.1.0.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.2.0.

Returns:

If path_or_buf is None, returns the resulting csv format as a string. Otherwise returns None.

Return type:

None or str

Differences from pandas

This operation has no known divergences from the pandas API.

See also

read_csv()
Load a CSV file into a DeferredDataFrame.
to_excel()
Write DeferredDataFrame to an Excel file.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'name': ['Raphael', 'Donatello'],
...                    'mask': ['red', 'purple'],
...                    'weapon': ['sai', 'bo staff']})
>>> df.to_csv(index=False)
'name,mask,weapon\nRaphael,red,sai\nDonatello,purple,bo staff\n'

Create 'out.zip' containing 'out.csv'

>>> compression_opts = dict(method='zip',
...                         archive_name='out.csv')  
>>> df.to_csv('out.zip', index=False,
...           compression=compression_opts)  

To write a csv file to a new folder or nested folder you will first
need to create it using either Pathlib or os:

>>> from pathlib import Path  
>>> filepath = Path('folder/subfolder/out.csv')  
>>> filepath.parent.mkdir(parents=True, exist_ok=True)  
>>> df.to_csv(filepath)  

>>> import os  
>>> os.makedirs('folder/subfolder', exist_ok=True)  
>>> df.to_csv('folder/subfolder/out.csv')  
to_excel(path, *args, **kwargs)

Write object to an Excel sheet.

To write a single object to an Excel .xlsx file it is only necessary to specify a target file name. To write to multiple sheets it is necessary to create an ExcelWriter object with a target file name, and specify a sheet in the file to write to.

Multiple sheets may be written to by specifying unique sheet_name. With all data written to the file it is necessary to save the changes. Note that creating an ExcelWriter object with a file name that already exists will result in the contents of the existing file being erased.

Parameters:
  • excel_writer (path-like, file-like, or ExcelWriter object) – File path or existing ExcelWriter.
  • sheet_name (str, default 'Sheet1') – Name of sheet which will contain DeferredDataFrame.
  • na_rep (str, default '') – Missing data representation.
  • float_format (str, optional) – Format string for floating point numbers. For example float_format="%.2f" will format 0.1234 to 0.12.
  • columns (sequence or list of str, optional) – Columns to write.
  • header (bool or list of str, default True) – Write out the column names. If a list of string is given it is assumed to be aliases for the column names.
  • index (bool, default True) – Write row names (index).
  • index_label (str or sequence, optional) – Column label for index column(s) if desired. If not specified, and header and index are True, then the index names are used. A sequence should be given if the DeferredDataFrame uses MultiIndex.
  • startrow (int, default 0) – Upper left cell row to dump data frame.
  • startcol (int, default 0) – Upper left cell column to dump data frame.
  • engine (str, optional) – Write engine to use, ‘openpyxl’ or ‘xlsxwriter’. You can also set this via the options io.excel.xlsx.writer or io.excel.xlsm.writer.
  • merge_cells (bool, default True) – Write MultiIndex and Hierarchical Rows as merged cells.
  • inf_rep (str, default 'inf') – Representation for infinity (there is no native representation for infinity in Excel).
  • freeze_panes (tuple of int (length 2), optional) – Specifies the one-based bottommost row and rightmost column that is to be frozen.
  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.2.0.

Differences from pandas

This operation has no known divergences from the pandas API.

See also

to_csv()
Write DeferredDataFrame to a comma-separated values (csv) file.
ExcelWriter()
Class for writing DeferredDataFrame objects into excel sheets.
read_excel()
Read an Excel file into a pandas DeferredDataFrame.
read_csv()
Read a comma-separated values (csv) file into DeferredDataFrame.
io.formats.style.Styler.to_excel()
Add styles to Excel sheet.

Notes

For compatibility with to_csv(), to_excel serializes lists and dicts to strings before writing.

Once a workbook has been saved it is not possible to write further data without rewriting the whole workbook.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Create, write to and save a workbook:

>>> df1 = pd.DataFrame([['a', 'b'], ['c', 'd']],
...                    index=['row 1', 'row 2'],
...                    columns=['col 1', 'col 2'])
>>> df1.to_excel("output.xlsx")  

To specify the sheet name:

>>> df1.to_excel("output.xlsx",
...              sheet_name='Sheet_name_1')  

If you wish to write to more than one sheet in the workbook, it is
necessary to specify an ExcelWriter object:

>>> df2 = df1.copy()
>>> with pd.ExcelWriter('output.xlsx') as writer:  
...     df1.to_excel(writer, sheet_name='Sheet_name_1')
...     df2.to_excel(writer, sheet_name='Sheet_name_2')

ExcelWriter can also be used to append to an existing Excel file:

>>> with pd.ExcelWriter('output.xlsx',
...                     mode='a') as writer:  
...     df.to_excel(writer, sheet_name='Sheet_name_3')

To set the library that is used to write the Excel file,
you can pass the `engine` keyword (the default engine is
automatically chosen depending on the file extension):

>>> df1.to_excel('output1.xlsx', engine='xlsxwriter')  
to_feather(path, *args, **kwargs)

Write a DataFrame to the binary Feather format.

Parameters:
  • path (str, path object, file-like object) – String, path object (implementing os.PathLike[str]), or file-like object implementing a binary write() function. If a string or a path, it will be used as Root Directory path when writing a partitioned dataset.
  • **kwargs

    Additional keywords passed to pyarrow.feather.write_feather(). Starting with pyarrow 0.17, this includes the compression, compression_level, chunksize and version keywords.

    New in version 1.1.0.

Differences from pandas

This operation has no known divergences from the pandas API.

Notes

This function writes the dataframe as a feather file. Requires a default index. For saving the DeferredDataFrame with your custom index use a method that supports custom indices e.g. to_parquet.

to_hdf(**kwargs)

pandas.DataFrame.to_hdf() is not yet supported in the Beam DataFrame API because HDF5 is a random access file format

to_html(path, *args, **kwargs)

Render a DataFrame as an HTML table.

Parameters:
  • buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.
  • columns (sequence, optional, default None) – The subset of columns to write. Writes all columns by default.
  • col_space (str or int, list or dict of int or str, optional) – The minimum width of each column in CSS length units. An int is assumed to be px units..
  • header (bool, optional) – Whether to print column labels, default True.
  • index (bool, optional, default True) – Whether to print index (row) labels.
  • na_rep (str, optional, default 'NaN') – String representation of NaN to use.
  • formatters (list, tuple or dict of one-param. functions, optional) – Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.
  • float_format (one-parameter function, optional, default None) –

    Formatter function to apply to columns’ elements if they are floats. This function must return a unicode string and will be applied only to the non-NaN elements, with NaN being handled by na_rep.

    Changed in version 1.2.0.

  • sparsify (bool, optional, default True) – Set to False for a DeferredDataFrame with a hierarchical index to print every multiindex key at each row.
  • index_names (bool, optional, default True) – Prints the names of the indexes.
  • justify (str, default None) –

    How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are

    • left
    • right
    • center
    • justify
    • justify-all
    • start
    • end
    • inherit
    • match-parent
    • initial
    • unset.
  • max_rows (int, optional) – Maximum number of rows to display in the console.
  • max_cols (int, optional) – Maximum number of columns to display in the console.
  • show_dimensions (bool, default False) – Display DeferredDataFrame dimensions (number of rows by number of columns).
  • decimal (str, default '.') – Character recognized as decimal separator, e.g. ‘,’ in Europe.
  • bold_rows (bool, default True) – Make the row labels bold in the output.
  • classes (str or list or tuple, default None) – CSS class(es) to apply to the resulting html table.
  • escape (bool, default True) – Convert the characters <, >, and & to HTML-safe sequences.
  • notebook ({True, False}, default False) – Whether the generated HTML is for IPython Notebook.
  • border (int) – A border=border attribute is included in the opening <table> tag. Default pd.options.display.html.border.
  • table_id (str, optional) – A css id is included in the opening <table> tag if specified.
  • render_links (bool, default False) – Convert URLs to HTML links.
  • encoding (str, default "utf-8") –

    Set character encoding.

    New in version 1.0.

Returns:

If buf is None, returns the result as a string. Otherwise returns None.

Return type:

str or None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

to_string()
Convert DeferredDataFrame to a string.
to_json(path, orient=None, *args, **kwargs)

Convert the object to a JSON string.

Note NaN’s and None will be converted to null and datetime objects will be converted to UNIX timestamps.

Parameters:
  • path_or_buf (str, path object, file-like object, or None, default None) – String, path object (implementing os.PathLike[str]), or file-like object implementing a write() function. If None, the result is returned as a string.
  • orient (str) –

    Indication of expected JSON string format.

    • DeferredSeries:
      • default is ‘index’
      • allowed values are: {‘split’, ‘records’, ‘index’, ‘table’}.
    • DeferredDataFrame:
      • default is ‘columns’
      • allowed values are: {‘split’, ‘records’, ‘index’, ‘columns’, ‘values’, ‘table’}.
    • The format of the JSON string:
      • ’split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}
      • ’records’ : list like [{column -> value}, … , {column -> value}]
      • ’index’ : dict like {index -> {column -> value}}
      • ’columns’ : dict like {column -> {index -> value}}
      • ’values’ : just the values array
      • ’table’ : dict like {‘schema’: {schema}, ‘data’: {data}}

      Describing the data, where data component is like orient='records'.

  • date_format ({None, 'epoch', 'iso'}) – Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient. For orient='table', the default is ‘iso’. For all other orients, the default is ‘epoch’.
  • double_precision (int, default 10) – The number of decimal places to use when encoding floating point values.
  • force_ascii (bool, default True) – Force encoded string to be ASCII.
  • date_unit (str, default 'ms' (milliseconds)) – The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.
  • default_handler (callable, default None) – Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.
  • lines (bool, default False) – If ‘orient’ is ‘records’ write out line-delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list-like.
  • compression (str or dict, default 'infer') –

    For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buf’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

    New in version 1.5.0: Added support for .tar files.

    Changed in version 1.4.0: Zstandard support.

  • index (bool, default True) – Whether to include the index values in the JSON string. Not including the index (index=False) is only supported when orient is ‘split’ or ‘table’.
  • indent (int, optional) – Length of whitespace used to indent each record.
  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.2.0.

  • mode (str, default 'w' (writing)) – Specify the IO mode for output when supplying a path_or_buf. Accepted args are ‘w’ (writing) and ‘a’ (append) only. mode=’a’ is only supported when lines is True and orient is ‘records’.
Returns:

If path_or_buf is None, returns the resulting json format as a string. Otherwise returns None.

Return type:

None or str

Differences from pandas

This operation has no known divergences from the pandas API.

See also

read_json()
Convert a JSON string to pandas object.

Notes

The behavior of indent=0 varies from the stdlib, which does not indent the output but does insert newlines. Currently, indent=0 and the default indent=None are equivalent in pandas, though this may change in a future release.

orient='table' contains a ‘pandas_version’ field under ‘schema’. This stores the version of pandas used in the latest revision of the schema.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> from json import loads, dumps
>>> df = pd.DataFrame(
...     [["a", "b"], ["c", "d"]],
...     index=["row 1", "row 2"],
...     columns=["col 1", "col 2"],
... )

>>> result = df.to_json(orient="split")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
{
    "columns": [
        "col 1",
        "col 2"
    ],
    "index": [
        "row 1",
        "row 2"
    ],
    "data": [
        [
            "a",
            "b"
        ],
        [
            "c",
            "d"
        ]
    ]
}

Encoding/decoding a Dataframe using ``'records'`` formatted JSON.
Note that index labels are not preserved with this encoding.

>>> result = df.to_json(orient="records")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
[
    {
        "col 1": "a",
        "col 2": "b"
    },
    {
        "col 1": "c",
        "col 2": "d"
    }
]

Encoding/decoding a Dataframe using ``'index'`` formatted JSON:

>>> result = df.to_json(orient="index")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
{
    "row 1": {
        "col 1": "a",
        "col 2": "b"
    },
    "row 2": {
        "col 1": "c",
        "col 2": "d"
    }
}

Encoding/decoding a Dataframe using ``'columns'`` formatted JSON:

>>> result = df.to_json(orient="columns")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
{
    "col 1": {
        "row 1": "a",
        "row 2": "c"
    },
    "col 2": {
        "row 1": "b",
        "row 2": "d"
    }
}

Encoding/decoding a Dataframe using ``'values'`` formatted JSON:

>>> result = df.to_json(orient="values")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
[
    [
        "a",
        "b"
    ],
    [
        "c",
        "d"
    ]
]

Encoding with Table Schema:

>>> result = df.to_json(orient="table")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
{
    "schema": {
        "fields": [
            {
                "name": "index",
                "type": "string"
            },
            {
                "name": "col 1",
                "type": "string"
            },
            {
                "name": "col 2",
                "type": "string"
            }
        ],
        "primaryKey": [
            "index"
        ],
        "pandas_version": "1.4.0"
    },
    "data": [
        {
            "index": "row 1",
            "col 1": "a",
            "col 2": "b"
        },
        {
            "index": "row 2",
            "col 1": "c",
            "col 2": "d"
        }
    ]
}
to_latex(**kwargs)

pandas.Series.to_latex() is not implemented yet in the Beam DataFrame API.

If support for ‘to_latex’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_markdown(**kwargs)

pandas.Series.to_markdown() is not implemented yet in the Beam DataFrame API.

If support for ‘to_markdown’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_msgpack(**kwargs)

pandas.DataFrame.to_msgpack() is not yet supported in the Beam DataFrame API because it is deprecated in pandas.

to_parquet(path, *args, **kwargs)

Write a DataFrame to the binary parquet format.

This function writes the dataframe as a parquet file. You can choose different parquet backends, and have the option of compression. See the user guide for more details.

Parameters:
  • path (str, path object, file-like object, or None, default None) –

    String, path object (implementing os.PathLike[str]), or file-like object implementing a binary write() function. If None, the result is returned as bytes. If a string or path, it will be used as Root Directory path when writing a partitioned dataset.

    Changed in version 1.2.0.

    Previously this was “fname”

  • engine ({'auto', 'pyarrow', 'fastparquet'}, default 'auto') – Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.
  • compression ({'snappy', 'gzip', 'brotli', None}, default 'snappy') – Name of the compression to use. Use None for no compression.
  • index (bool, default None) – If True, include the dataframe’s index(es) in the file output. If False, they will not be written to the file. If None, similar to True the dataframe’s index(es) will be saved. However, instead of being saved as values, the RangeIndex will be stored as a range in the metadata so it doesn’t require much space and is faster. Other indexes will be included as columns in the file output.
  • partition_cols (list, optional, default None) – Column names by which to partition the dataset. Columns are partitioned in the order they are given. Must be None if path is not a string.
  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.2.0.

  • **kwargs – Additional arguments passed to the parquet library. See pandas io for more details.
Returns:

Return type:

bytes if no path argument is provided else None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

read_parquet()
Read a parquet file.
DeferredDataFrame.to_orc()
Write an orc file.
DeferredDataFrame.to_csv()
Write a csv file.
DeferredDataFrame.to_sql()
Write to a sql table.
DeferredDataFrame.to_hdf()
Write to hdf.

Notes

This function requires either the fastparquet or pyarrow library.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
>>> df.to_parquet('df.parquet.gzip',
...               compression='gzip')  
>>> pd.read_parquet('df.parquet.gzip')  
   col1  col2
0     1     3
1     2     4

If you want to get a buffer to the parquet content you can use a io.BytesIO
object, as long as you don't use partition_cols, which creates multiple files.

>>> import io
>>> f = io.BytesIO()
>>> df.to_parquet(f)
>>> f.seek(0)
0
>>> content = f.read()
to_period(**kwargs)

pandas.Series.to_period() is not implemented yet in the Beam DataFrame API.

If support for ‘to_period’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_pickle(**kwargs)

pandas.Series.to_pickle() is not implemented yet in the Beam DataFrame API.

If support for ‘to_pickle’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_sql(**kwargs)

pandas.Series.to_sql() is not implemented yet in the Beam DataFrame API.

If support for ‘to_sql’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_stata(path, *args, **kwargs)

Export DataFrame object to Stata dta format.

Writes the DataFrame to a Stata dataset file. “dta” files contain a Stata dataset.

Parameters:
  • path (str, path object, or buffer) – String, path object (implementing os.PathLike[str]), or file-like object implementing a binary write() function.
  • convert_dates (dict) – Dictionary mapping columns containing datetime types to stata internal format to use when writing the dates. Options are ‘tc’, ‘td’, ‘tm’, ‘tw’, ‘th’, ‘tq’, ‘ty’. Column can be either an integer or a name. Datetime columns that do not have a conversion type specified will be converted to ‘tc’. Raises NotImplementedError if a datetime column has timezone information.
  • write_index (bool) – Write the index to Stata dataset.
  • byteorder (str) – Can be “>”, “<”, “little”, or “big”. default is sys.byteorder.
  • time_stamp (datetime) – A datetime to use as file creation date. Default is the current time.
  • data_label (str, optional) – A label for the data set. Must be 80 characters or smaller.
  • variable_labels (dict) – Dictionary containing columns as keys and variable labels as values. Each label must be 80 characters or smaller.
  • version ({114, 117, 118, 119, None}, default 114) –

    Version to use in the output dta file. Set to None to let pandas decide between 118 or 119 formats depending on the number of columns in the frame. pandas Version 114 can be read by Stata 10 and later. pandas Version 117 can be read by Stata 13 or later. pandas Version 118 is supported in Stata 14 and later. pandas Version 119 is supported in Stata 15 and later. pandas Version 114 limits string variables to 244 characters or fewer while versions 117 and later allow strings with lengths up to 2,000,000 characters. Versions 118 and 119 support Unicode characters, and pandas version 119 supports more than 32,767 variables.

    pandas Version 119 should usually only be used when the number of variables exceeds the capacity of dta format 118. Exporting smaller datasets in format 119 may have unintended consequences, and, as of November 2020, Stata SE cannot read pandas version 119 files.

  • convert_strl (list, optional) – List of column names to convert to string columns to Stata StrL format. Only available if version is 117. Storing strings in the StrL format can produce smaller dta files if strings have more than 8 characters and values are repeated.
  • compression (str or dict, default 'infer') –

    For on-the-fly compression of the output data. If ‘infer’ and ‘path’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

    New in version 1.5.0: Added support for .tar files.

    New in version 1.1.0.

    Changed in version 1.4.0: Zstandard support.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.2.0.

  • value_labels (dict of dicts) –

    Dictionary containing columns as keys and dictionaries of column value to labels as values. Labels for a single variable must be 32,000 characters or smaller.

    New in version 1.4.0.

Raises:
  • NotImplementedError – * If datetimes contain timezone information * Column dtype is not representable in Stata

  • ValueError – * Columns listed in convert_dates are neither datetime64[ns]

    or datetime.datetime

    • Column listed in convert_dates is not in DeferredDataFrame
    • Categorical label contains more than 32,000 characters

Differences from pandas

This operation has no known divergences from the pandas API.

See also

read_stata()
Import Stata data files.
io.stata.StataWriter()
Low-level writer for Stata data files.
io.stata.StataWriter117()
Low-level writer for pandas version 117 files.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'animal': ['falcon', 'parrot', 'falcon',
...                               'parrot'],
...                    'speed': [350, 18, 361, 15]})
>>> df.to_stata('animals.dta')  
to_timestamp(**kwargs)

pandas.Series.to_timestamp() is not implemented yet in the Beam DataFrame API.

If support for ‘to_timestamp’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_xarray(**kwargs)

pandas.DataFrame.to_xarray() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

transform(**kwargs)

Call func on self producing a DataFrame with the same axis shape as self.

Parameters:
  • func (function, str, list-like or dict-like) –

    Function to use for transforming the data. If a function, must either work when passed a DeferredDataFrame or when passed to DeferredDataFrame.apply. If func is both list-like and dict-like, dict-like behavior takes precedence.

    Accepted combinations are:

    • function
    • string function name
    • list-like of functions and/or function names, e.g. [np.exp, 'sqrt']
    • dict-like of axis labels -> functions, function names or list-like of such.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.
  • *args – Positional arguments to pass to func.
  • **kwargs – Keyword arguments to pass to func.
Returns:

A DeferredDataFrame that must have the same length as self.

Return type:

DeferredDataFrame

Raises:

ValueError : If the returned DeferredDataFrame has a different length than self.

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.agg()
Only perform aggregating type operations.
DeferredDataFrame.apply()
Invoke function on a DeferredDataFrame.

Notes

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'A': range(3), 'B': range(1, 4)})
>>> df
   A  B
0  0  1
1  1  2
2  2  3
>>> df.transform(lambda x: x + 1)
   A  B
0  1  2
1  2  3
2  3  4

Even though the resulting DataFrame must have the same length as the
input DataFrame, it is possible to provide several input functions:

>>> s = pd.Series(range(3))
>>> s
0    0
1    1
2    2
dtype: int64
>>> s.transform([np.sqrt, np.exp])
       sqrt        exp
0  0.000000   1.000000
1  1.000000   2.718282
2  1.414214   7.389056

You can call transform on a GroupBy object:

>>> df = pd.DataFrame({
...     "Date": [
...         "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05",
...         "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05"],
...     "Data": [5, 8, 6, 1, 50, 100, 60, 120],
... })
>>> df
         Date  Data
0  2015-05-08     5
1  2015-05-07     8
2  2015-05-06     6
3  2015-05-05     1
4  2015-05-08    50
5  2015-05-07   100
6  2015-05-06    60
7  2015-05-05   120
>>> df.groupby('Date')['Data'].transform('sum')
0     55
1    108
2     66
3    121
4     55
5    108
6     66
7    121
Name: Data, dtype: int64

>>> df = pd.DataFrame({
...     "c": [1, 1, 1, 2, 2, 2, 2],
...     "type": ["m", "n", "o", "m", "m", "n", "n"]
... })
>>> df
   c type
0  1    m
1  1    n
2  1    o
3  2    m
4  2    m
5  2    n
6  2    n
>>> df['size'] = df.groupby('c')['type'].transform(len)
>>> df
   c type size
0  1    m    3
1  1    n    3
2  1    o    3
3  2    m    4
4  2    m    4
5  2    n    4
6  2    n    4
truediv(**kwargs)

Return Floating division of series and other, element-wise (binary operator truediv).

Equivalent to series / other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rtruediv()
Reverse of the Floating division operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
truncate(before, after, axis)

Truncate a Series or DataFrame before and after some index value.

This is a useful shorthand for boolean indexing based on index values above or below certain thresholds.

Parameters:
  • before (date, str, int) – Truncate all rows before this index value.
  • after (date, str, int) – Truncate all rows after this index value.
  • axis ({0 or 'index', 1 or 'columns'}, optional) – Axis to truncate. Truncates the index (rows) by default. For DeferredSeries this parameter is unused and defaults to 0.
  • copy (bool, default is True,) – Return a copy of the truncated section.
Returns:

The truncated DeferredSeries or DeferredDataFrame.

Return type:

type of caller

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.loc()
Select a subset of a DeferredDataFrame by label.
DeferredDataFrame.iloc()
Select a subset of a DeferredDataFrame by position.

Notes

If the index being truncated contains only datetime values, before and after may be specified as strings instead of Timestamps.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'A': ['a', 'b', 'c', 'd', 'e'],
...                    'B': ['f', 'g', 'h', 'i', 'j'],
...                    'C': ['k', 'l', 'm', 'n', 'o']},
...                   index=[1, 2, 3, 4, 5])
>>> df
   A  B  C
1  a  f  k
2  b  g  l
3  c  h  m
4  d  i  n
5  e  j  o

>>> df.truncate(before=2, after=4)
   A  B  C
2  b  g  l
3  c  h  m
4  d  i  n

The columns of a DataFrame can be truncated.

>>> df.truncate(before="A", after="B", axis="columns")
   A  B
1  a  f
2  b  g
3  c  h
4  d  i
5  e  j

For Series, only rows can be truncated.

>>> df['A'].truncate(before=2, after=4)
2    b
3    c
4    d
Name: A, dtype: object

The index values in ``truncate`` can be datetimes or string
dates.

>>> dates = pd.date_range('2016-01-01', '2016-02-01', freq='s')
>>> df = pd.DataFrame(index=dates, data={'A': 1})
>>> df.tail()
                     A
2016-01-31 23:59:56  1
2016-01-31 23:59:57  1
2016-01-31 23:59:58  1
2016-01-31 23:59:59  1
2016-02-01 00:00:00  1

>>> df.truncate(before=pd.Timestamp('2016-01-05'),
...             after=pd.Timestamp('2016-01-10')).tail()
                     A
2016-01-09 23:59:56  1
2016-01-09 23:59:57  1
2016-01-09 23:59:58  1
2016-01-09 23:59:59  1
2016-01-10 00:00:00  1

Because the index is a DatetimeIndex containing only dates, we can
specify `before` and `after` as strings. They will be coerced to
Timestamps before truncation.

>>> df.truncate('2016-01-05', '2016-01-10').tail()
                     A
2016-01-09 23:59:56  1
2016-01-09 23:59:57  1
2016-01-09 23:59:58  1
2016-01-09 23:59:59  1
2016-01-10 00:00:00  1

Note that ``truncate`` assumes a 0 value for any unspecified time
component (midnight). This differs from partial string slicing, which
returns any partially matching dates.

>>> df.loc['2016-01-05':'2016-01-10', :].tail()
                     A
2016-01-10 23:59:55  1
2016-01-10 23:59:56  1
2016-01-10 23:59:57  1
2016-01-10 23:59:58  1
2016-01-10 23:59:59  1
tz_convert(**kwargs)

Convert tz-aware axis to target time zone.

Parameters:
  • tz (str or tzinfo object or None) – Target time zone. Passing None will convert to UTC and remove the timezone information.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to convert
  • level (int, str, default None) – If axis is a MultiIndex, convert a specific level. Otherwise must be None.
  • copy (bool, default True) – Also make a copy of the underlying data.
Returns:

Object with time zone converted axis.

Return type:

DeferredSeries/DeferredDataFrame

Raises:

TypeError – If the axis is tz-naive.

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Change to another time zone:

>>> s = pd.Series(
...     [1],
...     index=pd.DatetimeIndex(['2018-09-15 01:30:00+02:00']),
... )
>>> s.tz_convert('Asia/Shanghai')
2018-09-15 07:30:00+08:00    1
dtype: int64

Pass None to convert to UTC and get a tz-naive index:

>>> s = pd.Series([1],
...     index=pd.DatetimeIndex(['2018-09-15 01:30:00+02:00']))
>>> s.tz_convert(None)
2018-09-14 23:30:00    1
dtype: int64
tz_localize(ambiguous, **kwargs)

Localize tz-naive index of a Series or DataFrame to target time zone.

This operation localizes the Index. To localize the values in a timezone-naive Series, use Series.dt.tz_localize().

Parameters:
  • tz (str or tzinfo or None) – Time zone to localize. Passing None will remove the time zone information and preserve local time.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to localize
  • level (int, str, default None) – If axis ia a MultiIndex, localize a specific level. Otherwise must be None.
  • copy (bool, default True) – Also make a copy of the underlying data.
  • ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –

    When clocks moved backward due to DST, ambiguous times may arise. For example in Central European Time (UTC+01), when going from 03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at 00:30:00 UTC and at 01:30:00 UTC. In such a situation, the ambiguous parameter dictates how ambiguous times should be handled.

    • ’infer’ will attempt to infer fall dst-transition hours based on order
    • bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)
    • ’NaT’ will return NaT where there are ambiguous times
    • ’raise’ will raise an AmbiguousTimeError if there are ambiguous times.
  • nonexistent (str, default 'raise') –

    A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST. Valid values are:

    • ’shift_forward’ will shift the nonexistent time forward to the closest existing time
    • ’shift_backward’ will shift the nonexistent time backward to the closest existing time
    • ’NaT’ will return NaT where there are nonexistent times
    • timedelta objects will shift nonexistent times by the timedelta
    • ’raise’ will raise an NonExistentTimeError if there are nonexistent times.
Returns:

Same type as the input.

Return type:

DeferredSeries/DeferredDataFrame

Raises:

TypeError – If the TimeDeferredSeries is tz-aware and tz is not None.

Differences from pandas

ambiguous cannot be set to "infer" as its semantics are order-sensitive. Similarly, specifying ambiguous as an ndarray is order-sensitive, but you can achieve similar functionality by specifying ambiguous as a Series.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

Localize local times:

>>> s = pd.Series(
...     [1],
...     index=pd.DatetimeIndex(['2018-09-15 01:30:00']),
... )
>>> s.tz_localize('CET')
2018-09-15 01:30:00+02:00    1
dtype: int64

Pass None to convert to tz-naive index and preserve local time:

>>> s = pd.Series([1],
...     index=pd.DatetimeIndex(['2018-09-15 01:30:00+02:00']))
>>> s.tz_localize(None)
2018-09-15 01:30:00    1
dtype: int64

Be careful with DST changes. When there is sequential data, pandas
can infer the DST time:

>>> s = pd.Series(range(7),
...               index=pd.DatetimeIndex(['2018-10-28 01:30:00',
...                                       '2018-10-28 02:00:00',
...                                       '2018-10-28 02:30:00',
...                                       '2018-10-28 02:00:00',
...                                       '2018-10-28 02:30:00',
...                                       '2018-10-28 03:00:00',
...                                       '2018-10-28 03:30:00']))
>>> s.tz_localize('CET', ambiguous='infer')
2018-10-28 01:30:00+02:00    0
2018-10-28 02:00:00+02:00    1
2018-10-28 02:30:00+02:00    2
2018-10-28 02:00:00+01:00    3
2018-10-28 02:30:00+01:00    4
2018-10-28 03:00:00+01:00    5
2018-10-28 03:30:00+01:00    6
dtype: int64

In some cases, inferring the DST is impossible. In such cases, you can
pass an ndarray to the ambiguous parameter to set the DST explicitly

>>> s = pd.Series(range(3),
...               index=pd.DatetimeIndex(['2018-10-28 01:20:00',
...                                       '2018-10-28 02:36:00',
...                                       '2018-10-28 03:46:00']))
>>> s.tz_localize('CET', ambiguous=np.array([True, True, False]))
2018-10-28 01:20:00+02:00    0
2018-10-28 02:36:00+02:00    1
2018-10-28 03:46:00+01:00    2
dtype: int64

If the