apache_beam.dataframe.frames module

Analogs for pandas.DataFrame and pandas.Series: DeferredDataFrame and DeferredSeries.

These classes are effectively wrappers around a schema-aware PCollection that provide a set of operations compatible with the pandas API.

Note that we aim for the Beam DataFrame API to be completely compatible with the pandas API, but there are some features that are currently unimplemented for various reasons. Pay particular attention to the ‘Differences from pandas’ section for each operation to understand where we diverge.

class apache_beam.dataframe.frames.DeferredSeries(expr)[source]

Bases: DeferredDataFrameOrSeries

property name

Return the name of the Series.

The name of a Series becomes its index or column name if it is used to form a DataFrame. It is also used whenever displaying the Series using the interpreter.

Returns:

The name of the DeferredSeries, also the column name if part of a DeferredDataFrame.

Return type:

label (hashable object)

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.rename

Sets the DeferredSeries name when given a scalar input.

Index.name

Corresponding Index property.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

The Series name can be set initially when calling the constructor.

>>> s = pd.Series([1, 2, 3], dtype=np.int64, name='Numbers')
>>> s
0    1
1    2
2    3
Name: Numbers, dtype: int64
>>> s.name = "Integers"
>>> s
0    1
1    2
2    3
Name: Integers, dtype: int64

The name of a Series within a DataFrame is its column name.

>>> df = pd.DataFrame([[1, 2], [3, 4], [5, 6]],
...                   columns=["Odd Numbers", "Even Numbers"])
>>> df
   Odd Numbers  Even Numbers
0            1             2
1            3             4
2            5             6
>>> df["Even Numbers"].name
'Even Numbers'
property hasnans

Return True if there are any NaNs.

Enables various performance speedups.

Return type:

bool

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3, None])
>>> s
0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64
>>> s.hasnans
True
property dtype

Return the dtype object of the underlying data.

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3])
>>> s.dtype
dtype('int64')
property dtypes

Return the dtype object of the underlying data.

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3])
>>> s.dtype
dtype('int64')
keys()[source]

Return alias for index.

Returns:

Index of the DeferredSeries.

Return type:

Index

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3], index=[0, 1, 2])
>>> s.keys()
Index([0, 1, 2], dtype='int64')
T(**kwargs)

Return the transpose, which is by definition self.

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

For Series:

>>> s = pd.Series(['Ant', 'Bear', 'Cow'])
>>> s
0     Ant
1    Bear
2     Cow
dtype: object
>>> s.T
0     Ant
1    Bear
2     Cow
dtype: object

For Index:

>>> idx = pd.Index([1, 2, 3])
>>> idx.T
Index([1, 2, 3], dtype='int64')
transpose(**kwargs)

Return the transpose, which is by definition self.

Return type:

%(klass)s

Differences from pandas

This operation has no known divergences from the pandas API.

property shape

pandas.Series.shape() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

append(to_append, ignore_index, verify_integrity, **kwargs)[source]

This method has been removed in the current version of Pandas.

align(other, join, axis, level, method, **kwargs)[source]

Align two objects on their axes with the specified join method.

Join method is specified for each axis Index.

Parameters:
  • other (DeferredDataFrame or DeferredSeries)

  • join ({'outer', 'inner', 'left', 'right'}, default 'outer') –

    Type of alignment to be performed.

    • left: use only keys from left frame, preserve key order.

    • right: use only keys from right frame, preserve key order.

    • outer: use union of keys from both frames, sort keys lexicographically.

    • inner: use intersection of keys from both frames, preserve the order of the left keys.

  • axis (allowed axis of the other object, default None) – Align on index (0), columns (1), or both (None).

  • level (int or level name, default None) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • copy (bool, default True) – Always returns new objects. If copy=False and no reindexing is required then original objects are returned.

  • fill_value (scalar, default np.nan) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.

  • method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None) –

    Method to use for filling holes in reindexed DeferredSeries:

    • pad / ffill: propagate last valid observation forward to next valid.

    • backfill / bfill: use NEXT valid observation to fill gap.

    Deprecated since version 2.1.

  • limit (int, default None) –

    If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

    Deprecated since version 2.1.

  • fill_axis ({0 or 'index'} for DeferredSeries, {0 or 'index', 1 or 'columns'} for DeferredDataFrame, default 0) –

    Filling axis, method and limit.

    Deprecated since version 2.1.

  • broadcast_axis ({0 or 'index'} for DeferredSeries, {0 or 'index', 1 or 'columns'} for DeferredDataFrame, default None) –

    Broadcast values along this axis, if aligning two objects of different dimensions.

    Deprecated since version 2.1.

Returns:

Aligned objects.

Return type:

tuple of (DeferredSeries/DeferredDataFrame, type of other)

Differences from pandas

Aligning per-level is not yet supported. Only the default, level=None, is allowed.

Filling NaN values via method is not supported, because it is order-sensitive. Only the default, method=None, is allowed.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame(
...     [[1, 2, 3, 4], [6, 7, 8, 9]], columns=["D", "B", "E", "A"], index=[1, 2]
... )
>>> other = pd.DataFrame(
...     [[10, 20, 30, 40], [60, 70, 80, 90], [600, 700, 800, 900]],
...     columns=["A", "B", "C", "D"],
...     index=[2, 3, 4],
... )
>>> df
   D  B  E  A
1  1  2  3  4
2  6  7  8  9
>>> other
    A    B    C    D
2   10   20   30   40
3   60   70   80   90
4  600  700  800  900

Align on columns:

>>> left, right = df.align(other, join="outer", axis=1)
>>> left
   A  B   C  D  E
1  4  2 NaN  1  3
2  9  7 NaN  6  8
>>> right
    A    B    C    D   E
2   10   20   30   40 NaN
3   60   70   80   90 NaN
4  600  700  800  900 NaN

We can also align on the index:

>>> left, right = df.align(other, join="outer", axis=0)
>>> left
    D    B    E    A
1  1.0  2.0  3.0  4.0
2  6.0  7.0  8.0  9.0
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN
>>> right
    A      B      C      D
1    NaN    NaN    NaN    NaN
2   10.0   20.0   30.0   40.0
3   60.0   70.0   80.0   90.0
4  600.0  700.0  800.0  900.0

Finally, the default `axis=None` will align on both index and columns:

>>> left, right = df.align(other, join="outer", axis=None)
>>> left
     A    B   C    D    E
1  4.0  2.0 NaN  1.0  3.0
2  9.0  7.0 NaN  6.0  8.0
3  NaN  NaN NaN  NaN  NaN
4  NaN  NaN NaN  NaN  NaN
>>> right
       A      B      C      D   E
1    NaN    NaN    NaN    NaN NaN
2   10.0   20.0   30.0   40.0 NaN
3   60.0   70.0   80.0   90.0 NaN
4  600.0  700.0  800.0  900.0 NaN
argsort(**kwargs)

pandas.Series.argsort() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

property array

pandas.Series.array() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

get(**kwargs)

pandas.Series.get() is not yet supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.

For more information see https://s.apache.org/dataframe-non-deferred-columns.

ravel(**kwargs)

pandas.Series.ravel() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

slice_shift(**kwargs)

pandas.Series.slice_shift() is not yet supported in the Beam DataFrame API because it is deprecated in pandas.

tshift(**kwargs)

pandas.Series.tshift() is not yet supported in the Beam DataFrame API because it is deprecated in pandas.

rename(**kwargs)

Alter Series index labels or name.

Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.

Alternatively, change Series.name with a scalar value.

See the user guide for more.

Parameters:
  • index (scalar, hashable sequence, dict-like or function optional) – Functions or dict-like are transformations to apply to the index. Scalar or hashable sequence-like will alter the DeferredSeries.name attribute.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

  • copy (bool, default True) – Also copy underlying data.

  • inplace (bool, default False) – Whether to return a new DeferredSeries. If True the value of copy is ignored.

  • level (int or level name, default None) – In case of MultiIndex, only rename labels in the specified level.

  • errors ({'ignore', 'raise'}, default 'ignore') – If ‘raise’, raise KeyError when a dict-like mapper or index contains labels that are not present in the index being transformed. If ‘ignore’, existing keys will be renamed and extra keys will be ignored.

Returns:

DeferredSeries with index labels or name altered or None if inplace=True.

Return type:

DeferredSeries or None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.rename

Corresponding DeferredDataFrame method.

DeferredSeries.rename_axis

Set the name of the axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3])
>>> s
0    1
1    2
2    3
dtype: int64
>>> s.rename("my_name")  # scalar, changes Series.name
0    1
1    2
2    3
Name: my_name, dtype: int64
>>> s.rename(lambda x: x ** 2)  # function, changes labels
0    1
1    2
4    3
dtype: int64
>>> s.rename({1: 3, 2: 5})  # mapping, changes labels
0    1
3    2
5    3
dtype: int64
between(**kwargs)

Return boolean Series equivalent to left <= series <= right.

This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.

Parameters:
  • left (scalar or list-like) – Left boundary.

  • right (scalar or list-like) – Right boundary.

  • inclusive ({"both", "neither", "left", "right"}) –

    Include boundaries. Whether to set each bound as closed or open.

    Changed in version 1.3.0.

Returns:

DeferredSeries representing whether each element is between left and right (inclusive).

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.gt

Greater than of series and other.

DeferredSeries.lt

Less than of series and other.

Notes

This function is equivalent to (left <= ser) & (ser <= right)

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([2, 0, 4, 8, np.nan])

Boundary values are included by default:

>>> s.between(1, 4)
0     True
1    False
2     True
3    False
4    False
dtype: bool

With `inclusive` set to ``"neither"`` boundary values are excluded:

>>> s.between(1, 4, inclusive="neither")
0     True
1    False
2    False
3    False
4    False
dtype: bool

`left` and `right` can be any scalar value:

>>> s = pd.Series(['Alice', 'Bob', 'Carol', 'Eve'])
>>> s.between('Anna', 'Daniel')
0    False
1     True
2     True
3    False
dtype: bool
add_suffix(**kwargs)

Suffix labels with string suffix.

For Series, the row labels are suffixed. For DataFrame, the column labels are suffixed.

Parameters:
  • suffix (str) – The string to add after each label.

  • axis ({0 or 'index', 1 or 'columns', None}, default None) –

    Axis to add suffix on

    Added in version 2.0.0.

Returns:

New DeferredSeries or DeferredDataFrame with updated labels.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.add_prefix

Prefix row labels with string prefix.

DeferredDataFrame.add_prefix

Prefix column labels with string prefix.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64

>>> s.add_suffix('_item')
0_item    1
1_item    2
2_item    3
3_item    4
dtype: int64

>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})
>>> df
   A  B
0  1  3
1  2  4
2  3  5
3  4  6

>>> df.add_suffix('_col')
     A_col  B_col
0       1       3
1       2       4
2       3       5
3       4       6
add_prefix(**kwargs)

Prefix labels with string prefix.

For Series, the row labels are prefixed. For DataFrame, the column labels are prefixed.

Parameters:
  • prefix (str) – The string to add before each label.

  • axis ({0 or 'index', 1 or 'columns', None}, default None) –

    Axis to add prefix on

    Added in version 2.0.0.

Returns:

New DeferredSeries or DeferredDataFrame with updated labels.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.add_suffix

Suffix row labels with string suffix.

DeferredDataFrame.add_suffix

Suffix column labels with string suffix.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64

>>> s.add_prefix('item_')
item_0    1
item_1    2
item_2    3
item_3    4
dtype: int64

>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})
>>> df
   A  B
0  1  3
1  2  4
2  3  5
3  4  6

>>> df.add_prefix('col_')
     col_A  col_B
0       1       3
1       2       4
2       3       5
3       4       6
info(**kwargs)

pandas.Series.info() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

idxmin(**kwargs)[source]

Return the row label of the minimum value.

If multiple values equal the minimum, the first row label with that value is returned.

Parameters:
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

  • skipna (bool, default True) – Exclude NA/null values. If the entire DeferredSeries is NA, the result will be NA.

  • *args – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

  • **kwargs – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Label of the minimum value.

Return type:

Index

Raises:

ValueError – If the DeferredSeries is empty.

Differences from pandas

This operation has no known divergences from the pandas API.

See also

numpy.argmin

Return indices of the minimum values along the given axis.

DeferredDataFrame.idxmin

Return index of first occurrence of minimum over requested axis.

DeferredSeries.idxmax

Return index label of the first occurrence of maximum of values.

Notes

This method is the DeferredSeries version of ndarray.argmin. This method returns the label of the minimum, while ndarray.argmin returns the position. To get the position, use series.values.argmin().

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series(data=[1, None, 4, 1],
...               index=['A', 'B', 'C', 'D'])
>>> s
A    1.0
B    NaN
C    4.0
D    1.0
dtype: float64

>>> s.idxmin()
'A'

If `skipna` is False and there is an NA value in the data,
the function returns ``nan``.

>>> s.idxmin(skipna=False)
nan
idxmax(**kwargs)[source]

Return the row label of the maximum value.

If multiple values equal the maximum, the first row label with that value is returned.

Parameters:
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

  • skipna (bool, default True) – Exclude NA/null values. If the entire DeferredSeries is NA, the result will be NA.

  • *args – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

  • **kwargs – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Label of the maximum value.

Return type:

Index

Raises:

ValueError – If the DeferredSeries is empty.

Differences from pandas

This operation has no known divergences from the pandas API.

See also

numpy.argmax

Return indices of the maximum values along the given axis.

DeferredDataFrame.idxmax

Return index of first occurrence of maximum over requested axis.

DeferredSeries.idxmin

Return index label of the first occurrence of minimum of values.

Notes

This method is the DeferredSeries version of ndarray.argmax. This method returns the label of the maximum, while ndarray.argmax returns the position. To get the position, use series.values.argmax().

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series(data=[1, None, 4, 3, 4],
...               index=['A', 'B', 'C', 'D', 'E'])
>>> s
A    1.0
B    NaN
C    4.0
D    3.0
E    4.0
dtype: float64

>>> s.idxmax()
'C'

If `skipna` is False and there is an NA value in the data,
the function returns ``nan``.

>>> s.idxmax(skipna=False)
nan
explode(ignore_index)[source]

Transform each element of a list-like to a row.

Parameters:

ignore_index (bool, default False) – If True, the resulting index will be labeled 0, 1, …, n - 1.

Returns:

Exploded lists to rows; index will be duplicated for these rows.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.str.split

Split string values on specified separator.

DeferredSeries.unstack

Unstack, a.k.a. pivot, DeferredSeries with MultiIndex to produce DeferredDataFrame.

DeferredDataFrame.melt

Unpivot a DeferredDataFrame from wide format to long format.

DeferredDataFrame.explode

Explode a DeferredDataFrame from list-like columns to long format.

Notes

This routine will explode list-likes including lists, tuples, sets, DeferredSeries, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in a np.nan for that row. In addition, the ordering of elements in the output will be non-deterministic when exploding sets.

Reference the user guide for more examples.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([[1, 2, 3], 'foo', [], [3, 4]])
>>> s
0    [1, 2, 3]
1          foo
2           []
3       [3, 4]
dtype: object

>>> s.explode()
0      1
0      2
0      3
1    foo
2    NaN
3      3
3      4
dtype: object
dot(other)[source]

Compute the matrix multiplication between the DataFrame and other.

This method computes the matrix product between the DataFrame and the values of an other Series, DataFrame or a numpy array.

It can also be called using self @ other.

Parameters:

other (DeferredSeries, DeferredDataFrame or array-like) – The other object to compute the matrix product with.

Returns:

If other is a DeferredSeries, return the matrix product between self and other as a DeferredSeries. If other is a DeferredDataFrame or a numpy.array, return the matrix product of self and other in a DeferredDataFrame of a np.array.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

other must be a DeferredDataFrame or DeferredSeries instance. Computing the dot product with an array-like is not supported because it is order-sensitive.

See also

DeferredSeries.dot

Similar method for DeferredSeries.

Notes

The dimensions of DeferredDataFrame and other must be compatible in order to compute the matrix multiplication. In addition, the column names of DeferredDataFrame and the index of other must contain the same values, as they will be aligned prior to the multiplication.

The dot method for DeferredSeries computes the inner product, instead of the matrix product here.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

Here we multiply a DataFrame with a Series.

>>> df = pd.DataFrame([[0, 1, -2, -1], [1, 1, 1, 1]])
>>> s = pd.Series([1, 1, 2, 1])
>>> df.dot(s)
0    -4
1     5
dtype: int64

Here we multiply a DataFrame with another DataFrame.

>>> other = pd.DataFrame([[0, 1], [1, 2], [-1, -1], [2, 0]])
>>> df.dot(other)
    0   1
0   1   4
1   2   2

Note that the dot method give the same result as @

>>> df @ other
    0   1
0   1   4
1   2   2

The dot method works also if other is an np.array.

>>> arr = np.array([[0, 1], [1, 2], [-1, -1], [2, 0]])
>>> df.dot(arr)
    0   1
0   1   4
1   2   2

Note how shuffling of the objects does not change the result.

>>> s2 = s.reindex([1, 0, 2, 3])
>>> df.dot(s2)
0    -4
1     5
dtype: int64
nunique(**kwargs)[source]

Return number of unique elements in the object.

Excludes NA values by default.

Parameters:

dropna (bool, default True) – Don’t include NaN in the count.

Return type:

int

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.nunique

Method nunique for DeferredDataFrame.

DeferredSeries.count

Count non-NA/null observations in the DeferredSeries.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 3, 5, 7, 7])
>>> s
0    1
1    3
2    5
3    7
4    7
dtype: int64

>>> s.nunique()
4
quantile(q, **kwargs)[source]

Return value at the given quantile.

Parameters:
  • q (float or array-like, default 0.5 (50% quantile)) – The quantile(s) to compute, which can lie in range: 0 <= q <= 1.

  • interpolation ({'linear', 'lower', 'higher', 'midpoint', 'nearest'}) –

    This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:

    • linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.

    • lower: i.

    • higher: j.

    • nearest: i or j whichever is nearest.

    • midpoint: (i + j) / 2.

Returns:

If q is an array, a DeferredSeries will be returned where the index is q and the values are the quantiles, otherwise a float will be returned.

Return type:

float or DeferredSeries

Differences from pandas

quantile is not parallelizable. See Issue 20933 tracking the possible addition of an approximate, parallelizable implementation of quantile.

See also

core.window.Rolling.quantile

Calculate the rolling quantile.

numpy.percentile

Returns the q-th percentile(s) of the array elements.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> s = pd.Series([1, 2, 3, 4])
>>> s.quantile(.5)
2.5
>>> s.quantile([.25, .5, .75])
0.25    1.75
0.50    2.50
0.75    3.25
dtype: float64
std(*args, **kwargs)[source]

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument.

Parameters:
  • axis ({index (0)}) – For DeferredSeries this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

Return type:

scalar or DeferredSeries (if level specified)

Differences from pandas

This operation has no known divergences from the pandas API.

Notes

To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3],
...                    'age': [21, 25, 62, 43],
...                    'height': [1.61, 1.87, 1.49, 2.01]}
...                   ).set_index('person_id')
>>> df
           age  height
person_id
0           21    1.61
1           25    1.87
2           62    1.49
3           43    2.01

The standard deviation of the columns can be found as follows:

>>> df.std()
age       18.786076
height     0.237417
dtype: float64

Alternatively, `ddof=0` can be set to normalize by N instead of N-1:

>>> df.std(ddof=0)
age       16.269219
height     0.205609
dtype: float64
mean(skipna, **kwargs)[source]

Return the mean of the values over the requested axis.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

Examples

scalar or scalar

            Examples
            --------
            >>> s = pd.Series([1, 2, 3])
            >>> s.mean()
            2.0

            With a DataFrame

            >>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra'])
            >>> df
                   a   b
            tiger  1   2
            zebra  2   3
            >>> df.mean()
            a   1.5
            b   2.5
            dtype: float64

            Using axis=1

            >>> df.mean(axis=1)
            tiger   1.5
            zebra   2.5
            dtype: float64

            In this case, `numeric_only` should be set to `True` to avoid
            getting an error.

            >>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']},
            ...                   index=['tiger', 'zebra'])
            >>> df.mean(numeric_only=True)
            a   1.5
            dtype: float64


        --------
        >>> s = pd.Series([1, 2, 3])
        >>> s.mean()
        2.0

        With a DataFrame

        >>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra'])
        >>> df
               a   b
        tiger  1   2
        zebra  2   3
        >>> df.mean()
        a   1.5
        b   2.5
        dtype: float64

        Using axis=1

        >>> df.mean(axis=1)
        tiger   1.5
        zebra   2.5
        dtype: float64

        In this case, `numeric_only` should be set to `True` to avoid
        getting an error.

        >>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']},
        ...                   index=['tiger', 'zebra'])
        >>> df.mean(numeric_only=True)
        a   1.5
        dtype: float64

Differences from pandas

This operation has no known divergences from the pandas API.

var(axis, skipna, level, ddof, **kwargs)[source]

Return unbiased variance over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument.

Parameters:
  • axis ({index (0)}) – For DeferredSeries this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

Return type:

scalar or DeferredSeries (if level specified)

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3],
...                   'age': [21, 25, 62, 43],
...                   'height': [1.61, 1.87, 1.49, 2.01]}
...                  ).set_index('person_id')
>>> df
           age  height
person_id
0           21    1.61
1           25    1.87
2           62    1.49
3           43    2.01

>>> df.var()
age       352.916667
height      0.056367
dtype: float64

Alternatively, ``ddof=0`` can be set to normalize by N instead of N-1:

>>> df.var(ddof=0)
age       264.687500
height      0.042275
dtype: float64
corr(other, method, min_periods)[source]

Compute correlation with other Series, excluding missing values.

The two Series objects are not required to be the same length and will be aligned internally before the correlation function is applied.

Parameters:
  • other (DeferredSeries) – DeferredSeries with which to compute the correlation.

  • method ({'pearson', 'kendall', 'spearman'} or callable) –

    Method used to compute correlation:

    • pearson : Standard correlation coefficient

    • kendall : Kendall Tau correlation coefficient

    • spearman : Spearman rank correlation

    • callable: Callable with input two 1d ndarrays and returning a float.

    Warning

    Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

  • min_periods (int, optional) – Minimum number of observations needed to have a valid result.

Returns:

Correlation with other.

Return type:

float

Differences from pandas

Only method='pearson' is currently parallelizable.

See also

DeferredDataFrame.corr

Compute pairwise correlation between columns.

DeferredDataFrame.corrwith

Compute pairwise correlation with another DeferredDataFrame or DeferredSeries.

Notes

Pearson, Kendall and Spearman correlation are currently computed using pairwise complete observations.

Automatic data alignment: as with all pandas operations, automatic data alignment is performed for this method. corr() automatically considers values with matching indices.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> def histogram_intersection(a, b):
...     v = np.minimum(a, b).sum().round(decimals=1)
...     return v
>>> s1 = pd.Series([.2, .0, .6, .2])
>>> s2 = pd.Series([.3, .6, .0, .1])
>>> s1.corr(s2, method=histogram_intersection)
0.3

Pandas auto-aligns the values with matching indices

>>> s1 = pd.Series([1, 2, 3], index=[0, 1, 2])
>>> s2 = pd.Series([1, 2, 3], index=[2, 1, 0])
>>> s1.corr(s2)
-1.0
skew(axis, skipna, level, numeric_only, **kwargs)[source]

Return unbiased skew over requested axis.

Normalized by N-1.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

Examples

scalar or scalar

            Examples
            --------
            >>> s = pd.Series([1, 2, 3])
            >>> s.skew()
            0.0

            With a DataFrame

            >>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 3, 4], 'c': [1, 3, 5]},
            ...                  index=['tiger', 'zebra', 'cow'])
            >>> df
                    a   b   c
            tiger   1   2   1
            zebra   2   3   3
            cow     3   4   5
            >>> df.skew()
            a   0.0
            b   0.0
            c   0.0
            dtype: float64

            Using axis=1

            >>> df.skew(axis=1)
            tiger   1.732051
            zebra  -1.732051
            cow     0.000000
            dtype: float64

            In this case, `numeric_only` should be set to `True` to avoid
            getting an error.

            >>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['T', 'Z', 'X']},
            ...                  index=['tiger', 'zebra', 'cow'])
            >>> df.skew(numeric_only=True)
            a   0.0
            dtype: float64


        --------
        >>> s = pd.Series([1, 2, 3])
        >>> s.skew()
        0.0

        With a DataFrame

        >>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 3, 4], 'c': [1, 3, 5]},
        ...                  index=['tiger', 'zebra', 'cow'])
        >>> df
                a   b   c
        tiger   1   2   1
        zebra   2   3   3
        cow     3   4   5
        >>> df.skew()
        a   0.0
        b   0.0
        c   0.0
        dtype: float64

        Using axis=1

        >>> df.skew(axis=1)
        tiger   1.732051
        zebra  -1.732051
        cow     0.000000
        dtype: float64

        In this case, `numeric_only` should be set to `True` to avoid
        getting an error.

        >>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['T', 'Z', 'X']},
        ...                  index=['tiger', 'zebra', 'cow'])
        >>> df.skew(numeric_only=True)
        a   0.0
        dtype: float64

Differences from pandas

This operation has no known divergences from the pandas API.

kurtosis(axis, skipna, level, numeric_only, **kwargs)[source]

Return unbiased kurtosis over requested axis.

Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

Examples

scalar or scalar

            Examples
            --------
            >>> s = pd.Series([1, 2, 2, 3], index=['cat', 'dog', 'dog', 'mouse'])
            >>> s
            cat    1
            dog    2
            dog    2
            mouse  3
            dtype: int64
            >>> s.kurt()
            1.5

            With a DataFrame

            >>> df = pd.DataFrame({'a': [1, 2, 2, 3], 'b': [3, 4, 4, 4]},
            ...                   index=['cat', 'dog', 'dog', 'mouse'])
            >>> df
                   a   b
              cat  1   3
              dog  2   4
              dog  2   4
            mouse  3   4
            >>> df.kurt()
            a   1.5
            b   4.0
            dtype: float64

            With axis=None

            >>> df.kurt(axis=None).round(6)
            -0.988693

            Using axis=1

            >>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [3, 4], 'd': [1, 2]},
            ...                   index=['cat', 'dog'])
            >>> df.kurt(axis=1)
            cat   -6.0
            dog   -6.0
            dtype: float64


        --------
        >>> s = pd.Series([1, 2, 2, 3], index=['cat', 'dog', 'dog', 'mouse'])
        >>> s
        cat    1
        dog    2
        dog    2
        mouse  3
        dtype: int64
        >>> s.kurt()
        1.5

        With a DataFrame

        >>> df = pd.DataFrame({'a': [1, 2, 2, 3], 'b': [3, 4, 4, 4]},
        ...                   index=['cat', 'dog', 'dog', 'mouse'])
        >>> df
               a   b
          cat  1   3
          dog  2   4
          dog  2   4
        mouse  3   4
        >>> df.kurt()
        a   1.5
        b   4.0
        dtype: float64

        With axis=None

        >>> df.kurt(axis=None).round(6)
        -0.988693

        Using axis=1

        >>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [3, 4], 'd': [1, 2]},
        ...                   index=['cat', 'dog'])
        >>> df.kurt(axis=1)
        cat   -6.0
        dog   -6.0
        dtype: float64

Differences from pandas

This operation has no known divergences from the pandas API.

kurt(*args, **kwargs)[source]

Return unbiased kurtosis over requested axis.

Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

Examples

scalar or scalar

            Examples
            --------
            >>> s = pd.Series([1, 2, 2, 3], index=['cat', 'dog', 'dog', 'mouse'])
            >>> s
            cat    1
            dog    2
            dog    2
            mouse  3
            dtype: int64
            >>> s.kurt()
            1.5

            With a DataFrame

            >>> df = pd.DataFrame({'a': [1, 2, 2, 3], 'b': [3, 4, 4, 4]},
            ...                   index=['cat', 'dog', 'dog', 'mouse'])
            >>> df
                   a   b
              cat  1   3
              dog  2   4
              dog  2   4
            mouse  3   4
            >>> df.kurt()
            a   1.5
            b   4.0
            dtype: float64

            With axis=None

            >>> df.kurt(axis=None).round(6)
            -0.988693

            Using axis=1

            >>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [3, 4], 'd': [1, 2]},
            ...                   index=['cat', 'dog'])
            >>> df.kurt(axis=1)
            cat   -6.0
            dog   -6.0
            dtype: float64


        --------
        >>> s = pd.Series([1, 2, 2, 3], index=['cat', 'dog', 'dog', 'mouse'])
        >>> s
        cat    1
        dog    2
        dog    2
        mouse  3
        dtype: int64
        >>> s.kurt()
        1.5

        With a DataFrame

        >>> df = pd.DataFrame({'a': [1, 2, 2, 3], 'b': [3, 4, 4, 4]},
        ...                   index=['cat', 'dog', 'dog', 'mouse'])
        >>> df
               a   b
          cat  1   3
          dog  2   4
          dog  2   4
        mouse  3   4
        >>> df.kurt()
        a   1.5
        b   4.0
        dtype: float64

        With axis=None

        >>> df.kurt(axis=None).round(6)
        -0.988693

        Using axis=1

        >>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [3, 4], 'd': [1, 2]},
        ...                   index=['cat', 'dog'])
        >>> df.kurt(axis=1)
        cat   -6.0
        dog   -6.0
        dtype: float64

Differences from pandas

This operation has no known divergences from the pandas API.

cov(other, min_periods, ddof)[source]

Compute covariance with Series, excluding missing values.

The two Series objects are not required to be the same length and will be aligned internally before the covariance is calculated.

Parameters:
  • other (DeferredSeries) – DeferredSeries with which to compute the covariance.

  • min_periods (int, optional) – Minimum number of observations needed to have a valid result.

  • ddof (int, default 1) – Delta degrees of freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

Returns:

Covariance between DeferredSeries and other normalized by N-1 (unbiased estimator).

Return type:

float

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.cov

Compute pairwise covariance of columns.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s1 = pd.Series([0.90010907, 0.13484424, 0.62036035])
>>> s2 = pd.Series([0.12528585, 0.26962463, 0.51111198])
>>> s1.cov(s2)
-0.01685762652715874
dropna(**kwargs)[source]

Return a new Series with missing values removed.

See the User Guide for more on which values are considered missing, and how to work with missing data.

Parameters:
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

  • inplace (bool, default False) – If True, do operation inplace and return None.

  • how (str, optional) – Not in use. Kept for compatibility.

  • ignore_index (bool, default False) –

    If True, the resulting axis will be labeled 0, 1, …, n - 1.

    Added in version 2.0.0.

Returns:

DeferredSeries with NA entries dropped from it or None if inplace=True.

Return type:

DeferredSeries or None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.isna

Indicate missing values.

DeferredSeries.notna

Indicate existing (non-missing) values.

DeferredSeries.fillna

Replace missing values.

DeferredDataFrame.dropna

Drop rows or columns which contain NA values.

Index.dropna

Drop missing indices.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> ser = pd.Series([1., 2., np.nan])
>>> ser
0    1.0
1    2.0
2    NaN
dtype: float64

Drop NA values from a Series.

>>> ser.dropna()
0    1.0
1    2.0
dtype: float64

Empty strings are not considered NA values. ``None`` is considered an
NA value.

>>> ser = pd.Series([np.nan, 2, pd.NaT, '', None, 'I stay'])
>>> ser
0       NaN
1         2
2       NaT
3
4      None
5    I stay
dtype: object
>>> ser.dropna()
1         2
3
5    I stay
dtype: object
set_axis(labels, **kwargs)[source]

Assign desired index to given axis.

Indexes for row labels can be changed by assigning a list-like or Index.

Parameters:
  • labels (list-like, Index) – The values for the new index.

  • axis ({0 or 'index'}, default 0) – The axis to update. The value 0 identifies the rows. For DeferredSeries this parameter is unused and defaults to 0.

  • copy (bool, default True) –

    Whether to make a copy of the underlying data.

    Added in version 1.5.0.

Returns:

An object of type DeferredSeries.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

Series.rename_axis

Alter the name of the index.

Examples

Series.rename_axis : Alter the name of the index.

        Examples
        --------
        >>> s = pd.Series([1, 2, 3])
        >>> s
        0    1
        1    2
        2    3
        dtype: int64

        >>> s.set_axis(['a', 'b', 'c'], axis=0)
        a    1
        b    2
        c    3
        dtype: int64


    --------
    >>> s = pd.Series([1, 2, 3])
    >>> s
    0    1
    1    2
    2    3
    dtype: int64

    >>> s.set_axis(['a', 'b', 'c'], axis=0)
    a    1
    b    2
    c    3
    dtype: int64
isnull(**kwargs)

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:

Mask of bool values for each element in DeferredSeries that indicates whether an element is an NA value.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.isnull

Alias of isna.

DeferredSeries.notna

Boolean inverse of isna.

DeferredSeries.dropna

Omit axes labels with missing values.

isna

Top-level isna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.nan],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.nan])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.isna()
0    False
1    False
2     True
dtype: bool
isna(**kwargs)

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:

Mask of bool values for each element in DeferredSeries that indicates whether an element is an NA value.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.isnull

Alias of isna.

DeferredSeries.notna

Boolean inverse of isna.

DeferredSeries.dropna

Omit axes labels with missing values.

isna

Top-level isna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.nan],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.nan])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.isna()
0    False
1    False
2     True
dtype: bool
notnull(**kwargs)

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:

Mask of bool values for each element in DeferredSeries that indicates whether an element is not an NA value.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.notnull

Alias of notna.

DeferredSeries.isna

Boolean inverse of notna.

DeferredSeries.dropna

Omit axes labels with missing values.

notna

Top-level notna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.nan],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.nan])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.notna()
0     True
1     True
2    False
dtype: bool
notna(**kwargs)

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:

Mask of bool values for each element in DeferredSeries that indicates whether an element is not an NA value.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.notnull

Alias of notna.

DeferredSeries.isna

Boolean inverse of notna.

DeferredSeries.dropna

Omit axes labels with missing values.

notna

Top-level notna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.nan],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.nan])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.notna()
0     True
1     True
2    False
dtype: bool
items(**kwargs)

pandas.Series.items() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

iteritems(**kwargs)

pandas.Series.iteritems() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

tolist(**kwargs)

pandas.Series.tolist() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

to_numpy(**kwargs)

pandas.Series.to_numpy() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

to_string(**kwargs)

pandas.Series.to_string() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

duplicated(keep)[source]

Indicate duplicate Series values.

Duplicated values are indicated as True values in the resulting Series. Either all duplicates, all except the first or all except the last occurrence of duplicates can be indicated.

Parameters:

keep ({'first', 'last', False}, default 'first') –

Method to handle dropping duplicates:

  • ’first’ : Mark duplicates as True except for the first occurrence.

  • ’last’ : Mark duplicates as True except for the last occurrence.

  • False : Mark all duplicates as True.

Returns:

DeferredSeries indicating whether each value has occurred in the preceding values.

Return type:

DeferredSeries[bool]

Differences from pandas

Only keep=False and keep="any" are supported. Other values of keep make this an order-sensitive operation. Note keep="any" is a Beam-specific option that guarantees only one duplicate will be kept, but unlike "first" and "last" it makes no guarantees about _which_ duplicate element is kept.

See also

Index.duplicated

Equivalent method on pandas.Index.

DeferredDataFrame.duplicated

Equivalent method on pandas.DeferredDataFrame.

DeferredSeries.drop_duplicates

Remove duplicate values from DeferredSeries.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

By default, for each set of duplicated values, the first occurrence is
set on False and all others on True:

>>> animals = pd.Series(['llama', 'cow', 'llama', 'beetle', 'llama'])
>>> animals.duplicated()
0    False
1    False
2     True
3    False
4     True
dtype: bool

which is equivalent to

>>> animals.duplicated(keep='first')
0    False
1    False
2     True
3    False
4     True
dtype: bool

By using 'last', the last occurrence of each set of duplicated values
is set on False and all others on True:

>>> animals.duplicated(keep='last')
0     True
1    False
2     True
3    False
4    False
dtype: bool

By setting keep on ``False``, all duplicates are True:

>>> animals.duplicated(keep=False)
0     True
1    False
2     True
3    False
4     True
dtype: bool
drop_duplicates(keep)[source]

Return Series with duplicate values removed.

Parameters:
  • keep ({‘first’, ‘last’, False}, default ‘first’) –

    Method to handle dropping duplicates:

    • ’first’ : Drop duplicates except for the first occurrence.

    • ’last’ : Drop duplicates except for the last occurrence.

    • False : Drop all duplicates.

  • inplace (bool, default False) – If True, performs operation inplace and returns None.

  • ignore_index (bool, default False) –

    If True, the resulting axis will be labeled 0, 1, …, n - 1.

    Added in version 2.0.0.

Returns:

DeferredSeries with duplicates dropped or None if inplace=True.

Return type:

DeferredSeries or None

Differences from pandas

Only keep=False and keep="any" are supported. Other values of keep make this an order-sensitive operation. Note keep="any" is a Beam-specific option that guarantees only one duplicate will be kept, but unlike "first" and "last" it makes no guarantees about _which_ duplicate element is kept.

See also

Index.drop_duplicates

Equivalent method on Index.

DeferredDataFrame.drop_duplicates

Equivalent method on DeferredDataFrame.

DeferredSeries.duplicated

Related method on DeferredSeries, indicating duplicate DeferredSeries values.

DeferredSeries.unique

Return unique values as an array.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

Generate a Series with duplicated entries.

>>> s = pd.Series(['llama', 'cow', 'llama', 'beetle', 'llama', 'hippo'],
...               name='animal')
>>> s
0     llama
1       cow
2     llama
3    beetle
4     llama
5     hippo
Name: animal, dtype: object

With the 'keep' parameter, the selection behaviour of duplicated values
can be changed. The value 'first' keeps the first occurrence for each
set of duplicated entries. The default value of keep is 'first'.

>>> s.drop_duplicates()
0     llama
1       cow
3    beetle
5     hippo
Name: animal, dtype: object

The value 'last' for parameter 'keep' keeps the last occurrence for
each set of duplicated entries.

>>> s.drop_duplicates(keep='last')
1       cow
3    beetle
4     llama
5     hippo
Name: animal, dtype: object

The value ``False`` for parameter 'keep' discards all sets of
duplicated entries.

>>> s.drop_duplicates(keep=False)
1       cow
3    beetle
5     hippo
Name: animal, dtype: object
sample(**kwargs)[source]

Return a random sample of items from an axis of object.

You can use random_state for reproducibility.

Parameters:
  • n (int, optional) – Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.

  • frac (float, optional) – Fraction of axis items to return. Cannot be used with n.

  • replace (bool, default False) – Allow or disallow sampling of the same row more than once.

  • weights (str or ndarray-like, optional) – Default ‘None’ results in equal probability weighting. If passed a DeferredSeries, will align with target object on index. Index values in weights not found in sampled object will be ignored and index values in sampled object not in weights will be assigned weights of zero. If called on a DeferredDataFrame, will accept the name of a column when axis = 0. Unless weights are a DeferredSeries, weights must be same length as axis being sampled. If weights do not sum to 1, they will be normalized to sum to 1. Missing values in the weights column will be treated as zero. Infinite values not allowed.

  • random_state (int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional) –

    If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given.

    Changed in version 1.4.0: np.random.Generator objects now accepted

  • axis ({0 or 'index', 1 or 'columns', None}, default None) – Axis to sample. Accepts axis number or name. Default is stat axis for given data type. For DeferredSeries this parameter is unused and defaults to None.

  • ignore_index (bool, default False) –

    If True, the resulting index will be labeled 0, 1, …, n - 1.

    Added in version 1.3.0.

Returns:

A new object of same type as caller containing n items randomly sampled from the caller object.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

Only n and/or weights may be specified. frac, random_state, and replace=True are not yet supported. See Issue 21010.

Note that pandas will raise an error if n is larger than the length of the dataset, while the Beam DataFrame API will simply return the full dataset in that case.

See also

DeferredDataFrameGroupBy.sample

Generates random samples from each group of a DeferredDataFrame object.

DeferredSeriesGroupBy.sample

Generates random samples from each group of a DeferredSeries object.

numpy.random.choice

Generates a random sample from a given 1-D numpy array.

Notes

If frac > 1, replacement should be set to True.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
...                    'num_wings': [2, 0, 0, 0],
...                    'num_specimen_seen': [10, 2, 1, 8]},
...                   index=['falcon', 'dog', 'spider', 'fish'])
>>> df
        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
dog            4          0                  2
spider         8          0                  1
fish           0          0                  8

Extract 3 random elements from the ``Series`` ``df['num_legs']``:
Note that we use `random_state` to ensure the reproducibility of
the examples.

>>> df['num_legs'].sample(n=3, random_state=1)
fish      0
spider    8
falcon    2
Name: num_legs, dtype: int64

A random 50% sample of the ``DataFrame`` with replacement:

>>> df.sample(frac=0.5, replace=True, random_state=1)
      num_legs  num_wings  num_specimen_seen
dog          4          0                  2
fish         0          0                  8

An upsample sample of the ``DataFrame`` with replacement:
Note that `replace` parameter has to be `True` for `frac` parameter > 1.

>>> df.sample(frac=2, replace=True, random_state=1)
        num_legs  num_wings  num_specimen_seen
dog            4          0                  2
fish           0          0                  8
falcon         2          2                 10
falcon         2          2                 10
fish           0          0                  8
dog            4          0                  2
fish           0          0                  8
dog            4          0                  2

Using a DataFrame column as weights. Rows with larger value in the
`num_specimen_seen` column are more likely to be sampled.

>>> df.sample(n=2, weights='num_specimen_seen', random_state=1)
        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
fish           0          0                  8
aggregate(func, axis, *args, **kwargs)[source]

Aggregate using one or more operations over the specified axis.

Parameters:
  • func (function, str, list or dict) –

    Function to use for aggregating the data. If a function, must either work when passed a DeferredSeries or when passed to DeferredSeries.apply.

    Accepted combinations are:

    • function

    • string function name

    • list of functions and/or function names, e.g. [np.sum, 'mean']

    • dict of axis labels -> functions, function names or list of such.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

  • *args – Positional arguments to pass to func.

  • **kwargs – Keyword arguments to pass to func.

Returns:

The return can be:

  • scalar : when DeferredSeries.agg is called with single function

  • DeferredSeries : when DeferredDataFrame.agg is called with a single function

  • DeferredDataFrame : when DeferredDataFrame.agg is called with several functions

Return scalar, DeferredSeries or DeferredDataFrame.

Return type:

scalar, DeferredSeries or DeferredDataFrame

Differences from pandas

Some aggregation methods cannot be parallelized, and computing them will require collecting all data on a single machine.

See also

DeferredSeries.apply

Invoke function on a DeferredSeries.

DeferredSeries.transform

Transform function producing a DeferredSeries with like indexes.

Notes

The aggregation operations are always performed over an axis, either the index (default) or the column axis. This behavior is different from numpy aggregation functions (mean, median, prod, sum, std, var), where the default is to compute the aggregation of the flattened array, e.g., numpy.mean(arr_2d) as opposed to numpy.mean(arr_2d, axis=0).

agg is an alias for aggregate. Use the alias.

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.

A passed user-defined-function will be passed a DeferredSeries for evaluation.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64

>>> s.agg('min')
1

>>> s.agg(['min', 'max'])
min   1
max   4
dtype: int64
agg(func, axis, *args, **kwargs)

Aggregate using one or more operations over the specified axis.

Parameters:
  • func (function, str, list or dict) –

    Function to use for aggregating the data. If a function, must either work when passed a DeferredSeries or when passed to DeferredSeries.apply.

    Accepted combinations are:

    • function

    • string function name

    • list of functions and/or function names, e.g. [np.sum, 'mean']

    • dict of axis labels -> functions, function names or list of such.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

  • *args – Positional arguments to pass to func.

  • **kwargs – Keyword arguments to pass to func.

Returns:

The return can be:

  • scalar : when DeferredSeries.agg is called with single function

  • DeferredSeries : when DeferredDataFrame.agg is called with a single function

  • DeferredDataFrame : when DeferredDataFrame.agg is called with several functions

Return scalar, DeferredSeries or DeferredDataFrame.

Return type:

scalar, DeferredSeries or DeferredDataFrame

Differences from pandas

Some aggregation methods cannot be parallelized, and computing them will require collecting all data on a single machine.

See also

DeferredSeries.apply

Invoke function on a DeferredSeries.

DeferredSeries.transform

Transform function producing a DeferredSeries with like indexes.

Notes

The aggregation operations are always performed over an axis, either the index (default) or the column axis. This behavior is different from numpy aggregation functions (mean, median, prod, sum, std, var), where the default is to compute the aggregation of the flattened array, e.g., numpy.mean(arr_2d) as opposed to numpy.mean(arr_2d, axis=0).

agg is an alias for aggregate. Use the alias.

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.

A passed user-defined-function will be passed a DeferredSeries for evaluation.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64

>>> s.agg('min')
1

>>> s.agg(['min', 'max'])
min   1
max   4
dtype: int64
property axes

Return a list of the row axis labels.

Differences from pandas

This operation has no known divergences from the pandas API.

clip(**kwargs)

Trim values at input threshold(s).

Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis.

Parameters:
  • lower (float or array-like, default None) – Minimum threshold value. All values below this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.

  • upper (float or array-like, default None) – Maximum threshold value. All values above this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.

  • axis ({{0 or 'index', 1 or 'columns', None}}, default None) – Align object with lower and upper along the given axis. For DeferredSeries this parameter is unused and defaults to None.

  • inplace (bool, default False) – Whether to perform the operation in place on the data.

  • *args – Additional keywords have no effect but might be accepted for compatibility with numpy.

  • **kwargs – Additional keywords have no effect but might be accepted for compatibility with numpy.

Returns:

Same type as calling object with the values outside the clip boundaries replaced or None if inplace=True.

Return type:

DeferredSeries or DeferredDataFrame or None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.clip

Trim values at input threshold in series.

DeferredDataFrame.clip

Trim values at input threshold in dataframe.

numpy.clip

Clip (limit) the values in an array.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
>>> df = pd.DataFrame(data)
>>> df
   col_0  col_1
0      9     -2
1     -3     -7
2      0      6
3     -1      8
4      5     -5

Clips per column using lower and upper thresholds:

>>> df.clip(-4, 6)
   col_0  col_1
0      6     -2
1     -3     -4
2      0      6
3     -1      6
4      5     -4

Clips using specific lower and upper thresholds per column element:

>>> t = pd.Series([2, -4, -1, 6, 3])
>>> t
0    2
1   -4
2   -1
3    6
4    3
dtype: int64

>>> df.clip(t, t + 4, axis=0)
   col_0  col_1
0      6      2
1     -3     -4
2      0      3
3      6      8
4      5      3

Clips using specific lower threshold per column element, with missing values:

>>> t = pd.Series([2, -4, np.nan, 6, 3])
>>> t
0    2.0
1   -4.0
2    NaN
3    6.0
4    3.0
dtype: float64

>>> df.clip(t, axis=0)
col_0  col_1
0      9      2
1     -3     -4
2      0      6
3      6      8
4      5      3
all(*args, **kwargs)

Return whether all elements are True, potentially over an axis.

Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).

Parameters:
  • axis ({0 or 'index', 1 or 'columns', None}, default 0) –

    Indicate which axis or axes should be reduced. For DeferredSeries this parameter is unused and defaults to 0.

    • 0 / ‘index’ : reduce the index, return a DeferredSeries whose index is the original column labels.

    • 1 / ‘columns’ : reduce the columns, return a DeferredSeries whose index is the original index.

    • None : reduce all axes, return a scalar.

  • bool_only (bool, default False) – Include only boolean columns. Not implemented for DeferredSeries.

  • skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

  • **kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

If level is specified, then, DeferredSeries is returned; otherwise, scalar is returned.

Return type:

scalar or DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.all

Return True if all elements are True.

DeferredDataFrame.any

Return True if one (or more) elements are True.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

**Series**

>>> pd.Series([True, True]).all()
True
>>> pd.Series([True, False]).all()
False
>>> pd.Series([], dtype="float64").all()
True
>>> pd.Series([np.nan]).all()
True
>>> pd.Series([np.nan]).all(skipna=False)
True

**DataFrames**

Create a dataframe from a dictionary.

>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]})
>>> df
   col1   col2
0  True   True
1  True  False

Default behaviour checks if values in each column all return True.

>>> df.all()
col1     True
col2    False
dtype: bool

Specify ``axis='columns'`` to check if values in each row all return True.

>>> df.all(axis='columns')
0     True
1    False
dtype: bool

Or ``axis=None`` for whether every value is True.

>>> df.all(axis=None)
False
any(*args, **kwargs)

Return whether any element is True, potentially over an axis.

Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).

Parameters:
  • axis ({0 or 'index', 1 or 'columns', None}, default 0) –

    Indicate which axis or axes should be reduced. For DeferredSeries this parameter is unused and defaults to 0.

    • 0 / ‘index’ : reduce the index, return a DeferredSeries whose index is the original column labels.

    • 1 / ‘columns’ : reduce the columns, return a DeferredSeries whose index is the original index.

    • None : reduce all axes, return a scalar.

  • bool_only (bool, default False) – Include only boolean columns. Not implemented for DeferredSeries.

  • skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

  • **kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

If level is specified, then, DeferredSeries is returned; otherwise, scalar is returned.

Return type:

scalar or DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

numpy.any

Numpy version of this method.

DeferredSeries.any

Return whether any element is True.

DeferredSeries.all

Return whether all elements are True.

DeferredDataFrame.any

Return whether any element is True over requested axis.

DeferredDataFrame.all

Return whether all elements are True over requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

**Series**

For Series input, the output is a scalar indicating whether any element
is True.

>>> pd.Series([False, False]).any()
False
>>> pd.Series([True, False]).any()
True
>>> pd.Series([], dtype="float64").any()
False
>>> pd.Series([np.nan]).any()
False
>>> pd.Series([np.nan]).any(skipna=False)
True

**DataFrame**

Whether each column contains at least one True element (the default).

>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})
>>> df
   A  B  C
0  1  0  0
1  2  2  0

>>> df.any()
A     True
B     True
C    False
dtype: bool

Aggregating over the columns.

>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]})
>>> df
       A  B
0   True  1
1  False  2

>>> df.any(axis='columns')
0    True
1    True
dtype: bool

>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]})
>>> df
       A  B
0   True  1
1  False  0

>>> df.any(axis='columns')
0    True
1    False
dtype: bool

Aggregating over the entire DataFrame with ``axis=None``.

>>> df.any(axis=None)
True

`any` for an empty DataFrame is an empty Series.

>>> pd.DataFrame([]).any()
Series([], dtype: bool)
count(*args, **kwargs)

Return number of non-NA/null observations in the Series.

Returns:

Number of non-null values in the DeferredSeries.

Return type:

int or DeferredSeries (if level specified)

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.count

Count non-NA cells for each column or row.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([0.0, 1.0, np.nan])
>>> s.count()
2
describe(*args, **kwargs)

Generate descriptive statistics.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters:
  • percentiles (list-like of numbers, optional) – The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

  • include ('all', list-like of dtypes or None (default), optional) –

    A white list of data types to include in the result. Ignored for DeferredSeries. Here are the options:

    • ’all’ : All columns of the input will be included in the output.

    • A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'

    • None (default) : The result will include all numeric columns.

  • exclude (list-like of dtypes or None (default), optional,) –

    A black list of data types to omit from the result. Ignored for DeferredSeries. Here are the options:

    • A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(exclude=['O'])). To exclude pandas categorical columns, use 'category'

    • None (default) : The result will exclude nothing.

Returns:

Summary statistics of the DeferredSeries or Dataframe provided.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

describe cannot currently be parallelized. It will require collecting all data on a single node.

See also

DeferredDataFrame.count

Count number of non-NA/null observations.

DeferredDataFrame.max

Maximum of the values in the object.

DeferredDataFrame.min

Minimum of the values in the object.

DeferredDataFrame.mean

Mean of the values.

DeferredDataFrame.std

Standard deviation of the observations.

DeferredDataFrame.select_dtypes

Subset of a DeferredDataFrame including/excluding columns based on their dtype.

Notes

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DeferredDataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DeferredDataFrame are analyzed for the output. The parameters are ignored when analyzing a DeferredSeries.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

Describing a numeric ``Series``.

>>> s = pd.Series([1, 2, 3])
>>> s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Describing a categorical ``Series``.

>>> s = pd.Series(['a', 'a', 'b', 'c'])
>>> s.describe()
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp ``Series``.

>>> s = pd.Series([
...     np.datetime64("2000-01-01"),
...     np.datetime64("2010-01-01"),
...     np.datetime64("2010-01-01")
... ])
>>> s.describe()
count                      3
mean     2006-09-01 08:00:00
min      2000-01-01 00:00:00
25%      2004-12-31 12:00:00
50%      2010-01-01 00:00:00
75%      2010-01-01 00:00:00
max      2010-01-01 00:00:00
dtype: object

Describing a ``DataFrame``. By default only numeric fields
are returned.

>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),
...                    'numeric': [1, 2, 3],
...                    'object': ['a', 'b', 'c']
...                   })
>>> df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a ``DataFrame`` regardless of data type.

>>> df.describe(include='all')  
       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      a
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a ``DataFrame`` by accessing it as
an attribute.

>>> df.numeric.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a ``DataFrame`` description.

>>> df.describe(include=[np.number])
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a ``DataFrame`` description.

>>> df.describe(include=[object])  
       object
count       3
unique      3
top         a
freq        1

Including only categorical columns from a ``DataFrame`` description.

>>> df.describe(include=['category'])
       categorical
count            3
unique           3
top              d
freq             1

Excluding numeric columns from a ``DataFrame`` description.

>>> df.describe(exclude=[np.number])  
       categorical object
count            3      3
unique           3      3
top              f      a
freq             1      1

Excluding object columns from a ``DataFrame`` description.

>>> df.describe(exclude=[object])  
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0
min(*args, **kwargs)

Return the minimum of the values over the requested axis.

If you want the index of the minimum, use idxmin. This is the equivalent of the numpy.ndarray method argmin.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.sum

Return the sum.

DeferredSeries.min

Return the minimum.

DeferredSeries.max

Return the maximum.

DeferredSeries.idxmin

Return the index of the minimum.

DeferredSeries.idxmax

Return the index of the maximum.

DeferredDataFrame.sum

Return the sum over the requested axis.

DeferredDataFrame.min

Return the minimum over the requested axis.

DeferredDataFrame.max

Return the maximum over the requested axis.

DeferredDataFrame.idxmin

Return the index of the minimum over the requested axis.

DeferredDataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> idx = pd.MultiIndex.from_arrays([
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> s
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64

>>> s.min()
0
max(*args, **kwargs)

Return the maximum of the values over the requested axis.

If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.sum

Return the sum.

DeferredSeries.min

Return the minimum.

DeferredSeries.max

Return the maximum.

DeferredSeries.idxmin

Return the index of the minimum.

DeferredSeries.idxmax

Return the index of the maximum.

DeferredDataFrame.sum

Return the sum over the requested axis.

DeferredDataFrame.min

Return the minimum over the requested axis.

DeferredDataFrame.max

Return the maximum over the requested axis.

DeferredDataFrame.idxmin

Return the index of the minimum over the requested axis.

DeferredDataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> idx = pd.MultiIndex.from_arrays([
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> s
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64

>>> s.max()
8
prod(*args, **kwargs)

Return the product of the values over the requested axis.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.sum

Return the sum.

DeferredSeries.min

Return the minimum.

DeferredSeries.max

Return the maximum.

DeferredSeries.idxmin

Return the index of the minimum.

DeferredSeries.idxmax

Return the index of the maximum.

DeferredDataFrame.sum

Return the sum over the requested axis.

DeferredDataFrame.min

Return the minimum over the requested axis.

DeferredDataFrame.max

Return the maximum over the requested axis.

DeferredDataFrame.idxmin

Return the index of the minimum over the requested axis.

DeferredDataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

By default, the product of an empty or all-NA Series is ``1``

>>> pd.Series([], dtype="float64").prod()
1.0

This can be controlled with the ``min_count`` parameter

>>> pd.Series([], dtype="float64").prod(min_count=1)
nan

Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.

>>> pd.Series([np.nan]).prod()
1.0

>>> pd.Series([np.nan]).prod(min_count=1)
nan
product(*args, **kwargs)

Return the product of the values over the requested axis.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.sum

Return the sum.

DeferredSeries.min

Return the minimum.

DeferredSeries.max

Return the maximum.

DeferredSeries.idxmin

Return the index of the minimum.

DeferredSeries.idxmax

Return the index of the maximum.

DeferredDataFrame.sum

Return the sum over the requested axis.

DeferredDataFrame.min

Return the minimum over the requested axis.

DeferredDataFrame.max

Return the maximum over the requested axis.

DeferredDataFrame.idxmin

Return the index of the minimum over the requested axis.

DeferredDataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

By default, the product of an empty or all-NA Series is ``1``

>>> pd.Series([], dtype="float64").prod()
1.0

This can be controlled with the ``min_count`` parameter

>>> pd.Series([], dtype="float64").prod(min_count=1)
nan

Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.

>>> pd.Series([np.nan]).prod()
1.0

>>> pd.Series([np.nan]).prod(min_count=1)
nan
sum(*args, **kwargs)

Return the sum of the values over the requested axis.

This is equivalent to the method numpy.sum.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.sum

Return the sum.

DeferredSeries.min

Return the minimum.

DeferredSeries.max

Return the maximum.

DeferredSeries.idxmin

Return the index of the minimum.

DeferredSeries.idxmax

Return the index of the maximum.

DeferredDataFrame.sum

Return the sum over the requested axis.

DeferredDataFrame.min

Return the minimum over the requested axis.

DeferredDataFrame.max

Return the maximum over the requested axis.

DeferredDataFrame.idxmin

Return the index of the minimum over the requested axis.

DeferredDataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> idx = pd.MultiIndex.from_arrays([
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> s
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64

>>> s.sum()
14

By default, the sum of an empty or all-NA Series is ``0``.

>>> pd.Series([], dtype="float64").sum()  # min_count=0 is the default
0.0

This can be controlled with the ``min_count`` parameter. For example, if
you'd like the sum of an empty series to be NaN, pass ``min_count=1``.

>>> pd.Series([], dtype="float64").sum(min_count=1)
nan

Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.

>>> pd.Series([np.nan]).sum()
0.0

>>> pd.Series([np.nan]).sum(min_count=1)
nan
median(*args, **kwargs)

Return the median of the values over the requested axis.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

Examples

scalar or scalar

            Examples
            --------
            >>> s = pd.Series([1, 2, 3])
            >>> s.median()
            2.0

            With a DataFrame

            >>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra'])
            >>> df
                   a   b
            tiger  1   2
            zebra  2   3
            >>> df.median()
            a   1.5
            b   2.5
            dtype: float64

            Using axis=1

            >>> df.median(axis=1)
            tiger   1.5
            zebra   2.5
            dtype: float64

            In this case, `numeric_only` should be set to `True`
            to avoid getting an error.

            >>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']},
            ...                   index=['tiger', 'zebra'])
            >>> df.median(numeric_only=True)
            a   1.5
            dtype: float64


        --------
        >>> s = pd.Series([1, 2, 3])
        >>> s.median()
        2.0

        With a DataFrame

        >>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra'])
        >>> df
               a   b
        tiger  1   2
        zebra  2   3
        >>> df.median()
        a   1.5
        b   2.5
        dtype: float64

        Using axis=1

        >>> df.median(axis=1)
        tiger   1.5
        zebra   2.5
        dtype: float64

        In this case, `numeric_only` should be set to `True`
        to avoid getting an error.

        >>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']},
        ...                   index=['tiger', 'zebra'])
        >>> df.median(numeric_only=True)
        a   1.5
        dtype: float64

Differences from pandas

median cannot currently be parallelized. It will require collecting all data on a single node.

sem(*args, **kwargs)

Return unbiased standard error of the mean over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:
  • axis ({index (0)}) – For DeferredSeries this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

Return type:

scalar or Series (if level specified)

Examples

scalar or Series (if level specified)

            Examples
            --------
            >>> s = pd.Series([1, 2, 3])
            >>> s.sem().round(6)
            0.57735

            With a DataFrame

            >>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra'])
            >>> df
                   a   b
            tiger  1   2
            zebra  2   3
            >>> df.sem()
            a   0.5
            b   0.5
            dtype: float64

            Using axis=1

            >>> df.sem(axis=1)
            tiger   0.5
            zebra   0.5
            dtype: float64

            In this case, `numeric_only` should be set to `True`
            to avoid getting an error.

            >>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']},
            ...                   index=['tiger', 'zebra'])
            >>> df.sem(numeric_only=True)
            a   0.5
            dtype: float64


        --------
        >>> s = pd.Series([1, 2, 3])
        >>> s.sem().round(6)
        0.57735

        With a DataFrame

        >>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra'])
        >>> df
               a   b
        tiger  1   2
        zebra  2   3
        >>> df.sem()
        a   0.5
        b   0.5
        dtype: float64

        Using axis=1

        >>> df.sem(axis=1)
        tiger   0.5
        zebra   0.5
        dtype: float64

        In this case, `numeric_only` should be set to `True`
        to avoid getting an error.

        >>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']},
        ...                   index=['tiger', 'zebra'])
        >>> df.sem(numeric_only=True)
        a   0.5
        dtype: float64

Differences from pandas

sem cannot currently be parallelized. It will require collecting all data on a single node.

argmax(**kwargs)

pandas.Series.argmax() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

argmin(**kwargs)

pandas.Series.argmin() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

cummax(**kwargs)

pandas.Series.cummax() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

cummin(**kwargs)

pandas.Series.cummin() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

cumprod(**kwargs)

pandas.Series.cumprod() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

cumsum(**kwargs)

pandas.Series.cumsum() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

diff(**kwargs)

pandas.Series.diff() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

interpolate(**kwargs)

pandas.Series.interpolate() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

searchsorted(**kwargs)

pandas.Series.searchsorted() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

shift(**kwargs)

pandas.Series.shift() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

pct_change(**kwargs)

pandas.Series.pct_change() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

is_monotonic(**kwargs)

pandas.Series.is_monotonic() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

is_monotonic_increasing(**kwargs)

pandas.Series.is_monotonic_increasing() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

is_monotonic_decreasing(**kwargs)

pandas.Series.is_monotonic_decreasing() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

asof(**kwargs)

pandas.Series.asof() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

first_valid_index(**kwargs)

pandas.Series.first_valid_index() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

last_valid_index(**kwargs)

pandas.Series.last_valid_index() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

autocorr(**kwargs)

pandas.Series.autocorr() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

property iat

pandas.Series.iat() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

head(**kwargs)

pandas.Series.head() is not yet supported in the Beam DataFrame API because it is order-sensitive.

If you want to peek at a large dataset consider using interactive Beam’s ib.collect with n specified, or sample(). If you want to find the N largest elements, consider using DeferredDataFrame.nlargest().

tail(**kwargs)

pandas.Series.tail() is not yet supported in the Beam DataFrame API because it is order-sensitive.

If you want to peek at a large dataset consider using interactive Beam’s ib.collect with n specified, or sample(). If you want to find the N largest elements, consider using DeferredDataFrame.nlargest().

filter(**kwargs)

Subset the dataframe rows or columns according to the specified index labels.

Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.

Parameters:
  • items (list-like) – Keep labels from axis which are in items.

  • like (str) – Keep labels from axis for which “like in label == True”.

  • regex (str (regular expression)) – Keep labels from axis for which re.search(regex, label) == True.

  • axis ({0 or 'index', 1 or 'columns', None}, default None) – The axis to filter on, expressed either as an index (int) or axis name (str). By default this is the info axis, ‘columns’ for DeferredDataFrame. For DeferredSeries this parameter is unused and defaults to None.

Return type:

same type as input object

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.loc

Access a group of rows and columns by label(s) or a boolean array.

Notes

The items, like, and regex parameters are enforced to be mutually exclusive.

axis defaults to the info axis that is used when indexing with [].

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame(np.array(([1, 2, 3], [4, 5, 6])),
...                   index=['mouse', 'rabbit'],
...                   columns=['one', 'two', 'three'])
>>> df
        one  two  three
mouse     1    2      3
rabbit    4    5      6

>>> # select columns by name
>>> df.filter(items=['one', 'three'])
         one  three
mouse     1      3
rabbit    4      6

>>> # select columns by regular expression
>>> df.filter(regex='e$', axis=1)
         one  three
mouse     1      3
rabbit    4      6

>>> # select rows containing 'bbi'
>>> df.filter(like='bbi', axis=0)
         one  two  three
rabbit    4    5      6
memory_usage(**kwargs)

pandas.Series.memory_usage() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

nbytes(**kwargs)

pandas.Series.nbytes() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

to_list(**kwargs)

pandas.Series.to_list() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

factorize(**kwargs)

pandas.Series.factorize() is not yet supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.

For more information see https://s.apache.org/dataframe-non-deferred-columns.

nlargest(keep, **kwargs)[source]

Return the largest n elements.

Parameters:
  • n (int, default 5) – Return this many descending sorted values.

  • keep ({'first', 'last', 'all'}, default 'first') –

    When there are duplicate values that cannot all fit in a DeferredSeries of n elements:

    • first : return the first n occurrences in order of appearance.

    • last : return the last n occurrences in reverse order of appearance.

    • all : keep all occurrences. This can result in a DeferredSeries of size larger than n.

Returns:

The n largest values in the DeferredSeries, sorted in decreasing order.

Return type:

DeferredSeries

Differences from pandas

Only keep=False and keep="any" are supported. Other values of keep make this an order-sensitive operation. Note keep="any" is a Beam-specific option that guarantees only one duplicate will be kept, but unlike "first" and "last" it makes no guarantees about _which_ duplicate element is kept.

See also

DeferredSeries.nsmallest

Get the n smallest elements.

DeferredSeries.sort_values

Sort DeferredSeries by values.

DeferredSeries.head

Return the first n rows.

Notes

Faster than .sort_values(ascending=False).head(n) for small n relative to the size of the DeferredSeries object.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> countries_population = {"Italy": 59000000, "France": 65000000,
...                         "Malta": 434000, "Maldives": 434000,
...                         "Brunei": 434000, "Iceland": 337000,
...                         "Nauru": 11300, "Tuvalu": 11300,
...                         "Anguilla": 11300, "Montserrat": 5200}
>>> s = pd.Series(countries_population)
>>> s
Italy       59000000
France      65000000
Malta         434000
Maldives      434000
Brunei        434000
Iceland       337000
Nauru          11300
Tuvalu         11300
Anguilla       11300
Montserrat      5200
dtype: int64

The `n` largest elements where ``n=5`` by default.

>>> s.nlargest()
France      65000000
Italy       59000000
Malta         434000
Maldives      434000
Brunei        434000
dtype: int64

The `n` largest elements where ``n=3``. Default `keep` value is 'first'
so Malta will be kept.

>>> s.nlargest(3)
France    65000000
Italy     59000000
Malta       434000
dtype: int64

The `n` largest elements where ``n=3`` and keeping the last duplicates.
Brunei will be kept since it is the last with value 434000 based on
the index order.

>>> s.nlargest(3, keep='last')
France      65000000
Italy       59000000
Brunei        434000
dtype: int64

The `n` largest elements where ``n=3`` with all duplicates kept. Note
that the returned Series has five elements due to the three duplicates.

>>> s.nlargest(3, keep='all')
France      65000000
Italy       59000000
Malta         434000
Maldives      434000
Brunei        434000
dtype: int64
nsmallest(keep, **kwargs)[source]

Return the smallest n elements.

Parameters:
  • n (int, default 5) – Return this many ascending sorted values.

  • keep ({'first', 'last', 'all'}, default 'first') –

    When there are duplicate values that cannot all fit in a DeferredSeries of n elements:

    • first : return the first n occurrences in order of appearance.

    • last : return the last n occurrences in reverse order of appearance.

    • all : keep all occurrences. This can result in a DeferredSeries of size larger than n.

Returns:

The n smallest values in the DeferredSeries, sorted in increasing order.

Return type:

DeferredSeries

Differences from pandas

Only keep=False and keep="any" are supported. Other values of keep make this an order-sensitive operation. Note keep="any" is a Beam-specific option that guarantees only one duplicate will be kept, but unlike "first" and "last" it makes no guarantees about _which_ duplicate element is kept.

See also

DeferredSeries.nlargest

Get the n largest elements.

DeferredSeries.sort_values

Sort DeferredSeries by values.

DeferredSeries.head

Return the first n rows.

Notes

Faster than .sort_values().head(n) for small n relative to the size of the DeferredSeries object.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> countries_population = {"Italy": 59000000, "France": 65000000,
...                         "Brunei": 434000, "Malta": 434000,
...                         "Maldives": 434000, "Iceland": 337000,
...                         "Nauru": 11300, "Tuvalu": 11300,
...                         "Anguilla": 11300, "Montserrat": 5200}
>>> s = pd.Series(countries_population)
>>> s
Italy       59000000
France      65000000
Brunei        434000
Malta         434000
Maldives      434000
Iceland       337000
Nauru          11300
Tuvalu         11300
Anguilla       11300
Montserrat      5200
dtype: int64

The `n` smallest elements where ``n=5`` by default.

>>> s.nsmallest()
Montserrat    5200
Nauru        11300
Tuvalu       11300
Anguilla     11300
Iceland     337000
dtype: int64

The `n` smallest elements where ``n=3``. Default `keep` value is
'first' so Nauru and Tuvalu will be kept.

>>> s.nsmallest(3)
Montserrat   5200
Nauru       11300
Tuvalu      11300
dtype: int64

The `n` smallest elements where ``n=3`` and keeping the last
duplicates. Anguilla and Tuvalu will be kept since they are the last
with value 11300 based on the index order.

>>> s.nsmallest(3, keep='last')
Montserrat   5200
Anguilla    11300
Tuvalu      11300
dtype: int64

The `n` smallest elements where ``n=3`` with all duplicates kept. Note
that the returned Series has four elements due to the three duplicates.

>>> s.nsmallest(3, keep='all')
Montserrat   5200
Nauru       11300
Tuvalu      11300
Anguilla    11300
dtype: int64
property is_unique

Return boolean if values in the object are unique.

Return type:

bool

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3])
>>> s.is_unique
True

>>> s = pd.Series([1, 2, 3, 1])
>>> s.is_unique
False
plot(**kwargs)

pandas.Series.plot() is not yet supported in the Beam DataFrame API because it is a plotting tool.

For more information see https://s.apache.org/dataframe-plotting-tools.

pop(**kwargs)

pandas.Series.pop() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

rename_axis(**kwargs)

Set the name of the axis for the index or columns.

Parameters:
  • mapper (scalar, list-like, optional) – Value to set the axis name attribute.

  • index (scalar, list-like, dict-like or function, optional) –

    A scalar, list-like, dict-like or functions transformations to apply to that axis’ values. Note that the columns parameter is not allowed if the object is a DeferredSeries. This parameter only apply for DeferredDataFrame type objects.

    Use either mapper and axis to specify the axis to target with mapper, or index and/or columns.

  • columns (scalar, list-like, dict-like or function, optional) –

    A scalar, list-like, dict-like or functions transformations to apply to that axis’ values. Note that the columns parameter is not allowed if the object is a DeferredSeries. This parameter only apply for DeferredDataFrame type objects.

    Use either mapper and axis to specify the axis to target with mapper, or index and/or columns.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to rename. For DeferredSeries this parameter is unused and defaults to 0.

  • copy (bool, default None) – Also copy underlying data.

  • inplace (bool, default False) – Modifies the object directly, instead of creating a new DeferredSeries or DeferredDataFrame.

Returns:

The same type as the caller or None if inplace=True.

Return type:

DeferredSeries, DeferredDataFrame, or None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.rename

Alter DeferredSeries index labels or name.

DeferredDataFrame.rename

Alter DeferredDataFrame index labels or name.

Index.rename

Set new names on index.

Notes

DeferredDataFrame.rename_axis supports two calling conventions

  • (index=index_mapper, columns=columns_mapper, ...)

  • (mapper, axis={'index', 'columns'}, ...)

The first calling convention will only modify the names of the index and/or the names of the Index object that is the columns. In this case, the parameter copy is ignored.

The second calling convention will modify the names of the corresponding index if mapper is a list or a scalar. However, if mapper is dict-like or a function, it will use the deprecated behavior of modifying the axis labels.

We highly recommend using keyword arguments to clarify your intent.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

**Series**

>>> s = pd.Series(["dog", "cat", "monkey"])
>>> s
0       dog
1       cat
2    monkey
dtype: object
>>> s.rename_axis("animal")
animal
0    dog
1    cat
2    monkey
dtype: object

**DataFrame**

>>> df = pd.DataFrame({"num_legs": [4, 4, 2],
...                    "num_arms": [0, 0, 2]},
...                   ["dog", "cat", "monkey"])
>>> df
        num_legs  num_arms
dog            4         0
cat            4         0
monkey         2         2
>>> df = df.rename_axis("animal")
>>> df
        num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2
>>> df = df.rename_axis("limbs", axis="columns")
>>> df
limbs   num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2

**MultiIndex**

>>> df.index = pd.MultiIndex.from_product([['mammal'],
...                                        ['dog', 'cat', 'monkey']],
...                                       names=['type', 'name'])
>>> df
limbs          num_legs  num_arms
type   name
mammal dog            4         0
       cat            4         0
       monkey         2         2

>>> df.rename_axis(index={'type': 'class'})
limbs          num_legs  num_arms
class  name
mammal dog            4         0
       cat            4         0
       monkey         2         2

>>> df.rename_axis(columns=str.upper)
LIMBS          num_legs  num_arms
type   name
mammal dog            4         0
       cat            4         0
       monkey         2         2
round(**kwargs)

Round each value in a Series to the given number of decimals.

Parameters:
  • decimals (int, default 0) – Number of decimal places to round to. If decimals is negative, it specifies the number of positions to the left of the decimal point.

  • *args – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

  • **kwargs – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Rounded values of the DeferredSeries.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

numpy.around

Round values of an np.array.

DeferredDataFrame.round

Round values of a DeferredDataFrame.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([0.1, 1.3, 2.7])
>>> s.round()
0    0.0
1    1.0
2    3.0
dtype: float64
take(**kwargs)

pandas.Series.take() is not yet supported in the Beam DataFrame API because it is deprecated in pandas.

to_dict(**kwargs)

pandas.Series.to_dict() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

to_frame(**kwargs)

Convert Series to DataFrame.

Parameters:

name (object, optional) – The passed name should substitute for the series name (if it has one).

Returns:

DeferredDataFrame representation of DeferredSeries.

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series(["a", "b", "c"],
...               name="vals")
>>> s.to_frame()
  vals
0    a
1    b
2    c
unique(as_series=False)[source]

Return unique values of Series object.

Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort.

Returns:

The unique values returned as a NumPy array. See Notes.

Return type:

ndarray or ExtensionArray

Differences from pandas

unique is not supported by default because it produces a non-deferred result: an ndarray. You can use the Beam-specific argument unique(as_series=True) to get the result as a DeferredSeries

See also

DeferredSeries.drop_duplicates

Return DeferredSeries with duplicate values removed.

unique

Top-level unique method for any 1-d array-like object.

Index.unique

Return Index with unique values from an Index object.

Notes

Returns the unique values as a NumPy array. In case of an extension-array backed DeferredSeries, a new ExtensionArray of that type with just the unique values is returned. This includes

  • Categorical

  • Period

  • Datetime with Timezone

  • Datetime without Timezone

  • Timedelta

  • Interval

  • Sparse

  • IntegerNA

See Examples section.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> pd.Series([2, 1, 3, 3], name='A').unique()
array([2, 1, 3])

>>> pd.Series([pd.Timestamp('2016-01-01') for _ in range(3)]).unique()
<DatetimeArray>
['2016-01-01 00:00:00']
Length: 1, dtype: datetime64[ns]

>>> pd.Series([pd.Timestamp('2016-01-01', tz='US/Eastern')
...            for _ in range(3)]).unique()
<DatetimeArray>
['2016-01-01 00:00:00-05:00']
Length: 1, dtype: datetime64[ns, US/Eastern]

An Categorical will return categories in the order of
appearance and with the same dtype.

>>> pd.Series(pd.Categorical(list('baabc'))).unique()
['b', 'a', 'c']
Categories (3, object): ['a', 'b', 'c']
>>> pd.Series(pd.Categorical(list('baabc'), categories=list('abc'),
...                          ordered=True)).unique()
['b', 'a', 'c']
Categories (3, object): ['a' < 'b' < 'c']
update(other)[source]

Modify Series in place using values from passed Series.

Uses non-NA values from passed Series to make updates. Aligns on index.

Parameters:

other (DeferredSeries, or object coercible into DeferredSeries)

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3])
>>> s.update(pd.Series([4, 5, 6]))
>>> s
0    4
1    5
2    6
dtype: int64

>>> s = pd.Series(['a', 'b', 'c'])
>>> s.update(pd.Series(['d', 'e'], index=[0, 2]))
>>> s
0    d
1    b
2    e
dtype: object

>>> s = pd.Series([1, 2, 3])
>>> s.update(pd.Series([4, 5, 6, 7, 8]))
>>> s
0    4
1    5
2    6
dtype: int64

If ``other`` contains NaNs the corresponding values are not updated
in the original Series.

>>> s = pd.Series([1, 2, 3])
>>> s.update(pd.Series([4, np.nan, 6]))
>>> s
0    4
1    2
2    6
dtype: int64

``other`` can also be a non-Series object type
that is coercible into a Series

>>> s = pd.Series([1, 2, 3])
>>> s.update([4, np.nan, 6])
>>> s
0    4
1    2
2    6
dtype: int64

>>> s = pd.Series([1, 2, 3])
>>> s.update({1: 9})
>>> s
0    1
1    9
2    3
dtype: int64
value_counts(sort=False, normalize=False, ascending=False, bins=None, dropna=True)[source]

Return a Series containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Parameters:
  • normalize (bool, default False) – If True then the object returned will contain the relative frequencies of the unique values.

  • sort (bool, default True) – Sort by frequencies when True. Preserve the order of the data when False.

  • ascending (bool, default False) – Sort in ascending order.

  • bins (int, optional) – Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data.

  • dropna (bool, default True) – Don’t include counts of NaN.

Return type:

DeferredSeries

Differences from pandas

sort is False by default, and sort=True is not supported because it imposes an ordering on the dataset which likely will not be preserved.

When bin is specified this operation is not parallelizable. See [Issue 20903](https://github.com/apache/beam/issues/20903) tracking the possible addition of a distributed implementation.

See also

DeferredSeries.count

Number of non-NA elements in a DeferredSeries.

DeferredDataFrame.count

Number of non-NA elements in a DeferredDataFrame.

DeferredDataFrame.value_counts

Equivalent method on DeferredDataFrames.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> index = pd.Index([3, 1, 2, 3, 4, np.nan])
>>> index.value_counts()
3.0    2
1.0    1
2.0    1
4.0    1
Name: count, dtype: int64

With `normalize` set to `True`, returns the relative frequency by
dividing all values by the sum of values.

>>> s = pd.Series([3, 1, 2, 3, 4, np.nan])
>>> s.value_counts(normalize=True)
3.0    0.4
1.0    0.2
2.0    0.2
4.0    0.2
Name: proportion, dtype: float64

**bins**

Bins can be useful for going from a continuous variable to a
categorical variable; instead of counting unique
apparitions of values, divide the index in the specified
number of half-open bins.

>>> s.value_counts(bins=3)
(0.996, 2.0]    2
(2.0, 3.0]      2
(3.0, 4.0]      1
Name: count, dtype: int64

**dropna**

With `dropna` set to `False` we can also see NaN index values.

>>> s.value_counts(dropna=False)
3.0    2
1.0    1
2.0    1
4.0    1
NaN    1
Name: count, dtype: int64
property values

pandas.Series.values() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

view(**kwargs)

pandas.Series.view() is not yet supported in the Beam DataFrame API because it relies on memory-sharing semantics that are not compatible with the Beam model.

property str

Vectorized string functions for Series and Index.

NAs stay NA unless handled otherwise by a particular method. Patterned after Python’s string methods, with some inspiration from R’s stringr package.

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series(["A_Str_Series"])
>>> s
0    A_Str_Series
dtype: object

>>> s.str.split("_")
0    [A, Str, Series]
dtype: object

>>> s.str.replace("_", "")
0    AStrSeries
dtype: object
property cat

Accessor object for categorical properties of the Series values.

Parameters:

data (DeferredSeries or CategoricalIndex)

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series(list("abbccc")).astype("category")
>>> s
0    a
1    b
2    b
3    c
4    c
5    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

>>> s.cat.categories
Index(['a', 'b', 'c'], dtype='object')

>>> s.cat.rename_categories(list("cba"))
0    c
1    b
2    b
3    a
4    a
5    a
dtype: category
Categories (3, object): ['c', 'b', 'a']

>>> s.cat.reorder_categories(list("cba"))
0    a
1    b
2    b
3    c
4    c
5    c
dtype: category
Categories (3, object): ['c', 'b', 'a']

>>> s.cat.add_categories(["d", "e"])
0    a
1    b
2    b
3    c
4    c
5    c
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

>>> s.cat.remove_categories(["a", "c"])
0    NaN
1      b
2      b
3    NaN
4    NaN
5    NaN
dtype: category
Categories (1, object): ['b']

>>> s1 = s.cat.add_categories(["d", "e"])
>>> s1.cat.remove_unused_categories()
0    a
1    b
2    b
3    c
4    c
5    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

>>> s.cat.set_categories(list("abcde"))
0    a
1    b
2    b
3    c
4    c
5    c
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

>>> s.cat.as_ordered()
0    a
1    b
2    b
3    c
4    c
5    c
dtype: category
Categories (3, object): ['a' < 'b' < 'c']

>>> s.cat.as_unordered()
0    a
1    b
2    b
3    c
4    c
5    c
dtype: category
Categories (3, object): ['a', 'b', 'c']
property dt
mode(*args, **kwargs)[source]

Return the mode(s) of the Series.

The mode is the value that appears most often. There can be multiple modes.

Always returns Series even if only one value is returned.

Parameters:

dropna (bool, default True) – Don’t consider counts of NaN/NaT.

Returns:

Modes of the DeferredSeries in sorted order.

Return type:

DeferredSeries

Differences from pandas

mode is not currently parallelizable. An approximate, parallelizable implementation of mode may be added in the future (Issue 20946).

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> s = pd.Series([2, 4, 2, 2, 4, None])
>>> s.mode()
0    2.0
dtype: float64

More than one mode:

>>> s = pd.Series([2, 4, 8, 2, 4, None])
>>> s.mode()
0    2.0
1    4.0
dtype: float64

With and without considering null value:

>>> s = pd.Series([2, 4, None, None, 4, None])
>>> s.mode(dropna=False)
0   NaN
dtype: float64
>>> s = pd.Series([2, 4, None, None, 4, None])
>>> s.mode()
0    4.0
dtype: float64
apply(**kwargs)

Invoke function on values of Series.

Can be ufunc (a NumPy function that applies to the entire Series) or a Python function that only works on single values.

Parameters:
  • func (function) – Python function or NumPy ufunc to apply.

  • convert_dtype (bool, default True) –

    Try to find better dtype for elementwise function results. If False, leave as dtype=object. Note that the dtype is always preserved for some extension array dtypes, such as Categorical.

    Deprecated since version 2.1.0: convert_dtype has been deprecated. Do ser.astype(object).apply() instead if you want convert_dtype=False.

  • args (tuple) – Positional arguments passed to func after the series value.

  • by_row (False or "compat", default "compat") –

    If "compat" and func is a callable, func will be passed each element of the DeferredSeries, like DeferredSeries.map. If func is a list or dict of callables, will first try to translate each func into pandas methods. If that doesn’t work, will try call to apply again with by_row="compat" and if that fails, will call apply again with by_row=False (backward compatible). If False, the func will be passed the whole DeferredSeries at once.

    by_row has no effect when func is a string.

    Added in version 2.1.0.

  • **kwargs – Additional keyword arguments passed to func.

Returns:

If func returns a DeferredSeries object the result will be a DeferredDataFrame.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.map

For element-wise operations.

DeferredSeries.agg

Only perform aggregating type operations.

DeferredSeries.transform

Only perform transforming type operations.

Notes

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Create a series with typical summer temperatures for each city.

>>> s = pd.Series([20, 21, 12],
...               index=['London', 'New York', 'Helsinki'])
>>> s
London      20
New York    21
Helsinki    12
dtype: int64

Square the values by defining a function and passing it as an
argument to ``apply()``.

>>> def square(x):
...     return x ** 2
>>> s.apply(square)
London      400
New York    441
Helsinki    144
dtype: int64

Square the values by passing an anonymous function as an
argument to ``apply()``.

>>> s.apply(lambda x: x ** 2)
London      400
New York    441
Helsinki    144
dtype: int64

Define a custom function that needs additional positional
arguments and pass these additional arguments using the
``args`` keyword.

>>> def subtract_custom_value(x, custom_value):
...     return x - custom_value

>>> s.apply(subtract_custom_value, args=(5,))
London      15
New York    16
Helsinki     7
dtype: int64

Define a custom function that takes keyword arguments
and pass these arguments to ``apply``.

>>> def add_custom_values(x, **kwargs):
...     for month in kwargs:
...         x += kwargs[month]
...     return x

>>> s.apply(add_custom_values, june=30, july=20, august=25)
London      95
New York    96
Helsinki    87
dtype: int64

Use a function from the Numpy library.

>>> s.apply(np.log)
London      2.995732
New York    3.044522
Helsinki    2.484907
dtype: float64
map(**kwargs)

Map values of Series according to an input mapping or function.

Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.

Parameters:
  • arg (function, collections.abc.Mapping subclass or DeferredSeries) – Mapping correspondence.

  • na_action ({None, 'ignore'}, default None) – If ‘ignore’, propagate NaN values, without passing them to the mapping correspondence.

Returns:

Same index as caller.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.apply

For applying more complex functions on a DeferredSeries.

DeferredSeries.replace

Replace values given in to_replace with value.

DeferredDataFrame.apply

Apply a function row-/column-wise.

DeferredDataFrame.map

Apply a function elementwise on a whole DeferredDataFrame.

Notes

When arg is a dictionary, values in DeferredSeries that are not in the dictionary (as keys) are converted to NaN. However, if the dictionary is a dict subclass that defines __missing__ (i.e. provides a method for default values), then this default is used rather than NaN.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
>>> s
0      cat
1      dog
2      NaN
3   rabbit
dtype: object

``map`` accepts a ``dict`` or a ``Series``. Values that are not found
in the ``dict`` are converted to ``NaN``, unless the dict has a default
value (e.g. ``defaultdict``):

>>> s.map({'cat': 'kitten', 'dog': 'puppy'})
0   kitten
1    puppy
2      NaN
3      NaN
dtype: object

It also accepts a function:

>>> s.map('I am a {}'.format)
0       I am a cat
1       I am a dog
2       I am a nan
3    I am a rabbit
dtype: object

To avoid applying the function to missing values (and keep them as
``NaN``) ``na_action='ignore'`` can be used:

>>> s.map('I am a {}'.format, na_action='ignore')
0     I am a cat
1     I am a dog
2            NaN
3  I am a rabbit
dtype: object
repeat(repeats, axis)[source]

Repeat elements of a Series.

Returns a new Series where each element of the current Series is repeated consecutively a given number of times.

Parameters:
  • repeats (int or array of ints) – The number of repetitions for each element. This should be a non-negative integer. Repeating 0 times will return an empty DeferredSeries.

  • axis (None) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

Newly created DeferredSeries with repeated elements.

Return type:

DeferredSeries

Differences from pandas

repeats must be an int or a DeferredSeries. Lists are not supported because they make this operation order-sensitive.

See also

Index.repeat

Equivalent function for Index.

numpy.repeat

Similar method for numpy.ndarray.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> s = pd.Series(['a', 'b', 'c'])
>>> s
0    a
1    b
2    c
dtype: object
>>> s.repeat(2)
0    a
0    a
1    b
1    b
2    c
2    c
dtype: object
>>> s.repeat([1, 2, 3])
0    a
1    b
1    b
2    c
2    c
2    c
dtype: object
compare(other, align_axis, **kwargs)[source]

Compare to another Series and show the differences.

Parameters:
  • other (DeferredSeries) – Object to compare with.

  • align_axis ({0 or 'index', 1 or 'columns'}, default 1) –

    Determine which axis to align the comparison on.

    • 0, or ‘index’Resulting differences are stacked vertically

      with rows drawn alternately from self and other.

    • 1, or ‘columns’Resulting differences are aligned horizontally

      with columns drawn alternately from self and other.

  • keep_shape (bool, default False) – If true, all rows and columns are kept. Otherwise, only the ones with different values are kept.

  • keep_equal (bool, default False) – If true, the result keeps values that are equal. Otherwise, equal values are shown as NaNs.

  • result_names (tuple, default ('self', 'other')) –

    Set the dataframes names in the comparison.

    Added in version 1.5.0.

Returns:

If axis is 0 or ‘index’ the result will be a DeferredSeries. The resulting index will be a MultiIndex with ‘self’ and ‘other’ stacked alternately at the inner level.

If axis is 1 or ‘columns’ the result will be a DeferredDataFrame. It will have two columns namely ‘self’ and ‘other’.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.compare

Compare with another DeferredDataFrame and show differences.

Notes

Matching NaNs will not appear as a difference.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s1 = pd.Series(["a", "b", "c", "d", "e"])
>>> s2 = pd.Series(["a", "a", "c", "b", "e"])

Align the differences on columns

>>> s1.compare(s2)
  self other
1    b     a
3    d     b

Stack the differences on indices

>>> s1.compare(s2, align_axis=0)
1  self     b
   other    a
3  self     d
   other    b
dtype: object

Keep all original rows

>>> s1.compare(s2, keep_shape=True)
  self other
0  NaN   NaN
1    b     a
2  NaN   NaN
3    d     b
4  NaN   NaN

Keep all original rows and also all original values

>>> s1.compare(s2, keep_shape=True, keep_equal=True)
  self other
0    a     a
1    b     a
2    c     c
3    d     b
4    e     e
add(**kwargs)

Return Addition of series and other, element-wise (binary operator add).

Equivalent to series + other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.radd

Reverse of the Addition operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.add(b, fill_value=0)
a    2.0
b    1.0
c    1.0
d    1.0
e    NaN
dtype: float64
asfreq(**kwargs)

pandas.Series.asfreq() is not implemented yet in the Beam DataFrame API.

If support for ‘asfreq’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

property at

pandas.Series.at() is not implemented yet in the Beam DataFrame API.

If support for ‘at’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

convert_dtypes(**kwargs)

pandas.Series.convert_dtypes() is not implemented yet in the Beam DataFrame API.

If support for ‘convert_dtypes’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

div(**kwargs)

Return Floating division of series and other, element-wise (binary operator truediv).

Equivalent to series / other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rtruediv

Reverse of the Floating division operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
divide(**kwargs)

Return Floating division of series and other, element-wise (binary operator truediv).

Equivalent to series / other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rtruediv

Reverse of the Floating division operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
divmod(**kwargs)

Return Integer division and modulo of series and other, element-wise (binary operator divmod).

Equivalent to divmod(series, other), but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

2-Tuple of DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rdivmod

Reverse of the Integer division and modulo operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divmod(b, fill_value=0)
(a    1.0
 b    inf
 c    inf
 d    0.0
 e    NaN
 dtype: float64,
 a    0.0
 b    NaN
 c    NaN
 d    0.0
 e    NaN
 dtype: float64)
eq(**kwargs)

Return Equal to of series and other, element-wise (binary operator eq).

Equivalent to series == other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.eq(b, fill_value=0)
a     True
b    False
c    False
d    False
e    False
dtype: bool
property flags

pandas.Series.flags() is not implemented yet in the Beam DataFrame API.

If support for ‘flags’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

floordiv(**kwargs)

Return Integer division of series and other, element-wise (binary operator floordiv).

Equivalent to series // other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rfloordiv

Reverse of the Integer division operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.floordiv(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
ge(**kwargs)

Return Greater than or equal to of series and other, element-wise (binary operator ge).

Equivalent to series >= other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
e    1.0
dtype: float64
>>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f'])
>>> b
a    0.0
b    1.0
c    2.0
d    NaN
f    1.0
dtype: float64
>>> a.ge(b, fill_value=0)
a     True
b     True
c    False
d    False
e     True
f    False
dtype: bool
gt(**kwargs)

Return Greater than of series and other, element-wise (binary operator gt).

Equivalent to series > other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
e    1.0
dtype: float64
>>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f'])
>>> b
a    0.0
b    1.0
c    2.0
d    NaN
f    1.0
dtype: float64
>>> a.gt(b, fill_value=0)
a     True
b    False
c    False
d    False
e     True
f    False
dtype: bool
infer_objects(**kwargs)

pandas.Series.infer_objects() is not implemented yet in the Beam DataFrame API.

If support for ‘infer_objects’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

item(**kwargs)

pandas.Series.item() is not implemented yet in the Beam DataFrame API.

If support for ‘item’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

le(**kwargs)

Return Less than or equal to of series and other, element-wise (binary operator le).

Equivalent to series <= other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
e    1.0
dtype: float64
>>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f'])
>>> b
a    0.0
b    1.0
c    2.0
d    NaN
f    1.0
dtype: float64
>>> a.le(b, fill_value=0)
a    False
b     True
c     True
d    False
e    False
f     True
dtype: bool
lt(**kwargs)

Return Less than of series and other, element-wise (binary operator lt).

Equivalent to series < other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
e    1.0
dtype: float64
>>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f'])
>>> b
a    0.0
b    1.0
c    2.0
d    NaN
f    1.0
dtype: float64
>>> a.lt(b, fill_value=0)
a    False
b    False
c     True
d    False
e    False
f     True
dtype: bool
mod(**kwargs)

Return Modulo of series and other, element-wise (binary operator mod).

Equivalent to series % other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rmod

Reverse of the Modulo operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.mod(b, fill_value=0)
a    0.0
b    NaN
c    NaN
d    0.0
e    NaN
dtype: float64
mul(**kwargs)

Return Multiplication of series and other, element-wise (binary operator mul).

Equivalent to series * other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rmul

Reverse of the Multiplication operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.multiply(b, fill_value=0)
a    1.0
b    0.0
c    0.0
d    0.0
e    NaN
dtype: float64
multiply(**kwargs)

Return Multiplication of series and other, element-wise (binary operator mul).

Equivalent to series * other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rmul

Reverse of the Multiplication operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.multiply(b, fill_value=0)
a    1.0
b    0.0
c    0.0
d    0.0
e    NaN
dtype: float64
ne(**kwargs)

Return Not equal to of series and other, element-wise (binary operator ne).

Equivalent to series != other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.ne(b, fill_value=0)
a    False
b     True
c     True
d     True
e     True
dtype: bool
pow(**kwargs)

Return Exponential power of series and other, element-wise (binary operator pow).

Equivalent to series ** other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rpow

Reverse of the Exponential power operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.pow(b, fill_value=0)
a    1.0
b    1.0
c    1.0
d    0.0
e    NaN
dtype: float64
radd(**kwargs)

Return Addition of series and other, element-wise (binary operator radd).

Equivalent to other + series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.add

Element-wise Addition, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.add(b, fill_value=0)
a    2.0
b    1.0
c    1.0
d    1.0
e    NaN
dtype: float64
rank(**kwargs)

pandas.Series.rank() is not implemented yet in the Beam DataFrame API.

If support for ‘rank’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

rdiv(**kwargs)

Return Floating division of series and other, element-wise (binary operator rtruediv).

Equivalent to other / series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.truediv

Element-wise Floating division, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
rdivmod(**kwargs)

Return Integer division and modulo of series and other, element-wise (binary operator rdivmod).

Equivalent to other divmod series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

2-Tuple of DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.divmod

Element-wise Integer division and modulo, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divmod(b, fill_value=0)
(a    1.0
 b    inf
 c    inf
 d    0.0
 e    NaN
 dtype: float64,
 a    0.0
 b    NaN
 c    NaN
 d    0.0
 e    NaN
 dtype: float64)
reindex_like(**kwargs)

pandas.Series.reindex_like() is not implemented yet in the Beam DataFrame API.

If support for ‘reindex_like’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

rfloordiv(**kwargs)

Return Integer division of series and other, element-wise (binary operator rfloordiv).

Equivalent to other // series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.floordiv

Element-wise Integer division, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.floordiv(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
rmod(**kwargs)

Return Modulo of series and other, element-wise (binary operator rmod).

Equivalent to other % series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.mod

Element-wise Modulo, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.mod(b, fill_value=0)
a    0.0
b    NaN
c    NaN
d    0.0
e    NaN
dtype: float64
rmul(**kwargs)

Return Multiplication of series and other, element-wise (binary operator rmul).

Equivalent to other * series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.mul

Element-wise Multiplication, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.multiply(b, fill_value=0)
a    1.0
b    0.0
c    0.0
d    0.0
e    NaN
dtype: float64
rpow(**kwargs)

Return Exponential power of series and other, element-wise (binary operator rpow).

Equivalent to other ** series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.pow

Element-wise Exponential power, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.pow(b, fill_value=0)
a    1.0
b    1.0
c    1.0
d    0.0
e    NaN
dtype: float64
rsub(**kwargs)

Return Subtraction of series and other, element-wise (binary operator rsub).

Equivalent to other - series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.sub

Element-wise Subtraction, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.subtract(b, fill_value=0)
a    0.0
b    1.0
c    1.0
d   -1.0
e    NaN
dtype: float64
rtruediv(**kwargs)

Return Floating division of series and other, element-wise (binary operator rtruediv).

Equivalent to other / series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.truediv

Element-wise Floating division, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
set_flags(**kwargs)

pandas.Series.set_flags() is not implemented yet in the Beam DataFrame API.

If support for ‘set_flags’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

squeeze(**kwargs)

pandas.Series.squeeze() is not implemented yet in the Beam DataFrame API.

If support for ‘squeeze’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

sub(**kwargs)

Return Subtraction of series and other, element-wise (binary operator sub).

Equivalent to series - other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rsub

Reverse of the Subtraction operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.subtract(b, fill_value=0)
a    0.0
b    1.0
c    1.0
d   -1.0
e    NaN
dtype: float64
subtract(**kwargs)

Return Subtraction of series and other, element-wise (binary operator sub).

Equivalent to series - other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rsub

Reverse of the Subtraction operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.subtract(b, fill_value=0)
a    0.0
b    1.0
c    1.0
d   -1.0
e    NaN
dtype: float64
to_clipboard(**kwargs)

pandas.DataFrame.to_clipboard() is not implemented yet in the Beam DataFrame API.

If support for ‘to_clipboard’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_csv(path, transform_label=None, *args, **kwargs)

Write object to a comma-separated values (csv) file.

Parameters:
  • path_or_buf (str, path object, file-like object, or None, default None) –

    String, path object (implementing os.PathLike[str]), or file-like object implementing a write() function. If None, the result is returned as a string. If a non-binary file object is passed, it should be opened with newline=’’, disabling universal newlines. If a binary file object is passed, mode might need to contain a ‘b’.

    Changed in version 1.2.0: Support for binary file objects was introduced.

  • sep (str, default ',') – String of length 1. Field delimiter for the output file.

  • na_rep (str, default '') – Missing data representation.

  • float_format (str, Callable, default None) – Format string for floating point numbers. If a Callable is given, it takes precedence over other numeric formatting parameters, like decimal.

  • columns (sequence, optional) – Columns to write.

  • header (bool or list of str, default True) – Write out the column names. If a list of strings is given it is assumed to be aliases for the column names.

  • index (bool, default True) – Write row names (index).

  • index_label (str or sequence, or False, default None) – Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the object uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R.

  • mode ({'w', 'x', 'a'}, default 'w') –

    Forwarded to either open(mode=) or fsspec.open(mode=) to control the file opening. Typical values include:

    • ’w’, truncate the file first.

    • ’x’, exclusive creation, failing if the file already exists.

    • ’a’, append to the end of file if it exists.

  • encoding (str, optional) – A string representing the encoding to use in the output file, defaults to ‘utf-8’. encoding is not supported if path_or_buf is a non-binary file object.

  • compression (str or dict, default 'infer') –

    For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buf’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

    Added in version 1.5.0: Added support for .tar files.

    May be a dict with key ‘method’ as compression mode and other entries as additional compression options if compression mode is ‘zip’.

    Passing compression options as keys in dict is supported for compression modes ‘gzip’, ‘bz2’, ‘zstd’, and ‘zip’.

    Changed in version 1.2.0: Compression is supported for binary file objects.

    Changed in version 1.2.0: Previous versions forwarded dict entries for ‘gzip’ to gzip.open instead of gzip.GzipFile which prevented setting mtime.

  • quoting (optional constant from csv module) – Defaults to csv.QUOTE_MINIMAL. If you have set a float_format then floats are converted to strings and thus csv.QUOTE_NONNUMERIC will treat them as non-numeric.

  • quotechar (str, default '"') – String of length 1. Character used to quote fields.

  • lineterminator (str, optional) –

    The newline character or character sequence to use in the output file. Defaults to os.linesep, which depends on the OS in which this method is called (’\n’ for linux, ‘\r\n’ for Windows, i.e.).

    Changed in version 1.5.0: Previously was line_terminator, changed for consistency with read_csv and the standard library ‘csv’ module.

  • chunksize (int or None) – Rows to write at a time.

  • date_format (str, default None) – Format string for datetime objects.

  • doublequote (bool, default True) – Control quoting of quotechar inside a field.

  • escapechar (str, default None) – String of length 1. Character used to escape sep and quotechar when appropriate.

  • decimal (str, default '.') – Character recognized as decimal separator. E.g. use ‘,’ for European data.

  • errors (str, default 'strict') – Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    Added in version 1.2.0.

Returns:

If path_or_buf is None, returns the resulting csv format as a string. Otherwise returns None.

Return type:

None or str

Differences from pandas

This operation has no known divergences from the pandas API.

See also

read_csv

Load a CSV file into a DeferredDataFrame.

to_excel

Write DeferredDataFrame to an Excel file.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'name': ['Raphael', 'Donatello'],
...                    'mask': ['red', 'purple'],
...                    'weapon': ['sai', 'bo staff']})
>>> df.to_csv(index=False)
'name,mask,weapon\nRaphael,red,sai\nDonatello,purple,bo staff\n'

Create 'out.zip' containing 'out.csv'

>>> compression_opts = dict(method='zip',
...                         archive_name='out.csv')  
>>> df.to_csv('out.zip', index=False,
...           compression=compression_opts)  

To write a csv file to a new folder or nested folder you will first
need to create it using either Pathlib or os:

>>> from pathlib import Path  
>>> filepath = Path('folder/subfolder/out.csv')  
>>> filepath.parent.mkdir(parents=True, exist_ok=True)  
>>> df.to_csv(filepath)  

>>> import os  
>>> os.makedirs('folder/subfolder', exist_ok=True)  
>>> df.to_csv('folder/subfolder/out.csv')  
to_excel(path, *args, **kwargs)

Write object to an Excel sheet.

To write a single object to an Excel .xlsx file it is only necessary to specify a target file name. To write to multiple sheets it is necessary to create an ExcelWriter object with a target file name, and specify a sheet in the file to write to.

Multiple sheets may be written to by specifying unique sheet_name. With all data written to the file it is necessary to save the changes. Note that creating an ExcelWriter object with a file name that already exists will result in the contents of the existing file being erased.

Parameters:
  • excel_writer (path-like, file-like, or ExcelWriter object) – File path or existing ExcelWriter.

  • sheet_name (str, default 'Sheet1') – Name of sheet which will contain DeferredDataFrame.

  • na_rep (str, default '') – Missing data representation.

  • float_format (str, optional) – Format string for floating point numbers. For example float_format="%.2f" will format 0.1234 to 0.12.

  • columns (sequence or list of str, optional) – Columns to write.

  • header (bool or list of str, default True) – Write out the column names. If a list of string is given it is assumed to be aliases for the column names.

  • index (bool, default True) – Write row names (index).

  • index_label (str or sequence, optional) – Column label for index column(s) if desired. If not specified, and header and index are True, then the index names are used. A sequence should be given if the DeferredDataFrame uses MultiIndex.

  • startrow (int, default 0) – Upper left cell row to dump data frame.

  • startcol (int, default 0) – Upper left cell column to dump data frame.

  • engine (str, optional) – Write engine to use, ‘openpyxl’ or ‘xlsxwriter’. You can also set this via the options io.excel.xlsx.writer or io.excel.xlsm.writer.

  • merge_cells (bool, default True) – Write MultiIndex and Hierarchical Rows as merged cells.

  • inf_rep (str, default 'inf') – Representation for infinity (there is no native representation for infinity in Excel).

  • freeze_panes (tuple of int (length 2), optional) – Specifies the one-based bottommost row and rightmost column that is to be frozen.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    Added in version 1.2.0.

  • engine_kwargs (dict, optional) – Arbitrary keyword arguments passed to excel engine.

Differences from pandas

This operation has no known divergences from the pandas API.

See also

to_csv

Write DeferredDataFrame to a comma-separated values (csv) file.

ExcelWriter

Class for writing DeferredDataFrame objects into excel sheets.

read_excel

Read an Excel file into a pandas DeferredDataFrame.

read_csv

Read a comma-separated values (csv) file into DeferredDataFrame.

io.formats.style.Styler.to_excel

Add styles to Excel sheet.

Notes

For compatibility with to_csv(), to_excel serializes lists and dicts to strings before writing.

Once a workbook has been saved it is not possible to write further data without rewriting the whole workbook.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Create, write to and save a workbook:

>>> df1 = pd.DataFrame([['a', 'b'], ['c', 'd']],
...                    index=['row 1', 'row 2'],
...                    columns=['col 1', 'col 2'])
>>> df1.to_excel("output.xlsx")  

To specify the sheet name:

>>> df1.to_excel("output.xlsx",
...              sheet_name='Sheet_name_1')  

If you wish to write to more than one sheet in the workbook, it is
necessary to specify an ExcelWriter object:

>>> df2 = df1.copy()
>>> with pd.ExcelWriter('output.xlsx') as writer:  
...     df1.to_excel(writer, sheet_name='Sheet_name_1')
...     df2.to_excel(writer, sheet_name='Sheet_name_2')

ExcelWriter can also be used to append to an existing Excel file:

>>> with pd.ExcelWriter('output.xlsx',
...                     mode='a') as writer:  
...     df1.to_excel(writer, sheet_name='Sheet_name_3')

To set the library that is used to write the Excel file,
you can pass the `engine` keyword (the default engine is
automatically chosen depending on the file extension):

>>> df1.to_excel('output1.xlsx', engine='xlsxwriter')  
to_feather(path, *args, **kwargs)

Write a DataFrame to the binary Feather format.

Parameters:
  • path (str, path object, file-like object) – String, path object (implementing os.PathLike[str]), or file-like object implementing a binary write() function. If a string or a path, it will be used as Root Directory path when writing a partitioned dataset.

  • **kwargs – Additional keywords passed to pyarrow.feather.write_feather(). This includes the compression, compression_level, chunksize and version keywords.

Differences from pandas

This operation has no known divergences from the pandas API.

Notes

This function writes the dataframe as a feather file. Requires a default index. For saving the DeferredDataFrame with your custom index use a method that supports custom indices e.g. to_parquet.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
>>> df.to_feather("file.feather")  
to_hdf(**kwargs)

pandas.DataFrame.to_hdf() is not yet supported in the Beam DataFrame API because HDF5 is a random access file format

to_html(path, *args, **kwargs)

Render a DataFrame as an HTML table.

Parameters:
  • buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.

  • columns (array-like, optional, default None) – The subset of columns to write. Writes all columns by default.

  • col_space (str or int, list or dict of int or str, optional) – The minimum width of each column in CSS length units. An int is assumed to be px units..

  • header (bool, optional) – Whether to print column labels, default True.

  • index (bool, optional, default True) – Whether to print index (row) labels.

  • na_rep (str, optional, default 'NaN') – String representation of NaN to use.

  • formatters (list, tuple or dict of one-param. functions, optional) – Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.

  • float_format (one-parameter function, optional, default None) –

    Formatter function to apply to columns’ elements if they are floats. This function must return a unicode string and will be applied only to the non-NaN elements, with NaN being handled by na_rep.

    Changed in version 1.2.0.

  • sparsify (bool, optional, default True) – Set to False for a DeferredDataFrame with a hierarchical index to print every multiindex key at each row.

  • index_names (bool, optional, default True) – Prints the names of the indexes.

  • justify (str, default None) –

    How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are

    • left

    • right

    • center

    • justify

    • justify-all

    • start

    • end

    • inherit

    • match-parent

    • initial

    • unset.

  • max_rows (int, optional) – Maximum number of rows to display in the console.

  • max_cols (int, optional) – Maximum number of columns to display in the console.

  • show_dimensions (bool, default False) – Display DeferredDataFrame dimensions (number of rows by number of columns).

  • decimal (str, default '.') – Character recognized as decimal separator, e.g. ‘,’ in Europe.

  • bold_rows (bool, default True) – Make the row labels bold in the output.

  • classes (str or list or tuple, default None) – CSS class(es) to apply to the resulting html table.

  • escape (bool, default True) – Convert the characters <, >, and & to HTML-safe sequences.

  • notebook ({True, False}, default False) – Whether the generated HTML is for IPython Notebook.

  • border (int) – A border=border attribute is included in the opening <table> tag. Default pd.options.display.html.border.

  • table_id (str, optional) – A css id is included in the opening <table> tag if specified.

  • render_links (bool, default False) – Convert URLs to HTML links.

  • encoding (str, default "utf-8") – Set character encoding.

Returns:

If buf is None, returns the result as a string. Otherwise returns None.

Return type:

str or None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

to_string

Convert DeferredDataFrame to a string.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [4, 3]})
>>> html_string = '''<table border="1" class="dataframe">
...   <thead>
...     <tr style="text-align: right;">
...       <th></th>
...       <th>col1</th>
...       <th>col2</th>
...     </tr>
...   </thead>
...   <tbody>
...     <tr>
...       <th>0</th>
...       <td>1</td>
...       <td>4</td>
...     </tr>
...     <tr>
...       <th>1</th>
...       <td>2</td>
...       <td>3</td>
...     </tr>
...   </tbody>
... </table>'''
>>> assert html_string == df.to_html()
to_json(path, orient=None, *args, **kwargs)

Convert the object to a JSON string.

Note NaN’s and None will be converted to null and datetime objects will be converted to UNIX timestamps.

Parameters:
  • path_or_buf (str, path object, file-like object, or None, default None) – String, path object (implementing os.PathLike[str]), or file-like object implementing a write() function. If None, the result is returned as a string.

  • orient (str) –

    Indication of expected JSON string format.

    • DeferredSeries:

      • default is ‘index’

      • allowed values are: {‘split’, ‘records’, ‘index’, ‘table’}.

    • DeferredDataFrame:

      • default is ‘columns’

      • allowed values are: {‘split’, ‘records’, ‘index’, ‘columns’, ‘values’, ‘table’}.

    • The format of the JSON string:

      • ’split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}

      • ’records’ : list like [{column -> value}, … , {column -> value}]

      • ’index’ : dict like {index -> {column -> value}}

      • ’columns’ : dict like {column -> {index -> value}}

      • ’values’ : just the values array

      • ’table’ : dict like {‘schema’: {schema}, ‘data’: {data}}

      Describing the data, where data component is like orient='records'.

  • date_format ({None, 'epoch', 'iso'}) – Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient. For orient='table', the default is ‘iso’. For all other orients, the default is ‘epoch’.

  • double_precision (int, default 10) – The number of decimal places to use when encoding floating point values. The possible maximal value is 15. Passing double_precision greater than 15 will raise a ValueError.

  • force_ascii (bool, default True) – Force encoded string to be ASCII.

  • date_unit (str, default 'ms' (milliseconds)) – The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.

  • default_handler (callable, default None) – Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.

  • lines (bool, default False) – If ‘orient’ is ‘records’ write out line-delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list-like.

  • compression (str or dict, default 'infer') –

    For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buf’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

    Added in version 1.5.0: Added support for .tar files.

    Changed in version 1.4.0: Zstandard support.

  • index (bool or None, default None) – The index is only used when ‘orient’ is ‘split’, ‘index’, ‘column’, or ‘table’. Of these, ‘index’ and ‘column’ do not support index=False.

  • indent (int, optional) – Length of whitespace used to indent each record.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    Added in version 1.2.0.

  • mode (str, default 'w' (writing)) – Specify the IO mode for output when supplying a path_or_buf. Accepted args are ‘w’ (writing) and ‘a’ (append) only. mode=’a’ is only supported when lines is True and orient is ‘records’.

Returns:

If path_or_buf is None, returns the resulting json format as a string. Otherwise returns None.

Return type:

None or str

Differences from pandas

This operation has no known divergences from the pandas API.

See also

read_json

Convert a JSON string to pandas object.

Notes

The behavior of indent=0 varies from the stdlib, which does not indent the output but does insert newlines. Currently, indent=0 and the default indent=None are equivalent in pandas, though this may change in a future release.

orient='table' contains a ‘pandas_version’ field under ‘schema’. This stores the version of pandas used in the latest revision of the schema.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> from json import loads, dumps
>>> df = pd.DataFrame(
...     [["a", "b"], ["c", "d"]],
...     index=["row 1", "row 2"],
...     columns=["col 1", "col 2"],
... )

>>> result = df.to_json(orient="split")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
{
    "columns": [
        "col 1",
        "col 2"
    ],
    "index": [
        "row 1",
        "row 2"
    ],
    "data": [
        [
            "a",
            "b"
        ],
        [
            "c",
            "d"
        ]
    ]
}

Encoding/decoding a Dataframe using ``'records'`` formatted JSON.
Note that index labels are not preserved with this encoding.

>>> result = df.to_json(orient="records")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
[
    {
        "col 1": "a",
        "col 2": "b"
    },
    {
        "col 1": "c",
        "col 2": "d"
    }
]

Encoding/decoding a Dataframe using ``'index'`` formatted JSON:

>>> result = df.to_json(orient="index")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
{
    "row 1": {
        "col 1": "a",
        "col 2": "b"
    },
    "row 2": {
        "col 1": "c",
        "col 2": "d"
    }
}

Encoding/decoding a Dataframe using ``'columns'`` formatted JSON:

>>> result = df.to_json(orient="columns")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
{
    "col 1": {
        "row 1": "a",
        "row 2": "c"
    },
    "col 2": {
        "row 1": "b",
        "row 2": "d"
    }
}

Encoding/decoding a Dataframe using ``'values'`` formatted JSON:

>>> result = df.to_json(orient="values")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
[
    [
        "a",
        "b"
    ],
    [
        "c",
        "d"
    ]
]

Encoding with Table Schema:

>>> result = df.to_json(orient="table")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
{
    "schema": {
        "fields": [
            {
                "name": "index",
                "type": "string"
            },
            {
                "name": "col 1",
                "type": "string"
            },
            {
                "name": "col 2",
                "type": "string"
            }
        ],
        "primaryKey": [
            "index"
        ],
        "pandas_version": "1.4.0"
    },
    "data": [
        {
            "index": "row 1",
            "col 1": "a",
            "col 2": "b"
        },
        {
            "index": "row 2",
            "col 1": "c",
            "col 2": "d"
        }
    ]
}
to_latex(**kwargs)

pandas.Series.to_latex() is not implemented yet in the Beam DataFrame API.

If support for ‘to_latex’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_markdown(**kwargs)

pandas.Series.to_markdown() is not implemented yet in the Beam DataFrame API.

If support for ‘to_markdown’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_msgpack(**kwargs)

pandas.DataFrame.to_msgpack() is not yet supported in the Beam DataFrame API because it is deprecated in pandas.

to_parquet(path, *args, **kwargs)

Write a DataFrame to the binary parquet format.

This function writes the dataframe as a parquet file. You can choose different parquet backends, and have the option of compression. See the user guide for more details.

Parameters:
  • path (str, path object, file-like object, or None, default None) –

    String, path object (implementing os.PathLike[str]), or file-like object implementing a binary write() function. If None, the result is returned as bytes. If a string or path, it will be used as Root Directory path when writing a partitioned dataset.

    Changed in version 1.2.0.

    Previously this was “fname”

  • engine ({'auto', 'pyarrow', 'fastparquet'}, default 'auto') – Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.

  • compression (str or None, default 'snappy') – Name of the compression to use. Use None for no compression. Supported options: ‘snappy’, ‘gzip’, ‘brotli’, ‘lz4’, ‘zstd’.

  • index (bool, default None) – If True, include the dataframe’s index(es) in the file output. If False, they will not be written to the file. If None, similar to True the dataframe’s index(es) will be saved. However, instead of being saved as values, the RangeIndex will be stored as a range in the metadata so it doesn’t require much space and is faster. Other indexes will be included as columns in the file output.

  • partition_cols (list, optional, default None) – Column names by which to partition the dataset. Columns are partitioned in the order they are given. Must be None if path is not a string.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    Added in version 1.2.0.

  • **kwargs – Additional arguments passed to the parquet library. See pandas io for more details.

Return type:

bytes if no path argument is provided else None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

read_parquet

Read a parquet file.

DeferredDataFrame.to_orc

Write an orc file.

DeferredDataFrame.to_csv

Write a csv file.

DeferredDataFrame.to_sql

Write to a sql table.

DeferredDataFrame.to_hdf

Write to hdf.

Notes

This function requires either the fastparquet or pyarrow library.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
>>> df.to_parquet('df.parquet.gzip',
...               compression='gzip')  
>>> pd.read_parquet('df.parquet.gzip')  
   col1  col2
0     1     3
1     2     4

If you want to get a buffer to the parquet content you can use a io.BytesIO
object, as long as you don't use partition_cols, which creates multiple files.

>>> import io
>>> f = io.BytesIO()
>>> df.to_parquet(f)
>>> f.seek(0)
0
>>> content = f.read()
to_period(**kwargs)

pandas.Series.to_period() is not implemented yet in the Beam DataFrame API.

If support for ‘to_period’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_pickle(**kwargs)

pandas.Series.to_pickle() is not implemented yet in the Beam DataFrame API.

If support for ‘to_pickle’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_sql(**kwargs)

pandas.Series.to_sql() is not implemented yet in the Beam DataFrame API.

If support for ‘to_sql’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_stata(path, *args, **kwargs)

Export DataFrame object to Stata dta format.

Writes the DataFrame to a Stata dataset file. “dta” files contain a Stata dataset.

Parameters:
  • path (str, path object, or buffer) – String, path object (implementing os.PathLike[str]), or file-like object implementing a binary write() function.

  • convert_dates (dict) – Dictionary mapping columns containing datetime types to stata internal format to use when writing the dates. Options are ‘tc’, ‘td’, ‘tm’, ‘tw’, ‘th’, ‘tq’, ‘ty’. Column can be either an integer or a name. Datetime columns that do not have a conversion type specified will be converted to ‘tc’. Raises NotImplementedError if a datetime column has timezone information.

  • write_index (bool) – Write the index to Stata dataset.

  • byteorder (str) – Can be “>”, “<”, “little”, or “big”. default is sys.byteorder.

  • time_stamp (datetime) – A datetime to use as file creation date. Default is the current time.

  • data_label (str, optional) – A label for the data set. Must be 80 characters or smaller.

  • variable_labels (dict) – Dictionary containing columns as keys and variable labels as values. Each label must be 80 characters or smaller.

  • version ({114, 117, 118, 119, None}, default 114) –

    Version to use in the output dta file. Set to None to let pandas decide between 118 or 119 formats depending on the number of columns in the frame. pandas Version 114 can be read by Stata 10 and later. pandas Version 117 can be read by Stata 13 or later. pandas Version 118 is supported in Stata 14 and later. pandas Version 119 is supported in Stata 15 and later. pandas Version 114 limits string variables to 244 characters or fewer while versions 117 and later allow strings with lengths up to 2,000,000 characters. Versions 118 and 119 support Unicode characters, and pandas version 119 supports more than 32,767 variables.

    pandas Version 119 should usually only be used when the number of variables exceeds the capacity of dta format 118. Exporting smaller datasets in format 119 may have unintended consequences, and, as of November 2020, Stata SE cannot read pandas version 119 files.

  • convert_strl (list, optional) – List of column names to convert to string columns to Stata StrL format. Only available if version is 117. Storing strings in the StrL format can produce smaller dta files if strings have more than 8 characters and values are repeated.

  • compression (str or dict, default 'infer') –

    For on-the-fly compression of the output data. If ‘infer’ and ‘path’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

    Added in version 1.5.0: Added support for .tar files.

    Changed in version 1.4.0: Zstandard support.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    Added in version 1.2.0.

  • value_labels (dict of dicts) –

    Dictionary containing columns as keys and dictionaries of column value to labels as values. Labels for a single variable must be 32,000 characters or smaller.

    Added in version 1.4.0.

Raises:
  • NotImplementedError

    • If datetimes contain timezone information * Column dtype is not representable in Stata

  • ValueError

    • Columns listed in convert_dates are neither datetime64[ns] or datetime.datetime * Column listed in convert_dates is not in DeferredDataFrame * Categorical label contains more than 32,000 characters

Differences from pandas

This operation has no known divergences from the pandas API.

See also

read_stata

Import Stata data files.

io.stata.StataWriter

Low-level writer for Stata data files.

io.stata.StataWriter117

Low-level writer for pandas version 117 files.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'animal': ['falcon', 'parrot', 'falcon',
...                               'parrot'],
...                    'speed': [350, 18, 361, 15]})
>>> df.to_stata('animals.dta')  
to_timestamp(**kwargs)

pandas.Series.to_timestamp() is not implemented yet in the Beam DataFrame API.

If support for ‘to_timestamp’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

truediv(**kwargs)

Return Floating division of series and other, element-wise (binary operator truediv).

Equivalent to series / other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value)

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DeferredDataFrame.

Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

Only level=None is supported

See also

DeferredSeries.rtruediv

Reverse of the Floating division operator, see Python documentation for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
class apache_beam.dataframe.frames.DeferredDataFrame(expr)[source]

Bases: DeferredDataFrameOrSeries

property columns

The column labels of the DataFrame.

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
>>> df
     A  B
0    1  3
1    2  4
>>> df.columns
Index(['A', 'B'], dtype='object')
keys()[source]

Get the ‘info axis’ (see Indexing for more).

This is index for Series, columns for DataFrame.

Returns:

Info axis.

Return type:

Index

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> d = pd.DataFrame(data={'A': [1, 2, 3], 'B': [0, 4, 8]},
...                  index=['a', 'b', 'c'])
>>> d
   A  B
a  1  0
b  2  4
c  3  8
>>> d.keys()
Index(['A', 'B'], dtype='object')
align(other, join, axis, copy, level, method, **kwargs)[source]

Align two objects on their axes with the specified join method.

Join method is specified for each axis Index.

Parameters:
  • other (DeferredDataFrame or DeferredSeries)

  • join ({'outer', 'inner', 'left', 'right'}, default 'outer') –

    Type of alignment to be performed.

    • left: use only keys from left frame, preserve key order.

    • right: use only keys from right frame, preserve key order.

    • outer: use union of keys from both frames, sort keys lexicographically.

    • inner: use intersection of keys from both frames, preserve the order of the left keys.

  • axis (allowed axis of the other object, default None) – Align on index (0), columns (1), or both (None).

  • level (int or level name, default None) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • copy (bool, default True) – Always returns new objects. If copy=False and no reindexing is required then original objects are returned.

  • fill_value (scalar, default np.nan) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.

  • method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None) –

    Method to use for filling holes in reindexed DeferredSeries:

    • pad / ffill: propagate last valid observation forward to next valid.

    • backfill / bfill: use NEXT valid observation to fill gap.

    Deprecated since version 2.1.

  • limit (int, default None) –

    If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

    Deprecated since version 2.1.

  • fill_axis ({0 or 'index'} for DeferredSeries, {0 or 'index', 1 or 'columns'} for DeferredDataFrame, default 0) –

    Filling axis, method and limit.

    Deprecated since version 2.1.

  • broadcast_axis ({0 or 'index'} for DeferredSeries, {0 or 'index', 1 or 'columns'} for DeferredDataFrame, default None) –

    Broadcast values along this axis, if aligning two objects of different dimensions.

    Deprecated since version 2.1.

Returns:

Aligned objects.

Return type:

tuple of (DeferredSeries/DeferredDataFrame, type of other)

Differences from pandas

Aligning per level is not yet supported. Only the default, level=None, is allowed.

Filling NaN values via method is not supported, because it is order-sensitive. Only the default, method=None, is allowed.

copy=False is not supported because its behavior (whether or not it is an inplace operation) depends on the data.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame(
...     [[1, 2, 3, 4], [6, 7, 8, 9]], columns=["D", "B", "E", "A"], index=[1, 2]
... )
>>> other = pd.DataFrame(
...     [[10, 20, 30, 40], [60, 70, 80, 90], [600, 700, 800, 900]],
...     columns=["A", "B", "C", "D"],
...     index=[2, 3, 4],
... )
>>> df
   D  B  E  A
1  1  2  3  4
2  6  7  8  9
>>> other
    A    B    C    D
2   10   20   30   40
3   60   70   80   90
4  600  700  800  900

Align on columns:

>>> left, right = df.align(other, join="outer", axis=1)
>>> left
   A  B   C  D  E
1  4  2 NaN  1  3
2  9  7 NaN  6  8
>>> right
    A    B    C    D   E
2   10   20   30   40 NaN
3   60   70   80   90 NaN
4  600  700  800  900 NaN

We can also align on the index:

>>> left, right = df.align(other, join="outer", axis=0)
>>> left
    D    B    E    A
1  1.0  2.0  3.0  4.0
2  6.0  7.0  8.0  9.0
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN
>>> right
    A      B      C      D
1    NaN    NaN    NaN    NaN
2   10.0   20.0   30.0   40.0
3   60.0   70.0   80.0   90.0
4  600.0  700.0  800.0  900.0

Finally, the default `axis=None` will align on both index and columns:

>>> left, right = df.align(other, join="outer", axis=None)
>>> left
     A    B   C    D    E
1  4.0  2.0 NaN  1.0  3.0
2  9.0  7.0 NaN  6.0  8.0
3  NaN  NaN NaN  NaN  NaN
4  NaN  NaN NaN  NaN  NaN
>>> right
       A      B      C      D   E
1    NaN    NaN    NaN    NaN NaN
2   10.0   20.0   30.0   40.0 NaN
3   60.0   70.0   80.0   90.0 NaN
4  600.0  700.0  800.0  900.0 NaN
append(other, ignore_index, verify_integrity, sort, **kwargs)[source]

This method has been removed in the current version of Pandas.

get(key, default_value=None)[source]

Get item from object for given key (ex: DataFrame column).

Returns default value if not found.

Parameters:

key (object)

Return type:

same type as items contained in object

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame(
...     [
...         [24.3, 75.7, "high"],
...         [31, 87.8, "high"],
...         [22, 71.6, "medium"],
...         [35, 95, "medium"],
...     ],
...     columns=["temp_celsius", "temp_fahrenheit", "windspeed"],
...     index=pd.date_range(start="2014-02-12", end="2014-02-15", freq="D"),
... )

>>> df
            temp_celsius  temp_fahrenheit windspeed
2014-02-12          24.3             75.7      high
2014-02-13          31.0             87.8      high
2014-02-14          22.0             71.6    medium
2014-02-15          35.0             95.0    medium

>>> df.get(["temp_celsius", "windspeed"])
            temp_celsius windspeed
2014-02-12          24.3      high
2014-02-13          31.0      high
2014-02-14          22.0    medium
2014-02-15          35.0    medium

>>> ser = df['windspeed']
>>> ser.get('2014-02-13')
'high'

If the key isn't found, the default value will be used.

>>> df.get(["temp_celsius", "temp_kelvin"], default="default_value")
'default_value'

>>> ser.get('2014-02-10', '[unknown]')
'[unknown]'
set_index(keys, **kwargs)[source]

Set the DataFrame index using existing columns.

Set the DataFrame index (row labels) using one or more existing columns or arrays (of the correct length). The index can replace the existing index or expand on it.

Parameters:
  • keys (label or array-like or list of labels/arrays) – This parameter can be either a single column key, a single array of the same length as the calling DeferredDataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses DeferredSeries, Index, np.ndarray, and instances of Iterator.

  • drop (bool, default True) – Delete columns to be used as the new index.

  • append (bool, default False) – Whether to append columns to existing index.

  • inplace (bool, default False) – Whether to modify the DeferredDataFrame rather than creating a new one.

  • verify_integrity (bool, default False) – Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method.

Returns:

Changed row labels or None if inplace=True.

Return type:

DeferredDataFrame or None

Differences from pandas

keys must be a str or list[str]. Passing an Index or Series is not yet supported (Issue 20759).

See also

DeferredDataFrame.reset_index

Opposite of set_index.

DeferredDataFrame.reindex

Change to new indices or expand indices.

DeferredDataFrame.reindex_like

Change to same indices as other DeferredDataFrame.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'month': [1, 4, 7, 10],
...                    'year': [2012, 2014, 2013, 2014],
...                    'sale': [55, 40, 84, 31]})
>>> df
   month  year  sale
0      1  2012    55
1      4  2014    40
2      7  2013    84
3     10  2014    31

Set the index to become the 'month' column:

>>> df.set_index('month')
       year  sale
month
1      2012    55
4      2014    40
7      2013    84
10     2014    31

Create a MultiIndex using columns 'year' and 'month':

>>> df.set_index(['year', 'month'])
            sale
year  month
2012  1     55
2014  4     40
2013  7     84
2014  10    31

Create a MultiIndex using an Index and a column:

>>> df.set_index([pd.Index([1, 2, 3, 4]), 'year'])
         month  sale
   year
1  2012  1      55
2  2014  4      40
3  2013  7      84
4  2014  10     31

Create a MultiIndex using two Series:

>>> s = pd.Series([1, 2, 3, 4])
>>> df.set_index([s, s**2])
      month  year  sale
1 1       1  2012    55
2 4       4  2014    40
3 9       7  2013    84
4 16     10  2014    31
set_axis(labels, axis, **kwargs)[source]

Assign desired index to given axis.

Indexes for column or row labels can be changed by assigning a list-like or Index.

Parameters:
  • labels (list-like, Index) – The values for the new index.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to update. The value 0 identifies the rows. For DeferredSeries this parameter is unused and defaults to 0.

  • copy (bool, default True) –

    Whether to make a copy of the underlying data.

    Added in version 1.5.0.

Returns:

An object of type DeferredDataFrame.

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DataFrame.rename_axis

Alter the name of the index or columns.

Examples

DataFrame.rename_axis : Alter the name of the index or columns.

        Examples
        --------
        >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

        Change the row labels.

        >>> df.set_axis(['a', 'b', 'c'], axis='index')
           A  B
        a  1  4
        b  2  5
        c  3  6

        Change the column labels.

        >>> df.set_axis(['I', 'II'], axis='columns')
           I  II
        0  1   4
        1  2   5
        2  3   6


    --------
    >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

    Change the row labels.

    >>> df.set_axis(['a', 'b', 'c'], axis='index')
       A  B
    a  1  4
    b  2  5
    c  3  6

    Change the column labels.

    >>> df.set_axis(['I', 'II'], axis='columns')
       I  II
    0  1   4
    1  2   5
    2  3   6
property axes

Return a list representing the axes of the DataFrame.

It has the row axis labels and column axis labels as the only members. They are returned in that order.

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df.axes
[RangeIndex(start=0, stop=2, step=1), Index(['col1', 'col2'],
dtype='object')]
property dtypes

Return the dtypes in the DataFrame.

This returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns. Columns with mixed types are stored with the object dtype. See the User Guide for more.

Returns:

The data type of each column.

Return type:

pandas.DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'float': [1.0],
...                    'int': [1],
...                    'datetime': [pd.Timestamp('20180310')],
...                    'string': ['foo']})
>>> df.dtypes
float              float64
int                  int64
datetime    datetime64[ns]
string              object
dtype: object
assign(**kwargs)[source]

Assign new columns to a DataFrame.

Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.

Parameters:

**kwargs (dict of {str: callable or DeferredSeries}) – The column names are keywords. If the values are callable, they are computed on the DeferredDataFrame and assigned to the new columns. The callable must not change input DeferredDataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a DeferredSeries, scalar, or array), they are simply assigned.

Returns:

A new DeferredDataFrame with the new columns in addition to all the existing columns.

Return type:

DeferredDataFrame

Differences from pandas

value must be a callable or DeferredSeries. Other types make this operation order-sensitive.

Notes

Assigning multiple columns within the same assign is possible. Later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'temp_c': [17.0, 25.0]},
...                   index=['Portland', 'Berkeley'])
>>> df
          temp_c
Portland    17.0
Berkeley    25.0

Where the value is a callable, evaluated on `df`:

>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0

Alternatively, the same behavior can be achieved by directly
referencing an existing Series or sequence:

>>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32)
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0

You can create multiple columns within the same assign where one
of the columns depends on another one defined within the same assign:

>>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32,
...           temp_k=lambda x: (x['temp_f'] + 459.67) * 5 / 9)
          temp_c  temp_f  temp_k
Portland    17.0    62.6  290.15
Berkeley    25.0    77.0  298.15
explode(column, ignore_index)[source]

Transform each element of a list-like to a row, replicating index values.

Parameters:
  • column (IndexLabel) –

    Column(s) to explode. For multiple columns, specify a non-empty list with each element be str or tuple, and all specified columns their list-like data on same row of the frame must have matching length.

    Added in version 1.3.0: Multi-column explode

  • ignore_index (bool, default False) – If True, the resulting index will be labeled 0, 1, …, n - 1.

Returns:

Exploded lists to rows of the subset columns; index will be duplicated for these rows.

Return type:

DeferredDataFrame

Raises:

ValueError :

  • If columns of the frame are not unique. * If specified columns to explode is empty list. * If specified columns to explode have not matching count of elements rowwise in the frame.

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.unstack

Pivot a level of the (necessarily hierarchical) index labels.

DeferredDataFrame.melt

Unpivot a DeferredDataFrame from wide format to long format.

DeferredSeries.explode

Explode a DeferredDataFrame from list-like columns to long format.

Notes

This routine will explode list-likes including lists, tuples, sets, DeferredSeries, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in a np.nan for that row. In addition, the ordering of rows in the output will be non-deterministic when exploding sets.

Reference the user guide for more examples.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]],
...                    'B': 1,
...                    'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]})
>>> df
           A  B          C
0  [0, 1, 2]  1  [a, b, c]
1        foo  1        NaN
2         []  1         []
3     [3, 4]  1     [d, e]

Single-column explode.

>>> df.explode('A')
     A  B          C
0    0  1  [a, b, c]
0    1  1  [a, b, c]
0    2  1  [a, b, c]
1  foo  1        NaN
2  NaN  1         []
3    3  1     [d, e]
3    4  1     [d, e]

Multi-column explode.

>>> df.explode(list('AC'))
     A  B    C
0    0  1    a
0    1  1    b
0    2  1    c
1  foo  1  NaN
2  NaN  1  NaN
3    3  1    d
3    4  1    e
insert(value, **kwargs)[source]

Insert column into DataFrame at specified location.

Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.

Parameters:
  • loc (int) – Insertion index. Must verify 0 <= loc <= len(columns).

  • column (str, number, or hashable object) – Label of the inserted column.

  • value (Scalar, DeferredSeries, or array-like)

  • allow_duplicates (bool, optional, default lib.no_default)

Differences from pandas

value cannot be a List because aligning it with this DeferredDataFrame is order-sensitive.

See also

Index.insert

Insert new item by index.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df
   col1  col2
0     1     3
1     2     4
>>> df.insert(1, "newcol", [99, 99])
>>> df
   col1  newcol  col2
0     1      99     3
1     2      99     4
>>> df.insert(0, "col1", [100, 100], allow_duplicates=True)
>>> df
   col1  col1  newcol  col2
0   100     1      99     3
1   100     2      99     4

Notice that pandas uses index alignment in case of `value` from type `Series`:

>>> df.insert(0, "col0", pd.Series([5, 6], index=[1, 2]))
>>> df
   col0  col1  col1  newcol  col2
0   NaN   100     1      99     3
1   5.0   100     2      99     4
static from_dict(*args, **kwargs)[source]

Construct DataFrame from dict of array-like or dicts.

Creates DataFrame object from dictionary by columns or by index allowing dtype specification.

Parameters:
  • data (dict) – Of the form {field : array-like} or {field : dict}.

  • orient ({'columns', 'index', 'tight'}, default 'columns') –

    The “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DeferredDataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’. If ‘tight’, assume a dict with keys [‘index’, ‘columns’, ‘data’, ‘index_names’, ‘column_names’].

    Added in version 1.4.0: ‘tight’ as an allowed value for the orient argument

  • dtype (dtype, default None) – Data type to force after DeferredDataFrame construction, otherwise infer.

  • columns (list, default None) – Column labels to use when orient='index'. Raises a ValueError if used with orient='columns' or orient='tight'.

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.from_records

DeferredDataFrame from structured ndarray, sequence of tuples or dicts, or DeferredDataFrame.

DeferredDataFrame

DeferredDataFrame object creation using constructor.

DeferredDataFrame.to_dict

Convert the DeferredDataFrame to a dictionary.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

By default the keys of the dict become the DataFrame columns:

>>> data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
>>> pd.DataFrame.from_dict(data)
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

Specify ``orient='index'`` to create the DataFrame using dictionary
keys as rows:

>>> data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
>>> pd.DataFrame.from_dict(data, orient='index')
       0  1  2  3
row_1  3  2  1  0
row_2  a  b  c  d

When using the 'index' orientation, the column names can be
specified manually:

>>> pd.DataFrame.from_dict(data, orient='index',
...                        columns=['A', 'B', 'C', 'D'])
       A  B  C  D
row_1  3  2  1  0
row_2  a  b  c  d

Specify ``orient='tight'`` to create the DataFrame using a 'tight'
format:

>>> data = {'index': [('a', 'b'), ('a', 'c')],
...         'columns': [('x', 1), ('y', 2)],
...         'data': [[1, 3], [2, 4]],
...         'index_names': ['n1', 'n2'],
...         'column_names': ['z1', 'z2']}
>>> pd.DataFrame.from_dict(data, orient='tight')
z1     x  y
z2     1  2
n1 n2
a  b   1  3
   c   2  4
static from_records(*args, **kwargs)[source]

Convert structured or record ndarray to DataFrame.

Creates a DataFrame object from a structured ndarray, sequence of tuples or dicts, or DataFrame.

Parameters:
  • data (structured ndarray, sequence of tuples or dicts, or DeferredDataFrame) –

    Structured input data.

    Deprecated since version 2.1.0: Passing a DeferredDataFrame is deprecated.

  • index (str, list of fields, array-like) – Field of array to use as the index, alternately a specific set of input labels to use.

  • exclude (sequence, default None) – Columns or fields to exclude.

  • columns (sequence, default None) – Column names to use. If the passed data do not have names associated with them, this argument provides names for the columns. Otherwise this argument indicates the order of the columns in the result (any names not found in the data will become all-NA columns).

  • coerce_float (bool, default False) – Attempt to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.

  • nrows (int, default None) – Number of rows to read if data is an iterator.

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.from_dict

DeferredDataFrame from dict of array-like or dicts.

DeferredDataFrame

DeferredDataFrame object creation using constructor.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Data can be provided as a structured ndarray:

>>> data = np.array([(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')],
...                 dtype=[('col_1', 'i4'), ('col_2', 'U1')])
>>> pd.DataFrame.from_records(data)
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

Data can be provided as a list of dicts:

>>> data = [{'col_1': 3, 'col_2': 'a'},
...         {'col_1': 2, 'col_2': 'b'},
...         {'col_1': 1, 'col_2': 'c'},
...         {'col_1': 0, 'col_2': 'd'}]
>>> pd.DataFrame.from_records(data)
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

Data can be provided as a list of tuples with corresponding columns:

>>> data = [(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')]
>>> pd.DataFrame.from_records(data, columns=['col_1', 'col_2'])
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d
duplicated(keep, subset)[source]

Return boolean Series denoting duplicate rows.

Considering certain columns is optional.

Parameters:
  • subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by default use all of the columns.

  • keep ({'first', 'last', False}, default 'first') –

    Determines which duplicates (if any) to mark.

    • first : Mark duplicates as True except for the first occurrence.

    • last : Mark duplicates as True except for the last occurrence.

    • False : Mark all duplicates as True.

Returns:

Boolean series for each duplicated rows.

Return type:

DeferredSeries

Differences from pandas

Only keep=False and keep="any" are supported. Other values of keep make this an order-sensitive operation. Note keep="any" is a Beam-specific option that guarantees only one duplicate will be kept, but unlike "first" and "last" it makes no guarantees about _which_ duplicate element is kept.

See also

Index.duplicated

Equivalent method on index.

DeferredSeries.duplicated

Equivalent method on DeferredSeries.

DeferredSeries.drop_duplicates

Remove duplicate values from DeferredSeries.

DeferredDataFrame.drop_duplicates

Remove duplicate values from DeferredDataFrame.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

Consider dataset containing ramen rating.

>>> df = pd.DataFrame({
...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
...     'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
    brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

By default, for each set of duplicated values, the first occurrence
is set on False and all others on True.

>>> df.duplicated()
0    False
1     True
2    False
3    False
4    False
dtype: bool

By using 'last', the last occurrence of each set of duplicated values
is set on False and all others on True.

>>> df.duplicated(keep='last')
0     True
1    False
2    False
3    False
4    False
dtype: bool

By setting ``keep`` on False, all duplicates are True.

>>> df.duplicated(keep=False)
0     True
1     True
2    False
3    False
4    False
dtype: bool

To find duplicates on specific column(s), use ``subset``.

>>> df.duplicated(subset=['brand'])
0    False
1     True
2    False
3     True
4     True
dtype: bool
drop_duplicates(keep, subset, ignore_index)[source]

Return DataFrame with duplicate rows removed.

Considering certain columns is optional. Indexes, including time indexes are ignored.

Parameters:
  • subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by default use all of the columns.

  • keep ({‘first’, ‘last’, False}, default ‘first’) –

    Determines which duplicates (if any) to keep.

    • ’first’ : Drop duplicates except for the first occurrence.

    • ’last’ : Drop duplicates except for the last occurrence.

    • False : Drop all duplicates.

  • inplace (bool, default False) – Whether to modify the DeferredDataFrame rather than creating a new one.

  • ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.

Returns:

DeferredDataFrame with duplicates removed or None if inplace=True.

Return type:

DeferredDataFrame or None

Differences from pandas

Only keep=False and keep="any" are supported. Other values of keep make this an order-sensitive operation. Note keep="any" is a Beam-specific option that guarantees only one duplicate will be kept, but unlike "first" and "last" it makes no guarantees about _which_ duplicate element is kept.

See also

DeferredDataFrame.value_counts

Count unique combinations of columns.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

Consider dataset containing ramen rating.

>>> df = pd.DataFrame({
...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
...     'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
    brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

By default, it removes duplicate rows based on all columns.

>>> df.drop_duplicates()
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

To remove duplicates on specific column(s), use ``subset``.

>>> df.drop_duplicates(subset=['brand'])
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5

To remove duplicates and keep last occurrences, use ``keep``.

>>> df.drop_duplicates(subset=['brand', 'style'], keep='last')
    brand style  rating
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
4  Indomie  pack     5.0
aggregate(func, axis, *args, **kwargs)[source]

Aggregate using one or more operations over the specified axis.

Parameters:
  • func (function, str, list or dict) –

    Function to use for aggregating the data. If a function, must either work when passed a DeferredDataFrame or when passed to DeferredDataFrame.apply.

    Accepted combinations are:

    • function

    • string function name

    • list of functions and/or function names, e.g. [np.sum, 'mean']

    • dict of axis labels -> functions, function names or list of such.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.

  • *args – Positional arguments to pass to func.

  • **kwargs – Keyword arguments to pass to func.

Returns:

The return can be:

  • scalar : when DeferredSeries.agg is called with single function

  • DeferredSeries : when DeferredDataFrame.agg is called with a single function

  • DeferredDataFrame : when DeferredDataFrame.agg is called with several functions

Return scalar, DeferredSeries or DeferredDataFrame.

Return type:

scalar, DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.apply

Perform any type of operations.

DeferredDataFrame.transform

Perform transformation type operations.

core.groupby.GroupBy

Perform operations over groups.

core.resample.Resampler

Perform operations over resampled bins.

core.window.Rolling

Perform operations over rolling window.

core.window.Expanding

Perform operations over expanding window.

core.window.ExponentialMovingWindow

Perform operation over exponential weighted window.

Notes

The aggregation operations are always performed over an axis, either the index (default) or the column axis. This behavior is different from numpy aggregation functions (mean, median, prod, sum, std, var), where the default is to compute the aggregation of the flattened array, e.g., numpy.mean(arr_2d) as opposed to numpy.mean(arr_2d, axis=0).

agg is an alias for aggregate. Use the alias.

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.

A passed user-defined-function will be passed a DeferredSeries for evaluation.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame([[1, 2, 3],
...                    [4, 5, 6],
...                    [7, 8, 9],
...                    [np.nan, np.nan, np.nan]],
...                   columns=['A', 'B', 'C'])

Aggregate these functions over the rows.

>>> df.agg(['sum', 'min'])
        A     B     C
sum  12.0  15.0  18.0
min   1.0   2.0   3.0

Different aggregations per column.

>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
        A    B
sum  12.0  NaN
min   1.0  2.0
max   NaN  8.0

Aggregate different functions over the columns and rename the index of the resulting
DataFrame.

>>> df.agg(x=('A', 'max'), y=('B', 'min'), z=('C', 'mean'))
     A    B    C
x  7.0  NaN  NaN
y  NaN  2.0  NaN
z  NaN  NaN  6.0

Aggregate over the columns.

>>> df.agg("mean", axis="columns")
0    2.0
1    5.0
2    8.0
3    NaN
dtype: float64
agg(func, axis, *args, **kwargs)

Aggregate using one or more operations over the specified axis.

Parameters:
  • func (function, str, list or dict) –

    Function to use for aggregating the data. If a function, must either work when passed a DeferredDataFrame or when passed to DeferredDataFrame.apply.

    Accepted combinations are:

    • function

    • string function name

    • list of functions and/or function names, e.g. [np.sum, 'mean']

    • dict of axis labels -> functions, function names or list of such.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.

  • *args – Positional arguments to pass to func.

  • **kwargs – Keyword arguments to pass to func.

Returns:

The return can be:

  • scalar : when DeferredSeries.agg is called with single function

  • DeferredSeries : when DeferredDataFrame.agg is called with a single function

  • DeferredDataFrame : when DeferredDataFrame.agg is called with several functions

Return scalar, DeferredSeries or DeferredDataFrame.

Return type:

scalar, DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.apply

Perform any type of operations.

DeferredDataFrame.transform

Perform transformation type operations.

core.groupby.GroupBy

Perform operations over groups.

core.resample.Resampler

Perform operations over resampled bins.

core.window.Rolling

Perform operations over rolling window.

core.window.Expanding

Perform operations over expanding window.

core.window.ExponentialMovingWindow

Perform operation over exponential weighted window.

Notes

The aggregation operations are always performed over an axis, either the index (default) or the column axis. This behavior is different from numpy aggregation functions (mean, median, prod, sum, std, var), where the default is to compute the aggregation of the flattened array, e.g., numpy.mean(arr_2d) as opposed to numpy.mean(arr_2d, axis=0).

agg is an alias for aggregate. Use the alias.

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.

A passed user-defined-function will be passed a DeferredSeries for evaluation.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame([[1, 2, 3],
...                    [4, 5, 6],
...                    [7, 8, 9],
...                    [np.nan, np.nan, np.nan]],
...                   columns=['A', 'B', 'C'])

Aggregate these functions over the rows.

>>> df.agg(['sum', 'min'])
        A     B     C
sum  12.0  15.0  18.0
min   1.0   2.0   3.0

Different aggregations per column.

>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
        A    B
sum  12.0  NaN
min   1.0  2.0
max   NaN  8.0

Aggregate different functions over the columns and rename the index of the resulting
DataFrame.

>>> df.agg(x=('A', 'max'), y=('B', 'min'), z=('C', 'mean'))
     A    B    C
x  7.0  NaN  NaN
y  NaN  2.0  NaN
z  NaN  NaN  6.0

Aggregate over the columns.

>>> df.agg("mean", axis="columns")
0    2.0
1    5.0
2    8.0
3    NaN
dtype: float64
applymap(**kwargs)

Apply a function to a Dataframe elementwise.

Deprecated since version 2.1.0: DataFrame.applymap has been deprecated. Use DataFrame.map instead.

This method applies a function that accepts and returns a scalar to every element of a DataFrame.

Parameters:
  • func (callable) – Python function, returns a single value from a single value.

  • na_action ({None, 'ignore'}, default None) – If ‘ignore’, propagate NaN values, without passing them to func.

  • **kwargs – Additional keyword arguments to pass as keywords arguments to func.

Returns:

Transformed DeferredDataFrame.

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.apply

Apply a function along input axis of DeferredDataFrame.

DeferredDataFrame.map

Apply a function along input axis of DeferredDataFrame.

DeferredDataFrame.replace

Replace values given in to_replace with value.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame([[1, 2.12], [3.356, 4.567]])
>>> df
       0      1
0  1.000  2.120
1  3.356  4.567

>>> df.map(lambda x: len(str(x)))
   0  1
0  3  4
1  5  5
map(**kwargs)

Apply a function to a Dataframe elementwise.

Added in version 2.1.0: DataFrame.applymap was deprecated and renamed to DataFrame.map.

This method applies a function that accepts and returns a scalar to every element of a DataFrame.

Parameters:
  • func (callable) – Python function, returns a single value from a single value.

  • na_action ({None, 'ignore'}, default None) – If ‘ignore’, propagate NaN values, without passing them to func.

  • **kwargs – Additional keyword arguments to pass as keywords arguments to func.

Returns:

Transformed DeferredDataFrame.

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.apply

Apply a function along input axis of DeferredDataFrame.

DeferredDataFrame.replace

Replace values given in to_replace with value.

DeferredSeries.map

Apply a function elementwise on a DeferredSeries.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame([[1, 2.12], [3.356, 4.567]])
>>> df
       0      1
0  1.000  2.120
1  3.356  4.567

>>> df.map(lambda x: len(str(x)))
   0  1
0  3  4
1  5  5

Like Series.map, NA values can be ignored:

>>> df_copy = df.copy()
>>> df_copy.iloc[0, 0] = pd.NA
>>> df_copy.map(lambda x: len(str(x)), na_action='ignore')
     0  1
0  NaN  4
1  5.0  5

Note that a vectorized version of `func` often exists, which will
be much faster. You could square each number elementwise.

>>> df.map(lambda x: x**2)
           0          1
0   1.000000   4.494400
1  11.262736  20.857489

But it's better to avoid map in that case.

>>> df ** 2
           0          1
0   1.000000   4.494400
1  11.262736  20.857489
add_prefix(**kwargs)

Prefix labels with string prefix.

For Series, the row labels are prefixed. For DataFrame, the column labels are prefixed.

Parameters:
  • prefix (str) – The string to add before each label.

  • axis ({0 or 'index', 1 or 'columns', None}, default None) –

    Axis to add prefix on

    Added in version 2.0.0.

Returns:

New DeferredSeries or DeferredDataFrame with updated labels.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.add_suffix

Suffix row labels with string suffix.

DeferredDataFrame.add_suffix

Suffix column labels with string suffix.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64

>>> s.add_prefix('item_')
item_0    1
item_1    2
item_2    3
item_3    4
dtype: int64

>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})
>>> df
   A  B
0  1  3
1  2  4
2  3  5
3  4  6

>>> df.add_prefix('col_')
     col_A  col_B
0       1       3
1       2       4
2       3       5
3       4       6
add_suffix(**kwargs)

Suffix labels with string suffix.

For Series, the row labels are suffixed. For DataFrame, the column labels are suffixed.

Parameters:
  • suffix (str) – The string to add after each label.

  • axis ({0 or 'index', 1 or 'columns', None}, default None) –

    Axis to add suffix on

    Added in version 2.0.0.

Returns:

New DeferredSeries or DeferredDataFrame with updated labels.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.add_prefix

Prefix row labels with string prefix.

DeferredDataFrame.add_prefix

Prefix column labels with string prefix.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64

>>> s.add_suffix('_item')
0_item    1
1_item    2
2_item    3
3_item    4
dtype: int64

>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})
>>> df
   A  B
0  1  3
1  2  4
2  3  5
3  4  6

>>> df.add_suffix('_col')
     A_col  B_col
0       1       3
1       2       4
2       3       5
3       4       6
memory_usage(**kwargs)

pandas.DataFrame.memory_usage() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

info(**kwargs)

pandas.DataFrame.info() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

clip(axis, **kwargs)[source]

Trim values at input threshold(s).

Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis.

Parameters:
  • lower (float or array-like, default None) – Minimum threshold value. All values below this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.

  • upper (float or array-like, default None) – Maximum threshold value. All values above this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.

  • axis ({{0 or 'index', 1 or 'columns', None}}, default None) – Align object with lower and upper along the given axis. For DeferredSeries this parameter is unused and defaults to None.

  • inplace (bool, default False) – Whether to perform the operation in place on the data.

  • *args – Additional keywords have no effect but might be accepted for compatibility with numpy.

  • **kwargs – Additional keywords have no effect but might be accepted for compatibility with numpy.

Returns:

Same type as calling object with the values outside the clip boundaries replaced or None if inplace=True.

Return type:

DeferredSeries or DeferredDataFrame or None

Differences from pandas

lower and upper must be DeferredSeries instances, or constants. Array-like arguments are not supported because they are order-sensitive.

See also

DeferredSeries.clip

Trim values at input threshold in series.

DeferredDataFrame.clip

Trim values at input threshold in dataframe.

numpy.clip

Clip (limit) the values in an array.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
>>> df = pd.DataFrame(data)
>>> df
   col_0  col_1
0      9     -2
1     -3     -7
2      0      6
3     -1      8
4      5     -5

Clips per column using lower and upper thresholds:

>>> df.clip(-4, 6)
   col_0  col_1
0      6     -2
1     -3     -4
2      0      6
3     -1      6
4      5     -4

Clips using specific lower and upper thresholds per column element:

>>> t = pd.Series([2, -4, -1, 6, 3])
>>> t
0    2
1   -4
2   -1
3    6
4    3
dtype: int64

>>> df.clip(t, t + 4, axis=0)
   col_0  col_1
0      6      2
1     -3     -4
2      0      3
3      6      8
4      5      3

Clips using specific lower threshold per column element, with missing values:

>>> t = pd.Series([2, -4, np.nan, 6, 3])
>>> t
0    2.0
1   -4.0
2    NaN
3    6.0
4    3.0
dtype: float64

>>> df.clip(t, axis=0)
col_0  col_1
0      9      2
1     -3     -4
2      0      6
3      6      8
4      5      3
corr(method, min_periods)[source]

Compute pairwise correlation of columns, excluding NA/null values.

Parameters:
  • method ({'pearson', 'kendall', 'spearman'} or callable) –

    Method of correlation:

    • pearson : standard correlation coefficient

    • kendall : Kendall Tau correlation coefficient

    • spearman : Spearman rank correlation

    • callable: callable with input two 1d ndarrays

      and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

  • min_periods (int, optional) – Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.

  • numeric_only (bool, default False) –

    Include only float, int or boolean data.

    Added in version 1.5.0.

    Changed in version 2.0.0: The default value of numeric_only is now False.

Returns:

Correlation matrix.

Return type:

DeferredDataFrame

Differences from pandas

Only method="pearson" can be parallelized. Other methods require collecting all data on a single worker (see https://s.apache.org/dataframe-non-parallel-operations for details).

See also

DeferredDataFrame.corrwith

Compute pairwise correlation with another DeferredDataFrame or DeferredSeries.

DeferredSeries.corr

Compute the correlation between two DeferredSeries.

Notes

Pearson, Kendall and Spearman correlation are currently computed using pairwise complete observations.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> def histogram_intersection(a, b):
...     v = np.minimum(a, b).sum().round(decimals=1)
...     return v
>>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
...                   columns=['dogs', 'cats'])
>>> df.corr(method=histogram_intersection)
      dogs  cats
dogs   1.0   0.3
cats   0.3   1.0

>>> df = pd.DataFrame([(1, 1), (2, np.nan), (np.nan, 3), (4, 4)],
...                   columns=['dogs', 'cats'])
>>> df.corr(min_periods=3)
      dogs  cats
dogs   1.0   NaN
cats   NaN   1.0
cov(min_periods, ddof)[source]

Compute pairwise covariance of columns, excluding NA/null values.

Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.

Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as NaN.

This method is generally used for the analysis of time series data to understand the relationship between different measures across time.

Parameters:
  • min_periods (int, optional) – Minimum number of observations required per pair of columns to have a valid result.

  • ddof (int, default 1) – Delta degrees of freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. This argument is applicable only when no nan is in the dataframe.

  • numeric_only (bool, default False) –

    Include only float, int or boolean data.

    Added in version 1.5.0.

    Changed in version 2.0.0: The default value of numeric_only is now False.

Returns:

The covariance matrix of the series of the DeferredDataFrame.

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.cov

Compute covariance with another DeferredSeries.

core.window.ewm.ExponentialMovingWindow.cov

Exponential weighted sample covariance.

core.window.expanding.Expanding.cov

Expanding sample covariance.

core.window.rolling.Rolling.cov

Rolling sample covariance.

Notes

Returns the covariance matrix of the DeferredDataFrame’s time series. The covariance is normalized by N-ddof.

For DeferredDataFrames that have DeferredSeries that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member DeferredSeries.

However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],
...                   columns=['dogs', 'cats'])
>>> df.cov()
          dogs      cats
dogs  0.666667 -1.000000
cats -1.000000  1.666667

>>> np.random.seed(42)
>>> df = pd.DataFrame(np.random.randn(1000, 5),
...                   columns=['a', 'b', 'c', 'd', 'e'])
>>> df.cov()
          a         b         c         d         e
a  0.998438 -0.020161  0.059277 -0.008943  0.014144
b -0.020161  1.059352 -0.008543 -0.024738  0.009826
c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
e  0.014144  0.009826 -0.000271 -0.013692  0.977795

**Minimum number of periods**

This method also supports an optional ``min_periods`` keyword
that specifies the required minimum number of non-NA observations for
each column pair in order to have a valid result:

>>> np.random.seed(42)
>>> df = pd.DataFrame(np.random.randn(20, 3),
...                   columns=['a', 'b', 'c'])
>>> df.loc[df.index[:5], 'a'] = np.nan
>>> df.loc[df.index[5:10], 'b'] = np.nan
>>> df.cov(min_periods=12)
          a         b         c
a  0.316741       NaN -0.150812
b       NaN  1.248003  0.191417
c -0.150812  0.191417  0.895202
corrwith(other, axis, drop, method)[source]

Compute pairwise correlation.

Pairwise correlation is computed between rows or columns of DataFrame with rows or columns of Series or DataFrame. DataFrames are first aligned along both axes before computing the correlations.

Parameters:
  • other (DeferredDataFrame, DeferredSeries) – Object with which to compute correlations.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ to compute row-wise, 1 or ‘columns’ for column-wise.

  • drop (bool, default False) – Drop missing indices from result.

  • method ({'pearson', 'kendall', 'spearman'} or callable) –

    Method of correlation:

    • pearson : standard correlation coefficient

    • kendall : Kendall Tau correlation coefficient

    • spearman : Spearman rank correlation

    • callable: callable with input two 1d ndarrays

      and returning a float.

  • numeric_only (bool, default False) –

    Include only float, int or boolean data.

    Added in version 1.5.0.

    Changed in version 2.0.0: The default value of numeric_only is now False.

Returns:

Pairwise correlations.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.corr

Compute pairwise correlation of columns.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> index = ["a", "b", "c", "d", "e"]
>>> columns = ["one", "two", "three", "four"]
>>> df1 = pd.DataFrame(np.arange(20).reshape(5, 4), index=index, columns=columns)
>>> df2 = pd.DataFrame(np.arange(16).reshape(4, 4), index=index[:4], columns=columns)
>>> df1.corrwith(df2)
one      1.0
two      1.0
three    1.0
four     1.0
dtype: float64

>>> df2.corrwith(df1, axis=1)
a    1.0
b    1.0
c    1.0
d    1.0
e    NaN
dtype: float64
cummax(**kwargs)

pandas.DataFrame.cummax() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

cummin(**kwargs)

pandas.DataFrame.cummin() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

cumprod(**kwargs)

pandas.DataFrame.cumprod() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

cumsum(**kwargs)

pandas.DataFrame.cumsum() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

diff(**kwargs)

pandas.DataFrame.diff() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

interpolate(**kwargs)

pandas.DataFrame.interpolate() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

pct_change(**kwargs)

pandas.DataFrame.pct_change() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

asof(**kwargs)

pandas.DataFrame.asof() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

first_valid_index(**kwargs)

pandas.DataFrame.first_valid_index() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

last_valid_index(**kwargs)

pandas.DataFrame.last_valid_index() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

property iat

pandas.DataFrame.iat() is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see https://s.apache.org/dataframe-order-sensitive-operations.

lookup(**kwargs)

pandas.DataFrame.lookup() is not yet supported in the Beam DataFrame API because it is deprecated in pandas.

head(**kwargs)

pandas.DataFrame.head() is not yet supported in the Beam DataFrame API because it is order-sensitive.

If you want to peek at a large dataset consider using interactive Beam’s ib.collect with n specified, or sample(). If you want to find the N largest elements, consider using DeferredDataFrame.nlargest().

tail(**kwargs)

pandas.DataFrame.tail() is not yet supported in the Beam DataFrame API because it is order-sensitive.

If you want to peek at a large dataset consider using interactive Beam’s ib.collect with n specified, or sample(). If you want to find the N largest elements, consider using DeferredDataFrame.nlargest().

sample(n, frac, replace, weights, random_state, axis)[source]

Return a random sample of items from an axis of object.

You can use random_state for reproducibility.

Parameters:
  • n (int, optional) – Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.

  • frac (float, optional) – Fraction of axis items to return. Cannot be used with n.

  • replace (bool, default False) – Allow or disallow sampling of the same row more than once.

  • weights (str or ndarray-like, optional) – Default ‘None’ results in equal probability weighting. If passed a DeferredSeries, will align with target object on index. Index values in weights not found in sampled object will be ignored and index values in sampled object not in weights will be assigned weights of zero. If called on a DeferredDataFrame, will accept the name of a column when axis = 0. Unless weights are a DeferredSeries, weights must be same length as axis being sampled. If weights do not sum to 1, they will be normalized to sum to 1. Missing values in the weights column will be treated as zero. Infinite values not allowed.

  • random_state (int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional) –

    If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given.

    Changed in version 1.4.0: np.random.Generator objects now accepted

  • axis ({0 or 'index', 1 or 'columns', None}, default None) – Axis to sample. Accepts axis number or name. Default is stat axis for given data type. For DeferredSeries this parameter is unused and defaults to None.

  • ignore_index (bool, default False) –

    If True, the resulting index will be labeled 0, 1, …, n - 1.

    Added in version 1.3.0.

Returns:

A new object of same type as caller containing n items randomly sampled from the caller object.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

When axis='index', only n and/or weights may be specified. frac, random_state, and replace=True are not yet supported. See Issue 21010.

Note that pandas will raise an error if n is larger than the length of the dataset, while the Beam DataFrame API will simply return the full dataset in that case.

sample is fully supported for axis=’columns’.

See also

DeferredDataFrameGroupBy.sample

Generates random samples from each group of a DeferredDataFrame object.

DeferredSeriesGroupBy.sample

Generates random samples from each group of a DeferredSeries object.

numpy.random.choice

Generates a random sample from a given 1-D numpy array.

Notes

If frac > 1, replacement should be set to True.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
...                    'num_wings': [2, 0, 0, 0],
...                    'num_specimen_seen': [10, 2, 1, 8]},
...                   index=['falcon', 'dog', 'spider', 'fish'])
>>> df
        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
dog            4          0                  2
spider         8          0                  1
fish           0          0                  8

Extract 3 random elements from the ``Series`` ``df['num_legs']``:
Note that we use `random_state` to ensure the reproducibility of
the examples.

>>> df['num_legs'].sample(n=3, random_state=1)
fish      0
spider    8
falcon    2
Name: num_legs, dtype: int64

A random 50% sample of the ``DataFrame`` with replacement:

>>> df.sample(frac=0.5, replace=True, random_state=1)
      num_legs  num_wings  num_specimen_seen
dog          4          0                  2
fish         0          0                  8

An upsample sample of the ``DataFrame`` with replacement:
Note that `replace` parameter has to be `True` for `frac` parameter > 1.

>>> df.sample(frac=2, replace=True, random_state=1)
        num_legs  num_wings  num_specimen_seen
dog            4          0                  2
fish           0          0                  8
falcon         2          2                 10
falcon         2          2                 10
fish           0          0                  8
dog            4          0                  2
fish           0          0                  8
dog            4          0                  2

Using a DataFrame column as weights. Rows with larger value in the
`num_specimen_seen` column are more likely to be sampled.

>>> df.sample(n=2, weights='num_specimen_seen', random_state=1)
        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
fish           0          0                  8
dot(other)[source]

Compute the matrix multiplication between the DataFrame and other.

This method computes the matrix product between the DataFrame and the values of an other Series, DataFrame or a numpy array.

It can also be called using self @ other.

Parameters:

other (DeferredSeries, DeferredDataFrame or array-like) – The other object to compute the matrix product with.

Returns:

If other is a DeferredSeries, return the matrix product between self and other as a DeferredSeries. If other is a DeferredDataFrame or a numpy.array, return the matrix product of self and other in a DeferredDataFrame of a np.array.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.dot

Similar method for DeferredSeries.

Notes

The dimensions of DeferredDataFrame and other must be compatible in order to compute the matrix multiplication. In addition, the column names of DeferredDataFrame and the index of other must contain the same values, as they will be aligned prior to the multiplication.

The dot method for DeferredSeries computes the inner product, instead of the matrix product here.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Here we multiply a DataFrame with a Series.

>>> df = pd.DataFrame([[0, 1, -2, -1], [1, 1, 1, 1]])
>>> s = pd.Series([1, 1, 2, 1])
>>> df.dot(s)
0    -4
1     5
dtype: int64

Here we multiply a DataFrame with another DataFrame.

>>> other = pd.DataFrame([[0, 1], [1, 2], [-1, -1], [2, 0]])
>>> df.dot(other)
    0   1
0   1   4
1   2   2

Note that the dot method give the same result as @

>>> df @ other
    0   1
0   1   4
1   2   2

The dot method works also if other is an np.array.

>>> arr = np.array([[0, 1], [1, 2], [-1, -1], [2, 0]])
>>> df.dot(arr)
    0   1
0   1   4
1   2   2

Note how shuffling of the objects does not change the result.

>>> s2 = s.reindex([1, 0, 2, 3])
>>> df.dot(s2)
0    -4
1     5
dtype: int64
mode(axis=0, *args, **kwargs)[source]

Get the mode(s) of each element along the selected axis.

The mode of a set of values is the value that appears most often. It can be multiple values.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) –

    The axis to iterate over while searching for the mode:

    • 0 or ‘index’ : get mode of each column

    • 1 or ‘columns’ : get mode of each row.

  • numeric_only (bool, default False) – If True, only apply to numeric columns.

  • dropna (bool, default True) – Don’t consider counts of NaN/NaT.

Returns:

The modes of each column or row.

Return type:

DeferredDataFrame

Differences from pandas

mode with axis=”columns” is not implemented because it produces non-deferred columns.

mode with axis=”index” is not currently parallelizable. An approximate, parallelizable implementation of mode may be added in the future (Issue 20946).

See also

DeferredSeries.mode

Return the highest frequency value in a DeferredSeries.

DeferredSeries.value_counts

Return the counts of values in a DeferredSeries.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame([('bird', 2, 2),
...                    ('mammal', 4, np.nan),
...                    ('arthropod', 8, 0),
...                    ('bird', 2, np.nan)],
...                   index=('falcon', 'horse', 'spider', 'ostrich'),
...                   columns=('species', 'legs', 'wings'))
>>> df
           species  legs  wings
falcon        bird     2    2.0
horse       mammal     4    NaN
spider   arthropod     8    0.0
ostrich       bird     2    NaN

By default, missing values are not considered, and the mode of wings
are both 0 and 2. Because the resulting DataFrame has two rows,
the second row of ``species`` and ``legs`` contains ``NaN``.

>>> df.mode()
  species  legs  wings
0    bird   2.0    0.0
1     NaN   NaN    2.0

Setting ``dropna=False`` ``NaN`` values are considered and they can be
the mode (like for wings).

>>> df.mode(dropna=False)
  species  legs  wings
0    bird     2    NaN

Setting ``numeric_only=True``, only the mode of numeric columns is
computed, and columns of other types are ignored.

>>> df.mode(numeric_only=True)
   legs  wings
0   2.0    0.0
1   NaN    2.0

To compute the mode over columns and not rows, use the axis parameter:

>>> df.mode(axis='columns', numeric_only=True)
           0    1
falcon   2.0  NaN
horse    4.0  NaN
spider   0.0  8.0
ostrich  2.0  NaN
dropna(axis, **kwargs)[source]

Remove missing values.

See the User Guide for more on which values are considered missing, and how to work with missing data.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) –

    Determine if rows or columns which contain missing values are removed.

    • 0, or ‘index’ : Drop rows which contain missing values.

    • 1, or ‘columns’ : Drop columns which contain missing value.

    Only a single axis is allowed.

  • how ({'any', 'all'}, default 'any') –

    Determine if row or column is removed from DeferredDataFrame, when we have at least one NA or all NA.

    • ’any’ : If any NA values are present, drop that row or column.

    • ’all’ : If all values are NA, drop that row or column.

  • thresh (int, optional) – Require that many non-NA values. Cannot be combined with how.

  • subset (column label or sequence of labels, optional) – Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

  • inplace (bool, default False) – Whether to modify the DeferredDataFrame rather than creating a new one.

  • ignore_index (bool, default False) –

    If True, the resulting axis will be labeled 0, 1, …, n - 1.

    Added in version 2.0.0.

Returns:

DeferredDataFrame with NA entries dropped from it or None if inplace=True.

Return type:

DeferredDataFrame or None

Differences from pandas

dropna with axis=”columns” specified cannot be parallelized.

See also

DeferredDataFrame.isna

Indicate missing values.

DeferredDataFrame.notna

Indicate existing (non-missing) values.

DeferredDataFrame.fillna

Replace missing values.

DeferredSeries.dropna

Drop missing values.

Index.dropna

Drop missing indices.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
...                    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
...                    "born": [pd.NaT, pd.Timestamp("1940-04-25"),
...                             pd.NaT]})
>>> df
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Drop the rows where at least one element is missing.

>>> df.dropna()
     name        toy       born
1  Batman  Batmobile 1940-04-25

Drop the columns where at least one element is missing.

>>> df.dropna(axis='columns')
       name
0    Alfred
1    Batman
2  Catwoman

Drop the rows where all elements are missing.

>>> df.dropna(how='all')
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Keep only the rows with at least 2 non-NA values.

>>> df.dropna(thresh=2)
       name        toy       born
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Define in which columns to look for missing values.

>>> df.dropna(subset=['name', 'toy'])
       name        toy       born
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT
eval(expr, inplace, **kwargs)[source]

Evaluate a string describing operations on DataFrame columns.

Operates on columns only, not specific rows or elements. This allows eval to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.

Parameters:
  • expr (str) – The expression string to evaluate.

  • inplace (bool, default False) – If the expression contains an assignment, whether to perform the operation inplace and mutate the existing DeferredDataFrame. Otherwise, a new DeferredDataFrame is returned.

  • **kwargs – See the documentation for eval() for complete details on the keyword arguments accepted by query().

Returns:

The result of the evaluation or None if inplace=True.

Return type:

ndarray, scalar, pandas object, or None

Differences from pandas

Accessing local variables with @<varname> is not yet supported (Issue 20626).

Arguments local_dict, global_dict, level, target, and resolvers are not yet supported.

See also

DeferredDataFrame.query

Evaluates a boolean expression to query the columns of a frame.

DeferredDataFrame.assign

Can evaluate an expression or function to create new values for a column.

eval

Evaluate a Python expression as a string using various backends.

Notes

For more details see the API documentation for eval(). For detailed examples see enhancing performance with eval.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})
>>> df
   A   B
0  1  10
1  2   8
2  3   6
3  4   4
4  5   2
>>> df.eval('A + B')
0    11
1    10
2     9
3     8
4     7
dtype: int64

Assignment is allowed though by default the original DataFrame is not
modified.

>>> df.eval('C = A + B')
   A   B   C
0  1  10  11
1  2   8  10
2  3   6   9
3  4   4   8
4  5   2   7
>>> df
   A   B
0  1  10
1  2   8
2  3   6
3  4   4
4  5   2

Multiple columns can be assigned to using multi-line expressions:

>>> df.eval(
...     '''
... C = A + B
... D = A - B
... '''
... )
   A   B   C  D
0  1  10  11 -9
1  2   8  10 -6
2  3   6   9 -3
3  4   4   8  0
4  5   2   7  3
query(expr, inplace, **kwargs)[source]

Query the columns of a DataFrame with a boolean expression.

Parameters:
  • expr (str) –

    The query string to evaluate.

    You can refer to variables in the environment by prefixing them with an ‘@’ character like @a + b.

    You can refer to column names that are not valid Python variable names by surrounding them in backticks. Thus, column names containing spaces or punctuations (besides underscores) or starting with digits must be surrounded by backticks. (For example, a column named “Area (cm^2)” would be referenced as `Area (cm^2)`). Column names which are Python keywords (like “list”, “for”, “import”, etc) cannot be used.

    For example, if one of your columns is called a a and you want to sum it with b, your query should be `a a` + b.

  • inplace (bool) – Whether to modify the DeferredDataFrame rather than creating a new one.

  • **kwargs – See the documentation for eval() for complete details on the keyword arguments accepted by DeferredDataFrame.query().

Returns:

DeferredDataFrame resulting from the provided query expression or None if inplace=True.

Return type:

DeferredDataFrame or None

Differences from pandas

Accessing local variables with @<varname> is not yet supported (Issue 20626).

Arguments local_dict, global_dict, level, target, and resolvers are not yet supported.

See also

eval

Evaluate a string describing operations on DeferredDataFrame columns.

DeferredDataFrame.eval

Evaluate a string describing operations on DeferredDataFrame columns.

Notes

The result of the evaluation of this expression is first passed to DeferredDataFrame.loc and if that fails because of a multidimensional key (e.g., a DeferredDataFrame) then the result will be passed to DeferredDataFrame.__getitem__().

This method uses the top-level eval() function to evaluate the passed query.

The query() method uses a slightly modified Python syntax by default. For example, the & and | (bitwise) operators have the precedence of their boolean cousins, and and or. This is syntactically valid Python, however the semantics are different.

You can change the semantics of the expression by passing the keyword argument parser='python'. This enforces the same semantics as evaluation in Python space. Likewise, you can pass engine='python' to evaluate an expression using Python itself as a backend. This is not recommended as it is inefficient compared to using numexpr as the engine.

The DeferredDataFrame.index and DeferredDataFrame.columns attributes of the DeferredDataFrame instance are placed in the query namespace by default, which allows you to treat both the index and columns of the frame as a column in the frame. The identifier index is used for the frame index; you can also use the name of the index to identify it in a query. Please note that Python keywords may not be used as identifiers.

For further details and examples see the query documentation in indexing.

Backtick quoted variables

Backtick quoted variables are parsed as literal Python code and are converted internally to a Python valid identifier. This can lead to the following problems.

During parsing a number of disallowed characters inside the backtick quoted string are replaced by strings that are allowed as a Python identifier. These characters include all operators in Python, the space character, the question mark, the exclamation mark, the dollar sign, and the euro sign. For other characters that fall outside the ASCII range (U+0001..U+007F) and those that are not further specified in PEP 3131, the query parser will raise an error. This excludes whitespace different than the space character, but also the hashtag (as it is used for comments) and the backtick itself (backtick can also not be escaped).

In a special case, quotes that make a pair around a backtick can confuse the parser. For example, `it's` > `that's` will raise an error, as it forms a quoted string ('s > `that') with a backtick inside.

See also the Python documentation about lexical analysis (https://docs.python.org/3/reference/lexical_analysis.html) in combination with the source code in pandas.core.computation.parsing.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'A': range(1, 6),
...                    'B': range(10, 0, -2),
...                    'C C': range(10, 5, -1)})
>>> df
   A   B  C C
0  1  10   10
1  2   8    9
2  3   6    8
3  4   4    7
4  5   2    6
>>> df.query('A > B')
   A  B  C C
4  5  2    6

The previous expression is equivalent to

>>> df[df.A > df.B]
   A  B  C C
4  5  2    6

For columns with spaces in their name, you can use backtick quoting.

>>> df.query('B == `C C`')
   A   B  C C
0  1  10   10

The previous expression is equivalent to

>>> df[df.B == df['C C']]
   A   B  C C
0  1  10   10
isnull(**kwargs)

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:

Mask of bool values for each element in DeferredDataFrame that indicates whether an element is an NA value.

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.isnull

Alias of isna.

DeferredDataFrame.notna

Boolean inverse of isna.

DeferredDataFrame.dropna

Omit axes labels with missing values.

isna

Top-level isna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.nan],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.nan])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.isna()
0    False
1    False
2     True
dtype: bool
isna(**kwargs)

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:

Mask of bool values for each element in DeferredDataFrame that indicates whether an element is an NA value.

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.isnull

Alias of isna.

DeferredDataFrame.notna

Boolean inverse of isna.

DeferredDataFrame.dropna

Omit axes labels with missing values.

isna

Top-level isna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.nan],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.nan])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.isna()
0    False
1    False
2     True
dtype: bool
notnull(**kwargs)

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:

Mask of bool values for each element in DeferredDataFrame that indicates whether an element is not an NA value.

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.notnull

Alias of notna.

DeferredDataFrame.isna

Boolean inverse of notna.

DeferredDataFrame.dropna

Omit axes labels with missing values.

notna

Top-level notna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.nan],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.nan])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.notna()
0     True
1     True
2    False
dtype: bool
notna(**kwargs)

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:

Mask of bool values for each element in DeferredDataFrame that indicates whether an element is not an NA value.

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.notnull

Alias of notna.

DeferredDataFrame.isna

Boolean inverse of notna.

DeferredDataFrame.dropna

Omit axes labels with missing values.

notna

Top-level notna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.nan],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.nan])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.notna()
0     True
1     True
2    False
dtype: bool
items(**kwargs)

pandas.DataFrame.items() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

itertuples(**kwargs)

pandas.DataFrame.itertuples() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

iterrows(**kwargs)

pandas.DataFrame.iterrows() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

iteritems(**kwargs)

pandas.DataFrame.iteritems() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

join(other, on, **kwargs)[source]

Join columns of another DataFrame.

Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.

Parameters:
  • other (DeferredDataFrame, DeferredSeries, or a list containing any combination of them) – Index should be similar to one of the columns in this one. If a DeferredSeries is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DeferredDataFrame.

  • on (str, list of str, or array-like, optional) – Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DeferredDataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DeferredDataFrame. Like an Excel VLOOKUP operation.

  • how ({'left', 'right', 'outer', 'inner', 'cross'}, default 'left') –

    How to handle the operation of the two objects.

    • left: use calling frame’s index (or column if on is specified)

    • right: use other’s index.

    • outer: form union of calling frame’s index (or column if on is specified) with other’s index, and sort it lexicographically.

    • inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of the calling’s one.

    • cross: creates the cartesian product from both frames, preserves the order of the left keys.

      Added in version 1.2.0.

  • lsuffix (str, default '') – Suffix to use from left frame’s overlapping columns.

  • rsuffix (str, default '') – Suffix to use from right frame’s overlapping columns.

  • sort (bool, default False) – Order result DeferredDataFrame lexicographically by the join key. If False, the order of the join key depends on the join type (how keyword).

  • validate (str, optional) –

    If specified, checks if join is of specified type.

    • ”one_to_one” or “1:1”: check if join keys are unique in both left and right datasets.

    • ”one_to_many” or “1:m”: check if join keys are unique in left dataset.

    • ”many_to_one” or “m:1”: check if join keys are unique in right dataset.

    • ”many_to_many” or “m:m”: allowed, but does not result in checks.

    Added in version 1.5.0.

Returns:

A dataframe containing columns from both the caller and other.

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.merge

For column(s)-on-column(s) operations.

Notes

Parameters on, lsuffix, and rsuffix are not supported when passing a list of DeferredDataFrame objects.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
...                    'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})

>>> df
  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K3  A3
4  K4  A4
5  K5  A5

>>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
...                       'B': ['B0', 'B1', 'B2']})

>>> other
  key   B
0  K0  B0
1  K1  B1
2  K2  B2

Join DataFrames using their indexes.

>>> df.join(other, lsuffix='_caller', rsuffix='_other')
  key_caller   A key_other    B
0         K0  A0        K0   B0
1         K1  A1        K1   B1
2         K2  A2        K2   B2
3         K3  A3       NaN  NaN
4         K4  A4       NaN  NaN
5         K5  A5       NaN  NaN

If we want to join using the key columns, we need to set key to be
the index in both `df` and `other`. The joined DataFrame will have
key as its index.

>>> df.set_index('key').join(other.set_index('key'))
      A    B
key
K0   A0   B0
K1   A1   B1
K2   A2   B2
K3   A3  NaN
K4   A4  NaN
K5   A5  NaN

Another option to join using the key columns is to use the `on`
parameter. DataFrame.join always uses `other`'s index but we can use
any column in `df`. This method preserves the original DataFrame's
index in the result.

>>> df.join(other.set_index('key'), on='key')
  key   A    B
0  K0  A0   B0
1  K1  A1   B1
2  K2  A2   B2
3  K3  A3  NaN
4  K4  A4  NaN
5  K5  A5  NaN

Using non-unique key values shows how they are matched.

>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K1', 'K3', 'K0', 'K1'],
...                    'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})

>>> df
  key   A
0  K0  A0
1  K1  A1
2  K1  A2
3  K3  A3
4  K0  A4
5  K1  A5

>>> df.join(other.set_index('key'), on='key', validate='m:1')
  key   A    B
0  K0  A0   B0
1  K1  A1   B1
2  K1  A2   B1
3  K3  A3  NaN
4  K0  A4   B0
5  K1  A5   B1
merge(right, on, left_on, right_on, left_index, right_index, suffixes, **kwargs)[source]

Merge DataFrame or named Series objects with a database-style join.

A named Series object is treated as a DataFrame with a single named column.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.

Warning

If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.

Parameters:
  • right (DeferredDataFrame or named DeferredSeries) – Object to merge with.

  • how ({'left', 'right', 'outer', 'inner', 'cross'}, default 'inner') –

    Type of merge to be performed.

    • left: use only keys from left frame, similar to a SQL left outer join; preserve key order.

    • right: use only keys from right frame, similar to a SQL right outer join; preserve key order.

    • outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.

    • inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

    • cross: creates the cartesian product from both frames, preserves the order of the left keys.

      Added in version 1.2.0.

  • on (label or list) – Column or index level names to join on. These must be found in both DeferredDataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DeferredDataFrames.

  • left_on (label or list, or array-like) – Column or index level names to join on in the left DeferredDataFrame. Can also be an array or list of arrays of the length of the left DeferredDataFrame. These arrays are treated as if they are columns.

  • right_on (label or list, or array-like) – Column or index level names to join on in the right DeferredDataFrame. Can also be an array or list of arrays of the length of the right DeferredDataFrame. These arrays are treated as if they are columns.

  • left_index (bool, default False) – Use the index from the left DeferredDataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DeferredDataFrame (either the index or a number of columns) must match the number of levels.

  • right_index (bool, default False) – Use the index from the right DeferredDataFrame as the join key. Same caveats as left_index.

  • sort (bool, default False) – Sort the join keys lexicographically in the result DeferredDataFrame. If False, the order of the join keys depends on the join type (how keyword).

  • suffixes (list-like, default is ("_x", "_y")) – A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. Pass a value of None instead of a string to indicate that the column name from left or right should be left as-is, with no suffix. At least one of the values must not be None.

  • copy (bool, default True) – If False, avoid copy if possible.

  • indicator (bool or str, default False) – If True, adds a column to the output DeferredDataFrame called “_merge” with information on the source of each row. The column can be given a different name by providing a string argument. The column will have a Categorical type with the value of “left_only” for observations whose merge key only appears in the left DeferredDataFrame, “right_only” for observations whose merge key only appears in the right DeferredDataFrame, and “both” if the observation’s merge key is found in both DeferredDataFrames.

  • validate (str, optional) –

    If specified, checks if merge is of specified type.

    • ”one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.

    • ”one_to_many” or “1:m”: check if merge keys are unique in left dataset.

    • ”many_to_one” or “m:1”: check if merge keys are unique in right dataset.

    • ”many_to_many” or “m:m”: allowed, but does not result in checks.

Returns:

A DeferredDataFrame of the two merged objects.

Return type:

DeferredDataFrame

Differences from pandas

merge is not parallelizable unless left_index or right_index is ``True`, because it requires generating an entirely new unique index. See notes on DeferredDataFrame.reset_index(). It is recommended to move the join key for one of your columns to the index to avoid this issue. For an example see the enrich pipeline in apache_beam.examples.dataframe.taxiride.

how="cross" is not yet supported.

See also

merge_ordered

Merge with optional filling/interpolation.

merge_asof

Merge on nearest keys.

DeferredDataFrame.join

Similar method using indices.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
...                     'value': [1, 2, 3, 5]})
>>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
...                     'value': [5, 6, 7, 8]})
>>> df1
    lkey value
0   foo      1
1   bar      2
2   baz      3
3   foo      5
>>> df2
    rkey value
0   foo      5
1   bar      6
2   baz      7
3   foo      8

Merge df1 and df2 on the lkey and rkey columns. The value columns have
the default suffixes, _x and _y, appended.

>>> df1.merge(df2, left_on='lkey', right_on='rkey')
  lkey  value_x rkey  value_y
0  foo        1  foo        5
1  foo        1  foo        8
2  foo        5  foo        5
3  foo        5  foo        8
4  bar        2  bar        6
5  baz        3  baz        7

Merge DataFrames df1 and df2 with specified left and right suffixes
appended to any overlapping columns.

>>> df1.merge(df2, left_on='lkey', right_on='rkey',
...           suffixes=('_left', '_right'))
  lkey  value_left rkey  value_right
0  foo           1  foo            5
1  foo           1  foo            8
2  foo           5  foo            5
3  foo           5  foo            8
4  bar           2  bar            6
5  baz           3  baz            7

Merge DataFrames df1 and df2, but raise an exception if the DataFrames have
any overlapping columns.

>>> df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))
Traceback (most recent call last):
...
ValueError: columns overlap but no suffix specified:
    Index(['value'], dtype='object')

>>> df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
>>> df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})
>>> df1
      a  b
0   foo  1
1   bar  2
>>> df2
      a  c
0   foo  3
1   baz  4

>>> df1.merge(df2, how='inner', on='a')
      a  b  c
0   foo  1  3

>>> df1.merge(df2, how='left', on='a')
      a  b  c
0   foo  1  3.0
1   bar  2  NaN

>>> df1 = pd.DataFrame({'left': ['foo', 'bar']})
>>> df2 = pd.DataFrame({'right': [7, 8]})
>>> df1
    left
0   foo
1   bar
>>> df2
    right
0   7
1   8

>>> df1.merge(df2, how='cross')
   left  right
0   foo      7
1   foo      8
2   bar      7
3   bar      8
nlargest(keep, **kwargs)[source]

Return the first n rows ordered by columns in descending order.

Return the first n rows with the largest values in columns, in descending order. The columns that are not specified are returned as well, but not used for ordering.

This method is equivalent to df.sort_values(columns, ascending=False).head(n), but more performant.

Parameters:
  • n (int) – Number of rows to return.

  • columns (label or list of labels) – Column label(s) to order by.

  • keep ({'first', 'last', 'all'}, default 'first') –

    Where there are duplicate values:

    • first : prioritize the first occurrence(s)

    • last : prioritize the last occurrence(s)

    • all : do not drop any duplicates, even it means selecting more than n items.

Returns:

The first n rows ordered by the given columns in descending order.

Return type:

DeferredDataFrame

Differences from pandas

Only keep=False and keep="any" are supported. Other values of keep make this an order-sensitive operation. Note keep="any" is a Beam-specific option that guarantees only one duplicate will be kept, but unlike "first" and "last" it makes no guarantees about _which_ duplicate element is kept.

See also

DeferredDataFrame.nsmallest

Return the first n rows ordered by columns in ascending order.

DeferredDataFrame.sort_values

Sort DeferredDataFrame by the values.

DeferredDataFrame.head

Return the first n rows without re-ordering.

Notes

This function cannot be used with all column types. For example, when specifying columns with object or category dtypes, TypeError is raised.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000,
...                                   434000, 434000, 337000, 11300,
...                                   11300, 11300],
...                    'GDP': [1937894, 2583560 , 12011, 4520, 12128,
...                            17036, 182, 38, 311],
...                    'alpha-2': ["IT", "FR", "MT", "MV", "BN",
...                                "IS", "NR", "TV", "AI"]},
...                   index=["Italy", "France", "Malta",
...                          "Maldives", "Brunei", "Iceland",
...                          "Nauru", "Tuvalu", "Anguilla"])
>>> df
          population      GDP alpha-2
Italy       59000000  1937894      IT
France      65000000  2583560      FR
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN
Iceland       337000    17036      IS
Nauru          11300      182      NR
Tuvalu         11300       38      TV
Anguilla       11300      311      AI

In the following example, we will use ``nlargest`` to select the three
rows having the largest values in column "population".

>>> df.nlargest(3, 'population')
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Malta       434000    12011      MT

When using ``keep='last'``, ties are resolved in reverse order:

>>> df.nlargest(3, 'population', keep='last')
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Brunei      434000    12128      BN

When using ``keep='all'``, all duplicate items are maintained:

>>> df.nlargest(3, 'population', keep='all')
          population      GDP alpha-2
France      65000000  2583560      FR
Italy       59000000  1937894      IT
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN

To order by the largest values in column "population" and then "GDP",
we can specify multiple columns like in the next example.

>>> df.nlargest(3, ['population', 'GDP'])
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Brunei      434000    12128      BN
nsmallest(keep, **kwargs)[source]

Return the first n rows ordered by columns in ascending order.

Return the first n rows with the smallest values in columns, in ascending order. The columns that are not specified are returned as well, but not used for ordering.

This method is equivalent to df.sort_values(columns, ascending=True).head(n), but more performant.

Parameters:
  • n (int) – Number of items to retrieve.

  • columns (list or str) – Column name or names to order by.

  • keep ({'first', 'last', 'all'}, default 'first') –

    Where there are duplicate values:

    • first : take the first occurrence.

    • last : take the last occurrence.

    • all : do not drop any duplicates, even it means selecting more than n items.

Return type:

DeferredDataFrame

Differences from pandas

Only keep=False and keep="any" are supported. Other values of keep make this an order-sensitive operation. Note keep="any" is a Beam-specific option that guarantees only one duplicate will be kept, but unlike "first" and "last" it makes no guarantees about _which_ duplicate element is kept.

See also

DeferredDataFrame.nlargest

Return the first n rows ordered by columns in descending order.

DeferredDataFrame.sort_values

Sort DeferredDataFrame by the values.

DeferredDataFrame.head

Return the first n rows without re-ordering.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000,
...                                   434000, 434000, 337000, 337000,
...                                   11300, 11300],
...                    'GDP': [1937894, 2583560 , 12011, 4520, 12128,
...                            17036, 182, 38, 311],
...                    'alpha-2': ["IT", "FR", "MT", "MV", "BN",
...                                "IS", "NR", "TV", "AI"]},
...                   index=["Italy", "France", "Malta",
...                          "Maldives", "Brunei", "Iceland",
...                          "Nauru", "Tuvalu", "Anguilla"])
>>> df
          population      GDP alpha-2
Italy       59000000  1937894      IT
France      65000000  2583560      FR
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN
Iceland       337000    17036      IS
Nauru         337000      182      NR
Tuvalu         11300       38      TV
Anguilla       11300      311      AI

In the following example, we will use ``nsmallest`` to select the
three rows having the smallest values in column "population".

>>> df.nsmallest(3, 'population')
          population    GDP alpha-2
Tuvalu         11300     38      TV
Anguilla       11300    311      AI
Iceland       337000  17036      IS

When using ``keep='last'``, ties are resolved in reverse order:

>>> df.nsmallest(3, 'population', keep='last')
          population  GDP alpha-2
Anguilla       11300  311      AI
Tuvalu         11300   38      TV
Nauru         337000  182      NR

When using ``keep='all'``, all duplicate items are maintained:

>>> df.nsmallest(3, 'population', keep='all')
          population    GDP alpha-2
Tuvalu         11300     38      TV
Anguilla       11300    311      AI
Iceland       337000  17036      IS
Nauru         337000    182      NR

To order by the smallest values in column "population" and then "GDP", we can
specify multiple columns like in the next example.

>>> df.nsmallest(3, ['population', 'GDP'])
          population  GDP alpha-2
Tuvalu         11300   38      TV
Anguilla       11300  311      AI
Nauru         337000  182      NR
plot(**kwargs)

pandas.DataFrame.plot() is not yet supported in the Beam DataFrame API because it is a plotting tool.

For more information see https://s.apache.org/dataframe-plotting-tools.

pop(item)[source]

Return item and drop from frame. Raise KeyError if not found.

Parameters:

item (label) – Label of column to be popped.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame([('falcon', 'bird', 389.0),
...                    ('parrot', 'bird', 24.0),
...                    ('lion', 'mammal', 80.5),
...                    ('monkey', 'mammal', np.nan)],
...                   columns=('name', 'class', 'max_speed'))
>>> df
     name   class  max_speed
0  falcon    bird      389.0
1  parrot    bird       24.0
2    lion  mammal       80.5
3  monkey  mammal        NaN

>>> df.pop('class')
0      bird
1      bird
2    mammal
3    mammal
Name: class, dtype: object

>>> df
     name  max_speed
0  falcon      389.0
1  parrot       24.0
2    lion       80.5
3  monkey        NaN
quantile(q, axis, **kwargs)[source]

Return values at the given quantile over requested axis.

Parameters:
  • q (float or array-like, default 0.5 (50% quantile)) – Value between 0 <= q <= 1, the quantile(s) to compute.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Equals 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

  • numeric_only (bool, default False) –

    Include only float, int or boolean data.

    Changed in version 2.0.0: The default value of numeric_only is now False.

  • interpolation ({'linear', 'lower', 'higher', 'midpoint', 'nearest'}) –

    This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:

    • linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.

    • lower: i.

    • higher: j.

    • nearest: i or j whichever is nearest.

    • midpoint: (i + j) / 2.

  • method ({'single', 'table'}, default 'single') – Whether to compute quantiles per-column (‘single’) or over all columns (‘table’). When ‘table’, the only allowed interpolation methods are ‘nearest’, ‘lower’, and ‘higher’.

Returns:

If q is an array, a DeferredDataFrame will be returned where the

index is q, the columns are the columns of self, and the values are the quantiles.

If q is a float, a DeferredSeries will be returned where the

index is the columns of self and the values are the quantiles.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

quantile(axis="index") is not parallelizable. See Issue 20933 tracking the possible addition of an approximate, parallelizable implementation of quantile.

When using quantile with axis="columns" only a single q value can be specified.

See also

core.window.rolling.Rolling.quantile

Rolling quantile.

numpy.percentile

Numpy function to compute the percentile.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]),
...                   columns=['a', 'b'])
>>> df.quantile(.1)
a    1.3
b    3.7
Name: 0.1, dtype: float64
>>> df.quantile([.1, .5])
       a     b
0.1  1.3   3.7
0.5  2.5  55.0

Specifying `method='table'` will compute the quantile over all columns.

>>> df.quantile(.1, method="table", interpolation="nearest")
a    1
b    1
Name: 0.1, dtype: int64
>>> df.quantile([.1, .5], method="table", interpolation="nearest")
     a    b
0.1  1    1
0.5  3  100

Specifying `numeric_only=False` will also compute the quantile of
datetime and timedelta data.

>>> df = pd.DataFrame({'A': [1, 2],
...                    'B': [pd.Timestamp('2010'),
...                          pd.Timestamp('2011')],
...                    'C': [pd.Timedelta('1 days'),
...                          pd.Timedelta('2 days')]})
>>> df.quantile(0.5, numeric_only=False)
A                    1.5
B    2010-07-02 12:00:00
C        1 days 12:00:00
Name: 0.5, dtype: object
rename(**kwargs)[source]

Rename columns or index labels.

Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.

See the user guide for more.

Parameters:
  • mapper (dict-like or function) – Dict-like or function transformations to apply to that axis’ values. Use either mapper and axis to specify the axis to target with mapper, or index and columns.

  • index (dict-like or function) – Alternative to specifying axis (mapper, axis=0 is equivalent to index=mapper).

  • columns (dict-like or function) – Alternative to specifying axis (mapper, axis=1 is equivalent to columns=mapper).

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Axis to target with mapper. Can be either the axis name (‘index’, ‘columns’) or number (0, 1). The default is ‘index’.

  • copy (bool, default True) – Also copy underlying data.

  • inplace (bool, default False) – Whether to modify the DeferredDataFrame rather than creating a new one. If True then value of copy is ignored.

  • level (int or level name, default None) – In case of a MultiIndex, only rename labels in the specified level.

  • errors ({'ignore', 'raise'}, default 'ignore') – If ‘raise’, raise a KeyError when a dict-like mapper, index, or columns contains labels that are not present in the Index being transformed. If ‘ignore’, existing keys will be renamed and extra keys will be ignored.

Returns:

DeferredDataFrame with the renamed axis labels or None if inplace=True.

Return type:

DeferredDataFrame or None

Raises:

KeyError – If any of the labels is not found in the selected axis and “errors=’raise’”.

Differences from pandas

rename is not parallelizable when axis="index" and errors="raise". It requires collecting all data on a single node in order to detect if one of the index values is missing.

See also

DeferredDataFrame.rename_axis

Set the name of the axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

``DataFrame.rename`` supports two calling conventions

* ``(index=index_mapper, columns=columns_mapper, ...)``
* ``(mapper, axis={'index', 'columns'}, ...)``

We *highly* recommend using keyword arguments to clarify your
intent.

Rename columns using a mapping:

>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename(columns={"A": "a", "B": "c"})
   a  c
0  1  4
1  2  5
2  3  6

Rename index using a mapping:

>>> df.rename(index={0: "x", 1: "y", 2: "z"})
   A  B
x  1  4
y  2  5
z  3  6

Cast index labels to a different type:

>>> df.index
RangeIndex(start=0, stop=3, step=1)
>>> df.rename(index=str).index
Index(['0', '1', '2'], dtype='object')

>>> df.rename(columns={"A": "a", "B": "b", "C": "c"}, errors="raise")
Traceback (most recent call last):
KeyError: ['C'] not found in axis

Using axis-style parameters:

>>> df.rename(str.lower, axis='columns')
   a  b
0  1  4
1  2  5
2  3  6

>>> df.rename({1: 2, 2: 4}, axis='index')
   A  B
0  1  4
2  2  5
4  3  6
rename_axis(**kwargs)

Set the name of the axis for the index or columns.

Parameters:
  • mapper (scalar, list-like, optional) – Value to set the axis name attribute.

  • index (scalar, list-like, dict-like or function, optional) –

    A scalar, list-like, dict-like or functions transformations to apply to that axis’ values. Note that the columns parameter is not allowed if the object is a DeferredSeries. This parameter only apply for DeferredDataFrame type objects.

    Use either mapper and axis to specify the axis to target with mapper, or index and/or columns.

  • columns (scalar, list-like, dict-like or function, optional) –

    A scalar, list-like, dict-like or functions transformations to apply to that axis’ values. Note that the columns parameter is not allowed if the object is a DeferredSeries. This parameter only apply for DeferredDataFrame type objects.

    Use either mapper and axis to specify the axis to target with mapper, or index and/or columns.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to rename. For DeferredSeries this parameter is unused and defaults to 0.

  • copy (bool, default None) – Also copy underlying data.

  • inplace (bool, default False) – Modifies the object directly, instead of creating a new DeferredSeries or DeferredDataFrame.

Returns:

The same type as the caller or None if inplace=True.

Return type:

DeferredSeries, DeferredDataFrame, or None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.rename

Alter DeferredSeries index labels or name.

DeferredDataFrame.rename

Alter DeferredDataFrame index labels or name.

Index.rename

Set new names on index.

Notes

DeferredDataFrame.rename_axis supports two calling conventions

  • (index=index_mapper, columns=columns_mapper, ...)

  • (mapper, axis={'index', 'columns'}, ...)

The first calling convention will only modify the names of the index and/or the names of the Index object that is the columns. In this case, the parameter copy is ignored.

The second calling convention will modify the names of the corresponding index if mapper is a list or a scalar. However, if mapper is dict-like or a function, it will use the deprecated behavior of modifying the axis labels.

We highly recommend using keyword arguments to clarify your intent.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

**Series**

>>> s = pd.Series(["dog", "cat", "monkey"])
>>> s
0       dog
1       cat
2    monkey
dtype: object
>>> s.rename_axis("animal")
animal
0    dog
1    cat
2    monkey
dtype: object

**DataFrame**

>>> df = pd.DataFrame({"num_legs": [4, 4, 2],
...                    "num_arms": [0, 0, 2]},
...                   ["dog", "cat", "monkey"])
>>> df
        num_legs  num_arms
dog            4         0
cat            4         0
monkey         2         2
>>> df = df.rename_axis("animal")
>>> df
        num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2
>>> df = df.rename_axis("limbs", axis="columns")
>>> df
limbs   num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2

**MultiIndex**

>>> df.index = pd.MultiIndex.from_product([['mammal'],
...                                        ['dog', 'cat', 'monkey']],
...                                       names=['type', 'name'])
>>> df
limbs          num_legs  num_arms
type   name
mammal dog            4         0
       cat            4         0
       monkey         2         2

>>> df.rename_axis(index={'type': 'class'})
limbs          num_legs  num_arms
class  name
mammal dog            4         0
       cat            4         0
       monkey         2         2

>>> df.rename_axis(columns=str.upper)
LIMBS          num_legs  num_arms
type   name
mammal dog            4         0
       cat            4         0
       monkey         2         2
round(decimals, *args, **kwargs)[source]

Round a DataFrame to a variable number of decimal places.

Parameters:
  • decimals (int, dict, DeferredSeries) – Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and DeferredSeries round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a DeferredSeries. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.

  • *args – Additional keywords have no effect but might be accepted for compatibility with numpy.

  • **kwargs – Additional keywords have no effect but might be accepted for compatibility with numpy.

Returns:

A DeferredDataFrame with the affected columns rounded to the specified number of decimal places.

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

numpy.around

Round a numpy array to the given number of decimals.

DeferredSeries.round

Round a DeferredSeries to the given number of decimals.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame([(.21, .32), (.01, .67), (.66, .03), (.21, .18)],
...                   columns=['dogs', 'cats'])
>>> df
    dogs  cats
0  0.21  0.32
1  0.01  0.67
2  0.66  0.03
3  0.21  0.18

By providing an integer each column is rounded to the same number
of decimal places

>>> df.round(1)
    dogs  cats
0   0.2   0.3
1   0.0   0.7
2   0.7   0.0
3   0.2   0.2

With a dict, the number of places for specific columns can be
specified with the column names as key and the number of decimal
places as value

>>> df.round({'dogs': 1, 'cats': 0})
    dogs  cats
0   0.2   0.0
1   0.0   1.0
2   0.7   0.0
3   0.2   0.0

Using a Series, the number of places for specific columns can be
specified with the column names as index and the number of
decimal places as value

>>> decimals = pd.Series([0, 1], index=['cats', 'dogs'])
>>> df.round(decimals)
    dogs  cats
0   0.2   0.0
1   0.0   1.0
2   0.7   0.0
3   0.2   0.0
select_dtypes(**kwargs)

Return a subset of the DataFrame’s columns based on the column dtypes.

Parameters:
  • include (scalar or list-like) – A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied.

  • exclude (scalar or list-like) – A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied.

Returns:

The subset of the frame including the dtypes in include and excluding the dtypes in exclude.

Return type:

DeferredDataFrame

Raises:

ValueError

  • If both of include and exclude are empty * If include and exclude have overlapping elements * If any kind of string dtype is passed in.

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.dtypes

Return DeferredSeries with the data type of each column.

Notes

  • To select all numeric types, use np.number or 'number'

  • To select strings you must use the object dtype, but note that this will return all object dtype columns

  • See the numpy dtype hierarchy

  • To select datetimes, use np.datetime64, 'datetime' or 'datetime64'

  • To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'

  • To select Pandas categorical dtypes, use 'category'

  • To select Pandas datetimetz dtypes, use 'datetimetz' or 'datetime64[ns, tz]'

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'a': [1, 2] * 3,
...                    'b': [True, False] * 3,
...                    'c': [1.0, 2.0] * 3})
>>> df
        a      b  c
0       1   True  1.0
1       2  False  2.0
2       1   True  1.0
3       2  False  2.0
4       1   True  1.0
5       2  False  2.0

>>> df.select_dtypes(include='bool')
   b
0  True
1  False
2  True
3  False
4  True
5  False

>>> df.select_dtypes(include=['float64'])
   c
0  1.0
1  2.0
2  1.0
3  2.0
4  1.0
5  2.0

>>> df.select_dtypes(exclude=['int64'])
       b    c
0   True  1.0
1  False  2.0
2   True  1.0
3  False  2.0
4   True  1.0
5  False  2.0
shift(axis, freq, **kwargs)[source]

Shift index by desired number of periods with an optional time freq.

When freq is not passed, shift the index without realigning the data. If freq is passed (in this case, the index must be date or datetime, or it will raise a NotImplementedError), the index will be increased using the periods and the freq. freq can be inferred when specified as “infer” as long as either freq or inferred_freq attribute is set in the index.

Parameters:
  • periods (int or Sequence) – Number of periods to shift. Can be positive or negative. If an iterable of ints, the data will be shifted once by each int. This is equivalent to shifting by one value at a time and concatenating all resulting frames. The resulting columns will have the shift suffixed to their column names. For multiple periods, axis must not be 1.

  • freq (DateOffset, tseries.offsets, timedelta, or str, optional) – Offset to use from the tseries module or time rule (e.g. ‘EOM’). If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data. If freq is specified as “infer” then it will be inferred from the freq or inferred_freq attributes of the index. If neither of those attributes exist, a ValueError is thrown.

  • axis ({0 or 'index', 1 or 'columns', None}, default None) – Shift direction. For DeferredSeries this parameter is unused and defaults to 0.

  • fill_value (object, optional) – The scalar value to use for newly introduced missing values. the default depends on the dtype of self. For numeric data, np.nan is used. For datetime, timedelta, or period data, etc. NaT is used. For extension dtypes, self.dtype.na_value is used.

  • suffix (str, optional) – If str and periods is an iterable, this is added after the column name and before the shift value for each shifted column name.

Returns:

Copy of input object, shifted.

Return type:

DeferredDataFrame

Differences from pandas

shift with axis="index" is only supported with ``freq specified and fill_value undefined. Other configurations make this operation order-sensitive.

See also

Index.shift

Shift values of Index.

DatetimeIndex.shift

Shift values of DatetimeIndex.

PeriodIndex.shift

Shift values of PeriodIndex.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({"Col1": [10, 20, 15, 30, 45],
...                    "Col2": [13, 23, 18, 33, 48],
...                    "Col3": [17, 27, 22, 37, 52]},
...                   index=pd.date_range("2020-01-01", "2020-01-05"))
>>> df
            Col1  Col2  Col3
2020-01-01    10    13    17
2020-01-02    20    23    27
2020-01-03    15    18    22
2020-01-04    30    33    37
2020-01-05    45    48    52

>>> df.shift(periods=3)
            Col1  Col2  Col3
2020-01-01   NaN   NaN   NaN
2020-01-02   NaN   NaN   NaN
2020-01-03   NaN   NaN   NaN
2020-01-04  10.0  13.0  17.0
2020-01-05  20.0  23.0  27.0

>>> df.shift(periods=1, axis="columns")
            Col1  Col2  Col3
2020-01-01   NaN    10    13
2020-01-02   NaN    20    23
2020-01-03   NaN    15    18
2020-01-04   NaN    30    33
2020-01-05   NaN    45    48

>>> df.shift(periods=3, fill_value=0)
            Col1  Col2  Col3
2020-01-01     0     0     0
2020-01-02     0     0     0
2020-01-03     0     0     0
2020-01-04    10    13    17
2020-01-05    20    23    27

>>> df.shift(periods=3, freq="D")
            Col1  Col2  Col3
2020-01-04    10    13    17
2020-01-05    20    23    27
2020-01-06    15    18    22
2020-01-07    30    33    37
2020-01-08    45    48    52

>>> df.shift(periods=3, freq="infer")
            Col1  Col2  Col3
2020-01-04    10    13    17
2020-01-05    20    23    27
2020-01-06    15    18    22
2020-01-07    30    33    37
2020-01-08    45    48    52

>>> df['Col1'].shift(periods=[0, 1, 2])
            Col1_0  Col1_1  Col1_2
2020-01-01      10     NaN     NaN
2020-01-02      20    10.0     NaN
2020-01-03      15    20.0    10.0
2020-01-04      30    15.0    20.0
2020-01-05      45    30.0    15.0
property shape

pandas.DataFrame.shape() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

stack(**kwargs)

Stack the prescribed level(s) from columns to index.

Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe:

  • if the columns have a single level, the output is a Series;

  • if the columns have multiple levels, the new index level(s) is (are) taken from the prescribed level(s) and the output is a DataFrame.

Parameters:
  • level (int, str, list, default -1) – Level(s) to stack from the column axis onto the index axis, defined as one index or label, or a list of indices or labels.

  • dropna (bool, default True) – Whether to drop rows in the resulting Frame/DeferredSeries with missing values. Stacking a column level onto the index axis can create combinations of index and column values that are missing from the original dataframe. See Examples section.

  • sort (bool, default True) – Whether to sort the levels of the resulting MultiIndex.

  • future_stack (bool, default False) – Whether to use the new implementation that will replace the current implementation in pandas 3.0. When True, dropna and sort have no impact on the result and must remain unspecified. See pandas 2.1.0 Release notes for more details.

Returns:

Stacked dataframe or series.

Return type:

DeferredDataFrame or DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.unstack

Unstack prescribed level(s) from index axis onto column axis.

DeferredDataFrame.pivot

Reshape dataframe from long format to wide format.

DeferredDataFrame.pivot_table

Create a spreadsheet-style pivot table as a DeferredDataFrame.

Notes

The function is named by analogy with a collection of books being reorganized from being side by side on a horizontal position (the columns of the dataframe) to being stacked vertically on top of each other (in the index of the dataframe).

Reference the user guide for more examples.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

**Single level columns**

>>> df_single_level_cols = pd.DataFrame([[0, 1], [2, 3]],
...                                     index=['cat', 'dog'],
...                                     columns=['weight', 'height'])

Stacking a dataframe with a single level column axis returns a Series:

>>> df_single_level_cols
     weight height
cat       0      1
dog       2      3
>>> df_single_level_cols.stack(future_stack=True)
cat  weight    0
     height    1
dog  weight    2
     height    3
dtype: int64

**Multi level columns: simple case**

>>> multicol1 = pd.MultiIndex.from_tuples([('weight', 'kg'),
...                                        ('weight', 'pounds')])
>>> df_multi_level_cols1 = pd.DataFrame([[1, 2], [2, 4]],
...                                     index=['cat', 'dog'],
...                                     columns=multicol1)

Stacking a dataframe with a multi-level column axis:

>>> df_multi_level_cols1
     weight
         kg    pounds
cat       1        2
dog       2        4
>>> df_multi_level_cols1.stack(future_stack=True)
            weight
cat kg           1
    pounds       2
dog kg           2
    pounds       4

**Missing values**

>>> multicol2 = pd.MultiIndex.from_tuples([('weight', 'kg'),
...                                        ('height', 'm')])
>>> df_multi_level_cols2 = pd.DataFrame([[1.0, 2.0], [3.0, 4.0]],
...                                     index=['cat', 'dog'],
...                                     columns=multicol2)

It is common to have missing values when stacking a dataframe
with multi-level columns, as the stacked dataframe typically
has more values than the original dataframe. Missing values
are filled with NaNs:

>>> df_multi_level_cols2
    weight height
        kg      m
cat    1.0    2.0
dog    3.0    4.0
>>> df_multi_level_cols2.stack(future_stack=True)
        weight  height
cat kg     1.0     NaN
    m      NaN     2.0
dog kg     3.0     NaN
    m      NaN     4.0

**Prescribing the level(s) to be stacked**

The first parameter controls which level or levels are stacked:

>>> df_multi_level_cols2.stack(0, future_stack=True)
             kg    m
cat weight  1.0  NaN
    height  NaN  2.0
dog weight  3.0  NaN
    height  NaN  4.0
>>> df_multi_level_cols2.stack([0, 1], future_stack=True)
cat  weight  kg    1.0
     height  m     2.0
dog  weight  kg    3.0
     height  m     4.0
dtype: float64

**Dropping missing values**

>>> df_multi_level_cols3 = pd.DataFrame([[None, 1.0], [2.0, 3.0]],
...                                     index=['cat', 'dog'],
...                                     columns=multicol2)

Note that rows where all values are missing are dropped by
default but this behaviour can be controlled via the dropna
keyword parameter:

>>> df_multi_level_cols3
    weight height
        kg      m
cat    NaN    1.0
dog    2.0    3.0
>>> df_multi_level_cols3.stack(dropna=False)
        weight  height
cat kg     NaN     NaN
    m      NaN     1.0
dog kg     2.0     NaN
    m      NaN     3.0
>>> df_multi_level_cols3.stack(dropna=True)
        weight  height
cat m      NaN     1.0
dog kg     2.0     NaN
    m      NaN     3.0
all(*args, **kwargs)

Return whether all elements are True, potentially over an axis.

Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).

Parameters:
  • axis ({0 or 'index', 1 or 'columns', None}, default 0) –

    Indicate which axis or axes should be reduced. For DeferredSeries this parameter is unused and defaults to 0.

    • 0 / ‘index’ : reduce the index, return a DeferredSeries whose index is the original column labels.

    • 1 / ‘columns’ : reduce the columns, return a DeferredSeries whose index is the original index.

    • None : reduce all axes, return a scalar.

  • bool_only (bool, default False) – Include only boolean columns. Not implemented for DeferredSeries.

  • skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

  • **kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

If level is specified, then, DeferredDataFrame is returned; otherwise, DeferredSeries is returned.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.all

Return True if all elements are True.

DeferredDataFrame.any

Return True if one (or more) elements are True.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

**Series**

>>> pd.Series([True, True]).all()
True
>>> pd.Series([True, False]).all()
False
>>> pd.Series([], dtype="float64").all()
True
>>> pd.Series([np.nan]).all()
True
>>> pd.Series([np.nan]).all(skipna=False)
True

**DataFrames**

Create a dataframe from a dictionary.

>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]})
>>> df
   col1   col2
0  True   True
1  True  False

Default behaviour checks if values in each column all return True.

>>> df.all()
col1     True
col2    False
dtype: bool

Specify ``axis='columns'`` to check if values in each row all return True.

>>> df.all(axis='columns')
0     True
1    False
dtype: bool

Or ``axis=None`` for whether every value is True.

>>> df.all(axis=None)
False
any(*args, **kwargs)

Return whether any element is True, potentially over an axis.

Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).

Parameters:
  • axis ({0 or 'index', 1 or 'columns', None}, default 0) –

    Indicate which axis or axes should be reduced. For DeferredSeries this parameter is unused and defaults to 0.

    • 0 / ‘index’ : reduce the index, return a DeferredSeries whose index is the original column labels.

    • 1 / ‘columns’ : reduce the columns, return a DeferredSeries whose index is the original index.

    • None : reduce all axes, return a scalar.

  • bool_only (bool, default False) – Include only boolean columns. Not implemented for DeferredSeries.

  • skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

  • **kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

If level is specified, then, DeferredDataFrame is returned; otherwise, DeferredSeries is returned.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

numpy.any

Numpy version of this method.

DeferredSeries.any

Return whether any element is True.

DeferredSeries.all

Return whether all elements are True.

DeferredDataFrame.any

Return whether any element is True over requested axis.

DeferredDataFrame.all

Return whether all elements are True over requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

**Series**

For Series input, the output is a scalar indicating whether any element
is True.

>>> pd.Series([False, False]).any()
False
>>> pd.Series([True, False]).any()
True
>>> pd.Series([], dtype="float64").any()
False
>>> pd.Series([np.nan]).any()
False
>>> pd.Series([np.nan]).any(skipna=False)
True

**DataFrame**

Whether each column contains at least one True element (the default).

>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})
>>> df
   A  B  C
0  1  0  0
1  2  2  0

>>> df.any()
A     True
B     True
C    False
dtype: bool

Aggregating over the columns.

>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]})
>>> df
       A  B
0   True  1
1  False  2

>>> df.any(axis='columns')
0    True
1    True
dtype: bool

>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]})
>>> df
       A  B
0   True  1
1  False  0

>>> df.any(axis='columns')
0    True
1    False
dtype: bool

Aggregating over the entire DataFrame with ``axis=None``.

>>> df.any(axis=None)
True

`any` for an empty DataFrame is an empty Series.

>>> pd.DataFrame([]).any()
Series([], dtype: bool)
count(*args, **kwargs)

Count non-NA cells for each column or row.

The values None, NaN, NaT, pandas.NA are considered NA.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.

  • numeric_only (bool, default False) – Include only float, int or boolean data.

Returns:

For each column/row the number of non-NA/null entries.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.count

Number of non-NA elements in a DeferredSeries.

DeferredDataFrame.value_counts

Count unique combinations of columns.

DeferredDataFrame.shape

Number of DeferredDataFrame rows and columns (including NA elements).

DeferredDataFrame.isna

Boolean same-sized DeferredDataFrame showing places of NA elements.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Constructing DataFrame from a dictionary:

>>> df = pd.DataFrame({"Person":
...                    ["John", "Myla", "Lewis", "John", "Myla"],
...                    "Age": [24., np.nan, 21., 33, 26],
...                    "Single": [False, True, True, True, False]})
>>> df
   Person   Age  Single
0    John  24.0   False
1    Myla   NaN    True
2   Lewis  21.0    True
3    John  33.0    True
4    Myla  26.0   False

Notice the uncounted NA values:

>>> df.count()
Person    5
Age       4
Single    5
dtype: int64

Counts for each **row**:

>>> df.count(axis='columns')
0    3
1    2
2    3
3    3
4    3
dtype: int64
describe(*args, **kwargs)

Generate descriptive statistics.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters:
  • percentiles (list-like of numbers, optional) – The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

  • include ('all', list-like of dtypes or None (default), optional) –

    A white list of data types to include in the result. Ignored for DeferredSeries. Here are the options:

    • ’all’ : All columns of the input will be included in the output.

    • A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'

    • None (default) : The result will include all numeric columns.

  • exclude (list-like of dtypes or None (default), optional,) –

    A black list of data types to omit from the result. Ignored for DeferredSeries. Here are the options:

    • A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(exclude=['O'])). To exclude pandas categorical columns, use 'category'

    • None (default) : The result will exclude nothing.

Returns:

Summary statistics of the DeferredSeries or Dataframe provided.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

describe cannot currently be parallelized. It will require collecting all data on a single node.

See also

DeferredDataFrame.count

Count number of non-NA/null observations.

DeferredDataFrame.max

Maximum of the values in the object.

DeferredDataFrame.min

Minimum of the values in the object.

DeferredDataFrame.mean

Mean of the values.

DeferredDataFrame.std

Standard deviation of the observations.

DeferredDataFrame.select_dtypes

Subset of a DeferredDataFrame including/excluding columns based on their dtype.

Notes

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DeferredDataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DeferredDataFrame are analyzed for the output. The parameters are ignored when analyzing a DeferredSeries.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

Describing a numeric ``Series``.

>>> s = pd.Series([1, 2, 3])
>>> s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Describing a categorical ``Series``.

>>> s = pd.Series(['a', 'a', 'b', 'c'])
>>> s.describe()
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp ``Series``.

>>> s = pd.Series([
...     np.datetime64("2000-01-01"),
...     np.datetime64("2010-01-01"),
...     np.datetime64("2010-01-01")
... ])
>>> s.describe()
count                      3
mean     2006-09-01 08:00:00
min      2000-01-01 00:00:00
25%      2004-12-31 12:00:00
50%      2010-01-01 00:00:00
75%      2010-01-01 00:00:00
max      2010-01-01 00:00:00
dtype: object

Describing a ``DataFrame``. By default only numeric fields
are returned.

>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),
...                    'numeric': [1, 2, 3],
...                    'object': ['a', 'b', 'c']
...                   })
>>> df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a ``DataFrame`` regardless of data type.

>>> df.describe(include='all')  
       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      a
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a ``DataFrame`` by accessing it as
an attribute.

>>> df.numeric.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a ``DataFrame`` description.

>>> df.describe(include=[np.number])
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a ``DataFrame`` description.

>>> df.describe(include=[object])  
       object
count       3
unique      3
top         a
freq        1

Including only categorical columns from a ``DataFrame`` description.

>>> df.describe(include=['category'])
       categorical
count            3
unique           3
top              d
freq             1

Excluding numeric columns from a ``DataFrame`` description.

>>> df.describe(exclude=[np.number])  
       categorical object
count            3      3
unique           3      3
top              f      a
freq             1      1

Excluding object columns from a ``DataFrame`` description.

>>> df.describe(exclude=[object])  
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0
max(*args, **kwargs)

Return the maximum of the values over the requested axis.

If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

DeferredSeries or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.sum

Return the sum.

DeferredSeries.min

Return the minimum.

DeferredSeries.max

Return the maximum.

DeferredSeries.idxmin

Return the index of the minimum.

DeferredSeries.idxmax

Return the index of the maximum.

DeferredDataFrame.sum

Return the sum over the requested axis.

DeferredDataFrame.min

Return the minimum over the requested axis.

DeferredDataFrame.max

Return the maximum over the requested axis.

DeferredDataFrame.idxmin

Return the index of the minimum over the requested axis.

DeferredDataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> idx = pd.MultiIndex.from_arrays([
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> s
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64

>>> s.max()
8
min(*args, **kwargs)

Return the minimum of the values over the requested axis.

If you want the index of the minimum, use idxmin. This is the equivalent of the numpy.ndarray method argmin.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

DeferredSeries or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.sum

Return the sum.

DeferredSeries.min

Return the minimum.

DeferredSeries.max

Return the maximum.

DeferredSeries.idxmin

Return the index of the minimum.

DeferredSeries.idxmax

Return the index of the maximum.

DeferredDataFrame.sum

Return the sum over the requested axis.

DeferredDataFrame.min

Return the minimum over the requested axis.

DeferredDataFrame.max

Return the maximum over the requested axis.

DeferredDataFrame.idxmin

Return the index of the minimum over the requested axis.

DeferredDataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> idx = pd.MultiIndex.from_arrays([
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> s
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64

>>> s.min()
0
pivot(index=None, columns=None, values=None, **kwargs)[source]

Return reshaped DataFrame organized by given index / column values.

Reshape data (produce a “pivot” table) based on column values. Uses unique values from specified index / columns to form axes of the resulting DataFrame. This function does not support data aggregation, multiple values will result in a MultiIndex in the columns. See the User Guide for more on reshaping.

Parameters:
  • columns (str or object or a list of str) – Column to use to make new frame’s columns.

  • index (str or object or a list of str, optional) – Column to use to make new frame’s index. If not given, uses existing index.

  • values (str, object or a list of the previous, optional) – Column(s) to use for populating new frame’s values. If not specified, all remaining columns will be used and the result will have hierarchically indexed columns.

Returns:

Returns reshaped DeferredDataFrame.

Return type:

DeferredDataFrame

Raises:

ValueError: – When there are any index, columns combinations with multiple values. DeferredDataFrame.pivot_table when you need to aggregate.

Differences from pandas

Because pivot is a non-deferred method, any columns specified in columns must be CategoricalDType so we can determine the output column names.

See also

DeferredDataFrame.pivot_table

Generalization of pivot that can handle duplicate values for one index/column pair.

DeferredDataFrame.unstack

Pivot based on the index values instead of a column.

wide_to_long

Wide panel to long format. Less flexible but more user-friendly than melt.

Notes

For finer-tuned control, see hierarchical indexing documentation along with the related stack/unstack methods.

Reference the user guide for more examples.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
...                            'two'],
...                    'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
...                    'baz': [1, 2, 3, 4, 5, 6],
...                    'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
>>> df
    foo   bar  baz  zoo
0   one   A    1    x
1   one   B    2    y
2   one   C    3    z
3   two   A    4    q
4   two   B    5    w
5   two   C    6    t

>>> df.pivot(index='foo', columns='bar', values='baz')
bar  A   B   C
foo
one  1   2   3
two  4   5   6

>>> df.pivot(index='foo', columns='bar')['baz']
bar  A   B   C
foo
one  1   2   3
two  4   5   6

>>> df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
      baz       zoo
bar   A  B  C   A  B  C
foo
one   1  2  3   x  y  z
two   4  5  6   q  w  t

You could also assign a list of column names or a list of index names.

>>> df = pd.DataFrame({
...        "lev1": [1, 1, 1, 2, 2, 2],
...        "lev2": [1, 1, 2, 1, 1, 2],
...        "lev3": [1, 2, 1, 2, 1, 2],
...        "lev4": [1, 2, 3, 4, 5, 6],
...        "values": [0, 1, 2, 3, 4, 5]})
>>> df
    lev1 lev2 lev3 lev4 values
0   1    1    1    1    0
1   1    1    2    2    1
2   1    2    1    3    2
3   2    1    2    4    3
4   2    1    1    5    4
5   2    2    2    6    5

>>> df.pivot(index="lev1", columns=["lev2", "lev3"], values="values")
lev2    1         2
lev3    1    2    1    2
lev1
1     0.0  1.0  2.0  NaN
2     4.0  3.0  NaN  5.0

>>> df.pivot(index=["lev1", "lev2"], columns=["lev3"], values="values")
      lev3    1    2
lev1  lev2
   1     1  0.0  1.0
         2  2.0  NaN
   2     1  4.0  3.0
         2  NaN  5.0

A ValueError is raised if there are any duplicates.

>>> df = pd.DataFrame({"foo": ['one', 'one', 'two', 'two'],
...                    "bar": ['A', 'A', 'B', 'C'],
...                    "baz": [1, 2, 3, 4]})
>>> df
   foo bar  baz
0  one   A    1
1  one   A    2
2  two   B    3
3  two   C    4

Notice that the first two rows are the same for our `index`
and `columns` arguments.

>>> df.pivot(index='foo', columns='bar', values='baz')
Traceback (most recent call last):
   ...
ValueError: Index contains duplicate entries, cannot reshape
prod(*args, **kwargs)

Return the product of the values over the requested axis.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

DeferredSeries or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.sum

Return the sum.

DeferredSeries.min

Return the minimum.

DeferredSeries.max

Return the maximum.

DeferredSeries.idxmin

Return the index of the minimum.

DeferredSeries.idxmax

Return the index of the maximum.

DeferredDataFrame.sum

Return the sum over the requested axis.

DeferredDataFrame.min

Return the minimum over the requested axis.

DeferredDataFrame.max

Return the maximum over the requested axis.

DeferredDataFrame.idxmin

Return the index of the minimum over the requested axis.

DeferredDataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

By default, the product of an empty or all-NA Series is ``1``

>>> pd.Series([], dtype="float64").prod()
1.0

This can be controlled with the ``min_count`` parameter

>>> pd.Series([], dtype="float64").prod(min_count=1)
nan

Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.

>>> pd.Series([np.nan]).prod()
1.0

>>> pd.Series([np.nan]).prod(min_count=1)
nan
product(*args, **kwargs)

Return the product of the values over the requested axis.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

DeferredSeries or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.sum

Return the sum.

DeferredSeries.min

Return the minimum.

DeferredSeries.max

Return the maximum.

DeferredSeries.idxmin

Return the index of the minimum.

DeferredSeries.idxmax

Return the index of the maximum.

DeferredDataFrame.sum

Return the sum over the requested axis.

DeferredDataFrame.min

Return the minimum over the requested axis.

DeferredDataFrame.max

Return the maximum over the requested axis.

DeferredDataFrame.idxmin

Return the index of the minimum over the requested axis.

DeferredDataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

By default, the product of an empty or all-NA Series is ``1``

>>> pd.Series([], dtype="float64").prod()
1.0

This can be controlled with the ``min_count`` parameter

>>> pd.Series([], dtype="float64").prod(min_count=1)
nan

Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.

>>> pd.Series([np.nan]).prod()
1.0

>>> pd.Series([np.nan]).prod(min_count=1)
nan
sum(*args, **kwargs)

Return the sum of the values over the requested axis.

This is equivalent to the method numpy.sum.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

DeferredSeries or scalar

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.sum

Return the sum.

DeferredSeries.min

Return the minimum.

DeferredSeries.max

Return the maximum.

DeferredSeries.idxmin

Return the index of the minimum.

DeferredSeries.idxmax

Return the index of the maximum.

DeferredDataFrame.sum

Return the sum over the requested axis.

DeferredDataFrame.min

Return the minimum over the requested axis.

DeferredDataFrame.max

Return the maximum over the requested axis.

DeferredDataFrame.idxmin

Return the index of the minimum over the requested axis.

DeferredDataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> idx = pd.MultiIndex.from_arrays([
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> s
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64

>>> s.sum()
14

By default, the sum of an empty or all-NA Series is ``0``.

>>> pd.Series([], dtype="float64").sum()  # min_count=0 is the default
0.0

This can be controlled with the ``min_count`` parameter. For example, if
you'd like the sum of an empty series to be NaN, pass ``min_count=1``.

>>> pd.Series([], dtype="float64").sum(min_count=1)
nan

Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and
empty series identically.

>>> pd.Series([np.nan]).sum()
0.0

>>> pd.Series([np.nan]).sum(min_count=1)
nan
mean(*args, **kwargs)

Return the mean of the values over the requested axis.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

Series or scalar

Examples

Series or scalar

            Examples
            --------
            >>> s = pd.Series([1, 2, 3])
            >>> s.mean()
            2.0

            With a DataFrame

            >>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra'])
            >>> df
                   a   b
            tiger  1   2
            zebra  2   3
            >>> df.mean()
            a   1.5
            b   2.5
            dtype: float64

            Using axis=1

            >>> df.mean(axis=1)
            tiger   1.5
            zebra   2.5
            dtype: float64

            In this case, `numeric_only` should be set to `True` to avoid
            getting an error.

            >>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']},
            ...                   index=['tiger', 'zebra'])
            >>> df.mean(numeric_only=True)
            a   1.5
            dtype: float64


        --------
        >>> s = pd.Series([1, 2, 3])
        >>> s.mean()
        2.0

        With a DataFrame

        >>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra'])
        >>> df
               a   b
        tiger  1   2
        zebra  2   3
        >>> df.mean()
        a   1.5
        b   2.5
        dtype: float64

        Using axis=1

        >>> df.mean(axis=1)
        tiger   1.5
        zebra   2.5
        dtype: float64

        In this case, `numeric_only` should be set to `True` to avoid
        getting an error.

        >>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']},
        ...                   index=['tiger', 'zebra'])
        >>> df.mean(numeric_only=True)
        a   1.5
        dtype: float64

Differences from pandas

This operation has no known divergences from the pandas API.

median(*args, **kwargs)

Return the median of the values over the requested axis.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

Series or scalar

Examples

Series or scalar

            Examples
            --------
            >>> s = pd.Series([1, 2, 3])
            >>> s.median()
            2.0

            With a DataFrame

            >>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra'])
            >>> df
                   a   b
            tiger  1   2
            zebra  2   3
            >>> df.median()
            a   1.5
            b   2.5
            dtype: float64

            Using axis=1

            >>> df.median(axis=1)
            tiger   1.5
            zebra   2.5
            dtype: float64

            In this case, `numeric_only` should be set to `True`
            to avoid getting an error.

            >>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']},
            ...                   index=['tiger', 'zebra'])
            >>> df.median(numeric_only=True)
            a   1.5
            dtype: float64


        --------
        >>> s = pd.Series([1, 2, 3])
        >>> s.median()
        2.0

        With a DataFrame

        >>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra'])
        >>> df
               a   b
        tiger  1   2
        zebra  2   3
        >>> df.median()
        a   1.5
        b   2.5
        dtype: float64

        Using axis=1

        >>> df.median(axis=1)
        tiger   1.5
        zebra   2.5
        dtype: float64

        In this case, `numeric_only` should be set to `True`
        to avoid getting an error.

        >>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']},
        ...                   index=['tiger', 'zebra'])
        >>> df.median(numeric_only=True)
        a   1.5
        dtype: float64

Differences from pandas

median cannot currently be parallelized. It will require collecting all data on a single node.

nunique(*args, **kwargs)

Count number of distinct elements in specified axis.

Return Series with number of distinct elements. Can ignore NaN values.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

  • dropna (bool, default True) – Don’t include NaN in the counts.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.nunique

Method nunique for DeferredSeries.

DeferredDataFrame.count

Count non-NA cells for each column or row.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'A': [4, 5, 6], 'B': [4, 1, 1]})
>>> df.nunique()
A    3
B    2
dtype: int64

>>> df.nunique(axis=1)
0    1
1    2
2    2
dtype: int64
std(*args, **kwargs)

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument.

Parameters:
  • axis ({index (0), columns (1)}) – For DeferredSeries this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

Return type:

DeferredSeries or DeferredDataFrame (if level specified)

Differences from pandas

This operation has no known divergences from the pandas API.

Notes

To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3],
...                    'age': [21, 25, 62, 43],
...                    'height': [1.61, 1.87, 1.49, 2.01]}
...                   ).set_index('person_id')
>>> df
           age  height
person_id
0           21    1.61
1           25    1.87
2           62    1.49
3           43    2.01

The standard deviation of the columns can be found as follows:

>>> df.std()
age       18.786076
height     0.237417
dtype: float64

Alternatively, `ddof=0` can be set to normalize by N instead of N-1:

>>> df.std(ddof=0)
age       16.269219
height     0.205609
dtype: float64
var(*args, **kwargs)

Return unbiased variance over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument.

Parameters:
  • axis ({index (0), columns (1)}) – For DeferredSeries this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

Return type:

DeferredSeries or DeferredDataFrame (if level specified)

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3],
...                   'age': [21, 25, 62, 43],
...                   'height': [1.61, 1.87, 1.49, 2.01]}
...                  ).set_index('person_id')
>>> df
           age  height
person_id
0           21    1.61
1           25    1.87
2           62    1.49
3           43    2.01

>>> df.var()
age       352.916667
height      0.056367
dtype: float64

Alternatively, ``ddof=0`` can be set to normalize by N instead of N-1:

>>> df.var(ddof=0)
age       264.687500
height      0.042275
dtype: float64
sem(*args, **kwargs)

Return unbiased standard error of the mean over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:
  • axis ({index (0), columns (1)}) – For DeferredSeries this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

Return type:

Series or DataFrame (if level specified)

Examples

Series or DataFrame (if level specified)

            Examples
            --------
            >>> s = pd.Series([1, 2, 3])
            >>> s.sem().round(6)
            0.57735

            With a DataFrame

            >>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra'])
            >>> df
                   a   b
            tiger  1   2
            zebra  2   3
            >>> df.sem()
            a   0.5
            b   0.5
            dtype: float64

            Using axis=1

            >>> df.sem(axis=1)
            tiger   0.5
            zebra   0.5
            dtype: float64

            In this case, `numeric_only` should be set to `True`
            to avoid getting an error.

            >>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']},
            ...                   index=['tiger', 'zebra'])
            >>> df.sem(numeric_only=True)
            a   0.5
            dtype: float64


        --------
        >>> s = pd.Series([1, 2, 3])
        >>> s.sem().round(6)
        0.57735

        With a DataFrame

        >>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra'])
        >>> df
               a   b
        tiger  1   2
        zebra  2   3
        >>> df.sem()
        a   0.5
        b   0.5
        dtype: float64

        Using axis=1

        >>> df.sem(axis=1)
        tiger   0.5
        zebra   0.5
        dtype: float64

        In this case, `numeric_only` should be set to `True`
        to avoid getting an error.

        >>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']},
        ...                   index=['tiger', 'zebra'])
        >>> df.sem(numeric_only=True)
        a   0.5
        dtype: float64

Differences from pandas

sem cannot currently be parallelized. It will require collecting all data on a single node.

skew(*args, **kwargs)

Return unbiased skew over requested axis.

Normalized by N-1.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

Series or scalar

Examples

Series or scalar

            Examples
            --------
            >>> s = pd.Series([1, 2, 3])
            >>> s.skew()
            0.0

            With a DataFrame

            >>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 3, 4], 'c': [1, 3, 5]},
            ...                  index=['tiger', 'zebra', 'cow'])
            >>> df
                    a   b   c
            tiger   1   2   1
            zebra   2   3   3
            cow     3   4   5
            >>> df.skew()
            a   0.0
            b   0.0
            c   0.0
            dtype: float64

            Using axis=1

            >>> df.skew(axis=1)
            tiger   1.732051
            zebra  -1.732051
            cow     0.000000
            dtype: float64

            In this case, `numeric_only` should be set to `True` to avoid
            getting an error.

            >>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['T', 'Z', 'X']},
            ...                  index=['tiger', 'zebra', 'cow'])
            >>> df.skew(numeric_only=True)
            a   0.0
            dtype: float64


        --------
        >>> s = pd.Series([1, 2, 3])
        >>> s.skew()
        0.0

        With a DataFrame

        >>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 3, 4], 'c': [1, 3, 5]},
        ...                  index=['tiger', 'zebra', 'cow'])
        >>> df
                a   b   c
        tiger   1   2   1
        zebra   2   3   3
        cow     3   4   5
        >>> df.skew()
        a   0.0
        b   0.0
        c   0.0
        dtype: float64

        Using axis=1

        >>> df.skew(axis=1)
        tiger   1.732051
        zebra  -1.732051
        cow     0.000000
        dtype: float64

        In this case, `numeric_only` should be set to `True` to avoid
        getting an error.

        >>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['T', 'Z', 'X']},
        ...                  index=['tiger', 'zebra', 'cow'])
        >>> df.skew(numeric_only=True)
        a   0.0
        dtype: float64

Differences from pandas

This operation has no known divergences from the pandas API.

kurt(*args, **kwargs)

Return unbiased kurtosis over requested axis.

Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

Series or scalar

Examples

Series or scalar

            Examples
            --------
            >>> s = pd.Series([1, 2, 2, 3], index=['cat', 'dog', 'dog', 'mouse'])
            >>> s
            cat    1
            dog    2
            dog    2
            mouse  3
            dtype: int64
            >>> s.kurt()
            1.5

            With a DataFrame

            >>> df = pd.DataFrame({'a': [1, 2, 2, 3], 'b': [3, 4, 4, 4]},
            ...                   index=['cat', 'dog', 'dog', 'mouse'])
            >>> df
                   a   b
              cat  1   3
              dog  2   4
              dog  2   4
            mouse  3   4
            >>> df.kurt()
            a   1.5
            b   4.0
            dtype: float64

            With axis=None

            >>> df.kurt(axis=None).round(6)
            -0.988693

            Using axis=1

            >>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [3, 4], 'd': [1, 2]},
            ...                   index=['cat', 'dog'])
            >>> df.kurt(axis=1)
            cat   -6.0
            dog   -6.0
            dtype: float64


        --------
        >>> s = pd.Series([1, 2, 2, 3], index=['cat', 'dog', 'dog', 'mouse'])
        >>> s
        cat    1
        dog    2
        dog    2
        mouse  3
        dtype: int64
        >>> s.kurt()
        1.5

        With a DataFrame

        >>> df = pd.DataFrame({'a': [1, 2, 2, 3], 'b': [3, 4, 4, 4]},
        ...                   index=['cat', 'dog', 'dog', 'mouse'])
        >>> df
               a   b
          cat  1   3
          dog  2   4
          dog  2   4
        mouse  3   4
        >>> df.kurt()
        a   1.5
        b   4.0
        dtype: float64

        With axis=None

        >>> df.kurt(axis=None).round(6)
        -0.988693

        Using axis=1

        >>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [3, 4], 'd': [1, 2]},
        ...                   index=['cat', 'dog'])
        >>> df.kurt(axis=1)
        cat   -6.0
        dog   -6.0
        dtype: float64

Differences from pandas

This operation has no known divergences from the pandas API.

kurtosis(*args, **kwargs)

Return unbiased kurtosis over requested axis.

Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For DeferredSeries this parameter is unused and defaults to 0.

    For DeferredDataFrames, specifying axis=None will apply the aggregation across both axes.

    Added in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for DeferredSeries.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

Series or scalar

Examples

Series or scalar

            Examples
            --------
            >>> s = pd.Series([1, 2, 2, 3], index=['cat', 'dog', 'dog', 'mouse'])
            >>> s
            cat    1
            dog    2
            dog    2
            mouse  3
            dtype: int64
            >>> s.kurt()
            1.5

            With a DataFrame

            >>> df = pd.DataFrame({'a': [1, 2, 2, 3], 'b': [3, 4, 4, 4]},
            ...                   index=['cat', 'dog', 'dog', 'mouse'])
            >>> df
                   a   b
              cat  1   3
              dog  2   4
              dog  2   4
            mouse  3   4
            >>> df.kurt()
            a   1.5
            b   4.0
            dtype: float64

            With axis=None

            >>> df.kurt(axis=None).round(6)
            -0.988693

            Using axis=1

            >>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [3, 4], 'd': [1, 2]},
            ...                   index=['cat', 'dog'])
            >>> df.kurt(axis=1)
            cat   -6.0
            dog   -6.0
            dtype: float64


        --------
        >>> s = pd.Series([1, 2, 2, 3], index=['cat', 'dog', 'dog', 'mouse'])
        >>> s
        cat    1
        dog    2
        dog    2
        mouse  3
        dtype: int64
        >>> s.kurt()
        1.5

        With a DataFrame

        >>> df = pd.DataFrame({'a': [1, 2, 2, 3], 'b': [3, 4, 4, 4]},
        ...                   index=['cat', 'dog', 'dog', 'mouse'])
        >>> df
               a   b
          cat  1   3
          dog  2   4
          dog  2   4
        mouse  3   4
        >>> df.kurt()
        a   1.5
        b   4.0
        dtype: float64

        With axis=None

        >>> df.kurt(axis=None).round(6)
        -0.988693

        Using axis=1

        >>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [3, 4], 'd': [1, 2]},
        ...                   index=['cat', 'dog'])
        >>> df.kurt(axis=1)
        cat   -6.0
        dog   -6.0
        dtype: float64

Differences from pandas

This operation has no known divergences from the pandas API.

take(**kwargs)

pandas.DataFrame.take() is not yet supported in the Beam DataFrame API because it is deprecated in pandas.

to_records(**kwargs)

pandas.DataFrame.to_records() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

to_dict(**kwargs)

pandas.DataFrame.to_dict() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

to_numpy(**kwargs)

pandas.DataFrame.to_numpy() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

to_string(**kwargs)

pandas.DataFrame.to_string() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

to_sparse(**kwargs)

pandas.DataFrame.to_sparse() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

transpose(**kwargs)

pandas.DataFrame.transpose() is not yet supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.

For more information see https://s.apache.org/dataframe-non-deferred-columns.

property T

pandas.DataFrame.T() is not yet supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.

For more information see https://s.apache.org/dataframe-non-deferred-columns.

update(**kwargs)

Modify in place using non-NA values from another DataFrame.

Aligns on indices. There is no return value.

Parameters:
  • other (DeferredDataFrame, or object coercible into a DeferredDataFrame) – Should have at least one matching index/column label with the original DeferredDataFrame. If a DeferredSeries is passed, its name attribute must be set, and that will be used as the column name to align with the original DeferredDataFrame.

  • join ({'left'}, default 'left') – Only left join is implemented, keeping the index and columns of the original object.

  • overwrite (bool, default True) –

    How to handle non-NA values for overlapping keys:

    • True: overwrite original DeferredDataFrame’s values with values from other.

    • False: only update values that are NA in the original DeferredDataFrame.

  • filter_func (callable(1d-array) -> bool 1d-array, optional) – Can choose to replace values other than NA. Return True for values that should be updated.

  • errors ({'raise', 'ignore'}, default 'ignore') – If ‘raise’, will raise a ValueError if the DeferredDataFrame and other both contain non-NA data in the same place.

Returns:

This method directly changes calling object.

Return type:

None

Raises:
  • ValueError

    • When errors=’raise’ and there’s overlapping non-NA data. * When errors is not either ‘ignore’ or ‘raise’

  • NotImplementedError

    • If join != ‘left’

Differences from pandas

This operation has no known divergences from the pandas API.

See also

dict.update

Similar method for dictionaries.

DeferredDataFrame.merge

For column(s)-on-column(s) operations.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'A': [1, 2, 3],
...                    'B': [400, 500, 600]})
>>> new_df = pd.DataFrame({'B': [4, 5, 6],
...                        'C': [7, 8, 9]})
>>> df.update(new_df)
>>> df
   A  B
0  1  4
1  2  5
2  3  6

The DataFrame's length does not increase as a result of the update,
only values at matching index/column labels are updated.

>>> df = pd.DataFrame({'A': ['a', 'b', 'c'],
...                    'B': ['x', 'y', 'z']})
>>> new_df = pd.DataFrame({'B': ['d', 'e', 'f', 'g', 'h', 'i']})
>>> df.update(new_df)
>>> df
   A  B
0  a  d
1  b  e
2  c  f

For Series, its name attribute must be set.

>>> df = pd.DataFrame({'A': ['a', 'b', 'c'],
...                    'B': ['x', 'y', 'z']})
>>> new_column = pd.Series(['d', 'e'], name='B', index=[0, 2])
>>> df.update(new_column)
>>> df
   A  B
0  a  d
1  b  y
2  c  e
>>> df = pd.DataFrame({'A': ['a', 'b', 'c'],
...                    'B': ['x', 'y', 'z']})
>>> new_df = pd.DataFrame({'B': ['d', 'e']}, index=[1, 2])
>>> df.update(new_df)
>>> df
   A  B
0  a  x
1  b  d
2  c  e

If `other` contains NaNs the corresponding values are not updated
in the original dataframe.

>>> df = pd.DataFrame({'A': [1, 2, 3],
...                    'B': [400, 500, 600]})
>>> new_df = pd.DataFrame({'B': [4, np.nan, 6]})
>>> df.update(new_df)
>>> df
   A    B
0  1    4
1  2  500
2  3    6
property values

pandas.DataFrame.values() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

property style

pandas.DataFrame.style() is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see https://s.apache.org/dataframe-non-deferred-result.

melt(ignore_index, **kwargs)[source]

Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

Parameters:
  • id_vars (tuple, list, or ndarray, optional) – Column(s) to use as identifier variables.

  • value_vars (tuple, list, or ndarray, optional) – Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.

  • var_name (scalar) – Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.

  • value_name (scalar, default 'value') – Name to use for the ‘value’ column.

  • col_level (int or str, optional) – If columns are a MultiIndex then use this level to melt.

  • ignore_index (bool, default True) – If True, original index is ignored. If False, the original index is retained. Index labels will be repeated as necessary.

Returns:

Unpivoted DeferredDataFrame.

Return type:

DeferredDataFrame

Differences from pandas

ignore_index=True is not supported, because it requires generating an order-sensitive index.

See also

melt

Identical method.

pivot_table

Create a spreadsheet-style pivot table as a DeferredDataFrame.

DeferredDataFrame.pivot

Return reshaped DeferredDataFrame organized by given index / column values.

DeferredDataFrame.explode

Explode a DeferredDataFrame from list-like columns to long format.

Notes

Reference the user guide for more examples.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
...                    'B': {0: 1, 1: 3, 2: 5},
...                    'C': {0: 2, 1: 4, 2: 6}})
>>> df
   A  B  C
0  a  1  2
1  b  3  4
2  c  5  6

>>> df.melt(id_vars=['A'], value_vars=['B'])
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5

>>> df.melt(id_vars=['A'], value_vars=['B', 'C'])
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5
3  a        C      2
4  b        C      4
5  c        C      6

The names of 'variable' and 'value' columns can be customized:

>>> df.melt(id_vars=['A'], value_vars=['B'],
...         var_name='myVarname', value_name='myValname')
   A myVarname  myValname
0  a         B          1
1  b         B          3
2  c         B          5

Original index values can be kept around:

>>> df.melt(id_vars=['A'], value_vars=['B', 'C'], ignore_index=False)
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5
0  a        C      2
1  b        C      4
2  c        C      6

If you have multi-index columns:

>>> df.columns = [list('ABC'), list('DEF')]
>>> df
   A  B  C
   D  E  F
0  a  1  2
1  b  3  4
2  c  5  6

>>> df.melt(col_level=0, id_vars=['A'], value_vars=['B'])
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5

>>> df.melt(id_vars=[('A', 'D')], value_vars=[('B', 'E')])
  (A, D) variable_0 variable_1  value
0      a          B          E      1
1      b          B          E      3
2      c          B          E      5
value_counts(subset=None, sort=False, normalize=False, ascending=False, dropna=True)[source]

Return a Series containing the frequency of each distinct row in the Dataframe.

Parameters:
  • subset (label or list of labels, optional) – Columns to use when counting unique combinations.

  • normalize (bool, default False) – Return proportions rather than frequencies.

  • sort (bool, default True) – Sort by frequencies when True. Sort by DeferredDataFrame column values when False.

  • ascending (bool, default False) – Sort in ascending order.

  • dropna (bool, default True) –

    Don’t include counts of rows that contain NA values.

    Added in version 1.3.0.

Return type:

DeferredSeries

Differences from pandas

sort is False by default, and sort=True is not supported because it imposes an ordering on the dataset which likely will not be preserved.

See also

DeferredSeries.value_counts

Equivalent method on DeferredSeries.

Notes

The returned DeferredSeries will have a MultiIndex with one level per input column but an Index (non-multi) for a single label. By default, rows that contain any NA values are omitted from the result. By default, the resulting DeferredSeries will be in descending order so that the first element is the most frequently-occurring row.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'num_legs': [2, 4, 4, 6],
...                    'num_wings': [2, 0, 0, 0]},
...                   index=['falcon', 'dog', 'cat', 'ant'])
>>> df
        num_legs  num_wings
falcon         2          2
dog            4          0
cat            4          0
ant            6          0

>>> df.value_counts()
num_legs  num_wings
4         0            2
2         2            1
6         0            1
Name: count, dtype: int64

>>> df.value_counts(sort=False)
num_legs  num_wings
2         2            1
4         0            2
6         0            1
Name: count, dtype: int64

>>> df.value_counts(ascending=True)
num_legs  num_wings
2         2            1
6         0            1
4         0            2
Name: count, dtype: int64

>>> df.value_counts(normalize=True)
num_legs  num_wings
4         0            0.50
2         2            0.25
6         0            0.25
Name: proportion, dtype: float64

With `dropna` set to `False` we can also count rows with NA values.

>>> df = pd.DataFrame({'first_name': ['John', 'Anne', 'John', 'Beth'],
...                    'middle_name': ['Smith', pd.NA, pd.NA, 'Louise']})
>>> df
  first_name middle_name
0       John       Smith
1       Anne        <NA>
2       John        <NA>
3       Beth      Louise

>>> df.value_counts()
first_name  middle_name
Beth        Louise         1
John        Smith          1
Name: count, dtype: int64

>>> df.value_counts(dropna=False)
first_name  middle_name
Anne        NaN            1
Beth        Louise         1
John        Smith          1
            NaN            1
Name: count, dtype: int64

>>> df.value_counts("first_name")
first_name
John    2
Anne    1
Beth    1
Name: count, dtype: int64
compare(other, align_axis, keep_shape, **kwargs)[source]

Compare to another DataFrame and show the differences.

Parameters:
  • other (DeferredDataFrame) – Object to compare with.

  • align_axis ({0 or 'index', 1 or 'columns'}, default 1) –

    Determine which axis to align the comparison on.

    • 0, or ‘index’Resulting differences are stacked vertically

      with rows drawn alternately from self and other.

    • 1, or ‘columns’Resulting differences are aligned horizontally

      with columns drawn alternately from self and other.

  • keep_shape (bool, default False) – If true, all rows and columns are kept. Otherwise, only the ones with different values are kept.

  • keep_equal (bool, default False) – If true, the result keeps values that are equal. Otherwise, equal values are shown as NaNs.

  • result_names (tuple, default ('self', 'other')) –

    Set the dataframes names in the comparison.

    Added in version 1.5.0.

Returns:

DeferredDataFrame that shows the differences stacked side by side.

The resulting index will be a MultiIndex with ‘self’ and ‘other’ stacked alternately at the inner level.

Return type:

DeferredDataFrame

Raises:

ValueError – When the two DeferredDataFrames don’t have identical labels or shape.

Differences from pandas

The default values align_axis=1 and ``keep_shape=False are not supported, because the output columns depend on the data. To use align_axis=1, please specify keep_shape=True.

See also

DeferredSeries.compare

Compare with another DeferredSeries and show differences.

DeferredDataFrame.equals

Test whether two objects contain the same elements.

Notes

Matching NaNs will not appear as a difference.

Can only compare identically-labeled (i.e. same shape, identical row and column labels) DeferredDataFrames

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame(
...     {
...         "col1": ["a", "a", "b", "b", "a"],
...         "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
...         "col3": [1.0, 2.0, 3.0, 4.0, 5.0]
...     },
...     columns=["col1", "col2", "col3"],
... )
>>> df
  col1  col2  col3
0    a   1.0   1.0
1    a   2.0   2.0
2    b   3.0   3.0
3    b   NaN   4.0
4    a   5.0   5.0

>>> df2 = df.copy()
>>> df2.loc[0, 'col1'] = 'c'
>>> df2.loc[2, 'col3'] = 4.0
>>> df2
  col1  col2  col3
0    c   1.0   1.0
1    a   2.0   2.0
2    b   3.0   4.0
3    b   NaN   4.0
4    a   5.0   5.0

Align the differences on columns

>>> df.compare(df2)
  col1       col3
  self other self other
0    a     c  NaN   NaN
2  NaN   NaN  3.0   4.0

Assign result_names

>>> df.compare(df2, result_names=("left", "right"))
  col1       col3
  left right left right
0    a     c  NaN   NaN
2  NaN   NaN  3.0   4.0

Stack the differences on rows

>>> df.compare(df2, align_axis=0)
        col1  col3
0 self     a   NaN
  other    c   NaN
2 self   NaN   3.0
  other  NaN   4.0

Keep the equal values

>>> df.compare(df2, keep_equal=True)
  col1       col3
  self other self other
0    a     c  1.0   1.0
2    b     b  3.0   4.0

Keep all original rows and columns

>>> df.compare(df2, keep_shape=True)
  col1       col2       col3
  self other self other self other
0    a     c  NaN   NaN  NaN   NaN
1  NaN   NaN  NaN   NaN  NaN   NaN
2  NaN   NaN  NaN   NaN  3.0   4.0
3  NaN   NaN  NaN   NaN  NaN   NaN
4  NaN   NaN  NaN   NaN  NaN   NaN

Keep all original rows and columns and also all original values

>>> df.compare(df2, keep_shape=True, keep_equal=True)
  col1       col2       col3
  self other self other self other
0    a     c  1.0   1.0  1.0   1.0
1    a     a  2.0   2.0  2.0   2.0
2    b     b  3.0   3.0  3.0   4.0
3    b     b  NaN   NaN  4.0   4.0
4    a     a  5.0   5.0  5.0   5.0
idxmin(**kwargs)[source]

Return index of first occurrence of minimum over requested axis.

NA/null values are excluded.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • numeric_only (bool, default False) –

    Include only float, int or boolean data.

    Added in version 1.5.0.

Returns:

Indexes of minima along the specified axis.

Return type:

DeferredSeries

Raises:

ValueError

  • If the row/column is empty

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.idxmin

Return index of the minimum element.

Notes

This method is the DeferredDataFrame version of ndarray.argmin.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Consider a dataset containing food consumption in Argentina.

>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48],
...                     'co2_emissions': [37.2, 19.66, 1712]},
...                   index=['Pork', 'Wheat Products', 'Beef'])

>>> df
                consumption  co2_emissions
Pork                  10.51         37.20
Wheat Products       103.11         19.66
Beef                  55.48       1712.00

By default, it returns the index for the minimum value in each column.

>>> df.idxmin()
consumption                Pork
co2_emissions    Wheat Products
dtype: object

To return the index for the minimum value in each row, use ``axis="columns"``.

>>> df.idxmin(axis="columns")
Pork                consumption
Wheat Products    co2_emissions
Beef                consumption
dtype: object
idxmax(**kwargs)[source]

Return index of first occurrence of maximum over requested axis.

NA/null values are excluded.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • numeric_only (bool, default False) –

    Include only float, int or boolean data.

    Added in version 1.5.0.

Returns:

Indexes of maxima along the specified axis.

Return type:

DeferredSeries

Raises:

ValueError

  • If the row/column is empty

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.idxmax

Return index of the maximum element.

Notes

This method is the DeferredDataFrame version of ndarray.argmax.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Consider a dataset containing food consumption in Argentina.

>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48],
...                     'co2_emissions': [37.2, 19.66, 1712]},
...                   index=['Pork', 'Wheat Products', 'Beef'])

>>> df
                consumption  co2_emissions
Pork                  10.51         37.20
Wheat Products       103.11         19.66
Beef                  55.48       1712.00

By default, it returns the index for the maximum value in each column.

>>> df.idxmax()
consumption     Wheat Products
co2_emissions             Beef
dtype: object

To return the index for the maximum value in each row, use ``axis="columns"``.

>>> df.idxmax(axis="columns")
Pork              co2_emissions
Wheat Products     consumption
Beef              co2_emissions
dtype: object
add(**kwargs)

Get Addition of dataframe and other, element-wise (binary operator add).

Equivalent to dataframe + other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, radd.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
apply(**kwargs)

pandas.DataFrame.apply() is not implemented yet in the Beam DataFrame API.

If support for ‘apply’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

asfreq(**kwargs)

pandas.DataFrame.asfreq() is not implemented yet in the Beam DataFrame API.

If support for ‘asfreq’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

property at

pandas.DataFrame.at() is not implemented yet in the Beam DataFrame API.

If support for ‘at’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

boxplot(**kwargs)

pandas.DataFrame.boxplot() is not implemented yet in the Beam DataFrame API.

If support for ‘boxplot’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

convert_dtypes(**kwargs)

pandas.DataFrame.convert_dtypes() is not implemented yet in the Beam DataFrame API.

If support for ‘convert_dtypes’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

div(**kwargs)

Get Floating division of dataframe and other, element-wise (binary operator truediv).

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
divide(**kwargs)

Get Floating division of dataframe and other, element-wise (binary operator truediv).

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
eq(**kwargs)

Get Equal to of dataframe and other, element-wise (binary operator eq).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:

Result of the comparison.

Return type:

DeferredDataFrame of bool

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.eq

Compare DeferredDataFrames for equality elementwise.

DeferredDataFrame.ne

Compare DeferredDataFrames for inequality elementwise.

DeferredDataFrame.le

Compare DeferredDataFrames for less than inequality or equality elementwise.

DeferredDataFrame.lt

Compare DeferredDataFrames for strictly less than inequality elementwise.

DeferredDataFrame.ge

Compare DeferredDataFrames for greater than inequality or equality elementwise.

DeferredDataFrame.gt

Compare DeferredDataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When `other` is a :class:`Series`, the columns of a DataFrame are aligned
with the index of `other` and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must
match the number elements in `other`:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
filter(**kwargs)

Subset the dataframe rows or columns according to the specified index labels.

Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.

Parameters:
  • items (list-like) – Keep labels from axis which are in items.

  • like (str) – Keep labels from axis for which “like in label == True”.

  • regex (str (regular expression)) – Keep labels from axis for which re.search(regex, label) == True.

  • axis ({0 or 'index', 1 or 'columns', None}, default None) – The axis to filter on, expressed either as an index (int) or axis name (str). By default this is the info axis, ‘columns’ for DeferredDataFrame. For DeferredSeries this parameter is unused and defaults to None.

Return type:

same type as input object

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.loc

Access a group of rows and columns by label(s) or a boolean array.

Notes

The items, like, and regex parameters are enforced to be mutually exclusive.

axis defaults to the info axis that is used when indexing with [].

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame(np.array(([1, 2, 3], [4, 5, 6])),
...                   index=['mouse', 'rabbit'],
...                   columns=['one', 'two', 'three'])
>>> df
        one  two  three
mouse     1    2      3
rabbit    4    5      6

>>> # select columns by name
>>> df.filter(items=['one', 'three'])
         one  three
mouse     1      3
rabbit    4      6

>>> # select columns by regular expression
>>> df.filter(regex='e$', axis=1)
         one  three
mouse     1      3
rabbit    4      6

>>> # select rows containing 'bbi'
>>> df.filter(like='bbi', axis=0)
         one  two  three
rabbit    4    5      6
property flags

pandas.DataFrame.flags() is not implemented yet in the Beam DataFrame API.

If support for ‘flags’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

floordiv(**kwargs)

Get Integer division of dataframe and other, element-wise (binary operator floordiv).

Equivalent to dataframe // other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rfloordiv.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
ge(**kwargs)

Get Greater than or equal to of dataframe and other, element-wise (binary operator ge).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:

Result of the comparison.

Return type:

DeferredDataFrame of bool

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.eq

Compare DeferredDataFrames for equality elementwise.

DeferredDataFrame.ne

Compare DeferredDataFrames for inequality elementwise.

DeferredDataFrame.le

Compare DeferredDataFrames for less than inequality or equality elementwise.

DeferredDataFrame.lt

Compare DeferredDataFrames for strictly less than inequality elementwise.

DeferredDataFrame.ge

Compare DeferredDataFrames for greater than inequality or equality elementwise.

DeferredDataFrame.gt

Compare DeferredDataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When `other` is a :class:`Series`, the columns of a DataFrame are aligned
with the index of `other` and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must
match the number elements in `other`:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
gt(**kwargs)

Get Greater than of dataframe and other, element-wise (binary operator gt).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:

Result of the comparison.

Return type:

DeferredDataFrame of bool

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.eq

Compare DeferredDataFrames for equality elementwise.

DeferredDataFrame.ne

Compare DeferredDataFrames for inequality elementwise.

DeferredDataFrame.le

Compare DeferredDataFrames for less than inequality or equality elementwise.

DeferredDataFrame.lt

Compare DeferredDataFrames for strictly less than inequality elementwise.

DeferredDataFrame.ge

Compare DeferredDataFrames for greater than inequality or equality elementwise.

DeferredDataFrame.gt

Compare DeferredDataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When `other` is a :class:`Series`, the columns of a DataFrame are aligned
with the index of `other` and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must
match the number elements in `other`:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
infer_objects(**kwargs)

pandas.DataFrame.infer_objects() is not implemented yet in the Beam DataFrame API.

If support for ‘infer_objects’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

isetitem(**kwargs)

pandas.DataFrame.isetitem() is not implemented yet in the Beam DataFrame API.

If support for ‘isetitem’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

le(**kwargs)

Get Less than or equal to of dataframe and other, element-wise (binary operator le).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:

Result of the comparison.

Return type:

DeferredDataFrame of bool

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.eq

Compare DeferredDataFrames for equality elementwise.

DeferredDataFrame.ne

Compare DeferredDataFrames for inequality elementwise.

DeferredDataFrame.le

Compare DeferredDataFrames for less than inequality or equality elementwise.

DeferredDataFrame.lt

Compare DeferredDataFrames for strictly less than inequality elementwise.

DeferredDataFrame.ge

Compare DeferredDataFrames for greater than inequality or equality elementwise.

DeferredDataFrame.gt

Compare DeferredDataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When `other` is a :class:`Series`, the columns of a DataFrame are aligned
with the index of `other` and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must
match the number elements in `other`:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
lt(**kwargs)

Get Less than of dataframe and other, element-wise (binary operator lt).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:

Result of the comparison.

Return type:

DeferredDataFrame of bool

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.eq

Compare DeferredDataFrames for equality elementwise.

DeferredDataFrame.ne

Compare DeferredDataFrames for inequality elementwise.

DeferredDataFrame.le

Compare DeferredDataFrames for less than inequality or equality elementwise.

DeferredDataFrame.lt

Compare DeferredDataFrames for strictly less than inequality elementwise.

DeferredDataFrame.ge

Compare DeferredDataFrames for greater than inequality or equality elementwise.

DeferredDataFrame.gt

Compare DeferredDataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When `other` is a :class:`Series`, the columns of a DataFrame are aligned
with the index of `other` and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must
match the number elements in `other`:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
mod(**kwargs)

Get Modulo of dataframe and other, element-wise (binary operator mod).

Equivalent to dataframe % other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmod.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
mul(**kwargs)

Get Multiplication of dataframe and other, element-wise (binary operator mul).

Equivalent to dataframe * other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmul.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
multiply(**kwargs)

Get Multiplication of dataframe and other, element-wise (binary operator mul).

Equivalent to dataframe * other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmul.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
ne(**kwargs)

Get Not equal to of dataframe and other, element-wise (binary operator ne).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:

Result of the comparison.

Return type:

DeferredDataFrame of bool

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.eq

Compare DeferredDataFrames for equality elementwise.

DeferredDataFrame.ne

Compare DeferredDataFrames for inequality elementwise.

DeferredDataFrame.le

Compare DeferredDataFrames for less than inequality or equality elementwise.

DeferredDataFrame.lt

Compare DeferredDataFrames for strictly less than inequality elementwise.

DeferredDataFrame.ge

Compare DeferredDataFrames for greater than inequality or equality elementwise.

DeferredDataFrame.gt

Compare DeferredDataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When `other` is a :class:`Series`, the columns of a DataFrame are aligned
with the index of `other` and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must
match the number elements in `other`:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
pivot_table(**kwargs)

pandas.DataFrame.pivot_table() is not implemented yet in the Beam DataFrame API.

If support for ‘pivot_table’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

pow(**kwargs)

Get Exponential power of dataframe and other, element-wise (binary operator pow).

Equivalent to dataframe ** other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rpow.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
radd(**kwargs)

Get Addition of dataframe and other, element-wise (binary operator radd).

Equivalent to other + dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, add.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rank(**kwargs)

pandas.DataFrame.rank() is not implemented yet in the Beam DataFrame API.

If support for ‘rank’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

rdiv(**kwargs)

Get Floating division of dataframe and other, element-wise (binary operator rtruediv).

Equivalent to other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
reindex_like(**kwargs)

pandas.DataFrame.reindex_like() is not implemented yet in the Beam DataFrame API.

If support for ‘reindex_like’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

rfloordiv(**kwargs)

Get Integer division of dataframe and other, element-wise (binary operator rfloordiv).

Equivalent to other // dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, floordiv.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rmod(**kwargs)

Get Modulo of dataframe and other, element-wise (binary operator rmod).

Equivalent to other % dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, mod.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rmul(**kwargs)

Get Multiplication of dataframe and other, element-wise (binary operator rmul).

Equivalent to other * dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, mul.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rpow(**kwargs)

Get Exponential power of dataframe and other, element-wise (binary operator rpow).

Equivalent to other ** dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, pow.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rsub(**kwargs)

Get Subtraction of dataframe and other, element-wise (binary operator rsub).

Equivalent to other - dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, sub.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rtruediv(**kwargs)

Get Floating division of dataframe and other, element-wise (binary operator rtruediv).

Equivalent to other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
set_flags(**kwargs)

pandas.DataFrame.set_flags() is not implemented yet in the Beam DataFrame API.

If support for ‘set_flags’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

squeeze(**kwargs)

pandas.DataFrame.squeeze() is not implemented yet in the Beam DataFrame API.

If support for ‘squeeze’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

sub(**kwargs)

Get Subtraction of dataframe and other, element-wise (binary operator sub).

Equivalent to dataframe - other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rsub.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
subtract(**kwargs)

Get Subtraction of dataframe and other, element-wise (binary operator sub).

Equivalent to dataframe - other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rsub.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
to_clipboard(**kwargs)

pandas.DataFrame.to_clipboard() is not implemented yet in the Beam DataFrame API.

If support for ‘to_clipboard’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_csv(path, transform_label=None, *args, **kwargs)

Write object to a comma-separated values (csv) file.

Parameters:
  • path_or_buf (str, path object, file-like object, or None, default None) –

    String, path object (implementing os.PathLike[str]), or file-like object implementing a write() function. If None, the result is returned as a string. If a non-binary file object is passed, it should be opened with newline=’’, disabling universal newlines. If a binary file object is passed, mode might need to contain a ‘b’.

    Changed in version 1.2.0: Support for binary file objects was introduced.

  • sep (str, default ',') – String of length 1. Field delimiter for the output file.

  • na_rep (str, default '') – Missing data representation.

  • float_format (str, Callable, default None) – Format string for floating point numbers. If a Callable is given, it takes precedence over other numeric formatting parameters, like decimal.

  • columns (sequence, optional) – Columns to write.

  • header (bool or list of str, default True) – Write out the column names. If a list of strings is given it is assumed to be aliases for the column names.

  • index (bool, default True) – Write row names (index).

  • index_label (str or sequence, or False, default None) – Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the object uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R.

  • mode ({'w', 'x', 'a'}, default 'w') –

    Forwarded to either open(mode=) or fsspec.open(mode=) to control the file opening. Typical values include:

    • ’w’, truncate the file first.

    • ’x’, exclusive creation, failing if the file already exists.

    • ’a’, append to the end of file if it exists.

  • encoding (str, optional) – A string representing the encoding to use in the output file, defaults to ‘utf-8’. encoding is not supported if path_or_buf is a non-binary file object.

  • compression (str or dict, default 'infer') –

    For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buf’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

    Added in version 1.5.0: Added support for .tar files.

    May be a dict with key ‘method’ as compression mode and other entries as additional compression options if compression mode is ‘zip’.

    Passing compression options as keys in dict is supported for compression modes ‘gzip’, ‘bz2’, ‘zstd’, and ‘zip’.

    Changed in version 1.2.0: Compression is supported for binary file objects.

    Changed in version 1.2.0: Previous versions forwarded dict entries for ‘gzip’ to gzip.open instead of gzip.GzipFile which prevented setting mtime.

  • quoting (optional constant from csv module) – Defaults to csv.QUOTE_MINIMAL. If you have set a float_format then floats are converted to strings and thus csv.QUOTE_NONNUMERIC will treat them as non-numeric.

  • quotechar (str, default '"') – String of length 1. Character used to quote fields.

  • lineterminator (str, optional) –

    The newline character or character sequence to use in the output file. Defaults to os.linesep, which depends on the OS in which this method is called (’\n’ for linux, ‘\r\n’ for Windows, i.e.).

    Changed in version 1.5.0: Previously was line_terminator, changed for consistency with read_csv and the standard library ‘csv’ module.

  • chunksize (int or None) – Rows to write at a time.

  • date_format (str, default None) – Format string for datetime objects.

  • doublequote (bool, default True) – Control quoting of quotechar inside a field.

  • escapechar (str, default None) – String of length 1. Character used to escape sep and quotechar when appropriate.

  • decimal (str, default '.') – Character recognized as decimal separator. E.g. use ‘,’ for European data.

  • errors (str, default 'strict') – Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    Added in version 1.2.0.

Returns:

If path_or_buf is None, returns the resulting csv format as a string. Otherwise returns None.

Return type:

None or str

Differences from pandas

This operation has no known divergences from the pandas API.

See also

read_csv

Load a CSV file into a DeferredDataFrame.

to_excel

Write DeferredDataFrame to an Excel file.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'name': ['Raphael', 'Donatello'],
...                    'mask': ['red', 'purple'],
...                    'weapon': ['sai', 'bo staff']})
>>> df.to_csv(index=False)
'name,mask,weapon\nRaphael,red,sai\nDonatello,purple,bo staff\n'

Create 'out.zip' containing 'out.csv'

>>> compression_opts = dict(method='zip',
...                         archive_name='out.csv')  
>>> df.to_csv('out.zip', index=False,
...           compression=compression_opts)  

To write a csv file to a new folder or nested folder you will first
need to create it using either Pathlib or os:

>>> from pathlib import Path  
>>> filepath = Path('folder/subfolder/out.csv')  
>>> filepath.parent.mkdir(parents=True, exist_ok=True)  
>>> df.to_csv(filepath)  

>>> import os  
>>> os.makedirs('folder/subfolder', exist_ok=True)  
>>> df.to_csv('folder/subfolder/out.csv')  
to_excel(path, *args, **kwargs)

Write object to an Excel sheet.

To write a single object to an Excel .xlsx file it is only necessary to specify a target file name. To write to multiple sheets it is necessary to create an ExcelWriter object with a target file name, and specify a sheet in the file to write to.

Multiple sheets may be written to by specifying unique sheet_name. With all data written to the file it is necessary to save the changes. Note that creating an ExcelWriter object with a file name that already exists will result in the contents of the existing file being erased.

Parameters:
  • excel_writer (path-like, file-like, or ExcelWriter object) – File path or existing ExcelWriter.

  • sheet_name (str, default 'Sheet1') – Name of sheet which will contain DeferredDataFrame.

  • na_rep (str, default '') – Missing data representation.

  • float_format (str, optional) – Format string for floating point numbers. For example float_format="%.2f" will format 0.1234 to 0.12.

  • columns (sequence or list of str, optional) – Columns to write.

  • header (bool or list of str, default True) – Write out the column names. If a list of string is given it is assumed to be aliases for the column names.

  • index (bool, default True) – Write row names (index).

  • index_label (str or sequence, optional) – Column label for index column(s) if desired. If not specified, and header and index are True, then the index names are used. A sequence should be given if the DeferredDataFrame uses MultiIndex.

  • startrow (int, default 0) – Upper left cell row to dump data frame.

  • startcol (int, default 0) – Upper left cell column to dump data frame.

  • engine (str, optional) – Write engine to use, ‘openpyxl’ or ‘xlsxwriter’. You can also set this via the options io.excel.xlsx.writer or io.excel.xlsm.writer.

  • merge_cells (bool, default True) – Write MultiIndex and Hierarchical Rows as merged cells.

  • inf_rep (str, default 'inf') – Representation for infinity (there is no native representation for infinity in Excel).

  • freeze_panes (tuple of int (length 2), optional) – Specifies the one-based bottommost row and rightmost column that is to be frozen.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    Added in version 1.2.0.

  • engine_kwargs (dict, optional) – Arbitrary keyword arguments passed to excel engine.

Differences from pandas

This operation has no known divergences from the pandas API.

See also

to_csv

Write DeferredDataFrame to a comma-separated values (csv) file.

ExcelWriter

Class for writing DeferredDataFrame objects into excel sheets.

read_excel

Read an Excel file into a pandas DeferredDataFrame.

read_csv

Read a comma-separated values (csv) file into DeferredDataFrame.

io.formats.style.Styler.to_excel

Add styles to Excel sheet.

Notes

For compatibility with to_csv(), to_excel serializes lists and dicts to strings before writing.

Once a workbook has been saved it is not possible to write further data without rewriting the whole workbook.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Create, write to and save a workbook:

>>> df1 = pd.DataFrame([['a', 'b'], ['c', 'd']],
...                    index=['row 1', 'row 2'],
...                    columns=['col 1', 'col 2'])
>>> df1.to_excel("output.xlsx")  

To specify the sheet name:

>>> df1.to_excel("output.xlsx",
...              sheet_name='Sheet_name_1')  

If you wish to write to more than one sheet in the workbook, it is
necessary to specify an ExcelWriter object:

>>> df2 = df1.copy()
>>> with pd.ExcelWriter('output.xlsx') as writer:  
...     df1.to_excel(writer, sheet_name='Sheet_name_1')
...     df2.to_excel(writer, sheet_name='Sheet_name_2')

ExcelWriter can also be used to append to an existing Excel file:

>>> with pd.ExcelWriter('output.xlsx',
...                     mode='a') as writer:  
...     df1.to_excel(writer, sheet_name='Sheet_name_3')

To set the library that is used to write the Excel file,
you can pass the `engine` keyword (the default engine is
automatically chosen depending on the file extension):

>>> df1.to_excel('output1.xlsx', engine='xlsxwriter')  
to_feather(path, *args, **kwargs)

Write a DataFrame to the binary Feather format.

Parameters:
  • path (str, path object, file-like object) – String, path object (implementing os.PathLike[str]), or file-like object implementing a binary write() function. If a string or a path, it will be used as Root Directory path when writing a partitioned dataset.

  • **kwargs – Additional keywords passed to pyarrow.feather.write_feather(). This includes the compression, compression_level, chunksize and version keywords.

Differences from pandas

This operation has no known divergences from the pandas API.

Notes

This function writes the dataframe as a feather file. Requires a default index. For saving the DeferredDataFrame with your custom index use a method that supports custom indices e.g. to_parquet.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
>>> df.to_feather("file.feather")  
to_gbq(**kwargs)

pandas.DataFrame.to_gbq() is not implemented yet in the Beam DataFrame API.

If support for ‘to_gbq’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_hdf(**kwargs)

pandas.DataFrame.to_hdf() is not yet supported in the Beam DataFrame API because HDF5 is a random access file format

to_html(path, *args, **kwargs)

Render a DataFrame as an HTML table.

Parameters:
  • buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.

  • columns (array-like, optional, default None) – The subset of columns to write. Writes all columns by default.

  • col_space (str or int, list or dict of int or str, optional) – The minimum width of each column in CSS length units. An int is assumed to be px units..

  • header (bool, optional) – Whether to print column labels, default True.

  • index (bool, optional, default True) – Whether to print index (row) labels.

  • na_rep (str, optional, default 'NaN') – String representation of NaN to use.

  • formatters (list, tuple or dict of one-param. functions, optional) – Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.

  • float_format (one-parameter function, optional, default None) –

    Formatter function to apply to columns’ elements if they are floats. This function must return a unicode string and will be applied only to the non-NaN elements, with NaN being handled by na_rep.

    Changed in version 1.2.0.

  • sparsify (bool, optional, default True) – Set to False for a DeferredDataFrame with a hierarchical index to print every multiindex key at each row.

  • index_names (bool, optional, default True) – Prints the names of the indexes.

  • justify (str, default None) –

    How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are

    • left

    • right

    • center

    • justify

    • justify-all

    • start

    • end

    • inherit

    • match-parent

    • initial

    • unset.

  • max_rows (int, optional) – Maximum number of rows to display in the console.

  • max_cols (int, optional) – Maximum number of columns to display in the console.

  • show_dimensions (bool, default False) – Display DeferredDataFrame dimensions (number of rows by number of columns).

  • decimal (str, default '.') – Character recognized as decimal separator, e.g. ‘,’ in Europe.

  • bold_rows (bool, default True) – Make the row labels bold in the output.

  • classes (str or list or tuple, default None) – CSS class(es) to apply to the resulting html table.

  • escape (bool, default True) – Convert the characters <, >, and & to HTML-safe sequences.

  • notebook ({True, False}, default False) – Whether the generated HTML is for IPython Notebook.

  • border (int) – A border=border attribute is included in the opening <table> tag. Default pd.options.display.html.border.

  • table_id (str, optional) – A css id is included in the opening <table> tag if specified.

  • render_links (bool, default False) – Convert URLs to HTML links.

  • encoding (str, default "utf-8") – Set character encoding.

Returns:

If buf is None, returns the result as a string. Otherwise returns None.

Return type:

str or None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

to_string

Convert DeferredDataFrame to a string.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [4, 3]})
>>> html_string = '''<table border="1" class="dataframe">
...   <thead>
...     <tr style="text-align: right;">
...       <th></th>
...       <th>col1</th>
...       <th>col2</th>
...     </tr>
...   </thead>
...   <tbody>
...     <tr>
...       <th>0</th>
...       <td>1</td>
...       <td>4</td>
...     </tr>
...     <tr>
...       <th>1</th>
...       <td>2</td>
...       <td>3</td>
...     </tr>
...   </tbody>
... </table>'''
>>> assert html_string == df.to_html()
to_json(path, orient=None, *args, **kwargs)

Convert the object to a JSON string.

Note NaN’s and None will be converted to null and datetime objects will be converted to UNIX timestamps.

Parameters:
  • path_or_buf (str, path object, file-like object, or None, default None) – String, path object (implementing os.PathLike[str]), or file-like object implementing a write() function. If None, the result is returned as a string.

  • orient (str) –

    Indication of expected JSON string format.

    • DeferredSeries:

      • default is ‘index’

      • allowed values are: {‘split’, ‘records’, ‘index’, ‘table’}.

    • DeferredDataFrame:

      • default is ‘columns’

      • allowed values are: {‘split’, ‘records’, ‘index’, ‘columns’, ‘values’, ‘table’}.

    • The format of the JSON string:

      • ’split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}

      • ’records’ : list like [{column -> value}, … , {column -> value}]

      • ’index’ : dict like {index -> {column -> value}}

      • ’columns’ : dict like {column -> {index -> value}}

      • ’values’ : just the values array

      • ’table’ : dict like {‘schema’: {schema}, ‘data’: {data}}

      Describing the data, where data component is like orient='records'.

  • date_format ({None, 'epoch', 'iso'}) – Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient. For orient='table', the default is ‘iso’. For all other orients, the default is ‘epoch’.

  • double_precision (int, default 10) – The number of decimal places to use when encoding floating point values. The possible maximal value is 15. Passing double_precision greater than 15 will raise a ValueError.

  • force_ascii (bool, default True) – Force encoded string to be ASCII.

  • date_unit (str, default 'ms' (milliseconds)) – The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.

  • default_handler (callable, default None) – Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.

  • lines (bool, default False) – If ‘orient’ is ‘records’ write out line-delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list-like.

  • compression (str or dict, default 'infer') –

    For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buf’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

    Added in version 1.5.0: Added support for .tar files.

    Changed in version 1.4.0: Zstandard support.

  • index (bool or None, default None) – The index is only used when ‘orient’ is ‘split’, ‘index’, ‘column’, or ‘table’. Of these, ‘index’ and ‘column’ do not support index=False.

  • indent (int, optional) – Length of whitespace used to indent each record.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    Added in version 1.2.0.

  • mode (str, default 'w' (writing)) – Specify the IO mode for output when supplying a path_or_buf. Accepted args are ‘w’ (writing) and ‘a’ (append) only. mode=’a’ is only supported when lines is True and orient is ‘records’.

Returns:

If path_or_buf is None, returns the resulting json format as a string. Otherwise returns None.

Return type:

None or str

Differences from pandas

This operation has no known divergences from the pandas API.

See also

read_json

Convert a JSON string to pandas object.

Notes

The behavior of indent=0 varies from the stdlib, which does not indent the output but does insert newlines. Currently, indent=0 and the default indent=None are equivalent in pandas, though this may change in a future release.

orient='table' contains a ‘pandas_version’ field under ‘schema’. This stores the version of pandas used in the latest revision of the schema.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> from json import loads, dumps
>>> df = pd.DataFrame(
...     [["a", "b"], ["c", "d"]],
...     index=["row 1", "row 2"],
...     columns=["col 1", "col 2"],
... )

>>> result = df.to_json(orient="split")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
{
    "columns": [
        "col 1",
        "col 2"
    ],
    "index": [
        "row 1",
        "row 2"
    ],
    "data": [
        [
            "a",
            "b"
        ],
        [
            "c",
            "d"
        ]
    ]
}

Encoding/decoding a Dataframe using ``'records'`` formatted JSON.
Note that index labels are not preserved with this encoding.

>>> result = df.to_json(orient="records")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
[
    {
        "col 1": "a",
        "col 2": "b"
    },
    {
        "col 1": "c",
        "col 2": "d"
    }
]

Encoding/decoding a Dataframe using ``'index'`` formatted JSON:

>>> result = df.to_json(orient="index")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
{
    "row 1": {
        "col 1": "a",
        "col 2": "b"
    },
    "row 2": {
        "col 1": "c",
        "col 2": "d"
    }
}

Encoding/decoding a Dataframe using ``'columns'`` formatted JSON:

>>> result = df.to_json(orient="columns")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
{
    "col 1": {
        "row 1": "a",
        "row 2": "c"
    },
    "col 2": {
        "row 1": "b",
        "row 2": "d"
    }
}

Encoding/decoding a Dataframe using ``'values'`` formatted JSON:

>>> result = df.to_json(orient="values")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
[
    [
        "a",
        "b"
    ],
    [
        "c",
        "d"
    ]
]

Encoding with Table Schema:

>>> result = df.to_json(orient="table")
>>> parsed = loads(result)
>>> dumps(parsed, indent=4)  
{
    "schema": {
        "fields": [
            {
                "name": "index",
                "type": "string"
            },
            {
                "name": "col 1",
                "type": "string"
            },
            {
                "name": "col 2",
                "type": "string"
            }
        ],
        "primaryKey": [
            "index"
        ],
        "pandas_version": "1.4.0"
    },
    "data": [
        {
            "index": "row 1",
            "col 1": "a",
            "col 2": "b"
        },
        {
            "index": "row 2",
            "col 1": "c",
            "col 2": "d"
        }
    ]
}
to_latex(**kwargs)

pandas.DataFrame.to_latex() is not implemented yet in the Beam DataFrame API.

If support for ‘to_latex’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_markdown(**kwargs)

pandas.DataFrame.to_markdown() is not implemented yet in the Beam DataFrame API.

If support for ‘to_markdown’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_msgpack(**kwargs)

pandas.DataFrame.to_msgpack() is not yet supported in the Beam DataFrame API because it is deprecated in pandas.

to_orc(**kwargs)

pandas.DataFrame.to_orc() is not implemented yet in the Beam DataFrame API.

If support for ‘to_orc’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_parquet(path, *args, **kwargs)

Write a DataFrame to the binary parquet format.

This function writes the dataframe as a parquet file. You can choose different parquet backends, and have the option of compression. See the user guide for more details.

Parameters:
  • path (str, path object, file-like object, or None, default None) –

    String, path object (implementing os.PathLike[str]), or file-like object implementing a binary write() function. If None, the result is returned as bytes. If a string or path, it will be used as Root Directory path when writing a partitioned dataset.

    Changed in version 1.2.0.

    Previously this was “fname”

  • engine ({'auto', 'pyarrow', 'fastparquet'}, default 'auto') – Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.

  • compression (str or None, default 'snappy') – Name of the compression to use. Use None for no compression. Supported options: ‘snappy’, ‘gzip’, ‘brotli’, ‘lz4’, ‘zstd’.

  • index (bool, default None) – If True, include the dataframe’s index(es) in the file output. If False, they will not be written to the file. If None, similar to True the dataframe’s index(es) will be saved. However, instead of being saved as values, the RangeIndex will be stored as a range in the metadata so it doesn’t require much space and is faster. Other indexes will be included as columns in the file output.

  • partition_cols (list, optional, default None) – Column names by which to partition the dataset. Columns are partitioned in the order they are given. Must be None if path is not a string.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    Added in version 1.2.0.

  • **kwargs – Additional arguments passed to the parquet library. See pandas io for more details.

Return type:

bytes if no path argument is provided else None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

read_parquet

Read a parquet file.

DeferredDataFrame.to_orc

Write an orc file.

DeferredDataFrame.to_csv

Write a csv file.

DeferredDataFrame.to_sql

Write to a sql table.

DeferredDataFrame.to_hdf

Write to hdf.

Notes

This function requires either the fastparquet or pyarrow library.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
>>> df.to_parquet('df.parquet.gzip',
...               compression='gzip')  
>>> pd.read_parquet('df.parquet.gzip')  
   col1  col2
0     1     3
1     2     4

If you want to get a buffer to the parquet content you can use a io.BytesIO
object, as long as you don't use partition_cols, which creates multiple files.

>>> import io
>>> f = io.BytesIO()
>>> df.to_parquet(f)
>>> f.seek(0)
0
>>> content = f.read()
to_period(**kwargs)

pandas.DataFrame.to_period() is not implemented yet in the Beam DataFrame API.

If support for ‘to_period’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_pickle(**kwargs)

pandas.DataFrame.to_pickle() is not implemented yet in the Beam DataFrame API.

If support for ‘to_pickle’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_sql(**kwargs)

pandas.DataFrame.to_sql() is not implemented yet in the Beam DataFrame API.

If support for ‘to_sql’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_stata(path, *args, **kwargs)

Export DataFrame object to Stata dta format.

Writes the DataFrame to a Stata dataset file. “dta” files contain a Stata dataset.

Parameters:
  • path (str, path object, or buffer) – String, path object (implementing os.PathLike[str]), or file-like object implementing a binary write() function.

  • convert_dates (dict) – Dictionary mapping columns containing datetime types to stata internal format to use when writing the dates. Options are ‘tc’, ‘td’, ‘tm’, ‘tw’, ‘th’, ‘tq’, ‘ty’. Column can be either an integer or a name. Datetime columns that do not have a conversion type specified will be converted to ‘tc’. Raises NotImplementedError if a datetime column has timezone information.

  • write_index (bool) – Write the index to Stata dataset.

  • byteorder (str) – Can be “>”, “<”, “little”, or “big”. default is sys.byteorder.

  • time_stamp (datetime) – A datetime to use as file creation date. Default is the current time.

  • data_label (str, optional) – A label for the data set. Must be 80 characters or smaller.

  • variable_labels (dict) – Dictionary containing columns as keys and variable labels as values. Each label must be 80 characters or smaller.

  • version ({114, 117, 118, 119, None}, default 114) –

    Version to use in the output dta file. Set to None to let pandas decide between 118 or 119 formats depending on the number of columns in the frame. pandas Version 114 can be read by Stata 10 and later. pandas Version 117 can be read by Stata 13 or later. pandas Version 118 is supported in Stata 14 and later. pandas Version 119 is supported in Stata 15 and later. pandas Version 114 limits string variables to 244 characters or fewer while versions 117 and later allow strings with lengths up to 2,000,000 characters. Versions 118 and 119 support Unicode characters, and pandas version 119 supports more than 32,767 variables.

    pandas Version 119 should usually only be used when the number of variables exceeds the capacity of dta format 118. Exporting smaller datasets in format 119 may have unintended consequences, and, as of November 2020, Stata SE cannot read pandas version 119 files.

  • convert_strl (list, optional) – List of column names to convert to string columns to Stata StrL format. Only available if version is 117. Storing strings in the StrL format can produce smaller dta files if strings have more than 8 characters and values are repeated.

  • compression (str or dict, default 'infer') –

    For on-the-fly compression of the output data. If ‘infer’ and ‘path’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

    Added in version 1.5.0: Added support for .tar files.

    Changed in version 1.4.0: Zstandard support.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    Added in version 1.2.0.

  • value_labels (dict of dicts) –

    Dictionary containing columns as keys and dictionaries of column value to labels as values. Labels for a single variable must be 32,000 characters or smaller.

    Added in version 1.4.0.

Raises:
  • NotImplementedError

    • If datetimes contain timezone information * Column dtype is not representable in Stata

  • ValueError

    • Columns listed in convert_dates are neither datetime64[ns] or datetime.datetime * Column listed in convert_dates is not in DeferredDataFrame * Categorical label contains more than 32,000 characters

Differences from pandas

This operation has no known divergences from the pandas API.

See also

read_stata

Import Stata data files.

io.stata.StataWriter

Low-level writer for Stata data files.

io.stata.StataWriter117

Low-level writer for pandas version 117 files.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'animal': ['falcon', 'parrot', 'falcon',
...                               'parrot'],
...                    'speed': [350, 18, 361, 15]})
>>> df.to_stata('animals.dta')  
to_timestamp(**kwargs)

pandas.DataFrame.to_timestamp() is not implemented yet in the Beam DataFrame API.

If support for ‘to_timestamp’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

to_xml(**kwargs)

pandas.DataFrame.to_xml() is not implemented yet in the Beam DataFrame API.

If support for ‘to_xml’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on 20318.

truediv(**kwargs)

Get Floating division of dataframe and other, element-wise (binary operator truediv).

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.

Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, DeferredSeries, dict or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DeferredDataFrame

Differences from pandas

Only level=None is supported

See also

DeferredDataFrame.add

Add DeferredDataFrames.

DeferredDataFrame.sub

Subtract DeferredDataFrames.

DeferredDataFrame.mul

Multiply DeferredDataFrames.

DeferredDataFrame.div

Divide DeferredDataFrames (float division).

DeferredDataFrame.truediv

Divide DeferredDataFrames (float division).

DeferredDataFrame.floordiv

Divide DeferredDataFrames (integer division).

DeferredDataFrame.mod

Calculate modulo (remainder after division).

DeferredDataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same
results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720

>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4

>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0