apache_beam.dataframe.frames module

Analogs for pandas.DataFrame and pandas.Series: DeferredDataFrame and DeferredSeries.

These classes are effectively wrappers around a schema-aware PCollection that provide a set of operations compatible with the pandas API.

Note that we aim for the Beam DataFrame API to be completely compatible with the pandas API, but there are some features that are currently unimplemented for various reasons. Pay particular attention to the ‘Differences from pandas’ section for each operation to understand where we diverge.

class apache_beam.dataframe.frames.DeferredSeries(expr)[source]

Bases: apache_beam.dataframe.frames.DeferredDataFrameOrSeries

name
dtype
dtypes
keys()[source]
append(to_append, ignore_index, verify_integrity, **kwargs)[source]
align(other, join, axis, level, method, **kwargs)[source]

Align two objects on their axes with the specified join method.

Join method is specified for each axis Index.

Parameters:
  • other (DeferredDataFrame or DeferredSeries) –
  • join ({'outer', 'inner', 'left', 'right'}, default 'outer') –
  • axis (allowed axis of the other object, default None) – Align on index (0), columns (1), or both (None).
  • level (int or level name, default None) – Broadcast across a level, matching Index values on the passed MultiIndex level.
  • copy (bool, default True) – Always returns new objects. If copy=False and no reindexing is required then original objects are returned.
  • fill_value (scalar, default np.NaN) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
  • method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None) –

    Method to use for filling holes in reindexed DeferredSeries:

    • pad / ffill: propagate last valid observation forward to next valid.
    • backfill / bfill: use NEXT valid observation to fill gap.
  • limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
  • fill_axis ({0 or 'index'}, default 0) – Filling axis, method and limit.
  • broadcast_axis ({0 or 'index'}, default None) – Broadcast values along this axis, if aligning two objects of different dimensions.
Returns:

(left, right) – Aligned objects.

Return type:

(DeferredSeries, type of other)

Differences from pandas

Aligning per-level is not yet supported. Only the default, level=None, is allowed.

Filling NaN values via method is not supported, because it is sensitive to the order of the data (see https://s.apache.org/dataframe-order-sensitive-operations). Only the default, method=None, is allowed.

array

pandas.Series.array is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

ravel(**kwargs)

pandas.Series.ravel is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

rename(**kwargs)

Alter Series index labels or name.

Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.

Alternatively, change Series.name with a scalar value.

See the user guide for more.

Parameters:
  • axis ({0 or "index"}) – Unused. Accepted for compatibility with DeferredDataFrame method only.
  • index (scalar, hashable sequence, dict-like or function, optional) – Functions or dict-like are transformations to apply to the index. Scalar or hashable sequence-like will alter the DeferredSeries.name attribute.
  • **kwargs – Additional keyword arguments passed to the function. Only the “inplace” keyword is used.
Returns:

DeferredSeries with index labels or name altered or None if inplace=True.

Return type:

DeferredSeries or None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.rename()
Corresponding DeferredDataFrame method.
DeferredSeries.rename_axis()
Set the name of the axis.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3])
>>> s
0    1
1    2
2    3
dtype: int64
>>> s.rename("my_name")  # scalar, changes Series.name
0    1
1    2
2    3
Name: my_name, dtype: int64
>>> s.rename(lambda x: x ** 2)  # function, changes labels
0    1
1    2
4    3
dtype: int64
>>> s.rename({1: 3, 2: 5})  # mapping, changes labels
0    1
3    2
5    3
dtype: int64
between(**kwargs)

Return boolean Series equivalent to left <= series <= right.

This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.

Parameters:
  • left (scalar or list-like) – Left boundary.
  • right (scalar or list-like) – Right boundary.
  • inclusive (bool, default True) – Include boundaries.
Returns:

DeferredSeries representing whether each element is between left and right (inclusive).

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.gt()
Greater than of series and other.
DeferredSeries.lt()
Less than of series and other.

Notes

This function is equivalent to (left <= ser) & (ser <= right)

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([2, 0, 4, 8, np.nan])

Boundary values are included by default:

>>> s.between(1, 4)
0     True
1    False
2     True
3    False
4    False
dtype: bool

With `inclusive` set to ``False`` boundary values are excluded:

>>> s.between(1, 4, inclusive=False)
0     True
1    False
2    False
3    False
4    False
dtype: bool

`left` and `right` can be any scalar value:

>>> s = pd.Series(['Alice', 'Bob', 'Carol', 'Eve'])
>>> s.between('Anna', 'Daniel')
0    False
1     True
2     True
3    False
dtype: bool
add_suffix(**kwargs)

Suffix labels with string suffix.

For Series, the row labels are suffixed. For DataFrame, the column labels are suffixed.

Parameters:suffix (str) – The string to add after each label.
Returns:New DeferredSeries or DeferredDataFrame with updated labels.
Return type:DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.add_prefix()
Prefix row labels with string prefix.
DeferredDataFrame.add_prefix()
Prefix column labels with string prefix.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64

>>> s.add_suffix('_item')
0_item    1
1_item    2
2_item    3
3_item    4
dtype: int64

>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})
>>> df
   A  B
0  1  3
1  2  4
2  3  5
3  4  6

>>> df.add_suffix('_col')
     A_col  B_col
0       1       3
1       2       4
2       3       5
3       4       6
add_prefix(**kwargs)

Prefix labels with string prefix.

For Series, the row labels are prefixed. For DataFrame, the column labels are prefixed.

Parameters:prefix (str) – The string to add before each label.
Returns:New DeferredSeries or DeferredDataFrame with updated labels.
Return type:DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.add_suffix()
Suffix row labels with string suffix.
DeferredDataFrame.add_suffix()
Suffix column labels with string suffix.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64

>>> s.add_prefix('item_')
item_0    1
item_1    2
item_2    3
item_3    4
dtype: int64

>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})
>>> df
   A  B
0  1  3
1  2  4
2  3  5
3  4  6

>>> df.add_prefix('col_')
     col_A  col_B
0       1       3
1       2       4
2       3       5
3       4       6
dot(other)[source]
std(*args, **kwargs)[source]

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:
  • axis ({index (0)}) –
  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
  • level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
  • ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
  • numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
Returns:

Return type:

scalar or DeferredSeries (if level specified)

Differences from pandas

This operation has no known divergences from the pandas API.

Notes

To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)

var(axis, skipna, level, ddof, **kwargs)[source]

Return unbiased variance over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:
  • axis ({index (0)}) –
  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
  • level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
  • ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
  • numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
Returns:

Return type:

scalar or DeferredSeries (if level specified)

Differences from pandas

Per-level aggregation is not yet supported (BEAM-11777). Only the default, level=None, is allowed.

Notes

To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)

corr(other, method, min_periods)[source]
cov(other, min_periods, ddof)[source]
dropna(**kwargs)[source]
isnull(**kwargs)

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:Mask of bool values for each element in DeferredSeries that indicates whether an element is an NA value.
Return type:DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.isnull()
Alias of isna.
DeferredSeries.notna()
Boolean inverse of isna.
DeferredSeries.dropna()
Omit axes labels with missing values.
isna()
Top-level isna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                    born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                          pd.Timestamp('1940-04-25')],
...                    name=['Alfred', 'Batman', ''],
...                    toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.isna()
0    False
1    False
2     True
dtype: bool
isna(**kwargs)

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:Mask of bool values for each element in DeferredSeries that indicates whether an element is an NA value.
Return type:DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.isnull()
Alias of isna.
DeferredSeries.notna()
Boolean inverse of isna.
DeferredSeries.dropna()
Omit axes labels with missing values.
isna()
Top-level isna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                    born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                          pd.Timestamp('1940-04-25')],
...                    name=['Alfred', 'Batman', ''],
...                    toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.isna()
0    False
1    False
2     True
dtype: bool
notnull(**kwargs)

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:Mask of bool values for each element in DeferredSeries that indicates whether an element is not an NA value.
Return type:DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.notnull()
Alias of notna.
DeferredSeries.isna()
Boolean inverse of notna.
DeferredSeries.dropna()
Omit axes labels with missing values.
notna()
Top-level notna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                    born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                          pd.Timestamp('1940-04-25')],
...                    name=['Alfred', 'Batman', ''],
...                    toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.notna()
0     True
1     True
2    False
dtype: bool
notna(**kwargs)

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:Mask of bool values for each element in DeferredSeries that indicates whether an element is not an NA value.
Return type:DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.notnull()
Alias of notna.
DeferredSeries.isna()
Boolean inverse of notna.
DeferredSeries.dropna()
Omit axes labels with missing values.
notna()
Top-level notna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                    born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                          pd.Timestamp('1940-04-25')],
...                    name=['Alfred', 'Batman', ''],
...                    toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.notna()
0     True
1     True
2    False
dtype: bool
items(**kwargs)

pandas.Series.items is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

iteritems(**kwargs)

pandas.Series.iteritems is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

tolist(**kwargs)

pandas.Series.tolist is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

to_numpy(**kwargs)

pandas.Series.to_numpy is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

to_string(**kwargs)

pandas.Series.to_string is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

aggregate(func, axis=0, *args, **kwargs)[source]
agg(func, axis=0, *args, **kwargs)
axes
clip(**kwargs)

Trim values at input threshold(s).

Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis.

Parameters:
  • lower (float or array_like, default None) – Minimum threshold value. All values below this threshold will be set to it.
  • upper (float or array_like, default None) – Maximum threshold value. All values above this threshold will be set to it.
  • axis (int or str axis name, optional) – Align object with lower and upper along the given axis.
  • inplace (bool, default False) – Whether to perform the operation in place on the data.
  • **kwargs (*args,) –

    Additional keywords have no effect but might be accepted for compatibility with numpy.

Returns:

Same type as calling object with the values outside the clip boundaries replaced or None if inplace=True.

Return type:

DeferredSeries or DeferredDataFrame or None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.clip()
Trim values at input threshold in series.
DeferredDataFrame.clip()
Trim values at input threshold in dataframe.
numpy.clip()
Clip (limit) the values in an array.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
>>> df = pd.DataFrame(data)
>>> df
   col_0  col_1
0      9     -2
1     -3     -7
2      0      6
3     -1      8
4      5     -5

Clips per column using lower and upper thresholds:

>>> df.clip(-4, 6)
   col_0  col_1
0      6     -2
1     -3     -4
2      0      6
3     -1      6
4      5     -4

Clips using specific lower and upper thresholds per column element:

>>> t = pd.Series([2, -4, -1, 6, 3])
>>> t
0    2
1   -4
2   -1
3    6
4    3
dtype: int64

>>> df.clip(t, t + 4, axis=0)
   col_0  col_1
0      6      2
1     -3     -4
2      0      3
3      6      8
4      5      3
all(*args, **kwargs)
any(*args, **kwargs)
count(*args, **kwargs)
min(*args, **kwargs)
max(*args, **kwargs)
prod(*args, **kwargs)
product(*args, **kwargs)
sum(*args, **kwargs)
mean(*args, **kwargs)
median(*args, **kwargs)
argmax(**kwargs)

pandas.Series.argmax is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

argmin(**kwargs)

pandas.Series.argmin is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

cummax(**kwargs)

pandas.Series.cummax is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

cummin(**kwargs)

pandas.Series.cummin is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

cumprod(**kwargs)

pandas.Series.cumprod is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

cumsum(**kwargs)

pandas.Series.cumsum is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

diff(**kwargs)

pandas.Series.diff is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

first(**kwargs)

pandas.Series.first is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

head(**kwargs)

pandas.Series.head is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

interpolate(**kwargs)

pandas.Series.interpolate is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

last(**kwargs)

pandas.Series.last is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

searchsorted(**kwargs)

pandas.Series.searchsorted is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

shift(**kwargs)

pandas.Series.shift is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

tail(**kwargs)

pandas.Series.tail is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

filter(**kwargs)

Subset the dataframe rows or columns according to the specified index labels.

Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.

Parameters:
  • items (list-like) – Keep labels from axis which are in items.
  • like (str) – Keep labels from axis for which “like in label == True”.
  • regex (str (regular expression)) – Keep labels from axis for which re.search(regex, label) == True.
  • axis ({0 or ‘index’, 1 or ‘columns’, None}, default None) – The axis to filter on, expressed either as an index (int) or axis name (str). By default this is the info axis, ‘index’ for DeferredSeries, ‘columns’ for DeferredDataFrame.
Returns:

Return type:

same type as input object

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.loc()
Access a group of rows and columns by label(s) or a boolean array.

Notes

The items, like, and regex parameters are enforced to be mutually exclusive.

axis defaults to the info axis that is used when indexing with [].

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame(np.array(([1, 2, 3], [4, 5, 6])),
...                   index=['mouse', 'rabbit'],
...                   columns=['one', 'two', 'three'])
>>> df
        one  two  three
mouse     1    2      3
rabbit    4    5      6

>>> # select columns by name
>>> df.filter(items=['one', 'three'])
         one  three
mouse     1      3
rabbit    4      6

>>> # select columns by regular expression
>>> df.filter(regex='e$', axis=1)
         one  three
mouse     1      3
rabbit    4      6

>>> # select rows containing 'bbi'
>>> df.filter(like='bbi', axis=0)
         one  two  three
rabbit    4    5      6
memory_usage(**kwargs)

pandas.Series.memory_usage is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

nlargest(keep, **kwargs)[source]
nsmallest(keep, **kwargs)[source]
is_unique
plot(**kwargs)

pandas.Series.plot is not supported in the Beam DataFrame API because it is a plotting tool.

For more information see {reason_data[‘url’]}.

pop(**kwargs)

pandas.Series.pop is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

rename_axis(**kwargs)

Set the name of the axis for the index or columns.

Parameters:
  • mapper (scalar, list-like, optional) – Value to set the axis name attribute.
  • columns (index,) –

    A scalar, list-like, dict-like or functions transformations to apply to that axis’ values. Note that the columns parameter is not allowed if the object is a DeferredSeries. This parameter only apply for DeferredDataFrame type objects.

    Use either mapper and axis to specify the axis to target with mapper, or index and/or columns.

    Changed in version 0.24.0.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to rename.
  • copy (bool, default True) – Also copy underlying data.
  • inplace (bool, default False) – Modifies the object directly, instead of creating a new DeferredSeries or DeferredDataFrame.
Returns:

The same type as the caller or None if inplace=True.

Return type:

DeferredSeries, DeferredDataFrame, or None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.rename()
Alter DeferredSeries index labels or name.
DeferredDataFrame.rename()
Alter DeferredDataFrame index labels or name.
Index.rename()
Set new names on index.

Notes

DeferredDataFrame.rename_axis supports two calling conventions

  • (index=index_mapper, columns=columns_mapper, ...)
  • (mapper, axis={'index', 'columns'}, ...)

The first calling convention will only modify the names of the index and/or the names of the Index object that is the columns. In this case, the parameter copy is ignored.

The second calling convention will modify the names of the corresponding index if mapper is a list or a scalar. However, if mapper is dict-like or a function, it will use the deprecated behavior of modifying the axis labels.

We highly recommend using keyword arguments to clarify your intent.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

**Series**

>>> s = pd.Series(["dog", "cat", "monkey"])
>>> s
0       dog
1       cat
2    monkey
dtype: object
>>> s.rename_axis("animal")
animal
0    dog
1    cat
2    monkey
dtype: object

**DataFrame**

>>> df = pd.DataFrame({"num_legs": [4, 4, 2],
...                    "num_arms": [0, 0, 2]},
...                   ["dog", "cat", "monkey"])
>>> df
        num_legs  num_arms
dog            4         0
cat            4         0
monkey         2         2
>>> df = df.rename_axis("animal")
>>> df
        num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2
>>> df = df.rename_axis("limbs", axis="columns")
>>> df
limbs   num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2

**MultiIndex**

>>> df.index = pd.MultiIndex.from_product([['mammal'],
...                                        ['dog', 'cat', 'monkey']],
...                                       names=['type', 'name'])
>>> df
limbs          num_legs  num_arms
type   name
mammal dog            4         0
       cat            4         0
       monkey         2         2

>>> df.rename_axis(index={'type': 'class'})
limbs          num_legs  num_arms
class  name
mammal dog            4         0
       cat            4         0
       monkey         2         2

>>> df.rename_axis(columns=str.upper)
LIMBS          num_legs  num_arms
type   name
mammal dog            4         0
       cat            4         0
       monkey         2         2
replace(to_replace, value, limit, method, **kwargs)[source]
round(**kwargs)

Round each value in a Series to the given number of decimals.

Parameters:
  • decimals (int, default 0) – Number of decimal places to round to. If decimals is negative, it specifies the number of positions to the left of the decimal point.
  • **kwargs (*args,) –

    Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Rounded values of the DeferredSeries.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

numpy.around()
Round values of an np.array.
DeferredDataFrame.round()
Round values of a DeferredDataFrame.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([0.1, 1.3, 2.7])
>>> s.round()
0    0.0
1    1.0
2    3.0
dtype: float64
take(**kwargs)

pandas.Series.take is not supported in the Beam DataFrame API because it is deprecated in pandas.

to_dict(**kwargs)

pandas.Series.to_dict is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

to_frame(**kwargs)

Convert Series to DataFrame.

Parameters:name (object, default None) – The passed name should substitute for the series name (if it has one).
Returns:DeferredDataFrame representation of DeferredSeries.
Return type:DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series(["a", "b", "c"],
...               name="vals")
>>> s.to_frame()
  vals
0    a
1    b
2    c
unique(as_series=False)[source]
update(other)[source]
unstack(**kwargs)

pandas.Series.unstack is not supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.

For more information see {reason_data[‘url’]}.

values

pandas.Series.values is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

view(**kwargs)

pandas.Series.view is not supported in the Beam DataFrame API because it relies on memory-sharing semantics that are not compatible with the Beam model.

str
apply(**kwargs)

Invoke function on values of Series.

Can be ufunc (a NumPy function that applies to the entire Series) or a Python function that only works on single values.

Parameters:
  • func (function) – Python function or NumPy ufunc to apply.
  • convert_dtype (bool, default True) – Try to find better dtype for elementwise function results. If False, leave as dtype=object.
  • args (tuple) – Positional arguments passed to func after the series value.
  • **kwds – Additional keyword arguments passed to func.
Returns:

If func returns a DeferredSeries object the result will be a DeferredDataFrame.

Return type:

DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.map()
For element-wise operations.
DeferredSeries.agg()
Only perform aggregating type operations.
DeferredSeries.transform()
Only perform transforming type operations.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Create a series with typical summer temperatures for each city.

>>> s = pd.Series([20, 21, 12],
...               index=['London', 'New York', 'Helsinki'])
>>> s
London      20
New York    21
Helsinki    12
dtype: int64

Square the values by defining a function and passing it as an
argument to ``apply()``.

>>> def square(x):
...     return x ** 2
>>> s.apply(square)
London      400
New York    441
Helsinki    144
dtype: int64

Square the values by passing an anonymous function as an
argument to ``apply()``.

>>> s.apply(lambda x: x ** 2)
London      400
New York    441
Helsinki    144
dtype: int64

Define a custom function that needs additional positional
arguments and pass these additional arguments using the
``args`` keyword.

>>> def subtract_custom_value(x, custom_value):
...     return x - custom_value

>>> s.apply(subtract_custom_value, args=(5,))
London      15
New York    16
Helsinki     7
dtype: int64

Define a custom function that takes keyword arguments
and pass these arguments to ``apply``.

>>> def add_custom_values(x, **kwargs):
...     for month in kwargs:
...         x += kwargs[month]
...     return x

>>> s.apply(add_custom_values, june=30, july=20, august=25)
London      95
New York    96
Helsinki    87
dtype: int64

Use a function from the Numpy library.

>>> s.apply(np.log)
London      2.995732
New York    3.044522
Helsinki    2.484907
dtype: float64
map(**kwargs)

Map values of Series according to input correspondence.

Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.

Parameters:
  • arg (function, collections.abc.Mapping subclass or DeferredSeries) – Mapping correspondence.
  • na_action ({None, 'ignore'}, default None) – If ‘ignore’, propagate NaN values, without passing them to the mapping correspondence.
Returns:

Same index as caller.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.apply()
For applying more complex functions on a DeferredSeries.
DeferredDataFrame.apply()
Apply a function row-/column-wise.
DeferredDataFrame.applymap()
Apply a function elementwise on a whole DeferredDataFrame.

Notes

When arg is a dictionary, values in DeferredSeries that are not in the dictionary (as keys) are converted to NaN. However, if the dictionary is a dict subclass that defines __missing__ (i.e. provides a method for default values), then this default is used rather than NaN.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
>>> s
0      cat
1      dog
2      NaN
3   rabbit
dtype: object

``map`` accepts a ``dict`` or a ``Series``. Values that are not found
in the ``dict`` are converted to ``NaN``, unless the dict has a default
value (e.g. ``defaultdict``):

>>> s.map({'cat': 'kitten', 'dog': 'puppy'})
0   kitten
1    puppy
2      NaN
3      NaN
dtype: object

It also accepts a function:

>>> s.map('I am a {}'.format)
0       I am a cat
1       I am a dog
2       I am a nan
3    I am a rabbit
dtype: object

To avoid applying the function to missing values (and keep them as
``NaN``) ``na_action='ignore'`` can be used:

>>> s.map('I am a {}'.format, na_action='ignore')
0     I am a cat
1     I am a dog
2            NaN
3  I am a rabbit
dtype: object
T
abs(**kwargs)

Return a Series/DataFrame with absolute numeric value of each element.

This function only applies to elements that are all numeric.

Returns:DeferredSeries/DeferredDataFrame containing the absolute value of each element.
Return type:abs

Differences from pandas

This operation has no known divergences from the pandas API.

See also

numpy.absolute()
Calculate the absolute value element-wise.

Notes

For complex inputs, 1.2 + 1j, the absolute value is \(\sqrt{ a^2 + b^2 }\).

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Absolute numeric values in a Series.

>>> s = pd.Series([-1.10, 2, -3.33, 4])
>>> s.abs()
0    1.10
1    2.00
2    3.33
3    4.00
dtype: float64

Absolute numeric values in a Series with complex numbers.

>>> s = pd.Series([1.2 + 1j])
>>> s.abs()
0    1.56205
dtype: float64

Absolute numeric values in a Series with a Timedelta element.

>>> s = pd.Series([pd.Timedelta('1 days')])
>>> s.abs()
0   1 days
dtype: timedelta64[ns]

Select rows with data closest to certain value using argsort (from
`StackOverflow <https://stackoverflow.com/a/17758115>`__).

>>> df = pd.DataFrame({
...     'a': [4, 5, 6, 7],
...     'b': [10, 20, 30, 40],
...     'c': [100, 50, -30, -50]
... })
>>> df
     a    b    c
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50
>>> df.loc[(df.c - 43).abs().argsort()]
     a    b    c
1    5   20   50
0    4   10  100
2    6   30  -30
3    7   40  -50
add(**kwargs)
argsort(**kwargs)
asfreq(**kwargs)
asof(**kwargs)
astype(**kwargs)

Cast a pandas object to a specified dtype dtype.

Parameters:
  • dtype (data type, or dict of column name -> data type) – Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DeferredDataFrame’s columns to column-specific types.
  • copy (bool, default True) – Return a copy when copy=True (be very careful setting copy=False as changes to values then may propagate to other pandas objects).
  • errors ({'raise', 'ignore'}, default 'raise') –

    Control raising of exceptions on invalid data for provided dtype.

    • raise : allow exceptions to be raised
    • ignore : suppress exceptions. On error return original object.
Returns:

casted

Return type:

same type as caller

Differences from pandas

This operation has no known divergences from the pandas API.

See also

to_datetime()
Convert argument to datetime.
to_timedelta()
Convert argument to timedelta.
to_numeric()
Convert argument to a numeric type.
numpy.ndarray.astype()
Cast a numpy array to a specified type.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Create a DataFrame:

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df.dtypes
col1    int64
col2    int64
dtype: object

Cast all columns to int32:

>>> df.astype('int32').dtypes
col1    int32
col2    int32
dtype: object

Cast col1 to int32 using a dictionary:

>>> df.astype({'col1': 'int32'}).dtypes
col1    int32
col2    int64
dtype: object

Create a series:

>>> ser = pd.Series([1, 2], dtype='int32')
>>> ser
0    1
1    2
dtype: int32
>>> ser.astype('int64')
0    1
1    2
dtype: int64

Convert to categorical type:

>>> ser.astype('category')
0    1
1    2
dtype: category
Categories (2, int64): [1, 2]

Convert to ordered categorical type with custom ordering:

>>> cat_dtype = pd.api.types.CategoricalDtype(
...     categories=[2, 1], ordered=True)
>>> ser.astype(cat_dtype)
0    1
1    2
dtype: category
Categories (2, int64): [2 < 1]

Note that using ``copy=False`` and changing data on a new
pandas object may propagate changes:

>>> s1 = pd.Series([1, 2])
>>> s2 = s1.astype('int64', copy=False)
>>> s2[0] = 10
>>> s1  # note that s1[0] has changed too
0    10
1     2
dtype: int64

Create a series of dates:

>>> ser_date = pd.Series(pd.date_range('20200101', periods=3))
>>> ser_date
0   2020-01-01
1   2020-01-02
2   2020-01-03
dtype: datetime64[ns]

Datetimes are localized to UTC first before
converting to the specified timezone:

>>> ser_date.astype('datetime64[ns, US/Eastern]')
0   2019-12-31 19:00:00-05:00
1   2020-01-01 19:00:00-05:00
2   2020-01-02 19:00:00-05:00
dtype: datetime64[ns, US/Eastern]
at
at_time(**kwargs)
attrs

pandas.DataFrame.attrs is not supported in the Beam DataFrame API because it is experimental in pandas.

autocorr(**kwargs)
backfill(**kwargs)
between_time(**kwargs)
bfill(**kwargs)
bool()
cat
combine(**kwargs)
combine_first(**kwargs)
compare(**kwargs)
convert_dtypes(**kwargs)
copy(**kwargs)

Make a copy of this object’s indices and data.

When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below).

When deep=False, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).

Parameters:deep (bool, default True) – Make a deep copy, including a copy of the data and the indices. With deep=False neither the indices nor the data are copied.
Returns:copy – Object type matches caller.
Return type:DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

Notes

When deep=True, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively copies object data (see examples below).

While Index objects are copied when deep=True, the underlying numpy array is not copied for performance reasons. Since Index is immutable, the underlying data can be safely shared and a copy is not needed.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2], index=["a", "b"])
>>> s
a    1
b    2
dtype: int64

>>> s_copy = s.copy()
>>> s_copy
a    1
b    2
dtype: int64

**Shallow copy versus default (deep) copy:**

>>> s = pd.Series([1, 2], index=["a", "b"])
>>> deep = s.copy()
>>> shallow = s.copy(deep=False)

Shallow copy shares data and index with original.

>>> s is shallow
False
>>> s.values is shallow.values and s.index is shallow.index
True

Deep copy has own copy of data and index.

>>> s is deep
False
>>> s.values is deep.values or s.index is deep.index
False

Updates to the data shared by shallow copy and original is reflected
in both; deep copy remains unchanged.

>>> s[0] = 3
>>> shallow[1] = 4
>>> s
a    3
b    4
dtype: int64
>>> shallow
a    3
b    4
dtype: int64
>>> deep
a    1
b    2
dtype: int64

Note that when copying an object containing Python objects, a deep copy
will copy the data, but will not do so recursively. Updating a nested
data object will be reflected in the deep copy.

>>> s = pd.Series([[1, 2], [3, 4]])
>>> deep = s.copy()
>>> s[0][0] = 10
>>> s
0    [10, 2]
1     [3, 4]
dtype: object
>>> deep
0    [10, 2]
1     [3, 4]
dtype: object
describe(**kwargs)
div(**kwargs)
divide(**kwargs)
divmod(**kwargs)
drop(labels, axis, index, columns, errors, **kwargs)
drop_duplicates(**kwargs)
droplevel(level, axis)
dt
duplicated(**kwargs)
empty
eq(**kwargs)

Return Equal to of series and other, element-wise (binary operator eq).

Equivalent to series == other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.eq(b, fill_value=0)
a     True
b    False
c    False
d    False
e    False
dtype: bool
equals(other)
ewm(**kwargs)
expanding(**kwargs)
explode(**kwargs)
factorize(**kwargs)
ffill(**kwargs)
fillna(value, method, axis, limit, **kwargs)
first_valid_index(**kwargs)
flags
floordiv(**kwargs)
ge(**kwargs)

Return Greater than or equal to of series and other, element-wise (binary operator ge).

Equivalent to series >= other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
e    1.0
dtype: float64
>>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f'])
>>> b
a    0.0
b    1.0
c    2.0
d    NaN
f    1.0
dtype: float64
>>> a.ge(b, fill_value=0)
a     True
b     True
c    False
d    False
e     True
f    False
dtype: bool
get(**kwargs)
groupby(by, level, axis, as_index, group_keys, **kwargs)
gt(**kwargs)

Return Greater than of series and other, element-wise (binary operator gt).

Equivalent to series > other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
e    1.0
dtype: float64
>>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f'])
>>> b
a    0.0
b    1.0
c    2.0
d    NaN
f    1.0
dtype: float64
>>> a.gt(b, fill_value=0)
a     True
b    False
c    False
d    False
e     True
f    False
dtype: bool
hasnans
hist(**kwargs)

pandas.DataFrame.hist is not supported in the Beam DataFrame API because it is a plotting tool.

For more information see {reason_data[‘url’]}.

iat
idxmax(**kwargs)
idxmin(**kwargs)
iloc
index
infer_objects(**kwargs)
is_monotonic
is_monotonic_decreasing
is_monotonic_increasing
isin(**kwargs)

Whether each element in the DataFrame is contained in values.

Parameters:values (iterable, DeferredSeries, DeferredDataFrame or dict) – The result will only be true at a location if all the labels match. If values is a DeferredSeries, that’s the index. If values is a dict, the keys must be the column names, which must match. If values is a DeferredDataFrame, then both the index and column labels must match.
Returns:DeferredDataFrame of booleans showing whether each element in the DeferredDataFrame is contained in values.
Return type:DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.eq()
Equality test for DeferredDataFrame.
DeferredSeries.isin()
Equivalent method on DeferredSeries.
DeferredSeries.str.contains()
Test if pattern or regex is contained within a string of a DeferredSeries or Index.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]},
...                   index=['falcon', 'dog'])
>>> df
        num_legs  num_wings
falcon         2          2
dog            4          0

When ``values`` is a list check whether every value in the DataFrame
is present in the list (which animals have 0 or 2 legs or wings)

>>> df.isin([0, 2])
        num_legs  num_wings
falcon      True       True
dog        False       True

When ``values`` is a dict, we can pass values to check for each
column separately:

>>> df.isin({'num_wings': [0, 3]})
        num_legs  num_wings
falcon     False      False
dog        False       True

When ``values`` is a Series or DataFrame the index and column must
match. Note that 'falcon' does not match based on the number of legs
in df2.

>>> other = pd.DataFrame({'num_legs': [8, 2], 'num_wings': [0, 2]},
...                      index=['spider', 'falcon'])
>>> df.isin(other)
        num_legs  num_wings
falcon      True       True
dog        False      False
item(**kwargs)
kurt(**kwargs)
kurtosis(**kwargs)
last_valid_index(**kwargs)
le(**kwargs)

Return Less than or equal to of series and other, element-wise (binary operator le).

Equivalent to series <= other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
e    1.0
dtype: float64
>>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f'])
>>> b
a    0.0
b    1.0
c    2.0
d    NaN
f    1.0
dtype: float64
>>> a.le(b, fill_value=0)
a    False
b     True
c     True
d    False
e    False
f     True
dtype: bool
loc
lt(**kwargs)

Return Less than of series and other, element-wise (binary operator lt).

Equivalent to series < other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
e    1.0
dtype: float64
>>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f'])
>>> b
a    0.0
b    1.0
c    2.0
d    NaN
f    1.0
dtype: float64
>>> a.lt(b, fill_value=0)
a    False
b    False
c     True
d    False
e    False
f     True
dtype: bool
mad(**kwargs)
mask(cond, **kwargs)
mod(**kwargs)
mode(**kwargs)
mul(**kwargs)
multiply(**kwargs)
nbytes
ndim
ne(**kwargs)

Return Not equal to of series and other, element-wise (binary operator ne).

Equivalent to series != other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (DeferredSeries or scalar value) –
  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns:

The result of the operation.

Return type:

DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.ne(b, fill_value=0)
a    False
b     True
c     True
d     True
e     True
dtype: bool
nunique(**kwargs)
pad(**kwargs)
pct_change(**kwargs)
pipe(**kwargs)
pow(**kwargs)
quantile(**kwargs)
radd(**kwargs)
rank(**kwargs)
rdiv(**kwargs)
rdivmod(**kwargs)
reindex(**kwargs)
reindex_like(**kwargs)
reorder_levels(**kwargs)

Rearrange index levels using input order. May not drop or duplicate levels.

Parameters:
  • order (list of int or list of str) – List representing new level order. Reference level by number (position) or by key (label).
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Where to reorder levels.
Returns:

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

repeat(**kwargs)
resample(**kwargs)
reset_index(**kwargs)
rfloordiv(**kwargs)
rmod(**kwargs)
rmul(**kwargs)
rolling(**kwargs)
rpow(**kwargs)
rsub(**kwargs)
rtruediv(**kwargs)
sample(**kwargs)
sem(**kwargs)
set_axis(**kwargs)
set_flags(**kwargs)
shape
size
skew(**kwargs)
slice_shift(**kwargs)
sort_index(axis, **kwargs)

Sort object by labels (along an axis).

Returns a new DataFrame sorted by label if inplace argument is False, otherwise updates the original DataFrame and returns None.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.
  • level (int or level name or list of ints or list of level names) – If not None, sort on values in specified index level(s).
  • ascending (bool or list-like of bools, default True) – Sort ascending vs. descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.
  • inplace (bool, default False) – If True, perform operation in-place.
  • kind ({'quicksort', 'mergesort', 'heapsort'}, default 'quicksort') – Choice of sorting algorithm. See also ndarray.np.sort for more information. mergesort is the only stable algorithm. For DeferredDataFrames, this option is only applied when sorting on a single column or label.
  • na_position ({'first', 'last'}, default 'last') – Puts NaNs at the beginning if first; last puts NaNs at the end. Not implemented for MultiIndex.
  • sort_remaining (bool, default True) – If True and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level.
  • ignore_index (bool, default False) –

    If True, the resulting axis will be labeled 0, 1, …, n - 1.

    New in version 1.0.0.

  • key (callable, optional) –

    If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect an Index and return an Index of the same shape. For MultiIndex inputs, the key is applied per level.

    New in version 1.1.0.

Returns:

The original DeferredDataFrame sorted by the labels or None if inplace=True.

Return type:

DeferredDataFrame or None

Differences from pandas

axis=index is not allowed because it imposes an ordering on the dataset, and we cannot guarantee it will be maintained (see https://s.apache.org/dataframe-order-sensitive-operations). Only axis=columns is allowed.

See also

DeferredSeries.sort_index()
Sort DeferredSeries by the index.
DeferredDataFrame.sort_values()
Sort DeferredDataFrame by the value.
DeferredSeries.sort_values()
Sort DeferredSeries by the value.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150],
...                   columns=['A'])
>>> df.sort_index()
     A
1    4
29   2
100  1
150  5
234  3

By default, it sorts in ascending order, to sort in descending order,
use ``ascending=False``

>>> df.sort_index(ascending=False)
     A
234  3
150  5
100  1
29   2
1    4

A key function can be specified which is applied to the index before
sorting. For a ``MultiIndex`` this is applied to each level separately.

>>> df = pd.DataFrame({"a": [1, 2, 3, 4]}, index=['A', 'b', 'C', 'd'])
>>> df.sort_index(key=lambda x: x.str.lower())
   a
A  1
b  2
C  3
d  4
sort_values(axis, **kwargs)

sort_values is not implemented.

It is not implemented for axis=index because it imposes an ordering on the dataset, and we cannot guarantee it will be maintained (see https://s.apache.org/dataframe-order-sensitive-operations).

It is not implemented for axis=columns because it makes the order of the columns depend on the data (see https://s.apache.org/dataframe-non-deferred-column-names).

sparse
squeeze(**kwargs)
sub(**kwargs)
subtract(**kwargs)
swapaxes(**kwargs)
swaplevel(**kwargs)
to_clipboard(**kwargs)
to_csv(path, *args, **kwargs)
to_excel(path, *args, **kwargs)
to_feather(path, *args, **kwargs)
to_hdf(**kwargs)

pandas.DataFrame.to_hdf is not supported in the Beam DataFrame API because HDF5 is a random access file format.

to_html(path, *args, **kwargs)
to_json(path, orient=None, *args, **kwargs)
to_latex(**kwargs)
to_list(**kwargs)
to_markdown(**kwargs)
to_msgpack(**kwargs)

pandas.DataFrame.to_msgpack is not supported in the Beam DataFrame API because it is deprecated in pandas.

to_parquet(path, *args, **kwargs)
to_period(**kwargs)
to_pickle(**kwargs)
to_sql(**kwargs)
to_stata(path, *args, **kwargs)
to_timestamp(**kwargs)
to_xarray(**kwargs)
transform(**kwargs)
transpose(**kwargs)
truediv(**kwargs)
truncate(**kwargs)
tshift(**kwargs)
tz_convert(**kwargs)
tz_localize(ambiguous, **kwargs)
value_counts(**kwargs)
where(cond, other, errors, **kwargs)
classmethod wrap(expr, split_tuples=True)
xs(**kwargs)
class apache_beam.dataframe.frames.DeferredDataFrame(expr)[source]

Bases: apache_beam.dataframe.frames.DeferredDataFrameOrSeries

T
columns
keys()[source]
align(other, join, axis, copy, level, method, **kwargs)[source]
append(other, ignore_index, verify_integrity, sort, **kwargs)[source]
set_index(keys, **kwargs)[source]
loc
iloc
axes
dtypes
assign(**kwargs)[source]
explode(column, ignore_index)[source]
aggregate(func, axis=0, *args, **kwargs)[source]
agg(func, axis=0, *args, **kwargs)
applymap(**kwargs)

Apply a function to a Dataframe elementwise.

This method applies a function that accepts and returns a scalar to every element of a DataFrame.

Parameters:
  • func (callable) – Python function, returns a single value from a single value.
  • na_action ({None, 'ignore'}, default None) –

    If ‘ignore’, propagate NaN values, without passing them to func.

    New in version 1.2.

Returns:

Transformed DeferredDataFrame.

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.apply()
Apply a function along input axis of DeferredDataFrame.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame([[1, 2.12], [3.356, 4.567]])
>>> df
       0      1
0  1.000  2.120
1  3.356  4.567

>>> df.applymap(lambda x: len(str(x)))
   0  1
0  3  4
1  5  5

Like Series.map, NA values can be ignored:

>>> df_copy = df.copy()
>>> df_copy.iloc[0, 0] = pd.NA
>>> df_copy.applymap(lambda x: len(str(x)), na_action='ignore')
      0  1
0  <NA>  4
1     5  5

Note that a vectorized version of `func` often exists, which will
be much faster. You could square each number elementwise.

>>> df.applymap(lambda x: x**2)
           0          1
0   1.000000   4.494400
1  11.262736  20.857489

But it's better to avoid applymap in that case.

>>> df ** 2
           0          1
0   1.000000   4.494400
1  11.262736  20.857489
add_prefix(**kwargs)

Prefix labels with string prefix.

For Series, the row labels are prefixed. For DataFrame, the column labels are prefixed.

Parameters:prefix (str) – The string to add before each label.
Returns:New DeferredSeries or DeferredDataFrame with updated labels.
Return type:DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.add_suffix()
Suffix row labels with string suffix.
DeferredDataFrame.add_suffix()
Suffix column labels with string suffix.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64

>>> s.add_prefix('item_')
item_0    1
item_1    2
item_2    3
item_3    4
dtype: int64

>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})
>>> df
   A  B
0  1  3
1  2  4
2  3  5
3  4  6

>>> df.add_prefix('col_')
     col_A  col_B
0       1       3
1       2       4
2       3       5
3       4       6
add_suffix(**kwargs)

Suffix labels with string suffix.

For Series, the row labels are suffixed. For DataFrame, the column labels are suffixed.

Parameters:suffix (str) – The string to add after each label.
Returns:New DeferredSeries or DeferredDataFrame with updated labels.
Return type:DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.add_prefix()
Prefix row labels with string prefix.
DeferredDataFrame.add_prefix()
Prefix column labels with string prefix.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64

>>> s.add_suffix('_item')
0_item    1
1_item    2
2_item    3
3_item    4
dtype: int64

>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})
>>> df
   A  B
0  1  3
1  2  4
2  3  5
3  4  6

>>> df.add_suffix('_col')
     A_col  B_col
0       1       3
1       2       4
2       3       5
3       4       6
memory_usage(**kwargs)

pandas.DataFrame.memory_usage is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

info(**kwargs)

pandas.DataFrame.info is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

clip(**kwargs)
corr(method, min_periods)[source]

Compute pairwise correlation of columns, excluding NA/null values.

Parameters:
  • method ({'pearson', 'kendall', 'spearman'} or callable) –

    Method of correlation:

    • pearson : standard correlation coefficient
    • kendall : Kendall Tau correlation coefficient
    • spearman : Spearman rank correlation
    • callable: callable with input two 1d ndarrays
      and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

      New in version 0.24.0.

  • min_periods (int, optional) – Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.
Returns:

Correlation matrix.

Return type:

DeferredDataFrame

Differences from pandas

Only method="pearson" can be parallelized. Other methods require collecting all data on a single worker (see https://s.apache.org/dataframe-non-parallelizable-operations for details).

See also

DeferredDataFrame.corrwith()
Compute pairwise correlation with another DeferredDataFrame or DeferredSeries.
DeferredSeries.corr()
Compute the correlation between two DeferredSeries.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> def histogram_intersection(a, b):
...     v = np.minimum(a, b).sum().round(decimals=1)
...     return v
>>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
...                   columns=['dogs', 'cats'])
>>> df.corr(method=histogram_intersection)
      dogs  cats
dogs   1.0   0.3
cats   0.3   1.0
cov(min_periods, ddof)[source]
corrwith(other, axis, drop, method)[source]
cummax(**kwargs)

pandas.DataFrame.cummax is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

cummin(**kwargs)

pandas.DataFrame.cummin is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

cumprod(**kwargs)

pandas.DataFrame.cumprod is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

cumsum(**kwargs)

pandas.DataFrame.cumsum is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

diff(**kwargs)

pandas.DataFrame.diff is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

first(**kwargs)

pandas.DataFrame.first is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

head(**kwargs)

pandas.DataFrame.head is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

interpolate(**kwargs)

pandas.DataFrame.interpolate is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

last(**kwargs)

pandas.DataFrame.last is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

tail(**kwargs)

pandas.DataFrame.tail is not supported in the Beam DataFrame API because it is sensitive to the order of the data.

For more information see {reason_data[‘url’]}.

dot(other)[source]
mode(axis=0, *args, **kwargs)[source]
dropna(axis, **kwargs)[source]
eval(expr, inplace, **kwargs)[source]
query(expr, inplace, **kwargs)[source]
isnull(**kwargs)

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:Mask of bool values for each element in DeferredDataFrame that indicates whether an element is an NA value.
Return type:DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.isnull()
Alias of isna.
DeferredDataFrame.notna()
Boolean inverse of isna.
DeferredDataFrame.dropna()
Omit axes labels with missing values.
isna()
Top-level isna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                    born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                          pd.Timestamp('1940-04-25')],
...                    name=['Alfred', 'Batman', ''],
...                    toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.isna()
0    False
1    False
2     True
dtype: bool
isna(**kwargs)

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:Mask of bool values for each element in DeferredDataFrame that indicates whether an element is an NA value.
Return type:DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.isnull()
Alias of isna.
DeferredDataFrame.notna()
Boolean inverse of isna.
DeferredDataFrame.dropna()
Omit axes labels with missing values.
isna()
Top-level isna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                    born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                          pd.Timestamp('1940-04-25')],
...                    name=['Alfred', 'Batman', ''],
...                    toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.isna()
0    False
1    False
2     True
dtype: bool
notnull(**kwargs)

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:Mask of bool values for each element in DeferredDataFrame that indicates whether an element is not an NA value.
Return type:DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.notnull()
Alias of notna.
DeferredDataFrame.isna()
Boolean inverse of notna.
DeferredDataFrame.dropna()
Omit axes labels with missing values.
notna()
Top-level notna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                    born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                          pd.Timestamp('1940-04-25')],
...                    name=['Alfred', 'Batman', ''],
...                    toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.notna()
0     True
1     True
2    False
dtype: bool
notna(**kwargs)

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:Mask of bool values for each element in DeferredDataFrame that indicates whether an element is not an NA value.
Return type:DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.notnull()
Alias of notna.
DeferredDataFrame.isna()
Boolean inverse of notna.
DeferredDataFrame.dropna()
Omit axes labels with missing values.
notna()
Top-level notna.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                    born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                          pd.Timestamp('1940-04-25')],
...                    name=['Alfred', 'Batman', ''],
...                    toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.notna()
0     True
1     True
2    False
dtype: bool
items(**kwargs)

pandas.DataFrame.items is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

itertuples(**kwargs)

pandas.DataFrame.itertuples is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

iterrows(**kwargs)

pandas.DataFrame.iterrows is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

iteritems(**kwargs)

pandas.DataFrame.iteritems is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

join(other, on, **kwargs)[source]
merge(right, on, left_on, right_on, left_index, right_index, suffixes, **kwargs)[source]
nlargest(keep, **kwargs)[source]
nsmallest(keep, **kwargs)[source]
nunique(**kwargs)[source]
plot(**kwargs)

pandas.DataFrame.plot is not supported in the Beam DataFrame API because it is a plotting tool.

For more information see {reason_data[‘url’]}.

pop(item)[source]
quantile(q, axis, **kwargs)[source]
rename(**kwargs)[source]
rename_axis(**kwargs)

Set the name of the axis for the index or columns.

Parameters:
  • mapper (scalar, list-like, optional) – Value to set the axis name attribute.
  • columns (index,) –

    A scalar, list-like, dict-like or functions transformations to apply to that axis’ values. Note that the columns parameter is not allowed if the object is a DeferredSeries. This parameter only apply for DeferredDataFrame type objects.

    Use either mapper and axis to specify the axis to target with mapper, or index and/or columns.

    Changed in version 0.24.0.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to rename.
  • copy (bool, default True) – Also copy underlying data.
  • inplace (bool, default False) – Modifies the object directly, instead of creating a new DeferredSeries or DeferredDataFrame.
Returns:

The same type as the caller or None if inplace=True.

Return type:

DeferredSeries, DeferredDataFrame, or None

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredSeries.rename()
Alter DeferredSeries index labels or name.
DeferredDataFrame.rename()
Alter DeferredDataFrame index labels or name.
Index.rename()
Set new names on index.

Notes

DeferredDataFrame.rename_axis supports two calling conventions

  • (index=index_mapper, columns=columns_mapper, ...)
  • (mapper, axis={'index', 'columns'}, ...)

The first calling convention will only modify the names of the index and/or the names of the Index object that is the columns. In this case, the parameter copy is ignored.

The second calling convention will modify the names of the corresponding index if mapper is a list or a scalar. However, if mapper is dict-like or a function, it will use the deprecated behavior of modifying the axis labels.

We highly recommend using keyword arguments to clarify your intent.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

**Series**

>>> s = pd.Series(["dog", "cat", "monkey"])
>>> s
0       dog
1       cat
2    monkey
dtype: object
>>> s.rename_axis("animal")
animal
0    dog
1    cat
2    monkey
dtype: object

**DataFrame**

>>> df = pd.DataFrame({"num_legs": [4, 4, 2],
...                    "num_arms": [0, 0, 2]},
...                   ["dog", "cat", "monkey"])
>>> df
        num_legs  num_arms
dog            4         0
cat            4         0
monkey         2         2
>>> df = df.rename_axis("animal")
>>> df
        num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2
>>> df = df.rename_axis("limbs", axis="columns")
>>> df
limbs   num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2

**MultiIndex**

>>> df.index = pd.MultiIndex.from_product([['mammal'],
...                                        ['dog', 'cat', 'monkey']],
...                                       names=['type', 'name'])
>>> df
limbs          num_legs  num_arms
type   name
mammal dog            4         0
       cat            4         0
       monkey         2         2

>>> df.rename_axis(index={'type': 'class'})
limbs          num_legs  num_arms
class  name
mammal dog            4         0
       cat            4         0
       monkey         2         2

>>> df.rename_axis(columns=str.upper)
LIMBS          num_legs  num_arms
type   name
mammal dog            4         0
       cat            4         0
       monkey         2         2
replace(limit, **kwargs)[source]
reset_index(level=None, **kwargs)[source]
round(decimals, *args, **kwargs)[source]
select_dtypes(**kwargs)

Return a subset of the DataFrame’s columns based on the column dtypes.

Parameters:exclude (include,) – A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied.
Returns:The subset of the frame including the dtypes in include and excluding the dtypes in exclude.
Return type:DeferredDataFrame
Raises:ValueError – * If both of include and exclude are empty * If include and exclude have overlapping elements * If any kind of string dtype is passed in.

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.dtypes()
Return DeferredSeries with the data type of each column.

Notes

  • To select all numeric types, use np.number or 'number'
  • To select strings you must use the object dtype, but note that this will return all object dtype columns
  • See the numpy dtype hierarchy
  • To select datetimes, use np.datetime64, 'datetime' or 'datetime64'
  • To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'
  • To select Pandas categorical dtypes, use 'category'
  • To select Pandas datetimetz dtypes, use 'datetimetz' (new in 0.20.0) or 'datetime64[ns, tz]'

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'a': [1, 2] * 3,
...                    'b': [True, False] * 3,
...                    'c': [1.0, 2.0] * 3})
>>> df
        a      b  c
0       1   True  1.0
1       2  False  2.0
2       1   True  1.0
3       2  False  2.0
4       1   True  1.0
5       2  False  2.0

>>> df.select_dtypes(include='bool')
   b
0  True
1  False
2  True
3  False
4  True
5  False

>>> df.select_dtypes(include=['float64'])
   c
0  1.0
1  2.0
2  1.0
3  2.0
4  1.0
5  2.0

>>> df.select_dtypes(exclude=['int64'])
       b    c
0   True  1.0
1  False  2.0
2   True  1.0
3  False  2.0
4   True  1.0
5  False  2.0
shift(axis, **kwargs)[source]
shape

pandas.DataFrame.shape is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

stack(**kwargs)

Stack the prescribed level(s) from columns to index.

Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe:

  • if the columns have a single level, the output is a Series;
  • if the columns have multiple levels, the new index level(s) is (are) taken from the prescribed level(s) and the output is a DataFrame.
Parameters:
  • level (int, str, list, default -1) – Level(s) to stack from the column axis onto the index axis, defined as one index or label, or a list of indices or labels.
  • dropna (bool, default True) – Whether to drop rows in the resulting Frame/DeferredSeries with missing values. Stacking a column level onto the index axis can create combinations of index and column values that are missing from the original dataframe. See Examples section.
Returns:

Stacked dataframe or series.

Return type:

DeferredDataFrame or DeferredSeries

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.unstack()
Unstack prescribed level(s) from index axis onto column axis.
DeferredDataFrame.pivot()
Reshape dataframe from long format to wide format.
DeferredDataFrame.pivot_table()
Create a spreadsheet-style pivot table as a DeferredDataFrame.

Notes

The function is named by analogy with a collection of books being reorganized from being side by side on a horizontal position (the columns of the dataframe) to being stacked vertically on top of each other (in the index of the dataframe).

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

**Single level columns**

>>> df_single_level_cols = pd.DataFrame([[0, 1], [2, 3]],
...                                     index=['cat', 'dog'],
...                                     columns=['weight', 'height'])

Stacking a dataframe with a single level column axis returns a Series:

>>> df_single_level_cols
     weight height
cat       0      1
dog       2      3
>>> df_single_level_cols.stack()
cat  weight    0
     height    1
dog  weight    2
     height    3
dtype: int64

**Multi level columns: simple case**

>>> multicol1 = pd.MultiIndex.from_tuples([('weight', 'kg'),
...                                        ('weight', 'pounds')])
>>> df_multi_level_cols1 = pd.DataFrame([[1, 2], [2, 4]],
...                                     index=['cat', 'dog'],
...                                     columns=multicol1)

Stacking a dataframe with a multi-level column axis:

>>> df_multi_level_cols1
     weight
         kg    pounds
cat       1        2
dog       2        4
>>> df_multi_level_cols1.stack()
            weight
cat kg           1
    pounds       2
dog kg           2
    pounds       4

**Missing values**

>>> multicol2 = pd.MultiIndex.from_tuples([('weight', 'kg'),
...                                        ('height', 'm')])
>>> df_multi_level_cols2 = pd.DataFrame([[1.0, 2.0], [3.0, 4.0]],
...                                     index=['cat', 'dog'],
...                                     columns=multicol2)

It is common to have missing values when stacking a dataframe
with multi-level columns, as the stacked dataframe typically
has more values than the original dataframe. Missing values
are filled with NaNs:

>>> df_multi_level_cols2
    weight height
        kg      m
cat    1.0    2.0
dog    3.0    4.0
>>> df_multi_level_cols2.stack()
        height  weight
cat kg     NaN     1.0
    m      2.0     NaN
dog kg     NaN     3.0
    m      4.0     NaN

**Prescribing the level(s) to be stacked**

The first parameter controls which level or levels are stacked:

>>> df_multi_level_cols2.stack(0)
             kg    m
cat height  NaN  2.0
    weight  1.0  NaN
dog height  NaN  4.0
    weight  3.0  NaN
>>> df_multi_level_cols2.stack([0, 1])
cat  height  m     2.0
     weight  kg    1.0
dog  height  m     4.0
     weight  kg    3.0
dtype: float64

**Dropping missing values**

>>> df_multi_level_cols3 = pd.DataFrame([[None, 1.0], [2.0, 3.0]],
...                                     index=['cat', 'dog'],
...                                     columns=multicol2)

Note that rows where all values are missing are dropped by
default but this behaviour can be controlled via the dropna
keyword parameter:

>>> df_multi_level_cols3
    weight height
        kg      m
cat    NaN    1.0
dog    2.0    3.0
>>> df_multi_level_cols3.stack(dropna=False)
        height  weight
cat kg     NaN     NaN
    m      1.0     NaN
dog kg     NaN     2.0
    m      3.0     NaN
>>> df_multi_level_cols3.stack(dropna=True)
        height  weight
cat m      1.0     NaN
dog kg     NaN     2.0
    m      3.0     NaN
all(*args, **kwargs)
any(*args, **kwargs)
count(*args, **kwargs)
max(*args, **kwargs)
min(*args, **kwargs)
prod(*args, **kwargs)
product(*args, **kwargs)
sum(*args, **kwargs)
mean(*args, **kwargs)
median(*args, **kwargs)
take(**kwargs)

pandas.DataFrame.take is not supported in the Beam DataFrame API because it is deprecated in pandas.

to_records(**kwargs)

pandas.DataFrame.to_records is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

to_dict(**kwargs)

pandas.DataFrame.to_dict is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

to_numpy(**kwargs)

pandas.DataFrame.to_numpy is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

to_string(**kwargs)

pandas.DataFrame.to_string is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

to_sparse(**kwargs)

pandas.DataFrame.to_sparse is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

transpose(**kwargs)

pandas.DataFrame.transpose is not supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.

For more information see {reason_data[‘url’]}.

unstack(*args, **kwargs)[source]
update(**kwargs)

Modify in place using non-NA values from another DataFrame.

Aligns on indices. There is no return value.

Parameters:
  • other (DeferredDataFrame, or object coercible into a DeferredDataFrame) – Should have at least one matching index/column label with the original DeferredDataFrame. If a DeferredSeries is passed, its name attribute must be set, and that will be used as the column name to align with the original DeferredDataFrame.
  • join ({'left'}, default 'left') – Only left join is implemented, keeping the index and columns of the original object.
  • overwrite (bool, default True) –

    How to handle non-NA values for overlapping keys:

    • True: overwrite original DeferredDataFrame’s values with values from other.
    • False: only update values that are NA in the original DeferredDataFrame.
  • filter_func (callable(1d-array) -> bool 1d-array, optional) – Can choose to replace values other than NA. Return True for values that should be updated.
  • errors ({'raise', 'ignore'}, default 'ignore') –

    If ‘raise’, will raise a ValueError if the DeferredDataFrame and other both contain non-NA data in the same place.

    Changed in version 0.24.0: Changed from raise_conflict=False|True to errors=’ignore’|’raise’.

Returns:

None

Return type:

method directly changes calling object

Raises:
  • ValueError – * When errors=’raise’ and there’s overlapping non-NA data. * When errors is not either ‘ignore’ or ‘raise’
  • NotImplementedError – * If join != ‘left’

Differences from pandas

This operation has no known divergences from the pandas API.

See also

dict.update()
Similar method for dictionaries.
DeferredDataFrame.merge()
For column(s)-on-column(s) operations.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'A': [1, 2, 3],
...                    'B': [400, 500, 600]})
>>> new_df = pd.DataFrame({'B': [4, 5, 6],
...                        'C': [7, 8, 9]})
>>> df.update(new_df)
>>> df
   A  B
0  1  4
1  2  5
2  3  6

The DataFrame's length does not increase as a result of the update,
only values at matching index/column labels are updated.

>>> df = pd.DataFrame({'A': ['a', 'b', 'c'],
...                    'B': ['x', 'y', 'z']})
>>> new_df = pd.DataFrame({'B': ['d', 'e', 'f', 'g', 'h', 'i']})
>>> df.update(new_df)
>>> df
   A  B
0  a  d
1  b  e
2  c  f

For Series, its name attribute must be set.

>>> df = pd.DataFrame({'A': ['a', 'b', 'c'],
...                    'B': ['x', 'y', 'z']})
>>> new_column = pd.Series(['d', 'e'], name='B', index=[0, 2])
>>> df.update(new_column)
>>> df
   A  B
0  a  d
1  b  y
2  c  e
>>> df = pd.DataFrame({'A': ['a', 'b', 'c'],
...                    'B': ['x', 'y', 'z']})
>>> new_df = pd.DataFrame({'B': ['d', 'e']}, index=[1, 2])
>>> df.update(new_df)
>>> df
   A  B
0  a  x
1  b  d
2  c  e

If `other` contains NaNs the corresponding values are not updated
in the original dataframe.

>>> df = pd.DataFrame({'A': [1, 2, 3],
...                    'B': [400, 500, 600]})
>>> new_df = pd.DataFrame({'B': [4, np.nan, 6]})
>>> df.update(new_df)
>>> df
   A      B
0  1    4.0
1  2  500.0
2  3    6.0
values

pandas.DataFrame.values is not supported in the Beam DataFrame API because it produces an output type that is not deferred.

For more information see {reason_data[‘url’]}.

abs(**kwargs)

Return a Series/DataFrame with absolute numeric value of each element.

This function only applies to elements that are all numeric.

Returns:DeferredSeries/DeferredDataFrame containing the absolute value of each element.
Return type:abs

Differences from pandas

This operation has no known divergences from the pandas API.

See also

numpy.absolute()
Calculate the absolute value element-wise.

Notes

For complex inputs, 1.2 + 1j, the absolute value is \(\sqrt{ a^2 + b^2 }\).

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Absolute numeric values in a Series.

>>> s = pd.Series([-1.10, 2, -3.33, 4])
>>> s.abs()
0    1.10
1    2.00
2    3.33
3    4.00
dtype: float64

Absolute numeric values in a Series with complex numbers.

>>> s = pd.Series([1.2 + 1j])
>>> s.abs()
0    1.56205
dtype: float64

Absolute numeric values in a Series with a Timedelta element.

>>> s = pd.Series([pd.Timedelta('1 days')])
>>> s.abs()
0   1 days
dtype: timedelta64[ns]

Select rows with data closest to certain value using argsort (from
`StackOverflow <https://stackoverflow.com/a/17758115>`__).

>>> df = pd.DataFrame({
...     'a': [4, 5, 6, 7],
...     'b': [10, 20, 30, 40],
...     'c': [100, 50, -30, -50]
... })
>>> df
     a    b    c
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50
>>> df.loc[(df.c - 43).abs().argsort()]
     a    b    c
1    5   20   50
0    4   10  100
2    6   30  -30
3    7   40  -50
add(**kwargs)
apply(**kwargs)
asfreq(**kwargs)
asof(**kwargs)
astype(**kwargs)

Cast a pandas object to a specified dtype dtype.

Parameters:
  • dtype (data type, or dict of column name -> data type) – Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DeferredDataFrame’s columns to column-specific types.
  • copy (bool, default True) – Return a copy when copy=True (be very careful setting copy=False as changes to values then may propagate to other pandas objects).
  • errors ({'raise', 'ignore'}, default 'raise') –

    Control raising of exceptions on invalid data for provided dtype.

    • raise : allow exceptions to be raised
    • ignore : suppress exceptions. On error return original object.
Returns:

casted

Return type:

same type as caller

Differences from pandas

This operation has no known divergences from the pandas API.

See also

to_datetime()
Convert argument to datetime.
to_timedelta()
Convert argument to timedelta.
to_numeric()
Convert argument to a numeric type.
numpy.ndarray.astype()
Cast a numpy array to a specified type.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

Create a DataFrame:

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df.dtypes
col1    int64
col2    int64
dtype: object

Cast all columns to int32:

>>> df.astype('int32').dtypes
col1    int32
col2    int32
dtype: object

Cast col1 to int32 using a dictionary:

>>> df.astype({'col1': 'int32'}).dtypes
col1    int32
col2    int64
dtype: object

Create a series:

>>> ser = pd.Series([1, 2], dtype='int32')
>>> ser
0    1
1    2
dtype: int32
>>> ser.astype('int64')
0    1
1    2
dtype: int64

Convert to categorical type:

>>> ser.astype('category')
0    1
1    2
dtype: category
Categories (2, int64): [1, 2]

Convert to ordered categorical type with custom ordering:

>>> cat_dtype = pd.api.types.CategoricalDtype(
...     categories=[2, 1], ordered=True)
>>> ser.astype(cat_dtype)
0    1
1    2
dtype: category
Categories (2, int64): [2 < 1]

Note that using ``copy=False`` and changing data on a new
pandas object may propagate changes:

>>> s1 = pd.Series([1, 2])
>>> s2 = s1.astype('int64', copy=False)
>>> s2[0] = 10
>>> s1  # note that s1[0] has changed too
0    10
1     2
dtype: int64

Create a series of dates:

>>> ser_date = pd.Series(pd.date_range('20200101', periods=3))
>>> ser_date
0   2020-01-01
1   2020-01-02
2   2020-01-03
dtype: datetime64[ns]

Datetimes are localized to UTC first before
converting to the specified timezone:

>>> ser_date.astype('datetime64[ns, US/Eastern]')
0   2019-12-31 19:00:00-05:00
1   2020-01-01 19:00:00-05:00
2   2020-01-02 19:00:00-05:00
dtype: datetime64[ns, US/Eastern]
at
at_time(**kwargs)
attrs

pandas.DataFrame.attrs is not supported in the Beam DataFrame API because it is experimental in pandas.

backfill(**kwargs)
between_time(**kwargs)
bfill(**kwargs)
bool()
boxplot(**kwargs)
combine(**kwargs)
combine_first(**kwargs)
compare(**kwargs)
convert_dtypes(**kwargs)
copy(**kwargs)

Make a copy of this object’s indices and data.

When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below).

When deep=False, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).

Parameters:deep (bool, default True) – Make a deep copy, including a copy of the data and the indices. With deep=False neither the indices nor the data are copied.
Returns:copy – Object type matches caller.
Return type:DeferredSeries or DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

Notes

When deep=True, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively copies object data (see examples below).

While Index objects are copied when deep=True, the underlying numpy array is not copied for performance reasons. Since Index is immutable, the underlying data can be safely shared and a copy is not needed.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> s = pd.Series([1, 2], index=["a", "b"])
>>> s
a    1
b    2
dtype: int64

>>> s_copy = s.copy()
>>> s_copy
a    1
b    2
dtype: int64

**Shallow copy versus default (deep) copy:**

>>> s = pd.Series([1, 2], index=["a", "b"])
>>> deep = s.copy()
>>> shallow = s.copy(deep=False)

Shallow copy shares data and index with original.

>>> s is shallow
False
>>> s.values is shallow.values and s.index is shallow.index
True

Deep copy has own copy of data and index.

>>> s is deep
False
>>> s.values is deep.values or s.index is deep.index
False

Updates to the data shared by shallow copy and original is reflected
in both; deep copy remains unchanged.

>>> s[0] = 3
>>> shallow[1] = 4
>>> s
a    3
b    4
dtype: int64
>>> shallow
a    3
b    4
dtype: int64
>>> deep
a    1
b    2
dtype: int64

Note that when copying an object containing Python objects, a deep copy
will copy the data, but will not do so recursively. Updating a nested
data object will be reflected in the deep copy.

>>> s = pd.Series([[1, 2], [3, 4]])
>>> deep = s.copy()
>>> s[0][0] = 10
>>> s
0    [10, 2]
1     [3, 4]
dtype: object
>>> deep
0    [10, 2]
1     [3, 4]
dtype: object
describe(**kwargs)
div(**kwargs)
divide(**kwargs)
drop(labels, axis, index, columns, errors, **kwargs)
drop_duplicates(**kwargs)
droplevel(level, axis)
dtype
duplicated(**kwargs)
empty
eq(**kwargs)

Get Equal to of dataframe and other, element-wise (binary operator eq).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns:

Result of the comparison.

Return type:

DeferredDataFrame of bool

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.eq()
Compare DeferredDataFrames for equality elementwise.
DeferredDataFrame.ne()
Compare DeferredDataFrames for inequality elementwise.
DeferredDataFrame.le()
Compare DeferredDataFrames for less than inequality or equality elementwise.
DeferredDataFrame.lt()
Compare DeferredDataFrames for strictly less than inequality elementwise.
DeferredDataFrame.ge()
Compare DeferredDataFrames for greater than inequality or equality elementwise.
DeferredDataFrame.gt()
Compare DeferredDataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When `other` is a :class:`Series`, the columns of a DataFrame are aligned
with the index of `other` and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must
match the number elements in `other`:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
equals(other)
ewm(**kwargs)
expanding(**kwargs)
ffill(**kwargs)
fillna(value, method, axis, limit, **kwargs)
filter(**kwargs)

Subset the dataframe rows or columns according to the specified index labels.

Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.

Parameters:
  • items (list-like) – Keep labels from axis which are in items.
  • like (str) – Keep labels from axis for which “like in label == True”.
  • regex (str (regular expression)) – Keep labels from axis for which re.search(regex, label) == True.
  • axis ({0 or ‘index’, 1 or ‘columns’, None}, default None) – The axis to filter on, expressed either as an index (int) or axis name (str). By default this is the info axis, ‘index’ for DeferredSeries, ‘columns’ for DeferredDataFrame.
Returns:

Return type:

same type as input object

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.loc()
Access a group of rows and columns by label(s) or a boolean array.

Notes

The items, like, and regex parameters are enforced to be mutually exclusive.

axis defaults to the info axis that is used when indexing with [].

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame(np.array(([1, 2, 3], [4, 5, 6])),
...                   index=['mouse', 'rabbit'],
...                   columns=['one', 'two', 'three'])
>>> df
        one  two  three
mouse     1    2      3
rabbit    4    5      6

>>> # select columns by name
>>> df.filter(items=['one', 'three'])
         one  three
mouse     1      3
rabbit    4      6

>>> # select columns by regular expression
>>> df.filter(regex='e$', axis=1)
         one  three
mouse     1      3
rabbit    4      6

>>> # select rows containing 'bbi'
>>> df.filter(like='bbi', axis=0)
         one  two  three
rabbit    4    5      6
first_valid_index(**kwargs)
flags
floordiv(**kwargs)
from_dict(**kwargs)
from_records(**kwargs)
ge(**kwargs)

Get Greater than or equal to of dataframe and other, element-wise (binary operator ge).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns:

Result of the comparison.

Return type:

DeferredDataFrame of bool

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.eq()
Compare DeferredDataFrames for equality elementwise.
DeferredDataFrame.ne()
Compare DeferredDataFrames for inequality elementwise.
DeferredDataFrame.le()
Compare DeferredDataFrames for less than inequality or equality elementwise.
DeferredDataFrame.lt()
Compare DeferredDataFrames for strictly less than inequality elementwise.
DeferredDataFrame.ge()
Compare DeferredDataFrames for greater than inequality or equality elementwise.
DeferredDataFrame.gt()
Compare DeferredDataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When `other` is a :class:`Series`, the columns of a DataFrame are aligned
with the index of `other` and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must
match the number elements in `other`:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
get(**kwargs)
groupby(by, level, axis, as_index, group_keys, **kwargs)
gt(**kwargs)

Get Greater than of dataframe and other, element-wise (binary operator gt).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns:

Result of the comparison.

Return type:

DeferredDataFrame of bool

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.eq()
Compare DeferredDataFrames for equality elementwise.
DeferredDataFrame.ne()
Compare DeferredDataFrames for inequality elementwise.
DeferredDataFrame.le()
Compare DeferredDataFrames for less than inequality or equality elementwise.
DeferredDataFrame.lt()
Compare DeferredDataFrames for strictly less than inequality elementwise.
DeferredDataFrame.ge()
Compare DeferredDataFrames for greater than inequality or equality elementwise.
DeferredDataFrame.gt()
Compare DeferredDataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When `other` is a :class:`Series`, the columns of a DataFrame are aligned
with the index of `other` and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must
match the number elements in `other`:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
hist(**kwargs)

pandas.DataFrame.hist is not supported in the Beam DataFrame API because it is a plotting tool.

For more information see {reason_data[‘url’]}.

iat
idxmax(**kwargs)
idxmin(**kwargs)
index
infer_objects(**kwargs)
insert(**kwargs)
isin(**kwargs)

Whether each element in the DataFrame is contained in values.

Parameters:values (iterable, DeferredSeries, DeferredDataFrame or dict) – The result will only be true at a location if all the labels match. If values is a DeferredSeries, that’s the index. If values is a dict, the keys must be the column names, which must match. If values is a DeferredDataFrame, then both the index and column labels must match.
Returns:DeferredDataFrame of booleans showing whether each element in the DeferredDataFrame is contained in values.
Return type:DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.eq()
Equality test for DeferredDataFrame.
DeferredSeries.isin()
Equivalent method on DeferredSeries.
DeferredSeries.str.contains()
Test if pattern or regex is contained within a string of a DeferredSeries or Index.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]},
...                   index=['falcon', 'dog'])
>>> df
        num_legs  num_wings
falcon         2          2
dog            4          0

When ``values`` is a list check whether every value in the DataFrame
is present in the list (which animals have 0 or 2 legs or wings)

>>> df.isin([0, 2])
        num_legs  num_wings
falcon      True       True
dog        False       True

When ``values`` is a dict, we can pass values to check for each
column separately:

>>> df.isin({'num_wings': [0, 3]})
        num_legs  num_wings
falcon     False      False
dog        False       True

When ``values`` is a Series or DataFrame the index and column must
match. Note that 'falcon' does not match based on the number of legs
in df2.

>>> other = pd.DataFrame({'num_legs': [8, 2], 'num_wings': [0, 2]},
...                      index=['spider', 'falcon'])
>>> df.isin(other)
        num_legs  num_wings
falcon      True       True
dog        False      False
kurt(**kwargs)
kurtosis(**kwargs)
last_valid_index(**kwargs)
le(**kwargs)

Get Less than or equal to of dataframe and other, element-wise (binary operator le).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns:

Result of the comparison.

Return type:

DeferredDataFrame of bool

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.eq()
Compare DeferredDataFrames for equality elementwise.
DeferredDataFrame.ne()
Compare DeferredDataFrames for inequality elementwise.
DeferredDataFrame.le()
Compare DeferredDataFrames for less than inequality or equality elementwise.
DeferredDataFrame.lt()
Compare DeferredDataFrames for strictly less than inequality elementwise.
DeferredDataFrame.ge()
Compare DeferredDataFrames for greater than inequality or equality elementwise.
DeferredDataFrame.gt()
Compare DeferredDataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When `other` is a :class:`Series`, the columns of a DataFrame are aligned
with the index of `other` and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must
match the number elements in `other`:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
lookup(**kwargs)
lt(**kwargs)

Get Less than of dataframe and other, element-wise (binary operator lt).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns:

Result of the comparison.

Return type:

DeferredDataFrame of bool

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.eq()
Compare DeferredDataFrames for equality elementwise.
DeferredDataFrame.ne()
Compare DeferredDataFrames for inequality elementwise.
DeferredDataFrame.le()
Compare DeferredDataFrames for less than inequality or equality elementwise.
DeferredDataFrame.lt()
Compare DeferredDataFrames for strictly less than inequality elementwise.
DeferredDataFrame.ge()
Compare DeferredDataFrames for greater than inequality or equality elementwise.
DeferredDataFrame.gt()
Compare DeferredDataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When `other` is a :class:`Series`, the columns of a DataFrame are aligned
with the index of `other` and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must
match the number elements in `other`:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
mad(**kwargs)
mask(cond, **kwargs)
melt(**kwargs)
mod(**kwargs)
mul(**kwargs)
multiply(**kwargs)
ndim
ne(**kwargs)

Get Not equal to of dataframe and other, element-wise (binary operator ne).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns:

Result of the comparison.

Return type:

DeferredDataFrame of bool

Differences from pandas

This operation has no known divergences from the pandas API.

See also

DeferredDataFrame.eq()
Compare DeferredDataFrames for equality elementwise.
DeferredDataFrame.ne()
Compare DeferredDataFrames for inequality elementwise.
DeferredDataFrame.le()
Compare DeferredDataFrames for less than inequality or equality elementwise.
DeferredDataFrame.lt()
Compare DeferredDataFrames for strictly less than inequality elementwise.
DeferredDataFrame.ge()
Compare DeferredDataFrames for greater than inequality or equality elementwise.
DeferredDataFrame.gt()
Compare DeferredDataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When `other` is a :class:`Series`, the columns of a DataFrame are aligned
with the index of `other` and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must
match the number elements in `other`:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
pad(**kwargs)
pct_change(**kwargs)
pipe(**kwargs)
pivot(**kwargs)
pivot_table(**kwargs)
pow(**kwargs)
radd(**kwargs)
rank(**kwargs)
rdiv(**kwargs)
reindex(**kwargs)
reindex_like(**kwargs)
reorder_levels(**kwargs)

Rearrange index levels using input order. May not drop or duplicate levels.

Parameters:
  • order (list of int or list of str) – List representing new level order. Reference level by number (position) or by key (label).
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Where to reorder levels.
Returns:

Return type:

DeferredDataFrame

Differences from pandas

This operation has no known divergences from the pandas API.

resample(**kwargs)
rfloordiv(**kwargs)
rmod(**kwargs)
rmul(**kwargs)
rolling(**kwargs)
rpow(**kwargs)
rsub(**kwargs)
rtruediv(**kwargs)
sample(**kwargs)
sem(**kwargs)
set_axis(**kwargs)
set_flags(**kwargs)
size
skew(**kwargs)
slice_shift(**kwargs)
sort_index(axis, **kwargs)

Sort object by labels (along an axis).

Returns a new DataFrame sorted by label if inplace argument is False, otherwise updates the original DataFrame and returns None.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.
  • level (int or level name or list of ints or list of level names) – If not None, sort on values in specified index level(s).
  • ascending (bool or list-like of bools, default True) – Sort ascending vs. descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.
  • inplace (bool, default False) – If True, perform operation in-place.
  • kind ({'quicksort', 'mergesort', 'heapsort'}, default 'quicksort') – Choice of sorting algorithm. See also ndarray.np.sort for more information. mergesort is the only stable algorithm. For DeferredDataFrames, this option is only applied when sorting on a single column or label.
  • na_position ({'first', 'last'}, default 'last') – Puts NaNs at the beginning if first; last puts NaNs at the end. Not implemented for MultiIndex.
  • sort_remaining (bool, default True) – If True and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level.
  • ignore_index (bool, default False) –

    If True, the resulting axis will be labeled 0, 1, …, n - 1.

    New in version 1.0.0.

  • key (callable, optional) –

    If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect an Index and return an Index of the same shape. For MultiIndex inputs, the key is applied per level.

    New in version 1.1.0.

Returns:

The original DeferredDataFrame sorted by the labels or None if inplace=True.

Return type:

DeferredDataFrame or None

Differences from pandas

axis=index is not allowed because it imposes an ordering on the dataset, and we cannot guarantee it will be maintained (see https://s.apache.org/dataframe-order-sensitive-operations). Only axis=columns is allowed.

See also

DeferredSeries.sort_index()
Sort DeferredSeries by the index.
DeferredDataFrame.sort_values()
Sort DeferredDataFrame by the value.
DeferredSeries.sort_values()
Sort DeferredSeries by the value.

Examples

NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.

>>> df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150],
...                   columns=['A'])
>>> df.sort_index()
     A
1    4
29   2
100  1
150  5
234  3

By default, it sorts in ascending order, to sort in descending order,
use ``ascending=False``

>>> df.sort_index(ascending=False)
     A
234  3
150  5
100  1
29   2
1    4

A key function can be specified which is applied to the index before
sorting. For a ``MultiIndex`` this is applied to each level separately.

>>> df = pd.DataFrame({"a": [1, 2, 3, 4]}, index=['A', 'b', 'C', 'd'])
>>> df.sort_index(key=lambda x: x.str.lower())
   a
A  1
b  2
C  3
d  4
sort_values(axis, **kwargs)

sort_values is not implemented.

It is not implemented for axis=index because it imposes an ordering on the dataset, and we cannot guarantee it will be maintained (see https://s.apache.org/dataframe-order-sensitive-operations).

It is not implemented for axis=columns because it makes the order of the columns depend on the data (see https://s.apache.org/dataframe-non-deferred-column-names).

sparse
squeeze(**kwargs)
std(**kwargs)
style
sub(**kwargs)
subtract(**kwargs)
swapaxes(**kwargs)
swaplevel(**kwargs)
to_clipboard(**kwargs)
to_csv(path, *args, **kwargs)
to_excel(path, *args, **kwargs)
to_feather(path, *args, **kwargs)
to_gbq(**kwargs)
to_hdf(**kwargs)

pandas.DataFrame.to_hdf is not supported in the Beam DataFrame API because HDF5 is a random access file format.

to_html(path, *args, **kwargs)
to_json(path, orient=None, *args, **kwargs)
to_latex(**kwargs)
to_markdown(**kwargs)
to_msgpack(**kwargs)

pandas.DataFrame.to_msgpack is not supported in the Beam DataFrame API because it is deprecated in pandas.

to_parquet(path, *args, **kwargs)
to_period(**kwargs)
to_pickle(**kwargs)
to_sql(**kwargs)
to_stata(path, *args, **kwargs)
to_timestamp(**kwargs)
to_xarray(**kwargs)
transform(**kwargs)
truediv(**kwargs)
truncate(**kwargs)
tshift(**kwargs)
tz_convert(**kwargs)
tz_localize(ambiguous, **kwargs)
value_counts(**kwargs)
var(**kwargs)
where(cond, other, errors, **kwargs)
classmethod wrap(expr, split_tuples=True)
xs(**kwargs)