apache_beam.dataframe.frames module¶
Analogs for pandas.DataFrame
and pandas.Series
:
DeferredDataFrame
and DeferredSeries
.
These classes are effectively wrappers around a schema-aware
PCollection
that provide a set of operations
compatible with the pandas API.
Note that we aim for the Beam DataFrame API to be completely compatible with the pandas API, but there are some features that are currently unimplemented for various reasons. Pay particular attention to the ‘Differences from pandas’ section for each operation to understand where we diverge.
-
class
apache_beam.dataframe.frames.
DeferredSeries
(expr)[source]¶ Bases:
apache_beam.dataframe.frames.DeferredDataFrameOrSeries
-
name
¶ Return the name of the Series.
The name of a Series becomes its index or column name if it is used to form a DataFrame. It is also used whenever displaying the Series using the interpreter.
Returns: The name of the DeferredSeries, also the column name if part of a DeferredDataFrame. Return type: label (hashable object) Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.rename
- Sets the DeferredSeries name when given a scalar input.
Index.name
- Corresponding Index property.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
The Series name can be set initially when calling the constructor. >>> s = pd.Series([1, 2, 3], dtype=np.int64, name='Numbers') >>> s 0 1 1 2 2 3 Name: Numbers, dtype: int64 >>> s.name = "Integers" >>> s 0 1 1 2 2 3 Name: Integers, dtype: int64 The name of a Series within a DataFrame is its column name. >>> df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], ... columns=["Odd Numbers", "Even Numbers"]) >>> df Odd Numbers Even Numbers 0 1 2 1 3 4 2 5 6 >>> df["Even Numbers"].name 'Even Numbers'
-
hasnans
¶ Return if I have any nans; enables various perf speedups.
Differences from pandas
This operation has no known divergences from the pandas API.
-
dtype
¶ Return the dtype object of the underlying data.
Differences from pandas
This operation has no known divergences from the pandas API.
-
dtypes
¶ Return the dtype object of the underlying data.
Differences from pandas
This operation has no known divergences from the pandas API.
-
keys
()[source]¶ Return alias for index.
Returns: Index of the DeferredSeries. Return type: Index Differences from pandas
This operation has no known divergences from the pandas API.
-
T
(**kwargs)¶ Return the transpose, which is by definition self.
Differences from pandas
This operation has no known divergences from the pandas API.
-
transpose
(**kwargs)¶ Return the transpose, which is by definition self.
Returns: Return type: %(klass)s Differences from pandas
This operation has no known divergences from the pandas API.
-
shape
¶ pandas.Series.shape()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
append
(to_append, ignore_index, verify_integrity, **kwargs)[source]¶ Concatenate two or more Series.
Parameters: - to_append (DeferredSeries or list/tuple of DeferredSeries) – DeferredSeries to append with self.
- ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.
- verify_integrity (bool, default False) – If True, raise Exception on creating index with duplicates.
Returns: Concatenated DeferredSeries.
Return type: Differences from pandas
ignore_index=True
is not supported, because it requires generating an order-sensitive index.See also
concat()
- General function to concatenate DeferredDataFrame or DeferredSeries objects.
Notes
Iteratively appending to a DeferredSeries can be more computationally intensive than a single concatenate. A better solution is to append values to a list and then concatenate the list with the original DeferredSeries all at once.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> s1 = pd.Series([1, 2, 3]) >>> s2 = pd.Series([4, 5, 6]) >>> s3 = pd.Series([4, 5, 6], index=[3, 4, 5]) >>> s1.append(s2) 0 1 1 2 2 3 0 4 1 5 2 6 dtype: int64 >>> s1.append(s3) 0 1 1 2 2 3 3 4 4 5 5 6 dtype: int64 With `ignore_index` set to True: >>> s1.append(s2, ignore_index=True) 0 1 1 2 2 3 3 4 4 5 5 6 dtype: int64 With `verify_integrity` set to True: >>> s1.append(s2, verify_integrity=True) Traceback (most recent call last): ... ValueError: Indexes have overlapping values: [0, 1, 2]
-
align
(other, join, axis, level, method, **kwargs)[source]¶ Align two objects on their axes with the specified join method.
Join method is specified for each axis Index.
Parameters: - other (DeferredDataFrame or DeferredSeries) –
- join ({'outer', 'inner', 'left', 'right'}, default 'outer') –
- axis (allowed axis of the other object, default None) – Align on index (0), columns (1), or both (None).
- level (int or level name, default None) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- copy (bool, default True) – Always returns new objects. If copy=False and no reindexing is required then original objects are returned.
- fill_value (scalar, default np.NaN) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
- method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None) –
Method to use for filling holes in reindexed DeferredSeries:
- pad / ffill: propagate last valid observation forward to next valid.
- backfill / bfill: use NEXT valid observation to fill gap.
- limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
- fill_axis ({0 or 'index'}, default 0) – Filling axis, method and limit.
- broadcast_axis ({0 or 'index'}, default None) – Broadcast values along this axis, if aligning two objects of different dimensions.
Returns: (left, right) – Aligned objects.
Return type: (DeferredSeries, type of other)
Differences from pandas
Aligning per-level is not yet supported. Only the default,
level=None
, is allowed.Filling NaN values via
method
is not supported, because it is order-sensitive. Only the default,method=None
, is allowed.
-
argsort
(**kwargs)¶ pandas.Series.argsort()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
array
¶ pandas.Series.array()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
get
(**kwargs)¶ pandas.Series.get()
is not yet supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.For more information see https://s.apache.org/dataframe-non-deferred-columns.
-
ravel
(**kwargs)¶ pandas.Series.ravel()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
slice_shift
(**kwargs)¶ pandas.Series.slice_shift()
is not yet supported in the Beam DataFrame API because it is deprecated in pandas.
-
tshift
(**kwargs)¶ pandas.Series.tshift()
is not yet supported in the Beam DataFrame API because it is deprecated in pandas.
-
rename
(**kwargs)¶ Alter Series index labels or name.
Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.
Alternatively, change
Series.name
with a scalar value.See the user guide for more.
Parameters: - axis ({0 or "index"}) – Unused. Accepted for compatibility with DeferredDataFrame method only.
- index (scalar, hashable sequence, dict-like or function, optional) – Functions or dict-like are transformations to apply to
the index.
Scalar or hashable sequence-like will alter the
DeferredSeries.name
attribute. - **kwargs – Additional keyword arguments passed to the function. Only the “inplace” keyword is used.
Returns: DeferredSeries with index labels or name altered or None if
inplace=True
.Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.rename()
- Corresponding DeferredDataFrame method.
DeferredSeries.rename_axis()
- Set the name of the axis.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series([1, 2, 3]) >>> s 0 1 1 2 2 3 dtype: int64 >>> s.rename("my_name") # scalar, changes Series.name 0 1 1 2 2 3 Name: my_name, dtype: int64 >>> s.rename(lambda x: x ** 2) # function, changes labels 0 1 1 2 4 3 dtype: int64 >>> s.rename({1: 3, 2: 5}) # mapping, changes labels 0 1 3 2 5 3 dtype: int64
-
between
(**kwargs)¶ Return boolean Series equivalent to left <= series <= right.
This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.
Parameters: - left (scalar or list-like) – Left boundary.
- right (scalar or list-like) – Right boundary.
- inclusive ({"both", "neither", "left", "right"}) –
Include boundaries. Whether to set each bound as closed or open.
Changed in version 1.3.0.
Returns: DeferredSeries representing whether each element is between left and right (inclusive).
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.gt()
- Greater than of series and other.
DeferredSeries.lt()
- Less than of series and other.
Notes
This function is equivalent to
(left <= ser) & (ser <= right)
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series([2, 0, 4, 8, np.nan]) Boundary values are included by default: >>> s.between(1, 4) 0 True 1 False 2 True 3 False 4 False dtype: bool With `inclusive` set to ``"neither"`` boundary values are excluded: >>> s.between(1, 4, inclusive="neither") 0 True 1 False 2 False 3 False 4 False dtype: bool `left` and `right` can be any scalar value: >>> s = pd.Series(['Alice', 'Bob', 'Carol', 'Eve']) >>> s.between('Anna', 'Daniel') 0 False 1 True 2 True 3 False dtype: bool
-
add_suffix
(**kwargs)¶ Suffix labels with string suffix.
For Series, the row labels are suffixed. For DataFrame, the column labels are suffixed.
Parameters: suffix (str) – The string to add after each label. Returns: New DeferredSeries or DeferredDataFrame with updated labels. Return type: DeferredSeries or DeferredDataFrame Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.add_prefix()
- Prefix row labels with string prefix.
DeferredDataFrame.add_prefix()
- Prefix column labels with string prefix.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series([1, 2, 3, 4]) >>> s 0 1 1 2 2 3 3 4 dtype: int64 >>> s.add_suffix('_item') 0_item 1 1_item 2 2_item 3 3_item 4 dtype: int64 >>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}) >>> df A B 0 1 3 1 2 4 2 3 5 3 4 6 >>> df.add_suffix('_col') A_col B_col 0 1 3 1 2 4 2 3 5 3 4 6
-
add_prefix
(**kwargs)¶ Prefix labels with string prefix.
For Series, the row labels are prefixed. For DataFrame, the column labels are prefixed.
Parameters: prefix (str) – The string to add before each label. Returns: New DeferredSeries or DeferredDataFrame with updated labels. Return type: DeferredSeries or DeferredDataFrame Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.add_suffix()
- Suffix row labels with string suffix.
DeferredDataFrame.add_suffix()
- Suffix column labels with string suffix.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series([1, 2, 3, 4]) >>> s 0 1 1 2 2 3 3 4 dtype: int64 >>> s.add_prefix('item_') item_0 1 item_1 2 item_2 3 item_3 4 dtype: int64 >>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}) >>> df A B 0 1 3 1 2 4 2 3 5 3 4 6 >>> df.add_prefix('col_') col_A col_B 0 1 3 1 2 4 2 3 5 3 4 6
-
idxmin
(**kwargs)[source]¶ Return the row label of the minimum value.
If multiple values equal the minimum, the first row label with that value is returned.
Parameters: - axis (int, default 0) – For compatibility with DeferredDataFrame.idxmin. Redundant for application on DeferredSeries.
- skipna (bool, default True) – Exclude NA/null values. If the entire DeferredSeries is NA, the result will be NA.
- **kwargs (*args,) –
Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.
Returns: Label of the minimum value.
Return type: Raises: ValueError
– If the DeferredSeries is empty.Differences from pandas
This operation has no known divergences from the pandas API.
See also
numpy.argmin()
- Return indices of the minimum values along the given axis.
DeferredDataFrame.idxmin()
- Return index of first occurrence of minimum over requested axis.
DeferredSeries.idxmax()
- Return index label of the first occurrence of maximum of values.
Notes
This method is the DeferredSeries version of
ndarray.argmin
. This method returns the label of the minimum, whilendarray.argmin
returns the position. To get the position, useseries.values.argmin()
.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series(data=[1, None, 4, 1], ... index=['A', 'B', 'C', 'D']) >>> s A 1.0 B NaN C 4.0 D 1.0 dtype: float64 >>> s.idxmin() 'A' If `skipna` is False and there is an NA value in the data, the function returns ``nan``. >>> s.idxmin(skipna=False) nan
-
idxmax
(**kwargs)[source]¶ Return the row label of the maximum value.
If multiple values equal the maximum, the first row label with that value is returned.
Parameters: - axis (int, default 0) – For compatibility with DeferredDataFrame.idxmax. Redundant for application on DeferredSeries.
- skipna (bool, default True) – Exclude NA/null values. If the entire DeferredSeries is NA, the result will be NA.
- **kwargs (*args,) –
Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.
Returns: Label of the maximum value.
Return type: Raises: ValueError
– If the DeferredSeries is empty.Differences from pandas
This operation has no known divergences from the pandas API.
See also
numpy.argmax()
- Return indices of the maximum values along the given axis.
DeferredDataFrame.idxmax()
- Return index of first occurrence of maximum over requested axis.
DeferredSeries.idxmin()
- Return index label of the first occurrence of minimum of values.
Notes
This method is the DeferredSeries version of
ndarray.argmax
. This method returns the label of the maximum, whilendarray.argmax
returns the position. To get the position, useseries.values.argmax()
.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series(data=[1, None, 4, 3, 4], ... index=['A', 'B', 'C', 'D', 'E']) >>> s A 1.0 B NaN C 4.0 D 3.0 E 4.0 dtype: float64 >>> s.idxmax() 'C' If `skipna` is False and there is an NA value in the data, the function returns ``nan``. >>> s.idxmax(skipna=False) nan
-
explode
(ignore_index)[source]¶ Transform each element of a list-like to a row.
New in version 0.25.0.
Parameters: ignore_index (bool, default False) – If True, the resulting index will be labeled 0, 1, …, n - 1.
New in version 1.1.0.
Returns: Exploded lists to rows; index will be duplicated for these rows. Return type: DeferredSeries Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.str.split()
- Split string values on specified separator.
DeferredSeries.unstack()
- Unstack, a.k.a. pivot, DeferredSeries with MultiIndex to produce DeferredDataFrame.
DeferredDataFrame.melt()
- Unpivot a DeferredDataFrame from wide format to long format.
DeferredDataFrame.explode()
- Explode a DeferredDataFrame from list-like columns to long format.
Notes
This routine will explode list-likes including lists, tuples, sets, DeferredSeries, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in a np.nan for that row. In addition, the ordering of elements in the output will be non-deterministic when exploding sets.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series([[1, 2, 3], 'foo', [], [3, 4]]) >>> s 0 [1, 2, 3] 1 foo 2 [] 3 [3, 4] dtype: object >>> s.explode() 0 1 0 2 0 3 1 foo 2 NaN 3 3 3 4 dtype: object
-
dot
(other)[source]¶ Compute the matrix multiplication between the DataFrame and other.
This method computes the matrix product between the DataFrame and the values of an other Series, DataFrame or a numpy array.
It can also be called using
self @ other
in Python >= 3.5.Parameters: other (DeferredSeries, DeferredDataFrame or array-like) – The other object to compute the matrix product with. Returns: If other is a DeferredSeries, return the matrix product between self and other as a DeferredSeries. If other is a DeferredDataFrame or a numpy.array, return the matrix product of self and other in a DeferredDataFrame of a np.array. Return type: DeferredSeries or DeferredDataFrame Differences from pandas
other
must be aDeferredDataFrame
orDeferredSeries
instance. Computing the dot product with an array-like is not supported because it is order-sensitive.See also
DeferredSeries.dot()
- Similar method for DeferredSeries.
Notes
The dimensions of DeferredDataFrame and other must be compatible in order to compute the matrix multiplication. In addition, the column names of DeferredDataFrame and the index of other must contain the same values, as they will be aligned prior to the multiplication.
The dot method for DeferredSeries computes the inner product, instead of the matrix product here.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
Here we multiply a DataFrame with a Series. >>> df = pd.DataFrame([[0, 1, -2, -1], [1, 1, 1, 1]]) >>> s = pd.Series([1, 1, 2, 1]) >>> df.dot(s) 0 -4 1 5 dtype: int64 Here we multiply a DataFrame with another DataFrame. >>> other = pd.DataFrame([[0, 1], [1, 2], [-1, -1], [2, 0]]) >>> df.dot(other) 0 1 0 1 4 1 2 2 Note that the dot method give the same result as @ >>> df @ other 0 1 0 1 4 1 2 2 The dot method works also if other is an np.array. >>> arr = np.array([[0, 1], [1, 2], [-1, -1], [2, 0]]) >>> df.dot(arr) 0 1 0 1 4 1 2 2 Note how shuffling of the objects does not change the result. >>> s2 = s.reindex([1, 0, 2, 3]) >>> df.dot(s2) 0 -4 1 5 dtype: int64
-
nunique
(**kwargs)[source]¶ Return number of unique elements in the object.
Excludes NA values by default.
Parameters: dropna (bool, default True) – Don’t include NaN in the count. Returns: Return type: int Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.nunique()
- Method nunique for DeferredDataFrame.
DeferredSeries.count()
- Count non-NA/null observations in the DeferredSeries.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series([1, 3, 5, 7, 7]) >>> s 0 1 1 3 2 5 3 7 4 7 dtype: int64 >>> s.nunique() 4
-
quantile
(q, **kwargs)[source]¶ Return value at the given quantile.
Parameters: - q (float or array-like, default 0.5 (50% quantile)) – The quantile(s) to compute, which can lie in range: 0 <= q <= 1.
- interpolation ({'linear', 'lower', 'higher', 'midpoint', 'nearest'}) –
This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:
- linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
- lower: i.
- higher: j.
- nearest: i or j whichever is nearest.
- midpoint: (i + j) / 2.
Returns: If
q
is an array, a DeferredSeries will be returned where the index isq
and the values are the quantiles, otherwise a float will be returned.Return type: Differences from pandas
quantile is not parallelizable. See BEAM-12167 tracking the possible addition of an approximate, parallelizable implementation of quantile.
See also
core.window.Rolling.quantile()
- Calculate the rolling quantile.
numpy.percentile()
- Returns the q-th percentile(s) of the array elements.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> s = pd.Series([1, 2, 3, 4]) >>> s.quantile(.5) 2.5 >>> s.quantile([.25, .5, .75]) 0.25 1.75 0.50 2.50 0.75 3.25 dtype: float64
-
std
(*args, **kwargs)[source]¶ Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: - axis ({index (0)}) –
- skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
- ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
Returns: Return type: scalar or DeferredSeries (if level specified)
Differences from pandas
This operation has no known divergences from the pandas API.
Notes
To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
-
var
(axis, skipna, level, ddof, **kwargs)[source]¶ Return unbiased variance over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: - axis ({index (0)}) –
- skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
- ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
Returns: Return type: scalar or DeferredSeries (if level specified)
Differences from pandas
Per-level aggregation is not yet supported (BEAM-11777). Only the default,
level=None
, is allowed.Notes
To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
-
corr
(other, method, min_periods)[source]¶ Compute correlation with other Series, excluding missing values.
Parameters: - other (DeferredSeries) – DeferredSeries with which to compute the correlation.
- method ({'pearson', 'kendall', 'spearman'} or callable) –
Method used to compute correlation:
- pearson : Standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
- callable: Callable with input two 1d ndarrays and returning a float.
Warning
Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.
- min_periods (int, optional) – Minimum number of observations needed to have a valid result.
Returns: Correlation with other.
Return type: Differences from pandas
Only
method='pearson'
is currently parallelizable.See also
DeferredDataFrame.corr()
- Compute pairwise correlation between columns.
DeferredDataFrame.corrwith()
- Compute pairwise correlation with another DeferredDataFrame or DeferredSeries.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> def histogram_intersection(a, b): ... v = np.minimum(a, b).sum().round(decimals=1) ... return v >>> s1 = pd.Series([.2, .0, .6, .2]) >>> s2 = pd.Series([.3, .6, .0, .1]) >>> s1.corr(s2, method=histogram_intersection) 0.3
-
skew
(axis, skipna, level, numeric_only, **kwargs)[source]¶ Return unbiased skew over requested axis.
Normalized by N-1.
Parameters: - axis ({index (0)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: scalar or DeferredSeries (if level specified)
Differences from pandas
This operation has no known divergences from the pandas API.
-
kurtosis
(axis, skipna, level, numeric_only, **kwargs)[source]¶ Return unbiased kurtosis over requested axis.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
Parameters: - axis ({index (0)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: scalar or DeferredSeries (if level specified)
Differences from pandas
This operation has no known divergences from the pandas API.
-
kurt
(*args, **kwargs)[source]¶ Return unbiased kurtosis over requested axis.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
Parameters: - axis ({index (0)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: scalar or DeferredSeries (if level specified)
Differences from pandas
This operation has no known divergences from the pandas API.
-
cov
(other, min_periods, ddof)[source]¶ Compute covariance with Series, excluding missing values.
Parameters: - other (DeferredSeries) – DeferredSeries with which to compute the covariance.
- min_periods (int, optional) – Minimum number of observations needed to have a valid result.
- ddof (int, default 1) –
Delta degrees of freedom. The divisor used in calculations is
N - ddof
, whereN
represents the number of elements.New in version 1.1.0.
Returns: Covariance between DeferredSeries and other normalized by N-1 (unbiased estimator).
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.cov()
- Compute pairwise covariance of columns.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s1 = pd.Series([0.90010907, 0.13484424, 0.62036035]) >>> s2 = pd.Series([0.12528585, 0.26962463, 0.51111198]) >>> s1.cov(s2) -0.01685762652715874
-
dropna
(**kwargs)[source]¶ Return a new Series with missing values removed.
See the User Guide for more on which values are considered missing, and how to work with missing data.
Parameters: Returns: DeferredSeries with NA entries dropped from it or None if
inplace=True
.Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.isna()
- Indicate missing values.
DeferredSeries.notna()
- Indicate existing (non-missing) values.
DeferredSeries.fillna()
- Replace missing values.
DeferredDataFrame.dropna()
- Drop rows or columns which contain NA values.
Index.dropna()
- Drop missing indices.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> ser = pd.Series([1., 2., np.nan]) >>> ser 0 1.0 1 2.0 2 NaN dtype: float64 Drop NA values from a Series. >>> ser.dropna() 0 1.0 1 2.0 dtype: float64 Keep the Series with valid entries in the same variable. >>> ser.dropna(inplace=True) >>> ser 0 1.0 1 2.0 dtype: float64 Empty strings are not considered NA values. ``None`` is considered an NA value. >>> ser = pd.Series([np.NaN, 2, pd.NaT, '', None, 'I stay']) >>> ser 0 NaN 1 2 2 NaT 3 4 None 5 I stay dtype: object >>> ser.dropna() 1 2 3 5 I stay dtype: object
-
set_axis
(labels, **kwargs)[source]¶ Assign desired index to given axis.
Indexes for row labels can be changed by assigning a list-like or Index.
Parameters: Returns: renamed – An object of type DeferredSeries or None if
inplace=True
.Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.rename_axis()
- Alter the name of the index. Examples ——– >>> s = pd.DeferredSeries([1, 2, 3]) >>> s 0 1 1 2 2 3
dtype()
- int64 >>> s.set_axis([‘a’, ‘b’, ‘c’], axis=0) a 1 b 2 c 3
dtype()
- int64
-
isnull
(**kwargs)¶ Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN
, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
).Returns: Mask of bool values for each element in DeferredSeries that indicates whether an element is an NA value. Return type: DeferredSeries Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.isnull()
- Alias of isna.
DeferredSeries.notna()
- Boolean inverse of isna.
DeferredSeries.dropna()
- Omit axes labels with missing values.
isna()
- Top-level isna.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Show which entries in a DataFrame are NA. >>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker >>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False Show which entries in a Series are NA. >>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64 >>> ser.isna() 0 False 1 False 2 True dtype: bool
-
isna
(**kwargs)¶ Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN
, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
).Returns: Mask of bool values for each element in DeferredSeries that indicates whether an element is an NA value. Return type: DeferredSeries Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.isnull()
- Alias of isna.
DeferredSeries.notna()
- Boolean inverse of isna.
DeferredSeries.dropna()
- Omit axes labels with missing values.
isna()
- Top-level isna.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Show which entries in a DataFrame are NA. >>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker >>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False Show which entries in a Series are NA. >>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64 >>> ser.isna() 0 False 1 False 2 True dtype: bool
-
notnull
(**kwargs)¶ Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings
''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
). NA values, such as None ornumpy.NaN
, get mapped to False values.Returns: Mask of bool values for each element in DeferredSeries that indicates whether an element is not an NA value. Return type: DeferredSeries Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.notnull()
- Alias of notna.
DeferredSeries.isna()
- Boolean inverse of notna.
DeferredSeries.dropna()
- Omit axes labels with missing values.
notna()
- Top-level notna.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Show which entries in a DataFrame are not NA. >>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker >>> df.notna() age born name toy 0 True False True False 1 True True True True 2 False True True True Show which entries in a Series are not NA. >>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64 >>> ser.notna() 0 True 1 True 2 False dtype: bool
-
notna
(**kwargs)¶ Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings
''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
). NA values, such as None ornumpy.NaN
, get mapped to False values.Returns: Mask of bool values for each element in DeferredSeries that indicates whether an element is not an NA value. Return type: DeferredSeries Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.notnull()
- Alias of notna.
DeferredSeries.isna()
- Boolean inverse of notna.
DeferredSeries.dropna()
- Omit axes labels with missing values.
notna()
- Top-level notna.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Show which entries in a DataFrame are not NA. >>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker >>> df.notna() age born name toy 0 True False True False 1 True True True True 2 False True True True Show which entries in a Series are not NA. >>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64 >>> ser.notna() 0 True 1 True 2 False dtype: bool
-
items
(**kwargs)¶ pandas.Series.items()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
iteritems
(**kwargs)¶ pandas.Series.iteritems()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
tolist
(**kwargs)¶ pandas.Series.tolist()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
to_numpy
(**kwargs)¶ pandas.Series.to_numpy()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
to_string
(**kwargs)¶ pandas.Series.to_string()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
duplicated
(keep)[source]¶ Indicate duplicate Series values.
Duplicated values are indicated as
True
values in the resulting Series. Either all duplicates, all except the first or all except the last occurrence of duplicates can be indicated.Parameters: keep ({'first', 'last', False}, default 'first') – Method to handle dropping duplicates:
- ’first’ : Mark duplicates as
True
except for the first occurrence. - ’last’ : Mark duplicates as
True
except for the last occurrence. False
: Mark all duplicates asTrue
.
Returns: DeferredSeries indicating whether each value has occurred in the preceding values. Return type: DeferredSeries[bool] Differences from pandas
Only
keep=False
andkeep="any"
are supported. Other values ofkeep
make this an order-sensitive operation. Notekeep="any"
is a Beam-specific option that guarantees only one duplicate will be kept, but unlike"first"
and"last"
it makes no guarantees about _which_ duplicate element is kept.See also
Index.duplicated()
- Equivalent method on pandas.Index.
DeferredDataFrame.duplicated()
- Equivalent method on pandas.DeferredDataFrame.
DeferredSeries.drop_duplicates()
- Remove duplicate values from DeferredSeries.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
By default, for each set of duplicated values, the first occurrence is set on False and all others on True: >>> animals = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama']) >>> animals.duplicated() 0 False 1 False 2 True 3 False 4 True dtype: bool which is equivalent to >>> animals.duplicated(keep='first') 0 False 1 False 2 True 3 False 4 True dtype: bool By using 'last', the last occurrence of each set of duplicated values is set on False and all others on True: >>> animals.duplicated(keep='last') 0 True 1 False 2 True 3 False 4 False dtype: bool By setting keep on ``False``, all duplicates are True: >>> animals.duplicated(keep=False) 0 True 1 False 2 True 3 False 4 True dtype: bool
- ’first’ : Mark duplicates as
-
drop_duplicates
(keep)[source]¶ Return Series with duplicate values removed.
Parameters: - keep ({‘first’, ‘last’,
False
}, default ‘first’) –Method to handle dropping duplicates:
- ’first’ : Drop duplicates except for the first occurrence.
- ’last’ : Drop duplicates except for the last occurrence.
False
: Drop all duplicates.
- inplace (bool, default
False
) – IfTrue
, performs operation inplace and returns None.
Returns: DeferredSeries with duplicates dropped or None if
inplace=True
.Return type: Differences from pandas
Only
keep=False
andkeep="any"
are supported. Other values ofkeep
make this an order-sensitive operation. Notekeep="any"
is a Beam-specific option that guarantees only one duplicate will be kept, but unlike"first"
and"last"
it makes no guarantees about _which_ duplicate element is kept.See also
Index.drop_duplicates()
- Equivalent method on Index.
DeferredDataFrame.drop_duplicates()
- Equivalent method on DeferredDataFrame.
DeferredSeries.duplicated()
- Related method on DeferredSeries, indicating duplicate DeferredSeries values.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
Generate a Series with duplicated entries. >>> s = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama', 'hippo'], ... name='animal') >>> s 0 lama 1 cow 2 lama 3 beetle 4 lama 5 hippo Name: animal, dtype: object With the 'keep' parameter, the selection behaviour of duplicated values can be changed. The value 'first' keeps the first occurrence for each set of duplicated entries. The default value of keep is 'first'. >>> s.drop_duplicates() 0 lama 1 cow 3 beetle 5 hippo Name: animal, dtype: object The value 'last' for parameter 'keep' keeps the last occurrence for each set of duplicated entries. >>> s.drop_duplicates(keep='last') 1 cow 3 beetle 4 lama 5 hippo Name: animal, dtype: object The value ``False`` for parameter 'keep' discards all sets of duplicated entries. Setting the value of 'inplace' to ``True`` performs the operation inplace and returns ``None``. >>> s.drop_duplicates(keep=False, inplace=True) >>> s 1 cow 3 beetle 5 hippo Name: animal, dtype: object
- keep ({‘first’, ‘last’,
-
sample
(**kwargs)[source]¶ Return a random sample of items from an axis of object.
You can use random_state for reproducibility.
Parameters: - n (int, optional) – Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.
- frac (float, optional) – Fraction of axis items to return. Cannot be used with n.
- replace (bool, default False) – Allow or disallow sampling of the same row more than once.
- weights (str or ndarray-like, optional) – Default ‘None’ results in equal probability weighting. If passed a DeferredSeries, will align with target object on index. Index values in weights not found in sampled object will be ignored and index values in sampled object not in weights will be assigned weights of zero. If called on a DeferredDataFrame, will accept the name of a column when axis = 0. Unless weights are a DeferredSeries, weights must be same length as axis being sampled. If weights do not sum to 1, they will be normalized to sum to 1. Missing values in the weights column will be treated as zero. Infinite values not allowed.
- random_state (int, array-like, BitGenerator, np.random.RandomState, optional) –
If int, array-like, or BitGenerator (NumPy>=1.17), seed for random number generator If np.random.RandomState, use as numpy RandomState object.
Changed in version 1.1.0: array-like and BitGenerator (for NumPy>=1.17) object now passed to np.random.RandomState() as seed
- axis ({0 or ‘index’, 1 or ‘columns’, None}, default None) – Axis to sample. Accepts axis number or name. Default is stat axis for given data type (0 for DeferredSeries and DeferredDataFrames).
- ignore_index (bool, default False) –
If True, the resulting index will be labeled 0, 1, …, n - 1.
New in version 1.3.0.
Returns: A new object of same type as caller containing n items randomly sampled from the caller object.
Return type: Differences from pandas
Only
n
and/orweights
may be specified.frac
,random_state
, andreplace=True
are not yet supported. See BEAM-12476.Note that pandas will raise an error if
n
is larger than the length of the dataset, while the Beam DataFrame API will simply return the full dataset in that case.See also
DeferredDataFrameGroupBy.sample()
- Generates random samples from each group of a DeferredDataFrame object.
DeferredSeriesGroupBy.sample()
- Generates random samples from each group of a DeferredSeries object.
numpy.random.choice()
- Generates a random sample from a given 1-D numpy array.
Notes
If frac > 1, replacement should be set to True.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'num_legs': [2, 4, 8, 0], ... 'num_wings': [2, 0, 0, 0], ... 'num_specimen_seen': [10, 2, 1, 8]}, ... index=['falcon', 'dog', 'spider', 'fish']) >>> df num_legs num_wings num_specimen_seen falcon 2 2 10 dog 4 0 2 spider 8 0 1 fish 0 0 8 Extract 3 random elements from the ``Series`` ``df['num_legs']``: Note that we use `random_state` to ensure the reproducibility of the examples. >>> df['num_legs'].sample(n=3, random_state=1) fish 0 spider 8 falcon 2 Name: num_legs, dtype: int64 A random 50% sample of the ``DataFrame`` with replacement: >>> df.sample(frac=0.5, replace=True, random_state=1) num_legs num_wings num_specimen_seen dog 4 0 2 fish 0 0 8 An upsample sample of the ``DataFrame`` with replacement: Note that `replace` parameter has to be `True` for `frac` parameter > 1. >>> df.sample(frac=2, replace=True, random_state=1) num_legs num_wings num_specimen_seen dog 4 0 2 fish 0 0 8 falcon 2 2 10 falcon 2 2 10 fish 0 0 8 dog 4 0 2 fish 0 0 8 dog 4 0 2 Using a DataFrame column as weights. Rows with larger value in the `num_specimen_seen` column are more likely to be sampled. >>> df.sample(n=2, weights='num_specimen_seen', random_state=1) num_legs num_wings num_specimen_seen falcon 2 2 10 fish 0 0 8
-
aggregate
(func, axis, *args, **kwargs)[source]¶ Aggregate using one or more operations over the specified axis.
Parameters: - func (function, str, list or dict) –
Function to use for aggregating the data. If a function, must either work when passed a DeferredSeries or when passed to DeferredSeries.apply.
Accepted combinations are:
- function
- string function name
- list of functions and/or function names, e.g.
[np.sum, 'mean']
- dict of axis labels -> functions, function names or list of such.
- axis ({0 or 'index'}) – Parameter needed for compatibility with DeferredDataFrame.
- *args – Positional arguments to pass to func.
- **kwargs – Keyword arguments to pass to func.
Returns: The return can be:
- scalar : when DeferredSeries.agg is called with single function
- DeferredSeries : when DeferredDataFrame.agg is called with a single function
- DeferredDataFrame : when DeferredDataFrame.agg is called with several functions
Return scalar, DeferredSeries or DeferredDataFrame.
Return type: scalar, DeferredSeries or DeferredDataFrame
Differences from pandas
Some aggregation methods cannot be parallelized, and computing them will require collecting all data on a single machine.
See also
DeferredSeries.apply()
- Invoke function on a DeferredSeries.
DeferredSeries.transform()
- Transform function producing a DeferredSeries with like indexes.
Notes
agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.
A passed user-defined-function will be passed a DeferredSeries for evaluation.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> s = pd.Series([1, 2, 3, 4]) >>> s 0 1 1 2 2 3 3 4 dtype: int64 >>> s.agg('min') 1 >>> s.agg(['min', 'max']) min 1 max 4 dtype: int64
- func (function, str, list or dict) –
-
agg
(func, axis, *args, **kwargs)¶ Aggregate using one or more operations over the specified axis.
Parameters: - func (function, str, list or dict) –
Function to use for aggregating the data. If a function, must either work when passed a DeferredSeries or when passed to DeferredSeries.apply.
Accepted combinations are:
- function
- string function name
- list of functions and/or function names, e.g.
[np.sum, 'mean']
- dict of axis labels -> functions, function names or list of such.
- axis ({0 or 'index'}) – Parameter needed for compatibility with DeferredDataFrame.
- *args – Positional arguments to pass to func.
- **kwargs – Keyword arguments to pass to func.
Returns: The return can be:
- scalar : when DeferredSeries.agg is called with single function
- DeferredSeries : when DeferredDataFrame.agg is called with a single function
- DeferredDataFrame : when DeferredDataFrame.agg is called with several functions
Return scalar, DeferredSeries or DeferredDataFrame.
Return type: scalar, DeferredSeries or DeferredDataFrame
Differences from pandas
Some aggregation methods cannot be parallelized, and computing them will require collecting all data on a single machine.
See also
DeferredSeries.apply()
- Invoke function on a DeferredSeries.
DeferredSeries.transform()
- Transform function producing a DeferredSeries with like indexes.
Notes
agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.
A passed user-defined-function will be passed a DeferredSeries for evaluation.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> s = pd.Series([1, 2, 3, 4]) >>> s 0 1 1 2 2 3 3 4 dtype: int64 >>> s.agg('min') 1 >>> s.agg(['min', 'max']) min 1 max 4 dtype: int64
- func (function, str, list or dict) –
-
axes
¶ Return a list of the row axis labels.
Differences from pandas
This operation has no known divergences from the pandas API.
-
clip
(**kwargs)¶
-
all
(*args, **kwargs)¶ Return whether all elements are True, potentially over an axis.
Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).
Parameters: - axis ({0 or 'index', 1 or 'columns', None}, default 0) –
Indicate which axis or axes should be reduced.
- 0 / ‘index’ : reduce the index, return a DeferredSeries whose index is the original column labels.
- 1 / ‘columns’ : reduce the columns, return a DeferredSeries whose index is the original index.
- None : reduce all axes, return a scalar.
- bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for DeferredSeries.
- skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
- **kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: If level is specified, then, DeferredSeries is returned; otherwise, scalar is returned.
Return type: scalar or DeferredSeries
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.all()
- Return True if all elements are True.
DeferredDataFrame.any()
- Return True if one (or more) elements are True.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
**Series** >>> pd.Series([True, True]).all() True >>> pd.Series([True, False]).all() False >>> pd.Series([], dtype="float64").all() True >>> pd.Series([np.nan]).all() True >>> pd.Series([np.nan]).all(skipna=False) True **DataFrames** Create a dataframe from a dictionary. >>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]}) >>> df col1 col2 0 True True 1 True False Default behaviour checks if column-wise values all return True. >>> df.all() col1 True col2 False dtype: bool Specify ``axis='columns'`` to check if row-wise values all return True. >>> df.all(axis='columns') 0 True 1 False dtype: bool Or ``axis=None`` for whether every value is True. >>> df.all(axis=None) False
- axis ({0 or 'index', 1 or 'columns', None}, default 0) –
-
any
(*args, **kwargs)¶ Return whether any element is True, potentially over an axis.
Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).
Parameters: - axis ({0 or 'index', 1 or 'columns', None}, default 0) –
Indicate which axis or axes should be reduced.
- 0 / ‘index’ : reduce the index, return a DeferredSeries whose index is the original column labels.
- 1 / ‘columns’ : reduce the columns, return a DeferredSeries whose index is the original index.
- None : reduce all axes, return a scalar.
- bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for DeferredSeries.
- skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
- **kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: If level is specified, then, DeferredSeries is returned; otherwise, scalar is returned.
Return type: scalar or DeferredSeries
Differences from pandas
This operation has no known divergences from the pandas API.
See also
numpy.any()
- Numpy version of this method.
DeferredSeries.any()
- Return whether any element is True.
DeferredSeries.all()
- Return whether all elements are True.
DeferredDataFrame.any()
- Return whether any element is True over requested axis.
DeferredDataFrame.all()
- Return whether all elements are True over requested axis.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
**Series** For Series input, the output is a scalar indicating whether any element is True. >>> pd.Series([False, False]).any() False >>> pd.Series([True, False]).any() True >>> pd.Series([], dtype="float64").any() False >>> pd.Series([np.nan]).any() False >>> pd.Series([np.nan]).any(skipna=False) True **DataFrame** Whether each column contains at least one True element (the default). >>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]}) >>> df A B C 0 1 0 0 1 2 2 0 >>> df.any() A True B True C False dtype: bool Aggregating over the columns. >>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]}) >>> df A B 0 True 1 1 False 2 >>> df.any(axis='columns') 0 True 1 True dtype: bool >>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]}) >>> df A B 0 True 1 1 False 0 >>> df.any(axis='columns') 0 True 1 False dtype: bool Aggregating over the entire DataFrame with ``axis=None``. >>> df.any(axis=None) True `any` for an empty DataFrame is an empty Series. >>> pd.DataFrame([]).any() Series([], dtype: bool)
- axis ({0 or 'index', 1 or 'columns', None}, default 0) –
-
count
(*args, **kwargs)¶ Return number of non-NA/null observations in the Series.
Parameters: level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller DeferredSeries. Returns: Number of non-null values in the DeferredSeries. Return type: int or DeferredSeries (if level specified) Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.count()
- Count non-NA cells for each column or row.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series([0.0, 1.0, np.nan]) >>> s.count() 2
-
describe
(*args, **kwargs)¶ Generate descriptive statistics.
Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding
NaN
values.Analyzes both numeric and object series, as well as
DataFrame
column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.Parameters: - percentiles (list-like of numbers, optional) – The percentiles to include in the output. All should
fall between 0 and 1. The default is
[.25, .5, .75]
, which returns the 25th, 50th, and 75th percentiles. - include ('all', list-like of dtypes or None (default), optional) –
A white list of data types to include in the result. Ignored for
DeferredSeries
. Here are the options:- ’all’ : All columns of the input will be included in the output.
- A list-like of dtypes : Limits the results to the
provided data types.
To limit the result to numeric types submit
numpy.number
. To limit it instead to object columns submit thenumpy.object
data type. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To select pandas categorical columns, use'category'
- None (default) : The result will include all numeric columns.
- exclude (list-like of dtypes or None (default), optional,) –
A black list of data types to omit from the result. Ignored for
DeferredSeries
. Here are the options:- A list-like of dtypes : Excludes the provided data types
from the result. To exclude numeric types submit
numpy.number
. To exclude object columns submit the data typenumpy.object
. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To exclude pandas categorical columns, use'category'
- None (default) : The result will exclude nothing.
- A list-like of dtypes : Excludes the provided data types
from the result. To exclude numeric types submit
- datetime_is_numeric (bool, default False) –
Whether to treat datetime dtypes as numeric. This affects statistics calculated for the column. For DeferredDataFrame input, this also controls whether datetime columns are included by default.
New in version 1.1.0.
Returns: Summary statistics of the DeferredSeries or Dataframe provided.
Return type: Differences from pandas
describe
cannot currently be parallelized. It will require collecting all data on a single node.See also
DeferredDataFrame.count()
- Count number of non-NA/null observations.
DeferredDataFrame.max()
- Maximum of the values in the object.
DeferredDataFrame.min()
- Minimum of the values in the object.
DeferredDataFrame.mean()
- Mean of the values.
DeferredDataFrame.std()
- Standard deviation of the observations.
DeferredDataFrame.select_dtypes()
- Subset of a DeferredDataFrame including/excluding columns based on their dtype.
Notes
For numeric data, the result’s index will include
count
,mean
,std
,min
,max
as well as lower,50
and upper percentiles. By default the lower percentile is25
and the upper percentile is75
. The50
percentile is the same as the median.For object data (e.g. strings or timestamps), the result’s index will include
count
,unique
,top
, andfreq
. Thetop
is the most common value. Thefreq
is the most common value’s frequency. Timestamps also include thefirst
andlast
items.If multiple object values have the highest count, then the
count
andtop
results will be arbitrarily chosen from among those with the highest count.For mixed data types provided via a
DeferredDataFrame
, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. Ifinclude='all'
is provided as an option, the result will include a union of attributes of each type.The include and exclude parameters can be used to limit which columns in a
DeferredDataFrame
are analyzed for the output. The parameters are ignored when analyzing aDeferredSeries
.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
Describing a numeric ``Series``. >>> s = pd.Series([1, 2, 3]) >>> s.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 dtype: float64 Describing a categorical ``Series``. >>> s = pd.Series(['a', 'a', 'b', 'c']) >>> s.describe() count 4 unique 3 top a freq 2 dtype: object Describing a timestamp ``Series``. >>> s = pd.Series([ ... np.datetime64("2000-01-01"), ... np.datetime64("2010-01-01"), ... np.datetime64("2010-01-01") ... ]) >>> s.describe(datetime_is_numeric=True) count 3 mean 2006-09-01 08:00:00 min 2000-01-01 00:00:00 25% 2004-12-31 12:00:00 50% 2010-01-01 00:00:00 75% 2010-01-01 00:00:00 max 2010-01-01 00:00:00 dtype: object Describing a ``DataFrame``. By default only numeric fields are returned. >>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), ... 'numeric': [1, 2, 3], ... 'object': ['a', 'b', 'c'] ... }) >>> df.describe() numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 Describing all columns of a ``DataFrame`` regardless of data type. >>> df.describe(include='all') categorical numeric object count 3 3.0 3 unique 3 NaN 3 top f NaN a freq 1 NaN 1 mean NaN 2.0 NaN std NaN 1.0 NaN min NaN 1.0 NaN 25% NaN 1.5 NaN 50% NaN 2.0 NaN 75% NaN 2.5 NaN max NaN 3.0 NaN Describing a column from a ``DataFrame`` by accessing it as an attribute. >>> df.numeric.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 Name: numeric, dtype: float64 Including only numeric columns in a ``DataFrame`` description. >>> df.describe(include=[np.number]) numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 Including only string columns in a ``DataFrame`` description. >>> df.describe(include=[object]) object count 3 unique 3 top a freq 1 Including only categorical columns from a ``DataFrame`` description. >>> df.describe(include=['category']) categorical count 3 unique 3 top d freq 1 Excluding numeric columns from a ``DataFrame`` description. >>> df.describe(exclude=[np.number]) categorical object count 3 3 unique 3 3 top f a freq 1 1 Excluding object columns from a ``DataFrame`` description. >>> df.describe(exclude=[object]) categorical numeric count 3 3.0 unique 3 NaN top f NaN freq 1 NaN mean NaN 2.0 std NaN 1.0 min NaN 1.0 25% NaN 1.5 50% NaN 2.0 75% NaN 2.5 max NaN 3.0
- percentiles (list-like of numbers, optional) – The percentiles to include in the output. All should
fall between 0 and 1. The default is
-
min
(*args, **kwargs)¶ Return the minimum of the values over the requested axis.
If you want the index of the minimum, use
idxmin
. This is the equivalent of thenumpy.ndarray
methodargmin
.Parameters: - axis ({index (0)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: scalar or DeferredSeries (if level specified)
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.sum()
- Return the sum.
DeferredSeries.min()
- Return the minimum.
DeferredSeries.max()
- Return the maximum.
DeferredSeries.idxmin()
- Return the index of the minimum.
DeferredSeries.idxmax()
- Return the index of the maximum.
DeferredDataFrame.sum()
- Return the sum over the requested axis.
DeferredDataFrame.min()
- Return the minimum over the requested axis.
DeferredDataFrame.max()
- Return the maximum over the requested axis.
DeferredDataFrame.idxmin()
- Return the index of the minimum over the requested axis.
DeferredDataFrame.idxmax()
- Return the index of the maximum over the requested axis.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64 >>> s.min() 0
-
max
(*args, **kwargs)¶ Return the maximum of the values over the requested axis.
If you want the index of the maximum, use
idxmax
. This is the equivalent of thenumpy.ndarray
methodargmax
.Parameters: - axis ({index (0)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: scalar or DeferredSeries (if level specified)
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.sum()
- Return the sum.
DeferredSeries.min()
- Return the minimum.
DeferredSeries.max()
- Return the maximum.
DeferredSeries.idxmin()
- Return the index of the minimum.
DeferredSeries.idxmax()
- Return the index of the maximum.
DeferredDataFrame.sum()
- Return the sum over the requested axis.
DeferredDataFrame.min()
- Return the minimum over the requested axis.
DeferredDataFrame.max()
- Return the maximum over the requested axis.
DeferredDataFrame.idxmin()
- Return the index of the minimum over the requested axis.
DeferredDataFrame.idxmax()
- Return the index of the maximum over the requested axis.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64 >>> s.max() 8
-
prod
(*args, **kwargs)¶ Return the product of the values over the requested axis.
Parameters: - axis ({index (0)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_count
non-NA values are present the result will be NA. - **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: scalar or DeferredSeries (if level specified)
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.sum()
- Return the sum.
DeferredSeries.min()
- Return the minimum.
DeferredSeries.max()
- Return the maximum.
DeferredSeries.idxmin()
- Return the index of the minimum.
DeferredSeries.idxmax()
- Return the index of the maximum.
DeferredDataFrame.sum()
- Return the sum over the requested axis.
DeferredDataFrame.min()
- Return the minimum over the requested axis.
DeferredDataFrame.max()
- Return the maximum over the requested axis.
DeferredDataFrame.idxmin()
- Return the index of the minimum over the requested axis.
DeferredDataFrame.idxmax()
- Return the index of the maximum over the requested axis.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
By default, the product of an empty or all-NA Series is ``1`` >>> pd.Series([], dtype="float64").prod() 1.0 This can be controlled with the ``min_count`` parameter >>> pd.Series([], dtype="float64").prod(min_count=1) nan Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and empty series identically. >>> pd.Series([np.nan]).prod() 1.0 >>> pd.Series([np.nan]).prod(min_count=1) nan
-
product
(*args, **kwargs)¶ Return the product of the values over the requested axis.
Parameters: - axis ({index (0)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_count
non-NA values are present the result will be NA. - **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: scalar or DeferredSeries (if level specified)
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.sum()
- Return the sum.
DeferredSeries.min()
- Return the minimum.
DeferredSeries.max()
- Return the maximum.
DeferredSeries.idxmin()
- Return the index of the minimum.
DeferredSeries.idxmax()
- Return the index of the maximum.
DeferredDataFrame.sum()
- Return the sum over the requested axis.
DeferredDataFrame.min()
- Return the minimum over the requested axis.
DeferredDataFrame.max()
- Return the maximum over the requested axis.
DeferredDataFrame.idxmin()
- Return the index of the minimum over the requested axis.
DeferredDataFrame.idxmax()
- Return the index of the maximum over the requested axis.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
By default, the product of an empty or all-NA Series is ``1`` >>> pd.Series([], dtype="float64").prod() 1.0 This can be controlled with the ``min_count`` parameter >>> pd.Series([], dtype="float64").prod(min_count=1) nan Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and empty series identically. >>> pd.Series([np.nan]).prod() 1.0 >>> pd.Series([np.nan]).prod(min_count=1) nan
-
sum
(*args, **kwargs)¶ Return the sum of the values over the requested axis.
This is equivalent to the method
numpy.sum
.Parameters: - axis ({index (0)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_count
non-NA values are present the result will be NA. - **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: scalar or DeferredSeries (if level specified)
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.sum()
- Return the sum.
DeferredSeries.min()
- Return the minimum.
DeferredSeries.max()
- Return the maximum.
DeferredSeries.idxmin()
- Return the index of the minimum.
DeferredSeries.idxmax()
- Return the index of the maximum.
DeferredDataFrame.sum()
- Return the sum over the requested axis.
DeferredDataFrame.min()
- Return the minimum over the requested axis.
DeferredDataFrame.max()
- Return the maximum over the requested axis.
DeferredDataFrame.idxmin()
- Return the index of the minimum over the requested axis.
DeferredDataFrame.idxmax()
- Return the index of the maximum over the requested axis.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64 >>> s.sum() 14 By default, the sum of an empty or all-NA Series is ``0``. >>> pd.Series([], dtype="float64").sum() # min_count=0 is the default 0.0 This can be controlled with the ``min_count`` parameter. For example, if you'd like the sum of an empty series to be NaN, pass ``min_count=1``. >>> pd.Series([], dtype="float64").sum(min_count=1) nan Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and empty series identically. >>> pd.Series([np.nan]).sum() 0.0 >>> pd.Series([np.nan]).sum(min_count=1) nan
-
mean
(*args, **kwargs)¶ Return the mean of the values over the requested axis.
Parameters: - axis ({index (0)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: scalar or DeferredSeries (if level specified)
Differences from pandas
mean
cannot currently be parallelized. It will require collecting all data on a single node.
-
median
(*args, **kwargs)¶ Return the median of the values over the requested axis.
Parameters: - axis ({index (0)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: scalar or DeferredSeries (if level specified)
Differences from pandas
median
cannot currently be parallelized. It will require collecting all data on a single node.
-
sem
(*args, **kwargs)¶ Return unbiased standard error of the mean over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: - axis ({index (0)}) –
- skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
- ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
Returns: Return type: scalar or DeferredSeries (if level specified)
Differences from pandas
sem
cannot currently be parallelized. It will require collecting all data on a single node.Notes
To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
-
mad
(*args, **kwargs)¶ Return the mean absolute deviation of the values over the requested axis.
Parameters: Returns: Return type: scalar or DeferredSeries (if level specified)
Differences from pandas
mad
cannot currently be parallelized. It will require collecting all data on a single node.
-
argmax
(**kwargs)¶ pandas.Series.argmax()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
argmin
(**kwargs)¶ pandas.Series.argmin()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
cummax
(**kwargs)¶ pandas.Series.cummax()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
cummin
(**kwargs)¶ pandas.Series.cummin()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
cumprod
(**kwargs)¶ pandas.Series.cumprod()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
cumsum
(**kwargs)¶ pandas.Series.cumsum()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
diff
(**kwargs)¶ pandas.Series.diff()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
interpolate
(**kwargs)¶ pandas.Series.interpolate()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
searchsorted
(**kwargs)¶ pandas.Series.searchsorted()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
shift
(**kwargs)¶ pandas.Series.shift()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
pct_change
(**kwargs)¶ pandas.Series.pct_change()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
is_monotonic
(**kwargs)¶ pandas.Series.is_monotonic()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
is_monotonic_increasing
(**kwargs)¶ pandas.Series.is_monotonic_increasing()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
is_monotonic_decreasing
(**kwargs)¶ pandas.Series.is_monotonic_decreasing()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
asof
(**kwargs)¶ pandas.Series.asof()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
first_valid_index
(**kwargs)¶ pandas.Series.first_valid_index()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
last_valid_index
(**kwargs)¶ pandas.Series.last_valid_index()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
autocorr
(**kwargs)¶ pandas.Series.autocorr()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
iat
¶ pandas.Series.iat()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
head
(**kwargs)¶ pandas.Series.head()
is not yet supported in the Beam DataFrame API because it is order-sensitive.If you want to peek at a large dataset consider using interactive Beam’s
ib.collect
withn
specified, orsample()
. If you want to find the N largest elements, consider usingDeferredDataFrame.nlargest()
.
-
tail
(**kwargs)¶ pandas.Series.tail()
is not yet supported in the Beam DataFrame API because it is order-sensitive.If you want to peek at a large dataset consider using interactive Beam’s
ib.collect
withn
specified, orsample()
. If you want to find the N largest elements, consider usingDeferredDataFrame.nlargest()
.
-
filter
(**kwargs)¶ Subset the dataframe rows or columns according to the specified index labels.
Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.
Parameters: - items (list-like) – Keep labels from axis which are in items.
- like (str) – Keep labels from axis for which “like in label == True”.
- regex (str (regular expression)) – Keep labels from axis for which re.search(regex, label) == True.
- axis ({0 or ‘index’, 1 or ‘columns’, None}, default None) – The axis to filter on, expressed either as an index (int) or axis name (str). By default this is the info axis, ‘index’ for DeferredSeries, ‘columns’ for DeferredDataFrame.
Returns: Return type: same type as input object
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.loc()
- Access a group of rows and columns by label(s) or a boolean array.
Notes
The
items
,like
, andregex
parameters are enforced to be mutually exclusive.axis
defaults to the info axis that is used when indexing with[]
.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame(np.array(([1, 2, 3], [4, 5, 6])), ... index=['mouse', 'rabbit'], ... columns=['one', 'two', 'three']) >>> df one two three mouse 1 2 3 rabbit 4 5 6 >>> # select columns by name >>> df.filter(items=['one', 'three']) one three mouse 1 3 rabbit 4 6 >>> # select columns by regular expression >>> df.filter(regex='e$', axis=1) one three mouse 1 3 rabbit 4 6 >>> # select rows containing 'bbi' >>> df.filter(like='bbi', axis=0) one two three rabbit 4 5 6
-
memory_usage
(**kwargs)¶ pandas.Series.memory_usage()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
nbytes
(**kwargs)¶ pandas.Series.nbytes()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
to_list
(**kwargs)¶ pandas.Series.to_list()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
factorize
(**kwargs)¶ pandas.Series.factorize()
is not yet supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.For more information see https://s.apache.org/dataframe-non-deferred-columns.
-
nlargest
(keep, **kwargs)[source]¶ Return the largest n elements.
Parameters: - n (int, default 5) – Return this many descending sorted values.
- keep ({'first', 'last', 'all'}, default 'first') –
When there are duplicate values that cannot all fit in a DeferredSeries of n elements:
first
: return the first n occurrences in order- of appearance.
last
: return the last n occurrences in reverse- order of appearance.
all
: keep all occurrences. This can result in a DeferredSeries of- size larger than n.
Returns: The n largest values in the DeferredSeries, sorted in decreasing order.
Return type: Differences from pandas
Only
keep=False
andkeep="any"
are supported. Other values ofkeep
make this an order-sensitive operation. Notekeep="any"
is a Beam-specific option that guarantees only one duplicate will be kept, but unlike"first"
and"last"
it makes no guarantees about _which_ duplicate element is kept.See also
DeferredSeries.nsmallest()
- Get the n smallest elements.
DeferredSeries.sort_values()
- Sort DeferredSeries by values.
DeferredSeries.head()
- Return the first n rows.
Notes
Faster than
.sort_values(ascending=False).head(n)
for small n relative to the size of theDeferredSeries
object.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> countries_population = {"Italy": 59000000, "France": 65000000, ... "Malta": 434000, "Maldives": 434000, ... "Brunei": 434000, "Iceland": 337000, ... "Nauru": 11300, "Tuvalu": 11300, ... "Anguilla": 11300, "Montserrat": 5200} >>> s = pd.Series(countries_population) >>> s Italy 59000000 France 65000000 Malta 434000 Maldives 434000 Brunei 434000 Iceland 337000 Nauru 11300 Tuvalu 11300 Anguilla 11300 Montserrat 5200 dtype: int64 The `n` largest elements where ``n=5`` by default. >>> s.nlargest() France 65000000 Italy 59000000 Malta 434000 Maldives 434000 Brunei 434000 dtype: int64 The `n` largest elements where ``n=3``. Default `keep` value is 'first' so Malta will be kept. >>> s.nlargest(3) France 65000000 Italy 59000000 Malta 434000 dtype: int64 The `n` largest elements where ``n=3`` and keeping the last duplicates. Brunei will be kept since it is the last with value 434000 based on the index order. >>> s.nlargest(3, keep='last') France 65000000 Italy 59000000 Brunei 434000 dtype: int64 The `n` largest elements where ``n=3`` with all duplicates kept. Note that the returned Series has five elements due to the three duplicates. >>> s.nlargest(3, keep='all') France 65000000 Italy 59000000 Malta 434000 Maldives 434000 Brunei 434000 dtype: int64
-
nsmallest
(keep, **kwargs)[source]¶ Return the smallest n elements.
Parameters: - n (int, default 5) – Return this many ascending sorted values.
- keep ({'first', 'last', 'all'}, default 'first') –
When there are duplicate values that cannot all fit in a DeferredSeries of n elements:
first
: return the first n occurrences in order- of appearance.
last
: return the last n occurrences in reverse- order of appearance.
all
: keep all occurrences. This can result in a DeferredSeries of- size larger than n.
Returns: The n smallest values in the DeferredSeries, sorted in increasing order.
Return type: Differences from pandas
Only
keep=False
andkeep="any"
are supported. Other values ofkeep
make this an order-sensitive operation. Notekeep="any"
is a Beam-specific option that guarantees only one duplicate will be kept, but unlike"first"
and"last"
it makes no guarantees about _which_ duplicate element is kept.See also
DeferredSeries.nlargest()
- Get the n largest elements.
DeferredSeries.sort_values()
- Sort DeferredSeries by values.
DeferredSeries.head()
- Return the first n rows.
Notes
Faster than
.sort_values().head(n)
for small n relative to the size of theDeferredSeries
object.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> countries_population = {"Italy": 59000000, "France": 65000000, ... "Brunei": 434000, "Malta": 434000, ... "Maldives": 434000, "Iceland": 337000, ... "Nauru": 11300, "Tuvalu": 11300, ... "Anguilla": 11300, "Montserrat": 5200} >>> s = pd.Series(countries_population) >>> s Italy 59000000 France 65000000 Brunei 434000 Malta 434000 Maldives 434000 Iceland 337000 Nauru 11300 Tuvalu 11300 Anguilla 11300 Montserrat 5200 dtype: int64 The `n` smallest elements where ``n=5`` by default. >>> s.nsmallest() Montserrat 5200 Nauru 11300 Tuvalu 11300 Anguilla 11300 Iceland 337000 dtype: int64 The `n` smallest elements where ``n=3``. Default `keep` value is 'first' so Nauru and Tuvalu will be kept. >>> s.nsmallest(3) Montserrat 5200 Nauru 11300 Tuvalu 11300 dtype: int64 The `n` smallest elements where ``n=3`` and keeping the last duplicates. Anguilla and Tuvalu will be kept since they are the last with value 11300 based on the index order. >>> s.nsmallest(3, keep='last') Montserrat 5200 Anguilla 11300 Tuvalu 11300 dtype: int64 The `n` smallest elements where ``n=3`` with all duplicates kept. Note that the returned Series has four elements due to the three duplicates. >>> s.nsmallest(3, keep='all') Montserrat 5200 Nauru 11300 Tuvalu 11300 Anguilla 11300 dtype: int64
-
is_unique
¶ Return boolean if values in the object are unique.
Returns: Return type: bool Differences from pandas
This operation has no known divergences from the pandas API.
-
plot
(**kwargs)¶ pandas.Series.plot()
is not yet supported in the Beam DataFrame API because it is a plotting tool.For more information see https://s.apache.org/dataframe-plotting-tools.
-
pop
(**kwargs)¶ pandas.Series.pop()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
rename_axis
(**kwargs)¶ Set the name of the axis for the index or columns.
Parameters: - mapper (scalar, list-like, optional) – Value to set the axis name attribute.
- columns (index,) –
A scalar, list-like, dict-like or functions transformations to apply to that axis’ values. Note that the
columns
parameter is not allowed if the object is a DeferredSeries. This parameter only apply for DeferredDataFrame type objects.Use either
mapper
andaxis
to specify the axis to target withmapper
, orindex
and/orcolumns
. - axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to rename.
- copy (bool, default True) – Also copy underlying data.
- inplace (bool, default False) – Modifies the object directly, instead of creating a new DeferredSeries or DeferredDataFrame.
Returns: The same type as the caller or None if
inplace=True
.Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.rename()
- Alter DeferredSeries index labels or name.
DeferredDataFrame.rename()
- Alter DeferredDataFrame index labels or name.
Index.rename()
- Set new names on index.
Notes
DeferredDataFrame.rename_axis
supports two calling conventions(index=index_mapper, columns=columns_mapper, ...)
(mapper, axis={'index', 'columns'}, ...)
The first calling convention will only modify the names of the index and/or the names of the Index object that is the columns. In this case, the parameter
copy
is ignored.The second calling convention will modify the names of the corresponding index if mapper is a list or a scalar. However, if mapper is dict-like or a function, it will use the deprecated behavior of modifying the axis labels.
We highly recommend using keyword arguments to clarify your intent.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
**Series** >>> s = pd.Series(["dog", "cat", "monkey"]) >>> s 0 dog 1 cat 2 monkey dtype: object >>> s.rename_axis("animal") animal 0 dog 1 cat 2 monkey dtype: object **DataFrame** >>> df = pd.DataFrame({"num_legs": [4, 4, 2], ... "num_arms": [0, 0, 2]}, ... ["dog", "cat", "monkey"]) >>> df num_legs num_arms dog 4 0 cat 4 0 monkey 2 2 >>> df = df.rename_axis("animal") >>> df num_legs num_arms animal dog 4 0 cat 4 0 monkey 2 2 >>> df = df.rename_axis("limbs", axis="columns") >>> df limbs num_legs num_arms animal dog 4 0 cat 4 0 monkey 2 2 **MultiIndex** >>> df.index = pd.MultiIndex.from_product([['mammal'], ... ['dog', 'cat', 'monkey']], ... names=['type', 'name']) >>> df limbs num_legs num_arms type name mammal dog 4 0 cat 4 0 monkey 2 2 >>> df.rename_axis(index={'type': 'class'}) limbs num_legs num_arms class name mammal dog 4 0 cat 4 0 monkey 2 2 >>> df.rename_axis(columns=str.upper) LIMBS num_legs num_arms type name mammal dog 4 0 cat 4 0 monkey 2 2
-
round
(**kwargs)¶ Round each value in a Series to the given number of decimals.
Parameters: - decimals (int, default 0) – Number of decimal places to round to. If decimals is negative, it specifies the number of positions to the left of the decimal point.
- **kwargs (*args,) –
Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.
Returns: Rounded values of the DeferredSeries.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
numpy.around()
- Round values of an np.array.
DeferredDataFrame.round()
- Round values of a DeferredDataFrame.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series([0.1, 1.3, 2.7]) >>> s.round() 0 0.0 1 1.0 2 3.0 dtype: float64
-
take
(**kwargs)¶ pandas.Series.take()
is not yet supported in the Beam DataFrame API because it is deprecated in pandas.
-
to_dict
(**kwargs)¶ pandas.Series.to_dict()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
to_frame
(**kwargs)¶ Convert Series to DataFrame.
Parameters: name (object, default None) – The passed name should substitute for the series name (if it has one). Returns: DeferredDataFrame representation of DeferredSeries. Return type: DeferredDataFrame Differences from pandas
This operation has no known divergences from the pandas API.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series(["a", "b", "c"], ... name="vals") >>> s.to_frame() vals 0 a 1 b 2 c
-
unique
(as_series=False)[source]¶ Return unique values of Series object.
Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort.
Returns: The unique values returned as a NumPy array. See Notes. Return type: ndarray or ExtensionArray Differences from pandas
unique is not supported by default because it produces a non-deferred result: an
ndarray
. You can use the Beam-specific argumentunique(as_series=True)
to get the result as aDeferredSeries
See also
unique()
- Top-level unique method for any 1-d array-like object.
Index.unique()
- Return Index with unique values from an Index object.
Notes
Returns the unique values as a NumPy array. In case of an extension-array backed DeferredSeries, a new
ExtensionArray
of that type with just the unique values is returned. This includes- Categorical
- Period
- Datetime with Timezone
- Interval
- Sparse
- IntegerNA
See Examples section.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> pd.Series([2, 1, 3, 3], name='A').unique() array([2, 1, 3]) >>> pd.Series([pd.Timestamp('2016-01-01') for _ in range(3)]).unique() array(['2016-01-01T00:00:00.000000000'], dtype='datetime64[ns]') >>> pd.Series([pd.Timestamp('2016-01-01', tz='US/Eastern') ... for _ in range(3)]).unique() <DatetimeArray> ['2016-01-01 00:00:00-05:00'] Length: 1, dtype: datetime64[ns, US/Eastern] An Categorical will return categories in the order of appearance and with the same dtype. >>> pd.Series(pd.Categorical(list('baabc'))).unique() ['b', 'a', 'c'] Categories (3, object): ['a', 'b', 'c'] >>> pd.Series(pd.Categorical(list('baabc'), categories=list('abc'), ... ordered=True)).unique() ['b', 'a', 'c'] Categories (3, object): ['a' < 'b' < 'c']
-
update
(other)[source]¶ Modify Series in place using values from passed Series.
Uses non-NA values from passed Series to make updates. Aligns on index.
Parameters: other (DeferredSeries, or object coercible into DeferredSeries) – Differences from pandas
This operation has no known divergences from the pandas API.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series([1, 2, 3]) >>> s.update(pd.Series([4, 5, 6])) >>> s 0 4 1 5 2 6 dtype: int64 >>> s = pd.Series(['a', 'b', 'c']) >>> s.update(pd.Series(['d', 'e'], index=[0, 2])) >>> s 0 d 1 b 2 e dtype: object >>> s = pd.Series([1, 2, 3]) >>> s.update(pd.Series([4, 5, 6, 7, 8])) >>> s 0 4 1 5 2 6 dtype: int64 If ``other`` contains NaNs the corresponding values are not updated in the original Series. >>> s = pd.Series([1, 2, 3]) >>> s.update(pd.Series([4, np.nan, 6])) >>> s 0 4 1 2 2 6 dtype: int64 ``other`` can also be a non-Series object type that is coercible into a Series >>> s = pd.Series([1, 2, 3]) >>> s.update([4, np.nan, 6]) >>> s 0 4 1 2 2 6 dtype: int64 >>> s = pd.Series([1, 2, 3]) >>> s.update({1: 9}) >>> s 0 1 1 9 2 3 dtype: int64
-
unstack
(**kwargs)¶ pandas.Series.unstack()
is not yet supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.For more information see https://s.apache.org/dataframe-non-deferred-columns.
-
value_counts
(sort=False, normalize=False, ascending=False, bins=None, dropna=True)[source]¶ Return a Series containing counts of unique values.
The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
Parameters: - normalize (bool, default False) – If True then the object returned will contain the relative frequencies of the unique values.
- sort (bool, default True) – Sort by frequencies.
- ascending (bool, default False) – Sort in ascending order.
- bins (int, optional) – Rather than count values, group them into half-open bins,
a convenience for
pd.cut
, only works with numeric data. - dropna (bool, default True) – Don’t include counts of NaN.
Returns: Return type: Differences from pandas
sort
isFalse
by default, andsort=True
is not supported because it imposes an ordering on the dataset which likely will not be preserved.When
bin
is specified this operation is not parallelizable. See [BEAM-12441](https://issues.apache.org/jira/browse/BEAM-12441) tracking the possible addition of a distributed implementation.See also
DeferredSeries.count()
- Number of non-NA elements in a DeferredSeries.
DeferredDataFrame.count()
- Number of non-NA elements in a DeferredDataFrame.
DeferredDataFrame.value_counts()
- Equivalent method on DeferredDataFrames.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> index = pd.Index([3, 1, 2, 3, 4, np.nan]) >>> index.value_counts() 3.0 2 1.0 1 2.0 1 4.0 1 dtype: int64 With `normalize` set to `True`, returns the relative frequency by dividing all values by the sum of values. >>> s = pd.Series([3, 1, 2, 3, 4, np.nan]) >>> s.value_counts(normalize=True) 3.0 0.4 1.0 0.2 2.0 0.2 4.0 0.2 dtype: float64 **bins** Bins can be useful for going from a continuous variable to a categorical variable; instead of counting unique apparitions of values, divide the index in the specified number of half-open bins. >>> s.value_counts(bins=3) (0.996, 2.0] 2 (2.0, 3.0] 2 (3.0, 4.0] 1 dtype: int64 **dropna** With `dropna` set to `False` we can also see NaN index values. >>> s.value_counts(dropna=False) 3.0 2 1.0 1 2.0 1 4.0 1 NaN 1 dtype: int64
-
values
¶ pandas.Series.values()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
view
(**kwargs)¶ pandas.Series.view()
is not yet supported in the Beam DataFrame API because it relies on memory-sharing semantics that are not compatible with the Beam model.
-
str
¶ Vectorized string functions for Series and Index.
NAs stay NA unless handled otherwise by a particular method. Patterned after Python’s string methods, with some inspiration from R’s stringr package.
Differences from pandas
This operation has no known divergences from the pandas API.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series(["A_Str_Series"]) >>> s 0 A_Str_Series dtype: object >>> s.str.split("_") 0 [A, Str, Series] dtype: object >>> s.str.replace("_", "") 0 AStrSeries dtype: object
-
cat
¶ Accessor object for categorical properties of the Series values.
Be aware that assigning to categories is a inplace operation, while all methods return new categorical data per default (but can be called with inplace=True).
Parameters: data (DeferredSeries or CategoricalIndex) – Differences from pandas
This operation has no known divergences from the pandas API.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series(list("abbccc")).astype("category") >>> s 0 a 1 b 2 b 3 c 4 c 5 c dtype: category Categories (3, object): ['a', 'b', 'c'] >>> s.cat.categories Index(['a', 'b', 'c'], dtype='object') >>> s.cat.rename_categories(list("cba")) 0 c 1 b 2 b 3 a 4 a 5 a dtype: category Categories (3, object): ['c', 'b', 'a'] >>> s.cat.reorder_categories(list("cba")) 0 a 1 b 2 b 3 c 4 c 5 c dtype: category Categories (3, object): ['c', 'b', 'a'] >>> s.cat.add_categories(["d", "e"]) 0 a 1 b 2 b 3 c 4 c 5 c dtype: category Categories (5, object): ['a', 'b', 'c', 'd', 'e'] >>> s.cat.remove_categories(["a", "c"]) 0 NaN 1 b 2 b 3 NaN 4 NaN 5 NaN dtype: category Categories (1, object): ['b'] >>> s1 = s.cat.add_categories(["d", "e"]) >>> s1.cat.remove_unused_categories() 0 a 1 b 2 b 3 c 4 c 5 c dtype: category Categories (3, object): ['a', 'b', 'c'] >>> s.cat.set_categories(list("abcde")) 0 a 1 b 2 b 3 c 4 c 5 c dtype: category Categories (5, object): ['a', 'b', 'c', 'd', 'e'] >>> s.cat.as_ordered() 0 a 1 b 2 b 3 c 4 c 5 c dtype: category Categories (3, object): ['a' < 'b' < 'c'] >>> s.cat.as_unordered() 0 a 1 b 2 b 3 c 4 c 5 c dtype: category Categories (3, object): ['a', 'b', 'c']
-
dt
¶
-
mode
(*args, **kwargs)[source]¶ Return the mode(s) of the Series.
The mode is the value that appears most often. There can be multiple modes.
Always returns Series even if only one value is returned.
Parameters: dropna (bool, default True) – Don’t consider counts of NaN/NaT. Returns: Modes of the DeferredSeries in sorted order. Return type: DeferredSeries Differences from pandas
mode is not currently parallelizable. An approximate, parallelizable implementation of mode may be added in the future (BEAM-12181).
-
apply
(**kwargs)¶ Invoke function on values of Series.
Can be ufunc (a NumPy function that applies to the entire Series) or a Python function that only works on single values.
Parameters: - func (function) – Python function or NumPy ufunc to apply.
- convert_dtype (bool, default True) – Try to find better dtype for elementwise function results. If False, leave as dtype=object. Note that the dtype is always preserved for some extension array dtypes, such as Categorical.
- args (tuple) – Positional arguments passed to func after the series value.
- **kwargs – Additional keyword arguments passed to func.
Returns: If func returns a DeferredSeries object the result will be a DeferredDataFrame.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.map()
- For element-wise operations.
DeferredSeries.agg()
- Only perform aggregating type operations.
DeferredSeries.transform()
- Only perform transforming type operations.
Notes
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Create a series with typical summer temperatures for each city. >>> s = pd.Series([20, 21, 12], ... index=['London', 'New York', 'Helsinki']) >>> s London 20 New York 21 Helsinki 12 dtype: int64 Square the values by defining a function and passing it as an argument to ``apply()``. >>> def square(x): ... return x ** 2 >>> s.apply(square) London 400 New York 441 Helsinki 144 dtype: int64 Square the values by passing an anonymous function as an argument to ``apply()``. >>> s.apply(lambda x: x ** 2) London 400 New York 441 Helsinki 144 dtype: int64 Define a custom function that needs additional positional arguments and pass these additional arguments using the ``args`` keyword. >>> def subtract_custom_value(x, custom_value): ... return x - custom_value >>> s.apply(subtract_custom_value, args=(5,)) London 15 New York 16 Helsinki 7 dtype: int64 Define a custom function that takes keyword arguments and pass these arguments to ``apply``. >>> def add_custom_values(x, **kwargs): ... for month in kwargs: ... x += kwargs[month] ... return x >>> s.apply(add_custom_values, june=30, july=20, august=25) London 95 New York 96 Helsinki 87 dtype: int64 Use a function from the Numpy library. >>> s.apply(np.log) London 2.995732 New York 3.044522 Helsinki 2.484907 dtype: float64
-
map
(**kwargs)¶ Map values of Series according to input correspondence.
Used for substituting each value in a Series with another value, that may be derived from a function, a
dict
or aSeries
.Parameters: - arg (function, collections.abc.Mapping subclass or DeferredSeries) – Mapping correspondence.
- na_action ({None, 'ignore'}, default None) – If ‘ignore’, propagate NaN values, without passing them to the mapping correspondence.
Returns: Same index as caller.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.apply()
- For applying more complex functions on a DeferredSeries.
DeferredDataFrame.apply()
- Apply a function row-/column-wise.
DeferredDataFrame.applymap()
- Apply a function elementwise on a whole DeferredDataFrame.
Notes
When
arg
is a dictionary, values in DeferredSeries that are not in the dictionary (as keys) are converted toNaN
. However, if the dictionary is adict
subclass that defines__missing__
(i.e. provides a method for default values), then this default is used rather thanNaN
.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit']) >>> s 0 cat 1 dog 2 NaN 3 rabbit dtype: object ``map`` accepts a ``dict`` or a ``Series``. Values that are not found in the ``dict`` are converted to ``NaN``, unless the dict has a default value (e.g. ``defaultdict``): >>> s.map({'cat': 'kitten', 'dog': 'puppy'}) 0 kitten 1 puppy 2 NaN 3 NaN dtype: object It also accepts a function: >>> s.map('I am a {}'.format) 0 I am a cat 1 I am a dog 2 I am a nan 3 I am a rabbit dtype: object To avoid applying the function to missing values (and keep them as ``NaN``) ``na_action='ignore'`` can be used: >>> s.map('I am a {}'.format, na_action='ignore') 0 I am a cat 1 I am a dog 2 NaN 3 I am a rabbit dtype: object
-
repeat
(repeats, axis)[source]¶ Repeat elements of a Series.
Returns a new Series where each element of the current Series is repeated consecutively a given number of times.
Parameters: Returns: Newly created DeferredSeries with repeated elements.
Return type: Differences from pandas
repeats
must be anint
or aDeferredSeries
. Lists are not supported because they make this operation order-sensitive.See also
Index.repeat()
- Equivalent function for Index.
numpy.repeat()
- Similar method for
numpy.ndarray
.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> s = pd.Series(['a', 'b', 'c']) >>> s 0 a 1 b 2 c dtype: object >>> s.repeat(2) 0 a 0 a 1 b 1 b 2 c 2 c dtype: object >>> s.repeat([1, 2, 3]) 0 a 1 b 1 b 2 c 2 c 2 c dtype: object
-
compare
(other, align_axis, **kwargs)[source]¶ Compare to another Series and show the differences.
New in version 1.1.0.
Parameters: - other (DeferredSeries) – Object to compare with.
- align_axis ({0 or 'index', 1 or 'columns'}, default 1) –
Determine which axis to align the comparison on.
- 0, or ‘index’ : Resulting differences are stacked vertically
- with rows drawn alternately from self and other.
- 1, or ‘columns’ : Resulting differences are aligned horizontally
- with columns drawn alternately from self and other.
- keep_shape (bool, default False) – If true, all rows and columns are kept. Otherwise, only the ones with different values are kept.
- keep_equal (bool, default False) – If true, the result keeps values that are equal. Otherwise, equal values are shown as NaNs.
Returns: If axis is 0 or ‘index’ the result will be a DeferredSeries. The resulting index will be a MultiIndex with ‘self’ and ‘other’ stacked alternately at the inner level.
If axis is 1 or ‘columns’ the result will be a DeferredDataFrame. It will have two columns namely ‘self’ and ‘other’.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.compare()
- Compare with another DeferredDataFrame and show differences.
Notes
Matching NaNs will not appear as a difference.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s1 = pd.Series(["a", "b", "c", "d", "e"]) >>> s2 = pd.Series(["a", "a", "c", "b", "e"]) Align the differences on columns >>> s1.compare(s2) self other 1 b a 3 d b Stack the differences on indices >>> s1.compare(s2, align_axis=0) 1 self b other a 3 self d other b dtype: object Keep all original rows >>> s1.compare(s2, keep_shape=True) self other 0 NaN NaN 1 b a 2 NaN NaN 3 d b 4 NaN NaN Keep all original rows and also all original values >>> s1.compare(s2, keep_shape=True, keep_equal=True) self other 0 a a 1 b a 2 c c 3 d b 4 e e
-
abs
(**kwargs)¶ Return a Series/DataFrame with absolute numeric value of each element.
This function only applies to elements that are all numeric.
Returns: DeferredSeries/DeferredDataFrame containing the absolute value of each element. Return type: abs Differences from pandas
This operation has no known divergences from the pandas API.
See also
numpy.absolute()
- Calculate the absolute value element-wise.
Notes
For
complex
inputs,1.2 + 1j
, the absolute value is \(\sqrt{ a^2 + b^2 }\).Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Absolute numeric values in a Series. >>> s = pd.Series([-1.10, 2, -3.33, 4]) >>> s.abs() 0 1.10 1 2.00 2 3.33 3 4.00 dtype: float64 Absolute numeric values in a Series with complex numbers. >>> s = pd.Series([1.2 + 1j]) >>> s.abs() 0 1.56205 dtype: float64 Absolute numeric values in a Series with a Timedelta element. >>> s = pd.Series([pd.Timedelta('1 days')]) >>> s.abs() 0 1 days dtype: timedelta64[ns] Select rows with data closest to certain value using argsort (from `StackOverflow <https://stackoverflow.com/a/17758115>`__). >>> df = pd.DataFrame({ ... 'a': [4, 5, 6, 7], ... 'b': [10, 20, 30, 40], ... 'c': [100, 50, -30, -50] ... }) >>> df a b c 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 >>> df.loc[(df.c - 43).abs().argsort()] a b c 1 5 20 50 0 4 10 100 2 6 30 -30 3 7 40 -50
-
add
(**kwargs)¶ Return Addition of series and other, element-wise (binary operator add).
Equivalent to
series + other
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.radd()
- Reverse of the Addition operator, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
asfreq
(**kwargs)¶ pandas.Series.asfreq()
is not implemented yet in the Beam DataFrame API.If support for ‘asfreq’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
astype
(dtype, copy, errors)¶ Cast a pandas object to a specified dtype
dtype
.Parameters: - dtype (data type, or dict of column name -> data type) – Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DeferredDataFrame’s columns to column-specific types.
- copy (bool, default True) – Return a copy when
copy=True
(be very careful settingcopy=False
as changes to values then may propagate to other pandas objects). - errors ({'raise', 'ignore'}, default 'raise') –
Control raising of exceptions on invalid data for provided dtype.
raise
: allow exceptions to be raisedignore
: suppress exceptions. On error return original object.
Returns: casted
Return type: same type as caller
Differences from pandas
astype is not parallelizable when
errors="ignore"
is specified.copy=False
is not supported because it relies on memory-sharing semantics.dtype="category
is not supported because the type of the output column depends on the data. Please usepd.CategoricalDtype
with explicit categories instead.See also
to_datetime()
- Convert argument to datetime.
to_timedelta()
- Convert argument to timedelta.
to_numeric()
- Convert argument to a numeric type.
numpy.ndarray.astype()
- Cast a numpy array to a specified type.
Notes
Deprecated since version 1.3.0: Using
astype
to convert from timezone-naive dtype to timezone-aware dtype is deprecated and will raise in a future version. UseDeferredSeries.dt.tz_localize()
instead.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
Create a DataFrame: >>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df = pd.DataFrame(data=d) >>> df.dtypes col1 int64 col2 int64 dtype: object Cast all columns to int32: >>> df.astype('int32').dtypes col1 int32 col2 int32 dtype: object Cast col1 to int32 using a dictionary: >>> df.astype({'col1': 'int32'}).dtypes col1 int32 col2 int64 dtype: object Create a series: >>> ser = pd.Series([1, 2], dtype='int32') >>> ser 0 1 1 2 dtype: int32 >>> ser.astype('int64') 0 1 1 2 dtype: int64 Convert to categorical type: >>> ser.astype('category') 0 1 1 2 dtype: category Categories (2, int64): [1, 2] Convert to ordered categorical type with custom ordering: >>> from pandas.api.types import CategoricalDtype >>> cat_dtype = CategoricalDtype( ... categories=[2, 1], ordered=True) >>> ser.astype(cat_dtype) 0 1 1 2 dtype: category Categories (2, int64): [2 < 1] Note that using ``copy=False`` and changing data on a new pandas object may propagate changes: >>> s1 = pd.Series([1, 2]) >>> s2 = s1.astype('int64', copy=False) >>> s2[0] = 10 >>> s1 # note that s1[0] has changed too 0 10 1 2 dtype: int64 Create a series of dates: >>> ser_date = pd.Series(pd.date_range('20200101', periods=3)) >>> ser_date 0 2020-01-01 1 2020-01-02 2 2020-01-03 dtype: datetime64[ns]
-
at
¶ pandas.Series.at()
is not implemented yet in the Beam DataFrame API.If support for ‘at’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
at_time
(**kwargs)¶ Select values at particular time of day (e.g., 9:30AM).
Parameters: - time (datetime.time or str) –
- axis ({0 or 'index', 1 or 'columns'}, default 0) –
Returns: Return type: Raises: TypeError
– If the index is not aDatetimeIndex
Differences from pandas
This operation has no known divergences from the pandas API.
See also
between_time()
- Select values between particular times of the day.
first()
- Select initial periods of time series based on a date offset.
last()
- Select final periods of time series based on a date offset.
DatetimeIndex.indexer_at_time()
- Get just the index locations for values at particular time of the day.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> i = pd.date_range('2018-04-09', periods=4, freq='12H') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> ts A 2018-04-09 00:00:00 1 2018-04-09 12:00:00 2 2018-04-10 00:00:00 3 2018-04-10 12:00:00 4 >>> ts.at_time('12:00') A 2018-04-09 12:00:00 2 2018-04-10 12:00:00 4
-
attrs
¶ pandas.DataFrame.attrs()
is not yet supported in the Beam DataFrame API because it is experimental in pandas.
-
backfill
(*args, **kwargs)¶ Synonym for
DataFrame.fillna()
withmethod='bfill'
.Returns: Object with missing values filled or None if inplace=True
.Return type: DeferredSeries/DeferredDataFrame or None Differences from pandas
backfill is only supported for axis=”columns”. axis=”index” is order-sensitive.
-
between_time
(**kwargs)¶ Select values between particular times of the day (e.g., 9:00-9:30 AM).
By setting
start_time
to be later thanend_time
, you can get the times that are not between the two times.Parameters: - start_time (datetime.time or str) – Initial time as a time filter limit.
- end_time (datetime.time or str) – End time as a time filter limit.
- include_start (bool, default True) – Whether the start time needs to be included in the result.
- include_end (bool, default True) – Whether the end time needs to be included in the result.
- axis ({0 or 'index', 1 or 'columns'}, default 0) – Determine range time on index or columns value.
Returns: Data from the original object filtered to the specified dates range.
Return type: Raises: TypeError
– If the index is not aDatetimeIndex
Differences from pandas
This operation has no known divergences from the pandas API.
See also
at_time()
- Select values at a particular time of the day.
first()
- Select initial periods of time series based on a date offset.
last()
- Select final periods of time series based on a date offset.
DatetimeIndex.indexer_between_time()
- Get just the index locations for values between particular times of the day.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> i = pd.date_range('2018-04-09', periods=4, freq='1D20min') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> ts A 2018-04-09 00:00:00 1 2018-04-10 00:20:00 2 2018-04-11 00:40:00 3 2018-04-12 01:00:00 4 >>> ts.between_time('0:15', '0:45') A 2018-04-10 00:20:00 2 2018-04-11 00:40:00 3 You get the times that are *not* between two times by setting ``start_time`` later than ``end_time``: >>> ts.between_time('0:45', '0:15') A 2018-04-09 00:00:00 1 2018-04-12 01:00:00 4
-
bfill
(*args, **kwargs)¶ bfill is only supported for axis=”columns”. axis=”index” is order-sensitive.
-
bool
()¶ Return the bool of a single element Series or DataFrame.
This must be a boolean scalar value, either True or False. It will raise a ValueError if the Series or DataFrame does not have exactly 1 element, or that element is not boolean (integer values 0 and 1 will also raise an exception).
Returns: The value in the DeferredSeries or DeferredDataFrame. Return type: bool Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.astype()
- Change the data type of a DeferredSeries, including to boolean.
DeferredDataFrame.astype()
- Change the data type of a DeferredDataFrame, including to boolean.
numpy.bool_()
- NumPy boolean data type, used by pandas for boolean values.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
The method will only work for single element objects with a boolean value: >>> pd.Series([True]).bool() True >>> pd.Series([False]).bool() False >>> pd.DataFrame({'col': [True]}).bool() True >>> pd.DataFrame({'col': [False]}).bool() False
-
combine
(**kwargs)¶ Perform column-wise combine with another DataFrame.
Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.
Parameters: - other (DeferredDataFrame) – The DeferredDataFrame to merge column-wise.
- func (function) – Function that takes two series as inputs and return a DeferredSeries or a scalar. Used to merge the two dataframes column by columns.
- fill_value (scalar value, default None) – The value to fill NaNs with prior to passing any column to the merge func.
- overwrite (bool, default True) – If True, columns in self that do not exist in other will be overwritten with NaNs.
Returns: Combination of the provided DeferredDataFrames.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.combine_first()
- Combine two DeferredDataFrame objects and default to non-null values in frame calling the method.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Combine using a simple function that chooses the smaller column. >>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2 >>> df1.combine(df2, take_smaller) A B 0 0 3 1 0 3 Example using a true element-wise combine function. >>> df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine(df2, np.minimum) A B 0 1 2 1 0 3 Using `fill_value` fills Nones prior to passing the column to the merge function. >>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine(df2, take_smaller, fill_value=-5) A B 0 0 -5.0 1 0 4.0 However, if the same element in both dataframes is None, that None is preserved >>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [None, 3]}) >>> df1.combine(df2, take_smaller, fill_value=-5) A B 0 0 -5.0 1 0 3.0 Example that demonstrates the use of `overwrite` and behavior when the axis differ between the dataframes. >>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]}) >>> df2 = pd.DataFrame({'B': [3, 3], 'C': [-10, 1], }, index=[1, 2]) >>> df1.combine(df2, take_smaller) A B C 0 NaN NaN NaN 1 NaN 3.0 -10.0 2 NaN 3.0 1.0 >>> df1.combine(df2, take_smaller, overwrite=False) A B C 0 0.0 NaN NaN 1 0.0 3.0 -10.0 2 NaN 3.0 1.0 Demonstrating the preference of the passed in dataframe. >>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1], }, index=[1, 2]) >>> df2.combine(df1, take_smaller) A B C 0 0.0 NaN NaN 1 0.0 3.0 NaN 2 NaN 3.0 NaN >>> df2.combine(df1, take_smaller, overwrite=False) A B C 0 0.0 NaN NaN 1 0.0 3.0 1.0 2 NaN 3.0 1.0
-
combine_first
(**kwargs)¶ Update null elements with value in the same location in other.
Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two.
Parameters: other (DeferredDataFrame) – Provided DeferredDataFrame to use to fill null values. Returns: The result of combining the provided DeferredDataFrame with the other object. Return type: DeferredDataFrame Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.combine()
- Perform series-wise operation on two DeferredDataFrames using a given function.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine_first(df2) A B 0 1.0 3.0 1 0.0 4.0 Null values still persist if the location of that null value does not exist in `other` >>> df1 = pd.DataFrame({'A': [None, 0], 'B': [4, None]}) >>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1]}, index=[1, 2]) >>> df1.combine_first(df2) A B C 0 NaN 4.0 NaN 1 0.0 3.0 1.0 2 NaN 3.0 1.0
-
convert_dtypes
(**kwargs)¶ pandas.Series.convert_dtypes()
is not implemented yet in the Beam DataFrame API.If support for ‘convert_dtypes’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
copy
(**kwargs)¶ Make a copy of this object’s indices and data.
When
deep=True
(default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below).When
deep=False
, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).Parameters: deep (bool, default True) – Make a deep copy, including a copy of the data and the indices. With deep=False
neither the indices nor the data are copied.Returns: copy – Object type matches caller. Return type: DeferredSeries or DeferredDataFrame Differences from pandas
This operation has no known divergences from the pandas API.
Notes
When
deep=True
, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively copies object data (see examples below).While
Index
objects are copied whendeep=True
, the underlying numpy array is not copied for performance reasons. SinceIndex
is immutable, the underlying data can be safely shared and a copy is not needed.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series([1, 2], index=["a", "b"]) >>> s a 1 b 2 dtype: int64 >>> s_copy = s.copy() >>> s_copy a 1 b 2 dtype: int64 **Shallow copy versus default (deep) copy:** >>> s = pd.Series([1, 2], index=["a", "b"]) >>> deep = s.copy() >>> shallow = s.copy(deep=False) Shallow copy shares data and index with original. >>> s is shallow False >>> s.values is shallow.values and s.index is shallow.index True Deep copy has own copy of data and index. >>> s is deep False >>> s.values is deep.values or s.index is deep.index False Updates to the data shared by shallow copy and original is reflected in both; deep copy remains unchanged. >>> s[0] = 3 >>> shallow[1] = 4 >>> s a 3 b 4 dtype: int64 >>> shallow a 3 b 4 dtype: int64 >>> deep a 1 b 2 dtype: int64 Note that when copying an object containing Python objects, a deep copy will copy the data, but will not do so recursively. Updating a nested data object will be reflected in the deep copy. >>> s = pd.Series([[1, 2], [3, 4]]) >>> deep = s.copy() >>> s[0][0] = 10 >>> s 0 [10, 2] 1 [3, 4] dtype: object >>> deep 0 [10, 2] 1 [3, 4] dtype: object
-
div
(**kwargs)¶ Return Floating division of series and other, element-wise (binary operator truediv).
Equivalent to
series / other
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.rtruediv()
- Reverse of the Floating division operator, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divide(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
-
divide
(**kwargs)¶ Return Floating division of series and other, element-wise (binary operator truediv).
Equivalent to
series / other
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.rtruediv()
- Reverse of the Floating division operator, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divide(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
-
divmod
(**kwargs)¶ Return Integer division and modulo of series and other, element-wise (binary operator divmod).
Equivalent to
divmod(series, other)
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: 2-Tuple of DeferredSeries
Differences from pandas
Only level=None is supported
See also
DeferredSeries.rdivmod()
- Reverse of the Integer division and modulo operator, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divmod(b, fill_value=0) (a 1.0 b NaN c NaN d 0.0 e NaN dtype: float64, a 0.0 b NaN c NaN d 0.0 e NaN dtype: float64)
-
drop
(labels, axis, index, columns, errors, **kwargs)¶ Drop specified labels from rows or columns.
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the user guide <advanced.shown_levels> for more information about the now unused levels.
Parameters: - labels (single label or list-like) – Index or column labels to drop.
- axis ({0 or 'index', 1 or 'columns'}, default 0) – Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
- index (single label or list-like) – Alternative to specifying axis (
labels, axis=0
is equivalent toindex=labels
). - columns (single label or list-like) – Alternative to specifying axis (
labels, axis=1
is equivalent tocolumns=labels
). - level (int or level name, optional) – For MultiIndex, level from which the labels will be removed.
- inplace (bool, default False) – If False, return a copy. Otherwise, do operation inplace and return None.
- errors ({'ignore', 'raise'}, default 'raise') – If ‘ignore’, suppress error and only existing labels are dropped.
Returns: DeferredDataFrame without the removed index or column labels or None if
inplace=True
.Return type: Raises: KeyError
– If any of the labels is not found in the selected axis.Differences from pandas
drop is not parallelizable when dropping from the index and
errors="raise"
is specified. It requires collecting all data on a single node in order to detect if one of the index values is missing.See also
DeferredDataFrame.loc()
- Label-location based indexer for selection by label.
DeferredDataFrame.dropna()
- Return DeferredDataFrame with labels on given axis omitted where (all or any) data are missing.
DeferredDataFrame.drop_duplicates()
- Return DeferredDataFrame with duplicate rows removed, optionally only considering certain columns.
DeferredSeries.drop()
- Return DeferredSeries with specified index labels removed.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame(np.arange(12).reshape(3, 4), ... columns=['A', 'B', 'C', 'D']) >>> df A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 Drop columns >>> df.drop(['B', 'C'], axis=1) A D 0 0 3 1 4 7 2 8 11 >>> df.drop(columns=['B', 'C']) A D 0 0 3 1 4 7 2 8 11 Drop a row by index >>> df.drop([0, 1]) A B C D 2 8 9 10 11 Drop columns and/or rows of MultiIndex DataFrame >>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'], ... ['speed', 'weight', 'length']], ... codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2], ... [0, 1, 2, 0, 1, 2, 0, 1, 2]]) >>> df = pd.DataFrame(index=midx, columns=['big', 'small'], ... data=[[45, 30], [200, 100], [1.5, 1], [30, 20], ... [250, 150], [1.5, 0.8], [320, 250], ... [1, 0.8], [0.3, 0.2]]) >>> df big small lama speed 45.0 30.0 weight 200.0 100.0 length 1.5 1.0 cow speed 30.0 20.0 weight 250.0 150.0 length 1.5 0.8 falcon speed 320.0 250.0 weight 1.0 0.8 length 0.3 0.2 >>> df.drop(index='cow', columns='small') big lama speed 45.0 weight 200.0 length 1.5 falcon speed 320.0 weight 1.0 length 0.3 >>> df.drop(index='length', level=1) big small lama speed 45.0 30.0 weight 200.0 100.0 cow speed 30.0 20.0 weight 250.0 150.0 falcon speed 320.0 250.0 weight 1.0 0.8
-
droplevel
(level, axis)¶ Return Series/DataFrame with requested index / column level(s) removed.
Parameters: - level (int, str, or list-like) – If a string is given, must be the name of a level If list-like, elements must be names or positional indexes of levels.
- axis ({0 or 'index', 1 or 'columns'}, default 0) –
Axis along which the level(s) is removed:
- 0 or ‘index’: remove level(s) in column.
- 1 or ‘columns’: remove level(s) in row.
Returns: DeferredSeries/DeferredDataFrame with requested index / column level(s) removed.
Return type: DeferredSeries/DeferredDataFrame
Differences from pandas
This operation has no known divergences from the pandas API.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame([ ... [1, 2, 3, 4], ... [5, 6, 7, 8], ... [9, 10, 11, 12] ... ]).set_index([0, 1]).rename_axis(['a', 'b']) >>> df.columns = pd.MultiIndex.from_tuples([ ... ('c', 'e'), ('d', 'f') ... ], names=['level_1', 'level_2']) >>> df level_1 c d level_2 e f a b 1 2 3 4 5 6 7 8 9 10 11 12 >>> df.droplevel('a') level_1 c d level_2 e f b 2 3 4 6 7 8 10 11 12 >>> df.droplevel('level_2', axis=1) level_1 c d a b 1 2 3 4 5 6 7 8 9 10 11 12
-
empty
¶ Indicator whether DataFrame is empty.
True if DataFrame is entirely empty (no items), meaning any of the axes are of length 0.
Returns: If DeferredDataFrame is empty, return True, if not return False. Return type: bool Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.dropna
- Return series without null values.
DeferredDataFrame.dropna
- Return DeferredDataFrame with labels on given axis omitted where (all or any) data are missing.
Notes
If DeferredDataFrame contains only NaNs, it is still not considered empty. See the example below.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
An example of an actual empty DataFrame. Notice the index is empty: >>> df_empty = pd.DataFrame({'A' : []}) >>> df_empty Empty DataFrame Columns: [A] Index: [] >>> df_empty.empty True If we only have NaNs in our DataFrame, it is not considered empty! We will need to drop the NaNs to make the DataFrame empty: >>> df = pd.DataFrame({'A' : [np.nan]}) >>> df A 0 NaN >>> df.empty False >>> df.dropna().empty True
-
eq
(**kwargs)¶ Return Equal to of series and other, element-wise (binary operator eq).
Equivalent to
series == other
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.eq(b, fill_value=0) a True b False c False d False e False dtype: bool
-
equals
(other)¶ Test whether two objects contain the same elements.
This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
The row/column index do not need to have the same type, as long as the values are considered equal. Corresponding columns must be of the same dtype.
Parameters: other (DeferredSeries or DeferredDataFrame) – The other DeferredSeries or DeferredDataFrame to be compared with the first. Returns: True if all elements are the same in both objects, False otherwise. Return type: bool Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.eq()
- Compare two DeferredSeries objects of the same length and return a DeferredSeries where each element is True if the element in each DeferredSeries is equal, False otherwise.
DeferredDataFrame.eq()
- Compare two DeferredDataFrame objects of the same shape and return a DeferredDataFrame where each element is True if the respective element in each DeferredDataFrame is equal, False otherwise.
testing.assert_series_equal()
- Raises an AssertionError if left and right are not equal. Provides an easy interface to ignore inequality in dtypes, indexes and precision among others.
testing.assert_frame_equal()
- Like assert_series_equal, but targets DeferredDataFrames.
numpy.array_equal()
- Return True if two arrays have the same shape and elements, False otherwise.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({1: [10], 2: [20]}) >>> df 1 2 0 10 20 DataFrames df and exactly_equal have the same types and values for their elements and column labels, which will return True. >>> exactly_equal = pd.DataFrame({1: [10], 2: [20]}) >>> exactly_equal 1 2 0 10 20 >>> df.equals(exactly_equal) True DataFrames df and different_column_type have the same element types and values, but have different types for the column labels, which will still return True. >>> different_column_type = pd.DataFrame({1.0: [10], 2.0: [20]}) >>> different_column_type 1.0 2.0 0 10 20 >>> df.equals(different_column_type) True DataFrames df and different_data_type have different types for the same values for their elements, and will return False even though their column labels are the same values and types. >>> different_data_type = pd.DataFrame({1: [10.0], 2: [20.0]}) >>> different_data_type 1 2 0 10.0 20.0 >>> df.equals(different_data_type) False
-
ewm
(**kwargs)¶ pandas.Series.ewm()
is not yet supported in the Beam DataFrame API because implementing it would require integrating with Beam event-time semanticsFor more information see https://s.apache.org/dataframe-event-time-semantics.
-
expanding
(**kwargs)¶ pandas.Series.expanding()
is not yet supported in the Beam DataFrame API because implementing it would require integrating with Beam event-time semanticsFor more information see https://s.apache.org/dataframe-event-time-semantics.
-
ffill
(*args, **kwargs)¶ ffill is only supported for axis=”columns”. axis=”index” is order-sensitive.
-
fillna
(value, method, axis, limit, **kwargs)¶ Fill NA/NaN values using the specified method.
Parameters: - value (scalar, dict, DeferredSeries, or DeferredDataFrame) – Value to use to fill holes (e.g. 0), alternately a dict/DeferredSeries/DeferredDataFrame of values specifying which value to use for each index (for a DeferredSeries) or column (for a DeferredDataFrame). Values not in the dict/DeferredSeries/DeferredDataFrame will not be filled. This value cannot be a list.
- method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None) – Method to use for filling holes in reindexed DeferredSeries pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap.
- axis ({0 or 'index', 1 or 'columns'}) – Axis along which to fill missing values.
- inplace (bool, default False) – If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DeferredDataFrame).
- limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
- downcast (dict, default is None) – A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).
Returns: Object with missing values filled or None if
inplace=True
.Return type: Differences from pandas
When
axis="index"
, bothmethod
andlimit
must beNone
. otherwise this operation is order-sensitive.See also
interpolate()
- Fill NaN values using interpolation.
reindex()
- Conform object to new index.
asfreq()
- Convert TimeDeferredSeries to specified frequency.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0], ... [3, 4, np.nan, 1], ... [np.nan, np.nan, np.nan, 5], ... [np.nan, 3, np.nan, 4]], ... columns=list("ABCD")) >>> df A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5 3 NaN 3.0 NaN 4 Replace all NaN elements with 0s. >>> df.fillna(0) A B C D 0 0.0 2.0 0.0 0 1 3.0 4.0 0.0 1 2 0.0 0.0 0.0 5 3 0.0 3.0 0.0 4 We can also propagate non-null values forward or backward. >>> df.fillna(method="ffill") A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 3.0 4.0 NaN 5 3 3.0 3.0 NaN 4 Replace all NaN elements in column 'A', 'B', 'C', and 'D', with 0, 1, 2, and 3 respectively. >>> values = {"A": 0, "B": 1, "C": 2, "D": 3} >>> df.fillna(value=values) A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 2.0 1 2 0.0 1.0 2.0 5 3 0.0 3.0 2.0 4 Only replace the first NaN element. >>> df.fillna(value=values, limit=1) A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 NaN 1 2 NaN 1.0 NaN 5 3 NaN 3.0 NaN 4 When filling using a DataFrame, replacement happens along the same column names and same indices >>> df2 = pd.DataFrame(np.zeros((4, 4)), columns=list("ABCE")) >>> df.fillna(df2) A B C D 0 0.0 2.0 0.0 0 1 3.0 4.0 0.0 1 2 0.0 0.0 0.0 5 3 0.0 3.0 0.0 4
-
first
(offset)¶ Select initial periods of time series data based on a date offset.
When having a DataFrame with dates as index, this function can select the first few rows based on a date offset.
Parameters: offset (str, DateOffset or dateutil.relativedelta) – The offset length of the data that will be selected. For instance, ‘1M’ will display all the rows having their index within the first month. Returns: A subset of the caller. Return type: DeferredSeries or DeferredDataFrame Raises: TypeError
– If the index is not aDatetimeIndex
Differences from pandas
This operation has no known divergences from the pandas API.
See also
last()
- Select final periods of time series based on a date offset.
at_time()
- Select values at a particular time of the day.
between_time()
- Select values between particular times of the day.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> i = pd.date_range('2018-04-09', periods=4, freq='2D') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> ts A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4 Get the rows for the first 3 days: >>> ts.first('3D') A 2018-04-09 1 2018-04-11 2 Notice the data for 3 first calendar days were returned, not the first 3 days observed in the dataset, and therefore data for 2018-04-13 was not returned.
-
flags
¶ pandas.Series.flags()
is not implemented yet in the Beam DataFrame API.If support for ‘flags’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
floordiv
(**kwargs)¶ Return Integer division of series and other, element-wise (binary operator floordiv).
Equivalent to
series // other
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.rfloordiv()
- Reverse of the Integer division operator, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.floordiv(b, fill_value=0) a 1.0 b NaN c NaN d 0.0 e NaN dtype: float64
-
ge
(**kwargs)¶ Return Greater than or equal to of series and other, element-wise (binary operator ge).
Equivalent to
series >= other
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e']) >>> a a 1.0 b 1.0 c 1.0 d NaN e 1.0 dtype: float64 >>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f']) >>> b a 0.0 b 1.0 c 2.0 d NaN f 1.0 dtype: float64 >>> a.ge(b, fill_value=0) a True b True c False d False e True f False dtype: bool
-
groupby
(by, level, axis, as_index, group_keys, **kwargs)¶ Group DataFrame using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
Parameters: - by (mapping, function, label, or list of labels) – Used to determine the groups for the groupby.
If
by
is a function, it’s called on each value of the object’s index. If a dict or DeferredSeries is passed, the DeferredSeries or dict VALUES will be used to determine the groups (the DeferredSeries’ values are first aligned; see.align()
method). If an ndarray is passed, the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns inself
. Notice that a tuple is interpreted as a (single) key. - axis ({0 or 'index', 1 or 'columns'}, default 0) – Split along rows (0) or columns (1).
- level (int, level name, or sequence of such, default None) – If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
- as_index (bool, default True) – For aggregated output, return object with group labels as the index. Only relevant for DeferredDataFrame input. as_index=False is effectively “SQL-style” grouped output.
- sort (bool, default True) – Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
- group_keys (bool, default True) – When calling apply, add group keys to index to identify pieces.
- squeeze (bool, default False) –
Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
Deprecated since version 1.1.0.
- observed (bool, default False) – This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
- dropna (bool, default True) –
If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups
New in version 1.1.0.
Returns: Returns a groupby object that contains information about the groups.
Return type: DeferredDataFrameGroupBy
Differences from pandas
as_index
andgroup_keys
must both beTrue
.Aggregations grouping by a categorical column with
observed=False
set are not currently parallelizable (BEAM-11190).See also
resample()
- Convenience method for frequency conversion and resampling of time series.
Notes
See the user guide for more.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon', ... 'Parrot', 'Parrot'], ... 'Max Speed': [380., 370., 24., 26.]}) >>> df Animal Max Speed 0 Falcon 380.0 1 Falcon 370.0 2 Parrot 24.0 3 Parrot 26.0 >>> df.groupby(['Animal']).mean() Max Speed Animal Falcon 375.0 Parrot 25.0 **Hierarchical Indexes** We can groupby different levels of a hierarchical index using the `level` parameter: >>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'], ... ['Captive', 'Wild', 'Captive', 'Wild']] >>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type')) >>> df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]}, ... index=index) >>> df Max Speed Animal Type Falcon Captive 390.0 Wild 350.0 Parrot Captive 30.0 Wild 20.0 >>> df.groupby(level=0).mean() Max Speed Animal Falcon 370.0 Parrot 25.0 >>> df.groupby(level="Type").mean() Max Speed Type Captive 210.0 Wild 185.0 We can also choose to include NA in group keys or not by setting `dropna` parameter, the default setting is `True`: >>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]] >>> df = pd.DataFrame(l, columns=["a", "b", "c"]) >>> df.groupby(by=["b"]).sum() a c b 1.0 2 3 2.0 2 5 >>> df.groupby(by=["b"], dropna=False).sum() a c b 1.0 2 3 2.0 2 5 NaN 1 4 >>> l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]] >>> df = pd.DataFrame(l, columns=["a", "b", "c"]) >>> df.groupby(by="a").sum() b c a a 13.0 13.0 b 12.3 123.0 >>> df.groupby(by="a", dropna=False).sum() b c a a 13.0 13.0 b 12.3 123.0 NaN 12.3 33.0
- by (mapping, function, label, or list of labels) – Used to determine the groups for the groupby.
If
-
gt
(**kwargs)¶ Return Greater than of series and other, element-wise (binary operator gt).
Equivalent to
series > other
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e']) >>> a a 1.0 b 1.0 c 1.0 d NaN e 1.0 dtype: float64 >>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f']) >>> b a 0.0 b 1.0 c 2.0 d NaN f 1.0 dtype: float64 >>> a.gt(b, fill_value=0) a True b False c False d False e True f False dtype: bool
-
hist
(**kwargs)¶ pandas.DataFrame.hist()
is not yet supported in the Beam DataFrame API because it is a plotting tool.For more information see https://s.apache.org/dataframe-plotting-tools.
-
iloc
¶ Purely integer-location based indexing for selection by position.
.iloc[]
is primarily integer position based (from0
tolength-1
of the axis), but may also be used with a boolean array.Allowed inputs are:
- An integer, e.g.
5
. - A list or array of integers, e.g.
[4, 3, 0]
. - A slice object with ints, e.g.
1:7
. - A boolean array.
- A
callable
function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.
.iloc
will raiseIndexError
if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing (this conforms with python/numpy slice semantics).See more at Selection by Position.
Differences from pandas
Position-based indexing with iloc is order-sensitive in almost every case. Beam DataFrame users should prefer label-based indexing with loc.
See also
DeferredDataFrame.iat
- Fast integer location scalar accessor.
DeferredDataFrame.loc
- Purely label-location based indexer for selection by label.
DeferredSeries.iloc
- Purely integer-location based indexing for selection by position.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4}, ... {'a': 100, 'b': 200, 'c': 300, 'd': 400}, ... {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }] >>> df = pd.DataFrame(mydict) >>> df a b c d 0 1 2 3 4 1 100 200 300 400 2 1000 2000 3000 4000 **Indexing just the rows** With a scalar integer. >>> type(df.iloc[0]) <class 'pandas.core.series.Series'> >>> df.iloc[0] a 1 b 2 c 3 d 4 Name: 0, dtype: int64 With a list of integers. >>> df.iloc[[0]] a b c d 0 1 2 3 4 >>> type(df.iloc[[0]]) <class 'pandas.core.frame.DataFrame'> >>> df.iloc[[0, 1]] a b c d 0 1 2 3 4 1 100 200 300 400 With a `slice` object. >>> df.iloc[:3] a b c d 0 1 2 3 4 1 100 200 300 400 2 1000 2000 3000 4000 With a boolean mask the same length as the index. >>> df.iloc[[True, False, True]] a b c d 0 1 2 3 4 2 1000 2000 3000 4000 With a callable, useful in method chains. The `x` passed to the ``lambda`` is the DataFrame being sliced. This selects the rows whose index label even. >>> df.iloc[lambda x: x.index % 2 == 0] a b c d 0 1 2 3 4 2 1000 2000 3000 4000 **Indexing both axes** You can mix the indexer types for the index and columns. Use ``:`` to select the entire axis. With scalar integers. >>> df.iloc[0, 1] 2 With lists of integers. >>> df.iloc[[0, 2], [1, 3]] b d 0 2 4 2 2000 4000 With `slice` objects. >>> df.iloc[1:3, 0:3] a b c 1 100 200 300 2 1000 2000 3000 With a boolean array whose length matches the columns. >>> df.iloc[:, [True, False, True, False]] a c 0 1 3 1 100 300 2 1000 3000 With a callable function that expects the Series or DataFrame. >>> df.iloc[:, lambda df: [0, 2]] a c 0 1 3 1 100 300 2 1000 3000
- An integer, e.g.
-
index
¶ The index (row labels) of the DataFrame.
Differences from pandas
This operation has no known divergences from the pandas API.
-
infer_object
(**kwargs)¶ pandas.Series.infer_objects()
is not yet supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.For more information see https://s.apache.org/dataframe-non-deferred-columns.
-
infer_objects
(**kwargs)¶ pandas.Series.infer_objects()
is not implemented yet in the Beam DataFrame API.If support for ‘infer_objects’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
isin
(**kwargs)¶ Whether each element in the DataFrame is contained in values.
Parameters: values (iterable, DeferredSeries, DeferredDataFrame or dict) – The result will only be true at a location if all the labels match. If values is a DeferredSeries, that’s the index. If values is a dict, the keys must be the column names, which must match. If values is a DeferredDataFrame, then both the index and column labels must match. Returns: DeferredDataFrame of booleans showing whether each element in the DeferredDataFrame is contained in values. Return type: DeferredDataFrame Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.eq()
- Equality test for DeferredDataFrame.
DeferredSeries.isin()
- Equivalent method on DeferredSeries.
DeferredSeries.str.contains()
- Test if pattern or regex is contained within a string of a DeferredSeries or Index.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]}, ... index=['falcon', 'dog']) >>> df num_legs num_wings falcon 2 2 dog 4 0 When ``values`` is a list check whether every value in the DataFrame is present in the list (which animals have 0 or 2 legs or wings) >>> df.isin([0, 2]) num_legs num_wings falcon True True dog False True When ``values`` is a dict, we can pass values to check for each column separately: >>> df.isin({'num_wings': [0, 3]}) num_legs num_wings falcon False False dog False True When ``values`` is a Series or DataFrame the index and column must match. Note that 'falcon' does not match based on the number of legs in df2. >>> other = pd.DataFrame({'num_legs': [8, 2], 'num_wings': [0, 2]}, ... index=['spider', 'falcon']) >>> df.isin(other) num_legs num_wings falcon True True dog False False
-
item
(**kwargs)¶ pandas.Series.item()
is not implemented yet in the Beam DataFrame API.If support for ‘item’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
last
(offset)¶ Select final periods of time series data based on a date offset.
For a DataFrame with a sorted DatetimeIndex, this function selects the last few rows based on a date offset.
Parameters: offset (str, DateOffset, dateutil.relativedelta) – The offset length of the data that will be selected. For instance, ‘3D’ will display all the rows having their index within the last 3 days. Returns: A subset of the caller. Return type: DeferredSeries or DeferredDataFrame Raises: TypeError
– If the index is not aDatetimeIndex
Differences from pandas
This operation has no known divergences from the pandas API.
See also
first()
- Select initial periods of time series based on a date offset.
at_time()
- Select values at a particular time of the day.
between_time()
- Select values between particular times of the day.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> i = pd.date_range('2018-04-09', periods=4, freq='2D') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> ts A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4 Get the rows for the last 3 days: >>> ts.last('3D') A 2018-04-13 3 2018-04-15 4 Notice the data for 3 last calendar days were returned, not the last 3 observed days in the dataset, and therefore data for 2018-04-11 was not returned.
-
le
(**kwargs)¶ Return Less than or equal to of series and other, element-wise (binary operator le).
Equivalent to
series <= other
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e']) >>> a a 1.0 b 1.0 c 1.0 d NaN e 1.0 dtype: float64 >>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f']) >>> b a 0.0 b 1.0 c 2.0 d NaN f 1.0 dtype: float64 >>> a.le(b, fill_value=0) a False b True c True d False e False f True dtype: bool
-
length
()¶ Alternative to
len(df)
which returns a deferred result that can be used in arithmetic withDeferredSeries
orDeferredDataFrame
instances.
-
loc
¶ Access a group of rows and columns by label(s) or a boolean array.
.loc[]
is primarily label based, but may also be used with a boolean array.Allowed inputs are:
A single label, e.g.
5
or'a'
, (note that5
is interpreted as a label of the index, and never as an integer position along the index).A list or array of labels, e.g.
['a', 'b', 'c']
.A slice object with labels, e.g.
'a':'f'
.Warning
Note that contrary to usual python slices, both the start and the stop are included
A boolean array of the same length as the axis being sliced, e.g.
[True, False, True]
.An alignable boolean Series. The index of the key will be aligned before masking.
An alignable Index. The Index of the returned selection will be the input.
A
callable
function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)
See more at Selection by Label.
Raises: KeyError
– If any items are not found.IndexingError
– If an indexed key is passed and its index is unalignable to the frame index.
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.at
- Access a single value for a row/column label pair.
DeferredDataFrame.iloc
- Access group of rows and columns by integer position(s).
DeferredDataFrame.xs
- Returns a cross-section (row(s) or column(s)) from the DeferredSeries/DeferredDataFrame.
DeferredSeries.loc
- Access group of values using labels.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
**Getting values** >>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]], ... index=['cobra', 'viper', 'sidewinder'], ... columns=['max_speed', 'shield']) >>> df max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8 Single label. Note this returns the row as a Series. >>> df.loc['viper'] max_speed 4 shield 5 Name: viper, dtype: int64 List of labels. Note using ``[[]]`` returns a DataFrame. >>> df.loc[['viper', 'sidewinder']] max_speed shield viper 4 5 sidewinder 7 8 Single label for row and column >>> df.loc['cobra', 'shield'] 2 Slice with labels for row and single label for column. As mentioned above, note that both the start and stop of the slice are included. >>> df.loc['cobra':'viper', 'max_speed'] cobra 1 viper 4 Name: max_speed, dtype: int64 Boolean list with the same length as the row axis >>> df.loc[[False, False, True]] max_speed shield sidewinder 7 8 Alignable boolean Series: >>> df.loc[pd.Series([False, True, False], ... index=['viper', 'sidewinder', 'cobra'])] max_speed shield sidewinder 7 8 Index (same behavior as ``df.reindex``) >>> df.loc[pd.Index(["cobra", "viper"], name="foo")] max_speed shield foo cobra 1 2 viper 4 5 Conditional that returns a boolean Series >>> df.loc[df['shield'] > 6] max_speed shield sidewinder 7 8 Conditional that returns a boolean Series with column labels specified >>> df.loc[df['shield'] > 6, ['max_speed']] max_speed sidewinder 7 Callable that returns a boolean Series >>> df.loc[lambda df: df['shield'] == 8] max_speed shield sidewinder 7 8 **Setting values** Set value for all items matching the list of labels >>> df.loc[['viper', 'sidewinder'], ['shield']] = 50 >>> df max_speed shield cobra 1 2 viper 4 50 sidewinder 7 50 Set value for an entire row >>> df.loc['cobra'] = 10 >>> df max_speed shield cobra 10 10 viper 4 50 sidewinder 7 50 Set value for an entire column >>> df.loc[:, 'max_speed'] = 30 >>> df max_speed shield cobra 30 10 viper 30 50 sidewinder 30 50 Set value for rows matching callable condition >>> df.loc[df['shield'] > 35] = 0 >>> df max_speed shield cobra 30 10 viper 0 0 sidewinder 0 0 **Getting values on a DataFrame with an index that has integer labels** Another example using integers for the index >>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]], ... index=[7, 8, 9], columns=['max_speed', 'shield']) >>> df max_speed shield 7 1 2 8 4 5 9 7 8 Slice with integer labels for rows. As mentioned above, note that both the start and stop of the slice are included. >>> df.loc[7:9] max_speed shield 7 1 2 8 4 5 9 7 8 **Getting values with a MultiIndex** A number of examples using a DataFrame with a MultiIndex >>> tuples = [ ... ('cobra', 'mark i'), ('cobra', 'mark ii'), ... ('sidewinder', 'mark i'), ('sidewinder', 'mark ii'), ... ('viper', 'mark ii'), ('viper', 'mark iii') ... ] >>> index = pd.MultiIndex.from_tuples(tuples) >>> values = [[12, 2], [0, 4], [10, 20], ... [1, 4], [7, 1], [16, 36]] >>> df = pd.DataFrame(values, columns=['max_speed', 'shield'], index=index) >>> df max_speed shield cobra mark i 12 2 mark ii 0 4 sidewinder mark i 10 20 mark ii 1 4 viper mark ii 7 1 mark iii 16 36 Single label. Note this returns a DataFrame with a single index. >>> df.loc['cobra'] max_speed shield mark i 12 2 mark ii 0 4 Single index tuple. Note this returns a Series. >>> df.loc[('cobra', 'mark ii')] max_speed 0 shield 4 Name: (cobra, mark ii), dtype: int64 Single label for row and column. Similar to passing in a tuple, this returns a Series. >>> df.loc['cobra', 'mark i'] max_speed 12 shield 2 Name: (cobra, mark i), dtype: int64 Single tuple. Note using ``[[]]`` returns a DataFrame. >>> df.loc[[('cobra', 'mark ii')]] max_speed shield cobra mark ii 0 4 Single tuple for the index with a single label for the column >>> df.loc[('cobra', 'mark i'), 'shield'] 2 Slice from index tuple to single label >>> df.loc[('cobra', 'mark i'):'viper'] max_speed shield cobra mark i 12 2 mark ii 0 4 sidewinder mark i 10 20 mark ii 1 4 viper mark ii 7 1 mark iii 16 36 Slice from index tuple to index tuple >>> df.loc[('cobra', 'mark i'):('viper', 'mark ii')] max_speed shield cobra mark i 12 2 mark ii 0 4 sidewinder mark i 10 20 mark ii 1 4 viper mark ii 7 1
-
lt
(**kwargs)¶ Return Less than of series and other, element-wise (binary operator lt).
Equivalent to
series < other
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e']) >>> a a 1.0 b 1.0 c 1.0 d NaN e 1.0 dtype: float64 >>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f']) >>> b a 0.0 b 1.0 c 2.0 d NaN f 1.0 dtype: float64 >>> a.lt(b, fill_value=0) a False b False c True d False e False f True dtype: bool
-
mask
(cond, **kwargs)¶ mask is not parallelizable when
errors="ignore"
is specified.
-
mod
(**kwargs)¶ Return Modulo of series and other, element-wise (binary operator mod).
Equivalent to
series % other
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.rmod()
- Reverse of the Modulo operator, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.mod(b, fill_value=0) a 0.0 b NaN c NaN d 0.0 e NaN dtype: float64
-
mul
(**kwargs)¶ Return Multiplication of series and other, element-wise (binary operator mul).
Equivalent to
series * other
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.rmul()
- Reverse of the Multiplication operator, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.multiply(b, fill_value=0) a 1.0 b 0.0 c 0.0 d 0.0 e NaN dtype: float64
-
multiply
(**kwargs)¶ Return Multiplication of series and other, element-wise (binary operator mul).
Equivalent to
series * other
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.rmul()
- Reverse of the Multiplication operator, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.multiply(b, fill_value=0) a 1.0 b 0.0 c 0.0 d 0.0 e NaN dtype: float64
-
ndim
¶ Return an int representing the number of axes / array dimensions.
Return 1 if Series. Otherwise return 2 if DataFrame.
Differences from pandas
This operation has no known divergences from the pandas API.
See also
ndarray.ndim
- Number of array dimensions.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series({'a': 1, 'b': 2, 'c': 3}) >>> s.ndim 1 >>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df.ndim 2
-
ne
(**kwargs)¶ Return Not equal to of series and other, element-wise (binary operator ne).
Equivalent to
series != other
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.ne(b, fill_value=0) a False b True c True d True e True dtype: bool
-
pad
(*args, **kwargs)¶ Synonym for
DataFrame.fillna()
withmethod='ffill'
.Returns: Object with missing values filled or None if inplace=True
.Return type: DeferredSeries/DeferredDataFrame or None Differences from pandas
pad is only supported for axis=”columns”. axis=”index” is order-sensitive.
-
pipe
(func, *args, **kwargs)¶ Apply func(self, *args, **kwargs).
Parameters: - func (function) – Function to apply to the DeferredSeries/DeferredDataFrame.
args
, andkwargs
are passed intofunc
. Alternatively a(callable, data_keyword)
tuple wheredata_keyword
is a string indicating the keyword ofcallable
that expects the DeferredSeries/DeferredDataFrame. - args (iterable, optional) – Positional arguments passed into
func
. - kwargs (mapping, optional) – A dictionary of keyword arguments passed into
func
.
Returns: object
Return type: the return type of
func
.Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.apply()
- Apply a function along input axis of DeferredDataFrame.
DeferredDataFrame.applymap()
- Apply a function elementwise on a whole DeferredDataFrame.
DeferredSeries.map()
- Apply a mapping correspondence on a
DeferredSeries
.
Notes
Use
.pipe
when chaining together functions that expect DeferredSeries, DeferredDataFrames or GroupBy objects. Instead of writing>>> func(g(h(df), arg1=a), arg2=b, arg3=c) # doctest: +SKIP
You can write
>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe(func, arg2=b, arg3=c) ... ) # doctest: +SKIP
If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose
f
takes its data asarg2
:>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe((func, 'arg2'), arg1=a, arg3=c) ... ) # doctest: +SKIP
- func (function) – Function to apply to the DeferredSeries/DeferredDataFrame.
-
pow
(**kwargs)¶ Return Exponential power of series and other, element-wise (binary operator pow).
Equivalent to
series ** other
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.rpow()
- Reverse of the Exponential power operator, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.pow(b, fill_value=0) a 1.0 b 1.0 c 1.0 d 0.0 e NaN dtype: float64
-
radd
(**kwargs)¶ Return Addition of series and other, element-wise (binary operator radd).
Equivalent to
other + series
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.add()
- Element-wise Addition, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
rank
(**kwargs)¶ pandas.Series.rank()
is not implemented yet in the Beam DataFrame API.If support for ‘rank’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
rdiv
(**kwargs)¶ Return Floating division of series and other, element-wise (binary operator rtruediv).
Equivalent to
other / series
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.truediv()
- Element-wise Floating division, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divide(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
-
rdivmod
(**kwargs)¶ Return Integer division and modulo of series and other, element-wise (binary operator rdivmod).
Equivalent to
other divmod series
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: 2-Tuple of DeferredSeries
Differences from pandas
Only level=None is supported
See also
DeferredSeries.divmod()
- Element-wise Integer division and modulo, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divmod(b, fill_value=0) (a 1.0 b NaN c NaN d 0.0 e NaN dtype: float64, a 0.0 b NaN c NaN d 0.0 e NaN dtype: float64)
-
reindex
(**kwargs)¶ pandas.DataFrame.reindex()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
reindex_like
(**kwargs)¶ pandas.Series.reindex_like()
is not implemented yet in the Beam DataFrame API.If support for ‘reindex_like’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
reorder_levels
(**kwargs)¶ Rearrange index levels using input order. May not drop or duplicate levels.
Parameters: - order (list of int or list of str) – List representing new level order. Reference level by number (position) or by key (label).
- axis ({0 or 'index', 1 or 'columns'}, default 0) – Where to reorder levels.
Returns: Return type: Differences from pandas
This operation has no known divergences from the pandas API.
-
replace
(to_replace, value, limit, method, **kwargs)¶ Replace values given in to_replace with value.
Values of the DataFrame are replaced with other values dynamically.
This differs from updating with
.loc
or.iloc
, which require you to specify a location to update with some value.Parameters: - to_replace (str, regex, list, dict, DeferredSeries, int, float, or None) –
How to find the values that will be replaced.
- numeric, str or regex:
- numeric: numeric values equal to to_replace will be
- replaced with value
- str: string exactly matching to_replace will be replaced
- with value
- regex: regexs matching to_replace will be replaced with
- value
- list of str, regex, or numeric:
- First, if to_replace and value are both lists, they
- must be the same length.
- Second, if
regex=True
then all of the strings in both - lists will be interpreted as regexs otherwise they will match directly. This doesn’t matter much for value since there are only a few possible substitution regexes you can use.
- Second, if
- str, regex and numeric rules apply as above.
- dict:
- Dicts can be used to specify different replacement values
- for different existing values. For example,
{'a': 'b', 'y': 'z'}
replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way the value parameter should be None.
- For a DeferredDataFrame a dict can specify that different values
- should be replaced in different columns. For example,
{'a': 1, 'b': 'z'}
looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified in value. The value parameter should not beNone
in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.
- For a DeferredDataFrame nested dictionaries, e.g.,
{'a': {'b': np.nan}}
, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN. The value parameter should beNone
to use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.
- None:
- This means that the regex argument must be a string,
- compiled regular expression, or list, dict, ndarray or
DeferredSeries of such elements. If value is also
None
then this must be a nested dictionary or DeferredSeries.
See the examples section for examples of each of these.
- numeric, str or regex:
- value (scalar, dict, list, str, regex, default None) – Value to replace any values matching to_replace with. For a DeferredDataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.
- inplace (bool, default False) – If True, performs operation inplace and returns None.
- limit (int, default None) – Maximum size gap to forward or backward fill.
- regex (bool or same types as to_replace, default False) – Whether to interpret to_replace and/or value as regular
expressions. If this is
True
then to_replace must be a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must beNone
. - method ({‘pad’, ‘ffill’, ‘bfill’, None}) –
The method to use when for replacement, when to_replace is a scalar, list or tuple and value is
None
.Changed in version 0.23.0: Added to DeferredDataFrame.
Returns: Object after replacement.
Return type: Raises: AssertionError
– * If regex is not abool
and to_replace is notNone
.
TypeError
– * If to_replace is not a scalar, array-like,dict
, orNone
* If to_replace is adict
and value is not alist
,dict
,ndarray
, orDeferredSeries
- If to_replace is
None
and regex is not compilable - into a regular expression or is a list, dict, ndarray, or DeferredSeries.
- If to_replace is
- When replacing multiple
bool
ordatetime64
objects and - the arguments to to_replace does not match the type of the value being replaced
- When replacing multiple
ValueError
– * If alist
or anndarray
is passed to to_replace andvalue but they are not the same length.
Differences from pandas
method
is not supported in the Beam DataFrame API because it is order-sensitive. It cannot be specified.If
limit
is specified this operation is not parallelizable.See also
DeferredDataFrame.fillna()
- Fill NA values.
DeferredDataFrame.where()
- Replace values based on boolean condition.
DeferredSeries.str.replace()
- Simple string replacement.
Notes
- Regex substitution is performed under the hood with
re.sub
. The - rules for substitution for
re.sub
are the same.
- Regex substitution is performed under the hood with
- Regular expressions will only substitute on strings, meaning you
- cannot provide, for example, a regular expression matching floating point numbers and expect the columns in your frame that have a numeric dtype to be matched. However, if those floating point numbers are strings, then you can do this.
- This method has a lot of options. You are encouraged to experiment
- and play with this method to gain intuition about how it works.
- When dict is used as the to_replace value, it is like
- key(s) in the dict are the to_replace part and value(s) in the dict are the value parameter.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
**Scalar `to_replace` and `value`** >>> s = pd.Series([0, 1, 2, 3, 4]) >>> s.replace(0, 5) 0 5 1 1 2 2 3 3 4 4 dtype: int64 >>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4], ... 'B': [5, 6, 7, 8, 9], ... 'C': ['a', 'b', 'c', 'd', 'e']}) >>> df.replace(0, 5) A B C 0 5 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e **List-like `to_replace`** >>> df.replace([0, 1, 2, 3], 4) A B C 0 4 5 a 1 4 6 b 2 4 7 c 3 4 8 d 4 4 9 e >>> df.replace([0, 1, 2, 3], [4, 3, 2, 1]) A B C 0 4 5 a 1 3 6 b 2 2 7 c 3 1 8 d 4 4 9 e >>> s.replace([1, 2], method='bfill') 0 0 1 3 2 3 3 3 4 4 dtype: int64 **dict-like `to_replace`** >>> df.replace({0: 10, 1: 100}) A B C 0 10 5 a 1 100 6 b 2 2 7 c 3 3 8 d 4 4 9 e >>> df.replace({'A': 0, 'B': 5}, 100) A B C 0 100 100 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e >>> df.replace({'A': {0: 100, 4: 400}}) A B C 0 100 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 400 9 e **Regular expression `to_replace`** >>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'], ... 'B': ['abc', 'bar', 'xyz']}) >>> df.replace(to_replace=r'^ba.$', value='new', regex=True) A B 0 new abc 1 foo new 2 bait xyz >>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True) A B 0 new abc 1 foo bar 2 bait xyz >>> df.replace(regex=r'^ba.$', value='new') A B 0 new abc 1 foo new 2 bait xyz >>> df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'}) A B 0 new abc 1 xyz new 2 bait xyz >>> df.replace(regex=[r'^ba.$', 'foo'], value='new') A B 0 new abc 1 new new 2 bait xyz Compare the behavior of ``s.replace({'a': None})`` and ``s.replace('a', None)`` to understand the peculiarities of the `to_replace` parameter: >>> s = pd.Series([10, 'a', 'a', 'b', 'a']) When one uses a dict as the `to_replace` value, it is like the value(s) in the dict are equal to the `value` parameter. ``s.replace({'a': None})`` is equivalent to ``s.replace(to_replace={'a': None}, value=None, method=None)``: >>> s.replace({'a': None}) 0 10 1 None 2 None 3 b 4 None dtype: object When ``value=None`` and `to_replace` is a scalar, list or tuple, `replace` uses the method parameter (default 'pad') to do the replacement. So this is why the 'a' values are being replaced by 10 in rows 1 and 2 and 'b' in row 4 in this case. The command ``s.replace('a', None)`` is actually equivalent to ``s.replace(to_replace='a', value=None, method='pad')``: >>> s.replace('a', None) 0 10 1 10 2 10 3 b 4 b dtype: object
- to_replace (str, regex, list, dict, DeferredSeries, int, float, or None) –
-
resample
(**kwargs)¶ pandas.DataFrame.resample()
is not yet supported in the Beam DataFrame API because implementing it would require integrating with Beam event-time semanticsFor more information see https://s.apache.org/dataframe-event-time-semantics.
-
reset_index
(level=None, **kwargs)¶ Reset the index, or a level of it.
Reset the index of the DataFrame, and use the default one instead. If the DataFrame has a MultiIndex, this method can remove one or more levels.
Parameters: - level (int, str, tuple, or list, default None) – Only remove the given levels from the index. Removes all levels by default.
- drop (bool, default False) – Do not try to insert index into dataframe columns. This resets the index to the default integer index.
- inplace (bool, default False) – Modify the DeferredDataFrame in place (do not create a new object).
- col_level (int or str, default 0) – If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level.
- col_fill (object, default '') – If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.
Returns: DeferredDataFrame with the new index or None if
inplace=True
.Return type: Differences from pandas
Dropping the entire index (e.g. with
reset_index(level=None)
) is not parallelizable. It is also only guaranteed that the newly generated index values will be unique. The Beam DataFrame API makes no guarantee that the same index values as the equivalent pandas operation will be generated, because that implementation is order-sensitive.See also
DeferredDataFrame.set_index()
- Opposite of reset_index.
DeferredDataFrame.reindex()
- Change to new indices or expand indices.
DeferredDataFrame.reindex_like()
- Change to same indices as other DeferredDataFrame.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame([('bird', 389.0), ... ('bird', 24.0), ... ('mammal', 80.5), ... ('mammal', np.nan)], ... index=['falcon', 'parrot', 'lion', 'monkey'], ... columns=('class', 'max_speed')) >>> df class max_speed falcon bird 389.0 parrot bird 24.0 lion mammal 80.5 monkey mammal NaN When we reset the index, the old index is added as a column, and a new sequential index is used: >>> df.reset_index() index class max_speed 0 falcon bird 389.0 1 parrot bird 24.0 2 lion mammal 80.5 3 monkey mammal NaN We can use the `drop` parameter to avoid the old index being added as a column: >>> df.reset_index(drop=True) class max_speed 0 bird 389.0 1 bird 24.0 2 mammal 80.5 3 mammal NaN You can also use `reset_index` with `MultiIndex`. >>> index = pd.MultiIndex.from_tuples([('bird', 'falcon'), ... ('bird', 'parrot'), ... ('mammal', 'lion'), ... ('mammal', 'monkey')], ... names=['class', 'name']) >>> columns = pd.MultiIndex.from_tuples([('speed', 'max'), ... ('species', 'type')]) >>> df = pd.DataFrame([(389.0, 'fly'), ... ( 24.0, 'fly'), ... ( 80.5, 'run'), ... (np.nan, 'jump')], ... index=index, ... columns=columns) >>> df speed species max type class name bird falcon 389.0 fly parrot 24.0 fly mammal lion 80.5 run monkey NaN jump If the index has multiple levels, we can reset a subset of them: >>> df.reset_index(level='class') class speed species max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump If we are not dropping the index, by default, it is placed in the top level. We can place it in another level: >>> df.reset_index(level='class', col_level=1) speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump When the index is inserted under another level, we can specify under which one with the parameter `col_fill`: >>> df.reset_index(level='class', col_level=1, col_fill='species') species speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump If we specify a nonexistent level for `col_fill`, it is created: >>> df.reset_index(level='class', col_level=1, col_fill='genus') genus speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
-
rfloordiv
(**kwargs)¶ Return Integer division of series and other, element-wise (binary operator rfloordiv).
Equivalent to
other // series
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.floordiv()
- Element-wise Integer division, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.floordiv(b, fill_value=0) a 1.0 b NaN c NaN d 0.0 e NaN dtype: float64
-
rmod
(**kwargs)¶ Return Modulo of series and other, element-wise (binary operator rmod).
Equivalent to
other % series
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.mod()
- Element-wise Modulo, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.mod(b, fill_value=0) a 0.0 b NaN c NaN d 0.0 e NaN dtype: float64
-
rmul
(**kwargs)¶ Return Multiplication of series and other, element-wise (binary operator rmul).
Equivalent to
other * series
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.mul()
- Element-wise Multiplication, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.multiply(b, fill_value=0) a 1.0 b 0.0 c 0.0 d 0.0 e NaN dtype: float64
-
rolling
(**kwargs)¶ pandas.DataFrame.rolling()
is not yet supported in the Beam DataFrame API because implementing it would require integrating with Beam event-time semanticsFor more information see https://s.apache.org/dataframe-event-time-semantics.
-
rpow
(**kwargs)¶ Return Exponential power of series and other, element-wise (binary operator rpow).
Equivalent to
other ** series
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.pow()
- Element-wise Exponential power, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.pow(b, fill_value=0) a 1.0 b 1.0 c 1.0 d 0.0 e NaN dtype: float64
-
rsub
(**kwargs)¶ Return Subtraction of series and other, element-wise (binary operator rsub).
Equivalent to
other - series
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.sub()
- Element-wise Subtraction, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.subtract(b, fill_value=0) a 0.0 b 1.0 c 1.0 d -1.0 e NaN dtype: float64
-
rtruediv
(**kwargs)¶ Return Floating division of series and other, element-wise (binary operator rtruediv).
Equivalent to
other / series
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.truediv()
- Element-wise Floating division, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divide(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
-
set_flags
(**kwargs)¶ pandas.Series.set_flags()
is not implemented yet in the Beam DataFrame API.If support for ‘set_flags’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
size
¶ Return an int representing the number of elements in this object.
Return the number of rows if Series. Otherwise return the number of rows times number of columns if DataFrame.
Differences from pandas
This operation has no known divergences from the pandas API.
See also
ndarray.size
- Number of elements in the array.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series({'a': 1, 'b': 2, 'c': 3}) >>> s.size 3 >>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df.size 4
-
sort_index
(axis, **kwargs)¶ Sort object by labels (along an axis).
Returns a new DataFrame sorted by label if inplace argument is
False
, otherwise updates the original DataFrame and returns None.Parameters: - axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.
- level (int or level name or list of ints or list of level names) – If not None, sort on values in specified index level(s).
- ascending (bool or list-like of bools, default True) – Sort ascending vs. descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.
- inplace (bool, default False) – If True, perform operation in-place.
- kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See also
numpy.sort()
for more information. mergesort and stable are the only stable algorithms. For DeferredDataFrames, this option is only applied when sorting on a single column or label. - na_position ({'first', 'last'}, default 'last') – Puts NaNs at the beginning if first; last puts NaNs at the end. Not implemented for MultiIndex.
- sort_remaining (bool, default True) – If True and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level.
- ignore_index (bool, default False) –
If True, the resulting axis will be labeled 0, 1, …, n - 1.
New in version 1.0.0.
- key (callable, optional) –
If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin
sorted()
function, with the notable difference that this key function should be vectorized. It should expect anIndex
and return anIndex
of the same shape. For MultiIndex inputs, the key is applied per level.New in version 1.1.0.
Returns: The original DeferredDataFrame sorted by the labels or None if
inplace=True
.Return type: Differences from pandas
axis=index
is not allowed because it imposes an ordering on the dataset, and we cannot guarantee it will be maintained (see https://s.apache.org/dataframe-order-sensitive-operations). Onlyaxis=columns
is allowed.See also
DeferredSeries.sort_index()
- Sort DeferredSeries by the index.
DeferredDataFrame.sort_values()
- Sort DeferredDataFrame by the value.
DeferredSeries.sort_values()
- Sort DeferredSeries by the value.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150], ... columns=['A']) >>> df.sort_index() A 1 4 29 2 100 1 150 5 234 3 By default, it sorts in ascending order, to sort in descending order, use ``ascending=False`` >>> df.sort_index(ascending=False) A 234 3 150 5 100 1 29 2 1 4 A key function can be specified which is applied to the index before sorting. For a ``MultiIndex`` this is applied to each level separately. >>> df = pd.DataFrame({"a": [1, 2, 3, 4]}, index=['A', 'b', 'C', 'd']) >>> df.sort_index(key=lambda x: x.str.lower()) a A 1 b 2 C 3 d 4
-
sort_values
(axis, **kwargs)¶ sort_values
is not implemented.It is not implemented for
axis=index
because it imposes an ordering on the dataset, and it likely will not be maintained (see https://s.apache.org/dataframe-order-sensitive-operations).It is not implemented for
axis=columns
because it makes the order of the columns depend on the data (see https://s.apache.org/dataframe-non-deferred-columns).
-
sparse
¶ pandas.DataFrame.sparse()
is not implemented yet in the Beam DataFrame API.If support for ‘sparse’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-12425.
-
squeeze
(**kwargs)¶ pandas.Series.squeeze()
is not implemented yet in the Beam DataFrame API.If support for ‘squeeze’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
sub
(**kwargs)¶ Return Subtraction of series and other, element-wise (binary operator sub).
Equivalent to
series - other
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.rsub()
- Reverse of the Subtraction operator, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.subtract(b, fill_value=0) a 0.0 b 1.0 c 1.0 d -1.0 e NaN dtype: float64
-
subtract
(**kwargs)¶ Return Subtraction of series and other, element-wise (binary operator sub).
Equivalent to
series - other
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.rsub()
- Reverse of the Subtraction operator, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.subtract(b, fill_value=0) a 0.0 b 1.0 c 1.0 d -1.0 e NaN dtype: float64
-
swapaxes
(**kwargs)¶ pandas.Series.swapaxes()
is not yet supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.For more information see https://s.apache.org/dataframe-non-deferred-columns.
-
swaplevel
(**kwargs)¶ pandas.Series.swaplevel()
is not implemented yet in the Beam DataFrame API.If support for ‘swaplevel’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
to_clipboard
(**kwargs)¶ pandas.DataFrame.to_clipboard()
is not implemented yet in the Beam DataFrame API.If support for ‘to_clipboard’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
to_csv
(path, transform_label=None, *args, **kwargs)¶ Write object to a comma-separated values (csv) file.
Parameters: - path_or_buf (str or file handle, default None) –
File path or object, if None is provided the result is returned as a string. If a non-binary file object is passed, it should be opened with newline=’’, disabling universal newlines. If a binary file object is passed, mode might need to contain a ‘b’.
Changed in version 1.2.0: Support for binary file objects was introduced.
- sep (str, default ',') – String of length 1. Field delimiter for the output file.
- na_rep (str, default '') – Missing data representation.
- float_format (str, default None) – Format string for floating point numbers.
- columns (sequence, optional) – Columns to write.
- header (bool or list of str, default True) – Write out the column names. If a list of strings is given it is assumed to be aliases for the column names.
- index (bool, default True) – Write row names (index).
- index_label (str or sequence, or False, default None) – Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the object uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R.
- mode (str) – Python write mode, default ‘w’.
- encoding (str, optional) – A string representing the encoding to use in the output file, defaults to ‘utf-8’. encoding is not supported if path_or_buf is a non-binary file object.
- compression (str or dict, default 'infer') –
If str, represents compression mode. If dict, value at ‘method’ is the compression mode. Compression mode may be any of the following possible values: {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}. If compression mode is ‘infer’ and path_or_buf is path-like, then detect compression mode from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’ or ‘.xz’. (otherwise no compression). If dict given and mode is one of {‘zip’, ‘gzip’, ‘bz2’}, or inferred as one of the above, other entries passed as additional compression options.
Changed in version 1.0.0: May now be a dict with key ‘method’ as compression mode and other entries as additional compression options if compression mode is ‘zip’.
Changed in version 1.1.0: Passing compression options as keys in dict is supported for compression modes ‘gzip’ and ‘bz2’ as well as ‘zip’.
Changed in version 1.2.0: Compression is supported for binary file objects.
Changed in version 1.2.0: Previous versions forwarded dict entries for ‘gzip’ to gzip.open instead of gzip.GzipFile which prevented setting mtime.
- quoting (optional constant from csv module) – Defaults to csv.QUOTE_MINIMAL. If you have set a float_format then floats are converted to strings and thus csv.QUOTE_NONNUMERIC will treat them as non-numeric.
- quotechar (str, default '"') – String of length 1. Character used to quote fields.
- line_terminator (str, optional) – The newline character or character sequence to use in the output file. Defaults to os.linesep, which depends on the OS in which this method is called (‘\n’ for linux, ‘\r\n’ for Windows, i.e.).
- chunksize (int or None) – Rows to write at a time.
- date_format (str, default None) – Format string for datetime objects.
- doublequote (bool, default True) – Control quoting of quotechar inside a field.
- escapechar (str, default None) – String of length 1. Character used to escape sep and quotechar when appropriate.
- decimal (str, default '.') – Character recognized as decimal separator. E.g. use ‘,’ for European data.
- errors (str, default 'strict') –
Specifies how encoding and decoding errors are to be handled. See the errors argument for
open()
for a full list of options.New in version 1.1.0.
- storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec
. Please seefsspec
andurllib
for more details.New in version 1.2.0.
Returns: If path_or_buf is None, returns the resulting csv format as a string. Otherwise returns None.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
read_csv()
- Load a CSV file into a DeferredDataFrame.
to_excel()
- Write DeferredDataFrame to an Excel file.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'name': ['Raphael', 'Donatello'], ... 'mask': ['red', 'purple'], ... 'weapon': ['sai', 'bo staff']}) >>> df.to_csv(index=False) 'name,mask,weapon\nRaphael,red,sai\nDonatello,purple,bo staff\n' Create 'out.zip' containing 'out.csv' >>> compression_opts = dict(method='zip', ... archive_name='out.csv') >>> df.to_csv('out.zip', index=False, ... compression=compression_opts)
- path_or_buf (str or file handle, default None) –
-
to_excel
(path, *args, **kwargs)¶ Write object to an Excel sheet.
To write a single object to an Excel .xlsx file it is only necessary to specify a target file name. To write to multiple sheets it is necessary to create an ExcelWriter object with a target file name, and specify a sheet in the file to write to.
Multiple sheets may be written to by specifying unique sheet_name. With all data written to the file it is necessary to save the changes. Note that creating an ExcelWriter object with a file name that already exists will result in the contents of the existing file being erased.
Parameters: - excel_writer (path-like, file-like, or ExcelWriter object) – File path or existing ExcelWriter.
- sheet_name (str, default 'Sheet1') – Name of sheet which will contain DeferredDataFrame.
- na_rep (str, default '') – Missing data representation.
- float_format (str, optional) – Format string for floating point numbers. For example
float_format="%.2f"
will format 0.1234 to 0.12. - columns (sequence or list of str, optional) – Columns to write.
- header (bool or list of str, default True) – Write out the column names. If a list of string is given it is assumed to be aliases for the column names.
- index (bool, default True) – Write row names (index).
- index_label (str or sequence, optional) – Column label for index column(s) if desired. If not specified, and header and index are True, then the index names are used. A sequence should be given if the DeferredDataFrame uses MultiIndex.
- startrow (int, default 0) – Upper left cell row to dump data frame.
- startcol (int, default 0) – Upper left cell column to dump data frame.
- engine (str, optional) –
Write engine to use, ‘openpyxl’ or ‘xlsxwriter’. You can also set this via the options
io.excel.xlsx.writer
,io.excel.xls.writer
, andio.excel.xlsm.writer
.Deprecated since version 1.2.0: As the xlwt package is no longer maintained, the
xlwt
engine will be removed in a future version of pandas. - merge_cells (bool, default True) – Write MultiIndex and Hierarchical Rows as merged cells.
- encoding (str, optional) – Encoding of the resulting excel file. Only necessary for xlwt, other writers support unicode natively.
- inf_rep (str, default 'inf') – Representation for infinity (there is no native representation for infinity in Excel).
- verbose (bool, default True) – Display more information in the error logs.
- freeze_panes (tuple of int (length 2), optional) – Specifies the one-based bottommost row and rightmost column that is to be frozen.
- storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec
. Please seefsspec
andurllib
for more details.New in version 1.2.0.
Differences from pandas
This operation has no known divergences from the pandas API.
See also
to_csv()
- Write DeferredDataFrame to a comma-separated values (csv) file.
ExcelWriter()
- Class for writing DeferredDataFrame objects into excel sheets.
read_excel()
- Read an Excel file into a pandas DeferredDataFrame.
read_csv()
- Read a comma-separated values (csv) file into DeferredDataFrame.
Notes
For compatibility with
to_csv()
, to_excel serializes lists and dicts to strings before writing.Once a workbook has been saved it is not possible to write further data without rewriting the whole workbook.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Create, write to and save a workbook: >>> df1 = pd.DataFrame([['a', 'b'], ['c', 'd']], ... index=['row 1', 'row 2'], ... columns=['col 1', 'col 2']) >>> df1.to_excel("output.xlsx") To specify the sheet name: >>> df1.to_excel("output.xlsx", ... sheet_name='Sheet_name_1') If you wish to write to more than one sheet in the workbook, it is necessary to specify an ExcelWriter object: >>> df2 = df1.copy() >>> with pd.ExcelWriter('output.xlsx') as writer: ... df1.to_excel(writer, sheet_name='Sheet_name_1') ... df2.to_excel(writer, sheet_name='Sheet_name_2') ExcelWriter can also be used to append to an existing Excel file: >>> with pd.ExcelWriter('output.xlsx', ... mode='a') as writer: ... df.to_excel(writer, sheet_name='Sheet_name_3') To set the library that is used to write the Excel file, you can pass the `engine` keyword (the default engine is automatically chosen depending on the file extension): >>> df1.to_excel('output1.xlsx', engine='xlsxwriter')
-
to_feather
(path, *args, **kwargs)¶ Write a DataFrame to the binary Feather format.
Parameters: - path (str or file-like object) – If a string, it will be used as Root Directory path.
- **kwargs –
Additional keywords passed to
pyarrow.feather.write_feather()
. Starting with pyarrow 0.17, this includes the compression, compression_level, chunksize and version keywords.New in version 1.1.0.
Differences from pandas
This operation has no known divergences from the pandas API.
-
to_hdf
(**kwargs)¶ pandas.DataFrame.to_hdf()
is not yet supported in the Beam DataFrame API because HDF5 is a random access file format
-
to_html
(path, *args, **kwargs)¶ Render a DataFrame as an HTML table.
Parameters: - buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.
- columns (sequence, optional, default None) – The subset of columns to write. Writes all columns by default.
- col_space (str or int, list or dict of int or str, optional) –
The minimum width of each column in CSS length units. An int is assumed to be px units.
New in version 0.25.0: Ability to use str.
- header (bool, optional) – Whether to print column labels, default True.
- index (bool, optional, default True) – Whether to print index (row) labels.
- na_rep (str, optional, default 'NaN') – String representation of
NaN
to use. - formatters (list, tuple or dict of one-param. functions, optional) – Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.
- float_format (one-parameter function, optional, default None) –
Formatter function to apply to columns’ elements if they are floats. This function must return a unicode string and will be applied only to the non-
NaN
elements, withNaN
being handled byna_rep
.Changed in version 1.2.0.
- sparsify (bool, optional, default True) – Set to False for a DeferredDataFrame with a hierarchical index to print every multiindex key at each row.
- index_names (bool, optional, default True) – Prints the names of the indexes.
- justify (str, default None) –
How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are
- left
- right
- center
- justify
- justify-all
- start
- end
- inherit
- match-parent
- initial
- unset.
- max_rows (int, optional) – Maximum number of rows to display in the console.
- min_rows (int, optional) – The number of rows to display in the console in a truncated repr (when number of rows is above max_rows).
- max_cols (int, optional) – Maximum number of columns to display in the console.
- show_dimensions (bool, default False) – Display DeferredDataFrame dimensions (number of rows by number of columns).
- decimal (str, default '.') – Character recognized as decimal separator, e.g. ‘,’ in Europe.
- bold_rows (bool, default True) – Make the row labels bold in the output.
- classes (str or list or tuple, default None) – CSS class(es) to apply to the resulting html table.
- escape (bool, default True) – Convert the characters <, >, and & to HTML-safe sequences.
- notebook ({True, False}, default False) – Whether the generated HTML is for IPython Notebook.
- border (int) – A
border=border
attribute is included in the opening <table> tag. Defaultpd.options.display.html.border
. - encoding (str, default "utf-8") –
Set character encoding.
New in version 1.0.
- table_id (str, optional) – A css id is included in the opening <table> tag if specified.
- render_links (bool, default False) – Convert URLs to HTML links.
Returns: If buf is None, returns the result as a string. Otherwise returns None.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
to_string()
- Convert DeferredDataFrame to a string.
-
to_json
(path, orient=None, *args, **kwargs)¶ Convert the object to a JSON string.
Note NaN’s and None will be converted to null and datetime objects will be converted to UNIX timestamps.
Parameters: - path_or_buf (str or file handle, optional) – File path or object. If not specified, the result is returned as a string.
- orient (str) –
Indication of expected JSON string format.
- DeferredSeries:
- default is ‘index’
- allowed values are: {‘split’, ‘records’, ‘index’, ‘table’}.
- DeferredDataFrame:
- default is ‘columns’
- allowed values are: {‘split’, ‘records’, ‘index’, ‘columns’, ‘values’, ‘table’}.
- The format of the JSON string:
- ’split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}
- ’records’ : list like [{column -> value}, … , {column -> value}]
- ’index’ : dict like {index -> {column -> value}}
- ’columns’ : dict like {column -> {index -> value}}
- ’values’ : just the values array
- ’table’ : dict like {‘schema’: {schema}, ‘data’: {data}}
Describing the data, where data component is like
orient='records'
.
- DeferredSeries:
- date_format ({None, 'epoch', 'iso'}) – Type of date conversion. ‘epoch’ = epoch milliseconds,
‘iso’ = ISO8601. The default depends on the orient. For
orient='table'
, the default is ‘iso’. For all other orients, the default is ‘epoch’. - double_precision (int, default 10) – The number of decimal places to use when encoding floating point values.
- force_ascii (bool, default True) – Force encoded string to be ASCII.
- date_unit (str, default 'ms' (milliseconds)) – The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.
- default_handler (callable, default None) – Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.
- lines (bool, default False) – If ‘orient’ is ‘records’ write out line-delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list-like.
- compression ({'infer', 'gzip', 'bz2', 'zip', 'xz', None}) – A string representing the compression to use in the output file, only used when the first argument is a filename. By default, the compression is inferred from the filename.
- index (bool, default True) – Whether to include the index values in the JSON string. Not
including the index (
index=False
) is only supported when orient is ‘split’ or ‘table’. - indent (int, optional) –
Length of whitespace used to indent each record.
New in version 1.0.0.
- storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec
. Please seefsspec
andurllib
for more details.New in version 1.2.0.
Returns: If path_or_buf is None, returns the resulting json format as a string. Otherwise returns None.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
read_json()
- Convert a JSON string to pandas object.
Notes
The behavior of
indent=0
varies from the stdlib, which does not indent the output but does insert newlines. Currently,indent=0
and the defaultindent=None
are equivalent in pandas, though this may change in a future release.orient='table'
contains a ‘pandas_version’ field under ‘schema’. This stores the version of pandas used in the latest revision of the schema.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> import json >>> df = pd.DataFrame( ... [["a", "b"], ["c", "d"]], ... index=["row 1", "row 2"], ... columns=["col 1", "col 2"], ... ) >>> result = df.to_json(orient="split") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) { "columns": [ "col 1", "col 2" ], "index": [ "row 1", "row 2" ], "data": [ [ "a", "b" ], [ "c", "d" ] ] } Encoding/decoding a Dataframe using ``'records'`` formatted JSON. Note that index labels are not preserved with this encoding. >>> result = df.to_json(orient="records") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) [ { "col 1": "a", "col 2": "b" }, { "col 1": "c", "col 2": "d" } ] Encoding/decoding a Dataframe using ``'index'`` formatted JSON: >>> result = df.to_json(orient="index") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) { "row 1": { "col 1": "a", "col 2": "b" }, "row 2": { "col 1": "c", "col 2": "d" } } Encoding/decoding a Dataframe using ``'columns'`` formatted JSON: >>> result = df.to_json(orient="columns") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) { "col 1": { "row 1": "a", "row 2": "c" }, "col 2": { "row 1": "b", "row 2": "d" } } Encoding/decoding a Dataframe using ``'values'`` formatted JSON: >>> result = df.to_json(orient="values") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) [ [ "a", "b" ], [ "c", "d" ] ] Encoding with Table Schema: >>> result = df.to_json(orient="table") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) { "schema": { "fields": [ { "name": "index", "type": "string" }, { "name": "col 1", "type": "string" }, { "name": "col 2", "type": "string" } ], "primaryKey": [ "index" ], "pandas_version": "0.20.0" }, "data": [ { "index": "row 1", "col 1": "a", "col 2": "b" }, { "index": "row 2", "col 1": "c", "col 2": "d" } ] }
-
to_latex
(**kwargs)¶ pandas.Series.to_latex()
is not implemented yet in the Beam DataFrame API.If support for ‘to_latex’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
to_markdown
(**kwargs)¶ pandas.Series.to_markdown()
is not implemented yet in the Beam DataFrame API.If support for ‘to_markdown’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
to_msgpack
(**kwargs)¶ pandas.DataFrame.to_msgpack()
is not yet supported in the Beam DataFrame API because it is deprecated in pandas.
-
to_parquet
(path, *args, **kwargs)¶ Write a DataFrame to the binary parquet format.
This function writes the dataframe as a parquet file. You can choose different parquet backends, and have the option of compression. See the user guide for more details.
Parameters: - path (str or file-like object, default None) –
If a string, it will be used as Root Directory path when writing a partitioned dataset. By file-like object, we refer to objects with a write() method, such as a file handle (e.g. via builtin open function) or io.BytesIO. The engine fastparquet does not accept file-like objects. If path is None, a bytes object is returned.
Changed in version 1.2.0.
Previously this was “fname”
- engine ({'auto', 'pyarrow', 'fastparquet'}, default 'auto') – Parquet library to use. If ‘auto’, then the option
io.parquet.engine
is used. The defaultio.parquet.engine
behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable. - compression ({'snappy', 'gzip', 'brotli', None}, default 'snappy') – Name of the compression to use. Use
None
for no compression. - index (bool, default None) – If
True
, include the dataframe’s index(es) in the file output. IfFalse
, they will not be written to the file. IfNone
, similar toTrue
the dataframe’s index(es) will be saved. However, instead of being saved as values, the RangeIndex will be stored as a range in the metadata so it doesn’t require much space and is faster. Other indexes will be included as columns in the file output. - partition_cols (list, optional, default None) – Column names by which to partition the dataset. Columns are partitioned in the order they are given. Must be None if path is not a string.
- storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec
. Please seefsspec
andurllib
for more details.New in version 1.2.0.
- **kwargs – Additional arguments passed to the parquet library. See pandas io for more details.
Returns: Return type: bytes if no path argument is provided else None
Differences from pandas
This operation has no known divergences from the pandas API.
See also
read_parquet()
- Read a parquet file.
DeferredDataFrame.to_csv()
- Write a csv file.
DeferredDataFrame.to_sql()
- Write to a sql table.
DeferredDataFrame.to_hdf()
- Write to hdf.
Notes
This function requires either the fastparquet or pyarrow library.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]}) >>> df.to_parquet('df.parquet.gzip', ... compression='gzip') >>> pd.read_parquet('df.parquet.gzip') col1 col2 0 1 3 1 2 4 If you want to get a buffer to the parquet content you can use a io.BytesIO object, as long as you don't use partition_cols, which creates multiple files. >>> import io >>> f = io.BytesIO() >>> df.to_parquet(f) >>> f.seek(0) 0 >>> content = f.read()
- path (str or file-like object, default None) –
-
to_period
(**kwargs)¶ pandas.Series.to_period()
is not implemented yet in the Beam DataFrame API.If support for ‘to_period’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
to_pickle
(**kwargs)¶ pandas.Series.to_pickle()
is not implemented yet in the Beam DataFrame API.If support for ‘to_pickle’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
to_sql
(**kwargs)¶ pandas.Series.to_sql()
is not implemented yet in the Beam DataFrame API.If support for ‘to_sql’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
to_stata
(path, *args, **kwargs)¶ Export DataFrame object to Stata dta format.
Writes the DataFrame to a Stata dataset file. “dta” files contain a Stata dataset.
Parameters: - path (str, buffer or path object) –
String, path object (pathlib.Path or py._path.local.LocalPath) or object implementing a binary write() function. If using a buffer then the buffer will not be automatically closed after the file data has been written.
Changed in version 1.0.0.
Previously this was “fname”
- convert_dates (dict) – Dictionary mapping columns containing datetime types to stata internal format to use when writing the dates. Options are ‘tc’, ‘td’, ‘tm’, ‘tw’, ‘th’, ‘tq’, ‘ty’. Column can be either an integer or a name. Datetime columns that do not have a conversion type specified will be converted to ‘tc’. Raises NotImplementedError if a datetime column has timezone information.
- write_index (bool) – Write the index to Stata dataset.
- byteorder (str) – Can be “>”, “<”, “little”, or “big”. default is sys.byteorder.
- time_stamp (datetime) – A datetime to use as file creation date. Default is the current time.
- data_label (str, optional) – A label for the data set. Must be 80 characters or smaller.
- variable_labels (dict) – Dictionary containing columns as keys and variable labels as values. Each label must be 80 characters or smaller.
- version ({114, 117, 118, 119, None}, default 114) –
Version to use in the output dta file. Set to None to let pandas decide between 118 or 119 formats depending on the number of columns in the frame. pandas Version 114 can be read by Stata 10 and later. pandas Version 117 can be read by Stata 13 or later. pandas Version 118 is supported in Stata 14 and later. pandas Version 119 is supported in Stata 15 and later. pandas Version 114 limits string variables to 244 characters or fewer while versions 117 and later allow strings with lengths up to 2,000,000 characters. Versions 118 and 119 support Unicode characters, and pandas version 119 supports more than 32,767 variables.
pandas Version 119 should usually only be used when the number of variables exceeds the capacity of dta format 118. Exporting smaller datasets in format 119 may have unintended consequences, and, as of November 2020, Stata SE cannot read pandas version 119 files.
Changed in version 1.0.0: Added support for formats 118 and 119.
- convert_strl (list, optional) – List of column names to convert to string columns to Stata StrL format. Only available if version is 117. Storing strings in the StrL format can produce smaller dta files if strings have more than 8 characters and values are repeated.
- compression (str or dict, default 'infer') –
For on-the-fly compression of the output dta. If string, specifies compression mode. If dict, value at key ‘method’ specifies compression mode. Compression mode must be one of {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}. If compression mode is ‘infer’ and fname is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise no compression). If dict and compression mode is one of {‘zip’, ‘gzip’, ‘bz2’}, or inferred as one of the above, other entries passed as additional compression options.
New in version 1.1.0.
- storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec
. Please seefsspec
andurllib
for more details.New in version 1.2.0.
Raises: NotImplementedError
– * If datetimes contain timezone information * Column dtype is not representable in StataValueError
– * Columns listed in convert_dates are neither datetime64[ns]or datetime.datetime
- Column listed in convert_dates is not in DeferredDataFrame
- Categorical label contains more than 32,000 characters
Differences from pandas
This operation has no known divergences from the pandas API.
See also
read_stata()
- Import Stata data files.
io.stata.StataWriter()
- Low-level writer for Stata data files.
io.stata.StataWriter117()
- Low-level writer for pandas version 117 files.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'animal': ['falcon', 'parrot', 'falcon', ... 'parrot'], ... 'speed': [350, 18, 361, 15]}) >>> df.to_stata('animals.dta')
- path (str, buffer or path object) –
-
to_timestamp
(**kwargs)¶ pandas.Series.to_timestamp()
is not implemented yet in the Beam DataFrame API.If support for ‘to_timestamp’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
to_xarray
(**kwargs)¶ pandas.DataFrame.to_xarray()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
transform
(**kwargs)¶ Call
func
on self producing a DataFrame with transformed values.Produced DataFrame will have same axis length as self.
Parameters: - func (function, str, list-like or dict-like) –
Function to use for transforming the data. If a function, must either work when passed a DeferredDataFrame or when passed to DeferredDataFrame.apply. If func is both list-like and dict-like, dict-like behavior takes precedence.
Accepted combinations are:
- function
- string function name
- list-like of functions and/or function names, e.g.
[np.exp, 'sqrt']
- dict-like of axis labels -> functions, function names or list-like of such.
- axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.
- *args – Positional arguments to pass to func.
- **kwargs – Keyword arguments to pass to func.
Returns: A DeferredDataFrame that must have the same length as self.
Return type: Raises: ValueError : If the returned DeferredDataFrame has a different length than self.
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.agg()
- Only perform aggregating type operations.
DeferredDataFrame.apply()
- Invoke function on a DeferredDataFrame.
Notes
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'A': range(3), 'B': range(1, 4)}) >>> df A B 0 0 1 1 1 2 2 2 3 >>> df.transform(lambda x: x + 1) A B 0 1 2 1 2 3 2 3 4 Even though the resulting DataFrame must have the same length as the input DataFrame, it is possible to provide several input functions: >>> s = pd.Series(range(3)) >>> s 0 0 1 1 2 2 dtype: int64 >>> s.transform([np.sqrt, np.exp]) sqrt exp 0 0.000000 1.000000 1 1.000000 2.718282 2 1.414214 7.389056 You can call transform on a GroupBy object: >>> df = pd.DataFrame({ ... "Date": [ ... "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05", ... "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05"], ... "Data": [5, 8, 6, 1, 50, 100, 60, 120], ... }) >>> df Date Data 0 2015-05-08 5 1 2015-05-07 8 2 2015-05-06 6 3 2015-05-05 1 4 2015-05-08 50 5 2015-05-07 100 6 2015-05-06 60 7 2015-05-05 120 >>> df.groupby('Date')['Data'].transform('sum') 0 55 1 108 2 66 3 121 4 55 5 108 6 66 7 121 Name: Data, dtype: int64 >>> df = pd.DataFrame({ ... "c": [1, 1, 1, 2, 2, 2, 2], ... "type": ["m", "n", "o", "m", "m", "n", "n"] ... }) >>> df c type 0 1 m 1 1 n 2 1 o 3 2 m 4 2 m 5 2 n 6 2 n >>> df['size'] = df.groupby('c')['type'].transform(len) >>> df c type size 0 1 m 3 1 1 n 3 2 1 o 3 3 2 m 4 4 2 m 4 5 2 n 4 6 2 n 4
- func (function, str, list-like or dict-like) –
-
truediv
(**kwargs)¶ Return Floating division of series and other, element-wise (binary operator truediv).
Equivalent to
series / other
, but with support to substitute a fill_value for missing data in either one of the inputs.Parameters: - other (DeferredSeries or scalar value) –
- fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful DeferredSeries alignment, with this value before computation. If data in both corresponding DeferredSeries locations is missing the result of filling (at that location) will be missing.
- level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: The result of the operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredSeries.rtruediv()
- Reverse of the Floating division operator, see Python documentation for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divide(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
-
truncate
(before, after, axis)¶ Truncate a Series or DataFrame before and after some index value.
This is a useful shorthand for boolean indexing based on index values above or below certain thresholds.
Parameters: - before (date, str, int) – Truncate all rows before this index value.
- after (date, str, int) – Truncate all rows after this index value.
- axis ({0 or 'index', 1 or 'columns'}, optional) – Axis to truncate. Truncates the index (rows) by default.
- copy (bool, default is True,) – Return a copy of the truncated section.
Returns: The truncated DeferredSeries or DeferredDataFrame.
Return type: type of caller
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.loc()
- Select a subset of a DeferredDataFrame by label.
DeferredDataFrame.iloc()
- Select a subset of a DeferredDataFrame by position.
Notes
If the index being truncated contains only datetime values, before and after may be specified as strings instead of Timestamps.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'A': ['a', 'b', 'c', 'd', 'e'], ... 'B': ['f', 'g', 'h', 'i', 'j'], ... 'C': ['k', 'l', 'm', 'n', 'o']}, ... index=[1, 2, 3, 4, 5]) >>> df A B C 1 a f k 2 b g l 3 c h m 4 d i n 5 e j o >>> df.truncate(before=2, after=4) A B C 2 b g l 3 c h m 4 d i n The columns of a DataFrame can be truncated. >>> df.truncate(before="A", after="B", axis="columns") A B 1 a f 2 b g 3 c h 4 d i 5 e j For Series, only rows can be truncated. >>> df['A'].truncate(before=2, after=4) 2 b 3 c 4 d Name: A, dtype: object The index values in ``truncate`` can be datetimes or string dates. >>> dates = pd.date_range('2016-01-01', '2016-02-01', freq='s') >>> df = pd.DataFrame(index=dates, data={'A': 1}) >>> df.tail() A 2016-01-31 23:59:56 1 2016-01-31 23:59:57 1 2016-01-31 23:59:58 1 2016-01-31 23:59:59 1 2016-02-01 00:00:00 1 >>> df.truncate(before=pd.Timestamp('2016-01-05'), ... after=pd.Timestamp('2016-01-10')).tail() A 2016-01-09 23:59:56 1 2016-01-09 23:59:57 1 2016-01-09 23:59:58 1 2016-01-09 23:59:59 1 2016-01-10 00:00:00 1 Because the index is a DatetimeIndex containing only dates, we can specify `before` and `after` as strings. They will be coerced to Timestamps before truncation. >>> df.truncate('2016-01-05', '2016-01-10').tail() A 2016-01-09 23:59:56 1 2016-01-09 23:59:57 1 2016-01-09 23:59:58 1 2016-01-09 23:59:59 1 2016-01-10 00:00:00 1 Note that ``truncate`` assumes a 0 value for any unspecified time component (midnight). This differs from partial string slicing, which returns any partially matching dates. >>> df.loc['2016-01-05':'2016-01-10', :].tail() A 2016-01-10 23:59:55 1 2016-01-10 23:59:56 1 2016-01-10 23:59:57 1 2016-01-10 23:59:58 1 2016-01-10 23:59:59 1
-
tz_convert
(**kwargs)¶ Convert tz-aware axis to target time zone.
Parameters: Returns: Object with time zone converted axis.
Return type: {klass}
Raises: TypeError
– If the axis is tz-naive.Differences from pandas
This operation has no known divergences from the pandas API.
-
tz_localize
(ambiguous, **kwargs)¶ Localize tz-naive index of a Series or DataFrame to target time zone.
This operation localizes the Index. To localize the values in a timezone-naive Series, use
Series.dt.tz_localize()
.Parameters: - tz (str or tzinfo) –
- axis (the axis to localize) –
- level (int, str, default None) – If axis ia a MultiIndex, localize a specific level. Otherwise must be None.
- copy (bool, default True) – Also make a copy of the underlying data.
- ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –
When clocks moved backward due to DST, ambiguous times may arise. For example in Central European Time (UTC+01), when going from 03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at 00:30:00 UTC and at 01:30:00 UTC. In such a situation, the ambiguous parameter dictates how ambiguous times should be handled.
- ’infer’ will attempt to infer fall dst-transition hours based on order
- bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)
- ’NaT’ will return NaT where there are ambiguous times
- ’raise’ will raise an AmbiguousTimeError if there are ambiguous times.
- nonexistent (str, default 'raise') –
A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST. Valid values are:
- ’shift_forward’ will shift the nonexistent time forward to the closest existing time
- ’shift_backward’ will shift the nonexistent time backward to the closest existing time
- ’NaT’ will return NaT where there are nonexistent times
- timedelta objects will shift nonexistent times by the timedelta
- ’raise’ will raise an NonExistentTimeError if there are nonexistent times.
Returns: Same type as the input.
Return type: Raises: TypeError
– If the TimeDeferredSeries is tz-aware and tz is not None.Differences from pandas
ambiguous
cannot be set to"infer"
as its semantics are order-sensitive. Similarly, specifyingambiguous
as anndarray
is order-sensitive, but you can achieve similar functionality by specifyingambiguous
as a Series.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
Localize local times: >>> s = pd.Series([1], ... index=pd.DatetimeIndex(['2018-09-15 01:30:00'])) >>> s.tz_localize('CET') 2018-09-15 01:30:00+02:00 1 dtype: int64 Be careful with DST changes. When there is sequential data, pandas can infer the DST time: >>> s = pd.Series(range(7), ... index=pd.DatetimeIndex(['2018-10-28 01:30:00', ... '2018-10-28 02:00:00', ... '2018-10-28 02:30:00', ... '2018-10-28 02:00:00', ... '2018-10-28 02:30:00', ... '2018-10-28 03:00:00', ... '2018-10-28 03:30:00'])) >>> s.tz_localize('CET', ambiguous='infer') 2018-10-28 01:30:00+02:00 0 2018-10-28 02:00:00+02:00 1 2018-10-28 02:30:00+02:00 2 2018-10-28 02:00:00+01:00 3 2018-10-28 02:30:00+01:00 4 2018-10-28 03:00:00+01:00 5 2018-10-28 03:30:00+01:00 6 dtype: int64 In some cases, inferring the DST is impossible. In such cases, you can pass an ndarray to the ambiguous parameter to set the DST explicitly >>> s = pd.Series(range(3), ... index=pd.DatetimeIndex(['2018-10-28 01:20:00', ... '2018-10-28 02:36:00', ... '2018-10-28 03:46:00'])) >>> s.tz_localize('CET', ambiguous=np.array([True, True, False])) 2018-10-28 01:20:00+02:00 0 2018-10-28 02:36:00+02:00 1 2018-10-28 03:46:00+01:00 2 dtype: int64 If the DST transition causes nonexistent times, you can shift these dates forward or backward with a timedelta object or `'shift_forward'` or `'shift_backward'`. >>> s = pd.Series(range(2), ... index=pd.DatetimeIndex(['2015-03-29 02:30:00', ... '2015-03-29 03:30:00'])) >>> s.tz_localize('Europe/Warsaw', nonexistent='shift_forward') 2015-03-29 03:00:00+02:00 0 2015-03-29 03:30:00+02:00 1 dtype: int64 >>> s.tz_localize('Europe/Warsaw', nonexistent='shift_backward') 2015-03-29 01:59:59.999999999+01:00 0 2015-03-29 03:30:00+02:00 1 dtype: int64 >>> s.tz_localize('Europe/Warsaw', nonexistent=pd.Timedelta('1H')) 2015-03-29 03:30:00+02:00 0 2015-03-29 03:30:00+02:00 1 dtype: int64
-
where
(cond, other, errors, **kwargs)¶ where is not parallelizable when
errors="ignore"
is specified.
-
classmethod
wrap
(expr, split_tuples=True)¶
-
xs
(key, axis, level, **kwargs)¶ Return cross-section from the Series/DataFrame.
This method takes a key argument to select data at a particular level of a MultiIndex.
Parameters: - key (label or tuple of label) – Label contained in the index, or partially in a MultiIndex.
- axis ({0 or 'index', 1 or 'columns'}, default 0) – Axis to retrieve cross-section on.
- level (object, defaults to first n levels (n=1 or len(key))) – In case of a key partially contained in a MultiIndex, indicate which levels are used. Levels can be referred by label or position.
- drop_level (bool, default True) – If False, returns object with same levels as self.
Returns: Cross-section from the original DeferredSeries or DeferredDataFrame corresponding to the selected index levels.
Return type: Differences from pandas
Note that
xs(axis='index')
will raise aKeyError
at execution time if the key does not exist in the index.See also
DeferredDataFrame.loc()
- Access a group of rows and columns by label(s) or a boolean array.
DeferredDataFrame.iloc()
- Purely integer-location based indexing for selection by position.
Notes
xs can not be used to set values.
MultiIndex Slicers is a generic way to get/set values on any level or levels. It is a superset of xs functionality, see MultiIndex Slicers.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> d = {'num_legs': [4, 4, 2, 2], ... 'num_wings': [0, 0, 2, 2], ... 'class': ['mammal', 'mammal', 'mammal', 'bird'], ... 'animal': ['cat', 'dog', 'bat', 'penguin'], ... 'locomotion': ['walks', 'walks', 'flies', 'walks']} >>> df = pd.DataFrame(data=d) >>> df = df.set_index(['class', 'animal', 'locomotion']) >>> df num_legs num_wings class animal locomotion mammal cat walks 4 0 dog walks 4 0 bat flies 2 2 bird penguin walks 2 2 Get values at specified index >>> df.xs('mammal') num_legs num_wings animal locomotion cat walks 4 0 dog walks 4 0 bat flies 2 2 Get values at several indexes >>> df.xs(('mammal', 'dog')) num_legs num_wings locomotion walks 4 0 Get values at specified index and level >>> df.xs('cat', level=1) num_legs num_wings class locomotion mammal walks 4 0 Get values at several indexes and levels >>> df.xs(('bird', 'walks'), ... level=[0, 'locomotion']) num_legs num_wings animal penguin 2 2 Get values at specified column and axis >>> df.xs('num_wings', axis=1) class animal locomotion mammal cat walks 0 dog walks 0 bat flies 2 bird penguin walks 2 Name: num_wings, dtype: int64
-
-
class
apache_beam.dataframe.frames.
DeferredDataFrame
(expr)[source]¶ Bases:
apache_beam.dataframe.frames.DeferredDataFrameOrSeries
-
columns
¶ The column labels of the DataFrame.
Differences from pandas
This operation has no known divergences from the pandas API.
-
keys
()[source]¶ Get the ‘info axis’ (see Indexing for more).
This is index for Series, columns for DataFrame.
Returns: Info axis. Return type: Index Differences from pandas
This operation has no known divergences from the pandas API.
-
align
(other, join, axis, copy, level, method, **kwargs)[source]¶ Align two objects on their axes with the specified join method.
Join method is specified for each axis Index.
Parameters: - other (DeferredDataFrame or DeferredSeries) –
- join ({'outer', 'inner', 'left', 'right'}, default 'outer') –
- axis (allowed axis of the other object, default None) – Align on index (0), columns (1), or both (None).
- level (int or level name, default None) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- copy (bool, default True) – Always returns new objects. If copy=False and no reindexing is required then original objects are returned.
- fill_value (scalar, default np.NaN) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
- method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None) –
Method to use for filling holes in reindexed DeferredSeries:
- pad / ffill: propagate last valid observation forward to next valid.
- backfill / bfill: use NEXT valid observation to fill gap.
- limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
- fill_axis ({0 or 'index', 1 or 'columns'}, default 0) – Filling axis, method and limit.
- broadcast_axis ({0 or 'index', 1 or 'columns'}, default None) – Broadcast values along this axis, if aligning two objects of different dimensions.
Returns: (left, right) – Aligned objects.
Return type: (DeferredDataFrame, type of other)
Differences from pandas
Aligning per level is not yet supported. Only the default,
level=None
, is allowed.Filling NaN values via
method
is not supported, because it is order-sensitive. Only the default,method=None
, is allowed.copy=False
is not supported because its behavior (whether or not it is an inplace operation) depends on the data.
-
append
(other, ignore_index, verify_integrity, sort, **kwargs)[source]¶ Append rows of other to the end of caller, returning a new object.
Columns in other that are not in the caller are added as new columns.
Parameters: - other (DeferredDataFrame or DeferredSeries/dict-like object, or list of these) – The data to append.
- ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.
- verify_integrity (bool, default False) – If True, raise ValueError on creating index with duplicates.
- sort (bool, default False) –
Sort columns if the columns of self and other are not aligned.
Changed in version 1.0.0: Changed to not sort by default.
Returns: A new DeferredDataFrame consisting of the rows of caller and the rows of other.
Return type: Differences from pandas
ignore_index=True
is not supported, because it requires generating an order-sensitive index.See also
concat()
- General function to concatenate DeferredDataFrame or DeferredSeries objects.
Notes
If a list of dict/series is passed and the keys are all contained in the DeferredDataFrame’s index, the order of the columns in the resulting DeferredDataFrame will be unchanged.
Iteratively appending rows to a DeferredDataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DeferredDataFrame all at once.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'), index=['x', 'y']) >>> df A B x 1 2 y 3 4 >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'), index=['x', 'y']) >>> df.append(df2) A B x 1 2 y 3 4 x 5 6 y 7 8 With `ignore_index` set to True: >>> df.append(df2, ignore_index=True) A B 0 1 2 1 3 4 2 5 6 3 7 8 The following, while not recommended methods for generating DataFrames, show two ways to generate a DataFrame from multiple data sources. Less efficient: >>> df = pd.DataFrame(columns=['A']) >>> for i in range(5): ... df = df.append({'A': i}, ignore_index=True) >>> df A 0 0 1 1 2 2 3 3 4 4 More efficient: >>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)], ... ignore_index=True) A 0 0 1 1 2 2 3 3 4 4
-
get
(key, default_value=None)[source]¶ Get item from object for given key (ex: DataFrame column).
Returns default value if not found.
Parameters: key (object) – Returns: value Return type: same type as items contained in object Differences from pandas
This operation has no known divergences from the pandas API.
-
set_index
(keys, **kwargs)[source]¶ Set the DataFrame index using existing columns.
Set the DataFrame index (row labels) using one or more existing columns or arrays (of the correct length). The index can replace the existing index or expand on it.
Parameters: - keys (label or array-like or list of labels/arrays) – This parameter can be either a single column key, a single array of
the same length as the calling DeferredDataFrame, or a list containing an
arbitrary combination of column keys and arrays. Here, “array”
encompasses
DeferredSeries
,Index
,np.ndarray
, and instances ofIterator
. - drop (bool, default True) – Delete columns to be used as the new index.
- append (bool, default False) – Whether to append columns to existing index.
- inplace (bool, default False) – If True, modifies the DeferredDataFrame in place (do not create a new object).
- verify_integrity (bool, default False) – Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method.
Returns: Changed row labels or None if
inplace=True
.Return type: Differences from pandas
keys
must be astr
orList[str]
. Passing an Index or Series is not yet supported (BEAM-11711).See also
DeferredDataFrame.reset_index()
- Opposite of set_index.
DeferredDataFrame.reindex()
- Change to new indices or expand indices.
DeferredDataFrame.reindex_like()
- Change to same indices as other DeferredDataFrame.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'month': [1, 4, 7, 10], ... 'year': [2012, 2014, 2013, 2014], ... 'sale': [55, 40, 84, 31]}) >>> df month year sale 0 1 2012 55 1 4 2014 40 2 7 2013 84 3 10 2014 31 Set the index to become the 'month' column: >>> df.set_index('month') year sale month 1 2012 55 4 2014 40 7 2013 84 10 2014 31 Create a MultiIndex using columns 'year' and 'month': >>> df.set_index(['year', 'month']) sale year month 2012 1 55 2014 4 40 2013 7 84 2014 10 31 Create a MultiIndex using an Index and a column: >>> df.set_index([pd.Index([1, 2, 3, 4]), 'year']) month sale year 1 2012 1 55 2 2014 4 40 3 2013 7 84 4 2014 10 31 Create a MultiIndex using two Series: >>> s = pd.Series([1, 2, 3, 4]) >>> df.set_index([s, s**2]) month year sale 1 1 1 2012 55 2 4 4 2014 40 3 9 7 2013 84 4 16 10 2014 31
- keys (label or array-like or list of labels/arrays) – This parameter can be either a single column key, a single array of
the same length as the calling DeferredDataFrame, or a list containing an
arbitrary combination of column keys and arrays. Here, “array”
encompasses
-
set_axis
(labels, axis, **kwargs)[source]¶ Assign desired index to given axis.
Indexes for column or row labels can be changed by assigning a list-like or Index.
Parameters: Returns: renamed – An object of type DeferredDataFrame or None if
inplace=True
.Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.rename_axis()
- Alter the name of the index or columns. Examples ——– >>> df = pd.DeferredDataFrame({“A”: [1, 2, 3], “B”: [4, 5, 6]}) Change the row labels. >>> df.set_axis([‘a’, ‘b’, ‘c’], axis=’index’) A B a 1 4 b 2 5 c 3 6 Change the column labels. >>> df.set_axis([‘I’, ‘II’], axis=’columns’) I II 0 1 4 1 2 5 2 3 6 Now, update the labels inplace. >>> df.set_axis([‘i’, ‘ii’], axis=’columns’, inplace=True) >>> df i ii 0 1 4 1 2 5 2 3 6
-
axes
¶ Return a list representing the axes of the DataFrame.
It has the row axis labels and column axis labels as the only members. They are returned in that order.
Differences from pandas
This operation has no known divergences from the pandas API.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df.axes [RangeIndex(start=0, stop=2, step=1), Index(['col1', 'col2'], dtype='object')]
-
dtypes
¶ Return the dtypes in the DataFrame.
This returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns. Columns with mixed types are stored with the
object
dtype. See the User Guide for more.Returns: The data type of each column. Return type: pandas.DeferredSeries Differences from pandas
This operation has no known divergences from the pandas API.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'float': [1.0], ... 'int': [1], ... 'datetime': [pd.Timestamp('20180310')], ... 'string': ['foo']}) >>> df.dtypes float float64 int int64 datetime datetime64[ns] string object dtype: object
-
assign
(**kwargs)[source]¶ Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
Parameters: **kwargs (dict of {str: callable or DeferredSeries}) – The column names are keywords. If the values are callable, they are computed on the DeferredDataFrame and assigned to the new columns. The callable must not change input DeferredDataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a DeferredSeries, scalar, or array), they are simply assigned. Returns: A new DeferredDataFrame with the new columns in addition to all the existing columns. Return type: DeferredDataFrame Differences from pandas
value
must be acallable
orDeferredSeries
. Other types make this operation order-sensitive.Notes
Assigning multiple columns within the same
assign
is possible. Later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'temp_c': [17.0, 25.0]}, ... index=['Portland', 'Berkeley']) >>> df temp_c Portland 17.0 Berkeley 25.0 Where the value is a callable, evaluated on `df`: >>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32) temp_c temp_f Portland 17.0 62.6 Berkeley 25.0 77.0 Alternatively, the same behavior can be achieved by directly referencing an existing Series or sequence: >>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32) temp_c temp_f Portland 17.0 62.6 Berkeley 25.0 77.0 You can create multiple columns within the same assign where one of the columns depends on another one defined within the same assign: >>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32, ... temp_k=lambda x: (x['temp_f'] + 459.67) * 5 / 9) temp_c temp_f temp_k Portland 17.0 62.6 290.15 Berkeley 25.0 77.0 298.15
-
explode
(column, ignore_index)[source]¶ Transform each element of a list-like to a row, replicating index values.
New in version 0.25.0.
Parameters: - column (IndexLabel) –
Column(s) to explode. For multiple columns, specify a non-empty list with each element be str or tuple, and all specified columns their list-like data on same row of the frame must have matching length.
New in version 1.3.0: Multi-column explode
- ignore_index (bool, default False) –
If True, the resulting index will be labeled 0, 1, …, n - 1.
New in version 1.1.0.
Returns: Exploded lists to rows of the subset columns; index will be duplicated for these rows.
Return type: Raises: ValueError : – * If columns of the frame are not unique. * If specified columns to explode is empty list. * If specified columns to explode have not matching count of
elements rowwise in the frame.
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.unstack()
- Pivot a level of the (necessarily hierarchical) index labels.
DeferredDataFrame.melt()
- Unpivot a DeferredDataFrame from wide format to long format.
DeferredSeries.explode()
- Explode a DeferredDataFrame from list-like columns to long format.
Notes
This routine will explode list-likes including lists, tuples, sets, DeferredSeries, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in a np.nan for that row. In addition, the ordering of rows in the output will be non-deterministic when exploding sets.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]], ... 'B': 1, ... 'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]}) >>> df A B C 0 [0, 1, 2] 1 [a, b, c] 1 foo 1 NaN 2 [] 1 [] 3 [3, 4] 1 [d, e] Single-column explode. >>> df.explode('A') A B C 0 0 1 [a, b, c] 0 1 1 [a, b, c] 0 2 1 [a, b, c] 1 foo 1 NaN 2 NaN 1 [] 3 3 1 [d, e] 3 4 1 [d, e] Multi-column explode. >>> df.explode(list('AC')) A B C 0 0 1 a 0 1 1 b 0 2 1 c 1 foo 1 NaN 2 NaN 1 NaN 3 3 1 d 3 4 1 e
- column (IndexLabel) –
-
insert
(value, **kwargs)[source]¶ Insert column into DataFrame at specified location.
Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.
Parameters: - loc (int) – Insertion index. Must verify 0 <= loc <= len(columns).
- column (str, number, or hashable object) – Label of the inserted column.
- value (int, DeferredSeries, or array-like) –
- allow_duplicates (bool, optional) –
Differences from pandas
value
cannot be aList
because aligning it with this DeferredDataFrame is order-sensitive.See also
Index.insert()
- Insert new item by index.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df col1 col2 0 1 3 1 2 4 >>> df.insert(1, "newcol", [99, 99]) >>> df col1 newcol col2 0 1 99 3 1 2 99 4 >>> df.insert(0, "col1", [100, 100], allow_duplicates=True) >>> df col1 col1 newcol col2 0 100 1 99 3 1 100 2 99 4 Notice that pandas uses index alignment in case of `value` from type `Series`: >>> df.insert(0, "col0", pd.Series([5, 6], index=[1, 2])) >>> df col0 col1 col1 newcol col2 0 NaN 100 1 99 3 1 5.0 100 2 99 4
-
static
from_dict
(*args, **kwargs)[source]¶ Construct DataFrame from dict of array-like or dicts.
Creates DataFrame object from dictionary by columns or by index allowing dtype specification.
Parameters: - data (dict) – Of the form {field : array-like} or {field : dict}.
- orient ({'columns', 'index'}, default 'columns') – The “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DeferredDataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.
- dtype (dtype, default None) – Data type to force, otherwise infer.
- columns (list, default None) – Column labels to use when
orient='index'
. Raises a ValueError if used withorient='columns'
.
Returns: Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.from_records()
- DeferredDataFrame from structured ndarray, sequence of tuples or dicts, or DeferredDataFrame.
DeferredDataFrame()
- DeferredDataFrame object creation using constructor.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
By default the keys of the dict become the DataFrame columns: >>> data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']} >>> pd.DataFrame.from_dict(data) col_1 col_2 0 3 a 1 2 b 2 1 c 3 0 d Specify ``orient='index'`` to create the DataFrame using dictionary keys as rows: >>> data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']} >>> pd.DataFrame.from_dict(data, orient='index') 0 1 2 3 row_1 3 2 1 0 row_2 a b c d When using the 'index' orientation, the column names can be specified manually: >>> pd.DataFrame.from_dict(data, orient='index', ... columns=['A', 'B', 'C', 'D']) A B C D row_1 3 2 1 0 row_2 a b c d
-
static
from_records
(*args, **kwargs)[source]¶ Convert structured or record ndarray to DataFrame.
Creates a DataFrame object from a structured ndarray, sequence of tuples or dicts, or DataFrame.
Parameters: - data (structured ndarray, sequence of tuples or dicts, or DeferredDataFrame) – Structured input data.
- index (str, list of fields, array-like) – Field of array to use as the index, alternately a specific set of input labels to use.
- exclude (sequence, default None) – Columns or fields to exclude.
- columns (sequence, default None) – Column names to use. If the passed data do not have names associated with them, this argument provides names for the columns. Otherwise this argument indicates the order of the columns in the result (any names not found in the data will become all-NA columns).
- coerce_float (bool, default False) – Attempt to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.
- nrows (int, default None) – Number of rows to read if data is an iterator.
Returns: Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.from_dict()
- DeferredDataFrame from dict of array-like or dicts.
DeferredDataFrame()
- DeferredDataFrame object creation using constructor.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Data can be provided as a structured ndarray: >>> data = np.array([(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')], ... dtype=[('col_1', 'i4'), ('col_2', 'U1')]) >>> pd.DataFrame.from_records(data) col_1 col_2 0 3 a 1 2 b 2 1 c 3 0 d Data can be provided as a list of dicts: >>> data = [{'col_1': 3, 'col_2': 'a'}, ... {'col_1': 2, 'col_2': 'b'}, ... {'col_1': 1, 'col_2': 'c'}, ... {'col_1': 0, 'col_2': 'd'}] >>> pd.DataFrame.from_records(data) col_1 col_2 0 3 a 1 2 b 2 1 c 3 0 d Data can be provided as a list of tuples with corresponding columns: >>> data = [(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')] >>> pd.DataFrame.from_records(data, columns=['col_1', 'col_2']) col_1 col_2 0 3 a 1 2 b 2 1 c 3 0 d
-
duplicated
(keep, subset)[source]¶ Return boolean Series denoting duplicate rows.
Considering certain columns is optional.
Parameters: - subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by default use all of the columns.
- keep ({'first', 'last', False}, default 'first') –
Determines which duplicates (if any) to mark.
first
: Mark duplicates asTrue
except for the first occurrence.last
: Mark duplicates asTrue
except for the last occurrence.- False : Mark all duplicates as
True
.
Returns: Boolean series for each duplicated rows.
Return type: Differences from pandas
Only
keep=False
andkeep="any"
are supported. Other values ofkeep
make this an order-sensitive operation. Notekeep="any"
is a Beam-specific option that guarantees only one duplicate will be kept, but unlike"first"
and"last"
it makes no guarantees about _which_ duplicate element is kept.See also
Index.duplicated()
- Equivalent method on index.
DeferredSeries.duplicated()
- Equivalent method on DeferredSeries.
DeferredSeries.drop_duplicates()
- Remove duplicate values from DeferredSeries.
DeferredDataFrame.drop_duplicates()
- Remove duplicate values from DeferredDataFrame.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
Consider dataset containing ramen rating. >>> df = pd.DataFrame({ ... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], ... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], ... 'rating': [4, 4, 3.5, 15, 5] ... }) >>> df brand style rating 0 Yum Yum cup 4.0 1 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0 By default, for each set of duplicated values, the first occurrence is set on False and all others on True. >>> df.duplicated() 0 False 1 True 2 False 3 False 4 False dtype: bool By using 'last', the last occurrence of each set of duplicated values is set on False and all others on True. >>> df.duplicated(keep='last') 0 True 1 False 2 False 3 False 4 False dtype: bool By setting ``keep`` on False, all duplicates are True. >>> df.duplicated(keep=False) 0 True 1 True 2 False 3 False 4 False dtype: bool To find duplicates on specific column(s), use ``subset``. >>> df.duplicated(subset=['brand']) 0 False 1 True 2 False 3 True 4 True dtype: bool
-
drop_duplicates
(keep, subset, ignore_index)[source]¶ Return DataFrame with duplicate rows removed.
Considering certain columns is optional. Indexes, including time indexes are ignored.
Parameters: - subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by default use all of the columns.
- keep ({'first', 'last', False}, default 'first') – Determines which duplicates (if any) to keep.
-
first
: Drop duplicates except for the first occurrence. -last
: Drop duplicates except for the last occurrence. - False : Drop all duplicates. - inplace (bool, default False) – Whether to drop duplicates in place or to return a copy.
- ignore_index (bool, default False) –
If True, the resulting axis will be labeled 0, 1, …, n - 1.
New in version 1.0.0.
Returns: DeferredDataFrame with duplicates removed or None if
inplace=True
.Return type: Differences from pandas
Only
keep=False
andkeep="any"
are supported. Other values ofkeep
make this an order-sensitive operation. Notekeep="any"
is a Beam-specific option that guarantees only one duplicate will be kept, but unlike"first"
and"last"
it makes no guarantees about _which_ duplicate element is kept.See also
DeferredDataFrame.value_counts()
- Count unique combinations of columns.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
Consider dataset containing ramen rating. >>> df = pd.DataFrame({ ... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], ... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], ... 'rating': [4, 4, 3.5, 15, 5] ... }) >>> df brand style rating 0 Yum Yum cup 4.0 1 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0 By default, it removes duplicate rows based on all columns. >>> df.drop_duplicates() brand style rating 0 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0 To remove duplicates on specific column(s), use ``subset``. >>> df.drop_duplicates(subset=['brand']) brand style rating 0 Yum Yum cup 4.0 2 Indomie cup 3.5 To remove duplicates and keep last occurrences, use ``keep``. >>> df.drop_duplicates(subset=['brand', 'style'], keep='last') brand style rating 1 Yum Yum cup 4.0 2 Indomie cup 3.5 4 Indomie pack 5.0
-
aggregate
(func, axis, *args, **kwargs)[source]¶ Aggregate using one or more operations over the specified axis.
Parameters: - func (function, str, list or dict) –
Function to use for aggregating the data. If a function, must either work when passed a DeferredDataFrame or when passed to DeferredDataFrame.apply.
Accepted combinations are:
- function
- string function name
- list of functions and/or function names, e.g.
[np.sum, 'mean']
- dict of axis labels -> functions, function names or list of such.
- axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.
- *args – Positional arguments to pass to func.
- **kwargs – Keyword arguments to pass to func.
Returns: scalar, DeferredSeries or DeferredDataFrame – The return can be:
- scalar : when DeferredSeries.agg is called with single function
- DeferredSeries : when DeferredDataFrame.agg is called with a single function
- DeferredDataFrame : when DeferredDataFrame.agg is called with several functions
Return scalar, DeferredSeries or DeferredDataFrame.
The aggregation operations are always performed over an axis, either the
index (default) or the column axis. This behavior is different from
numpy aggregation functions (mean, median, prod, sum, std,
var), where the default is to compute the aggregation of the flattened
array, e.g.,
numpy.mean(arr_2d)
as opposed tonumpy.mean(arr_2d, axis=0)
.agg is an alias for aggregate. Use the alias.
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.apply()
- Perform any type of operations.
DeferredDataFrame.transform()
- Perform transformation type operations.
core.groupby.GroupBy()
- Perform operations over groups.
core.resample.Resampler()
- Perform operations over resampled bins.
core.window.Rolling()
- Perform operations over rolling window.
core.window.Expanding()
- Perform operations over expanding window.
core.window.ExponentialMovingWindow()
- Perform operation over exponential weighted window.
Notes
agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.
A passed user-defined-function will be passed a DeferredSeries for evaluation.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame([[1, 2, 3], ... [4, 5, 6], ... [7, 8, 9], ... [np.nan, np.nan, np.nan]], ... columns=['A', 'B', 'C']) Aggregate these functions over the rows. >>> df.agg(['sum', 'min']) A B C sum 12.0 15.0 18.0 min 1.0 2.0 3.0 Different aggregations per column. >>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']}) A B sum 12.0 NaN min 1.0 2.0 max NaN 8.0 Aggregate different functions over the columns and rename the index of the resulting DataFrame. >>> df.agg(x=('A', max), y=('B', 'min'), z=('C', np.mean)) A B C x 7.0 NaN NaN y NaN 2.0 NaN z NaN NaN 6.0 Aggregate over the columns. >>> df.agg("mean", axis="columns") 0 2.0 1 5.0 2 8.0 3 NaN dtype: float64
- func (function, str, list or dict) –
-
agg
(func, axis, *args, **kwargs)¶ Aggregate using one or more operations over the specified axis.
Parameters: - func (function, str, list or dict) –
Function to use for aggregating the data. If a function, must either work when passed a DeferredDataFrame or when passed to DeferredDataFrame.apply.
Accepted combinations are:
- function
- string function name
- list of functions and/or function names, e.g.
[np.sum, 'mean']
- dict of axis labels -> functions, function names or list of such.
- axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.
- *args – Positional arguments to pass to func.
- **kwargs – Keyword arguments to pass to func.
Returns: scalar, DeferredSeries or DeferredDataFrame – The return can be:
- scalar : when DeferredSeries.agg is called with single function
- DeferredSeries : when DeferredDataFrame.agg is called with a single function
- DeferredDataFrame : when DeferredDataFrame.agg is called with several functions
Return scalar, DeferredSeries or DeferredDataFrame.
The aggregation operations are always performed over an axis, either the
index (default) or the column axis. This behavior is different from
numpy aggregation functions (mean, median, prod, sum, std,
var), where the default is to compute the aggregation of the flattened
array, e.g.,
numpy.mean(arr_2d)
as opposed tonumpy.mean(arr_2d, axis=0)
.agg is an alias for aggregate. Use the alias.
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.apply()
- Perform any type of operations.
DeferredDataFrame.transform()
- Perform transformation type operations.
core.groupby.GroupBy()
- Perform operations over groups.
core.resample.Resampler()
- Perform operations over resampled bins.
core.window.Rolling()
- Perform operations over rolling window.
core.window.Expanding()
- Perform operations over expanding window.
core.window.ExponentialMovingWindow()
- Perform operation over exponential weighted window.
Notes
agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.
A passed user-defined-function will be passed a DeferredSeries for evaluation.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame([[1, 2, 3], ... [4, 5, 6], ... [7, 8, 9], ... [np.nan, np.nan, np.nan]], ... columns=['A', 'B', 'C']) Aggregate these functions over the rows. >>> df.agg(['sum', 'min']) A B C sum 12.0 15.0 18.0 min 1.0 2.0 3.0 Different aggregations per column. >>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']}) A B sum 12.0 NaN min 1.0 2.0 max NaN 8.0 Aggregate different functions over the columns and rename the index of the resulting DataFrame. >>> df.agg(x=('A', max), y=('B', 'min'), z=('C', np.mean)) A B C x 7.0 NaN NaN y NaN 2.0 NaN z NaN NaN 6.0 Aggregate over the columns. >>> df.agg("mean", axis="columns") 0 2.0 1 5.0 2 8.0 3 NaN dtype: float64
- func (function, str, list or dict) –
-
applymap
(**kwargs)¶ Apply a function to a Dataframe elementwise.
This method applies a function that accepts and returns a scalar to every element of a DataFrame.
Parameters: - func (callable) – Python function, returns a single value from a single value.
- na_action ({None, 'ignore'}, default None) –
If ‘ignore’, propagate NaN values, without passing them to func.
New in version 1.2.
- **kwargs –
Additional keyword arguments to pass as keywords arguments to func.
New in version 1.3.0.
Returns: Transformed DeferredDataFrame.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.apply()
- Apply a function along input axis of DeferredDataFrame.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame([[1, 2.12], [3.356, 4.567]]) >>> df 0 1 0 1.000 2.120 1 3.356 4.567 >>> df.applymap(lambda x: len(str(x))) 0 1 0 3 4 1 5 5 Like Series.map, NA values can be ignored: >>> df_copy = df.copy() >>> df_copy.iloc[0, 0] = pd.NA >>> df_copy.applymap(lambda x: len(str(x)), na_action='ignore') 0 1 0 <NA> 4 1 5 5 Note that a vectorized version of `func` often exists, which will be much faster. You could square each number elementwise. >>> df.applymap(lambda x: x**2) 0 1 0 1.000000 4.494400 1 11.262736 20.857489 But it's better to avoid applymap in that case. >>> df ** 2 0 1 0 1.000000 4.494400 1 11.262736 20.857489
-
add_prefix
(**kwargs)¶ Prefix labels with string prefix.
For Series, the row labels are prefixed. For DataFrame, the column labels are prefixed.
Parameters: prefix (str) – The string to add before each label. Returns: New DeferredSeries or DeferredDataFrame with updated labels. Return type: DeferredSeries or DeferredDataFrame Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.add_suffix()
- Suffix row labels with string suffix.
DeferredDataFrame.add_suffix()
- Suffix column labels with string suffix.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series([1, 2, 3, 4]) >>> s 0 1 1 2 2 3 3 4 dtype: int64 >>> s.add_prefix('item_') item_0 1 item_1 2 item_2 3 item_3 4 dtype: int64 >>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}) >>> df A B 0 1 3 1 2 4 2 3 5 3 4 6 >>> df.add_prefix('col_') col_A col_B 0 1 3 1 2 4 2 3 5 3 4 6
-
add_suffix
(**kwargs)¶ Suffix labels with string suffix.
For Series, the row labels are suffixed. For DataFrame, the column labels are suffixed.
Parameters: suffix (str) – The string to add after each label. Returns: New DeferredSeries or DeferredDataFrame with updated labels. Return type: DeferredSeries or DeferredDataFrame Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.add_prefix()
- Prefix row labels with string prefix.
DeferredDataFrame.add_prefix()
- Prefix column labels with string prefix.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series([1, 2, 3, 4]) >>> s 0 1 1 2 2 3 3 4 dtype: int64 >>> s.add_suffix('_item') 0_item 1 1_item 2 2_item 3 3_item 4 dtype: int64 >>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}) >>> df A B 0 1 3 1 2 4 2 3 5 3 4 6 >>> df.add_suffix('_col') A_col B_col 0 1 3 1 2 4 2 3 5 3 4 6
-
memory_usage
(**kwargs)¶ pandas.DataFrame.memory_usage()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
info
(**kwargs)¶ pandas.DataFrame.info()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
clip
(axis, **kwargs)[source]¶ lower
andupper
must beDeferredSeries
instances, or constants. Array-like arguments are not supported because they are order-sensitive.
-
corr
(method, min_periods)[source]¶ Compute pairwise correlation of columns, excluding NA/null values.
Parameters: - method ({'pearson', 'kendall', 'spearman'} or callable) –
Method of correlation:
- pearson : standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
- callable: callable with input two 1d ndarrays
- and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.
- min_periods (int, optional) – Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.
Returns: Correlation matrix.
Return type: Differences from pandas
Only
method="pearson"
can be parallelized. Other methods require collecting all data on a single worker (see https://s.apache.org/dataframe-non-parallel-operations for details).See also
DeferredDataFrame.corrwith()
- Compute pairwise correlation with another DeferredDataFrame or DeferredSeries.
DeferredSeries.corr()
- Compute the correlation between two DeferredSeries.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> def histogram_intersection(a, b): ... v = np.minimum(a, b).sum().round(decimals=1) ... return v >>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)], ... columns=['dogs', 'cats']) >>> df.corr(method=histogram_intersection) dogs cats dogs 1.0 0.3 cats 0.3 1.0
- method ({'pearson', 'kendall', 'spearman'} or callable) –
-
cov
(min_periods, ddof)[source]¶ Compute pairwise covariance of columns, excluding NA/null values.
Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.
Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as
NaN
.This method is generally used for the analysis of time series data to understand the relationship between different measures across time.
Parameters: Returns: The covariance matrix of the series of the DeferredDataFrame.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.cov()
- Compute covariance with another DeferredSeries.
core.window.ExponentialMovingWindow.cov()
- Exponential weighted sample covariance.
core.window.Expanding.cov()
- Expanding sample covariance.
core.window.Rolling.cov()
- Rolling sample covariance.
Notes
Returns the covariance matrix of the DeferredDataFrame’s time series. The covariance is normalized by N-ddof.
For DeferredDataFrames that have DeferredSeries that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member DeferredSeries.
However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], ... columns=['dogs', 'cats']) >>> df.cov() dogs cats dogs 0.666667 -1.000000 cats -1.000000 1.666667 >>> np.random.seed(42) >>> df = pd.DataFrame(np.random.randn(1000, 5), ... columns=['a', 'b', 'c', 'd', 'e']) >>> df.cov() a b c d e a 0.998438 -0.020161 0.059277 -0.008943 0.014144 b -0.020161 1.059352 -0.008543 -0.024738 0.009826 c 0.059277 -0.008543 1.010670 -0.001486 -0.000271 d -0.008943 -0.024738 -0.001486 0.921297 -0.013692 e 0.014144 0.009826 -0.000271 -0.013692 0.977795 **Minimum number of periods** This method also supports an optional ``min_periods`` keyword that specifies the required minimum number of non-NA observations for each column pair in order to have a valid result: >>> np.random.seed(42) >>> df = pd.DataFrame(np.random.randn(20, 3), ... columns=['a', 'b', 'c']) >>> df.loc[df.index[:5], 'a'] = np.nan >>> df.loc[df.index[5:10], 'b'] = np.nan >>> df.cov(min_periods=12) a b c a 0.316741 NaN -0.150812 b NaN 1.248003 0.191417 c -0.150812 0.191417 0.895202
-
corrwith
(other, axis, drop, method)[source]¶ Compute pairwise correlation.
Pairwise correlation is computed between rows or columns of DataFrame with rows or columns of Series or DataFrame. DataFrames are first aligned along both axes before computing the correlations.
Parameters: - other (DeferredDataFrame, DeferredSeries) – Object with which to compute correlations.
- axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ to compute column-wise, 1 or ‘columns’ for row-wise.
- drop (bool, default False) – Drop missing indices from result.
- method ({'pearson', 'kendall', 'spearman'} or callable) –
Method of correlation:
- pearson : standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
- callable: callable with input two 1d ndarrays
- and returning a float.
Returns: Pairwise correlations.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.corr()
- Compute pairwise correlation of columns.
-
cummax
(**kwargs)¶ pandas.DataFrame.cummax()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
cummin
(**kwargs)¶ pandas.DataFrame.cummin()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
cumprod
(**kwargs)¶ pandas.DataFrame.cumprod()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
cumsum
(**kwargs)¶ pandas.DataFrame.cumsum()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
diff
(**kwargs)¶ pandas.DataFrame.diff()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
interpolate
(**kwargs)¶ pandas.DataFrame.interpolate()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
pct_change
(**kwargs)¶ pandas.DataFrame.pct_change()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
asof
(**kwargs)¶ pandas.DataFrame.asof()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
first_valid_index
(**kwargs)¶ pandas.DataFrame.first_valid_index()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
last_valid_index
(**kwargs)¶ pandas.DataFrame.last_valid_index()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
iat
¶ pandas.DataFrame.iat()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
lookup
(**kwargs)¶ pandas.DataFrame.lookup()
is not yet supported in the Beam DataFrame API because it is deprecated in pandas.
-
head
(**kwargs)¶ pandas.DataFrame.head()
is not yet supported in the Beam DataFrame API because it is order-sensitive.If you want to peek at a large dataset consider using interactive Beam’s
ib.collect
withn
specified, orsample()
. If you want to find the N largest elements, consider usingDeferredDataFrame.nlargest()
.
-
tail
(**kwargs)¶ pandas.DataFrame.tail()
is not yet supported in the Beam DataFrame API because it is order-sensitive.If you want to peek at a large dataset consider using interactive Beam’s
ib.collect
withn
specified, orsample()
. If you want to find the N largest elements, consider usingDeferredDataFrame.nlargest()
.
-
sample
(n, frac, replace, weights, random_state, axis)[source]¶ Return a random sample of items from an axis of object.
You can use random_state for reproducibility.
Parameters: - n (int, optional) – Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.
- frac (float, optional) – Fraction of axis items to return. Cannot be used with n.
- replace (bool, default False) – Allow or disallow sampling of the same row more than once.
- weights (str or ndarray-like, optional) – Default ‘None’ results in equal probability weighting. If passed a DeferredSeries, will align with target object on index. Index values in weights not found in sampled object will be ignored and index values in sampled object not in weights will be assigned weights of zero. If called on a DeferredDataFrame, will accept the name of a column when axis = 0. Unless weights are a DeferredSeries, weights must be same length as axis being sampled. If weights do not sum to 1, they will be normalized to sum to 1. Missing values in the weights column will be treated as zero. Infinite values not allowed.
- random_state (int, array-like, BitGenerator, np.random.RandomState, optional) –
If int, array-like, or BitGenerator (NumPy>=1.17), seed for random number generator If np.random.RandomState, use as numpy RandomState object.
Changed in version 1.1.0: array-like and BitGenerator (for NumPy>=1.17) object now passed to np.random.RandomState() as seed
- axis ({0 or ‘index’, 1 or ‘columns’, None}, default None) – Axis to sample. Accepts axis number or name. Default is stat axis for given data type (0 for DeferredSeries and DeferredDataFrames).
- ignore_index (bool, default False) –
If True, the resulting index will be labeled 0, 1, …, n - 1.
New in version 1.3.0.
Returns: A new object of same type as caller containing n items randomly sampled from the caller object.
Return type: Differences from pandas
When
axis='index'
, onlyn
and/orweights
may be specified.frac
,random_state
, andreplace=True
are not yet supported. See BEAM-12476.Note that pandas will raise an error if
n
is larger than the length of the dataset, while the Beam DataFrame API will simply return the full dataset in that case.sample is fully supported for axis=’columns’.
See also
DeferredDataFrameGroupBy.sample()
- Generates random samples from each group of a DeferredDataFrame object.
DeferredSeriesGroupBy.sample()
- Generates random samples from each group of a DeferredSeries object.
numpy.random.choice()
- Generates a random sample from a given 1-D numpy array.
Notes
If frac > 1, replacement should be set to True.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'num_legs': [2, 4, 8, 0], ... 'num_wings': [2, 0, 0, 0], ... 'num_specimen_seen': [10, 2, 1, 8]}, ... index=['falcon', 'dog', 'spider', 'fish']) >>> df num_legs num_wings num_specimen_seen falcon 2 2 10 dog 4 0 2 spider 8 0 1 fish 0 0 8 Extract 3 random elements from the ``Series`` ``df['num_legs']``: Note that we use `random_state` to ensure the reproducibility of the examples. >>> df['num_legs'].sample(n=3, random_state=1) fish 0 spider 8 falcon 2 Name: num_legs, dtype: int64 A random 50% sample of the ``DataFrame`` with replacement: >>> df.sample(frac=0.5, replace=True, random_state=1) num_legs num_wings num_specimen_seen dog 4 0 2 fish 0 0 8 An upsample sample of the ``DataFrame`` with replacement: Note that `replace` parameter has to be `True` for `frac` parameter > 1. >>> df.sample(frac=2, replace=True, random_state=1) num_legs num_wings num_specimen_seen dog 4 0 2 fish 0 0 8 falcon 2 2 10 falcon 2 2 10 fish 0 0 8 dog 4 0 2 fish 0 0 8 dog 4 0 2 Using a DataFrame column as weights. Rows with larger value in the `num_specimen_seen` column are more likely to be sampled. >>> df.sample(n=2, weights='num_specimen_seen', random_state=1) num_legs num_wings num_specimen_seen falcon 2 2 10 fish 0 0 8
-
dot
(other)[source]¶ Compute the matrix multiplication between the DataFrame and other.
This method computes the matrix product between the DataFrame and the values of an other Series, DataFrame or a numpy array.
It can also be called using
self @ other
in Python >= 3.5.Parameters: other (DeferredSeries, DeferredDataFrame or array-like) – The other object to compute the matrix product with. Returns: If other is a DeferredSeries, return the matrix product between self and other as a DeferredSeries. If other is a DeferredDataFrame or a numpy.array, return the matrix product of self and other in a DeferredDataFrame of a np.array. Return type: DeferredSeries or DeferredDataFrame Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.dot()
- Similar method for DeferredSeries.
Notes
The dimensions of DeferredDataFrame and other must be compatible in order to compute the matrix multiplication. In addition, the column names of DeferredDataFrame and the index of other must contain the same values, as they will be aligned prior to the multiplication.
The dot method for DeferredSeries computes the inner product, instead of the matrix product here.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Here we multiply a DataFrame with a Series. >>> df = pd.DataFrame([[0, 1, -2, -1], [1, 1, 1, 1]]) >>> s = pd.Series([1, 1, 2, 1]) >>> df.dot(s) 0 -4 1 5 dtype: int64 Here we multiply a DataFrame with another DataFrame. >>> other = pd.DataFrame([[0, 1], [1, 2], [-1, -1], [2, 0]]) >>> df.dot(other) 0 1 0 1 4 1 2 2 Note that the dot method give the same result as @ >>> df @ other 0 1 0 1 4 1 2 2 The dot method works also if other is an np.array. >>> arr = np.array([[0, 1], [1, 2], [-1, -1], [2, 0]]) >>> df.dot(arr) 0 1 0 1 4 1 2 2 Note how shuffling of the objects does not change the result. >>> s2 = s.reindex([1, 0, 2, 3]) >>> df.dot(s2) 0 -4 1 5 dtype: int64
-
mode
(axis=0, *args, **kwargs)[source]¶ Get the mode(s) of each element along the selected axis.
The mode of a set of values is the value that appears most often. It can be multiple values.
Parameters: - axis ({0 or 'index', 1 or 'columns'}, default 0) –
The axis to iterate over while searching for the mode:
- 0 or ‘index’ : get mode of each column
- 1 or ‘columns’ : get mode of each row.
- numeric_only (bool, default False) – If True, only apply to numeric columns.
- dropna (bool, default True) – Don’t consider counts of NaN/NaT.
Returns: The modes of each column or row.
Return type: Differences from pandas
mode with axis=”columns” is not implemented because it produces non-deferred columns.
mode with axis=”index” is not currently parallelizable. An approximate, parallelizable implementation of mode may be added in the future (BEAM-12181).
See also
DeferredSeries.mode()
- Return the highest frequency value in a DeferredSeries.
DeferredSeries.value_counts()
- Return the counts of values in a DeferredSeries.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame([('bird', 2, 2), ... ('mammal', 4, np.nan), ... ('arthropod', 8, 0), ... ('bird', 2, np.nan)], ... index=('falcon', 'horse', 'spider', 'ostrich'), ... columns=('species', 'legs', 'wings')) >>> df species legs wings falcon bird 2 2.0 horse mammal 4 NaN spider arthropod 8 0.0 ostrich bird 2 NaN By default, missing values are not considered, and the mode of wings are both 0 and 2. Because the resulting DataFrame has two rows, the second row of ``species`` and ``legs`` contains ``NaN``. >>> df.mode() species legs wings 0 bird 2.0 0.0 1 NaN NaN 2.0 Setting ``dropna=False`` ``NaN`` values are considered and they can be the mode (like for wings). >>> df.mode(dropna=False) species legs wings 0 bird 2 NaN Setting ``numeric_only=True``, only the mode of numeric columns is computed, and columns of other types are ignored. >>> df.mode(numeric_only=True) legs wings 0 2.0 0.0 1 NaN 2.0 To compute the mode over columns and not rows, use the axis parameter: >>> df.mode(axis='columns', numeric_only=True) 0 1 falcon 2.0 NaN horse 4.0 NaN spider 0.0 8.0 ostrich 2.0 NaN
- axis ({0 or 'index', 1 or 'columns'}, default 0) –
-
dropna
(axis, **kwargs)[source]¶ Remove missing values.
See the User Guide for more on which values are considered missing, and how to work with missing data.
Parameters: - axis ({0 or 'index', 1 or 'columns'}, default 0) –
Determine if rows or columns which contain missing values are removed.
- 0, or ‘index’ : Drop rows which contain missing values.
- 1, or ‘columns’ : Drop columns which contain missing value.
Changed in version 1.0.0: Pass tuple or list to drop on multiple axes. Only a single axis is allowed.
- how ({'any', 'all'}, default 'any') –
Determine if row or column is removed from DeferredDataFrame, when we have at least one NA or all NA.
- ’any’ : If any NA values are present, drop that row or column.
- ’all’ : If all values are NA, drop that row or column.
- thresh (int, optional) – Require that many non-NA values.
- subset (array-like, optional) – Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
- inplace (bool, default False) – If True, do operation inplace and return None.
Returns: DeferredDataFrame with NA entries dropped from it or None if
inplace=True
.Return type: Differences from pandas
dropna with axis=”columns” specified cannot be parallelized.
See also
DeferredDataFrame.isna()
- Indicate missing values.
DeferredDataFrame.notna()
- Indicate existing (non-missing) values.
DeferredDataFrame.fillna()
- Replace missing values.
DeferredSeries.dropna()
- Drop missing values.
Index.dropna()
- Drop missing indices.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'], ... "toy": [np.nan, 'Batmobile', 'Bullwhip'], ... "born": [pd.NaT, pd.Timestamp("1940-04-25"), ... pd.NaT]}) >>> df name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT Drop the rows where at least one element is missing. >>> df.dropna() name toy born 1 Batman Batmobile 1940-04-25 Drop the columns where at least one element is missing. >>> df.dropna(axis='columns') name 0 Alfred 1 Batman 2 Catwoman Drop the rows where all elements are missing. >>> df.dropna(how='all') name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT Keep only the rows with at least 2 non-NA values. >>> df.dropna(thresh=2) name toy born 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT Define in which columns to look for missing values. >>> df.dropna(subset=['name', 'toy']) name toy born 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT Keep the DataFrame with valid entries in the same variable. >>> df.dropna(inplace=True) >>> df name toy born 1 Batman Batmobile 1940-04-25
- axis ({0 or 'index', 1 or 'columns'}, default 0) –
-
eval
(expr, inplace, **kwargs)[source]¶ Evaluate a string describing operations on DataFrame columns.
Operates on columns only, not specific rows or elements. This allows eval to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.
Parameters: - expr (str) – The expression string to evaluate.
- inplace (bool, default False) – If the expression contains an assignment, whether to perform the operation inplace and mutate the existing DeferredDataFrame. Otherwise, a new DeferredDataFrame is returned.
- **kwargs – See the documentation for
eval()
for complete details on the keyword arguments accepted byquery()
.
Returns: The result of the evaluation or None if
inplace=True
.Return type: ndarray, scalar, pandas object, or None
Differences from pandas
Accessing local variables with
@<varname>
is not yet supported (BEAM-11202).Arguments
local_dict
,global_dict
,level
,target
, andresolvers
are not yet supported.See also
DeferredDataFrame.query()
- Evaluates a boolean expression to query the columns of a frame.
DeferredDataFrame.assign()
- Can evaluate an expression or function to create new values for a column.
eval()
- Evaluate a Python expression as a string using various backends.
Notes
For more details see the API documentation for
eval()
. For detailed examples see enhancing performance with eval.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)}) >>> df A B 0 1 10 1 2 8 2 3 6 3 4 4 4 5 2 >>> df.eval('A + B') 0 11 1 10 2 9 3 8 4 7 dtype: int64 Assignment is allowed though by default the original DataFrame is not modified. >>> df.eval('C = A + B') A B C 0 1 10 11 1 2 8 10 2 3 6 9 3 4 4 8 4 5 2 7 >>> df A B 0 1 10 1 2 8 2 3 6 3 4 4 4 5 2 Use ``inplace=True`` to modify the original DataFrame. >>> df.eval('C = A + B', inplace=True) >>> df A B C 0 1 10 11 1 2 8 10 2 3 6 9 3 4 4 8 4 5 2 7 Multiple columns can be assigned to using multi-line expressions: >>> df.eval( ... ''' ... C = A + B ... D = A - B ... ''' ... ) A B C D 0 1 10 11 -9 1 2 8 10 -6 2 3 6 9 -3 3 4 4 8 0 4 5 2 7 3
-
query
(expr, inplace, **kwargs)[source]¶ Query the columns of a DataFrame with a boolean expression.
Parameters: - expr (str) –
The query string to evaluate.
You can refer to variables in the environment by prefixing them with an ‘@’ character like
@a + b
.You can refer to column names that are not valid Python variable names by surrounding them in backticks. Thus, column names containing spaces or punctuations (besides underscores) or starting with digits must be surrounded by backticks. (For example, a column named “Area (cm^2)” would be referenced as
`Area (cm^2)`
). Column names which are Python keywords (like “list”, “for”, “import”, etc) cannot be used.For example, if one of your columns is called
a a
and you want to sum it withb
, your query should be`a a` + b
.New in version 0.25.0: Backtick quoting introduced.
New in version 1.0.0: Expanding functionality of backtick quoting for more than only spaces.
- inplace (bool) – Whether the query should modify the data in place or return a modified copy.
- **kwargs – See the documentation for
eval()
for complete details on the keyword arguments accepted byDeferredDataFrame.query()
.
Returns: DeferredDataFrame resulting from the provided query expression or None if
inplace=True
.Return type: Differences from pandas
Accessing local variables with
@<varname>
is not yet supported (BEAM-11202).Arguments
local_dict
,global_dict
,level
,target
, andresolvers
are not yet supported.See also
eval()
- Evaluate a string describing operations on DeferredDataFrame columns.
DeferredDataFrame.eval()
- Evaluate a string describing operations on DeferredDataFrame columns.
Notes
The result of the evaluation of this expression is first passed to
DeferredDataFrame.loc
and if that fails because of a multidimensional key (e.g., a DeferredDataFrame) then the result will be passed toDeferredDataFrame.__getitem__()
.This method uses the top-level
eval()
function to evaluate the passed query.The
query()
method uses a slightly modified Python syntax by default. For example, the&
and|
(bitwise) operators have the precedence of their boolean cousins,and
andor
. This is syntactically valid Python, however the semantics are different.You can change the semantics of the expression by passing the keyword argument
parser='python'
. This enforces the same semantics as evaluation in Python space. Likewise, you can passengine='python'
to evaluate an expression using Python itself as a backend. This is not recommended as it is inefficient compared to usingnumexpr
as the engine.The
DeferredDataFrame.index
andDeferredDataFrame.columns
attributes of theDeferredDataFrame
instance are placed in the query namespace by default, which allows you to treat both the index and columns of the frame as a column in the frame. The identifierindex
is used for the frame index; you can also use the name of the index to identify it in a query. Please note that Python keywords may not be used as identifiers.For further details and examples see the
query
documentation in indexing.Backtick quoted variables
Backtick quoted variables are parsed as literal Python code and are converted internally to a Python valid identifier. This can lead to the following problems.
During parsing a number of disallowed characters inside the backtick quoted string are replaced by strings that are allowed as a Python identifier. These characters include all operators in Python, the space character, the question mark, the exclamation mark, the dollar sign, and the euro sign. For other characters that fall outside the ASCII range (U+0001..U+007F) and those that are not further specified in PEP 3131, the query parser will raise an error. This excludes whitespace different than the space character, but also the hashtag (as it is used for comments) and the backtick itself (backtick can also not be escaped).
In a special case, quotes that make a pair around a backtick can confuse the parser. For example,
`it's` > `that's`
will raise an error, as it forms a quoted string ('s > `that'
) with a backtick inside.See also the Python documentation about lexical analysis (https://docs.python.org/3/reference/lexical_analysis.html) in combination with the source code in
pandas.core.computation.parsing
.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'A': range(1, 6), ... 'B': range(10, 0, -2), ... 'C C': range(10, 5, -1)}) >>> df A B C C 0 1 10 10 1 2 8 9 2 3 6 8 3 4 4 7 4 5 2 6 >>> df.query('A > B') A B C C 4 5 2 6 The previous expression is equivalent to >>> df[df.A > df.B] A B C C 4 5 2 6 For columns with spaces in their name, you can use backtick quoting. >>> df.query('B == `C C`') A B C C 0 1 10 10 The previous expression is equivalent to >>> df[df.B == df['C C']] A B C C 0 1 10 10
- expr (str) –
-
isnull
(**kwargs)¶ Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN
, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
).Returns: Mask of bool values for each element in DeferredDataFrame that indicates whether an element is an NA value. Return type: DeferredDataFrame Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.isnull()
- Alias of isna.
DeferredDataFrame.notna()
- Boolean inverse of isna.
DeferredDataFrame.dropna()
- Omit axes labels with missing values.
isna()
- Top-level isna.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Show which entries in a DataFrame are NA. >>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker >>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False Show which entries in a Series are NA. >>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64 >>> ser.isna() 0 False 1 False 2 True dtype: bool
-
isna
(**kwargs)¶ Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN
, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
).Returns: Mask of bool values for each element in DeferredDataFrame that indicates whether an element is an NA value. Return type: DeferredDataFrame Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.isnull()
- Alias of isna.
DeferredDataFrame.notna()
- Boolean inverse of isna.
DeferredDataFrame.dropna()
- Omit axes labels with missing values.
isna()
- Top-level isna.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Show which entries in a DataFrame are NA. >>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker >>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False Show which entries in a Series are NA. >>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64 >>> ser.isna() 0 False 1 False 2 True dtype: bool
-
notnull
(**kwargs)¶ Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings
''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
). NA values, such as None ornumpy.NaN
, get mapped to False values.Returns: Mask of bool values for each element in DeferredDataFrame that indicates whether an element is not an NA value. Return type: DeferredDataFrame Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.notnull()
- Alias of notna.
DeferredDataFrame.isna()
- Boolean inverse of notna.
DeferredDataFrame.dropna()
- Omit axes labels with missing values.
notna()
- Top-level notna.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Show which entries in a DataFrame are not NA. >>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker >>> df.notna() age born name toy 0 True False True False 1 True True True True 2 False True True True Show which entries in a Series are not NA. >>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64 >>> ser.notna() 0 True 1 True 2 False dtype: bool
-
notna
(**kwargs)¶ Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings
''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
). NA values, such as None ornumpy.NaN
, get mapped to False values.Returns: Mask of bool values for each element in DeferredDataFrame that indicates whether an element is not an NA value. Return type: DeferredDataFrame Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.notnull()
- Alias of notna.
DeferredDataFrame.isna()
- Boolean inverse of notna.
DeferredDataFrame.dropna()
- Omit axes labels with missing values.
notna()
- Top-level notna.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Show which entries in a DataFrame are not NA. >>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker >>> df.notna() age born name toy 0 True False True False 1 True True True True 2 False True True True Show which entries in a Series are not NA. >>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64 >>> ser.notna() 0 True 1 True 2 False dtype: bool
-
items
(**kwargs)¶ pandas.DataFrame.items()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
itertuples
(**kwargs)¶ pandas.DataFrame.itertuples()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
iterrows
(**kwargs)¶ pandas.DataFrame.iterrows()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
iteritems
(**kwargs)¶ pandas.DataFrame.iteritems()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
join
(other, on, **kwargs)[source]¶ Join columns of another DataFrame.
Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.
Parameters: - other (DeferredDataFrame, DeferredSeries, or list of DeferredDataFrame) – Index should be similar to one of the columns in this one. If a DeferredSeries is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DeferredDataFrame.
- on (str, list of str, or array-like, optional) – Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DeferredDataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DeferredDataFrame. Like an Excel VLOOKUP operation.
- how ({'left', 'right', 'outer', 'inner'}, default 'left') –
How to handle the operation of the two objects.
- left: use calling frame’s index (or column if on is specified)
- right: use other’s index.
- outer: form union of calling frame’s index (or column if on is specified) with other’s index, and sort it. lexicographically.
- inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of the calling’s one.
- lsuffix (str, default '') – Suffix to use from left frame’s overlapping columns.
- rsuffix (str, default '') – Suffix to use from right frame’s overlapping columns.
- sort (bool, default False) – Order result DeferredDataFrame lexicographically by the join key. If False, the order of the join key depends on the join type (how keyword).
Returns: A dataframe containing columns from both the caller and other.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.merge()
- For column(s)-on-column(s) operations.
Notes
Parameters on, lsuffix, and rsuffix are not supported when passing a list of DeferredDataFrame objects.
Support for specifying index levels as the on parameter was added in pandas version 0.23.0.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], ... 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']}) >>> df key A 0 K0 A0 1 K1 A1 2 K2 A2 3 K3 A3 4 K4 A4 5 K5 A5 >>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'], ... 'B': ['B0', 'B1', 'B2']}) >>> other key B 0 K0 B0 1 K1 B1 2 K2 B2 Join DataFrames using their indexes. >>> df.join(other, lsuffix='_caller', rsuffix='_other') key_caller A key_other B 0 K0 A0 K0 B0 1 K1 A1 K1 B1 2 K2 A2 K2 B2 3 K3 A3 NaN NaN 4 K4 A4 NaN NaN 5 K5 A5 NaN NaN If we want to join using the key columns, we need to set key to be the index in both `df` and `other`. The joined DataFrame will have key as its index. >>> df.set_index('key').join(other.set_index('key')) A B key K0 A0 B0 K1 A1 B1 K2 A2 B2 K3 A3 NaN K4 A4 NaN K5 A5 NaN Another option to join using the key columns is to use the `on` parameter. DataFrame.join always uses `other`'s index but we can use any column in `df`. This method preserves the original DataFrame's index in the result. >>> df.join(other.set_index('key'), on='key') key A B 0 K0 A0 B0 1 K1 A1 B1 2 K2 A2 B2 3 K3 A3 NaN 4 K4 A4 NaN 5 K5 A5 NaN
-
merge
(right, on, left_on, right_on, left_index, right_index, suffixes, **kwargs)[source]¶ Merge DataFrame or named Series objects with a database-style join.
A named Series object is treated as a DataFrame with a single named column.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.
Parameters: - right (DeferredDataFrame or named DeferredSeries) – Object to merge with.
- how ({'left', 'right', 'outer', 'inner', 'cross'}, default 'inner') –
Type of merge to be performed.
- left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
- right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
- outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
- inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
- cross: creates the cartesian product from both frames, preserves the order
of the left keys.
New in version 1.2.0.
- on (label or list) – Column or index level names to join on. These must be found in both DeferredDataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DeferredDataFrames.
- left_on (label or list, or array-like) – Column or index level names to join on in the left DeferredDataFrame. Can also be an array or list of arrays of the length of the left DeferredDataFrame. These arrays are treated as if they are columns.
- right_on (label or list, or array-like) – Column or index level names to join on in the right DeferredDataFrame. Can also be an array or list of arrays of the length of the right DeferredDataFrame. These arrays are treated as if they are columns.
- left_index (bool, default False) – Use the index from the left DeferredDataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DeferredDataFrame (either the index or a number of columns) must match the number of levels.
- right_index (bool, default False) – Use the index from the right DeferredDataFrame as the join key. Same caveats as left_index.
- sort (bool, default False) – Sort the join keys lexicographically in the result DeferredDataFrame. If False, the order of the join keys depends on the join type (how keyword).
- suffixes (list-like, default is ("_x", "_y")) – A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. Pass a value of None instead of a string to indicate that the column name from left or right should be left as-is, with no suffix. At least one of the values must not be None.
- copy (bool, default True) – If False, avoid copy if possible.
- indicator (bool or str, default False) – If True, adds a column to the output DeferredDataFrame called “_merge” with information on the source of each row. The column can be given a different name by providing a string argument. The column will have a Categorical type with the value of “left_only” for observations whose merge key only appears in the left DeferredDataFrame, “right_only” for observations whose merge key only appears in the right DeferredDataFrame, and “both” if the observation’s merge key is found in both DeferredDataFrames.
- validate (str, optional) –
If specified, checks if merge is of specified type.
- ”one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
- ”one_to_many” or “1:m”: check if merge keys are unique in left dataset.
- ”many_to_one” or “m:1”: check if merge keys are unique in right dataset.
- ”many_to_many” or “m:m”: allowed, but does not result in checks.
Returns: A DeferredDataFrame of the two merged objects.
Return type: Differences from pandas
merge is not parallelizable unless
left_index
orright_index
is ``True`, because it requires generating an entirely new unique index. See notes onDeferredDataFrame.reset_index()
. It is recommended to move the join key for one of your columns to the index to avoid this issue. For an example see the enrich pipeline inapache_beam.examples.dataframe.taxiride
.how="cross"
is not yet supported.See also
merge_ordered()
- Merge with optional filling/interpolation.
merge_asof()
- Merge on nearest keys.
DeferredDataFrame.join()
- Similar method using indices.
Notes
Support for specifying index levels as the on, left_on, and right_on parameters was added in pandas version 0.23.0 Support for merging named DeferredSeries objects was added in pandas version 0.24.0
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'], ... 'value': [1, 2, 3, 5]}) >>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'], ... 'value': [5, 6, 7, 8]}) >>> df1 lkey value 0 foo 1 1 bar 2 2 baz 3 3 foo 5 >>> df2 rkey value 0 foo 5 1 bar 6 2 baz 7 3 foo 8 Merge df1 and df2 on the lkey and rkey columns. The value columns have the default suffixes, _x and _y, appended. >>> df1.merge(df2, left_on='lkey', right_on='rkey') lkey value_x rkey value_y 0 foo 1 foo 5 1 foo 1 foo 8 2 foo 5 foo 5 3 foo 5 foo 8 4 bar 2 bar 6 5 baz 3 baz 7 Merge DataFrames df1 and df2 with specified left and right suffixes appended to any overlapping columns. >>> df1.merge(df2, left_on='lkey', right_on='rkey', ... suffixes=('_left', '_right')) lkey value_left rkey value_right 0 foo 1 foo 5 1 foo 1 foo 8 2 foo 5 foo 5 3 foo 5 foo 8 4 bar 2 bar 6 5 baz 3 baz 7 Merge DataFrames df1 and df2, but raise an exception if the DataFrames have any overlapping columns. >>> df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False)) Traceback (most recent call last): ... ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object') >>> df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]}) >>> df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]}) >>> df1 a b 0 foo 1 1 bar 2 >>> df2 a c 0 foo 3 1 baz 4 >>> df1.merge(df2, how='inner', on='a') a b c 0 foo 1 3 >>> df1.merge(df2, how='left', on='a') a b c 0 foo 1 3.0 1 bar 2 NaN >>> df1 = pd.DataFrame({'left': ['foo', 'bar']}) >>> df2 = pd.DataFrame({'right': [7, 8]}) >>> df1 left 0 foo 1 bar >>> df2 right 0 7 1 8 >>> df1.merge(df2, how='cross') left right 0 foo 7 1 foo 8 2 bar 7 3 bar 8
-
nlargest
(keep, **kwargs)[source]¶ Return the first n rows ordered by columns in descending order.
Return the first n rows with the largest values in columns, in descending order. The columns that are not specified are returned as well, but not used for ordering.
This method is equivalent to
df.sort_values(columns, ascending=False).head(n)
, but more performant.Parameters: - n (int) – Number of rows to return.
- columns (label or list of labels) – Column label(s) to order by.
- keep ({'first', 'last', 'all'}, default 'first') –
Where there are duplicate values:
- first : prioritize the first occurrence(s)
- last : prioritize the last occurrence(s)
all
: do not drop any duplicates, even it means- selecting more than n items.
Returns: The first n rows ordered by the given columns in descending order.
Return type: Differences from pandas
Only
keep=False
andkeep="any"
are supported. Other values ofkeep
make this an order-sensitive operation. Notekeep="any"
is a Beam-specific option that guarantees only one duplicate will be kept, but unlike"first"
and"last"
it makes no guarantees about _which_ duplicate element is kept.See also
DeferredDataFrame.nsmallest()
- Return the first n rows ordered by columns in ascending order.
DeferredDataFrame.sort_values()
- Sort DeferredDataFrame by the values.
DeferredDataFrame.head()
- Return the first n rows without re-ordering.
Notes
This function cannot be used with all column types. For example, when specifying columns with object or category dtypes,
TypeError
is raised.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000, ... 434000, 434000, 337000, 11300, ... 11300, 11300], ... 'GDP': [1937894, 2583560 , 12011, 4520, 12128, ... 17036, 182, 38, 311], ... 'alpha-2': ["IT", "FR", "MT", "MV", "BN", ... "IS", "NR", "TV", "AI"]}, ... index=["Italy", "France", "Malta", ... "Maldives", "Brunei", "Iceland", ... "Nauru", "Tuvalu", "Anguilla"]) >>> df population GDP alpha-2 Italy 59000000 1937894 IT France 65000000 2583560 FR Malta 434000 12011 MT Maldives 434000 4520 MV Brunei 434000 12128 BN Iceland 337000 17036 IS Nauru 11300 182 NR Tuvalu 11300 38 TV Anguilla 11300 311 AI In the following example, we will use ``nlargest`` to select the three rows having the largest values in column "population". >>> df.nlargest(3, 'population') population GDP alpha-2 France 65000000 2583560 FR Italy 59000000 1937894 IT Malta 434000 12011 MT When using ``keep='last'``, ties are resolved in reverse order: >>> df.nlargest(3, 'population', keep='last') population GDP alpha-2 France 65000000 2583560 FR Italy 59000000 1937894 IT Brunei 434000 12128 BN When using ``keep='all'``, all duplicate items are maintained: >>> df.nlargest(3, 'population', keep='all') population GDP alpha-2 France 65000000 2583560 FR Italy 59000000 1937894 IT Malta 434000 12011 MT Maldives 434000 4520 MV Brunei 434000 12128 BN To order by the largest values in column "population" and then "GDP", we can specify multiple columns like in the next example. >>> df.nlargest(3, ['population', 'GDP']) population GDP alpha-2 France 65000000 2583560 FR Italy 59000000 1937894 IT Brunei 434000 12128 BN
-
nsmallest
(keep, **kwargs)[source]¶ Return the first n rows ordered by columns in ascending order.
Return the first n rows with the smallest values in columns, in ascending order. The columns that are not specified are returned as well, but not used for ordering.
This method is equivalent to
df.sort_values(columns, ascending=True).head(n)
, but more performant.Parameters: - n (int) – Number of items to retrieve.
- columns (list or str) – Column name or names to order by.
- keep ({'first', 'last', 'all'}, default 'first') –
Where there are duplicate values:
first
: take the first occurrence.last
: take the last occurrence.all
: do not drop any duplicates, even it means selecting more than n items.
Returns: Return type: Differences from pandas
Only
keep=False
andkeep="any"
are supported. Other values ofkeep
make this an order-sensitive operation. Notekeep="any"
is a Beam-specific option that guarantees only one duplicate will be kept, but unlike"first"
and"last"
it makes no guarantees about _which_ duplicate element is kept.See also
DeferredDataFrame.nlargest()
- Return the first n rows ordered by columns in descending order.
DeferredDataFrame.sort_values()
- Sort DeferredDataFrame by the values.
DeferredDataFrame.head()
- Return the first n rows without re-ordering.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000, ... 434000, 434000, 337000, 337000, ... 11300, 11300], ... 'GDP': [1937894, 2583560 , 12011, 4520, 12128, ... 17036, 182, 38, 311], ... 'alpha-2': ["IT", "FR", "MT", "MV", "BN", ... "IS", "NR", "TV", "AI"]}, ... index=["Italy", "France", "Malta", ... "Maldives", "Brunei", "Iceland", ... "Nauru", "Tuvalu", "Anguilla"]) >>> df population GDP alpha-2 Italy 59000000 1937894 IT France 65000000 2583560 FR Malta 434000 12011 MT Maldives 434000 4520 MV Brunei 434000 12128 BN Iceland 337000 17036 IS Nauru 337000 182 NR Tuvalu 11300 38 TV Anguilla 11300 311 AI In the following example, we will use ``nsmallest`` to select the three rows having the smallest values in column "population". >>> df.nsmallest(3, 'population') population GDP alpha-2 Tuvalu 11300 38 TV Anguilla 11300 311 AI Iceland 337000 17036 IS When using ``keep='last'``, ties are resolved in reverse order: >>> df.nsmallest(3, 'population', keep='last') population GDP alpha-2 Anguilla 11300 311 AI Tuvalu 11300 38 TV Nauru 337000 182 NR When using ``keep='all'``, all duplicate items are maintained: >>> df.nsmallest(3, 'population', keep='all') population GDP alpha-2 Tuvalu 11300 38 TV Anguilla 11300 311 AI Iceland 337000 17036 IS Nauru 337000 182 NR To order by the smallest values in column "population" and then "GDP", we can specify multiple columns like in the next example. >>> df.nsmallest(3, ['population', 'GDP']) population GDP alpha-2 Tuvalu 11300 38 TV Anguilla 11300 311 AI Nauru 337000 182 NR
-
plot
(**kwargs)¶ pandas.DataFrame.plot()
is not yet supported in the Beam DataFrame API because it is a plotting tool.For more information see https://s.apache.org/dataframe-plotting-tools.
-
pop
(item)[source]¶ Return item and drop from frame. Raise KeyError if not found.
Parameters: item (label) – Label of column to be popped. Returns: Return type: DeferredSeries Differences from pandas
This operation has no known divergences from the pandas API.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame([('falcon', 'bird', 389.0), ... ('parrot', 'bird', 24.0), ... ('lion', 'mammal', 80.5), ... ('monkey', 'mammal', np.nan)], ... columns=('name', 'class', 'max_speed')) >>> df name class max_speed 0 falcon bird 389.0 1 parrot bird 24.0 2 lion mammal 80.5 3 monkey mammal NaN >>> df.pop('class') 0 bird 1 bird 2 mammal 3 mammal Name: class, dtype: object >>> df name max_speed 0 falcon 389.0 1 parrot 24.0 2 lion 80.5 3 monkey NaN
-
quantile
(q, axis, **kwargs)[source]¶ Return values at the given quantile over requested axis.
Parameters: - q (float or array-like, default 0.5 (50% quantile)) – Value between 0 <= q <= 1, the quantile(s) to compute.
- axis ({0, 1, 'index', 'columns'}, default 0) – Equals 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
- numeric_only (bool, default True) – If False, the quantile of datetime and timedelta data will be computed as well.
- interpolation ({'linear', 'lower', 'higher', 'midpoint', 'nearest'}) –
This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:
- linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
- lower: i.
- higher: j.
- nearest: i or j whichever is nearest.
- midpoint: (i + j) / 2.
Returns: - If
q
is an array, a DeferredDataFrame will be returned where the index is
q
, the columns are the columns of self, and the values are the quantiles.- If
q
is a float, a DeferredSeries will be returned where the index is the columns of self and the values are the quantiles.
Return type: Differences from pandas
quantile(axis="index")
is not parallelizable. See BEAM-12167 tracking the possible addition of an approximate, parallelizable implementation of quantile.When using quantile with
axis="columns"
only a singleq
value can be specified.See also
core.window.Rolling.quantile()
- Rolling quantile.
numpy.percentile()
- Numpy function to compute the percentile.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]), ... columns=['a', 'b']) >>> df.quantile(.1) a 1.3 b 3.7 Name: 0.1, dtype: float64 >>> df.quantile([.1, .5]) a b 0.1 1.3 3.7 0.5 2.5 55.0 Specifying `numeric_only=False` will also compute the quantile of datetime and timedelta data. >>> df = pd.DataFrame({'A': [1, 2], ... 'B': [pd.Timestamp('2010'), ... pd.Timestamp('2011')], ... 'C': [pd.Timedelta('1 days'), ... pd.Timedelta('2 days')]}) >>> df.quantile(0.5, numeric_only=False) A 1.5 B 2010-07-02 12:00:00 C 1 days 12:00:00 Name: 0.5, dtype: object
-
rename
(**kwargs)[source]¶ Alter axes labels.
Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.
See the user guide for more.
Parameters: - mapper (dict-like or function) – Dict-like or function transformations to apply to
that axis’ values. Use either
mapper
andaxis
to specify the axis to target withmapper
, orindex
andcolumns
. - index (dict-like or function) – Alternative to specifying axis (
mapper, axis=0
is equivalent toindex=mapper
). - columns (dict-like or function) – Alternative to specifying axis (
mapper, axis=1
is equivalent tocolumns=mapper
). - axis ({0 or 'index', 1 or 'columns'}, default 0) – Axis to target with
mapper
. Can be either the axis name (‘index’, ‘columns’) or number (0, 1). The default is ‘index’. - copy (bool, default True) – Also copy underlying data.
- inplace (bool, default False) – Whether to return a new DeferredDataFrame. If True then value of copy is ignored.
- level (int or level name, default None) – In case of a MultiIndex, only rename labels in the specified level.
- errors ({'ignore', 'raise'}, default 'ignore') – If ‘raise’, raise a KeyError when a dict-like mapper, index, or columns contains labels that are not present in the Index being transformed. If ‘ignore’, existing keys will be renamed and extra keys will be ignored.
Returns: DeferredDataFrame with the renamed axis labels or None if
inplace=True
.Return type: Raises: KeyError
– If any of the labels is not found in the selected axis and “errors=’raise’”.Differences from pandas
rename is not parallelizable when
axis="index"
anderrors="raise"
. It requires collecting all data on a single node in order to detect if one of the index values is missing.See also
DeferredDataFrame.rename_axis()
- Set the name of the axis.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
``DataFrame.rename`` supports two calling conventions * ``(index=index_mapper, columns=columns_mapper, ...)`` * ``(mapper, axis={'index', 'columns'}, ...)`` We *highly* recommend using keyword arguments to clarify your intent. Rename columns using a mapping: >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> df.rename(columns={"A": "a", "B": "c"}) a c 0 1 4 1 2 5 2 3 6 Rename index using a mapping: >>> df.rename(index={0: "x", 1: "y", 2: "z"}) A B x 1 4 y 2 5 z 3 6 Cast index labels to a different type: >>> df.index RangeIndex(start=0, stop=3, step=1) >>> df.rename(index=str).index Index(['0', '1', '2'], dtype='object') >>> df.rename(columns={"A": "a", "B": "b", "C": "c"}, errors="raise") Traceback (most recent call last): KeyError: ['C'] not found in axis Using axis-style parameters: >>> df.rename(str.lower, axis='columns') a b 0 1 4 1 2 5 2 3 6 >>> df.rename({1: 2, 2: 4}, axis='index') A B 0 1 4 2 2 5 4 3 6
- mapper (dict-like or function) – Dict-like or function transformations to apply to
that axis’ values. Use either
-
rename_axis
(**kwargs)¶ Set the name of the axis for the index or columns.
Parameters: - mapper (scalar, list-like, optional) – Value to set the axis name attribute.
- columns (index,) –
A scalar, list-like, dict-like or functions transformations to apply to that axis’ values. Note that the
columns
parameter is not allowed if the object is a DeferredSeries. This parameter only apply for DeferredDataFrame type objects.Use either
mapper
andaxis
to specify the axis to target withmapper
, orindex
and/orcolumns
. - axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to rename.
- copy (bool, default True) – Also copy underlying data.
- inplace (bool, default False) – Modifies the object directly, instead of creating a new DeferredSeries or DeferredDataFrame.
Returns: The same type as the caller or None if
inplace=True
.Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.rename()
- Alter DeferredSeries index labels or name.
DeferredDataFrame.rename()
- Alter DeferredDataFrame index labels or name.
Index.rename()
- Set new names on index.
Notes
DeferredDataFrame.rename_axis
supports two calling conventions(index=index_mapper, columns=columns_mapper, ...)
(mapper, axis={'index', 'columns'}, ...)
The first calling convention will only modify the names of the index and/or the names of the Index object that is the columns. In this case, the parameter
copy
is ignored.The second calling convention will modify the names of the corresponding index if mapper is a list or a scalar. However, if mapper is dict-like or a function, it will use the deprecated behavior of modifying the axis labels.
We highly recommend using keyword arguments to clarify your intent.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
**Series** >>> s = pd.Series(["dog", "cat", "monkey"]) >>> s 0 dog 1 cat 2 monkey dtype: object >>> s.rename_axis("animal") animal 0 dog 1 cat 2 monkey dtype: object **DataFrame** >>> df = pd.DataFrame({"num_legs": [4, 4, 2], ... "num_arms": [0, 0, 2]}, ... ["dog", "cat", "monkey"]) >>> df num_legs num_arms dog 4 0 cat 4 0 monkey 2 2 >>> df = df.rename_axis("animal") >>> df num_legs num_arms animal dog 4 0 cat 4 0 monkey 2 2 >>> df = df.rename_axis("limbs", axis="columns") >>> df limbs num_legs num_arms animal dog 4 0 cat 4 0 monkey 2 2 **MultiIndex** >>> df.index = pd.MultiIndex.from_product([['mammal'], ... ['dog', 'cat', 'monkey']], ... names=['type', 'name']) >>> df limbs num_legs num_arms type name mammal dog 4 0 cat 4 0 monkey 2 2 >>> df.rename_axis(index={'type': 'class'}) limbs num_legs num_arms class name mammal dog 4 0 cat 4 0 monkey 2 2 >>> df.rename_axis(columns=str.upper) LIMBS num_legs num_arms type name mammal dog 4 0 cat 4 0 monkey 2 2
-
round
(decimals, *args, **kwargs)[source]¶ Round a DataFrame to a variable number of decimal places.
Parameters: - decimals (int, dict, DeferredSeries) – Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and DeferredSeries round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a DeferredSeries. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.
- *args – Additional keywords have no effect but might be accepted for compatibility with numpy.
- **kwargs – Additional keywords have no effect but might be accepted for compatibility with numpy.
Returns: A DeferredDataFrame with the affected columns rounded to the specified number of decimal places.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
numpy.around()
- Round a numpy array to the given number of decimals.
DeferredSeries.round()
- Round a DeferredSeries to the given number of decimals.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame([(.21, .32), (.01, .67), (.66, .03), (.21, .18)], ... columns=['dogs', 'cats']) >>> df dogs cats 0 0.21 0.32 1 0.01 0.67 2 0.66 0.03 3 0.21 0.18 By providing an integer each column is rounded to the same number of decimal places >>> df.round(1) dogs cats 0 0.2 0.3 1 0.0 0.7 2 0.7 0.0 3 0.2 0.2 With a dict, the number of places for specific columns can be specified with the column names as key and the number of decimal places as value >>> df.round({'dogs': 1, 'cats': 0}) dogs cats 0 0.2 0.0 1 0.0 1.0 2 0.7 0.0 3 0.2 0.0 Using a Series, the number of places for specific columns can be specified with the column names as index and the number of decimal places as value >>> decimals = pd.Series([0, 1], index=['cats', 'dogs']) >>> df.round(decimals) dogs cats 0 0.2 0.0 1 0.0 1.0 2 0.7 0.0 3 0.2 0.0
-
select_dtypes
(**kwargs)¶ Return a subset of the DataFrame’s columns based on the column dtypes.
Parameters: exclude (include,) – A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied. Returns: The subset of the frame including the dtypes in include
and excluding the dtypes inexclude
.Return type: DeferredDataFrame Raises: ValueError
– * If both ofinclude
andexclude
are empty * Ifinclude
andexclude
have overlapping elements * If any kind of string dtype is passed in.Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.dtypes()
- Return DeferredSeries with the data type of each column.
Notes
- To select all numeric types, use
np.number
or'number'
- To select strings you must use the
object
dtype, but note that this will return all object dtype columns - See the numpy dtype hierarchy
- To select datetimes, use
np.datetime64
,'datetime'
or'datetime64'
- To select timedeltas, use
np.timedelta64
,'timedelta'
or'timedelta64'
- To select Pandas categorical dtypes, use
'category'
- To select Pandas datetimetz dtypes, use
'datetimetz'
(new in 0.20.0) or'datetime64[ns, tz]'
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'a': [1, 2] * 3, ... 'b': [True, False] * 3, ... 'c': [1.0, 2.0] * 3}) >>> df a b c 0 1 True 1.0 1 2 False 2.0 2 1 True 1.0 3 2 False 2.0 4 1 True 1.0 5 2 False 2.0 >>> df.select_dtypes(include='bool') b 0 True 1 False 2 True 3 False 4 True 5 False >>> df.select_dtypes(include=['float64']) c 0 1.0 1 2.0 2 1.0 3 2.0 4 1.0 5 2.0 >>> df.select_dtypes(exclude=['int64']) b c 0 True 1.0 1 False 2.0 2 True 1.0 3 False 2.0 4 True 1.0 5 False 2.0
-
shift
(axis, freq, **kwargs)[source]¶ Shift index by desired number of periods with an optional time freq.
When freq is not passed, shift the index without realigning the data. If freq is passed (in this case, the index must be date or datetime, or it will raise a NotImplementedError), the index will be increased using the periods and the freq. freq can be inferred when specified as “infer” as long as either freq or inferred_freq attribute is set in the index.
Parameters: - periods (int) – Number of periods to shift. Can be positive or negative.
- freq (DateOffset, tseries.offsets, timedelta, or str, optional) – Offset to use from the tseries module or time rule (e.g. ‘EOM’). If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data. If freq is specified as “infer” then it will be inferred from the freq or inferred_freq attributes of the index. If neither of those attributes exist, a ValueError is thrown.
- axis ({0 or 'index', 1 or 'columns', None}, default None) – Shift direction.
- fill_value (object, optional) –
The scalar value to use for newly introduced missing values. the default depends on the dtype of self. For numeric data,
np.nan
is used. For datetime, timedelta, or period data, etc.NaT
is used. For extension dtypes,self.dtype.na_value
is used.Changed in version 1.1.0.
Returns: Copy of input object, shifted.
Return type: Differences from pandas
shift with
axis="index" is only supported with ``freq
specified andfill_value
undefined. Other configurations make this operation order-sensitive.See also
Index.shift()
- Shift values of Index.
DatetimeIndex.shift()
- Shift values of DatetimeIndex.
PeriodIndex.shift()
- Shift values of PeriodIndex.
tshift()
- Shift the time index, using the index’s frequency if available.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({"Col1": [10, 20, 15, 30, 45], ... "Col2": [13, 23, 18, 33, 48], ... "Col3": [17, 27, 22, 37, 52]}, ... index=pd.date_range("2020-01-01", "2020-01-05")) >>> df Col1 Col2 Col3 2020-01-01 10 13 17 2020-01-02 20 23 27 2020-01-03 15 18 22 2020-01-04 30 33 37 2020-01-05 45 48 52 >>> df.shift(periods=3) Col1 Col2 Col3 2020-01-01 NaN NaN NaN 2020-01-02 NaN NaN NaN 2020-01-03 NaN NaN NaN 2020-01-04 10.0 13.0 17.0 2020-01-05 20.0 23.0 27.0 >>> df.shift(periods=1, axis="columns") Col1 Col2 Col3 2020-01-01 NaN 10 13 2020-01-02 NaN 20 23 2020-01-03 NaN 15 18 2020-01-04 NaN 30 33 2020-01-05 NaN 45 48 >>> df.shift(periods=3, fill_value=0) Col1 Col2 Col3 2020-01-01 0 0 0 2020-01-02 0 0 0 2020-01-03 0 0 0 2020-01-04 10 13 17 2020-01-05 20 23 27 >>> df.shift(periods=3, freq="D") Col1 Col2 Col3 2020-01-04 10 13 17 2020-01-05 20 23 27 2020-01-06 15 18 22 2020-01-07 30 33 37 2020-01-08 45 48 52 >>> df.shift(periods=3, freq="infer") Col1 Col2 Col3 2020-01-04 10 13 17 2020-01-05 20 23 27 2020-01-06 15 18 22 2020-01-07 30 33 37 2020-01-08 45 48 52
-
shape
¶ pandas.DataFrame.shape()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
stack
(**kwargs)¶ Stack the prescribed level(s) from columns to index.
Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe:
- if the columns have a single level, the output is a Series;
- if the columns have multiple levels, the new index level(s) is (are) taken from the prescribed level(s) and the output is a DataFrame.
Parameters: - level (int, str, list, default -1) – Level(s) to stack from the column axis onto the index axis, defined as one index or label, or a list of indices or labels.
- dropna (bool, default True) – Whether to drop rows in the resulting Frame/DeferredSeries with missing values. Stacking a column level onto the index axis can create combinations of index and column values that are missing from the original dataframe. See Examples section.
Returns: Stacked dataframe or series.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.unstack()
- Unstack prescribed level(s) from index axis onto column axis.
DeferredDataFrame.pivot()
- Reshape dataframe from long format to wide format.
DeferredDataFrame.pivot_table()
- Create a spreadsheet-style pivot table as a DeferredDataFrame.
Notes
The function is named by analogy with a collection of books being reorganized from being side by side on a horizontal position (the columns of the dataframe) to being stacked vertically on top of each other (in the index of the dataframe).
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
**Single level columns** >>> df_single_level_cols = pd.DataFrame([[0, 1], [2, 3]], ... index=['cat', 'dog'], ... columns=['weight', 'height']) Stacking a dataframe with a single level column axis returns a Series: >>> df_single_level_cols weight height cat 0 1 dog 2 3 >>> df_single_level_cols.stack() cat weight 0 height 1 dog weight 2 height 3 dtype: int64 **Multi level columns: simple case** >>> multicol1 = pd.MultiIndex.from_tuples([('weight', 'kg'), ... ('weight', 'pounds')]) >>> df_multi_level_cols1 = pd.DataFrame([[1, 2], [2, 4]], ... index=['cat', 'dog'], ... columns=multicol1) Stacking a dataframe with a multi-level column axis: >>> df_multi_level_cols1 weight kg pounds cat 1 2 dog 2 4 >>> df_multi_level_cols1.stack() weight cat kg 1 pounds 2 dog kg 2 pounds 4 **Missing values** >>> multicol2 = pd.MultiIndex.from_tuples([('weight', 'kg'), ... ('height', 'm')]) >>> df_multi_level_cols2 = pd.DataFrame([[1.0, 2.0], [3.0, 4.0]], ... index=['cat', 'dog'], ... columns=multicol2) It is common to have missing values when stacking a dataframe with multi-level columns, as the stacked dataframe typically has more values than the original dataframe. Missing values are filled with NaNs: >>> df_multi_level_cols2 weight height kg m cat 1.0 2.0 dog 3.0 4.0 >>> df_multi_level_cols2.stack() height weight cat kg NaN 1.0 m 2.0 NaN dog kg NaN 3.0 m 4.0 NaN **Prescribing the level(s) to be stacked** The first parameter controls which level or levels are stacked: >>> df_multi_level_cols2.stack(0) kg m cat height NaN 2.0 weight 1.0 NaN dog height NaN 4.0 weight 3.0 NaN >>> df_multi_level_cols2.stack([0, 1]) cat height m 2.0 weight kg 1.0 dog height m 4.0 weight kg 3.0 dtype: float64 **Dropping missing values** >>> df_multi_level_cols3 = pd.DataFrame([[None, 1.0], [2.0, 3.0]], ... index=['cat', 'dog'], ... columns=multicol2) Note that rows where all values are missing are dropped by default but this behaviour can be controlled via the dropna keyword parameter: >>> df_multi_level_cols3 weight height kg m cat NaN 1.0 dog 2.0 3.0 >>> df_multi_level_cols3.stack(dropna=False) height weight cat kg NaN NaN m 1.0 NaN dog kg NaN 2.0 m 3.0 NaN >>> df_multi_level_cols3.stack(dropna=True) height weight cat m 1.0 NaN dog kg NaN 2.0 m 3.0 NaN
-
all
(*args, **kwargs)¶ Return whether all elements are True, potentially over an axis.
Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).
Parameters: - axis ({0 or 'index', 1 or 'columns', None}, default 0) –
Indicate which axis or axes should be reduced.
- 0 / ‘index’ : reduce the index, return a DeferredSeries whose index is the original column labels.
- 1 / ‘columns’ : reduce the columns, return a DeferredSeries whose index is the original index.
- None : reduce all axes, return a scalar.
- bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for DeferredSeries.
- skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DeferredSeries.
- **kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: If level is specified, then, DeferredDataFrame is returned; otherwise, DeferredSeries is returned.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.all()
- Return True if all elements are True.
DeferredDataFrame.any()
- Return True if one (or more) elements are True.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
**Series** >>> pd.Series([True, True]).all() True >>> pd.Series([True, False]).all() False >>> pd.Series([], dtype="float64").all() True >>> pd.Series([np.nan]).all() True >>> pd.Series([np.nan]).all(skipna=False) True **DataFrames** Create a dataframe from a dictionary. >>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]}) >>> df col1 col2 0 True True 1 True False Default behaviour checks if column-wise values all return True. >>> df.all() col1 True col2 False dtype: bool Specify ``axis='columns'`` to check if row-wise values all return True. >>> df.all(axis='columns') 0 True 1 False dtype: bool Or ``axis=None`` for whether every value is True. >>> df.all(axis=None) False
- axis ({0 or 'index', 1 or 'columns', None}, default 0) –
-
any
(*args, **kwargs)¶ Return whether any element is True, potentially over an axis.
Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).
Parameters: - axis ({0 or 'index', 1 or 'columns', None}, default 0) –
Indicate which axis or axes should be reduced.
- 0 / ‘index’ : reduce the index, return a DeferredSeries whose index is the original column labels.
- 1 / ‘columns’ : reduce the columns, return a DeferredSeries whose index is the original index.
- None : reduce all axes, return a scalar.
- bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for DeferredSeries.
- skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DeferredSeries.
- **kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: If level is specified, then, DeferredDataFrame is returned; otherwise, DeferredSeries is returned.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
numpy.any()
- Numpy version of this method.
DeferredSeries.any()
- Return whether any element is True.
DeferredSeries.all()
- Return whether all elements are True.
DeferredDataFrame.any()
- Return whether any element is True over requested axis.
DeferredDataFrame.all()
- Return whether all elements are True over requested axis.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
**Series** For Series input, the output is a scalar indicating whether any element is True. >>> pd.Series([False, False]).any() False >>> pd.Series([True, False]).any() True >>> pd.Series([], dtype="float64").any() False >>> pd.Series([np.nan]).any() False >>> pd.Series([np.nan]).any(skipna=False) True **DataFrame** Whether each column contains at least one True element (the default). >>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]}) >>> df A B C 0 1 0 0 1 2 2 0 >>> df.any() A True B True C False dtype: bool Aggregating over the columns. >>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]}) >>> df A B 0 True 1 1 False 2 >>> df.any(axis='columns') 0 True 1 True dtype: bool >>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]}) >>> df A B 0 True 1 1 False 0 >>> df.any(axis='columns') 0 True 1 False dtype: bool Aggregating over the entire DataFrame with ``axis=None``. >>> df.any(axis=None) True `any` for an empty DataFrame is an empty Series. >>> pd.DataFrame([]).any() Series([], dtype: bool)
- axis ({0 or 'index', 1 or 'columns', None}, default 0) –
-
count
(*args, **kwargs)¶ Count non-NA cells for each column or row.
The values None, NaN, NaT, and optionally numpy.inf (depending on pandas.options.mode.use_inf_as_na) are considered NA.
Parameters: - axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.
- level (int or str, optional) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DeferredDataFrame. A str specifies the level name.
- numeric_only (bool, default False) – Include only float, int or boolean data.
Returns: For each column/row the number of non-NA/null entries. If level is specified returns a DeferredDataFrame.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.count()
- Number of non-NA elements in a DeferredSeries.
DeferredDataFrame.value_counts()
- Count unique combinations of columns.
DeferredDataFrame.shape()
- Number of DeferredDataFrame rows and columns (including NA elements).
DeferredDataFrame.isna()
- Boolean same-sized DeferredDataFrame showing places of NA elements.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Constructing DataFrame from a dictionary: >>> df = pd.DataFrame({"Person": ... ["John", "Myla", "Lewis", "John", "Myla"], ... "Age": [24., np.nan, 21., 33, 26], ... "Single": [False, True, True, True, False]}) >>> df Person Age Single 0 John 24.0 False 1 Myla NaN True 2 Lewis 21.0 True 3 John 33.0 True 4 Myla 26.0 False Notice the uncounted NA values: >>> df.count() Person 5 Age 4 Single 5 dtype: int64 Counts for each **row**: >>> df.count(axis='columns') 0 3 1 2 2 3 3 3 4 3 dtype: int64
-
describe
(*args, **kwargs)¶ Generate descriptive statistics.
Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding
NaN
values.Analyzes both numeric and object series, as well as
DataFrame
column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.Parameters: - percentiles (list-like of numbers, optional) – The percentiles to include in the output. All should
fall between 0 and 1. The default is
[.25, .5, .75]
, which returns the 25th, 50th, and 75th percentiles. - include ('all', list-like of dtypes or None (default), optional) –
A white list of data types to include in the result. Ignored for
DeferredSeries
. Here are the options:- ’all’ : All columns of the input will be included in the output.
- A list-like of dtypes : Limits the results to the
provided data types.
To limit the result to numeric types submit
numpy.number
. To limit it instead to object columns submit thenumpy.object
data type. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To select pandas categorical columns, use'category'
- None (default) : The result will include all numeric columns.
- exclude (list-like of dtypes or None (default), optional,) –
A black list of data types to omit from the result. Ignored for
DeferredSeries
. Here are the options:- A list-like of dtypes : Excludes the provided data types
from the result. To exclude numeric types submit
numpy.number
. To exclude object columns submit the data typenumpy.object
. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To exclude pandas categorical columns, use'category'
- None (default) : The result will exclude nothing.
- A list-like of dtypes : Excludes the provided data types
from the result. To exclude numeric types submit
- datetime_is_numeric (bool, default False) –
Whether to treat datetime dtypes as numeric. This affects statistics calculated for the column. For DeferredDataFrame input, this also controls whether datetime columns are included by default.
New in version 1.1.0.
Returns: Summary statistics of the DeferredSeries or Dataframe provided.
Return type: Differences from pandas
describe
cannot currently be parallelized. It will require collecting all data on a single node.See also
DeferredDataFrame.count()
- Count number of non-NA/null observations.
DeferredDataFrame.max()
- Maximum of the values in the object.
DeferredDataFrame.min()
- Minimum of the values in the object.
DeferredDataFrame.mean()
- Mean of the values.
DeferredDataFrame.std()
- Standard deviation of the observations.
DeferredDataFrame.select_dtypes()
- Subset of a DeferredDataFrame including/excluding columns based on their dtype.
Notes
For numeric data, the result’s index will include
count
,mean
,std
,min
,max
as well as lower,50
and upper percentiles. By default the lower percentile is25
and the upper percentile is75
. The50
percentile is the same as the median.For object data (e.g. strings or timestamps), the result’s index will include
count
,unique
,top
, andfreq
. Thetop
is the most common value. Thefreq
is the most common value’s frequency. Timestamps also include thefirst
andlast
items.If multiple object values have the highest count, then the
count
andtop
results will be arbitrarily chosen from among those with the highest count.For mixed data types provided via a
DeferredDataFrame
, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. Ifinclude='all'
is provided as an option, the result will include a union of attributes of each type.The include and exclude parameters can be used to limit which columns in a
DeferredDataFrame
are analyzed for the output. The parameters are ignored when analyzing aDeferredSeries
.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
Describing a numeric ``Series``. >>> s = pd.Series([1, 2, 3]) >>> s.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 dtype: float64 Describing a categorical ``Series``. >>> s = pd.Series(['a', 'a', 'b', 'c']) >>> s.describe() count 4 unique 3 top a freq 2 dtype: object Describing a timestamp ``Series``. >>> s = pd.Series([ ... np.datetime64("2000-01-01"), ... np.datetime64("2010-01-01"), ... np.datetime64("2010-01-01") ... ]) >>> s.describe(datetime_is_numeric=True) count 3 mean 2006-09-01 08:00:00 min 2000-01-01 00:00:00 25% 2004-12-31 12:00:00 50% 2010-01-01 00:00:00 75% 2010-01-01 00:00:00 max 2010-01-01 00:00:00 dtype: object Describing a ``DataFrame``. By default only numeric fields are returned. >>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), ... 'numeric': [1, 2, 3], ... 'object': ['a', 'b', 'c'] ... }) >>> df.describe() numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 Describing all columns of a ``DataFrame`` regardless of data type. >>> df.describe(include='all') categorical numeric object count 3 3.0 3 unique 3 NaN 3 top f NaN a freq 1 NaN 1 mean NaN 2.0 NaN std NaN 1.0 NaN min NaN 1.0 NaN 25% NaN 1.5 NaN 50% NaN 2.0 NaN 75% NaN 2.5 NaN max NaN 3.0 NaN Describing a column from a ``DataFrame`` by accessing it as an attribute. >>> df.numeric.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 Name: numeric, dtype: float64 Including only numeric columns in a ``DataFrame`` description. >>> df.describe(include=[np.number]) numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 Including only string columns in a ``DataFrame`` description. >>> df.describe(include=[object]) object count 3 unique 3 top a freq 1 Including only categorical columns from a ``DataFrame`` description. >>> df.describe(include=['category']) categorical count 3 unique 3 top d freq 1 Excluding numeric columns from a ``DataFrame`` description. >>> df.describe(exclude=[np.number]) categorical object count 3 3 unique 3 3 top f a freq 1 1 Excluding object columns from a ``DataFrame`` description. >>> df.describe(exclude=[object]) categorical numeric count 3 3.0 unique 3 NaN top f NaN freq 1 NaN mean NaN 2.0 std NaN 1.0 min NaN 1.0 25% NaN 1.5 50% NaN 2.0 75% NaN 2.5 max NaN 3.0
- percentiles (list-like of numbers, optional) – The percentiles to include in the output. All should
fall between 0 and 1. The default is
-
max
(*args, **kwargs)¶ Return the maximum of the values over the requested axis.
If you want the index of the maximum, use
idxmax
. This is the equivalent of thenumpy.ndarray
methodargmax
.Parameters: - axis ({index (0), columns (1)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DeferredSeries.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: DeferredSeries or DeferredDataFrame (if level specified)
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.sum()
- Return the sum.
DeferredSeries.min()
- Return the minimum.
DeferredSeries.max()
- Return the maximum.
DeferredSeries.idxmin()
- Return the index of the minimum.
DeferredSeries.idxmax()
- Return the index of the maximum.
DeferredDataFrame.sum()
- Return the sum over the requested axis.
DeferredDataFrame.min()
- Return the minimum over the requested axis.
DeferredDataFrame.max()
- Return the maximum over the requested axis.
DeferredDataFrame.idxmin()
- Return the index of the minimum over the requested axis.
DeferredDataFrame.idxmax()
- Return the index of the maximum over the requested axis.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64 >>> s.max() 8
-
min
(*args, **kwargs)¶ Return the minimum of the values over the requested axis.
If you want the index of the minimum, use
idxmin
. This is the equivalent of thenumpy.ndarray
methodargmin
.Parameters: - axis ({index (0), columns (1)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DeferredSeries.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: DeferredSeries or DeferredDataFrame (if level specified)
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.sum()
- Return the sum.
DeferredSeries.min()
- Return the minimum.
DeferredSeries.max()
- Return the maximum.
DeferredSeries.idxmin()
- Return the index of the minimum.
DeferredSeries.idxmax()
- Return the index of the maximum.
DeferredDataFrame.sum()
- Return the sum over the requested axis.
DeferredDataFrame.min()
- Return the minimum over the requested axis.
DeferredDataFrame.max()
- Return the maximum over the requested axis.
DeferredDataFrame.idxmin()
- Return the index of the minimum over the requested axis.
DeferredDataFrame.idxmax()
- Return the index of the maximum over the requested axis.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64 >>> s.min() 0
-
prod
(*args, **kwargs)¶ Return the product of the values over the requested axis.
Parameters: - axis ({index (0), columns (1)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DeferredSeries.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_count
non-NA values are present the result will be NA. - **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: DeferredSeries or DeferredDataFrame (if level specified)
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.sum()
- Return the sum.
DeferredSeries.min()
- Return the minimum.
DeferredSeries.max()
- Return the maximum.
DeferredSeries.idxmin()
- Return the index of the minimum.
DeferredSeries.idxmax()
- Return the index of the maximum.
DeferredDataFrame.sum()
- Return the sum over the requested axis.
DeferredDataFrame.min()
- Return the minimum over the requested axis.
DeferredDataFrame.max()
- Return the maximum over the requested axis.
DeferredDataFrame.idxmin()
- Return the index of the minimum over the requested axis.
DeferredDataFrame.idxmax()
- Return the index of the maximum over the requested axis.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
By default, the product of an empty or all-NA Series is ``1`` >>> pd.Series([], dtype="float64").prod() 1.0 This can be controlled with the ``min_count`` parameter >>> pd.Series([], dtype="float64").prod(min_count=1) nan Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and empty series identically. >>> pd.Series([np.nan]).prod() 1.0 >>> pd.Series([np.nan]).prod(min_count=1) nan
-
product
(*args, **kwargs)¶ Return the product of the values over the requested axis.
Parameters: - axis ({index (0), columns (1)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DeferredSeries.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_count
non-NA values are present the result will be NA. - **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: DeferredSeries or DeferredDataFrame (if level specified)
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.sum()
- Return the sum.
DeferredSeries.min()
- Return the minimum.
DeferredSeries.max()
- Return the maximum.
DeferredSeries.idxmin()
- Return the index of the minimum.
DeferredSeries.idxmax()
- Return the index of the maximum.
DeferredDataFrame.sum()
- Return the sum over the requested axis.
DeferredDataFrame.min()
- Return the minimum over the requested axis.
DeferredDataFrame.max()
- Return the maximum over the requested axis.
DeferredDataFrame.idxmin()
- Return the index of the minimum over the requested axis.
DeferredDataFrame.idxmax()
- Return the index of the maximum over the requested axis.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
By default, the product of an empty or all-NA Series is ``1`` >>> pd.Series([], dtype="float64").prod() 1.0 This can be controlled with the ``min_count`` parameter >>> pd.Series([], dtype="float64").prod(min_count=1) nan Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and empty series identically. >>> pd.Series([np.nan]).prod() 1.0 >>> pd.Series([np.nan]).prod(min_count=1) nan
-
sum
(*args, **kwargs)¶ Return the sum of the values over the requested axis.
This is equivalent to the method
numpy.sum
.Parameters: - axis ({index (0), columns (1)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DeferredSeries.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_count
non-NA values are present the result will be NA. - **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: DeferredSeries or DeferredDataFrame (if level specified)
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.sum()
- Return the sum.
DeferredSeries.min()
- Return the minimum.
DeferredSeries.max()
- Return the maximum.
DeferredSeries.idxmin()
- Return the index of the minimum.
DeferredSeries.idxmax()
- Return the index of the maximum.
DeferredDataFrame.sum()
- Return the sum over the requested axis.
DeferredDataFrame.min()
- Return the minimum over the requested axis.
DeferredDataFrame.max()
- Return the maximum over the requested axis.
DeferredDataFrame.idxmin()
- Return the index of the minimum over the requested axis.
DeferredDataFrame.idxmax()
- Return the index of the maximum over the requested axis.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64 >>> s.sum() 14 By default, the sum of an empty or all-NA Series is ``0``. >>> pd.Series([], dtype="float64").sum() # min_count=0 is the default 0.0 This can be controlled with the ``min_count`` parameter. For example, if you'd like the sum of an empty series to be NaN, pass ``min_count=1``. >>> pd.Series([], dtype="float64").sum(min_count=1) nan Thanks to the ``skipna`` parameter, ``min_count`` handles all-NA and empty series identically. >>> pd.Series([np.nan]).sum() 0.0 >>> pd.Series([np.nan]).sum(min_count=1) nan
-
mean
(*args, **kwargs)¶ Return the mean of the values over the requested axis.
Parameters: - axis ({index (0), columns (1)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DeferredSeries.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: DeferredSeries or DeferredDataFrame (if level specified)
Differences from pandas
mean
cannot currently be parallelized. It will require collecting all data on a single node.
-
median
(*args, **kwargs)¶ Return the median of the values over the requested axis.
Parameters: - axis ({index (0), columns (1)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DeferredSeries.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: DeferredSeries or DeferredDataFrame (if level specified)
Differences from pandas
median
cannot currently be parallelized. It will require collecting all data on a single node.
-
nunique
(*args, **kwargs)¶ Count number of distinct elements in specified axis.
Return Series with number of distinct elements. Can ignore NaN values.
Parameters: - axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
- dropna (bool, default True) – Don’t include NaN in the counts.
Returns: Return type: Differences from pandas
nunique
cannot currently be parallelized. It will require collecting all data on a single node.See also
DeferredSeries.nunique()
- Method nunique for DeferredSeries.
DeferredDataFrame.count()
- Count non-NA cells for each column or row.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'A': [4, 5, 6], 'B': [4, 1, 1]}) >>> df.nunique() A 3 B 2 dtype: int64 >>> df.nunique(axis=1) 0 1 1 2 2 2 dtype: int64
-
std
(*args, **kwargs)¶ Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: - axis ({index (0), columns (1)}) –
- skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DeferredSeries.
- ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
Returns: Return type: DeferredSeries or DeferredDataFrame (if level specified)
Differences from pandas
std
cannot currently be parallelized. It will require collecting all data on a single node.Notes
To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
-
var
(*args, **kwargs)¶ Return unbiased variance over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: - axis ({index (0), columns (1)}) –
- skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DeferredSeries.
- ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
Returns: Return type: DeferredSeries or DeferredDataFrame (if level specified)
Differences from pandas
var
cannot currently be parallelized. It will require collecting all data on a single node.Notes
To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
-
sem
(*args, **kwargs)¶ Return unbiased standard error of the mean over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: - axis ({index (0), columns (1)}) –
- skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DeferredSeries.
- ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
Returns: Return type: DeferredSeries or DeferredDataFrame (if level specified)
Differences from pandas
sem
cannot currently be parallelized. It will require collecting all data on a single node.Notes
To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
-
mad
(*args, **kwargs)¶ Return the mean absolute deviation of the values over the requested axis.
Parameters: - axis ({index (0), columns (1)}) – Axis for the function to be applied on.
- skipna (bool, default None) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DeferredSeries.
Returns: Return type: DeferredSeries or DeferredDataFrame (if level specified)
Differences from pandas
mad
cannot currently be parallelized. It will require collecting all data on a single node.
-
skew
(*args, **kwargs)¶ Return unbiased skew over requested axis.
Normalized by N-1.
Parameters: - axis ({index (0), columns (1)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DeferredSeries.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: DeferredSeries or DeferredDataFrame (if level specified)
Differences from pandas
skew
cannot currently be parallelized. It will require collecting all data on a single node.
-
kurt
(*args, **kwargs)¶ Return unbiased kurtosis over requested axis.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
Parameters: - axis ({index (0), columns (1)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DeferredSeries.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: DeferredSeries or DeferredDataFrame (if level specified)
Differences from pandas
kurt
cannot currently be parallelized. It will require collecting all data on a single node.
-
kurtosis
(*args, **kwargs)¶ Return unbiased kurtosis over requested axis.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
Parameters: - axis ({index (0), columns (1)}) – Axis for the function to be applied on.
- skipna (bool, default True) – Exclude NA/null values when computing the result.
- level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DeferredSeries.
- numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for DeferredSeries.
- **kwargs – Additional keyword arguments to be passed to the function.
Returns: Return type: DeferredSeries or DeferredDataFrame (if level specified)
Differences from pandas
kurtosis
cannot currently be parallelized. It will require collecting all data on a single node.
-
take
(**kwargs)¶ pandas.DataFrame.take()
is not yet supported in the Beam DataFrame API because it is deprecated in pandas.
-
to_records
(**kwargs)¶ pandas.DataFrame.to_records()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
to_dict
(**kwargs)¶ pandas.DataFrame.to_dict()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
to_numpy
(**kwargs)¶ pandas.DataFrame.to_numpy()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
to_string
(**kwargs)¶ pandas.DataFrame.to_string()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
to_sparse
(**kwargs)¶ pandas.DataFrame.to_sparse()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
transpose
(**kwargs)¶ pandas.DataFrame.transpose()
is not yet supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.For more information see https://s.apache.org/dataframe-non-deferred-columns.
-
T
¶ pandas.DataFrame.T()
is not yet supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.For more information see https://s.apache.org/dataframe-non-deferred-columns.
-
unstack
(*args, **kwargs)[source]¶ Pivot a level of the (necessarily hierarchical) index labels.
Returns a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels.
If the index is not a MultiIndex, the output will be a Series (the analogue of stack when the columns are not a MultiIndex).
Parameters: Returns: Return type: Differences from pandas
unstack cannot be used on
DeferredDataFrame
instances with multiple index levels, because the columns in the output depend on the data.See also
DeferredDataFrame.pivot()
- Pivot a table based on column values.
DeferredDataFrame.stack()
- Pivot a level of the column labels (inverse operation from unstack).
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ... ('two', 'a'), ('two', 'b')]) >>> s = pd.Series(np.arange(1.0, 5.0), index=index) >>> s one a 1.0 b 2.0 two a 3.0 b 4.0 dtype: float64 >>> s.unstack(level=-1) a b one 1.0 2.0 two 3.0 4.0 >>> s.unstack(level=0) one two a 1.0 3.0 b 2.0 4.0 >>> df = s.unstack(level=0) >>> df.unstack() one a 1.0 b 2.0 two a 3.0 b 4.0 dtype: float64
-
update
(**kwargs)¶ Modify in place using non-NA values from another DataFrame.
Aligns on indices. There is no return value.
Parameters: - other (DeferredDataFrame, or object coercible into a DeferredDataFrame) – Should have at least one matching index/column label with the original DeferredDataFrame. If a DeferredSeries is passed, its name attribute must be set, and that will be used as the column name to align with the original DeferredDataFrame.
- join ({'left'}, default 'left') – Only left join is implemented, keeping the index and columns of the original object.
- overwrite (bool, default True) –
How to handle non-NA values for overlapping keys:
- True: overwrite original DeferredDataFrame’s values with values from other.
- False: only update values that are NA in the original DeferredDataFrame.
- filter_func (callable(1d-array) -> bool 1d-array, optional) – Can choose to replace values other than NA. Return True for values that should be updated.
- errors ({'raise', 'ignore'}, default 'ignore') – If ‘raise’, will raise a ValueError if the DeferredDataFrame and other both contain non-NA data in the same place.
Returns: None
Return type: method directly changes calling object
Raises: ValueError
– * When errors=’raise’ and there’s overlapping non-NA data. * When errors is not either ‘ignore’ or ‘raise’NotImplementedError
– * If join != ‘left’
Differences from pandas
This operation has no known divergences from the pandas API.
See also
dict.update()
- Similar method for dictionaries.
DeferredDataFrame.merge()
- For column(s)-on-column(s) operations.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'A': [1, 2, 3], ... 'B': [400, 500, 600]}) >>> new_df = pd.DataFrame({'B': [4, 5, 6], ... 'C': [7, 8, 9]}) >>> df.update(new_df) >>> df A B 0 1 4 1 2 5 2 3 6 The DataFrame's length does not increase as a result of the update, only values at matching index/column labels are updated. >>> df = pd.DataFrame({'A': ['a', 'b', 'c'], ... 'B': ['x', 'y', 'z']}) >>> new_df = pd.DataFrame({'B': ['d', 'e', 'f', 'g', 'h', 'i']}) >>> df.update(new_df) >>> df A B 0 a d 1 b e 2 c f For Series, its name attribute must be set. >>> df = pd.DataFrame({'A': ['a', 'b', 'c'], ... 'B': ['x', 'y', 'z']}) >>> new_column = pd.Series(['d', 'e'], name='B', index=[0, 2]) >>> df.update(new_column) >>> df A B 0 a d 1 b y 2 c e >>> df = pd.DataFrame({'A': ['a', 'b', 'c'], ... 'B': ['x', 'y', 'z']}) >>> new_df = pd.DataFrame({'B': ['d', 'e']}, index=[1, 2]) >>> df.update(new_df) >>> df A B 0 a x 1 b d 2 c e If `other` contains NaNs the corresponding values are not updated in the original dataframe. >>> df = pd.DataFrame({'A': [1, 2, 3], ... 'B': [400, 500, 600]}) >>> new_df = pd.DataFrame({'B': [4, np.nan, 6]}) >>> df.update(new_df) >>> df A B 0 1 4.0 1 2 500.0 2 3 6.0
-
values
¶ pandas.DataFrame.values()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
style
¶ pandas.DataFrame.style()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
melt
(ignore_index, **kwargs)[source]¶ Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
Parameters: - id_vars (tuple, list, or ndarray, optional) – Column(s) to use as identifier variables.
- value_vars (tuple, list, or ndarray, optional) – Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
- var_name (scalar) – Name to use for the ‘variable’ column. If None it uses
frame.columns.name
or ‘variable’. - value_name (scalar, default 'value') – Name to use for the ‘value’ column.
- col_level (int or str, optional) – If columns are a MultiIndex then use this level to melt.
- ignore_index (bool, default True) –
If True, original index is ignored. If False, the original index is retained. Index labels will be repeated as necessary.
New in version 1.1.0.
Returns: Unpivoted DeferredDataFrame.
Return type: Differences from pandas
ignore_index=True
is not supported, because it requires generating an order-sensitive index.See also
melt()
- Identical method.
pivot_table()
- Create a spreadsheet-style pivot table as a DeferredDataFrame.
DeferredDataFrame.pivot()
- Return reshaped DeferredDataFrame organized by given index / column values.
DeferredDataFrame.explode()
- Explode a DeferredDataFrame from list-like columns to long format.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'}, ... 'B': {0: 1, 1: 3, 2: 5}, ... 'C': {0: 2, 1: 4, 2: 6}}) >>> df A B C 0 a 1 2 1 b 3 4 2 c 5 6 >>> df.melt(id_vars=['A'], value_vars=['B']) A variable value 0 a B 1 1 b B 3 2 c B 5 >>> df.melt(id_vars=['A'], value_vars=['B', 'C']) A variable value 0 a B 1 1 b B 3 2 c B 5 3 a C 2 4 b C 4 5 c C 6 The names of 'variable' and 'value' columns can be customized: >>> df.melt(id_vars=['A'], value_vars=['B'], ... var_name='myVarname', value_name='myValname') A myVarname myValname 0 a B 1 1 b B 3 2 c B 5 Original index values can be kept around: >>> df.melt(id_vars=['A'], value_vars=['B', 'C'], ignore_index=False) A variable value 0 a B 1 1 b B 3 2 c B 5 0 a C 2 1 b C 4 2 c C 6 If you have multi-index columns: >>> df.columns = [list('ABC'), list('DEF')] >>> df A B C D E F 0 a 1 2 1 b 3 4 2 c 5 6 >>> df.melt(col_level=0, id_vars=['A'], value_vars=['B']) A variable value 0 a B 1 1 b B 3 2 c B 5 >>> df.melt(id_vars=[('A', 'D')], value_vars=[('B', 'E')]) (A, D) variable_0 variable_1 value 0 a B E 1 1 b B E 3 2 c B E 5
-
value_counts
(subset=None, sort=False, normalize=False, ascending=False, dropna=True)[source]¶ Return a Series containing counts of unique rows in the DataFrame.
New in version 1.1.0.
Parameters: - subset (list-like, optional) – Columns to use when counting unique combinations.
- normalize (bool, default False) – Return proportions rather than frequencies.
- sort (bool, default True) – Sort by frequencies.
- ascending (bool, default False) – Sort in ascending order.
- dropna (bool, default True) –
Don’t include counts of rows that contain NA values.
New in version 1.3.0.
Returns: Return type: Differences from pandas
sort
isFalse
by default, andsort=True
is not supported because it imposes an ordering on the dataset which likely will not be preserved.See also
DeferredSeries.value_counts()
- Equivalent method on DeferredSeries.
Notes
The returned DeferredSeries will have a MultiIndex with one level per input column. By default, rows that contain any NA values are omitted from the result. By default, the resulting DeferredSeries will be in descending order so that the first element is the most frequently-occurring row.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'num_legs': [2, 4, 4, 6], ... 'num_wings': [2, 0, 0, 0]}, ... index=['falcon', 'dog', 'cat', 'ant']) >>> df num_legs num_wings falcon 2 2 dog 4 0 cat 4 0 ant 6 0 >>> df.value_counts() num_legs num_wings 4 0 2 2 2 1 6 0 1 dtype: int64 >>> df.value_counts(sort=False) num_legs num_wings 2 2 1 4 0 2 6 0 1 dtype: int64 >>> df.value_counts(ascending=True) num_legs num_wings 2 2 1 6 0 1 4 0 2 dtype: int64 >>> df.value_counts(normalize=True) num_legs num_wings 4 0 0.50 2 2 0.25 6 0 0.25 dtype: float64 With `dropna` set to `False` we can also count rows with NA values. >>> df = pd.DataFrame({'first_name': ['John', 'Anne', 'John', 'Beth'], ... 'middle_name': ['Smith', pd.NA, pd.NA, 'Louise']}) >>> df first_name middle_name 0 John Smith 1 Anne <NA> 2 John <NA> 3 Beth Louise >>> df.value_counts() first_name middle_name Beth Louise 1 John Smith 1 dtype: int64 >>> df.value_counts(dropna=False) first_name middle_name Anne NaN 1 Beth Louise 1 John Smith 1 NaN 1 dtype: int64
-
compare
(other, align_axis, keep_shape, **kwargs)[source]¶ Compare to another DataFrame and show the differences.
New in version 1.1.0.
Parameters: - other (DeferredDataFrame) – Object to compare with.
- align_axis ({0 or 'index', 1 or 'columns'}, default 1) –
Determine which axis to align the comparison on.
- 0, or ‘index’ : Resulting differences are stacked vertically
- with rows drawn alternately from self and other.
- 1, or ‘columns’ : Resulting differences are aligned horizontally
- with columns drawn alternately from self and other.
- keep_shape (bool, default False) – If true, all rows and columns are kept. Otherwise, only the ones with different values are kept.
- keep_equal (bool, default False) – If true, the result keeps values that are equal. Otherwise, equal values are shown as NaNs.
Returns: DeferredDataFrame that shows the differences stacked side by side.
The resulting index will be a MultiIndex with ‘self’ and ‘other’ stacked alternately at the inner level.
Return type: Raises: ValueError
– When the two DeferredDataFrames don’t have identical labels or shape.Differences from pandas
The default values
align_axis=1 and ``keep_shape=False
are not supported, because the output columns depend on the data. To usealign_axis=1
, please specifykeep_shape=True
.See also
DeferredSeries.compare()
- Compare with another DeferredSeries and show differences.
DeferredDataFrame.equals()
- Test whether two objects contain the same elements.
Notes
Matching NaNs will not appear as a difference.
Can only compare identically-labeled (i.e. same shape, identical row and column labels) DeferredDataFrames
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame( ... { ... "col1": ["a", "a", "b", "b", "a"], ... "col2": [1.0, 2.0, 3.0, np.nan, 5.0], ... "col3": [1.0, 2.0, 3.0, 4.0, 5.0] ... }, ... columns=["col1", "col2", "col3"], ... ) >>> df col1 col2 col3 0 a 1.0 1.0 1 a 2.0 2.0 2 b 3.0 3.0 3 b NaN 4.0 4 a 5.0 5.0 >>> df2 = df.copy() >>> df2.loc[0, 'col1'] = 'c' >>> df2.loc[2, 'col3'] = 4.0 >>> df2 col1 col2 col3 0 c 1.0 1.0 1 a 2.0 2.0 2 b 3.0 4.0 3 b NaN 4.0 4 a 5.0 5.0 Align the differences on columns >>> df.compare(df2) col1 col3 self other self other 0 a c NaN NaN 2 NaN NaN 3.0 4.0 Stack the differences on rows >>> df.compare(df2, align_axis=0) col1 col3 0 self a NaN other c NaN 2 self NaN 3.0 other NaN 4.0 Keep the equal values >>> df.compare(df2, keep_equal=True) col1 col3 self other self other 0 a c 1.0 1.0 2 b b 3.0 4.0 Keep all original rows and columns >>> df.compare(df2, keep_shape=True) col1 col2 col3 self other self other self other 0 a c NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN 3.0 4.0 3 NaN NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN NaN Keep all original rows and columns and also all original values >>> df.compare(df2, keep_shape=True, keep_equal=True) col1 col2 col3 self other self other self other 0 a c 1.0 1.0 1.0 1.0 1 a a 2.0 2.0 2.0 2.0 2 b b 3.0 3.0 3.0 4.0 3 b b NaN NaN 4.0 4.0 4 a a 5.0 5.0 5.0 5.0
-
idxmin
(**kwargs)[source]¶ Return index of first occurrence of minimum over requested axis.
NA/null values are excluded.
Parameters: - axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
- skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
Returns: Indexes of minima along the specified axis.
Return type: Raises: ValueError
– * If the row/column is emptyDifferences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.idxmin()
- Return index of the minimum element.
Notes
This method is the DeferredDataFrame version of
ndarray.argmin
.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Consider a dataset containing food consumption in Argentina. >>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48], ... 'co2_emissions': [37.2, 19.66, 1712]}, ... index=['Pork', 'Wheat Products', 'Beef']) >>> df consumption co2_emissions Pork 10.51 37.20 Wheat Products 103.11 19.66 Beef 55.48 1712.00 By default, it returns the index for the minimum value in each column. >>> df.idxmin() consumption Pork co2_emissions Wheat Products dtype: object To return the index for the minimum value in each row, use ``axis="columns"``. >>> df.idxmin(axis="columns") Pork consumption Wheat Products co2_emissions Beef consumption dtype: object
-
idxmax
(**kwargs)[source]¶ Return index of first occurrence of maximum over requested axis.
NA/null values are excluded.
Parameters: - axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
- skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
Returns: Indexes of maxima along the specified axis.
Return type: Raises: ValueError
– * If the row/column is emptyDifferences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.idxmax()
- Return index of the maximum element.
Notes
This method is the DeferredDataFrame version of
ndarray.argmax
.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Consider a dataset containing food consumption in Argentina. >>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48], ... 'co2_emissions': [37.2, 19.66, 1712]}, ... index=['Pork', 'Wheat Products', 'Beef']) >>> df consumption co2_emissions Pork 10.51 37.20 Wheat Products 103.11 19.66 Beef 55.48 1712.00 By default, it returns the index for the maximum value in each column. >>> df.idxmax() consumption Wheat Products co2_emissions Beef dtype: object To return the index for the maximum value in each row, use ``axis="columns"``. >>> df.idxmax(axis="columns") Pork co2_emissions Wheat Products consumption Beef co2_emissions dtype: object
-
abs
(**kwargs)¶ Return a Series/DataFrame with absolute numeric value of each element.
This function only applies to elements that are all numeric.
Returns: DeferredSeries/DeferredDataFrame containing the absolute value of each element. Return type: abs Differences from pandas
This operation has no known divergences from the pandas API.
See also
numpy.absolute()
- Calculate the absolute value element-wise.
Notes
For
complex
inputs,1.2 + 1j
, the absolute value is \(\sqrt{ a^2 + b^2 }\).Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Absolute numeric values in a Series. >>> s = pd.Series([-1.10, 2, -3.33, 4]) >>> s.abs() 0 1.10 1 2.00 2 3.33 3 4.00 dtype: float64 Absolute numeric values in a Series with complex numbers. >>> s = pd.Series([1.2 + 1j]) >>> s.abs() 0 1.56205 dtype: float64 Absolute numeric values in a Series with a Timedelta element. >>> s = pd.Series([pd.Timedelta('1 days')]) >>> s.abs() 0 1 days dtype: timedelta64[ns] Select rows with data closest to certain value using argsort (from `StackOverflow <https://stackoverflow.com/a/17758115>`__). >>> df = pd.DataFrame({ ... 'a': [4, 5, 6, 7], ... 'b': [10, 20, 30, 40], ... 'c': [100, 50, -30, -50] ... }) >>> df a b c 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 >>> df.loc[(df.c - 43).abs().argsort()] a b c 1 5 20 50 0 4 10 100 2 6 30 -30 3 7 40 -50
-
add
(**kwargs)¶ Get Addition of dataframe and other, element-wise (binary operator add).
Equivalent to
dataframe + other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, radd.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
apply
(**kwargs)¶ pandas.DataFrame.apply()
is not implemented yet in the Beam DataFrame API.If support for ‘apply’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
asfreq
(**kwargs)¶ pandas.DataFrame.asfreq()
is not implemented yet in the Beam DataFrame API.If support for ‘asfreq’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
astype
(dtype, copy, errors)¶ Cast a pandas object to a specified dtype
dtype
.Parameters: - dtype (data type, or dict of column name -> data type) – Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DeferredDataFrame’s columns to column-specific types.
- copy (bool, default True) – Return a copy when
copy=True
(be very careful settingcopy=False
as changes to values then may propagate to other pandas objects). - errors ({'raise', 'ignore'}, default 'raise') –
Control raising of exceptions on invalid data for provided dtype.
raise
: allow exceptions to be raisedignore
: suppress exceptions. On error return original object.
Returns: casted
Return type: same type as caller
Differences from pandas
astype is not parallelizable when
errors="ignore"
is specified.copy=False
is not supported because it relies on memory-sharing semantics.dtype="category
is not supported because the type of the output column depends on the data. Please usepd.CategoricalDtype
with explicit categories instead.See also
to_datetime()
- Convert argument to datetime.
to_timedelta()
- Convert argument to timedelta.
to_numeric()
- Convert argument to a numeric type.
numpy.ndarray.astype()
- Cast a numpy array to a specified type.
Notes
Deprecated since version 1.3.0: Using
astype
to convert from timezone-naive dtype to timezone-aware dtype is deprecated and will raise in a future version. UseDeferredSeries.dt.tz_localize()
instead.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
Create a DataFrame: >>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df = pd.DataFrame(data=d) >>> df.dtypes col1 int64 col2 int64 dtype: object Cast all columns to int32: >>> df.astype('int32').dtypes col1 int32 col2 int32 dtype: object Cast col1 to int32 using a dictionary: >>> df.astype({'col1': 'int32'}).dtypes col1 int32 col2 int64 dtype: object Create a series: >>> ser = pd.Series([1, 2], dtype='int32') >>> ser 0 1 1 2 dtype: int32 >>> ser.astype('int64') 0 1 1 2 dtype: int64 Convert to categorical type: >>> ser.astype('category') 0 1 1 2 dtype: category Categories (2, int64): [1, 2] Convert to ordered categorical type with custom ordering: >>> from pandas.api.types import CategoricalDtype >>> cat_dtype = CategoricalDtype( ... categories=[2, 1], ordered=True) >>> ser.astype(cat_dtype) 0 1 1 2 dtype: category Categories (2, int64): [2 < 1] Note that using ``copy=False`` and changing data on a new pandas object may propagate changes: >>> s1 = pd.Series([1, 2]) >>> s2 = s1.astype('int64', copy=False) >>> s2[0] = 10 >>> s1 # note that s1[0] has changed too 0 10 1 2 dtype: int64 Create a series of dates: >>> ser_date = pd.Series(pd.date_range('20200101', periods=3)) >>> ser_date 0 2020-01-01 1 2020-01-02 2 2020-01-03 dtype: datetime64[ns]
-
at
¶ pandas.DataFrame.at()
is not implemented yet in the Beam DataFrame API.If support for ‘at’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
at_time
(**kwargs)¶ Select values at particular time of day (e.g., 9:30AM).
Parameters: - time (datetime.time or str) –
- axis ({0 or 'index', 1 or 'columns'}, default 0) –
Returns: Return type: Raises: TypeError
– If the index is not aDatetimeIndex
Differences from pandas
This operation has no known divergences from the pandas API.
See also
between_time()
- Select values between particular times of the day.
first()
- Select initial periods of time series based on a date offset.
last()
- Select final periods of time series based on a date offset.
DatetimeIndex.indexer_at_time()
- Get just the index locations for values at particular time of the day.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> i = pd.date_range('2018-04-09', periods=4, freq='12H') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> ts A 2018-04-09 00:00:00 1 2018-04-09 12:00:00 2 2018-04-10 00:00:00 3 2018-04-10 12:00:00 4 >>> ts.at_time('12:00') A 2018-04-09 12:00:00 2 2018-04-10 12:00:00 4
-
attrs
¶ pandas.DataFrame.attrs()
is not yet supported in the Beam DataFrame API because it is experimental in pandas.
-
backfill
(*args, **kwargs)¶ Synonym for
DataFrame.fillna()
withmethod='bfill'
.Returns: Object with missing values filled or None if inplace=True
.Return type: DeferredSeries/DeferredDataFrame or None Differences from pandas
backfill is only supported for axis=”columns”. axis=”index” is order-sensitive.
-
between_time
(**kwargs)¶ Select values between particular times of the day (e.g., 9:00-9:30 AM).
By setting
start_time
to be later thanend_time
, you can get the times that are not between the two times.Parameters: - start_time (datetime.time or str) – Initial time as a time filter limit.
- end_time (datetime.time or str) – End time as a time filter limit.
- include_start (bool, default True) – Whether the start time needs to be included in the result.
- include_end (bool, default True) – Whether the end time needs to be included in the result.
- axis ({0 or 'index', 1 or 'columns'}, default 0) – Determine range time on index or columns value.
Returns: Data from the original object filtered to the specified dates range.
Return type: Raises: TypeError
– If the index is not aDatetimeIndex
Differences from pandas
This operation has no known divergences from the pandas API.
See also
at_time()
- Select values at a particular time of the day.
first()
- Select initial periods of time series based on a date offset.
last()
- Select final periods of time series based on a date offset.
DatetimeIndex.indexer_between_time()
- Get just the index locations for values between particular times of the day.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> i = pd.date_range('2018-04-09', periods=4, freq='1D20min') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> ts A 2018-04-09 00:00:00 1 2018-04-10 00:20:00 2 2018-04-11 00:40:00 3 2018-04-12 01:00:00 4 >>> ts.between_time('0:15', '0:45') A 2018-04-10 00:20:00 2 2018-04-11 00:40:00 3 You get the times that are *not* between two times by setting ``start_time`` later than ``end_time``: >>> ts.between_time('0:45', '0:15') A 2018-04-09 00:00:00 1 2018-04-12 01:00:00 4
-
bfill
(*args, **kwargs)¶ bfill is only supported for axis=”columns”. axis=”index” is order-sensitive.
-
bool
()¶ Return the bool of a single element Series or DataFrame.
This must be a boolean scalar value, either True or False. It will raise a ValueError if the Series or DataFrame does not have exactly 1 element, or that element is not boolean (integer values 0 and 1 will also raise an exception).
Returns: The value in the DeferredSeries or DeferredDataFrame. Return type: bool Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.astype()
- Change the data type of a DeferredSeries, including to boolean.
DeferredDataFrame.astype()
- Change the data type of a DeferredDataFrame, including to boolean.
numpy.bool_()
- NumPy boolean data type, used by pandas for boolean values.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
The method will only work for single element objects with a boolean value: >>> pd.Series([True]).bool() True >>> pd.Series([False]).bool() False >>> pd.DataFrame({'col': [True]}).bool() True >>> pd.DataFrame({'col': [False]}).bool() False
-
boxplot
(**kwargs)¶ pandas.DataFrame.boxplot()
is not implemented yet in the Beam DataFrame API.If support for ‘boxplot’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
combine
(**kwargs)¶ Perform column-wise combine with another DataFrame.
Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.
Parameters: - other (DeferredDataFrame) – The DeferredDataFrame to merge column-wise.
- func (function) – Function that takes two series as inputs and return a DeferredSeries or a scalar. Used to merge the two dataframes column by columns.
- fill_value (scalar value, default None) – The value to fill NaNs with prior to passing any column to the merge func.
- overwrite (bool, default True) – If True, columns in self that do not exist in other will be overwritten with NaNs.
Returns: Combination of the provided DeferredDataFrames.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.combine_first()
- Combine two DeferredDataFrame objects and default to non-null values in frame calling the method.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Combine using a simple function that chooses the smaller column. >>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2 >>> df1.combine(df2, take_smaller) A B 0 0 3 1 0 3 Example using a true element-wise combine function. >>> df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine(df2, np.minimum) A B 0 1 2 1 0 3 Using `fill_value` fills Nones prior to passing the column to the merge function. >>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine(df2, take_smaller, fill_value=-5) A B 0 0 -5.0 1 0 4.0 However, if the same element in both dataframes is None, that None is preserved >>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [None, 3]}) >>> df1.combine(df2, take_smaller, fill_value=-5) A B 0 0 -5.0 1 0 3.0 Example that demonstrates the use of `overwrite` and behavior when the axis differ between the dataframes. >>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]}) >>> df2 = pd.DataFrame({'B': [3, 3], 'C': [-10, 1], }, index=[1, 2]) >>> df1.combine(df2, take_smaller) A B C 0 NaN NaN NaN 1 NaN 3.0 -10.0 2 NaN 3.0 1.0 >>> df1.combine(df2, take_smaller, overwrite=False) A B C 0 0.0 NaN NaN 1 0.0 3.0 -10.0 2 NaN 3.0 1.0 Demonstrating the preference of the passed in dataframe. >>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1], }, index=[1, 2]) >>> df2.combine(df1, take_smaller) A B C 0 0.0 NaN NaN 1 0.0 3.0 NaN 2 NaN 3.0 NaN >>> df2.combine(df1, take_smaller, overwrite=False) A B C 0 0.0 NaN NaN 1 0.0 3.0 1.0 2 NaN 3.0 1.0
-
combine_first
(**kwargs)¶ Update null elements with value in the same location in other.
Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two.
Parameters: other (DeferredDataFrame) – Provided DeferredDataFrame to use to fill null values. Returns: The result of combining the provided DeferredDataFrame with the other object. Return type: DeferredDataFrame Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.combine()
- Perform series-wise operation on two DeferredDataFrames using a given function.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine_first(df2) A B 0 1.0 3.0 1 0.0 4.0 Null values still persist if the location of that null value does not exist in `other` >>> df1 = pd.DataFrame({'A': [None, 0], 'B': [4, None]}) >>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1]}, index=[1, 2]) >>> df1.combine_first(df2) A B C 0 NaN 4.0 NaN 1 0.0 3.0 1.0 2 NaN 3.0 1.0
-
convert_dtypes
(**kwargs)¶ pandas.DataFrame.convert_dtypes()
is not implemented yet in the Beam DataFrame API.If support for ‘convert_dtypes’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
copy
(**kwargs)¶ Make a copy of this object’s indices and data.
When
deep=True
(default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below).When
deep=False
, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).Parameters: deep (bool, default True) – Make a deep copy, including a copy of the data and the indices. With deep=False
neither the indices nor the data are copied.Returns: copy – Object type matches caller. Return type: DeferredSeries or DeferredDataFrame Differences from pandas
This operation has no known divergences from the pandas API.
Notes
When
deep=True
, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively copies object data (see examples below).While
Index
objects are copied whendeep=True
, the underlying numpy array is not copied for performance reasons. SinceIndex
is immutable, the underlying data can be safely shared and a copy is not needed.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series([1, 2], index=["a", "b"]) >>> s a 1 b 2 dtype: int64 >>> s_copy = s.copy() >>> s_copy a 1 b 2 dtype: int64 **Shallow copy versus default (deep) copy:** >>> s = pd.Series([1, 2], index=["a", "b"]) >>> deep = s.copy() >>> shallow = s.copy(deep=False) Shallow copy shares data and index with original. >>> s is shallow False >>> s.values is shallow.values and s.index is shallow.index True Deep copy has own copy of data and index. >>> s is deep False >>> s.values is deep.values or s.index is deep.index False Updates to the data shared by shallow copy and original is reflected in both; deep copy remains unchanged. >>> s[0] = 3 >>> shallow[1] = 4 >>> s a 3 b 4 dtype: int64 >>> shallow a 3 b 4 dtype: int64 >>> deep a 1 b 2 dtype: int64 Note that when copying an object containing Python objects, a deep copy will copy the data, but will not do so recursively. Updating a nested data object will be reflected in the deep copy. >>> s = pd.Series([[1, 2], [3, 4]]) >>> deep = s.copy() >>> s[0][0] = 10 >>> s 0 [10, 2] 1 [3, 4] dtype: object >>> deep 0 [10, 2] 1 [3, 4] dtype: object
-
div
(**kwargs)¶ Get Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
divide
(**kwargs)¶ Get Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
drop
(labels, axis, index, columns, errors, **kwargs)¶ Drop specified labels from rows or columns.
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the user guide <advanced.shown_levels> for more information about the now unused levels.
Parameters: - labels (single label or list-like) – Index or column labels to drop.
- axis ({0 or 'index', 1 or 'columns'}, default 0) – Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
- index (single label or list-like) – Alternative to specifying axis (
labels, axis=0
is equivalent toindex=labels
). - columns (single label or list-like) – Alternative to specifying axis (
labels, axis=1
is equivalent tocolumns=labels
). - level (int or level name, optional) – For MultiIndex, level from which the labels will be removed.
- inplace (bool, default False) – If False, return a copy. Otherwise, do operation inplace and return None.
- errors ({'ignore', 'raise'}, default 'raise') – If ‘ignore’, suppress error and only existing labels are dropped.
Returns: DeferredDataFrame without the removed index or column labels or None if
inplace=True
.Return type: Raises: KeyError
– If any of the labels is not found in the selected axis.Differences from pandas
drop is not parallelizable when dropping from the index and
errors="raise"
is specified. It requires collecting all data on a single node in order to detect if one of the index values is missing.See also
DeferredDataFrame.loc()
- Label-location based indexer for selection by label.
DeferredDataFrame.dropna()
- Return DeferredDataFrame with labels on given axis omitted where (all or any) data are missing.
DeferredDataFrame.drop_duplicates()
- Return DeferredDataFrame with duplicate rows removed, optionally only considering certain columns.
DeferredSeries.drop()
- Return DeferredSeries with specified index labels removed.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame(np.arange(12).reshape(3, 4), ... columns=['A', 'B', 'C', 'D']) >>> df A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 Drop columns >>> df.drop(['B', 'C'], axis=1) A D 0 0 3 1 4 7 2 8 11 >>> df.drop(columns=['B', 'C']) A D 0 0 3 1 4 7 2 8 11 Drop a row by index >>> df.drop([0, 1]) A B C D 2 8 9 10 11 Drop columns and/or rows of MultiIndex DataFrame >>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'], ... ['speed', 'weight', 'length']], ... codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2], ... [0, 1, 2, 0, 1, 2, 0, 1, 2]]) >>> df = pd.DataFrame(index=midx, columns=['big', 'small'], ... data=[[45, 30], [200, 100], [1.5, 1], [30, 20], ... [250, 150], [1.5, 0.8], [320, 250], ... [1, 0.8], [0.3, 0.2]]) >>> df big small lama speed 45.0 30.0 weight 200.0 100.0 length 1.5 1.0 cow speed 30.0 20.0 weight 250.0 150.0 length 1.5 0.8 falcon speed 320.0 250.0 weight 1.0 0.8 length 0.3 0.2 >>> df.drop(index='cow', columns='small') big lama speed 45.0 weight 200.0 length 1.5 falcon speed 320.0 weight 1.0 length 0.3 >>> df.drop(index='length', level=1) big small lama speed 45.0 30.0 weight 200.0 100.0 cow speed 30.0 20.0 weight 250.0 150.0 falcon speed 320.0 250.0 weight 1.0 0.8
-
droplevel
(level, axis)¶ Return Series/DataFrame with requested index / column level(s) removed.
Parameters: - level (int, str, or list-like) – If a string is given, must be the name of a level If list-like, elements must be names or positional indexes of levels.
- axis ({0 or 'index', 1 or 'columns'}, default 0) –
Axis along which the level(s) is removed:
- 0 or ‘index’: remove level(s) in column.
- 1 or ‘columns’: remove level(s) in row.
Returns: DeferredSeries/DeferredDataFrame with requested index / column level(s) removed.
Return type: DeferredSeries/DeferredDataFrame
Differences from pandas
This operation has no known divergences from the pandas API.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame([ ... [1, 2, 3, 4], ... [5, 6, 7, 8], ... [9, 10, 11, 12] ... ]).set_index([0, 1]).rename_axis(['a', 'b']) >>> df.columns = pd.MultiIndex.from_tuples([ ... ('c', 'e'), ('d', 'f') ... ], names=['level_1', 'level_2']) >>> df level_1 c d level_2 e f a b 1 2 3 4 5 6 7 8 9 10 11 12 >>> df.droplevel('a') level_1 c d level_2 e f b 2 3 4 6 7 8 10 11 12 >>> df.droplevel('level_2', axis=1) level_1 c d a b 1 2 3 4 5 6 7 8 9 10 11 12
-
dtype
¶
-
empty
¶ Indicator whether DataFrame is empty.
True if DataFrame is entirely empty (no items), meaning any of the axes are of length 0.
Returns: If DeferredDataFrame is empty, return True, if not return False. Return type: bool Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.dropna
- Return series without null values.
DeferredDataFrame.dropna
- Return DeferredDataFrame with labels on given axis omitted where (all or any) data are missing.
Notes
If DeferredDataFrame contains only NaNs, it is still not considered empty. See the example below.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
An example of an actual empty DataFrame. Notice the index is empty: >>> df_empty = pd.DataFrame({'A' : []}) >>> df_empty Empty DataFrame Columns: [A] Index: [] >>> df_empty.empty True If we only have NaNs in our DataFrame, it is not considered empty! We will need to drop the NaNs to make the DataFrame empty: >>> df = pd.DataFrame({'A' : [np.nan]}) >>> df A 0 NaN >>> df.empty False >>> df.dropna().empty True
-
eq
(**kwargs)¶ Get Equal to of dataframe and other, element-wise (binary operator eq).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: Result of the comparison.
Return type: DeferredDataFrame of bool
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.eq()
- Compare DeferredDataFrames for equality elementwise.
DeferredDataFrame.ne()
- Compare DeferredDataFrames for inequality elementwise.
DeferredDataFrame.le()
- Compare DeferredDataFrames for less than inequality or equality elementwise.
DeferredDataFrame.lt()
- Compare DeferredDataFrames for strictly less than inequality elementwise.
DeferredDataFrame.ge()
- Compare DeferredDataFrames for greater than inequality or equality elementwise.
DeferredDataFrame.gt()
- Compare DeferredDataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300 Comparison with a scalar, using either the operator or method: >>> df == 100 cost revenue A False True B False False C True False >>> df.eq(100) cost revenue A False True B False False C True False When `other` is a :class:`Series`, the columns of a DataFrame are aligned with the index of `other` and broadcast: >>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True Use the method to control the broadcast axis: >>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True When comparing to an arbitrary sequence, the number of columns must match the number elements in `other`: >>> df == [250, 100] cost revenue A True True B False False C False False Use the method to control the axis: >>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False Compare to a DataFrame of different shape. >>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150 >>> df.gt(other) cost revenue A False False B False False C False True D False False Compare to a MultiIndex by level. >>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225 >>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
-
equals
(other)¶ Test whether two objects contain the same elements.
This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
The row/column index do not need to have the same type, as long as the values are considered equal. Corresponding columns must be of the same dtype.
Parameters: other (DeferredSeries or DeferredDataFrame) – The other DeferredSeries or DeferredDataFrame to be compared with the first. Returns: True if all elements are the same in both objects, False otherwise. Return type: bool Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredSeries.eq()
- Compare two DeferredSeries objects of the same length and return a DeferredSeries where each element is True if the element in each DeferredSeries is equal, False otherwise.
DeferredDataFrame.eq()
- Compare two DeferredDataFrame objects of the same shape and return a DeferredDataFrame where each element is True if the respective element in each DeferredDataFrame is equal, False otherwise.
testing.assert_series_equal()
- Raises an AssertionError if left and right are not equal. Provides an easy interface to ignore inequality in dtypes, indexes and precision among others.
testing.assert_frame_equal()
- Like assert_series_equal, but targets DeferredDataFrames.
numpy.array_equal()
- Return True if two arrays have the same shape and elements, False otherwise.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({1: [10], 2: [20]}) >>> df 1 2 0 10 20 DataFrames df and exactly_equal have the same types and values for their elements and column labels, which will return True. >>> exactly_equal = pd.DataFrame({1: [10], 2: [20]}) >>> exactly_equal 1 2 0 10 20 >>> df.equals(exactly_equal) True DataFrames df and different_column_type have the same element types and values, but have different types for the column labels, which will still return True. >>> different_column_type = pd.DataFrame({1.0: [10], 2.0: [20]}) >>> different_column_type 1.0 2.0 0 10 20 >>> df.equals(different_column_type) True DataFrames df and different_data_type have different types for the same values for their elements, and will return False even though their column labels are the same values and types. >>> different_data_type = pd.DataFrame({1: [10.0], 2: [20.0]}) >>> different_data_type 1 2 0 10.0 20.0 >>> df.equals(different_data_type) False
-
ewm
(**kwargs)¶ pandas.Series.ewm()
is not yet supported in the Beam DataFrame API because implementing it would require integrating with Beam event-time semanticsFor more information see https://s.apache.org/dataframe-event-time-semantics.
-
expanding
(**kwargs)¶ pandas.Series.expanding()
is not yet supported in the Beam DataFrame API because implementing it would require integrating with Beam event-time semanticsFor more information see https://s.apache.org/dataframe-event-time-semantics.
-
ffill
(*args, **kwargs)¶ ffill is only supported for axis=”columns”. axis=”index” is order-sensitive.
-
fillna
(value, method, axis, limit, **kwargs)¶ Fill NA/NaN values using the specified method.
Parameters: - value (scalar, dict, DeferredSeries, or DeferredDataFrame) – Value to use to fill holes (e.g. 0), alternately a dict/DeferredSeries/DeferredDataFrame of values specifying which value to use for each index (for a DeferredSeries) or column (for a DeferredDataFrame). Values not in the dict/DeferredSeries/DeferredDataFrame will not be filled. This value cannot be a list.
- method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None) – Method to use for filling holes in reindexed DeferredSeries pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap.
- axis ({0 or 'index', 1 or 'columns'}) – Axis along which to fill missing values.
- inplace (bool, default False) – If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DeferredDataFrame).
- limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
- downcast (dict, default is None) – A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).
Returns: Object with missing values filled or None if
inplace=True
.Return type: Differences from pandas
When
axis="index"
, bothmethod
andlimit
must beNone
. otherwise this operation is order-sensitive.See also
interpolate()
- Fill NaN values using interpolation.
reindex()
- Conform object to new index.
asfreq()
- Convert TimeDeferredSeries to specified frequency.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0], ... [3, 4, np.nan, 1], ... [np.nan, np.nan, np.nan, 5], ... [np.nan, 3, np.nan, 4]], ... columns=list("ABCD")) >>> df A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5 3 NaN 3.0 NaN 4 Replace all NaN elements with 0s. >>> df.fillna(0) A B C D 0 0.0 2.0 0.0 0 1 3.0 4.0 0.0 1 2 0.0 0.0 0.0 5 3 0.0 3.0 0.0 4 We can also propagate non-null values forward or backward. >>> df.fillna(method="ffill") A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 3.0 4.0 NaN 5 3 3.0 3.0 NaN 4 Replace all NaN elements in column 'A', 'B', 'C', and 'D', with 0, 1, 2, and 3 respectively. >>> values = {"A": 0, "B": 1, "C": 2, "D": 3} >>> df.fillna(value=values) A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 2.0 1 2 0.0 1.0 2.0 5 3 0.0 3.0 2.0 4 Only replace the first NaN element. >>> df.fillna(value=values, limit=1) A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 NaN 1 2 NaN 1.0 NaN 5 3 NaN 3.0 NaN 4 When filling using a DataFrame, replacement happens along the same column names and same indices >>> df2 = pd.DataFrame(np.zeros((4, 4)), columns=list("ABCE")) >>> df.fillna(df2) A B C D 0 0.0 2.0 0.0 0 1 3.0 4.0 0.0 1 2 0.0 0.0 0.0 5 3 0.0 3.0 0.0 4
-
filter
(**kwargs)¶ Subset the dataframe rows or columns according to the specified index labels.
Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.
Parameters: - items (list-like) – Keep labels from axis which are in items.
- like (str) – Keep labels from axis for which “like in label == True”.
- regex (str (regular expression)) – Keep labels from axis for which re.search(regex, label) == True.
- axis ({0 or ‘index’, 1 or ‘columns’, None}, default None) – The axis to filter on, expressed either as an index (int) or axis name (str). By default this is the info axis, ‘index’ for DeferredSeries, ‘columns’ for DeferredDataFrame.
Returns: Return type: same type as input object
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.loc()
- Access a group of rows and columns by label(s) or a boolean array.
Notes
The
items
,like
, andregex
parameters are enforced to be mutually exclusive.axis
defaults to the info axis that is used when indexing with[]
.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame(np.array(([1, 2, 3], [4, 5, 6])), ... index=['mouse', 'rabbit'], ... columns=['one', 'two', 'three']) >>> df one two three mouse 1 2 3 rabbit 4 5 6 >>> # select columns by name >>> df.filter(items=['one', 'three']) one three mouse 1 3 rabbit 4 6 >>> # select columns by regular expression >>> df.filter(regex='e$', axis=1) one three mouse 1 3 rabbit 4 6 >>> # select rows containing 'bbi' >>> df.filter(like='bbi', axis=0) one two three rabbit 4 5 6
-
first
(offset)¶ Select initial periods of time series data based on a date offset.
When having a DataFrame with dates as index, this function can select the first few rows based on a date offset.
Parameters: offset (str, DateOffset or dateutil.relativedelta) – The offset length of the data that will be selected. For instance, ‘1M’ will display all the rows having their index within the first month. Returns: A subset of the caller. Return type: DeferredSeries or DeferredDataFrame Raises: TypeError
– If the index is not aDatetimeIndex
Differences from pandas
This operation has no known divergences from the pandas API.
See also
last()
- Select final periods of time series based on a date offset.
at_time()
- Select values at a particular time of the day.
between_time()
- Select values between particular times of the day.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> i = pd.date_range('2018-04-09', periods=4, freq='2D') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> ts A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4 Get the rows for the first 3 days: >>> ts.first('3D') A 2018-04-09 1 2018-04-11 2 Notice the data for 3 first calendar days were returned, not the first 3 days observed in the dataset, and therefore data for 2018-04-13 was not returned.
-
flags
¶ pandas.DataFrame.flags()
is not implemented yet in the Beam DataFrame API.If support for ‘flags’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
floordiv
(**kwargs)¶ Get Integer division of dataframe and other, element-wise (binary operator floordiv).
Equivalent to
dataframe // other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rfloordiv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
ge
(**kwargs)¶ Get Greater than or equal to of dataframe and other, element-wise (binary operator ge).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: Result of the comparison.
Return type: DeferredDataFrame of bool
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.eq()
- Compare DeferredDataFrames for equality elementwise.
DeferredDataFrame.ne()
- Compare DeferredDataFrames for inequality elementwise.
DeferredDataFrame.le()
- Compare DeferredDataFrames for less than inequality or equality elementwise.
DeferredDataFrame.lt()
- Compare DeferredDataFrames for strictly less than inequality elementwise.
DeferredDataFrame.ge()
- Compare DeferredDataFrames for greater than inequality or equality elementwise.
DeferredDataFrame.gt()
- Compare DeferredDataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300 Comparison with a scalar, using either the operator or method: >>> df == 100 cost revenue A False True B False False C True False >>> df.eq(100) cost revenue A False True B False False C True False When `other` is a :class:`Series`, the columns of a DataFrame are aligned with the index of `other` and broadcast: >>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True Use the method to control the broadcast axis: >>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True When comparing to an arbitrary sequence, the number of columns must match the number elements in `other`: >>> df == [250, 100] cost revenue A True True B False False C False False Use the method to control the axis: >>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False Compare to a DataFrame of different shape. >>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150 >>> df.gt(other) cost revenue A False False B False False C False True D False False Compare to a MultiIndex by level. >>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225 >>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
-
groupby
(by, level, axis, as_index, group_keys, **kwargs)¶ Group DataFrame using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
Parameters: - by (mapping, function, label, or list of labels) – Used to determine the groups for the groupby.
If
by
is a function, it’s called on each value of the object’s index. If a dict or DeferredSeries is passed, the DeferredSeries or dict VALUES will be used to determine the groups (the DeferredSeries’ values are first aligned; see.align()
method). If an ndarray is passed, the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns inself
. Notice that a tuple is interpreted as a (single) key. - axis ({0 or 'index', 1 or 'columns'}, default 0) – Split along rows (0) or columns (1).
- level (int, level name, or sequence of such, default None) – If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
- as_index (bool, default True) – For aggregated output, return object with group labels as the index. Only relevant for DeferredDataFrame input. as_index=False is effectively “SQL-style” grouped output.
- sort (bool, default True) – Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
- group_keys (bool, default True) – When calling apply, add group keys to index to identify pieces.
- squeeze (bool, default False) –
Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
Deprecated since version 1.1.0.
- observed (bool, default False) – This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
- dropna (bool, default True) –
If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups
New in version 1.1.0.
Returns: Returns a groupby object that contains information about the groups.
Return type: DeferredDataFrameGroupBy
Differences from pandas
as_index
andgroup_keys
must both beTrue
.Aggregations grouping by a categorical column with
observed=False
set are not currently parallelizable (BEAM-11190).See also
resample()
- Convenience method for frequency conversion and resampling of time series.
Notes
See the user guide for more.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon', ... 'Parrot', 'Parrot'], ... 'Max Speed': [380., 370., 24., 26.]}) >>> df Animal Max Speed 0 Falcon 380.0 1 Falcon 370.0 2 Parrot 24.0 3 Parrot 26.0 >>> df.groupby(['Animal']).mean() Max Speed Animal Falcon 375.0 Parrot 25.0 **Hierarchical Indexes** We can groupby different levels of a hierarchical index using the `level` parameter: >>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'], ... ['Captive', 'Wild', 'Captive', 'Wild']] >>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type')) >>> df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]}, ... index=index) >>> df Max Speed Animal Type Falcon Captive 390.0 Wild 350.0 Parrot Captive 30.0 Wild 20.0 >>> df.groupby(level=0).mean() Max Speed Animal Falcon 370.0 Parrot 25.0 >>> df.groupby(level="Type").mean() Max Speed Type Captive 210.0 Wild 185.0 We can also choose to include NA in group keys or not by setting `dropna` parameter, the default setting is `True`: >>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]] >>> df = pd.DataFrame(l, columns=["a", "b", "c"]) >>> df.groupby(by=["b"]).sum() a c b 1.0 2 3 2.0 2 5 >>> df.groupby(by=["b"], dropna=False).sum() a c b 1.0 2 3 2.0 2 5 NaN 1 4 >>> l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]] >>> df = pd.DataFrame(l, columns=["a", "b", "c"]) >>> df.groupby(by="a").sum() b c a a 13.0 13.0 b 12.3 123.0 >>> df.groupby(by="a", dropna=False).sum() b c a a 13.0 13.0 b 12.3 123.0 NaN 12.3 33.0
- by (mapping, function, label, or list of labels) – Used to determine the groups for the groupby.
If
-
gt
(**kwargs)¶ Get Greater than of dataframe and other, element-wise (binary operator gt).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: Result of the comparison.
Return type: DeferredDataFrame of bool
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.eq()
- Compare DeferredDataFrames for equality elementwise.
DeferredDataFrame.ne()
- Compare DeferredDataFrames for inequality elementwise.
DeferredDataFrame.le()
- Compare DeferredDataFrames for less than inequality or equality elementwise.
DeferredDataFrame.lt()
- Compare DeferredDataFrames for strictly less than inequality elementwise.
DeferredDataFrame.ge()
- Compare DeferredDataFrames for greater than inequality or equality elementwise.
DeferredDataFrame.gt()
- Compare DeferredDataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300 Comparison with a scalar, using either the operator or method: >>> df == 100 cost revenue A False True B False False C True False >>> df.eq(100) cost revenue A False True B False False C True False When `other` is a :class:`Series`, the columns of a DataFrame are aligned with the index of `other` and broadcast: >>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True Use the method to control the broadcast axis: >>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True When comparing to an arbitrary sequence, the number of columns must match the number elements in `other`: >>> df == [250, 100] cost revenue A True True B False False C False False Use the method to control the axis: >>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False Compare to a DataFrame of different shape. >>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150 >>> df.gt(other) cost revenue A False False B False False C False True D False False Compare to a MultiIndex by level. >>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225 >>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
-
hist
(**kwargs)¶ pandas.DataFrame.hist()
is not yet supported in the Beam DataFrame API because it is a plotting tool.For more information see https://s.apache.org/dataframe-plotting-tools.
-
iloc
¶ Purely integer-location based indexing for selection by position.
.iloc[]
is primarily integer position based (from0
tolength-1
of the axis), but may also be used with a boolean array.Allowed inputs are:
- An integer, e.g.
5
. - A list or array of integers, e.g.
[4, 3, 0]
. - A slice object with ints, e.g.
1:7
. - A boolean array.
- A
callable
function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.
.iloc
will raiseIndexError
if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing (this conforms with python/numpy slice semantics).See more at Selection by Position.
Differences from pandas
Position-based indexing with iloc is order-sensitive in almost every case. Beam DataFrame users should prefer label-based indexing with loc.
See also
DeferredDataFrame.iat
- Fast integer location scalar accessor.
DeferredDataFrame.loc
- Purely label-location based indexer for selection by label.
DeferredSeries.iloc
- Purely integer-location based indexing for selection by position.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4}, ... {'a': 100, 'b': 200, 'c': 300, 'd': 400}, ... {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }] >>> df = pd.DataFrame(mydict) >>> df a b c d 0 1 2 3 4 1 100 200 300 400 2 1000 2000 3000 4000 **Indexing just the rows** With a scalar integer. >>> type(df.iloc[0]) <class 'pandas.core.series.Series'> >>> df.iloc[0] a 1 b 2 c 3 d 4 Name: 0, dtype: int64 With a list of integers. >>> df.iloc[[0]] a b c d 0 1 2 3 4 >>> type(df.iloc[[0]]) <class 'pandas.core.frame.DataFrame'> >>> df.iloc[[0, 1]] a b c d 0 1 2 3 4 1 100 200 300 400 With a `slice` object. >>> df.iloc[:3] a b c d 0 1 2 3 4 1 100 200 300 400 2 1000 2000 3000 4000 With a boolean mask the same length as the index. >>> df.iloc[[True, False, True]] a b c d 0 1 2 3 4 2 1000 2000 3000 4000 With a callable, useful in method chains. The `x` passed to the ``lambda`` is the DataFrame being sliced. This selects the rows whose index label even. >>> df.iloc[lambda x: x.index % 2 == 0] a b c d 0 1 2 3 4 2 1000 2000 3000 4000 **Indexing both axes** You can mix the indexer types for the index and columns. Use ``:`` to select the entire axis. With scalar integers. >>> df.iloc[0, 1] 2 With lists of integers. >>> df.iloc[[0, 2], [1, 3]] b d 0 2 4 2 2000 4000 With `slice` objects. >>> df.iloc[1:3, 0:3] a b c 1 100 200 300 2 1000 2000 3000 With a boolean array whose length matches the columns. >>> df.iloc[:, [True, False, True, False]] a c 0 1 3 1 100 300 2 1000 3000 With a callable function that expects the Series or DataFrame. >>> df.iloc[:, lambda df: [0, 2]] a c 0 1 3 1 100 300 2 1000 3000
- An integer, e.g.
-
index
¶ The index (row labels) of the DataFrame.
Differences from pandas
This operation has no known divergences from the pandas API.
-
infer_object
(**kwargs)¶ pandas.Series.infer_objects()
is not yet supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.For more information see https://s.apache.org/dataframe-non-deferred-columns.
-
infer_objects
(**kwargs)¶ pandas.DataFrame.infer_objects()
is not implemented yet in the Beam DataFrame API.If support for ‘infer_objects’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
isin
(**kwargs)¶ Whether each element in the DataFrame is contained in values.
Parameters: values (iterable, DeferredSeries, DeferredDataFrame or dict) – The result will only be true at a location if all the labels match. If values is a DeferredSeries, that’s the index. If values is a dict, the keys must be the column names, which must match. If values is a DeferredDataFrame, then both the index and column labels must match. Returns: DeferredDataFrame of booleans showing whether each element in the DeferredDataFrame is contained in values. Return type: DeferredDataFrame Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.eq()
- Equality test for DeferredDataFrame.
DeferredSeries.isin()
- Equivalent method on DeferredSeries.
DeferredSeries.str.contains()
- Test if pattern or regex is contained within a string of a DeferredSeries or Index.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]}, ... index=['falcon', 'dog']) >>> df num_legs num_wings falcon 2 2 dog 4 0 When ``values`` is a list check whether every value in the DataFrame is present in the list (which animals have 0 or 2 legs or wings) >>> df.isin([0, 2]) num_legs num_wings falcon True True dog False True When ``values`` is a dict, we can pass values to check for each column separately: >>> df.isin({'num_wings': [0, 3]}) num_legs num_wings falcon False False dog False True When ``values`` is a Series or DataFrame the index and column must match. Note that 'falcon' does not match based on the number of legs in df2. >>> other = pd.DataFrame({'num_legs': [8, 2], 'num_wings': [0, 2]}, ... index=['spider', 'falcon']) >>> df.isin(other) num_legs num_wings falcon True True dog False False
-
last
(offset)¶ Select final periods of time series data based on a date offset.
For a DataFrame with a sorted DatetimeIndex, this function selects the last few rows based on a date offset.
Parameters: offset (str, DateOffset, dateutil.relativedelta) – The offset length of the data that will be selected. For instance, ‘3D’ will display all the rows having their index within the last 3 days. Returns: A subset of the caller. Return type: DeferredSeries or DeferredDataFrame Raises: TypeError
– If the index is not aDatetimeIndex
Differences from pandas
This operation has no known divergences from the pandas API.
See also
first()
- Select initial periods of time series based on a date offset.
at_time()
- Select values at a particular time of the day.
between_time()
- Select values between particular times of the day.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> i = pd.date_range('2018-04-09', periods=4, freq='2D') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> ts A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4 Get the rows for the last 3 days: >>> ts.last('3D') A 2018-04-13 3 2018-04-15 4 Notice the data for 3 last calendar days were returned, not the last 3 observed days in the dataset, and therefore data for 2018-04-11 was not returned.
-
le
(**kwargs)¶ Get Less than or equal to of dataframe and other, element-wise (binary operator le).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: Result of the comparison.
Return type: DeferredDataFrame of bool
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.eq()
- Compare DeferredDataFrames for equality elementwise.
DeferredDataFrame.ne()
- Compare DeferredDataFrames for inequality elementwise.
DeferredDataFrame.le()
- Compare DeferredDataFrames for less than inequality or equality elementwise.
DeferredDataFrame.lt()
- Compare DeferredDataFrames for strictly less than inequality elementwise.
DeferredDataFrame.ge()
- Compare DeferredDataFrames for greater than inequality or equality elementwise.
DeferredDataFrame.gt()
- Compare DeferredDataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300 Comparison with a scalar, using either the operator or method: >>> df == 100 cost revenue A False True B False False C True False >>> df.eq(100) cost revenue A False True B False False C True False When `other` is a :class:`Series`, the columns of a DataFrame are aligned with the index of `other` and broadcast: >>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True Use the method to control the broadcast axis: >>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True When comparing to an arbitrary sequence, the number of columns must match the number elements in `other`: >>> df == [250, 100] cost revenue A True True B False False C False False Use the method to control the axis: >>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False Compare to a DataFrame of different shape. >>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150 >>> df.gt(other) cost revenue A False False B False False C False True D False False Compare to a MultiIndex by level. >>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225 >>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
-
length
()¶ Alternative to
len(df)
which returns a deferred result that can be used in arithmetic withDeferredSeries
orDeferredDataFrame
instances.
-
loc
¶ Access a group of rows and columns by label(s) or a boolean array.
.loc[]
is primarily label based, but may also be used with a boolean array.Allowed inputs are:
A single label, e.g.
5
or'a'
, (note that5
is interpreted as a label of the index, and never as an integer position along the index).A list or array of labels, e.g.
['a', 'b', 'c']
.A slice object with labels, e.g.
'a':'f'
.Warning
Note that contrary to usual python slices, both the start and the stop are included
A boolean array of the same length as the axis being sliced, e.g.
[True, False, True]
.An alignable boolean Series. The index of the key will be aligned before masking.
An alignable Index. The Index of the returned selection will be the input.
A
callable
function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)
See more at Selection by Label.
Raises: KeyError
– If any items are not found.IndexingError
– If an indexed key is passed and its index is unalignable to the frame index.
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.at
- Access a single value for a row/column label pair.
DeferredDataFrame.iloc
- Access group of rows and columns by integer position(s).
DeferredDataFrame.xs
- Returns a cross-section (row(s) or column(s)) from the DeferredSeries/DeferredDataFrame.
DeferredSeries.loc
- Access group of values using labels.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
**Getting values** >>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]], ... index=['cobra', 'viper', 'sidewinder'], ... columns=['max_speed', 'shield']) >>> df max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8 Single label. Note this returns the row as a Series. >>> df.loc['viper'] max_speed 4 shield 5 Name: viper, dtype: int64 List of labels. Note using ``[[]]`` returns a DataFrame. >>> df.loc[['viper', 'sidewinder']] max_speed shield viper 4 5 sidewinder 7 8 Single label for row and column >>> df.loc['cobra', 'shield'] 2 Slice with labels for row and single label for column. As mentioned above, note that both the start and stop of the slice are included. >>> df.loc['cobra':'viper', 'max_speed'] cobra 1 viper 4 Name: max_speed, dtype: int64 Boolean list with the same length as the row axis >>> df.loc[[False, False, True]] max_speed shield sidewinder 7 8 Alignable boolean Series: >>> df.loc[pd.Series([False, True, False], ... index=['viper', 'sidewinder', 'cobra'])] max_speed shield sidewinder 7 8 Index (same behavior as ``df.reindex``) >>> df.loc[pd.Index(["cobra", "viper"], name="foo")] max_speed shield foo cobra 1 2 viper 4 5 Conditional that returns a boolean Series >>> df.loc[df['shield'] > 6] max_speed shield sidewinder 7 8 Conditional that returns a boolean Series with column labels specified >>> df.loc[df['shield'] > 6, ['max_speed']] max_speed sidewinder 7 Callable that returns a boolean Series >>> df.loc[lambda df: df['shield'] == 8] max_speed shield sidewinder 7 8 **Setting values** Set value for all items matching the list of labels >>> df.loc[['viper', 'sidewinder'], ['shield']] = 50 >>> df max_speed shield cobra 1 2 viper 4 50 sidewinder 7 50 Set value for an entire row >>> df.loc['cobra'] = 10 >>> df max_speed shield cobra 10 10 viper 4 50 sidewinder 7 50 Set value for an entire column >>> df.loc[:, 'max_speed'] = 30 >>> df max_speed shield cobra 30 10 viper 30 50 sidewinder 30 50 Set value for rows matching callable condition >>> df.loc[df['shield'] > 35] = 0 >>> df max_speed shield cobra 30 10 viper 0 0 sidewinder 0 0 **Getting values on a DataFrame with an index that has integer labels** Another example using integers for the index >>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]], ... index=[7, 8, 9], columns=['max_speed', 'shield']) >>> df max_speed shield 7 1 2 8 4 5 9 7 8 Slice with integer labels for rows. As mentioned above, note that both the start and stop of the slice are included. >>> df.loc[7:9] max_speed shield 7 1 2 8 4 5 9 7 8 **Getting values with a MultiIndex** A number of examples using a DataFrame with a MultiIndex >>> tuples = [ ... ('cobra', 'mark i'), ('cobra', 'mark ii'), ... ('sidewinder', 'mark i'), ('sidewinder', 'mark ii'), ... ('viper', 'mark ii'), ('viper', 'mark iii') ... ] >>> index = pd.MultiIndex.from_tuples(tuples) >>> values = [[12, 2], [0, 4], [10, 20], ... [1, 4], [7, 1], [16, 36]] >>> df = pd.DataFrame(values, columns=['max_speed', 'shield'], index=index) >>> df max_speed shield cobra mark i 12 2 mark ii 0 4 sidewinder mark i 10 20 mark ii 1 4 viper mark ii 7 1 mark iii 16 36 Single label. Note this returns a DataFrame with a single index. >>> df.loc['cobra'] max_speed shield mark i 12 2 mark ii 0 4 Single index tuple. Note this returns a Series. >>> df.loc[('cobra', 'mark ii')] max_speed 0 shield 4 Name: (cobra, mark ii), dtype: int64 Single label for row and column. Similar to passing in a tuple, this returns a Series. >>> df.loc['cobra', 'mark i'] max_speed 12 shield 2 Name: (cobra, mark i), dtype: int64 Single tuple. Note using ``[[]]`` returns a DataFrame. >>> df.loc[[('cobra', 'mark ii')]] max_speed shield cobra mark ii 0 4 Single tuple for the index with a single label for the column >>> df.loc[('cobra', 'mark i'), 'shield'] 2 Slice from index tuple to single label >>> df.loc[('cobra', 'mark i'):'viper'] max_speed shield cobra mark i 12 2 mark ii 0 4 sidewinder mark i 10 20 mark ii 1 4 viper mark ii 7 1 mark iii 16 36 Slice from index tuple to index tuple >>> df.loc[('cobra', 'mark i'):('viper', 'mark ii')] max_speed shield cobra mark i 12 2 mark ii 0 4 sidewinder mark i 10 20 mark ii 1 4 viper mark ii 7 1
-
lt
(**kwargs)¶ Get Less than of dataframe and other, element-wise (binary operator lt).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: Result of the comparison.
Return type: DeferredDataFrame of bool
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.eq()
- Compare DeferredDataFrames for equality elementwise.
DeferredDataFrame.ne()
- Compare DeferredDataFrames for inequality elementwise.
DeferredDataFrame.le()
- Compare DeferredDataFrames for less than inequality or equality elementwise.
DeferredDataFrame.lt()
- Compare DeferredDataFrames for strictly less than inequality elementwise.
DeferredDataFrame.ge()
- Compare DeferredDataFrames for greater than inequality or equality elementwise.
DeferredDataFrame.gt()
- Compare DeferredDataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300 Comparison with a scalar, using either the operator or method: >>> df == 100 cost revenue A False True B False False C True False >>> df.eq(100) cost revenue A False True B False False C True False When `other` is a :class:`Series`, the columns of a DataFrame are aligned with the index of `other` and broadcast: >>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True Use the method to control the broadcast axis: >>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True When comparing to an arbitrary sequence, the number of columns must match the number elements in `other`: >>> df == [250, 100] cost revenue A True True B False False C False False Use the method to control the axis: >>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False Compare to a DataFrame of different shape. >>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150 >>> df.gt(other) cost revenue A False False B False False C False True D False False Compare to a MultiIndex by level. >>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225 >>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
-
mask
(cond, **kwargs)¶ mask is not parallelizable when
errors="ignore"
is specified.
-
mod
(**kwargs)¶ Get Modulo of dataframe and other, element-wise (binary operator mod).
Equivalent to
dataframe % other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmod.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
mul
(**kwargs)¶ Get Multiplication of dataframe and other, element-wise (binary operator mul).
Equivalent to
dataframe * other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmul.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
multiply
(**kwargs)¶ Get Multiplication of dataframe and other, element-wise (binary operator mul).
Equivalent to
dataframe * other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmul.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
ndim
¶ Return an int representing the number of axes / array dimensions.
Return 1 if Series. Otherwise return 2 if DataFrame.
Differences from pandas
This operation has no known divergences from the pandas API.
See also
ndarray.ndim
- Number of array dimensions.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series({'a': 1, 'b': 2, 'c': 3}) >>> s.ndim 1 >>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df.ndim 2
-
ne
(**kwargs)¶ Get Not equal to of dataframe and other, element-wise (binary operator ne).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: Result of the comparison.
Return type: DeferredDataFrame of bool
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.eq()
- Compare DeferredDataFrames for equality elementwise.
DeferredDataFrame.ne()
- Compare DeferredDataFrames for inequality elementwise.
DeferredDataFrame.le()
- Compare DeferredDataFrames for less than inequality or equality elementwise.
DeferredDataFrame.lt()
- Compare DeferredDataFrames for strictly less than inequality elementwise.
DeferredDataFrame.ge()
- Compare DeferredDataFrames for greater than inequality or equality elementwise.
DeferredDataFrame.gt()
- Compare DeferredDataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300 Comparison with a scalar, using either the operator or method: >>> df == 100 cost revenue A False True B False False C True False >>> df.eq(100) cost revenue A False True B False False C True False When `other` is a :class:`Series`, the columns of a DataFrame are aligned with the index of `other` and broadcast: >>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True Use the method to control the broadcast axis: >>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True When comparing to an arbitrary sequence, the number of columns must match the number elements in `other`: >>> df == [250, 100] cost revenue A True True B False False C False False Use the method to control the axis: >>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False Compare to a DataFrame of different shape. >>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150 >>> df.gt(other) cost revenue A False False B False False C False True D False False Compare to a MultiIndex by level. >>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225 >>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
-
pad
(*args, **kwargs)¶ Synonym for
DataFrame.fillna()
withmethod='ffill'
.Returns: Object with missing values filled or None if inplace=True
.Return type: DeferredSeries/DeferredDataFrame or None Differences from pandas
pad is only supported for axis=”columns”. axis=”index” is order-sensitive.
-
pipe
(func, *args, **kwargs)¶ Apply func(self, *args, **kwargs).
Parameters: - func (function) – Function to apply to the DeferredSeries/DeferredDataFrame.
args
, andkwargs
are passed intofunc
. Alternatively a(callable, data_keyword)
tuple wheredata_keyword
is a string indicating the keyword ofcallable
that expects the DeferredSeries/DeferredDataFrame. - args (iterable, optional) – Positional arguments passed into
func
. - kwargs (mapping, optional) – A dictionary of keyword arguments passed into
func
.
Returns: object
Return type: the return type of
func
.Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.apply()
- Apply a function along input axis of DeferredDataFrame.
DeferredDataFrame.applymap()
- Apply a function elementwise on a whole DeferredDataFrame.
DeferredSeries.map()
- Apply a mapping correspondence on a
DeferredSeries
.
Notes
Use
.pipe
when chaining together functions that expect DeferredSeries, DeferredDataFrames or GroupBy objects. Instead of writing>>> func(g(h(df), arg1=a), arg2=b, arg3=c) # doctest: +SKIP
You can write
>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe(func, arg2=b, arg3=c) ... ) # doctest: +SKIP
If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose
f
takes its data asarg2
:>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe((func, 'arg2'), arg1=a, arg3=c) ... ) # doctest: +SKIP
- func (function) – Function to apply to the DeferredSeries/DeferredDataFrame.
-
pivot
(**kwargs)¶ pandas.DataFrame.pivot()
is not implemented yet in the Beam DataFrame API.If support for ‘pivot’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
pivot_table
(**kwargs)¶ pandas.DataFrame.pivot_table()
is not implemented yet in the Beam DataFrame API.If support for ‘pivot_table’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
pow
(**kwargs)¶ Get Exponential power of dataframe and other, element-wise (binary operator pow).
Equivalent to
dataframe ** other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rpow.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
radd
(**kwargs)¶ Get Addition of dataframe and other, element-wise (binary operator radd).
Equivalent to
other + dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, add.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
rank
(**kwargs)¶ pandas.DataFrame.rank()
is not implemented yet in the Beam DataFrame API.If support for ‘rank’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
rdiv
(**kwargs)¶ Get Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to
other / dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
reindex
(**kwargs)¶ pandas.DataFrame.reindex()
is not yet supported in the Beam DataFrame API because it is sensitive to the order of the data.For more information see https://s.apache.org/dataframe-order-sensitive-operations.
-
reindex_like
(**kwargs)¶ pandas.DataFrame.reindex_like()
is not implemented yet in the Beam DataFrame API.If support for ‘reindex_like’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
reorder_levels
(**kwargs)¶ Rearrange index levels using input order. May not drop or duplicate levels.
Parameters: - order (list of int or list of str) – List representing new level order. Reference level by number (position) or by key (label).
- axis ({0 or 'index', 1 or 'columns'}, default 0) – Where to reorder levels.
Returns: Return type: Differences from pandas
This operation has no known divergences from the pandas API.
-
replace
(to_replace, value, limit, method, **kwargs)¶ Replace values given in to_replace with value.
Values of the DataFrame are replaced with other values dynamically.
This differs from updating with
.loc
or.iloc
, which require you to specify a location to update with some value.Parameters: - to_replace (str, regex, list, dict, DeferredSeries, int, float, or None) –
How to find the values that will be replaced.
- numeric, str or regex:
- numeric: numeric values equal to to_replace will be
- replaced with value
- str: string exactly matching to_replace will be replaced
- with value
- regex: regexs matching to_replace will be replaced with
- value
- list of str, regex, or numeric:
- First, if to_replace and value are both lists, they
- must be the same length.
- Second, if
regex=True
then all of the strings in both - lists will be interpreted as regexs otherwise they will match directly. This doesn’t matter much for value since there are only a few possible substitution regexes you can use.
- Second, if
- str, regex and numeric rules apply as above.
- dict:
- Dicts can be used to specify different replacement values
- for different existing values. For example,
{'a': 'b', 'y': 'z'}
replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way the value parameter should be None.
- For a DeferredDataFrame a dict can specify that different values
- should be replaced in different columns. For example,
{'a': 1, 'b': 'z'}
looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified in value. The value parameter should not beNone
in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.
- For a DeferredDataFrame nested dictionaries, e.g.,
{'a': {'b': np.nan}}
, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN. The value parameter should beNone
to use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.
- None:
- This means that the regex argument must be a string,
- compiled regular expression, or list, dict, ndarray or
DeferredSeries of such elements. If value is also
None
then this must be a nested dictionary or DeferredSeries.
See the examples section for examples of each of these.
- numeric, str or regex:
- value (scalar, dict, list, str, regex, default None) – Value to replace any values matching to_replace with. For a DeferredDataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.
- inplace (bool, default False) – If True, performs operation inplace and returns None.
- limit (int, default None) – Maximum size gap to forward or backward fill.
- regex (bool or same types as to_replace, default False) – Whether to interpret to_replace and/or value as regular
expressions. If this is
True
then to_replace must be a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must beNone
. - method ({‘pad’, ‘ffill’, ‘bfill’, None}) –
The method to use when for replacement, when to_replace is a scalar, list or tuple and value is
None
.Changed in version 0.23.0: Added to DeferredDataFrame.
Returns: Object after replacement.
Return type: Raises: AssertionError
– * If regex is not abool
and to_replace is notNone
.
TypeError
– * If to_replace is not a scalar, array-like,dict
, orNone
* If to_replace is adict
and value is not alist
,dict
,ndarray
, orDeferredSeries
- If to_replace is
None
and regex is not compilable - into a regular expression or is a list, dict, ndarray, or DeferredSeries.
- If to_replace is
- When replacing multiple
bool
ordatetime64
objects and - the arguments to to_replace does not match the type of the value being replaced
- When replacing multiple
ValueError
– * If alist
or anndarray
is passed to to_replace andvalue but they are not the same length.
Differences from pandas
method
is not supported in the Beam DataFrame API because it is order-sensitive. It cannot be specified.If
limit
is specified this operation is not parallelizable.See also
DeferredDataFrame.fillna()
- Fill NA values.
DeferredDataFrame.where()
- Replace values based on boolean condition.
DeferredSeries.str.replace()
- Simple string replacement.
Notes
- Regex substitution is performed under the hood with
re.sub
. The - rules for substitution for
re.sub
are the same.
- Regex substitution is performed under the hood with
- Regular expressions will only substitute on strings, meaning you
- cannot provide, for example, a regular expression matching floating point numbers and expect the columns in your frame that have a numeric dtype to be matched. However, if those floating point numbers are strings, then you can do this.
- This method has a lot of options. You are encouraged to experiment
- and play with this method to gain intuition about how it works.
- When dict is used as the to_replace value, it is like
- key(s) in the dict are the to_replace part and value(s) in the dict are the value parameter.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
**Scalar `to_replace` and `value`** >>> s = pd.Series([0, 1, 2, 3, 4]) >>> s.replace(0, 5) 0 5 1 1 2 2 3 3 4 4 dtype: int64 >>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4], ... 'B': [5, 6, 7, 8, 9], ... 'C': ['a', 'b', 'c', 'd', 'e']}) >>> df.replace(0, 5) A B C 0 5 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e **List-like `to_replace`** >>> df.replace([0, 1, 2, 3], 4) A B C 0 4 5 a 1 4 6 b 2 4 7 c 3 4 8 d 4 4 9 e >>> df.replace([0, 1, 2, 3], [4, 3, 2, 1]) A B C 0 4 5 a 1 3 6 b 2 2 7 c 3 1 8 d 4 4 9 e >>> s.replace([1, 2], method='bfill') 0 0 1 3 2 3 3 3 4 4 dtype: int64 **dict-like `to_replace`** >>> df.replace({0: 10, 1: 100}) A B C 0 10 5 a 1 100 6 b 2 2 7 c 3 3 8 d 4 4 9 e >>> df.replace({'A': 0, 'B': 5}, 100) A B C 0 100 100 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e >>> df.replace({'A': {0: 100, 4: 400}}) A B C 0 100 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 400 9 e **Regular expression `to_replace`** >>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'], ... 'B': ['abc', 'bar', 'xyz']}) >>> df.replace(to_replace=r'^ba.$', value='new', regex=True) A B 0 new abc 1 foo new 2 bait xyz >>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True) A B 0 new abc 1 foo bar 2 bait xyz >>> df.replace(regex=r'^ba.$', value='new') A B 0 new abc 1 foo new 2 bait xyz >>> df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'}) A B 0 new abc 1 xyz new 2 bait xyz >>> df.replace(regex=[r'^ba.$', 'foo'], value='new') A B 0 new abc 1 new new 2 bait xyz Compare the behavior of ``s.replace({'a': None})`` and ``s.replace('a', None)`` to understand the peculiarities of the `to_replace` parameter: >>> s = pd.Series([10, 'a', 'a', 'b', 'a']) When one uses a dict as the `to_replace` value, it is like the value(s) in the dict are equal to the `value` parameter. ``s.replace({'a': None})`` is equivalent to ``s.replace(to_replace={'a': None}, value=None, method=None)``: >>> s.replace({'a': None}) 0 10 1 None 2 None 3 b 4 None dtype: object When ``value=None`` and `to_replace` is a scalar, list or tuple, `replace` uses the method parameter (default 'pad') to do the replacement. So this is why the 'a' values are being replaced by 10 in rows 1 and 2 and 'b' in row 4 in this case. The command ``s.replace('a', None)`` is actually equivalent to ``s.replace(to_replace='a', value=None, method='pad')``: >>> s.replace('a', None) 0 10 1 10 2 10 3 b 4 b dtype: object
- to_replace (str, regex, list, dict, DeferredSeries, int, float, or None) –
-
resample
(**kwargs)¶ pandas.DataFrame.resample()
is not yet supported in the Beam DataFrame API because implementing it would require integrating with Beam event-time semanticsFor more information see https://s.apache.org/dataframe-event-time-semantics.
-
reset_index
(level=None, **kwargs)¶ Reset the index, or a level of it.
Reset the index of the DataFrame, and use the default one instead. If the DataFrame has a MultiIndex, this method can remove one or more levels.
Parameters: - level (int, str, tuple, or list, default None) – Only remove the given levels from the index. Removes all levels by default.
- drop (bool, default False) – Do not try to insert index into dataframe columns. This resets the index to the default integer index.
- inplace (bool, default False) – Modify the DeferredDataFrame in place (do not create a new object).
- col_level (int or str, default 0) – If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level.
- col_fill (object, default '') – If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.
Returns: DeferredDataFrame with the new index or None if
inplace=True
.Return type: Differences from pandas
Dropping the entire index (e.g. with
reset_index(level=None)
) is not parallelizable. It is also only guaranteed that the newly generated index values will be unique. The Beam DataFrame API makes no guarantee that the same index values as the equivalent pandas operation will be generated, because that implementation is order-sensitive.See also
DeferredDataFrame.set_index()
- Opposite of reset_index.
DeferredDataFrame.reindex()
- Change to new indices or expand indices.
DeferredDataFrame.reindex_like()
- Change to same indices as other DeferredDataFrame.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame([('bird', 389.0), ... ('bird', 24.0), ... ('mammal', 80.5), ... ('mammal', np.nan)], ... index=['falcon', 'parrot', 'lion', 'monkey'], ... columns=('class', 'max_speed')) >>> df class max_speed falcon bird 389.0 parrot bird 24.0 lion mammal 80.5 monkey mammal NaN When we reset the index, the old index is added as a column, and a new sequential index is used: >>> df.reset_index() index class max_speed 0 falcon bird 389.0 1 parrot bird 24.0 2 lion mammal 80.5 3 monkey mammal NaN We can use the `drop` parameter to avoid the old index being added as a column: >>> df.reset_index(drop=True) class max_speed 0 bird 389.0 1 bird 24.0 2 mammal 80.5 3 mammal NaN You can also use `reset_index` with `MultiIndex`. >>> index = pd.MultiIndex.from_tuples([('bird', 'falcon'), ... ('bird', 'parrot'), ... ('mammal', 'lion'), ... ('mammal', 'monkey')], ... names=['class', 'name']) >>> columns = pd.MultiIndex.from_tuples([('speed', 'max'), ... ('species', 'type')]) >>> df = pd.DataFrame([(389.0, 'fly'), ... ( 24.0, 'fly'), ... ( 80.5, 'run'), ... (np.nan, 'jump')], ... index=index, ... columns=columns) >>> df speed species max type class name bird falcon 389.0 fly parrot 24.0 fly mammal lion 80.5 run monkey NaN jump If the index has multiple levels, we can reset a subset of them: >>> df.reset_index(level='class') class speed species max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump If we are not dropping the index, by default, it is placed in the top level. We can place it in another level: >>> df.reset_index(level='class', col_level=1) speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump When the index is inserted under another level, we can specify under which one with the parameter `col_fill`: >>> df.reset_index(level='class', col_level=1, col_fill='species') species speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump If we specify a nonexistent level for `col_fill`, it is created: >>> df.reset_index(level='class', col_level=1, col_fill='genus') genus speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
-
rfloordiv
(**kwargs)¶ Get Integer division of dataframe and other, element-wise (binary operator rfloordiv).
Equivalent to
other // dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, floordiv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
rmod
(**kwargs)¶ Get Modulo of dataframe and other, element-wise (binary operator rmod).
Equivalent to
other % dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, mod.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
rmul
(**kwargs)¶ Get Multiplication of dataframe and other, element-wise (binary operator rmul).
Equivalent to
other * dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, mul.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
rolling
(**kwargs)¶ pandas.DataFrame.rolling()
is not yet supported in the Beam DataFrame API because implementing it would require integrating with Beam event-time semanticsFor more information see https://s.apache.org/dataframe-event-time-semantics.
-
rpow
(**kwargs)¶ Get Exponential power of dataframe and other, element-wise (binary operator rpow).
Equivalent to
other ** dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, pow.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
rsub
(**kwargs)¶ Get Subtraction of dataframe and other, element-wise (binary operator rsub).
Equivalent to
other - dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, sub.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
rtruediv
(**kwargs)¶ Get Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to
other / dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
set_flags
(**kwargs)¶ pandas.DataFrame.set_flags()
is not implemented yet in the Beam DataFrame API.If support for ‘set_flags’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
size
¶ Return an int representing the number of elements in this object.
Return the number of rows if Series. Otherwise return the number of rows times number of columns if DataFrame.
Differences from pandas
This operation has no known divergences from the pandas API.
See also
ndarray.size
- Number of elements in the array.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> s = pd.Series({'a': 1, 'b': 2, 'c': 3}) >>> s.size 3 >>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df.size 4
-
slice_shift
(**kwargs)¶ pandas.DataFrame.slice_shift()
is not implemented yet in the Beam DataFrame API.If support for ‘slice_shift’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
sort_index
(axis, **kwargs)¶ Sort object by labels (along an axis).
Returns a new DataFrame sorted by label if inplace argument is
False
, otherwise updates the original DataFrame and returns None.Parameters: - axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.
- level (int or level name or list of ints or list of level names) – If not None, sort on values in specified index level(s).
- ascending (bool or list-like of bools, default True) – Sort ascending vs. descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.
- inplace (bool, default False) – If True, perform operation in-place.
- kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See also
numpy.sort()
for more information. mergesort and stable are the only stable algorithms. For DeferredDataFrames, this option is only applied when sorting on a single column or label. - na_position ({'first', 'last'}, default 'last') – Puts NaNs at the beginning if first; last puts NaNs at the end. Not implemented for MultiIndex.
- sort_remaining (bool, default True) – If True and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level.
- ignore_index (bool, default False) –
If True, the resulting axis will be labeled 0, 1, …, n - 1.
New in version 1.0.0.
- key (callable, optional) –
If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin
sorted()
function, with the notable difference that this key function should be vectorized. It should expect anIndex
and return anIndex
of the same shape. For MultiIndex inputs, the key is applied per level.New in version 1.1.0.
Returns: The original DeferredDataFrame sorted by the labels or None if
inplace=True
.Return type: Differences from pandas
axis=index
is not allowed because it imposes an ordering on the dataset, and we cannot guarantee it will be maintained (see https://s.apache.org/dataframe-order-sensitive-operations). Onlyaxis=columns
is allowed.See also
DeferredSeries.sort_index()
- Sort DeferredSeries by the index.
DeferredDataFrame.sort_values()
- Sort DeferredDataFrame by the value.
DeferredSeries.sort_values()
- Sort DeferredSeries by the value.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150], ... columns=['A']) >>> df.sort_index() A 1 4 29 2 100 1 150 5 234 3 By default, it sorts in ascending order, to sort in descending order, use ``ascending=False`` >>> df.sort_index(ascending=False) A 234 3 150 5 100 1 29 2 1 4 A key function can be specified which is applied to the index before sorting. For a ``MultiIndex`` this is applied to each level separately. >>> df = pd.DataFrame({"a": [1, 2, 3, 4]}, index=['A', 'b', 'C', 'd']) >>> df.sort_index(key=lambda x: x.str.lower()) a A 1 b 2 C 3 d 4
-
sort_values
(axis, **kwargs)¶ sort_values
is not implemented.It is not implemented for
axis=index
because it imposes an ordering on the dataset, and it likely will not be maintained (see https://s.apache.org/dataframe-order-sensitive-operations).It is not implemented for
axis=columns
because it makes the order of the columns depend on the data (see https://s.apache.org/dataframe-non-deferred-columns).
-
sparse
¶ pandas.DataFrame.sparse()
is not implemented yet in the Beam DataFrame API.If support for ‘sparse’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-12425.
-
squeeze
(**kwargs)¶ pandas.DataFrame.squeeze()
is not implemented yet in the Beam DataFrame API.If support for ‘squeeze’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
sub
(**kwargs)¶ Get Subtraction of dataframe and other, element-wise (binary operator sub).
Equivalent to
dataframe - other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rsub.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
subtract
(**kwargs)¶ Get Subtraction of dataframe and other, element-wise (binary operator sub).
Equivalent to
dataframe - other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rsub.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
swapaxes
(**kwargs)¶ pandas.Series.swapaxes()
is not yet supported in the Beam DataFrame API because the columns in the output DataFrame depend on the data.For more information see https://s.apache.org/dataframe-non-deferred-columns.
-
swaplevel
(**kwargs)¶ pandas.DataFrame.swaplevel()
is not implemented yet in the Beam DataFrame API.If support for ‘swaplevel’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
to_clipboard
(**kwargs)¶ pandas.DataFrame.to_clipboard()
is not implemented yet in the Beam DataFrame API.If support for ‘to_clipboard’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
to_csv
(path, transform_label=None, *args, **kwargs)¶ Write object to a comma-separated values (csv) file.
Parameters: - path_or_buf (str or file handle, default None) –
File path or object, if None is provided the result is returned as a string. If a non-binary file object is passed, it should be opened with newline=’’, disabling universal newlines. If a binary file object is passed, mode might need to contain a ‘b’.
Changed in version 1.2.0: Support for binary file objects was introduced.
- sep (str, default ',') – String of length 1. Field delimiter for the output file.
- na_rep (str, default '') – Missing data representation.
- float_format (str, default None) – Format string for floating point numbers.
- columns (sequence, optional) – Columns to write.
- header (bool or list of str, default True) – Write out the column names. If a list of strings is given it is assumed to be aliases for the column names.
- index (bool, default True) – Write row names (index).
- index_label (str or sequence, or False, default None) – Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the object uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R.
- mode (str) – Python write mode, default ‘w’.
- encoding (str, optional) – A string representing the encoding to use in the output file, defaults to ‘utf-8’. encoding is not supported if path_or_buf is a non-binary file object.
- compression (str or dict, default 'infer') –
If str, represents compression mode. If dict, value at ‘method’ is the compression mode. Compression mode may be any of the following possible values: {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}. If compression mode is ‘infer’ and path_or_buf is path-like, then detect compression mode from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’ or ‘.xz’. (otherwise no compression). If dict given and mode is one of {‘zip’, ‘gzip’, ‘bz2’}, or inferred as one of the above, other entries passed as additional compression options.
Changed in version 1.0.0: May now be a dict with key ‘method’ as compression mode and other entries as additional compression options if compression mode is ‘zip’.
Changed in version 1.1.0: Passing compression options as keys in dict is supported for compression modes ‘gzip’ and ‘bz2’ as well as ‘zip’.
Changed in version 1.2.0: Compression is supported for binary file objects.
Changed in version 1.2.0: Previous versions forwarded dict entries for ‘gzip’ to gzip.open instead of gzip.GzipFile which prevented setting mtime.
- quoting (optional constant from csv module) – Defaults to csv.QUOTE_MINIMAL. If you have set a float_format then floats are converted to strings and thus csv.QUOTE_NONNUMERIC will treat them as non-numeric.
- quotechar (str, default '"') – String of length 1. Character used to quote fields.
- line_terminator (str, optional) – The newline character or character sequence to use in the output file. Defaults to os.linesep, which depends on the OS in which this method is called (‘\n’ for linux, ‘\r\n’ for Windows, i.e.).
- chunksize (int or None) – Rows to write at a time.
- date_format (str, default None) – Format string for datetime objects.
- doublequote (bool, default True) – Control quoting of quotechar inside a field.
- escapechar (str, default None) – String of length 1. Character used to escape sep and quotechar when appropriate.
- decimal (str, default '.') – Character recognized as decimal separator. E.g. use ‘,’ for European data.
- errors (str, default 'strict') –
Specifies how encoding and decoding errors are to be handled. See the errors argument for
open()
for a full list of options.New in version 1.1.0.
- storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec
. Please seefsspec
andurllib
for more details.New in version 1.2.0.
Returns: If path_or_buf is None, returns the resulting csv format as a string. Otherwise returns None.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
read_csv()
- Load a CSV file into a DeferredDataFrame.
to_excel()
- Write DeferredDataFrame to an Excel file.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'name': ['Raphael', 'Donatello'], ... 'mask': ['red', 'purple'], ... 'weapon': ['sai', 'bo staff']}) >>> df.to_csv(index=False) 'name,mask,weapon\nRaphael,red,sai\nDonatello,purple,bo staff\n' Create 'out.zip' containing 'out.csv' >>> compression_opts = dict(method='zip', ... archive_name='out.csv') >>> df.to_csv('out.zip', index=False, ... compression=compression_opts)
- path_or_buf (str or file handle, default None) –
-
to_excel
(path, *args, **kwargs)¶ Write object to an Excel sheet.
To write a single object to an Excel .xlsx file it is only necessary to specify a target file name. To write to multiple sheets it is necessary to create an ExcelWriter object with a target file name, and specify a sheet in the file to write to.
Multiple sheets may be written to by specifying unique sheet_name. With all data written to the file it is necessary to save the changes. Note that creating an ExcelWriter object with a file name that already exists will result in the contents of the existing file being erased.
Parameters: - excel_writer (path-like, file-like, or ExcelWriter object) – File path or existing ExcelWriter.
- sheet_name (str, default 'Sheet1') – Name of sheet which will contain DeferredDataFrame.
- na_rep (str, default '') – Missing data representation.
- float_format (str, optional) – Format string for floating point numbers. For example
float_format="%.2f"
will format 0.1234 to 0.12. - columns (sequence or list of str, optional) – Columns to write.
- header (bool or list of str, default True) – Write out the column names. If a list of string is given it is assumed to be aliases for the column names.
- index (bool, default True) – Write row names (index).
- index_label (str or sequence, optional) – Column label for index column(s) if desired. If not specified, and header and index are True, then the index names are used. A sequence should be given if the DeferredDataFrame uses MultiIndex.
- startrow (int, default 0) – Upper left cell row to dump data frame.
- startcol (int, default 0) – Upper left cell column to dump data frame.
- engine (str, optional) –
Write engine to use, ‘openpyxl’ or ‘xlsxwriter’. You can also set this via the options
io.excel.xlsx.writer
,io.excel.xls.writer
, andio.excel.xlsm.writer
.Deprecated since version 1.2.0: As the xlwt package is no longer maintained, the
xlwt
engine will be removed in a future version of pandas. - merge_cells (bool, default True) – Write MultiIndex and Hierarchical Rows as merged cells.
- encoding (str, optional) – Encoding of the resulting excel file. Only necessary for xlwt, other writers support unicode natively.
- inf_rep (str, default 'inf') – Representation for infinity (there is no native representation for infinity in Excel).
- verbose (bool, default True) – Display more information in the error logs.
- freeze_panes (tuple of int (length 2), optional) – Specifies the one-based bottommost row and rightmost column that is to be frozen.
- storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec
. Please seefsspec
andurllib
for more details.New in version 1.2.0.
Differences from pandas
This operation has no known divergences from the pandas API.
See also
to_csv()
- Write DeferredDataFrame to a comma-separated values (csv) file.
ExcelWriter()
- Class for writing DeferredDataFrame objects into excel sheets.
read_excel()
- Read an Excel file into a pandas DeferredDataFrame.
read_csv()
- Read a comma-separated values (csv) file into DeferredDataFrame.
Notes
For compatibility with
to_csv()
, to_excel serializes lists and dicts to strings before writing.Once a workbook has been saved it is not possible to write further data without rewriting the whole workbook.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
Create, write to and save a workbook: >>> df1 = pd.DataFrame([['a', 'b'], ['c', 'd']], ... index=['row 1', 'row 2'], ... columns=['col 1', 'col 2']) >>> df1.to_excel("output.xlsx") To specify the sheet name: >>> df1.to_excel("output.xlsx", ... sheet_name='Sheet_name_1') If you wish to write to more than one sheet in the workbook, it is necessary to specify an ExcelWriter object: >>> df2 = df1.copy() >>> with pd.ExcelWriter('output.xlsx') as writer: ... df1.to_excel(writer, sheet_name='Sheet_name_1') ... df2.to_excel(writer, sheet_name='Sheet_name_2') ExcelWriter can also be used to append to an existing Excel file: >>> with pd.ExcelWriter('output.xlsx', ... mode='a') as writer: ... df.to_excel(writer, sheet_name='Sheet_name_3') To set the library that is used to write the Excel file, you can pass the `engine` keyword (the default engine is automatically chosen depending on the file extension): >>> df1.to_excel('output1.xlsx', engine='xlsxwriter')
-
to_feather
(path, *args, **kwargs)¶ Write a DataFrame to the binary Feather format.
Parameters: - path (str or file-like object) – If a string, it will be used as Root Directory path.
- **kwargs –
Additional keywords passed to
pyarrow.feather.write_feather()
. Starting with pyarrow 0.17, this includes the compression, compression_level, chunksize and version keywords.New in version 1.1.0.
Differences from pandas
This operation has no known divergences from the pandas API.
-
to_gbq
(**kwargs)¶ pandas.DataFrame.to_gbq()
is not implemented yet in the Beam DataFrame API.If support for ‘to_gbq’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
to_hdf
(**kwargs)¶ pandas.DataFrame.to_hdf()
is not yet supported in the Beam DataFrame API because HDF5 is a random access file format
-
to_html
(path, *args, **kwargs)¶ Render a DataFrame as an HTML table.
Parameters: - buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.
- columns (sequence, optional, default None) – The subset of columns to write. Writes all columns by default.
- col_space (str or int, list or dict of int or str, optional) –
The minimum width of each column in CSS length units. An int is assumed to be px units.
New in version 0.25.0: Ability to use str.
- header (bool, optional) – Whether to print column labels, default True.
- index (bool, optional, default True) – Whether to print index (row) labels.
- na_rep (str, optional, default 'NaN') – String representation of
NaN
to use. - formatters (list, tuple or dict of one-param. functions, optional) – Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.
- float_format (one-parameter function, optional, default None) –
Formatter function to apply to columns’ elements if they are floats. This function must return a unicode string and will be applied only to the non-
NaN
elements, withNaN
being handled byna_rep
.Changed in version 1.2.0.
- sparsify (bool, optional, default True) – Set to False for a DeferredDataFrame with a hierarchical index to print every multiindex key at each row.
- index_names (bool, optional, default True) – Prints the names of the indexes.
- justify (str, default None) –
How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are
- left
- right
- center
- justify
- justify-all
- start
- end
- inherit
- match-parent
- initial
- unset.
- max_rows (int, optional) – Maximum number of rows to display in the console.
- min_rows (int, optional) – The number of rows to display in the console in a truncated repr (when number of rows is above max_rows).
- max_cols (int, optional) – Maximum number of columns to display in the console.
- show_dimensions (bool, default False) – Display DeferredDataFrame dimensions (number of rows by number of columns).
- decimal (str, default '.') – Character recognized as decimal separator, e.g. ‘,’ in Europe.
- bold_rows (bool, default True) – Make the row labels bold in the output.
- classes (str or list or tuple, default None) – CSS class(es) to apply to the resulting html table.
- escape (bool, default True) – Convert the characters <, >, and & to HTML-safe sequences.
- notebook ({True, False}, default False) – Whether the generated HTML is for IPython Notebook.
- border (int) – A
border=border
attribute is included in the opening <table> tag. Defaultpd.options.display.html.border
. - encoding (str, default "utf-8") –
Set character encoding.
New in version 1.0.
- table_id (str, optional) – A css id is included in the opening <table> tag if specified.
- render_links (bool, default False) – Convert URLs to HTML links.
Returns: If buf is None, returns the result as a string. Otherwise returns None.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
to_string()
- Convert DeferredDataFrame to a string.
-
to_json
(path, orient=None, *args, **kwargs)¶ Convert the object to a JSON string.
Note NaN’s and None will be converted to null and datetime objects will be converted to UNIX timestamps.
Parameters: - path_or_buf (str or file handle, optional) – File path or object. If not specified, the result is returned as a string.
- orient (str) –
Indication of expected JSON string format.
- DeferredSeries:
- default is ‘index’
- allowed values are: {‘split’, ‘records’, ‘index’, ‘table’}.
- DeferredDataFrame:
- default is ‘columns’
- allowed values are: {‘split’, ‘records’, ‘index’, ‘columns’, ‘values’, ‘table’}.
- The format of the JSON string:
- ’split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}
- ’records’ : list like [{column -> value}, … , {column -> value}]
- ’index’ : dict like {index -> {column -> value}}
- ’columns’ : dict like {column -> {index -> value}}
- ’values’ : just the values array
- ’table’ : dict like {‘schema’: {schema}, ‘data’: {data}}
Describing the data, where data component is like
orient='records'
.
- DeferredSeries:
- date_format ({None, 'epoch', 'iso'}) – Type of date conversion. ‘epoch’ = epoch milliseconds,
‘iso’ = ISO8601. The default depends on the orient. For
orient='table'
, the default is ‘iso’. For all other orients, the default is ‘epoch’. - double_precision (int, default 10) – The number of decimal places to use when encoding floating point values.
- force_ascii (bool, default True) – Force encoded string to be ASCII.
- date_unit (str, default 'ms' (milliseconds)) – The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.
- default_handler (callable, default None) – Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.
- lines (bool, default False) – If ‘orient’ is ‘records’ write out line-delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list-like.
- compression ({'infer', 'gzip', 'bz2', 'zip', 'xz', None}) – A string representing the compression to use in the output file, only used when the first argument is a filename. By default, the compression is inferred from the filename.
- index (bool, default True) – Whether to include the index values in the JSON string. Not
including the index (
index=False
) is only supported when orient is ‘split’ or ‘table’. - indent (int, optional) –
Length of whitespace used to indent each record.
New in version 1.0.0.
- storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec
. Please seefsspec
andurllib
for more details.New in version 1.2.0.
Returns: If path_or_buf is None, returns the resulting json format as a string. Otherwise returns None.
Return type: Differences from pandas
This operation has no known divergences from the pandas API.
See also
read_json()
- Convert a JSON string to pandas object.
Notes
The behavior of
indent=0
varies from the stdlib, which does not indent the output but does insert newlines. Currently,indent=0
and the defaultindent=None
are equivalent in pandas, though this may change in a future release.orient='table'
contains a ‘pandas_version’ field under ‘schema’. This stores the version of pandas used in the latest revision of the schema.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> import json >>> df = pd.DataFrame( ... [["a", "b"], ["c", "d"]], ... index=["row 1", "row 2"], ... columns=["col 1", "col 2"], ... ) >>> result = df.to_json(orient="split") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) { "columns": [ "col 1", "col 2" ], "index": [ "row 1", "row 2" ], "data": [ [ "a", "b" ], [ "c", "d" ] ] } Encoding/decoding a Dataframe using ``'records'`` formatted JSON. Note that index labels are not preserved with this encoding. >>> result = df.to_json(orient="records") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) [ { "col 1": "a", "col 2": "b" }, { "col 1": "c", "col 2": "d" } ] Encoding/decoding a Dataframe using ``'index'`` formatted JSON: >>> result = df.to_json(orient="index") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) { "row 1": { "col 1": "a", "col 2": "b" }, "row 2": { "col 1": "c", "col 2": "d" } } Encoding/decoding a Dataframe using ``'columns'`` formatted JSON: >>> result = df.to_json(orient="columns") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) { "col 1": { "row 1": "a", "row 2": "c" }, "col 2": { "row 1": "b", "row 2": "d" } } Encoding/decoding a Dataframe using ``'values'`` formatted JSON: >>> result = df.to_json(orient="values") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) [ [ "a", "b" ], [ "c", "d" ] ] Encoding with Table Schema: >>> result = df.to_json(orient="table") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) { "schema": { "fields": [ { "name": "index", "type": "string" }, { "name": "col 1", "type": "string" }, { "name": "col 2", "type": "string" } ], "primaryKey": [ "index" ], "pandas_version": "0.20.0" }, "data": [ { "index": "row 1", "col 1": "a", "col 2": "b" }, { "index": "row 2", "col 1": "c", "col 2": "d" } ] }
-
to_latex
(**kwargs)¶ pandas.DataFrame.to_latex()
is not implemented yet in the Beam DataFrame API.If support for ‘to_latex’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
to_markdown
(**kwargs)¶ pandas.DataFrame.to_markdown()
is not implemented yet in the Beam DataFrame API.If support for ‘to_markdown’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
to_msgpack
(**kwargs)¶ pandas.DataFrame.to_msgpack()
is not yet supported in the Beam DataFrame API because it is deprecated in pandas.
-
to_parquet
(path, *args, **kwargs)¶ Write a DataFrame to the binary parquet format.
This function writes the dataframe as a parquet file. You can choose different parquet backends, and have the option of compression. See the user guide for more details.
Parameters: - path (str or file-like object, default None) –
If a string, it will be used as Root Directory path when writing a partitioned dataset. By file-like object, we refer to objects with a write() method, such as a file handle (e.g. via builtin open function) or io.BytesIO. The engine fastparquet does not accept file-like objects. If path is None, a bytes object is returned.
Changed in version 1.2.0.
Previously this was “fname”
- engine ({'auto', 'pyarrow', 'fastparquet'}, default 'auto') – Parquet library to use. If ‘auto’, then the option
io.parquet.engine
is used. The defaultio.parquet.engine
behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable. - compression ({'snappy', 'gzip', 'brotli', None}, default 'snappy') – Name of the compression to use. Use
None
for no compression. - index (bool, default None) – If
True
, include the dataframe’s index(es) in the file output. IfFalse
, they will not be written to the file. IfNone
, similar toTrue
the dataframe’s index(es) will be saved. However, instead of being saved as values, the RangeIndex will be stored as a range in the metadata so it doesn’t require much space and is faster. Other indexes will be included as columns in the file output. - partition_cols (list, optional, default None) – Column names by which to partition the dataset. Columns are partitioned in the order they are given. Must be None if path is not a string.
- storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec
. Please seefsspec
andurllib
for more details.New in version 1.2.0.
- **kwargs – Additional arguments passed to the parquet library. See pandas io for more details.
Returns: Return type: bytes if no path argument is provided else None
Differences from pandas
This operation has no known divergences from the pandas API.
See also
read_parquet()
- Read a parquet file.
DeferredDataFrame.to_csv()
- Write a csv file.
DeferredDataFrame.to_sql()
- Write to a sql table.
DeferredDataFrame.to_hdf()
- Write to hdf.
Notes
This function requires either the fastparquet or pyarrow library.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]}) >>> df.to_parquet('df.parquet.gzip', ... compression='gzip') >>> pd.read_parquet('df.parquet.gzip') col1 col2 0 1 3 1 2 4 If you want to get a buffer to the parquet content you can use a io.BytesIO object, as long as you don't use partition_cols, which creates multiple files. >>> import io >>> f = io.BytesIO() >>> df.to_parquet(f) >>> f.seek(0) 0 >>> content = f.read()
- path (str or file-like object, default None) –
-
to_period
(**kwargs)¶ pandas.DataFrame.to_period()
is not implemented yet in the Beam DataFrame API.If support for ‘to_period’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
to_pickle
(**kwargs)¶ pandas.DataFrame.to_pickle()
is not implemented yet in the Beam DataFrame API.If support for ‘to_pickle’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
to_sql
(**kwargs)¶ pandas.DataFrame.to_sql()
is not implemented yet in the Beam DataFrame API.If support for ‘to_sql’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
to_stata
(path, *args, **kwargs)¶ Export DataFrame object to Stata dta format.
Writes the DataFrame to a Stata dataset file. “dta” files contain a Stata dataset.
Parameters: - path (str, buffer or path object) –
String, path object (pathlib.Path or py._path.local.LocalPath) or object implementing a binary write() function. If using a buffer then the buffer will not be automatically closed after the file data has been written.
Changed in version 1.0.0.
Previously this was “fname”
- convert_dates (dict) – Dictionary mapping columns containing datetime types to stata internal format to use when writing the dates. Options are ‘tc’, ‘td’, ‘tm’, ‘tw’, ‘th’, ‘tq’, ‘ty’. Column can be either an integer or a name. Datetime columns that do not have a conversion type specified will be converted to ‘tc’. Raises NotImplementedError if a datetime column has timezone information.
- write_index (bool) – Write the index to Stata dataset.
- byteorder (str) – Can be “>”, “<”, “little”, or “big”. default is sys.byteorder.
- time_stamp (datetime) – A datetime to use as file creation date. Default is the current time.
- data_label (str, optional) – A label for the data set. Must be 80 characters or smaller.
- variable_labels (dict) – Dictionary containing columns as keys and variable labels as values. Each label must be 80 characters or smaller.
- version ({114, 117, 118, 119, None}, default 114) –
Version to use in the output dta file. Set to None to let pandas decide between 118 or 119 formats depending on the number of columns in the frame. pandas Version 114 can be read by Stata 10 and later. pandas Version 117 can be read by Stata 13 or later. pandas Version 118 is supported in Stata 14 and later. pandas Version 119 is supported in Stata 15 and later. pandas Version 114 limits string variables to 244 characters or fewer while versions 117 and later allow strings with lengths up to 2,000,000 characters. Versions 118 and 119 support Unicode characters, and pandas version 119 supports more than 32,767 variables.
pandas Version 119 should usually only be used when the number of variables exceeds the capacity of dta format 118. Exporting smaller datasets in format 119 may have unintended consequences, and, as of November 2020, Stata SE cannot read pandas version 119 files.
Changed in version 1.0.0: Added support for formats 118 and 119.
- convert_strl (list, optional) – List of column names to convert to string columns to Stata StrL format. Only available if version is 117. Storing strings in the StrL format can produce smaller dta files if strings have more than 8 characters and values are repeated.
- compression (str or dict, default 'infer') –
For on-the-fly compression of the output dta. If string, specifies compression mode. If dict, value at key ‘method’ specifies compression mode. Compression mode must be one of {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}. If compression mode is ‘infer’ and fname is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise no compression). If dict and compression mode is one of {‘zip’, ‘gzip’, ‘bz2’}, or inferred as one of the above, other entries passed as additional compression options.
New in version 1.1.0.
- storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec
. Please seefsspec
andurllib
for more details.New in version 1.2.0.
Raises: NotImplementedError
– * If datetimes contain timezone information * Column dtype is not representable in StataValueError
– * Columns listed in convert_dates are neither datetime64[ns]or datetime.datetime
- Column listed in convert_dates is not in DeferredDataFrame
- Categorical label contains more than 32,000 characters
Differences from pandas
This operation has no known divergences from the pandas API.
See also
read_stata()
- Import Stata data files.
io.stata.StataWriter()
- Low-level writer for Stata data files.
io.stata.StataWriter117()
- Low-level writer for pandas version 117 files.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'animal': ['falcon', 'parrot', 'falcon', ... 'parrot'], ... 'speed': [350, 18, 361, 15]}) >>> df.to_stata('animals.dta')
- path (str, buffer or path object) –
-
to_timestamp
(**kwargs)¶ pandas.DataFrame.to_timestamp()
is not implemented yet in the Beam DataFrame API.If support for ‘to_timestamp’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
to_xarray
(**kwargs)¶ pandas.DataFrame.to_xarray()
is not yet supported in the Beam DataFrame API because it produces an output type that is not deferred.For more information see https://s.apache.org/dataframe-non-deferred-result.
-
to_xml
(**kwargs)¶ pandas.DataFrame.to_xml()
is not implemented yet in the Beam DataFrame API.If support for ‘to_xml’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
transform
(**kwargs)¶ Call
func
on self producing a DataFrame with transformed values.Produced DataFrame will have same axis length as self.
Parameters: - func (function, str, list-like or dict-like) –
Function to use for transforming the data. If a function, must either work when passed a DeferredDataFrame or when passed to DeferredDataFrame.apply. If func is both list-like and dict-like, dict-like behavior takes precedence.
Accepted combinations are:
- function
- string function name
- list-like of functions and/or function names, e.g.
[np.exp, 'sqrt']
- dict-like of axis labels -> functions, function names or list-like of such.
- axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.
- *args – Positional arguments to pass to func.
- **kwargs – Keyword arguments to pass to func.
Returns: A DeferredDataFrame that must have the same length as self.
Return type: Raises: ValueError : If the returned DeferredDataFrame has a different length than self.
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.agg()
- Only perform aggregating type operations.
DeferredDataFrame.apply()
- Invoke function on a DeferredDataFrame.
Notes
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'A': range(3), 'B': range(1, 4)}) >>> df A B 0 0 1 1 1 2 2 2 3 >>> df.transform(lambda x: x + 1) A B 0 1 2 1 2 3 2 3 4 Even though the resulting DataFrame must have the same length as the input DataFrame, it is possible to provide several input functions: >>> s = pd.Series(range(3)) >>> s 0 0 1 1 2 2 dtype: int64 >>> s.transform([np.sqrt, np.exp]) sqrt exp 0 0.000000 1.000000 1 1.000000 2.718282 2 1.414214 7.389056 You can call transform on a GroupBy object: >>> df = pd.DataFrame({ ... "Date": [ ... "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05", ... "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05"], ... "Data": [5, 8, 6, 1, 50, 100, 60, 120], ... }) >>> df Date Data 0 2015-05-08 5 1 2015-05-07 8 2 2015-05-06 6 3 2015-05-05 1 4 2015-05-08 50 5 2015-05-07 100 6 2015-05-06 60 7 2015-05-05 120 >>> df.groupby('Date')['Data'].transform('sum') 0 55 1 108 2 66 3 121 4 55 5 108 6 66 7 121 Name: Data, dtype: int64 >>> df = pd.DataFrame({ ... "c": [1, 1, 1, 2, 2, 2, 2], ... "type": ["m", "n", "o", "m", "m", "n", "n"] ... }) >>> df c type 0 1 m 1 1 n 2 1 o 3 2 m 4 2 m 5 2 n 6 2 n >>> df['size'] = df.groupby('c')['type'].transform(len) >>> df c type size 0 1 m 3 1 1 n 3 2 1 o 3 3 2 m 4 4 2 m 4 5 2 n 4 6 2 n 4
- func (function, str, list-like or dict-like) –
-
truediv
(**kwargs)¶ Get Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other (scalar, sequence, DeferredSeries, or DeferredDataFrame) – Any single or multiple element data structure, or list-like object.
- axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For DeferredSeries input, axis to match DeferredSeries index on.
- level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DeferredDataFrame alignment, with this value before computation. If data in both corresponding DeferredDataFrame locations is missing the result will be missing.
Returns: Result of the arithmetic operation.
Return type: Differences from pandas
Only level=None is supported
See also
DeferredDataFrame.add()
- Add DeferredDataFrames.
DeferredDataFrame.sub()
- Subtract DeferredDataFrames.
DeferredDataFrame.mul()
- Multiply DeferredDataFrames.
DeferredDataFrame.div()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.truediv()
- Divide DeferredDataFrames (float division).
DeferredDataFrame.floordiv()
- Divide DeferredDataFrames (integer division).
DeferredDataFrame.mod()
- Calculate modulo (remainder after division).
DeferredDataFrame.pow()
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360 Add a scalar with operator version which return the same results. >>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361 >>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361 Divide by constant with reverse version. >>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0 >>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778 Subtract a list and Series by axis with operator version. >>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358 >>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359 Multiply a DataFrame of different shape with operator version. >>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4 >>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN >>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0 Divide by a MultiIndex by level. >>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720 >>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
truncate
(before, after, axis)¶ Truncate a Series or DataFrame before and after some index value.
This is a useful shorthand for boolean indexing based on index values above or below certain thresholds.
Parameters: - before (date, str, int) – Truncate all rows before this index value.
- after (date, str, int) – Truncate all rows after this index value.
- axis ({0 or 'index', 1 or 'columns'}, optional) – Axis to truncate. Truncates the index (rows) by default.
- copy (bool, default is True,) – Return a copy of the truncated section.
Returns: The truncated DeferredSeries or DeferredDataFrame.
Return type: type of caller
Differences from pandas
This operation has no known divergences from the pandas API.
See also
DeferredDataFrame.loc()
- Select a subset of a DeferredDataFrame by label.
DeferredDataFrame.iloc()
- Select a subset of a DeferredDataFrame by position.
Notes
If the index being truncated contains only datetime values, before and after may be specified as strings instead of Timestamps.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API.
>>> df = pd.DataFrame({'A': ['a', 'b', 'c', 'd', 'e'], ... 'B': ['f', 'g', 'h', 'i', 'j'], ... 'C': ['k', 'l', 'm', 'n', 'o']}, ... index=[1, 2, 3, 4, 5]) >>> df A B C 1 a f k 2 b g l 3 c h m 4 d i n 5 e j o >>> df.truncate(before=2, after=4) A B C 2 b g l 3 c h m 4 d i n The columns of a DataFrame can be truncated. >>> df.truncate(before="A", after="B", axis="columns") A B 1 a f 2 b g 3 c h 4 d i 5 e j For Series, only rows can be truncated. >>> df['A'].truncate(before=2, after=4) 2 b 3 c 4 d Name: A, dtype: object The index values in ``truncate`` can be datetimes or string dates. >>> dates = pd.date_range('2016-01-01', '2016-02-01', freq='s') >>> df = pd.DataFrame(index=dates, data={'A': 1}) >>> df.tail() A 2016-01-31 23:59:56 1 2016-01-31 23:59:57 1 2016-01-31 23:59:58 1 2016-01-31 23:59:59 1 2016-02-01 00:00:00 1 >>> df.truncate(before=pd.Timestamp('2016-01-05'), ... after=pd.Timestamp('2016-01-10')).tail() A 2016-01-09 23:59:56 1 2016-01-09 23:59:57 1 2016-01-09 23:59:58 1 2016-01-09 23:59:59 1 2016-01-10 00:00:00 1 Because the index is a DatetimeIndex containing only dates, we can specify `before` and `after` as strings. They will be coerced to Timestamps before truncation. >>> df.truncate('2016-01-05', '2016-01-10').tail() A 2016-01-09 23:59:56 1 2016-01-09 23:59:57 1 2016-01-09 23:59:58 1 2016-01-09 23:59:59 1 2016-01-10 00:00:00 1 Note that ``truncate`` assumes a 0 value for any unspecified time component (midnight). This differs from partial string slicing, which returns any partially matching dates. >>> df.loc['2016-01-05':'2016-01-10', :].tail() A 2016-01-10 23:59:55 1 2016-01-10 23:59:56 1 2016-01-10 23:59:57 1 2016-01-10 23:59:58 1 2016-01-10 23:59:59 1
-
tshift
(**kwargs)¶ pandas.DataFrame.tshift()
is not implemented yet in the Beam DataFrame API.If support for ‘tshift’ is important to you, please let the Beam community know by writing to user@beam.apache.org or commenting on BEAM-9547.
-
tz_convert
(**kwargs)¶ Convert tz-aware axis to target time zone.
Parameters: Returns: Object with time zone converted axis.
Return type: {klass}
Raises: TypeError
– If the axis is tz-naive.Differences from pandas
This operation has no known divergences from the pandas API.
-
tz_localize
(ambiguous, **kwargs)¶ Localize tz-naive index of a Series or DataFrame to target time zone.
This operation localizes the Index. To localize the values in a timezone-naive Series, use
Series.dt.tz_localize()
.Parameters: - tz (str or tzinfo) –
- axis (the axis to localize) –
- level (int, str, default None) – If axis ia a MultiIndex, localize a specific level. Otherwise must be None.
- copy (bool, default True) – Also make a copy of the underlying data.
- ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –
When clocks moved backward due to DST, ambiguous times may arise. For example in Central European Time (UTC+01), when going from 03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at 00:30:00 UTC and at 01:30:00 UTC. In such a situation, the ambiguous parameter dictates how ambiguous times should be handled.
- ’infer’ will attempt to infer fall dst-transition hours based on order
- bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)
- ’NaT’ will return NaT where there are ambiguous times
- ’raise’ will raise an AmbiguousTimeError if there are ambiguous times.
- nonexistent (str, default 'raise') –
A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST. Valid values are:
- ’shift_forward’ will shift the nonexistent time forward to the closest existing time
- ’shift_backward’ will shift the nonexistent time backward to the closest existing time
- ’NaT’ will return NaT where there are nonexistent times
- timedelta objects will shift nonexistent times by the timedelta
- ’raise’ will raise an NonExistentTimeError if there are nonexistent times.
Returns: Same type as the input.
Return type: Raises: TypeError
– If the TimeDeferredSeries is tz-aware and tz is not None.Differences from pandas
ambiguous
cannot be set to"infer"
as its semantics are order-sensitive. Similarly, specifyingambiguous
as anndarray
is order-sensitive, but you can achieve similar functionality by specifyingambiguous
as a Series.Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
Localize local times: >>> s = pd.Series([1], ... index=pd.DatetimeIndex(['2018-09-15 01:30:00'])) >>> s.tz_localize('CET') 2018-09-15 01:30:00+02:00 1 dtype: int64 Be careful with DST changes. When there is sequential data, pandas can infer the DST time: >>> s = pd.Series(range(7), ... index=pd.DatetimeIndex(['2018-10-28 01:30:00', ... '2018-10-28 02:00:00', ... '2018-10-28 02:30:00', ... '2018-10-28 02:00:00', ... '2018-10-28 02:30:00', ... '2018-10-28 03:00:00', ... '2018-10-28 03:30:00'])) >>> s.tz_localize('CET', ambiguous='infer') 2018-10-28 01:30:00+02:00 0 2018-10-28 02:00:00+02:00 1 2018-10-28 02:30:00+02:00 2 2018-10-28 02:00:00+01:00 3 2018-10-28 02:30:00+01:00 4 2018-10-28 03:00:00+01:00 5 2018-10-28 03:30:00+01:00 6 dtype: int64 In some cases, inferring the DST is impossible. In such cases, you can pass an ndarray to the ambiguous parameter to set the DST explicitly >>> s = pd.Series(range(3), ... index=pd.DatetimeIndex(['2018-10-28 01:20:00', ... '2018-10-28 02:36:00', ... '2018-10-28 03:46:00'])) >>> s.tz_localize('CET', ambiguous=np.array([True, True, False])) 2018-10-28 01:20:00+02:00 0 2018-10-28 02:36:00+02:00 1 2018-10-28 03:46:00+01:00 2 dtype: int64 If the DST transition causes nonexistent times, you can shift these dates forward or backward with a timedelta object or `'shift_forward'` or `'shift_backward'`. >>> s = pd.Series(range(2), ... index=pd.DatetimeIndex(['2015-03-29 02:30:00', ... '2015-03-29 03:30:00'])) >>> s.tz_localize('Europe/Warsaw', nonexistent='shift_forward') 2015-03-29 03:00:00+02:00 0 2015-03-29 03:30:00+02:00 1 dtype: int64 >>> s.tz_localize('Europe/Warsaw', nonexistent='shift_backward') 2015-03-29 01:59:59.999999999+01:00 0 2015-03-29 03:30:00+02:00 1 dtype: int64 >>> s.tz_localize('Europe/Warsaw', nonexistent=pd.Timedelta('1H')) 2015-03-29 03:30:00+02:00 0 2015-03-29 03:30:00+02:00 1 dtype: int64
-
where
(cond, other, errors, **kwargs)¶ where is not parallelizable when
errors="ignore"
is specified.
-
classmethod
wrap
(expr, split_tuples=True)¶
-
xs
(key, axis, level, **kwargs)¶ Return cross-section from the Series/DataFrame.
This method takes a key argument to select data at a particular level of a MultiIndex.
Parameters: - key (label or tuple of label) – Label contained in the index, or partially in a MultiIndex.
- axis ({0 or 'index', 1 or 'columns'}, default 0) – Axis to retrieve cross-section on.
- level (object, defaults to first n levels (n=1 or len(key))) – In case of a key partially contained in a MultiIndex, indicate which levels are used. Levels can be referred by label or position.
- drop_level (bool, default True) – If False, returns object with same levels as self.
Returns: Cross-section from the original DeferredSeries or DeferredDataFrame corresponding to the selected index levels.
Return type: Differences from pandas
Note that
xs(axis='index')
will raise aKeyError
at execution time if the key does not exist in the index.See also
DeferredDataFrame.loc()
- Access a group of rows and columns by label(s) or a boolean array.
DeferredDataFrame.iloc()
- Purely integer-location based indexing for selection by position.
Notes
xs can not be used to set values.
MultiIndex Slicers is a generic way to get/set values on any level or levels. It is a superset of xs functionality, see MultiIndex Slicers.
Examples
NOTE: These examples are pulled directly from the pandas documentation for convenience. Usage of the Beam DataFrame API will look different because it is a deferred API. In addition, some arguments shown here may not be supported, see ‘Differences from pandas’ for details.
>>> d = {'num_legs': [4, 4, 2, 2], ... 'num_wings': [0, 0, 2, 2], ... 'class': ['mammal', 'mammal', 'mammal', 'bird'], ... 'animal': ['cat', 'dog', 'bat', 'penguin'], ... 'locomotion': ['walks', 'walks', 'flies', 'walks']} >>> df = pd.DataFrame(data=d) >>> df = df.set_index(['class', 'animal', 'locomotion']) >>> df num_legs num_wings class animal locomotion mammal cat walks 4 0 dog walks 4 0 bat flies 2 2 bird penguin walks 2 2 Get values at specified index >>> df.xs('mammal') num_legs num_wings animal locomotion cat walks 4 0 dog walks 4 0 bat flies 2 2 Get values at several indexes >>> df.xs(('mammal', 'dog')) num_legs num_wings locomotion walks 4 0 Get values at specified index and level >>> df.xs('cat', level=1) num_legs num_wings class locomotion mammal walks 4 0 Get values at several indexes and levels >>> df.xs(('bird', 'walks'), ... level=[0, 'locomotion']) num_legs num_wings animal penguin 2 2 Get values at specified column and axis >>> df.xs('num_wings', axis=1) class animal locomotion mammal cat walks 0 dog walks 0 bat flies 2 bird penguin walks 2 Name: num_wings, dtype: int64
-