An R-like 'which' Function for Python Pandas
If you’re coming from R to Python Pandas, you might have some habits that are hard to kick. For me, it was using the which
function, which is something I use fairly frequently in R. I find at least two uses for which
: one is for indexing and selecting data, though I can’t particuarly recommend this as a good practice. You’ll almost certainly have better performance using Python Pandas’ built-in logical indexing. When working with small data though, using which
vs. logical indexing probably won’t have any noticeable effect.
The second reason I like using which
is simply to figure out precisely where in my dataframe the rows I’m interested in are located. Especially when working with big data(frames), it’s hard to understand where the data you’re looking for is located. You can count how many rows match your condition, but maybe you’re missing out on a useful insight (that they all come from consecutive rows, or that it’s more common later in the dataframe rather than earlier, etc.). These are situations where it’d be nice to use an R-like which
function.
That’s why I’ve come up with the following little Python function, that behaves just like the which
you’re used to in R, except now you can apply to Python Pandas logical Series objects. It simply iterates of the Series and returns the indices for all the values that evaluate to True
.
import pandas as pd
import numpy as np
def which(self):
original_type = type(self)
try:
# Convert input to numpy boolean array
boolean_array = np.asarray(self, dtype=bool)
# Only proceed if boolean_array is 1D
if boolean_array.ndim != 1:
raise ValueError("Input to 'which' must be 1-dimensional (vector-like), not shape {}".format(boolean_array.shape))
indices_array = np.where(boolean_array)[0]
if original_type == pd.Series:
return self.index[indices_array]
elif original_type == np.ndarray:
return indices_array
elif original_type == list:
return list(indices_array)
else:
# If the input is not a recognized type, return indices as a numpy array
return indices_array
except Exception as e:
("Input to 'which' could not be processed as a boolean array-like structure.\n")
("This typically means the input is not a pandas Series, numpy array, or ")
(f"list/iterable of boolean-convertible values.\nOriginal error: {str(e)}")
# If you want to apply it as a class method to Pandas Series objects
pd.Series.which = which
Just to give you a feel for how it works, I’ll load some toy data:
from io import StringIO
toy_data = StringIO("""A;B
4.4;99
4.5;200
4.7;65
3.2;140
""")
df = pd.read_csv(toy_data, sep=";")
df
A B
0 4.4 99
1 4.5 200
2 4.7 65
3 3.2 140
With our toy dataframe, we can apply which to the columns as an outer function:
which(df.A > 4)
[0, 1, 2]
Or, if you’ve defined the class method, you can call .which()
like any other Pandas method:
(df.B == 200).which()
[1]
Just like in R, it works perfectly well for indexing:
df.loc[which(df.B < 100), ['A']]
A
0 4.4
2 4.7
Hopefully this function can ease some of the pains of switching between R and Pandas.
Alex Miller is a scientist, developer, and former academic. Feel free to connect with him on LinkedIn.