Why pandas apply method is slow, and how Terality accelerates it by 80x

February 21, 2022
Lucas Guillermou
Lucas Guillermou

While processing data with pandas, it is quite common to perform a user-defined function on every row of a DataFrame. The typical way to do it is to use the method apply. This article will focus on the method apply with axis=1, which evaluates a function on every row. The axis=0 version evaluates a function on each column but does not suffer from performance issues as pandas dataframes are internally stored column-wise using NumPy arrays.

How does apply method work?

When using DataFrame.apply(func, axis=1), func will be evaluated on every row. During the evaluation, a row is wrapped as a Series, thus the input parameter of func is a Series having the same length as the number of columns of the original dataframe, indexed by the columns labels.

While the input is always a Series, the output type of func will determine the return type of the method apply. For instance, if func returns a scalar, the result of df.apply(func, axis=1) will be a Series, while if it returns a Series, the result will be a DataFrame. You can also modify this behavior with the parameter result_type, for which you will find more details at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html.

Let’s see on a dummy example how apply works.

-- CODE language-python --
def inspect_and_add_5(row: pd.Series) -> pd.Series:
return row + 5

>>> df = pd.DataFrame({"c1": [0, 1, 2], "c2": [10, 11, 12]})
>>> df.apply(inspect_and_add_5, axis=1)
<class 'pandas.core.series.Series'>  # evaluation on 1st row
<class 'pandas.core.series.Series'>  # evaluation on 2nd row
<class 'pandas.core.series.Series'>  # evaluation on 3rd row
c1  c2  # result of the fonction
0   5  15
1   6  16
2   7  17

Apply seems pretty easy to use and very convenient as you can literally perform any operation you want on rows. However, this comes at a price.

Understanding poor apply performances

Let’s compare two code snippets computing the sum of the squares of dataframe columns.

-- CODE language-python --
import numpy as np
import pandas as pd
n = 1_000_000
df = pd.DataFrame({"A": np.random.rand(n), "B":np.random.rand(n)}

Here is how the dataframe looks like.

Now, we compute the sum of squares of the dataframe columns. The first version uses apply while the other uses regular pandas methods.

-- CODE language-python --
def sum_squares(row):
return row["A"] ** 2 + row["B"] ** 2


df.apply(sum_squares, axis=1)

8.74 s ± 157 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

tens_squared = df["A"] ** 2
zeros_squared = df["B"] ** 2
tens_squared + zeros_squared
3.07 ms ± 69.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Here the non-apply version is ~ 3 000 times faster than the apply one. This makes a stunning difference. Regarding memory, the following snippets show that the apply version is at least 2.5 times more consuming than the non-apply one.

-- CODE language-python --
lucas@lucas-XPS-15-9500:~$ python -m memory_profiler benchmark_apply.py
Filename: benchmark.py

Line # Mem usage Increment  Occurences   Line Contents
9 76.5 MiB 76.5 MiB       1   @profile
10                                     def apply_version():
11 76.5 MiB  0.0 MiB       1  n = 1_000_000
12 92.0 MiB 15.6 MiB       1  df = pd.DataFrame({"A": np.random.rand(n), "B": np.random.rand(n)})
13 155.2 MiB 63.1 MiB       1  return df.apply(sum_squares, axis=1)

lucas@lucas-XPS-15-9500:~$ python -m memory_profiler benchmark_non_apply.py
Filename: benchmark.py

Line # Mem usage Increment  Occurences   Line Contents
9 76.4 MiB 76.4 MiB       1   @profile
10                                     def apply_version():
11 76.4 MiB  0.0 MiB       1  n = 1_000_000
12 92.0 MiB 15.6 MiB       1  df = pd.DataFrame({"A": np.random.rand(n), "B": np.random.rand(n)})
13 115.3 MiB 23.3 MiB       1  return df["A"] ** 2 + df["B"] ** 2

But why such a difference in performances?

Let’s first understand why the non-apply version is that fast. 

Series pow and add operators are vectorized operations, which means the computation is optimized so an operation is performed on multiple elements simultaneously. It’s a kind of parallelisation at the processor level, which pandas benefits thanks to its numpy backend relying on C arrays.

On the other hand, calling DataFrame.apply with a user defined function has several performances issues:

  • Function is applied in python space, thus much less efficient than performing operations on C arrays benefiting from vectorization.
  • A new Series object is created for each row of the dataframe, resulting in consequent memory consumption.

How to enhance apply performances?

The first thing you should do is wondering whether you really need to use apply. Most of the time, your function is rewritable with pandas or numpy vectorized operations. For instance, if you want to create a column based on a single condition, you should use np.where instead of defining a function performing an if-else on every row without benefiting from vectorization.

-- CODE language-python --
df = pd.DataFrame({"code": ["A", "B", "A"], "value": [0, 1, 2]})

def add_fruit_column(row):
if row["code"] == "A":
    return "APPLE"
    return "LEMON"

# Not efficient
df["fruit"] = df.apply(add_fruit_column, axis=1)

# Vectorized, good practice
df["fruit"] = np.where(df["code"] == "A", "APPLE", "LEMON")


code  value  fruit
0 A 10  APPLE
1 B 11  LEMON
2 A 12  APPLE

Accelerating apply by 80x with Terality and zero effort

There might be some cases where the operation you perform is sufficiently complex so that you can’t avoid using apply. In that case, apply slowness might be a pain, and you may want to use Terality to enhance performance.

Terality is a serverless and auto-scalable data processing engine with the same syntax as pandas that allows you to scale your data processing with zero efforts. Basically, Terality makes pandas scale and makes pandas run faster. You just need to pip install the python package, change the import line, and you can process hundreds of GBs of data on our clusters (soon Terabytes). As Terality parallelizes computations, you will benefit from huge performance improvements regarding execution time. In these benchmarks, pandas operations were ran on a Dell XPS 15 9500 - i7-10750H CPU @ 2.60GHz with 16 GB RAM.

-- CODE language-python --
import pandas as pd
import terality as te

n = 1_000_000
df_pd = pd.DataFrame(
      "a": np.random.randn(n),
      "b": np.random.randn(n),
      "N": np.random.randint(1000, 10_000, n),
df_te = te.DataFrame.from_pandas(df_pd)

def f(x):
  return x * (x - 1)

def integrate_f(a, b, N):
  s = 0
  dx = (b - a) / N
  for i in range(N):
      s += f(a + i * dx)
  return s * dx

df_pd.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1)
11min 45s ± 6.79 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

df_te.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1)
9.03 s ± 2.06 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In this example, Terality’s apply is 80 times faster than pandas’ one. Using Terality, you can then perform identical pandas workflows, with a way better performance, without worrying of infrastructure constraints.

Interested in joining the team?

Home DocsIntegrationsPricingBlogContact UsAbout UsLog In