While processing data with pandas, it is quite common to perform a user-defined function on every row of a DataFrame. The typical way to do it is to use the method apply. This article will focus on the method apply with axis=1, which evaluates a function on every row. The axis=0 version evaluates a function on each column but does not suffer from performance issues as pandas dataframes are internally stored column-wise using NumPy arrays.
To check how Terality compares to the best solutions on the market, we picked the most scientific, unbiased and well-known benchmark for pandas alternatives: the h2o benchmark. It consists of a list of timed simulations on different database-like operations like: join, merge, and groupby, run on different dataset sizes: 0.5, 5 and 50GB. You can check the final section where we give more detail on the experiments and how to reproduce the results for Terality.
In this article, we review the different options available to you as a Data Scientist to make pandas work at scale: whether on larger datasets or faster. We then explain why we built Terality, combining all the best features of these options in a single solution. Data scientists can finally run pandas at scale with our fully serverless engine, by changing just one line of their code.