The 11 solutions to make pandas scale and run faster

November 11, 2021
Adrien Hoarau
Adrien Hoarau


In this article, we review the different options available to you as a Data Scientist to make pandas work at scale: whether on larger datasets or faster. We then explain why we built Terality, combining all the best features of these options in a single solution. Data scientists can finally run pandas at scale with our fully serverless engine, by changing just one line of their code.

Big data has been around for some time, but as a Data Scientist, have you ever tried to use your favorite data processing library, pandas, on a dataset of more than 1GB? If you have, you quickly ran into one of these two problems:

  • out of memory errors, sometimes even crashing your whole notebook.
  • computations getting slow, from a few dozens of seconds to dozens of minutes

Even worse, both problems can happen at the same time: there is nothing more infuriating than to run a join between two large dataframes, wait for 20 minutes, and then realize your notebook crashed and you're good to start rerunning it from the start.

But rejoice, you're not alone, and there are many options to process pandas faster and avoid memory errors.

1. Scaling pandas: The do-it-yourself solutions

A natural reflex is to try to solve the problem on your own, using one of the following strategies:

Sampling

Sub-sampling your dataset is not exactly a way of scaling pandas, but if your data is too big, it's easy: just make it smaller.

While it's always nice to be able to take a quick first glance at your data this way, this solution has many issues. Among those:

  • You will discover new critical insights that you missed when you later start loading more of the data. 
  • You will run into the problem of joining two tables: how will you be able to match the samples from both tables?
  • What’s more,  did you load the whole data at least once to do the sampling? If yes, that means you still had to find a way to load and process the data at least once. If not, are you confident that your data is homogeneous over time and that you didn't miss all the interesting changes happening later?

Conclusion: sub-sampling your dataset might be sometimes helpful, but I wouldn't call it a solution as it has many critical issues.

Iterating over chunks

Ok, your dataset is too big to fit in memory. But maybe you don't need to load it all at once. Perhaps you can just process each chunk sequentially. If you're lucky, your dataset is already nicely split over several small files. If not, you might know about pd.read_csv(..., iterator=True)

Pros:

  • Simple enough to get started by yourself.

Cons:

  • It’s not faster. And even actually slower, since you're repeatedly loading and saving data each time you want to process it.
  • Tedious to write specific code for iterating in a subtly different way for each of the functions you run into.
  • Flat out does not work for complicated functions: this won't help you join two dataframes or explore the number of unique values.

Conclusion: A quick and dirty solution you've probably already tried by yourself. Generally, you do it once, realize all the cons, and move on to something a little more advanced.

Parallelizing pandas yourself

Your laptop has eight cores, and it feels like a big waste that pandas only uses one while the seven others sit idle. You have heard of python.multiprocessing, so why not do it yourself?

Pros:

  • Sure, it could theoretically work.

Cons:

  • Laborious and time-consuming. Don't you have something more useful to do for next months (years)?

Conclusion: While it is a sound strategy, you're better off using one of the solutions presented in the following paragraphs, which will implement it for you.

2. Scaling pandas: using other techs

As Data scientists, we are not the first engineers to deal with the problem of processing data at scale, so it's reasonable to try one of the solutions our colleagues use.

Relational Databases/SQL

Databases, especially relational ones, are very good at joins and a very mature technology.

Pros:

  • Mature and reliable technology.
  • Probably already some familiarity with it.

Cons:

  • You're not doing pandas anymore at all: you'll miss most of the API you're used to.
  • Writing schemas for each of your tables and dealing with all the issues of inconsistent types between your database table and your pandas DataFrame
  • Infrastructure: setting up and managing the database.

Conclusion: While it's natural to think of RDB to solve our problem, there are better solutions in this space.

Spark

A sound choice, as Spark was explicitly designed to solve our problem of processing data at scale.

Pros:

  • Spark is both relatively modern and mature enough.

Cons:

  • Java-based: from the foundations to small details, you'll be consistently reminded that you're not doing pandas in python
  • Complicated: it's not a library. It's a whole framework. It will enable you to do many more things than you can think of, but there is a steep price to pay for it.
  • Different API than pandas.
  • Infrastructure: setting up, running, scaling, and maintaining a Spark cluster is a job on its own.

Conclusion:

It works, no doubt. But what's the point of mixing pandas and Spark? If you're already proficient in Spark and want to work in the Java + Spark ecosystem, go for it! If you're a Data Scientist working in python + pandas, there are better options down the road.

Pyspark

Pyspark is a Spark interface in Python. Now that sounds better. And it is, it does alleviate some of the pains of trying to work with Spark as a Data Scientist working with pandas. Yet, it only alleviates some of the pain. For all the rest, you're still working with a different ecosystem.

Pros:

  • All of the above, plus it’s much easier coming from Python

Cons:

  • You're still trying to fit your pandas' use-case in a Java + Spark ecosystem.

Conclusion: Up to now, this still seems the best solution. But that would be a bit sad if that was all there is to it.

3. Scaling pandas: using open-source solutions

Ok, you don’t want to try scaling pandas by yourself, and you don't want to switch your processing to SQL or Spark. You're not alone looking for a solution based on python and pandas. Here are the alternatives in this space:

Dask

Dask is probably the most popular and mature solution for distributed pandas. It was one of the first to prove that this problem could be solved directly in Python and inspired many other solutions.

Pros:

  • Based on python and pandas
  • Can run on your laptop without any infrastructure to manage
  • It uses the power of all of your CPUs

Cons:

  • Different syntax than pandas: has a lazy evaluation workflow that does not work well for Data Scientist exploration and processing work.
  • Dask does not provide all of the pandas API or support all pandas styles.
  • Dask is executed on your laptop: still limited to the memory available you have.
  • Dask can be run on a cluster, but you'll have to set up and manage it yourself.

Conclusion: Dask is a good pick. It has been the most mature and reliable way to scale pandas without relying on Spark for the past few years.

Modin

Modin is an interesting take on Dask. One of the main features is that they break away from Dask's lazy evaluation framework and stick closer to pandas' standard workflow.

Pros:

  • Modin tries to improve on Dask.
  • It’s modular and experimental.

Cons:

  • Not all functions are available.
  • Not very mature yet.
  • Same limitations as Dask for running on clusters.

Conclusion: Modin is the new kid on the block, trying to improve on Dask. Currently, I would probably still stick to Dask unless I specifically want one of Modin's features. If so, accept that your experience will be a little rougher around the edges overall.

Koalas

The crossbreed of Pyspark and Dask, Koalas tries to bridge the best of both worlds.

Pros:

  • Closer to pandas than PySpark
  • Great solution if you want to combine pandas and spark in your workflow

Cons:

  • Not as close to Pandas as Dask.
  • Infrastructure: can run on a cluster but then runs in the same infrastructure issues as Spark

Conclusion: A sound alternative too. How to choose between Dask and Koalas? I think both solutions have their pros and cons, and in the end, your choice should be guided by the rest of your team's ecosystem. Want to work in a pure Python ecosystem? Go Dask. Also proficient in Spark and have other people running a Spark cluster for you? Give Koalas a try.

Vaex

Vaex is a little different from the solutions we presented. It is a highly specialized library to address very well a tiny subset of pandas: aggregation and visualization.

Pros:

  • The best at their specific use case

Cons

  • Not a general solution, it only addresses a niche

Conclusion: One of the most popular of the specialized libraries, but not a general solution. We also won't expand here on other specialized libraries (pandarallel, swifter, ...)

4. Terality: the ultimate solution to scale pandas

All these solutions bring some exciting ideas to the table. All will help you alleviate your pain and help you scale pandas. Unfortunately, they all have their shortcomings, and none of them will check all the features we would like to have to do pandas at scale.

As former data scientists, we felt the pain too and were frustrated with the lack of solutions. So we decided to build it ourselves at Terality. We thought a lot about what we liked about each solution and what we didn't and designed our engine to bring you the best experience for processing data at scale in pandas.

So without further ado, we are introducingTerality:

Terality is the fully hosted solution to process data at scale with pandas, even on large datasets, 10 to 100x faster than pandas, and with zero infrastructure management.

With Terality we have designed the solution we dreamt of as pandas users, focusing on providing the best user experience to data scientists:

  • Speed: Terality processes pandas faster by parallelizing its execution behind the scene to accelerate it by at least 10 to 100x compared to pandas (our first benchmark here).
  • Infrastructure: Terality is fully serverless. There is zero infrastructure to set up, manage or scale. Terality auto-scales infrastructure for you behind the scene.
  • Syntax: Terality has the exact same syntax as pandas. No learning curve for pandas’ users.
  • pandas API-coverage: Terality offers all the pandas’ functions, and is fully compatible with pandas, from data types to syntax.
  • Security: Terality provides secure isolation. Your data is fully protected and isolated, whether in transit or during computations.
  • Time-to-value: It takes 30 seconds to get started. Create your free account on www.terality.com (no credit card required), pip install the package and start using executing pandas code at single. It’s as simple as using pandas.

Want to get started? Signup on www.terality.com.


Additional resources about Terality:

  • We provide a Terality tutorial so that you can try Terality in 1 minute.
  • We also support classic pandas users, by creating pandas guides available here.


Interested in joining the team?

Home DocumentationBlogPricingContact UsAbout UsLog In