In this article, we review the different options available to you as a Data Scientist to make pandas work at scale: whether on larger datasets or faster. We then explain why we built Terality, combining all the best features of these options in a single solution. Data scientists can finally run pandas at scale with our fully serverless engine, by changing just one line of their code.
Big data has been around for some time, but as a Data Scientist, have you ever tried to use your favorite data processing library, pandas, on a dataset of more than 1GB? If you have, you quickly ran into one of these two problems:
Even worse, both problems can happen at the same time: there is nothing more infuriating than to run a join between two large dataframes, wait for 20 minutes, and then realize your notebook crashed and you're good to start rerunning it from the start.
But rejoice, you're not alone, and there are many options to process pandas faster and avoid memory errors.
A natural reflex is to try to solve the problem on your own, using one of the following strategies:
Sub-sampling your dataset is not exactly a way of scaling pandas, but if your data is too big, it's easy: just make it smaller.
While it's always nice to be able to take a quick first glance at your data this way, this solution has many issues. Among those:
Conclusion: sub-sampling your dataset might be sometimes helpful, but I wouldn't call it a solution as it has many critical issues.
Ok, your dataset is too big to fit in memory. But maybe you don't need to load it all at once. Perhaps you can just process each chunk sequentially. If you're lucky, your dataset is already nicely split over several small files. If not, you might know about pd.read_csv(..., iterator=True)
Pros:
Cons:
Conclusion: A quick and dirty solution you've probably already tried by yourself. Generally, you do it once, realize all the cons, and move on to something a little more advanced.
Your laptop has eight cores, and it feels like a big waste that pandas only uses one while the seven others sit idle. You have heard of python.multiprocessing, so why not do it yourself?
Pros:
Cons:
Conclusion: While it is a sound strategy, you're better off using one of the solutions presented in the following paragraphs, which will implement it for you.
As Data scientists, we are not the first engineers to deal with the problem of processing data at scale, so it's reasonable to try one of the solutions our colleagues use.
Databases, especially relational ones, are very good at joins and a very mature technology.
Pros:
Cons:
Conclusion: While it's natural to think of RDB to solve our problem, there are better solutions in this space.
A sound choice, as Spark was explicitly designed to solve our problem of processing data at scale.
Pros:
Cons:
Conclusion:
It works, no doubt. But what's the point of mixing pandas and Spark? If you're already proficient in Spark and want to work in the Java + Spark ecosystem, go for it! If you're a Data Scientist working in python + pandas, there are better options down the road.
Pyspark is a Spark interface in Python. Now that sounds better. And it is, it does alleviate some of the pains of trying to work with Spark as a Data Scientist working with pandas. Yet, it only alleviates some of the pain. For all the rest, you're still working with a different ecosystem.
Pros:
Cons:
Conclusion: Up to now, this still seems the best solution. But that would be a bit sad if that was all there is to it.
Ok, you don’t want to try scaling pandas by yourself, and you don't want to switch your processing to SQL or Spark. You're not alone looking for a solution based on python and pandas. Here are the alternatives in this space:
Dask is probably the most popular and mature solution for distributed pandas. It was one of the first to prove that this problem could be solved directly in Python and inspired many other solutions.
Pros:
Cons:
Conclusion: Dask is a good pick. It has been the most mature and reliable way to scale pandas without relying on Spark for the past few years.
Modin is an interesting take on Dask. One of the main features is that they break away from Dask's lazy evaluation framework and stick closer to pandas' standard workflow.
Pros:
Cons:
Conclusion: Modin is the new kid on the block, trying to improve on Dask. Currently, I would probably still stick to Dask unless I specifically want one of Modin's features. If so, accept that your experience will be a little rougher around the edges overall.
The crossbreed of Pyspark and Dask, Koalas tries to bridge the best of both worlds.
Pros:
Cons:
Conclusion: A sound alternative too. How to choose between Dask and Koalas? I think both solutions have their pros and cons, and in the end, your choice should be guided by the rest of your team's ecosystem. Want to work in a pure Python ecosystem? Go Dask. Also proficient in Spark and have other people running a Spark cluster for you? Give Koalas a try.
Vaex is a little different from the solutions we presented. It is a highly specialized library to address very well a tiny subset of pandas: aggregation and visualization.
Pros:
Cons
Conclusion: One of the most popular of the specialized libraries, but not a general solution. We also won't expand here on other specialized libraries (pandarallel, swifter, ...)
All these solutions bring some exciting ideas to the table. All will help you alleviate your pain and help you scale pandas. Unfortunately, they all have their shortcomings, and none of them will check all the features we would like to have to do pandas at scale.
As former data scientists, we felt the pain too and were frustrated with the lack of solutions. So we decided to build it ourselves at Terality. We thought a lot about what we liked about each solution and what we didn't and designed our engine to bring you the best experience for processing data at scale in pandas.
So without further ado, we are introducingTerality:
Terality is the fully hosted solution to process data at scale with pandas, even on large datasets, 10 to 100x faster than pandas, and with zero infrastructure management.
With Terality we have designed the solution we dreamt of as pandas users, focusing on providing the best user experience to data scientists:
Want to get started? Signup on www.terality.com.
Additional resources about Terality:
As of today, you can use Terality in your favorite data science online notebook environment - Google Colab. A lot of Google Colab users have been experiencing the pain of getting memory errors and speed issues with Pandas. Indeed, Pandas doesn’t scale well when it comes to processing large datasets above 5 or 10GB.