In this article, we review the different options available to you as a Data Scientist to make pandas work at scale: whether on larger datasets or faster. We then explain why we built Terality, combining all the best features of these options in a single solution. Data scientists can finally run pandas at scale with our fully serverless engine, by changing just one line of their code.
Big data has been around for some time, but as a Data Scientist, have you ever tried to use your favorite data processing library, pandas, on a dataset of more than 1GB? If you have, you quickly ran into one of these two problems:
Even worse, both problems can happen at the same time: there is nothing more infuriating than to run a join between two large dataframes, wait for 20 minutes, and then realize your notebook crashed and you're good to start rerunning it from the start.
But rejoice, you're not alone, and there are many options to process pandas faster and avoid memory errors.
A natural reflex is to try to solve the problem on your own, using one of the following strategies:
Sub-sampling your dataset is not exactly a way of scaling pandas, but if your data is too big, it's easy: just make it smaller.
While it's always nice to be able to take a quick first glance at your data this way, this solution has many issues. Among those:
Conclusion: sub-sampling your dataset might be sometimes helpful, but I wouldn't call it a solution as it has many critical issues.
Ok, your dataset is too big to fit in memory. But maybe you don't need to load it all at once. Perhaps you can just process each chunk sequentially. If you're lucky, your dataset is already nicely split over several small files. If not, you might know about pd.read_csv(..., iterator=True)
Conclusion: A quick and dirty solution you've probably already tried by yourself. Generally, you do it once, realize all the cons, and move on to something a little more advanced.
Your laptop has eight cores, and it feels like a big waste that pandas only uses one while the seven others sit idle. You have heard of python.multiprocessing, so why not do it yourself?
Conclusion: While it is a sound strategy, you're better off using one of the solutions presented in the following paragraphs, which will implement it for you.
As Data scientists, we are not the first engineers to deal with the problem of processing data at scale, so it's reasonable to try one of the solutions our colleagues use.
Databases, especially relational ones, are very good at joins and a very mature technology.
Conclusion: While it's natural to think of RDB to solve our problem, there are better solutions in this space.
A sound choice, as Spark was explicitly designed to solve our problem of processing data at scale.
It works, no doubt. But what's the point of mixing pandas and Spark? If you're already proficient in Spark and want to work in the Java + Spark ecosystem, go for it! If you're a Data Scientist working in python + pandas, there are better options down the road.
Pyspark is a Spark interface in Python. Now that sounds better. And it is, it does alleviate some of the pains of trying to work with Spark as a Data Scientist working with pandas. Yet, it only alleviates some of the pain. For all the rest, you're still working with a different ecosystem.
Conclusion: Up to now, this still seems the best solution. But that would be a bit sad if that was all there is to it.
Ok, you don’t want to try scaling pandas by yourself, and you don't want to switch your processing to SQL or Spark. You're not alone looking for a solution based on python and pandas. Here are the alternatives in this space:
Dask is probably the most popular and mature solution for distributed pandas. It was one of the first to prove that this problem could be solved directly in Python and inspired many other solutions.
Conclusion: Dask is a good pick. It has been the most mature and reliable way to scale pandas without relying on Spark for the past few years.
Modin is an interesting take on Dask. One of the main features is that they break away from Dask's lazy evaluation framework and stick closer to pandas' standard workflow.
Conclusion: Modin is the new kid on the block, trying to improve on Dask. Currently, I would probably still stick to Dask unless I specifically want one of Modin's features. If so, accept that your experience will be a little rougher around the edges overall.
The crossbreed of Pyspark and Dask, Koalas tries to bridge the best of both worlds.
Conclusion: A sound alternative too. How to choose between Dask and Koalas? I think both solutions have their pros and cons, and in the end, your choice should be guided by the rest of your team's ecosystem. Want to work in a pure Python ecosystem? Go Dask. Also proficient in Spark and have other people running a Spark cluster for you? Give Koalas a try.
Vaex is a little different from the solutions we presented. It is a highly specialized library to address very well a tiny subset of pandas: aggregation and visualization.
All these solutions bring some exciting ideas to the table. All will help you alleviate your pain and help you scale pandas. Unfortunately, they all have their shortcomings, and none of them will check all the features we would like to have to do pandas at scale.
As former data scientists, we felt the pain too and were frustrated with the lack of solutions. So we decided to build it ourselves at Terality. We thought a lot about what we liked about each solution and what we didn't and designed our engine to bring you the best experience for processing data at scale in pandas.
So without further ado, we are introducingTerality:
Terality is the fully hosted solution to process data at scale with pandas, even on large datasets, 10 to 100x faster than pandas, and with zero infrastructure management.
With Terality we have designed the solution we dreamt of as pandas users, focusing on providing the best user experience to data scientists:
Additional resources about Terality:
To check how Terality compares to the best solutions on the market, we picked the most scientific, unbiased and well-known benchmark for pandas alternatives: the h2o benchmark. It consists of a list of timed simulations on different database-like operations like: join, merge, and groupby, run on different dataset sizes: 0.5, 5 and 50GB. You can check the final section where we give more detail on the experiments and how to reproduce the results for Terality.
After weeks of preparation, we’re proud to finally announce Terality hosted demo notebook - the fastest way to take Terality for a test ride, completely free of charge. We wanted to lower the time needed for you to realize what Terality is all about to 1 click! There’s no better way than running a pre-written tutorial on our infrastructure to experience our pandas lightning-fast serverless data processing