Terality beats Spark and Dask on the large datasets of the H2O benchmark

December 1, 2021
Simon Saliba
Simon Saliba

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool. It has become one of the leading python libraries for data science alongside TensorFlow, NumPy, SciPy and others. However, how fast you can process your data with Pandas highly depends on your hardware specifications. With data becoming more and more abundant, dataset sizes are reaching all time highs, both for professional and educational use. Users are now confronted with GBs of data to process on a daily basis. Processing speed and memory size have reached their peak in the last couple of years. Hardware development has been slowing ever since, so users need to find alternative solutions for scaling pandas. The most well known solutions are Spark or Dask (see The 11 solutions to make pandas scale and run faster), and the memory format Apache Arrow.


Recently, a new actor entered the market: Terality. Terality is simply a fully-managed, auto-scalable and ultrafast version of pandas. Terality scales pandas in one line of code. It is designed to become a direct replacement of pandas for medium to large datasets since it adopts the same exact syntax and API as pandas. Terality aims at solving memory limitations and hardware speed, by just importing the terality library in your notebooks and writing the exact same code as with Pandas.


To check how Terality compares to the best solutions on the market, we picked the most scientific, unbiased and well-known benchmark for pandas alternatives: the h2o benchmark. It consists of a list of timed simulations on different database-like operations like: join, merge, and groupby, run on different dataset sizes: 0.5, 5 and 50GB. You can check the final section where we give more detail on the experiments and how to reproduce the results for Terality.


Results:

Advanced groupby operations

On the 5GB datasets, you can see that Terality comes way ahead of the other solutions especially on complicated groupby operations. Whereas on the 50GB datasets, Terality was the only solution that completed the experiments with a final result, and impressively fast, averaging 13 mins and 45 seconds on 5 complex experiments, thus 2 mins and 45 seconds per operation. All other solutions were out of memory on the 50GB datasets (see h2o benchmark). 

Unavailable data on the graphs means that the experiment was not accomplished successfully due to either memory, internal or implementation errors (see h2o benchmark).


Join operations

On 5GB datasets, Terality is slightly ahead of Spark but still way ahead of all other alternatives. However on 50GB datasets, Terality is once again the only solution to get to a final result without errors, averaging 14 minutes and 20 seconds on 5 Join experiments, thus 2 minutes and 50 seconds for every Join operation. 



As you can see, Terality definitely comes out on top of the existing pandas alternatives. In some specific cases, it might be comparable to Spark, but still very far ahead of all the other alternatives when it comes to large datasets. We should keep in mind that Terality is still, as of Nov 2021, in its early days. Big performance improvements will come over the next few months.

The better news is that there are other advantages for Terality than speed. Let’s cover the most important ones.


A hidden factor: Concurrency


The speed of a certain solution is definitely a determining factor when it comes to the user experience, but it is not the only one. Another crucial factor is concurrency, or the ability to use the same machine on other tasks and calculations while executing your pandas code. Since Terality is fully hosted, all the computation is done on Terality’s internal server. The amount of RAM consumed is minimal which gives data scientists the possibility to multitask very effectively. You can definitely run a merge on 100GB while reading documentation, writing code, or sending an email. This is definitely not a possibility with other solutions running locally. 



Avoiding memory errors with Terality


Another aspect that is not reflected in the benchmarks results above is the number of memory errors and crashes you can suffer from when dealing with local solutions. I personally had a painful experience running a merge on a 5GB dataset on my MacBook pro 16GB RAM with pandas locally. I repeated the experiment several times due to memory failures, and my machine was totally unusable for a couple of hours. These problems never occur with Terality because it’s auto-scalable and fully managed. With Terality, pandas scales behind the scenes.


Removing barriers to entry: setup and ease of use


One more crucial element of the Terality vs. Spark comparison is the difficulty to set up, launch and maintain. Configuring and getting Spark to run locally is a hell of a journey especially for people with little to no Spark/JAVA experience. Let alone to host a Spark cluster which is a job by itself. Even with a managed Spark solution such as Databricks, EMR or Dataproc, you’ll have to choose your infrastructure, setup options and wait for the clusters to be provisioned.
However, with Terality, in under two minutes, you can create your account, get your API key and configure your client. To start using Terality, you only need to import it in your script the same way you’d do it with pandas. When you read your dataset files using `read_parquet` for example, Terality will make a copy of the data on its secured cloud infrastructure so that you can start working on it. It has exactly the same syntax as pandas, so no new syntax to learn. In case you need further assistance, you can check the complete documentation on the company's website and reach out to the team.


Launched in September 2021, Terality is still in its beta version. We are expecting so many improvements in the future.



Further on the H2O Benchmark:


Description:


The H2O benchmark consists of a set of 15 “groupby” and “join” experiments on different dataset sizes. 


Groupby experiments: 

10 experiments in total divided into two categories depending on the calculation complexity: basic or advanced, having 5 experiments each. 


Join experiments: 5 experiments in total.


Every experiment is run twice, and the result execution time is the average of the two execution times, meant to reduce random variance.


The figures displayed in the article are therefore the sum of the execution times of all experiments in a certain category. For example, Terality took 1373 seconds on Groupby experiments for 50GB means that the sum of the execution times of the 10 Groupby calculations on 50GB is 1373 seconds or 137.3 on average per Groupby operation.


Reproduction:


The results of the H2O benchmark are published on the following website: https://h2oai.github.io/db-benchmark/


The benchmark code is open source: https://github.com/h2oai/db-benchmark

To reproduce Terality results, you will need to:

  • clone the github repo
  • duplicate pandas folder and rename the second one terality 
  • install the terality library 
  • replace the pandas import line in the pandas script by the terality import line 

For any additional help, you can contact our technical team. We are more than happy to support you.


Interested in joining the team?

Home DocumentationBlogPricingContact UsAbout UsLog In