As of today, you can use Terality in your favorite data science online notebook environment - Google Colab. A lot of Google Colab users have been experiencing the pain of getting memory errors and speed issues with Pandas. Indeed, Pandas doesn’t scale well when it comes to processing large datasets above 5 or 10GB.
We cut around 30% of the execution time when executing functions on small datasets.
While processing data with pandas, it is quite common to perform a user-defined function on every row of a DataFrame. The typical way to do it is to use the method apply. This article will focus on the method apply with axis=1, which evaluates a function on every row. The axis=0 version evaluates a function on each column but does not suffer from performance issues as pandas dataframes are internally stored column-wise using NumPy arrays.
Terality now caches the result of computations. In other words, running the same function twice on the same inputs will return a result almost instantly.
To check how Terality compares to the best solutions on the market, we picked the most scientific, unbiased and well-known benchmark for pandas alternatives: the h2o benchmark. It consists of a list of timed simulations on different database-like operations like: join, merge, and groupby, run on different dataset sizes: 0.5, 5 and 50GB. You can check the final section where we give more detail on the experiments and how to reproduce the results for Terality.
After weeks of preparation, we’re proud to finally announce Terality hosted demo notebook - the fastest way to take Terality for a test ride, completely free of charge. We wanted to lower the time needed for you to realize what Terality is all about to 1 click! There’s no better way than running a pre-written tutorial on our infrastructure to experience our pandas lightning-fast serverless data processing
You can now export dataframes into multiple files, using to_csv_folder / to_parquet_folder. Exporting your data in one huge file is often a big problem for the next steps in your workflow. Terality now offers the possibility to export a dataframe into multiple files, with two extra methods to the pandas API: to_csv_folder and to_parquet_folder.
In this article, we review the different options available to you as a Data Scientist to make pandas work at scale: whether on larger datasets or faster. We then explain why we built Terality, combining all the best features of these options in a single solution. Data scientists can finally run pandas at scale with our fully serverless engine, by changing just one line of their code.
Whether it is to concatenate several datasets from different csv files or to merge sets of aggregated data from different google analytics accounts, combining data from various sources is critical to drawing the right conclusions and extracting optimal value from data analytics.
Terality can now provision infrastructure in less than 500 milliseconds, allowing you to get started right away with much less latency.
In our first benchmarks, we are experiencing, on average, a 10x faster experience with Terality than Pandas on 1GB+ files.
We have released a side engine that will run all the functions we haven't optimized and parallelized yet. This means that you can execute (almost) any pandas function with Terality, even the ones we haven't hand-optimized. You don't have to do anything to use this side engine: our scheduler will automatically use it when needed.
We have released a side engine that will run all the functions we haven't optimized and parallelized yet. We keep Terality's promise: it's fully serverless, scalable, and compatible with the pandas API.
At Terality, we want to be transparent with our users and keep our community updated regularly. That's why we will publish updates about our product every two weeks or so. First news: Our documentation is online!
Terality is a distributed data processing engine for Data Scientists to execute all their Pandas code 100 times faster, even on terabytes of data, by only changing one line of code. Terality is hosted, so there is no infrastructure to manage, and memory is virtually unlimited. Data Scientists can be up and running within minutes, even on an existing Pandas codebase.