As of today, you can use Terality in your favorite data science online notebook environment - Google Colab. A lot of Googe Colab users have been experiencing the pain of getting memory errors and speed issues with Pandas. Indeed, Pandas doesn’t scale well when it comes to processing large datasets above 5 or 10GB.
Terality scales dynamically to process up to terabytes of data and runs up to 100x faster than Pandas. That is why we are making Terality available on Google collab today.
Please note: If you need to use the apply() function in Colab you’ll need to configure a Python 3.8 environment. There is a 2-minute guide to doing so at the end of this article.
You can install Terality as you normally would with Pip - just don’t forget the exclamation sign in the front:
-- CODE language-python --
!pip install --upgrade terality
Once installed, configure it by authenticating with your email and API key:
-- CODE language-python --
!terality account configure --email <your_email> --api-key <your_api_key> --overwrite
If you entered your credentials correctly, you’ll see the following message:
And that’s it with regards to the setup! In the following section, we’ll verify Terality works through Google Colab by accessing a large dataset on Amazon S3 and processing it.
It’s assumed you have a dataset ready - preferably stored on Amazon S3. The one you see below was created during our Dask and Terality comparison. If you’re following along, feel free to copy the code from the article to create your version:
It’s also assumed you have AWS credentials configured on your PC/laptop. There’s no additional setup if you’re working with a local Google Colab environment, but in case you’re not, refer to this article to learn how to access an S3 bucket from Colab.
Note that you can also use other Storage services than AWS S3 to import data in Terality (list here).
Import Terality and connect to your data source:
-- CODE language-python --
import terality as te
s3_folder = 's3://<your_bucket>'
with te.disable_cache():
data = te.read_parquet(f'{s3_folder}/<your_dataset>')
data.head()
Here’s what the first five rows look like:
Let’s go over our first use case - sorting. The following code snippet uses Terality to sort the DataFrame by the column a in descending order:
-- CODE language-python --
%%time
data_sorted = data.sort_values(by='a', ascending=False)
There are over 31 million rows, which is a joke for the processing power behind Terality that parallelizes and scales pandas:
To verify the dataset was sorted:
Let’s try something more complex. The math_func() function written below fetches the maximum value of a row and calculates sine and cosine, which are then returned as a Pandas Series.
-- CODE language-python --
import numpy as np
import pandas as pd
def math_func(row):
max_ = row.max()
cos = np.cos(max_)
sin = np.sin(max_)
return pd.Series([cos, sin], index=['cos', 'sin'])
We’ll apply the function to all numerical columns:
-- CODE language-python --
%%time
math_res = data[['a', 'b', 'c', 'd', 'e']].apply(math_func, axis=1)
Terality is finished in around 30 seconds:
Just for reference - Pandas running on M1 Pro MacBook Pro needs over 3 hours to do the same calculation, which is 340 times slower.
Here’s what the dataset looks like:
We won’t dive into further tests, as the point was only to showcase the integration of Terality and Google Colab.
To summarize, Terality and Google Colab work nicely together. The only tricky part is changing the Python version on Colab, which is 3.7 by default (February 2022). If you don’t plan to use the apply() function then there’s no need to set up a local environment. Expect to see these issues resolved by our tech team soon.
What are your thoughts on Terality and Colab integration? What would you like to see in the future? Let us know:
Recommended reads:
To use the apply() function in Colab you’ll need to configure a Python 3.8 environment. The reason is simple - Colab ships with Python 3.7, and our recommended version is 3.8 or later. We are working on this issue and we’ll update this article as soon as it’s resolved.
There’s no easy way to change the Python version in Google Colab. For that reason, the recommended solution is to create a local Python environment that you’ll connect to Colab. Use the following shell commands to create and activate a virtual environment based on Python 3.8 through Anaconda:
-- CODE language-python --
conda create -name <env_name> python=3.8 -y
conda activate <env_name>
From here, use Pip to install and enable the jupyter_http_over_ws jupyter extension:
-- CODE language-python --
pip install jupyter_http_over_ws
jupyter serverextension enable --py jupyter_http_over_ws
One step left - and that is to run a new notebook server and set a flag to explicitly trust WebSocket connections from Colab:
-- CODE language-python --
jupyter notebook \
--NotebookApp.allow_origin='https://colab.research.google.com' \
--port=8888 \
--NotebookApp.port_retries=0
Once the server has started, copy the backend URL used for authentication and create a new notebook in Google Colab. In the top right corner, you’ll see an option to connect to a runtime. Click on the dropdown for more options, and select “Connect to a local runtime”:
A popup window will appear, asking you to enter a local address and authentication token of your Jupyter notebook instance:
After clicking on “Connect”, verify you’re using Python 3.8 by running the following command:
As of today, you can use Terality in your favorite data science online notebook environment - Google Colab. A lot of Google Colab users have been experiencing the pain of getting memory errors and speed issues with Pandas. Indeed, Pandas doesn’t scale well when it comes to processing large datasets above 5 or 10GB.