Terality Meets Google Colab - How to Get Started in Minutes

February 23, 2022
Guillaume Duvaux
Guillaume Duvaux

As of today, you can use Terality in your favorite data science online notebook environment - Google Colab. A lot of Googe Colab users have been experiencing the pain of getting memory errors and speed issues with Pandas. Indeed, Pandas doesn’t scale well when it comes to processing large datasets above 5 or 10GB. 

Terality scales dynamically to process up to terabytes of data and runs up to 100x faster than Pandas. That is why we are making Terality available on Google collab today.

Please note: If you need to use the apply() function in Colab you’ll need to configure a Python 3.8 environment. There is a 2-minute guide to doing so at the end of this article.

Installing Terality in Google Colab

You can install Terality as you normally would with Pip -  just don’t forget the exclamation sign in the front:

-- CODE language-python --
!pip install --upgrade terality

Once installed, configure it by authenticating with your email and API key:

-- CODE language-python --
!terality account configure --email <your_email> --api-key <your_api_key> --overwrite

If you entered your credentials correctly, you’ll see the following message:

Image 1 - Terality configuration success message

And that’s it with regards to the setup! In the following section, we’ll verify Terality works through Google Colab by accessing a large dataset on Amazon S3 and processing it.

Demo: Processing Data From Amazon S3 with Terality

It’s assumed you have a dataset ready - preferably stored on Amazon S3. The one you see below was created during our Dask and Terality comparison. If you’re following along, feel free to copy the code from the article to create your version:

Image 2 - Contents of an S3 bucket

It’s also assumed you have AWS credentials configured on your PC/laptop. There’s no additional setup if you’re working with a local Google Colab environment, but in case you’re not, refer to this article to learn how to access an S3 bucket from Colab.

Note that you can also use other Storage services than AWS S3 to import data in Terality (list here).

Import Terality and connect to your data source:

-- CODE language-python --
import terality as te

s3_folder = 's3://<your_bucket>'
with te.disable_cache():
    data = te.read_parquet(f'{s3_folder}/<your_dataset>')


Here’s what the first five rows look like:

Image 3 - Head of the 31.5M row dataset

Let’s go over our first use case - sorting. The following code snippet uses Terality to sort the DataFrame by the column a in descending order:

-- CODE language-python --
data_sorted = data.sort_values(by='a', ascending=False)

There are over 31 million rows, which is a joke for the processing power behind Terality that parallelizes and scales pandas:

Image 4 - Time required to sort a DataFrame with Terality in Google Colab

To verify the dataset was sorted:

Image 5 - Sorted dataset

Let’s try something more complex. The math_func() function written below fetches the maximum value of a row and calculates sine and cosine, which are then returned as a Pandas Series.

-- CODE language-python --
import numpy as np
import pandas as pd

def math_func(row):
    max_ = row.max()
    cos = np.cos(max_)
    sin = np.sin(max_)
    return pd.Series([cos, sin], index=['cos', 'sin'])

We’ll apply the function to all numerical columns:

-- CODE language-python --
math_res = data[['a', 'b', 'c', 'd', 'e']].apply(math_func, axis=1)

Terality is finished in around 30 seconds:

Image 6 - Time required to perform heavy math operations with Terality on Google Colab

Just for reference - Pandas running on M1 Pro MacBook Pro needs over 3 hours to do the same calculation, which is 340 times slower.

Here’s what the dataset looks like:

Image 7 - The resulting DataFrame

We won’t dive into further tests, as the point was only to showcase the integration of Terality and Google Colab.

Where To Go From Here

To summarize, Terality and Google Colab work nicely together. The only tricky part is changing the Python version on Colab, which is 3.7 by default (February 2022). If you don’t plan to use the apply() function then there’s no need to set up a local environment. Expect to see these issues resolved by our tech team soon.

What are your thoughts on Terality and Colab integration? What would you like to see in the future? Let us know:

Recommended reads:

Need to use the apply() function in Colab, here is How to Configure a Local Google Colab Environment for Terality

To use the apply() function in Colab you’ll need to configure a Python 3.8 environment. The reason is simple - Colab ships with Python 3.7, and our recommended version is 3.8 or later. We are working on this issue and we’ll update this article as soon as it’s resolved.

There’s no easy way to change the Python version in Google Colab. For that reason, the recommended solution is to create a local Python environment that you’ll connect to Colab. Use the following shell commands to create and activate a virtual environment based on Python 3.8 through Anaconda:

-- CODE language-python --
conda create -name <env_name> python=3.8 -y
conda activate <env_name>

From here, use Pip to install and enable the jupyter_http_over_ws jupyter extension:

-- CODE language-python --
pip install jupyter_http_over_ws
jupyter serverextension enable --py jupyter_http_over_ws

One step left - and that is to run a new notebook server and set a flag to explicitly trust WebSocket connections from Colab:

-- CODE language-python --
jupyter notebook \
  --NotebookApp.allow_origin='https://colab.research.google.com' \
  --port=8888 \

Once the server has started, copy the backend URL used for authentication and create a new notebook in Google Colab. In the top right corner, you’ll see an option to connect to a runtime. Click on the dropdown for more options, and select “Connect to a local runtime”:

Image 1 - Connecting to a local runtime in Colab

A popup window will appear, asking you to enter a local address and authentication token of your Jupyter notebook instance:

Image 2 - Connecting to a local runtime in Colab (2)

After clicking on “Connect”, verify you’re using Python 3.8 by running the following command:

Image 3 - Python version for a local Google Colab environment

Interested in joining the team?

Home DocsIntegrationsPricingBlogContact UsAbout UsLog In