One Billion Taxi Rides: SigScalr vs ClickHouse

            We are excited to announce that after conducting a performance test using the popular NYC Taxi dataset, SigScalr emerged a staggering 37x faster than ClickHouse. The speed advantage of SigScalr over ClickHouse stems from its advanced compression and query optimization techniques. Further bolstering its performance is SigScalr's distributed framework, which seamlessly scales horizontally, catering to high-demand tasks.
Let’s get into the nitty-gritty of the dataset, nodes used, detailed queries, and test results.

NYC Taxi Dataset Analysis

The NYC Taxi Dataset is a massive dataset of over 1.1 Billion trips made by NYC Taxis, Uber,and other for-hire vehicles, all originating in New York City since 2009.

The queries were as follows:

  • Q1: Find the total number of rides in each borough.
  • Q2: Find the average ride time for each cab type.
  • Q3: Find the most popular pick-up and drop-off locations.
  • Q4: Find the top 10 most popular ride-hailing apps.

We found that SigScalr was significantly faster than ClickHouse on all queries.
This is probably because SigScalr is better at handling complex queries involving large amounts of data.
One reason for the difference in speed is that SigScalr uses dynamic columnar compression with adaptive microindexes and an innovative agile aggregation tree algorithm.

NYC Taxi Dataset

The dataset is divided into two tables:

  • trips: This table contains all yellow and green taxi trips. Each trip has a cab_type_id, which references the cab_types table and refers to one of yellow or green.
  • fhv_trips: This table contains all for-hire vehicle trip records, including ride-hailing apps Uber, Lyft, Via, and Juno.

The dataset also includes several other tables, such as:

  • fhv_bases: This table maps fhv_trips to base names and "doing business as" labels, which include ride-hailing app names.
  • nyct2010: This table contains NYC census tracts plus the Newark Airport. It also maps census tracts to NYC's official neighborhood tabulation areas.
  • taxi_zones: This table contains the TLC's official taxi zone boundaries. Starting in July 2016, the TLC no longer provides pickup and dropoff coordinates. Instead, each trip comes with taxi zone pickup and dropoff location IDs.
  • central_park_weather_observations: This table has a summary of weather data by date.

Source: GitHub
Here’s the Github Link for New York City Taxi and For-Hire Vehicle Data

Node Types Used

SigScalr

SigScalr is a log management solution that is faster, better, and more efficient than most log management solutions. It runs on a very small form factor and supports a variety of ingestion integrations and various query languages.

Node Type Used - im4gn.2xlarge

  • 8 vCPUs
  • 32 GB of RAM
  • 3.5 TB of storage
  • 10 Gigabit networking

This is the link where ClickHouse performance numbers are available: 1.1 Billion Taxi Rides on ClickHouse & an Intel Core i5.


Setting Up Servers

SigScalr Setup

Setup the Node and disk volumes

  1. Create an EC2 instance of type im4gn.2xlarge.
  2. Attach a disk volume of at least 1 TB to the EC2 instance.
  3. Mount the disk volume to the /data directory.

Run SigScalr

  1. Download the SigScalr binary from the SigScalr website.
  2. Make the SigScalr binary executable.
  3. Run the SigScalr binary with the following command:
./hyperion -d <http://localhost:8081/elastic> -p 1

Prepare the Data

  1. Download the NYC taxi trip data from the AWS datasets website.
  2. Uncompress the data files.
  3. Convert the data files to JSON format using the following command:
go run cmd/utils/converter.go --input trips_X --output trips_X.json {X == 0…19}

Ingest the data into SigScalr

For each data file, run the following command:

go run main.go ingest -d http://localhost:8081/elastic -p 1 -g file -x trips_X.json -t 50_000_000 -b 500 {X == 0…19}

Run the queries

ClickHouse Setup

Mark Litwintschik used ClickHouse, a distributed column-oriented database management system, to analyze the NYC taxi trip data.

For a more detailed explanation of Mark's steps check out the blog here: 1.1 Billion Taxi Rides on ClickHouse & an Intel Core i5 .

Benchmarking ClickHouse:

  • Mark ran SQL commands via the ClickHouse CLI to benchmark the database's performance.
  • He measured the lowest query times during a series of runs to assess ClickHouse's speed.

Introduction of Results

Query 1

Group count(*) by cab_type

This query groups the data by cab type and counts the number of trips for each type. This query can be used to understand the distribution of trips by cab type.

SELECT cab_type, count(*)
FROM trips_mergetree 
GROUP BY cab_type

Query 2

Group avg(total_amount) by passenger_count

This query groups the data by passenger count and averages the total amount for each group. This query can be used to understand the average spending per passenger count.

SELECT passenger_count, avg(total_amount)
FROM trips_mergetree
GROUP BY passenger_count

Query 3

Group count(*) by passenger_count, pickup_date

This query groups the data by passenger count and pickup date and counts the number of trips for each group. This query can be used to understand the distribution of trips by passenger count and pickup date.

SELECT passenger_count, toYear(pickup_date)
AS year, count(*) 
FROM trips_mergetree 
GROUP BY passenger_count, year

Query 4

Group count(*) by passenger_count, pickup_date, trip_distance

This query groups the data by passenger count, pickup date, and trip distance and counts the number of trips for each group. This query can be used to understand the distribution of trips by passenger count, pickup date, and trip distance.

SELECT passenger_count, toYear(pickup_date) 
AS year, round(trip_distance) 
AS distance, count(*) 
FROM trips_mergetree 
GROUP BY passenger_count, year, distance 
ORDER BY year, count(*) DESC

SigScalr vs ClickHouse-single-node: 4x - 38x faster

Query Type ClickHouse Single Node SigScalr Comparison
Q1 1034 ms 131 ms 8x faster
Q2 3058 ms 168 ms 18x faster
Q3 5354 ms 142 ms 38x faster
Q4 12748 ms 3185 ms 4x faster

SigScalr vs ClickHouse-Cluster: 0.3x - 19x faster

Query Type 108-core ClickHouse Cluster SigScalr Comparison
Q1 2449 ms 131 ms 19x faster
Q2 691 ms 168 ms 4x faster
Q3 582 ms 142 ms 4x faster
Q4 983 ms 3185 ms 0.3x slower

SigScalr vs ClickHouse-on-DoubleCloud: 0.9x - 10x faster

Query Type ClickHouse on DoubleCloud SigScalr Comparison
Q1 347 ms 131 ms 3x faster
Q2 1100 ms 168 ms 6x faster
Q3 1389 ms 142 ms 10x faster
Q4 2935 ms 3185 ms 0.9x slower

Conclusion

SigScalr is your go-to solution for managing vast amounts of logs. Crafted to scale horizontally, it's effortless to add resources as your needs expand. Notably, SigScalr ensures optimal utilization of your hardware and cloud assets.

Here are some key benefits of using SigScalr for large datasets:

  • Scalability: Regardless of your dataset's size, SigScalr's horizontal scalability has you covered.
  • Efficiency: Get the full potential of your hardware and cloud resources with SigScalr's efficiency.
  • Fast: SigScalr can search and aggregate billions of log lines in under a second.
  • Ease of use: SigScalr offers an intuitive interface, making it accessible even for those unfamiliar with observability tools.
  • High Query Throughput: SigScalr has high query throughput and can handle thousands of users and queries concurrently.