We are excited to
announce that after conducting a performance test using the popular NYC Taxi dataset, SigScalr
emerged a staggering 37x faster than ClickHouse. The speed advantage of SigScalr over
ClickHouse
stems from its advanced compression and query optimization techniques. Further bolstering its
performance is SigScalr's distributed framework, which seamlessly scales horizontally, catering
to high-demand tasks.
Let’s get into the nitty-gritty of the dataset, nodes used, detailed
queries, and test results.
The NYC Taxi Dataset is a massive dataset of over 1.1 Billion trips made by NYC Taxis, Uber,and other for-hire vehicles, all originating in New York City since 2009.
The queries were as follows:
We found that SigScalr was significantly faster than ClickHouse on all
queries.
This is probably because SigScalr is better at handling complex
queries involving large amounts of data.
One reason for the difference in
speed is that SigScalr uses dynamic columnar compression with adaptive
microindexes and an innovative agile aggregation tree algorithm.
The dataset is divided into two tables:
The dataset also includes several other tables, such as:
Source: GitHub
Here’s the Github Link for New York City Taxi
and For-Hire Vehicle Data
SigScalr is a log management solution that is faster, better, and more efficient than most log management solutions. It runs on a very small form factor and supports a variety of ingestion integrations and various query languages.
Node Type Used - im4gn.2xlarge
This is the link where ClickHouse performance numbers are available: 1.1 Billion Taxi Rides on ClickHouse & an Intel Core i5.
Setup the Node and disk volumes
Run SigScalr
./hyperion -d <http://localhost:8081/elastic> -p 1
Prepare the Data
go run cmd/utils/converter.go --input trips_X --output trips_X.json {X == 0…19}
Ingest the data into SigScalr
For each data file, run the following command:
go run main.go ingest -d http://localhost:8081/elastic -p 1 -g file -x trips_X.json -t 50_000_000 -b 500 {X == 0…19}
Run the queries
Mark Litwintschik used ClickHouse, a distributed column-oriented database management system, to analyze the NYC taxi trip data.
For a more detailed explanation of Mark's steps check out the blog here: 1.1 Billion Taxi Rides on ClickHouse & an Intel Core i5 .
Benchmarking ClickHouse:
Group count(*) by cab_type
This query groups the data by cab type and counts the number of trips for each type. This query can be used to understand the distribution of trips by cab type.
SELECT cab_type, count(*)
FROM trips_mergetree
GROUP BY cab_type
Group avg(total_amount) by passenger_count
This query groups the data by passenger count and averages the total amount for each group. This query can be used to understand the average spending per passenger count.
SELECT passenger_count, avg(total_amount)
FROM trips_mergetree
GROUP BY passenger_count
Group count(*) by passenger_count, pickup_date
This query groups the data by passenger count and pickup date and counts the number of trips for each group. This query can be used to understand the distribution of trips by passenger count and pickup date.
SELECT passenger_count, toYear(pickup_date)
AS year, count(*)
FROM trips_mergetree
GROUP BY passenger_count, year
Group count(*) by passenger_count, pickup_date, trip_distance
This query groups the data by passenger count, pickup date, and trip distance and counts the number of trips for each group. This query can be used to understand the distribution of trips by passenger count, pickup date, and trip distance.
SELECT passenger_count, toYear(pickup_date)
AS year, round(trip_distance)
AS distance, count(*)
FROM trips_mergetree
GROUP BY passenger_count, year, distance
ORDER BY year, count(*) DESC
Query Type | ClickHouse Single Node | SigScalr | Comparison |
---|---|---|---|
Q1 | 1034 ms | 131 ms | 8x faster |
Q2 | 3058 ms | 168 ms | 18x faster |
Q3 | 5354 ms | 142 ms | 38x faster |
Q4 | 12748 ms | 3185 ms | 4x faster |
Query Type | 108-core ClickHouse Cluster | SigScalr | Comparison |
---|---|---|---|
Q1 | 2449 ms | 131 ms | 19x faster |
Q2 | 691 ms | 168 ms | 4x faster |
Q3 | 582 ms | 142 ms | 4x faster |
Q4 | 983 ms | 3185 ms | 0.3x slower |
Query Type | ClickHouse on DoubleCloud | SigScalr | Comparison |
---|---|---|---|
Q1 | 347 ms | 131 ms | 3x faster |
Q2 | 1100 ms | 168 ms | 6x faster |
Q3 | 1389 ms | 142 ms | 10x faster |
Q4 | 2935 ms | 3185 ms | 0.9x slower |
SigScalr is your go-to solution for managing vast amounts of logs. Crafted to scale horizontally, it's effortless to add resources as your needs expand. Notably, SigScalr ensures optimal utilization of your hardware and cloud assets.
Here are some key benefits of using SigScalr for large datasets: