Skip to main content

· One min read

I want to export data segregating the path in S3 bucket to follow a structure like:

  • 2022
    • 1
    • 2
    • ...
    • 12
  • 2021
    • 1
    • 2
    • ...
    • 12

and so on ...

Answer

Considering the ClickHouse table:

CREATE TABLE sample_data (
`name` String,
`age` Int,
`time` DateTime
) ENGINE = MergeTree
ORDER BY
name

Add 10000 entries:

INSERT INTO
sample_data
SELECT
*
FROM
generateRandom(
'name String, age Int, time DateTime',
10,
10,
10
)
LIMIT
10000;

Run this to create the desired structure in s3 bucket my_bucket (note this example writes files in parquet format):

INSERT INTO
FUNCTION s3(
'https://s3-host:4321/my_bucket/{_partition_id}/file.parquet.gz',
's3-access-key',
's3-secret-access-key',
Parquet,
'name String, age Int, time DateTime'
) PARTITION BY concat(
formatDateTime(time, '%Y'),
'/',
formatDateTime(time, '%m')
)
SELECT
name,
age,
time
FROM
sample_data
Query id: 55adcf22-f6af-491e-b697-d09694bbcc56

Ok.

0 rows in set. Elapsed: 15.579 sec. Processed 10.00 thousand rows, 219.93 KB (641.87 rows/s., 14.12 KB/s.)

· 3 min read

ClickHouse has a built-in debugger and introspection capabilities. For example, you can get the stack traces of every server's thread at runtime by querying the system.stack_trace table:

SELECT
count(),
arrayStringConcat(arrayMap(x -> concat(demangle(addressToSymbol(x)), '\n ', addressToLine(x)), trace), '\n') AS sym
FROM system.stack_trace
GROUP BY trace
ORDER BY count() DESC
LIMIT 10
FORMAT Vertical
SETTINGS allow_introspection_functions = 1;

The query result will show the locations in the ClickHouse source code where the threads are running or waiting. (You will need to set allow_introspection_functions to 1 to enable the introspection functions.) The response looks like:

Row 1:
──────
count(): 144
sym: pthread_cond_wait

DB::BackgroundSchedulePool::threadFunction()
/usr/bin/clickhouse

/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 2:
──────
count(): 80
sym: pthread_cond_wait

std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)
/usr/bin/clickhouse
DB::MergeTreeBackgroundExecutor<DB::OrdinaryRuntimeQueue>::threadFunction()
/usr/bin/clickhouse
ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__1::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>)
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 3:
──────
count(): 55
sym: pthread_cond_wait

ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__1::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>)
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 4:
──────
count(): 16
sym:

DB::AsynchronousInsertQueue::processBatchDeadlines(unsigned long)
/usr/bin/clickhouse

/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 5:
──────
count(): 16
sym: pthread_cond_wait

std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)
/usr/bin/clickhouse
DB::MergeTreeBackgroundExecutor<DB::MergeMutateRuntimeQueue>::threadFunction()
/usr/bin/clickhouse
ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__1::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>)
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 6:
──────
count(): 10
sym: poll

Poco::Net::SocketImpl::pollImpl(Poco::Timespan&, int)
/usr/bin/clickhouse
Poco::Net::SocketImpl::poll(Poco::Timespan const&, int)
/usr/bin/clickhouse
Poco::Net::TCPServer::run()
/usr/bin/clickhouse
Poco::ThreadImpl::runnableEntry(void*)
/usr/bin/clickhouse


clone


Row 7:
──────
count(): 9
sym: pthread_cond_wait

ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 8:
──────
count(): 7
sym: poll

Poco::Net::SocketImpl::pollImpl(Poco::Timespan&, int)
/usr/bin/clickhouse
Poco::Net::SocketImpl::poll(Poco::Timespan const&, int)
/usr/bin/clickhouse
DB::ReadBufferFromPocoSocket::poll(unsigned long) const
/usr/bin/clickhouse
DB::TCPHandler::runImpl()
/usr/bin/clickhouse
DB::TCPHandler::run()
/usr/bin/clickhouse

/usr/bin/clickhouse
Poco::Net::TCPServerConnection::start()
/usr/bin/clickhouse
Poco::Net::TCPServerDispatcher::run()
/usr/bin/clickhouse
Poco::PooledThread::run()
/usr/bin/clickhouse
Poco::ThreadImpl::runnableEntry(void*)
/usr/bin/clickhouse


clone


Row 9:
───────
count(): 3
sym: pthread_cond_wait

Poco::EventImpl::waitImpl()
/usr/bin/clickhouse
DB::DDLWorker::runCleanupThread()
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 10:
───────
count(): 3
sym: pthread_cond_wait

Poco::EventImpl::waitImpl()
/usr/bin/clickhouse
DB::DDLWorker::runMainThread()
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


10 rows in set. Elapsed: 0.026 sec.
Note

If you installed ClickHouse from a .deb/.rpm/.tgz you can also install the package with the debug info to see the line numbers from the source code:

sudo apt install clickhouse-common-static-dbg

If you've installed ClickHouse as a single-binary, it already contains the debug info.

Tip

For more high-level information, check out some of these other system tables:

And there is handy info in the other system tables also.

· 2 min read

Using INTO OUTFILE Clause

Add an INTO OUTFILE clause to your query.

For example:

SELECT * FROM table INTO OUTFILE 'file'

By default, ClickHouse uses the file extension of the filename to deteremine the output format and compression. For example, all of the rows in nyc_taxi will be exported to the nyc_taxi.parquet using the Parquet format:

SELECT *
FROM nyc_taxi
INTO OUTFILE 'taxi_rides.parquet'

And the following file will be a compressed, tab-separated file:

SELECT *
FROM nyc_taxi
INTO OUTFILE 'taxi_rides.tsv.gz'

If ClickHouse can not determine the format from the file extension, then the output format defaults to TabSeparated for output data. To specify the output format, use the FORMAT clause.

For example:

SELECT *
FROM nyc_taxi
INTO OUTFILE 'taxi_rides.txt'
FORMAT CSV

Using the File table engine

Another option is to use the File table engine, where ClickHouse uses the file to store the data. You can perform queries and inserts directly on the file.

For example:

CREATE TABLE my_table (
x UInt32,
y String,
z DateTime
)
ENGINE = File(Parquet)

Insert a few rows:

INSERT INTO my_table VALUES
(1, 'Hello', now()),
(2, 'World', now()),
(3, 'Goodbye', now())

The file is stored in the data folder of your ClickHouse server - specifically in /data/default/my_table in a file named data.Parquet.

Note

Using the File table engine is incredibly handy for creating and querying files on your file system, but keep in mind that File tables are not MergeTree tables, so you don't get all the benefits that come with MergeTree. Use File for convenience when exporting data out of ClickHouse in convenient formats.

Using Command-Line Redirection

$ clickhouse-client --query "SELECT * from table" --format FormatName > result.txt

See clickhouse-client.

· 3 min read

Below are some basics of using the S3 table engine to read parquet files.

  • create access and secret keys for an IAM service user. normal login users usually don't work since they may have been configured with an MFA policy.

  • set the permissions on the policy to allow the service user to access the bucket and folders.

The following is a very simple example that you can use to test the mechanics of accessing your parquet files successfully prior to applying to your actual data.

If you need an example of creating a user and bucket, you can follow the first two sections (create user and create bucket): https://clickhouse.com/docs/en/guides/sre/configuring-s3-for-clickhouse-use/

I used this sample file: https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet and uploaded it to my test bucket

You can set the policy something like this on the bucket: (adjust as needed, this one is fairly open for privileges but will help in testing. you can narrow your permissions as necessary)

{
"Version": "2012-10-17",
"Id": "Policy123456",
"Statement": [
{
"Sid": "abc123",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::1234567890:user/mars-s3-user"
]
},
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::mars-doc-test",
"arn:aws:s3:::mars-doc-test/*"
]
}
]
}

You can run queries with this type of syntax using the S3 table engine: https://clickhouse.com/docs/en/sql-reference/table-functions/s3/

clickhouse-cloud :)  select count(*) from s3('https://mars-doc-test.s3.amazonaws.com/s3-parquet-test/userdata1.parquet','ABC123', 'abc+123', 'Parquet', 'first_name String');

SELECT count(*)
FROM s3('https://mars-doc-test.s3.amazonaws.com/s3-parquet-test/userdata1.parquet', 'ABC123', 'abc+123', 'Parquet', 'first_name String')

Query id: fd4f1193-d604-4ac0-9a46-bdd2d5e14727

┌─count()─┐
│ 1000 │
└─────────┘

1 row in set. Elapsed: 1.274 sec. Processed 1.00 thousand rows, 14.64 KB (784.81 rows/s., 11.49 KB/s.)

The data types reference for parquet format are here: https://clickhouse.com/docs/en/interfaces/formats/#data-format-parquet

To bring in the data into a native ClickHouse table:

create the table, something like this (just chose a couple of the columns in the parquet file):

clickhouse-cloud :) CREATE TABLE my_parquet_table (id UInt64, first_name String) ENGINE = MergeTree ORDER BY id;

CREATE TABLE my_parquet_table
(
`id` UInt64,
`first_name` String
)
ENGINE = MergeTree
ORDER BY id

Query id: 412e3994-bf8e-444e-ac43-a7c82642b7da

Ok.

0 rows in set. Elapsed: 0.600 sec.

Select the data from the S3 bucket to insert into the new table:

clickhouse-cloud :) INSERT INTO my_parquet_table (id, first_name) SELECT id, first_name FROM s3('https://mars-doc-test.s3.amazonaws.com/s3-parquet-test/userdata1.parquet', 'ABC123','abc+123', 'Parquet', 'id UInt64, first_name String') FORMAT Parquet

INSERT INTO my_parquet_table (id, first_name) SELECT
id,
first_name
FROM s3('https://mars-doc-test.s3.amazonaws.com/s3-parquet-test/userdata1.parquet', 'ABC123', 'abc+123', 'Parquet', 'id UInt64, first_name String')

Query id: c3cdc871-f338-462d-8797-6751b45a0b58

Ok.

0 rows in set. Elapsed: 1.220 sec. Processed 1.00 thousand rows, 22.64 KB (819.61 rows/s., 18.56 KB/s.)

Verify the import:

clickhouse-cloud :) SELECT * FROM my_parquet_table LIMIT 10;

SELECT *
FROM my_parquet_table
LIMIT 10

Query id: 1ccf59dd-d804-46a9-aadd-ed5c57b9e1a0

┌─id─┬─first_name─┐
│ 1 │ Amanda │
│ 2 │ Albert │
│ 3 │ Evelyn │
│ 4 │ Denise │
│ 5 │ Carlos │
│ 6 │ Kathryn │
│ 7 │ Samuel │
│ 8 │ Harry │
│ 9 │ Jose │
│ 10 │ Emily │
└────┴────────────┘

When you are ready to import your real data, you can use some special syntax like wildcards and ranges to specify your folders, subfolders and files in your bucket. I'd recommend to filter a few directories and files to test the import, maybe a certain year, a couple months and some date range to test first.

besides the path options here, newly released is syntax ** which specifies all subdirectories recursively. https://clickhouse.com/docs/en/sql-reference/table-functions/s3/

For example, assuming the paths and bucket structure is something like this: https://your_s3_bucket.s3.amazonaws.com/<your_folder>/<year>/<month>/<day>/<filename>.parquet https://mars-doc-test.s3.amazonaws.com/system_logs/2022/11/01/my-app-logs-0001.parquet

This would get all files for 1st day of every month in 2021-2022 https://mars-doc-test.s3.amazonaws.com/system_logs/{2021-2022}/**/01/*.parquet

· 3 min read

You can use clickhouse-local to convert files between any of the input and output formats that ClickHouse supports (which is over 70 different formats!). In this article, we are convert a Parquet file in S3 into a CSV and JSON file.

Let's start at the beginning. ClickHouse has a collection of table functions that read from files, databases and other resoures and converts the data to a table. To demonstrate, suppose we have a Parquet file in S3. We will use the s3 table function to read it (ClickHouse knows it's a Parquet file based on the filename).

But first, let's download the clickhouse binary:

curl https://clickhouse.com/ | sh

Accessing the data using a table function

Let's verify we can read the file by using DESCRIBE on the resulting table that the s3 table function creates:

./clickhouse local -q "DESCRIBE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')"

This particular file contains home prices of properties sold in the United Kingdom. The response looks like:

price   Nullable(Int64)
date Nullable(UInt16)
postcode1 Nullable(String)
postcode2 Nullable(String)
type Nullable(String)
is_new Nullable(UInt8)
duration Nullable(String)
addr1 Nullable(String)
addr2 Nullable(String)
street Nullable(String)
locality Nullable(String)
town Nullable(String)
district Nullable(String)
county Nullable(String)

You can run any query you want on the data. For example, let's see which towns have the highest average price of homes:

./clickhouse local -q "SELECT
town,
avg(price) AS avg_price
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')
GROUP BY town
ORDER BY avg_price DESC
LIMIT 10"

The response looks like:

GATWICK 16818750
CHALFONT ST GILES 938090.0985915493
VIRGINIA WATER 789301.1320224719
COBHAM 699874.7111622555
BEACONSFIELD 677247.5483146068
ESHER 616004.6888297872
KESTON 607585.8597560975
GERRARDS CROSS 566330.2959086584
ASCOT 551491.2975753123
WEYBRIDGE 548974.828692494

Convert the Parquet file to a CSV

You can send the result of any SQL query to a file. Let's grab all the columns from our Parquet file in S3 and send the output to a new CSV file. Because the output file ends in .csv, ClickHouse knows to use the CSV output format:

./clickhouse local -q "SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')
INTO OUTFILE 'house_prices.csv'"

Let's verify it worked:

$ tail house_prices.csv
70000,10508,"YO8","9XN","detached",0,"freehold","7","","POPPY CLOSE","SELBY","SELBY","SELBY","NORTH YORKSHIRE"
130000,14274,"YO8","9XP","detached",0,"freehold","10","","HEATHER CLOSE","","SELBY","SELBY","NORTH YORKSHIRE"
150000,18180,"YO8","9XP","detached",0,"freehold","11","","HEATHER CLOSE","","SELBY","SELBY","NORTH YORKSHIRE"
157000,18088,"YO8","9XP","detached",0,"freehold","12","","HEATHER CLOSE","","SELBY","SELBY","NORTH YORKSHIRE"
134000,17333,"YO8","9XP","semi-detached",0,"freehold","16","","HEATHER CLOSE","","SELBY","SELBY","NORTH YORKSHIRE"
250000,13405,"YO8","9YA","detached",0,"freehold","6","","YORKDALE COURT","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"
59500,11166,"YO8","9YB","semi-detached",0,"freehold","4","","YORKDALE DRIVE","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"
142500,17648,"YO8","9YB","semi-detached",0,"freehold","4A","","YORKDALE DRIVE","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"
230000,15125,"YO8","9YD","detached",0,"freehold","1","","ONE ACRE GARTH","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"
250000,15950,"YO8","9YD","detached",0,"freehold","3","","ONE ACRE GARTH","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"

Convert the Parquet file to a JSON

To convert the Parquet file to JSON, simply change the extension on the output filename:

./clickhouse local -q "SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')
INTO OUTFILE 'house_prices.ndjson'"

Let's verify it worked:

 $ tail house_prices.ndjson
{"price":"70000","date":10508,"postcode1":"YO8","postcode2":"9XN","type":"detached","is_new":0,"duration":"freehold","addr1":"7","addr2":"","street":"POPPY CLOSE","locality":"SELBY","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"130000","date":14274,"postcode1":"YO8","postcode2":"9XP","type":"detached","is_new":0,"duration":"freehold","addr1":"10","addr2":"","street":"HEATHER CLOSE","locality":"","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"150000","date":18180,"postcode1":"YO8","postcode2":"9XP","type":"detached","is_new":0,"duration":"freehold","addr1":"11","addr2":"","street":"HEATHER CLOSE","locality":"","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"157000","date":18088,"postcode1":"YO8","postcode2":"9XP","type":"detached","is_new":0,"duration":"freehold","addr1":"12","addr2":"","street":"HEATHER CLOSE","locality":"","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"134000","date":17333,"postcode1":"YO8","postcode2":"9XP","type":"semi-detached","is_new":0,"duration":"freehold","addr1":"16","addr2":"","street":"HEATHER CLOSE","locality":"","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"250000","date":13405,"postcode1":"YO8","postcode2":"9YA","type":"detached","is_new":0,"duration":"freehold","addr1":"6","addr2":"","street":"YORKDALE COURT","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"59500","date":11166,"postcode1":"YO8","postcode2":"9YB","type":"semi-detached","is_new":0,"duration":"freehold","addr1":"4","addr2":"","street":"YORKDALE DRIVE","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"142500","date":17648,"postcode1":"YO8","postcode2":"9YB","type":"semi-detached","is_new":0,"duration":"freehold","addr1":"4A","addr2":"","street":"YORKDALE DRIVE","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"230000","date":15125,"postcode1":"YO8","postcode2":"9YD","type":"detached","is_new":0,"duration":"freehold","addr1":"1","addr2":"","street":"ONE ACRE GARTH","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"250000","date":15950,"postcode1":"YO8","postcode2":"9YD","type":"detached","is_new":0,"duration":"freehold","addr1":"3","addr2":"","street":"ONE ACRE GARTH","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}

Convert CSV to Parquet

It works both ways - we can easily read in the new CSV file and output it into a Parquet file. The local file house_prices.csv can be read in ClickHouse using the file table function, and ClickHouse outputs the file in Parquet format based on the filename ending in .parquet (or we could have added the FORMAT Parquet clause):

./clickhouse local -q "SELECT *
FROM file('house_prices.csv')
INTO OUTFILE 'house_prices.parquet'"

As we mentioned above, you can use any of the ClickHouse input and output formats along with clickhouse local to easily convert files into different formats.

· 3 min read

This one is easy with clickhouse-local:

  • Use the postgresql table function to read the data
  • Use the INTO OUTFILE _filename_ FORMAT clause and specify the desired output format

The output format can be any of the supported ouput formats in ClickHouse. Let's look at a few examples...

These examples use clickhouse-local, which is a part of the ClickHouse binary. Download it using the following:

curl https://clickhouse.com/ | sh

Export PostgreSQL to Parquet

The postgresql table function allows SELECT (and INSERT) queries to be performed on data that is stored on a remote PostgreSQL server. For example, to view the entire contents of a table in PostgreSQL:

SELECT *
FROM
postgresql(
'localhost:5432',
'postgres_database',
'postgres_table',
'user',
'password'
);

We can pipe the output of this query to a file using INTO OUTFILE. Use FORMAT to specify the format of the file to be created. Let's grab the entire contents of the PostgreSQL table, and send its contents to a Parquet file:

./clickhouse local -q "SELECT * FROM
postgresql(
'localhost:5432',
'postgres_database',
'postgres_table',
'user',
'password'
)
INTO OUTFILE 'my_output_file.parquet'"
Note

Because the name of the output file has a .parquet extension, ClickHouse assumes we want the Parquet format, so notice we omitted the FORMAT Parquet clause.

Export PostgreSQL to CSV

It's the same as for Parquet, except we specify a more approriate filename for the output:

./clickhouse local -q "SELECT * FROM
postgresql(
'localhost:5432',
'postgres_database',
'postgres_table',
'user',
'password'
)
INTO OUTFILE 'my_output_file.csv'"

That's it! ClickHouse sees the .csv extension on the output file name and outputs the data as comma-separated. Otherwise, it's the exact same command as above.

Export PostgreSQL to JSON

To go from PostgreSQL to JSON, we just change the filename and ClickHouse will figure out the format:

./clickhouse local -q "SELECT * FROM
postgresql(
'localhost:5432',
'postgres_database',
'postgres_table',
'user',
'password'
)
INTO OUTFILE 'my_output_file.ndjson'"
Note

You don't have to stop here - you can use clickhouse-local to pull data from PostgreSQL and send it to all types of output formats.

If ClickHouse can not determine the output type by the filename extension, or if you want to specifically choose a format, add the FOMRAT clause:

```bash
./clickhouse local -q "SELECT * FROM
postgresql(
'localhost:5432',
'postgres_database',
'postgres_table',
'user',
'password'
)
INTO OUTFILE 'my_output_file.ndjson'
FORMAT JSONEachRow"

Stream PostgreSQL to another process

Instead of using INTO OUTFILE, you can stream the results of a table function to another process. Here's a simple example to demonstrate the syntax - we count the number of rows using the Linux wc -l command:

./clickhouse local -q "SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet'
FORMAT JSONEachRow
)" | wc -l

However, we could easily stream the rows to a shell script, Python script, or any other process that you want.

· 2 min read

The clickhouse-local tool makes it quick and easy to read data from MySQL and output the data into lots of different formats, including Parquet, CSV, and JSON. We are going to:

  • Use the mysql table function to read the data
  • Use the INTO OUTFILE _filename_ FORMAT clause and specify the desired output format

The clickhouse-local tool is a part of the ClickHouse binary. Download it using the following:

curl https://clickhouse.com/ | sh

Export MySQL to Parquet

The mysql table function creates a table based on the results of a query sent to a MySQL instance. For example:

SELECT *
FROM
mysql(
'localhost:3306',
'my_sql_database',
'my_sql_table',
'user',
'password'
);

We can pipe the output of this query to a file using INTO OUTFILE. Use FORMAT to specify the format of the file to be created. Let's grab the entire contents of a MySQL table, and send its contents to a Parquet file:

./clickhouse local -q "SELECT * FROM
mysql(
'localhost:3306',
'my_sql_database',
'my_sql_table',
'user',
'password'
)
INTO OUTFILE 'my_output_file.parquet'"
Note

Because the name of the output file has a .parquet extension, ClickHouse assumes we want the Parquet format, so notice we omitted the FORMAT Parquet clause.

Export MySQL to CSV

It's the same as for Parquet, except this time we use a .csv extension on the filename. ClickHouse will realize we want a comma-separated output and that's how the data will be written to the file:

./clickhouse local -q "SELECT * FROM
mysql(
'localhost:3306',
'my_sql_database',
'my_sql_table',
'user',
'password'
)
INTO OUTFILE 'my_output_file.csv'"

Export MySQL to JSON

To go from MySQL to JSON, just change the extension on the filename to jsonl or ndjson:

./clickhouse local -q "SELECT * FROM
mysqlql(
'localhost:3306',
'my_sql_database',
'my_sql_table',
'user',
'password'
)
INTO OUTFILE 'my_output_file.ndjson'"

It's impressive how simple yet powerful the clickhouse-local tool really is. You can easily read data from a database like MySQL and output it into all types of different output formats.

· 2 min read

Problem remote() or remoteSecure() table function allows the access of remote table from another ClickHouse node.

When using these functions on a node that is located more than 100ms (latency wise) away from the remote node, it is common to encounter the following timeout error.

4776d4bd8190 :) SELECT * FROM remoteSecure('HOSTNAME.us-east-2.aws.clickhouse.cloud', DATABASE, TABLE, 'USER', 'USER_PASSWORD')

SELECT *
FROM remoteSecure('HOSTNAME.us-east-2.aws.clickhouse.cloud', DATABASE, TABLE, 'USER', 'USER_PASSWORD')

Query id: 2bd6ddd0-66d9-4d19-830f-87e3cec3724b


0 rows in set. Elapsed: 1.213 sec.

Received exception from server (version 22.6.9):
Code: 519. DB::Exception: Received from localhost:9000. DB::NetException. DB::NetException: All attempts to get table structure failed. Log:

Code: 279. DB::NetException: All connection tries failed. Log:

Code: 209. DB::NetException: Timeout: connect timed out: 18.218.245.169:9440 (hc7d963h1t.us-east-2.aws.clickhouse.cloud:9440, connection timeout 100 ms). (SOCKET_TIMEOUT) (version 22.6.9.11 (official build))
Code: 209. DB::NetException: Timeout: connect timed out: 18.218.245.169:9440 (hc7d963h1t.us-east-2.aws.clickhouse.cloud:9440, connection timeout 100 ms). (SOCKET_TIMEOUT) (version 22.6.9.11 (official build))
Code: 209. DB::NetException: Timeout: connect timed out: 18.218.245.169:9440 (hc7d963h1t.us-east-2.aws.clickhouse.cloud:9440, connection timeout 100 ms). (SOCKET_TIMEOUT) (version 22.6.9.11 (official build))

. (ALL_CONNECTION_TRIES_FAILED) (version 22.6.9.11 (official build))

. (NO_REMOTE_SHARD_AVAILABLE)

Workaround To get increase the connection timeout, set connect_timeout_with_failover_secure_ms to a higher value (e.g. 1 second) from the default 100ms.

4776d4bd8190 :) SELECT * FROM remoteSecure('HOSTNAME.us-east-2.aws.clickhouse.cloud:9440', DATABASE, TABLE, 'USER', 'USER_PASSWORD') SETTINGS connect_timeout_with_failover_secure_ms = 1000

SELECT *
FROM remoteSecure('HOSTNAME.us-east-2.aws.clickhouse.cloud:9440', DATABASE, TABLE, 'USER', 'USER_PASSWORD')
SETTINGS connect_timeout_with_failover_secure_ms = 1000

Query id: 8e2f4d41-307b-4e61-abb8-809190023247

┌─x─┐
│ 1 │
└───┘

1 row in set. Elapsed: 2.403 sec.

· 3 min read

You reached the parts_to_throw_insert setting on a MergeTree table. You can monitor the number of active parts for a given table with:

select count(*) from system.parts where table = '<table_name>' and active == 1

The main requirement about inserting into Clickhouse: you should never send too many INSERT statements per second. Ideally - one insert per second / per few seconds.

So you can insert 100K rows per second but only with one big bulk INSERT statement. When you send hundreds / thousands insert statements per second to *MergeTree table you will always get some errors, and it can not be changed by adjusting some settings.

If you can't combine lot of inserts into one big bulk insert statement outside - then you should create Buffer table before *MergeTree table.

  1. Each insert create a folder in /var/lib/clickhouse/.../table_name/. Inside that folder there are 2 files per each column - one with data (compressed), second with index. Data is physically sorted by primary key inside those files. Those folders are called 'parts'.

  2. ClickHouse merges those smaller parts to bigger parts in the background. It chooses parts to merge according to some rules. After merging two (or more) parts one bigger part is being created and old parts are queued to be removed. The settings you list allow finetuning the rules of merging parts. The goal of merging process - is to leave one big part for each partition (or few big parts per partition which are not worth to merge because they are too big). Please check also that comment.

  3. If you create new parts too fast (for example by doing lot of small inserts) and ClickHouse is not able to merge them with proper speed (so new parts come faster than ClickHouse can merge them) - then you get the exception 'Merges are processing significantly slower than inserts'. You can try to increase the limit but you can get the situation then you get filesystem problems caused by the too big number of files / directories (like inodes limit).

  4. If you insert to lot of partitions at once the problem is multiplied by the number of partitions affected by insert.

  5. You can try to adjust the behaviour of clickhouse with one of the listed settings, or with max_insert_block_size / max_block_size / insert_format_max_block_size / max_client_network_bandwidth. But: the better solution is just to insert data in expected tempo. The expected tempo is: one insert per 1-2 sec, each insert containing 10K-500K rows of data.

  6. So proper solution to solve "Merges are processing significantly slower than inserts" is to adjust the number of inserts per second and number of rows in each insert. Use batch insert to combine small inserts into one bigger if data comes row-by-row. Throttle huge inserts if you have too much data to insert at once. Don't change clickhouse internals, unless you really understand well what does they it mean.

  7. If your data comes faster than 500K rows per second - most probably you need more servers in the cluster to serve that traffic, not the adjustment of settings.

  8. The speed of background merges usually depends on storage speed, used compression settings, the MergeTree option (the merge algorithm - plain merge/aggregating/summing/collapsing, etc.), and the used sorting key.

· One min read

ClickHouse supports a wide range of data formats for input and output. There are multiple JSON variations among them, but the most commonly used for data ingestion is JSONEachRow. It expects one JSON object per row, each object separated by a newline.

Examples

Using HTTP interface:

$ echo '{"foo":"bar"}' | curl 'http://localhost:8123/?query=INSERT%20INTO%20test%20FORMAT%20JSONEachRow' --data-binary @-

Using CLI interface:

$ echo '{"foo":"bar"}'  | clickhouse-client --query="INSERT INTO test FORMAT JSONEachRow"

Instead of inserting data manually, you might consider to use one of client libraries instead.

Useful Settings

  • input_format_skip_unknown_fields allows to insert JSON even if there were additional fields not present in table schema (by discarding them).
  • input_format_import_nested_json allows to insert nested JSON objects into columns of Nested type.
Note

Settings are specified as GET parameters for the HTTP interface or as additional command-line arguments prefixed with -- for the CLI interface.