"id","deleted","type","by","time","text","dead","parent","poll","kids","url","score","title","parts","descendants" 18346787,0,"comment","RobAtticus","2018-10-31 15:56:39.000000000","We do have comparisons, but judging by their Medium read times some may not be considered "quick" :)
* Influx: https://blog.timescale.com/timescaledb-vs-influxdb-for-time-...
* Cassandra: https://blog.timescale.com/time-series-data-cassandra-vs-tim...
* Mongo: https://blog.timescale.com/how-to-store-time-series-data-mon...
We also released a tool called Time Series Benchmark Suite (TSBS) here that someone just submitted a PR for Clickhouse: https://github.com/timescale/tsbs/pull/26
There is also this spreadsheet that compares a bunch of different time series databases, including TimescaleDB: https://docs.google.com/spreadsheets/d/1sMQe9oOKhMhIVw9WmuCE...
Hopefully some of that is useful :)",0,18346746,0,"[18346822]","",0,"","[]",0 18355652,0,"comment","jeroensoeters","2018-11-01 16:29:16.000000000","Instana | Senior Software Engineer | Austin, TX | Onsite | Full-time | Competitive salary + equity Instana is the leading provider of Application Performance Management solutions for containerized microservice applications, at Instana we apply automation and artificial intelligence to deliver the visibility needed to effectively manage the performance of today's dynamic applications across the DevOps lifecycle.
At Instana, we have a myriad of complex and interesting projects to work on; from our agent software that has ridiculous performance requirements, to our big data processing pipeline that processes many terabytes per day, and from a fully 3D rendered web UI, to state of the art machine learning algorithms for detecting and predicting anomalies.
Tech: Java8, Project Reactor, Cassandra, ElasticSearch, ClickHouse, Kafka, C, C++, Go, ES6, React, ThreeJS, AWS and much more.
Requirements: deep knowledge of the JVM and related technologies, solid understanding of building distributed systems. Preferred: experience building ingress systems, stream processing
If you're interested please email me at jeroen.soeters@instana.com",0,18354503,0,"[]","",0,"","[]",0 18362819,0,"comment","arespredator","2018-11-02 13:26:59.000000000","MessageBird | Amsterdam, Netherlands | Data Engineer | Full-time | Onsite | Visa
MessageBird is a Cloud Communications Platform as a Service (CPaaS) company for SMS, Voice and Chat communications that connects businesses to 7 billion phones worldwide. We’re one of the fastest growing software companies in Europe and we’re looking to expand our engineering team with an experienced Data Engineer.
Data engineering at MessageBird is programming-heavy, so we're looking for people who like to code and have significant software engineering experience.
Tech stack: Go, gRPC, Clickhouse, Bigtable, Java, Apache Beam (Google Dataflow), GCP, k8s.
Our data team is currently 10 engineers and 8 nationalities. We have a very well stocked kitchen and a roof terrace in our brand new Rivierenbuurt office.
Apply at https://www.messagebird.com/en/careers and feel free to contact me at piotr@messagebird.com in case you have any questions.",0,18354503,0,"[18364519]","",0,"","[]",0 21938521,0,"comment","lykr0n","2020-01-02 19:01:23.000000000","Role: Site Reliability Engineer/System Administrator/System Engineer
Location: Seattle, WA (and surrounding areas)
Willing to relocate: I'd rather not
Technologies: Linux (CentOS/RHEL), MySQL, Postgres, Clickhouse, Docker, Nomad, Consul, Vault, Puppet, Ansible, SaltStack, Python 2/3 (development + administration), Rust (development + administration), Java + JVM (administration), KVM (oVirt/RHEV), VMware vSphere, Limited AWS/GCP, etcd, zookeeper, kafka, haproxy, nginx, Bash, GitHib/GitLab, Git, HTML, Datadog, Grafana, InfluxDB, and so on and so on. On Call? Love it.
Résumé/CV: On Request
Email: lykron@mm.st
Looking for more of a smaller company this time around. 5 to 250 people or so. Could be startup to established company. I love building infrastructure and being involved with architecture design. I've been heavily involved in improving reliability of applications and systems to make sure they do not go down.",0,21936438,0,"[]","",0,"","[]",0 21942826,0,"story","phatak-dev","2020-01-03 03:25:46.000000000","",0,0,0,"[]","http://blog.madhukaraphatak.com/clickouse-clustering-spark-developer/",1,"ClickHouse Clustering from Hadoop Perspective","[]",0 21953967,0,"comment","jmakov","2020-01-04 09:56:48.000000000","Clickhouse + Grafana or Prometheus + Victoriametrics",0,21949997,0,"[]","",0,"","[]",0 21966741,0,"comment","Dim25","2020-01-06 05:31:48.000000000","SEEKING WORK | San Francisco, CA, USA | REMOTE or LOCAL
Hi all, I'm Dima (https://www.linkedin.com/in/dim25/) from SF (San Francisco Bay Area). Full-stack with Machine Learning experience; AI/ML product manager.
Python: * Machine Learning: (TensorFlow; Keras; PyTorch). * Computer Vision (OpenCV; TensorFlow). * Media \ communications (Twillio; Ring Central; Kurento). * Streaming \ Workflows: Kafka+Faust; Airflow; Celery. * Web servers (Flask), and many other applications of Python.
Web Development: HTML; CSS; Bootstrap. JS (Front-end + Node.js): All the basics necessary for web development; Basic experience with d3.js and other visualizations and dashboards tools.
DBs: MongoDB; ElasticSearch; Redis (incl. RediSearch), SQLs. Basics of ClickHouse.
C/C++: Some experiments with ROS/robotics.
Most recent projects:
* Analyzing millions of job postings worldwide.
* Computer Vision CCTV Stream analytics.
Previously: * Co-founder at MBaaS startup. 'Firefighter' from $0 to $120K MRR.
* Managed a team of 15 mobile developers to assist with the delivery of
the #1 mobile banking app in Russia (iOS + Android).
* AWM, rev-share with Kinks (guys from San Francisco Armory).
Especially good match: if you need a cost-efficient prototype; fix and deliver your machine learning or automation strategy; looking for an early-stage full-stack dev with ML experience; or have a remote team you don’t have time to manage.Rate: Open to discuss. Don't need perks, 'cool' office spaces and other shenanigans. Available now.
Email: dima_cv1@protonmail.com
Latest version of this CV: https://bitly.com/dima_cv1
Add me on LinkedIn: https://www.linkedin.com/in/dim25/",0,21936439,0,"[]","",0,"","[]",0 18404015,0,"comment","manigandham","2018-11-08 02:44:50.000000000","All modern columnstores can handle vast ingest rates and query speeds. It's all down to sharding, zone maps and sparse indexing, fast algorithms that operate on compressed data, and storage throughput. These are well-solved problems at this point.
Your blog post doesn't mention a single columnstore database though. KDB+, Clickhouse, MemSQL, or any of the GPU-powered variations will happily beat any TSDB out there.",0,18403683,0,"[18404257,18404255]","",0,"","[]",0 18404089,0,"comment","qaq","2018-11-08 03:05:27.000000000","You do realize there are say ClickHouse clusters that ingest in a few days more than largest timescale cluster can handle as it's max size.",0,18403810,0,"[18404185]","",0,"","[]",0 18404090,0,"comment","manigandham","2018-11-08 03:06:14.000000000","Timescale, for all their wonderful marketing, is just an automatic sharding extension for PostgreSQL. You can accomplish the same yourself using native partitioning, or pg_partman, or Citus.
Partitions are a basic building block for scaling performance and storage so it helps when you have lots of data, but Postgres w/Timescale does not have column-oriented storage and is still single-node only so it comes nowhere near the capabilities of cutting-edge columnstores like Clickhouse, KDB+, MemSQL, Kinetica, etc.",0,18403810,0,"[18404114]","",0,"","[]",0 18404237,0,"comment","qaq","2018-11-08 04:00:06.000000000","google is you friend google clickhouse vertica etc. the comment about limited analytical power is especially fun. Cloudflare is ingesting 11 million rows per second into CH.",0,18404185,0,"[]","",0,"","[]",0 18404239,0,"comment","manigandham","2018-11-08 04:00:20.000000000","Not really what I was saying at all.
Timescale adds automatic partitioning to Postgres, a single-node rowstore relational database. This will naturally give you better performance for larger data (whether time-series or not).
This will not approach the performance and scalability of a fully distributed relational column-oriented database like Clickhouse or MemSQL, because automatic partitioning is just one of many techniques they use for fast performance. There is nothing a special TSDB, or TSDB extension, can do that these database cannot already do faster, while providing rich SQL and joins.",0,18404114,0,"[]","",0,"","[]",0 18404267,0,"comment","manigandham","2018-11-08 04:08:25.000000000","I think Clickhouse would do well but I've seen other metrics/observability vendors (like Honeycomb) also build their own systems given the scale and cost factors.
Isn't Datadog on AWS? If you have very specific needs and can build a vertical infrastructure stack then it makes perfect sense to build your own.",0,18403718,0,"[18405578]","",0,"","[]",0 18404271,0,"comment","manigandham","2018-11-08 04:10:23.000000000","Clickhouse, MemSQL, Redshift, MapD, Kinetica, etc.
If you just want rollups and don't care about every row, then look at Druid (or imply.io for a startup making it easier).
All these systems can delete old data very quick as they just delete entire compressed partition files.",0,18404234,0,"[18404378,18404322]","",0,"","[]",0 18404324,0,"comment","manigandham","2018-11-08 04:29:14.000000000","Compared to what? Economically viable is very vague and relative. Columnar storage can easily reach 90% compression levels, is faster to read, and vectorized processing beats per-row/record iteration, so there's a reason it's the best for OLAP currently.
Why not benchmark IronDB against Clickhouse and post the results?",0,18404255,0,"[18404539]","",0,"","[]",0 18404607,0,"comment","manish_gill","2018-11-08 06:06:05.000000000","As I was reading through the post I kept wondering why they weren't using some warehousing technique for older data - either dump it to S3 or better yet, Google BigQuery, which is amazingly fast at that scale. They only did it after doing lots of fire-fighting and per-tenant clusters.
Clickhouse would also be a good option for doing aggregating queries that TSDBs are mostly used for.
One of my wishlist items in the data space is a Managed Clickhouse offering. :-)",0,18402890,0,"[]","",0,"","[]",0 21970952,0,"story","hodgesrm","2020-01-06 17:23:30.000000000","",0,0,0,"[21973580,21971228,21972398,21976779,21977576,21974126,21973697,21972879,21979089,21973787,21983718,21977335,21971546,21973207,21974145,21971459,21986381]","https://www.altinity.com/blog/2020/1/1/clickhouse-cost-efficiency-in-action-analyzing-500-billion-rows-on-an-intel-nuc",216,"ClickHouse cost-efficiency in action: analyzing 500B rows on an Intel NUC","[]",86 21971228,0,"comment","jgrahamc","2020-01-06 17:51:49.000000000","We use ClickHouse extensively and it's been great: https://blog.cloudflare.com/http-analytics-for-6m-requests-p...",0,21970952,0,"[21971842]","",0,"","[]",0 21971665,0,"comment","polskibus","2020-01-06 18:32:30.000000000","Heavily optimized column store, incl use of SSE instructions, etc. Moreover, some architectural tradeoffs mentioned at https://clickhouse.yandex/docs/en/introduction/features_cons...",0,21971459,0,"[]","",0,"","[]",0 21971842,0,"comment","chupasaurus","2020-01-06 18:50:36.000000000","You've lost a Russian IT meme in translation.
We use the verb "to use brakes" to describe that something works slowly. There was a long story of threads about Java performance, until the meme was solidified after the news from 2005 DARPA Grand Challenge (racing competition for autonomous robotic self-driven cars): a car named Tommy by Jefferson Team which was running Java under Linux haven't used breaks before a turn and crashed to the wall at 70 mph, hence Java runs fast and Java doesn't use brakes were both described perfectly by the same sentence.
Yandex used the meme to advertise Clickhouse for engineers.
edit: formatting
edit2: brakes instead of breaks, wtf",0,21971228,0,"[21971971,21971951,21971938]","",0,"","[]",0 21971971,0,"comment","PeterZaitsev","2020-01-06 19:01:33.000000000","I think this is poor translation. In Russian "Тормозить" may mean to "use brakes" when applied to a car or just "be slow" when applied to a program (or a person). "MySQL сегодня тормозит" would mean MySQL is acting slow today, not what it is using brakes. So that meme I think is best translated as "Clickhouse is never slow" or "Clickhouse never acts slow"",0,21971842,0,"[]","",0,"","[]",0 21972280,0,"comment","hodgesrm","2020-01-06 19:31:33.000000000","Mat views are great as the article showed. I use them to get query response down to milliseconds, as they vastly reduce the amount of data ClickHouse must scan.
That said, there are a lot of other tools: column storage, vectorwise query, efficient compression including column codecs, and skip indexes to name a few. If you only have a few billion rows it's still possible to get sub-second query results using brute force scans.
Disclaimer: I work for Altinity, who wrote this article.",0,21971686,0,"[21972794]","",0,"","[]",0 21972398,0,"comment","patelh","2020-01-06 19:41:32.000000000","Not exactly a good comparison if you don't generate the data the same way for the test setup. Your generated data is more compressible by clickhouse, that skews the comparison. Would have been better to not change the test data if you wanted to do a comparison.",0,21970952,0,"[21972658,21972764,21974114,21977409]","",0,"","[]",0 21972764,0,"comment","hodgesrm","2020-01-06 20:14:53.000000000","The important difference is that we used a more realistic temperature profile, which as you say does affect compression for that column. Schema design (including sort order, compression, and codecs) for the remaining columns is just good ClickHouse practice. Much of the storage and I/O savings is in the date, time, and sensor_id and columns.
It's also useful to note that the materialized view results would be essentially the same no matter how you generate and store data because the materialized view down-samples temperature max/min to daily aggregates. The data are vastly smaller no matter how you generate them.
The article illustrates that if you really had such an IoT app and designed it properly you could run analytics with surprisingly few resources. I think that's a significant point.",0,21972398,0,"[21973222,21972843]","",0,"","[]",0 21972879,0,"comment","codexon","2020-01-06 20:24:02.000000000","One thing I haven't seen anyone note about clickhouse though which would be really important to many for data durability, is that it does not use fsync anywhere at all.",0,21970952,0,"[21974265,21973150,21973779,21974410]","",0,"","[]",0 21973150,0,"comment","caust1c","2020-01-06 20:49:55.000000000","It's pretty clearly laid out in the docs. Hopefully anyone seriously considering using Clickhouse reads the docs thoroughly and understands what they're implementing.",0,21972879,0,"[21973567]","",0,"","[]",0 21973567,0,"comment","codexon","2020-01-06 21:32:13.000000000","What do you mean clearly laid out? This is the only mention of fsync I could find through google or their own search function.
https://clickhouse.yandex/docs/en/operations/settings/settin...",0,21973150,0,"[21974186]","",0,"","[]",0 21973580,0,"comment","ofek","2020-01-06 21:33:06.000000000","Hey, Ofek from Datadog here!
I recently implemented our ClickHouse integration [1], so if any of you would like to try it out we would appreciate feedback. I really enjoyed learning about this database, and it has excellent docs :)
Oh fun fact, speaking of docs, this was the first integration of ours that we scrape docs for as part of the test suite. So when a new built-in metric is added it will fail our CI until we support it [2]. We just did this again for Apache Airflow [3].
[1]: https://github.com/DataDog/integrations-core/pull/4957
[2]: https://github.com/DataDog/integrations-core/pull/5233
[3]: https://github.com/DataDog/integrations-core/pull/5311",0,21970952,0,"[21974584]","",0,"","[]",0 21973624,0,"comment","endymi0n","2020-01-06 21:37:33.000000000","I'd argue on ClickHouse not even being that fast (compared to comparable technology like Snowflake, Redshift or BigQuery) but actually the ScyllaDB example being completely misleading. Scylla is probably one of the fastest OLTP datastores, yet they're benchmarking an analytics query — which is pretty easy to crack by any columnar datastore.
The actual point here is that you can execute millions of (different!) individual queries per second on ScyllaDB, which beats any columnar datastore hands down. ClickHouse "cheated" here by translating the (unfortunate) benchmark setup into a single query that's extremely heavily optimized under the hood.",0,21971459,0,"[21973735,21973919]","",0,"","[]",0 21973735,0,"comment","PeterZaitsev","2020-01-06 21:47:13.000000000","Actually while ClickHouse does not have all features of RedShift, BigQuery etc it usually is much faster than them. It can be slower on some workloads on GPU powered systems, when all data fits in GPU memory but it is not the use case it targets.
ScyllaDB is amazing when it comes to OLTP performance but not in the Analytical ones.
I think they took pretty mediocre Analytical Workload results and shared them as something outstanding.",0,21973624,0,"[21978109]","",0,"","[]",0 21973787,0,"comment","atombender","2020-01-06 21:52:02.000000000","Is ClickHouse good for event data when you want to do rollups? For example, say all my events are of the form:
{event: "viewedArticle", article_id: 63534, user_id: 42, topic: "news", time: "2020-01-06"}
I want to be able to build aggregations which shows number of "viewedArticle" events grouped by hour, grouped by topic, counting unique user_ids within each bucket.
Or let's say I want the top K articles viewed each day, filtered by a topic.
That's something that's trivial with Elasticsearch, which has a hierarchical aggregation DSL. Is ClickHouse good at this?
Whenever I see time-series databases such as InfluxDB mentioned, they look like they're focused on measurements, not discrete rows. You can attach the event data as "labels", but this isn't efficient when the cardinality of each column is very high (e.g. article IDs or user IDs in the above example).",0,21970952,0,"[21973958]","",0,"","[]",0 21973857,0,"comment","FridgeSeal","2020-01-06 21:58:31.000000000","For what it’s worth, I’ve used Clickhouse and Snowflake and I strongly prefer Clickhouse.
Performance was superior, client libraries and built-in HTTP interface was a god-send, it supported geospatial queries. I had perpetual issues with getting Snowflake to properly escape strings in CSV, handle JSON in anything approaching a sensible way, there’s claims that it integrates properly with Kafka as a consumer, but it most certainly does not. The UX is horrible to boot.",0,21973207,0,"[21973993]","",0,"","[]",0 21973958,0,"comment","manigandham","2020-01-06 22:05:32.000000000","Yes. Clickhouse is a column-oriented relational database among many others like MemSQL, Vertica, Redshift, BigQuery, Snowflake, Greenplum, etc. They're all focused on analytical queries over very large datasets using SQL.
An aggregation with several `group by` statements is no challenge and all of these databases also support approximate counting via HyperLogLog for faster results.
Clickhouse has some unique features where each table can have a separate 'engine' including some that automatically apply aggregations. Start with a normal table though since it'll be plenty fast enough for most use cases.",0,21973787,0,"[21974131]","",0,"","[]",0 21974068,0,"comment","bdcravens","2020-01-06 22:15:36.000000000","Clickhouse is also crazy fast without materialized views - I've only done some PoC's against it, but in loading a largish data set of raw invoice CSVs, I was very impressed with the performance compared to our standard RDBMS.",0,21972129,0,"[21977203]","",0,"","[]",0 21974131,0,"comment","atombender","2020-01-06 22:23:10.000000000","Thanks! Looks like the only downside is that, as it returns rows as results, you end up getting a lot of duplicate column data back and need to "nest" the nested buckets yourself.
For example, a result like:
topic;time;count
news;2020-01-01;44
news;2020-01-02;31
Now you have "news" repeated, and to group this into buckets for rendering summary tables and such (with sub totals at each level), you need to iterate through the flattened results and generate nested structures. This is something Elasticsearch gives you out of the box.Last I looked at Clickhouse, it had master/slave replication only, and if you want shards of data distributed across a cluster it's something you need to manually manage?",0,21973958,0,"[21974632,21974181]","",0,"","[]",0 21974181,0,"comment","manigandham","2020-01-06 22:28:22.000000000","Right, relational databases only return flat tabular results but that seems minor compared to performance increase you gain.
Clickhouse is fast but not as operationally friendly as the others. It's more much work once you go beyond a single node so I'd suggest looking at those other options if you want something easier to operate, or use one of the cloud data warehouses like Bigquery or Snowflake to eliminate ops entirely.",0,21974131,0,"[]","",0,"","[]",0 21974186,0,"comment","caust1c","2020-01-06 22:29:02.000000000","The title of the page might be a little snarky, but it's in the introduction that transactional queries are not supported:
https://clickhouse.yandex/docs/en/introduction/features_cons...
Sure it's not specifically about `fsync` but presumably this is what the consumer of the database actually wants to know.",0,21973567,0,"[21975707]","",0,"","[]",0 21974205,0,"comment","caust1c","2020-01-06 22:31:16.000000000","The implication is that clickhouse can't easily support transactional queries. That's why it's an OLAP not OLTP database. (On-Line Analytics Processing vs On-Line Transaction Processing).",0,21973779,0,"[21975731,21975209]","",0,"","[]",0 21974265,0,"comment","DasIch","2020-01-06 22:37:52.000000000","I can't find anything about this in the docs except[1]. I also can't find any issues in their bug tracker related to clickhouse not using fsync[2].
I can however find code that actually calls fsync[3][4]. To be fair I haven't read enough to determine how this (doesn't) affect durability. Nevertheless I'm wondering do you have a source for this claim?
[1]: https://clickhouse.yandex/docs/en/operations/settings/settings/#fsync-metadata
[2]: https://github.com/ClickHouse/ClickHouse/search?q=fsync&type=Issues
[3]: https://github.com/ClickHouse/ClickHouse/blob/355b1e5594119e036a2d62988bfa42bc8b1a1687/dbms/src/IO/WriteBufferFromFileDescriptor.cpp#L113
[4]: https://github.com/ClickHouse/ClickHouse/blob/e765733a26cfc4cecc13c981686560338256a6b1/dbms/src/IO/WriteBufferAIO.cpp#L98
",0,21972879,0,"[21975810]","",0,"","[]",0
21974316,0,"comment","DasIch","2020-01-06 22:43:20.000000000","When you write to a file, you generally don't write to physical storage. Instead the writes get buffered in memory and written to physical storage in batches. This substantially improves performance but creates a risk: If there is some sort of outage before the data is flushed to disk, you might lose data.In order to address that risk, you can explicitly force data to be written to disk by calling fsync. Databases generally do this to ensure durability and only signal success after fsync succeeded and the data is safely stored.
So ClickHouse not calling fsync implies that it might lose data in case of a power outage or a similar event.",0,21973779,0,"[21987159,21975412]","",0,"","[]",0 21974632,0,"comment","hodgesrm","2020-01-06 23:17:22.000000000","> Now you have "news" repeated, and to group this into buckets for rendering summary tables and such (with sub totals at each level), you need to iterate through the flattened results and generate nested structures. This is something Elasticsearch gives you out of the box.
ClickHouse has a number of optimization for solving 'visitor' problems that you describe. Assuming you just want to group in different ways an idiomatic ClickHouse solution is to construct a materialized view that aggregates counts (e.g., of unique users like uniq(user)). You can then select from the materialized view and further aggregate to have larger buckets. ClickHouse can also compute single-level totals using the WITH TOTALS modifier.
If you need to have cascading sub-totals within the same listing as far as I know you'll have to compute the totals yourself. (That feature actually might be an interesting pull request since ClickHouse generates JSON output.)
> Last I looked at Clickhouse, it had master/slave replication only, and if you want shards of data distributed across a cluster it's something you need to manually manage?
ClickHouse replication is multi-master. The model is eventually consistent. Also, ClickHouse can automatically shard INSERTs across a cluster using distributed tables. That said, many users insert directly to local nodes because it's faster and uses fewer resources.",0,21974131,0,"[21974963]","",0,"","[]",0 21975373,0,"comment","hodgesrm","2020-01-07 00:40:33.000000000","That's a sticker from a ClickHouse community event, not related to the benchmark. We tend to stick them on anything flat. My ancient Dell XPS-13 has one. It's definitely not that fast.
That said, the sticker is from a real performance test. I assume it was a cluster but don't have details. ClickHouse query performance is outstanding--it's not hard to scan billions of rows per second on relatively modest hosts. These are brute force queries on source data, no optimization using materialized views or indexes.
For instance, I have an Amazon md5.2xlarge with 8 vcpus, 32 GB of RAM, and EBS GP2 storage rated at 100 iops. I can compute average passengers on the benchmark NYC taxi cab dataset [1] in .551 seconds using direct I/O. The throughput is 2.37B rows/sec.
ClickHouse is so fast on raw scans that many production users don't even use materialized views. I mostly use them to get responses down to small numbers of milliseconds for demos.
[1] https://tech.marksblogg.com/benchmarks.html",0,21974967,0,"[]","",0,"","[]",0 21975658,0,"comment","manigandham","2020-01-07 01:11:31.000000000","Yes it has geospatial support. Variant columns are better than every other database so far. Redshift and Bigquery just have a text field and require far more verbose commands to operate and cast. It sounds like most of your issues are with importing and exporting data rather than querying it?
Snowflake is basically EC2 servers reading files from S3 so you get more bandwidth with a larger warehouse size but it's fundamental limit and will have much higher latency compared to running on a local SSD with clickhouse. Lack of streaming is a known problem. They actually do have an HTTP interface, you just don't see it but that's how all the ODBC drivers are implemented (with HTTP calls and JSON data in the background).
If your data fits or you don't mind the operational overhead of running your own data warehouse then it's almost always a cheaper and faster option.",0,21975193,0,"[]","",0,"","[]",0 21975731,0,"comment","codexon","2020-01-07 01:21:07.000000000","This is not the implication at all.
Clickhouse can easily add fsync, they just choose not to do it.
Mongodb also did not use fsync and was ridiculed for it, yet no one mentions this about clickhouse.",0,21974205,0,"[21978559,21979034]","",0,"","[]",0 21975810,0,"comment","codexon","2020-01-07 01:34:27.000000000","As I mentioned, there's only 1 place where it says anything about fsync, and in that page, it says that is only for creating .sql files.
https://groups.google.com/d/msg/clickhouse/cjJ6v8uzu0Q/jGV59...
> The reason is because CH does not use fsync (for performance)
https://www.linkedin.com/in/dzhuravlev/",0,21974265,0,"[]","",0,"","[]",0 21976779,0,"comment","avisk","2020-01-07 04:16:28.000000000","At Sematext we replaced our HBase based metrics datastore with Clickhouse. We are happy with the performance gain and flexibility. We also added support for Clickhouse monitoring - https://sematext.com/blog/clickhouse-monitoring-sematext/",0,21970952,0,"[]","",0,"","[]",0 21977203,0,"comment","tumanian","2020-01-07 05:30:27.000000000","that sounds like a non-canonical use of clickhouse. Wouldnt a good RDBMS be a better fit for invoice data? This is on the surface, of course, really interested in what is this invoice data like, and what queries are you trying to run on them.",0,21974068,0,"[]","",0,"","[]",0 21977344,0,"comment","xs83","2020-01-07 06:00:17.000000000","https://github.com/ClickHouse/ClickHouse/pull/8430
Just to answer my own question - this looks good - I might have to try it out!",0,21977335,0,"[]","",0,"","[]",0 21977576,0,"comment","subhajeet2107","2020-01-07 06:43:01.000000000","We use Clickhouse extensive at work and boy is it better than anything i have used in column oriented databases so far, documentation is good, http query interface and features such as builtin url parsing are amazing, we also tested Druid and found Clickhouse to be better than Druid, it is easier to setup and maintain as well",0,21970952,0,"[]","",0,"","[]",0 21978103,0,"comment","lmeyerov","2020-01-07 08:27:36.000000000","Yeah I was confused, where I couldn't tell what was precomputed stats (col min/max/count), view calcs, and what's actual perf -- even legacy SQL vendors do all those. That's apples/oranges, more of a statement against the other db vs for clickhouse. Likewise, the db comparison I'd like to see if _other_columnar_stores_.
I know some folks running one of the larger clickhouse instances out there... but this article made me trust the community less, not more.",0,21971686,0,"[]","",0,"","[]",0 21978559,0,"comment","pritambaral","2020-01-07 10:06:06.000000000","> Mongodb also did not use fsync and was ridiculed for it, yet no one mentions this about clickhouse.
MongoDB claimed to be a replacement for RDBMS-es (which includes OLTP). ClickHouse is explicit about being OLAP-only. MongoDB also hid the fact that they weren't doing fsync, especially when showing off "benchmarks" against OLTP RDBMS-es, while ClickHouse has not tried to show themselves as a replacement for OLTP RDBMS-es.
> Clickhouse can easily add fsync, they just choose not to do it.
For good reason. It's not a simple matter of choosing one of two options. The choice has consequences: performance.",0,21975731,0,"[21984836,21984551]","",0,"","[]",0 21978658,0,"comment","jayleeg","2020-01-07 10:24:21.000000000","I've worked with MPP DBs, Hadoop, Spark, ElasticSearch, Druid, kdb, DolphinDB and now ClickHouse and performance wise it's all true - in our case ClickHouse was 10-20x faster than Spark and used 4x less memory. I've seen it outperform the fastest commercial timeseries stores by 2x.
This will make me unpopular but my conclusion is that the file based data lake, splitting data from compute, is not the right approach in many (not all) cases and that Spark was not really that revolutionary. I would go as far to say that the direction data has taken has been a failure and ClickHouse and such come closer to solving the real problem of 'BigData'.
So two things here about 'loading'...
1) ClickHouse table/data files are completely portable (like Parquet) and can be moved from one server to another, copied or cloned etc.. there is even a mechanism to allow remote execution or to pull just the files from a remote server or an S3 store etc.. Just because the CH native file format isn't spoken about in the same circles as Parquet and ORC doesn't mean it can't be treated the same way if thats your thing. The CH native format is far more performant/compressible than Parquet or ORC and the specification is Open Source. Someone could implement a CH native file format serdes for Hive for example.
2) In this instance they were generating the data so no different to running Spark and writing to a Parquet file and running analytics on it later. Spark can't write / generate this amount of data in this amount of time on these resources and write out / compress the data to Parquet or whatever other preferred format. I've tried.
ClickHouse isn't perfect and I'm not affiliated with the Altinity guys but I can tell you this is the real deal.",0,21977335,0,"[22043472]","",0,"","[]",0 21978754,0,"comment","jayleeg","2020-01-07 10:43:47.000000000","But they didn't set the temperature reading to anything that would advantage their tests. Without access to the original data they simply generated a dataset as close to the original dataset and volume as possible. The fact they took a few sentences talking about the temperature doesn't equate to invalidating the test.
Looking at this your way - Scylla used an INT, Altinity used a Decimal type with specialized compression (T64). I can tell you that this would have hampered ClickHouse and advantaged Scylla. It's the opposite of what you're saying. They actually performed this benchmark with one arm tied behind their back.
It's a funny benchmark anyway because the two systems have very different use cases but it doesn't invalidate the result.",0,21973222,0,"[]","",0,"","[]",0 21979034,0,"comment","jayleeg","2020-01-07 11:38:45.000000000","To add to pritambaral comments.
The top commercial high performance timeseries databases, which ClickHouse can usually best, used by banks to make decisions on your money also don't use fsync. You can literally quit the software and watch your transaction data be written out 5 seconds later.
Edit: a word",0,21975731,0,"[]","",0,"","[]",0 21979089,0,"comment","pachicodev","2020-01-07 11:50:00.000000000","We have been using ClickHouse in production for some months already and we find it a real game changer. We're running queries over 3B rows for business intelligence purposes.
As a side project, a group of friends and I are working on a simple web analytics project powered by ClickHouse (what it's been originally build for). If anyone wants to contribute, just let me know.
Cheers",0,21970952,0,"[]","",0,"","[]",0 21979127,0,"comment","zX41ZdbW","2020-01-07 11:56:27.000000000","You can also use ClickHouse to query files directly - with clickhouse-local tool.
Example: https://www.altinity.com/blog/2019/6/11/clickhouse-local-the...",0,21977335,0,"[]","",0,"","[]",0 21979694,0,"comment","zX41ZdbW","2020-01-07 13:29:50.000000000","There is a video named "The Secrets of ClickHouse Performance Optimizations":
https://youtu.be/ZOZQCQEtrz8",0,21971459,0,"[]","",0,"","[]",0 21983718,0,"comment","thekozmo","2020-01-07 19:28:51.000000000","ClickHouse achieved good (great) results. However, it's a bad comparison. Clickhouse is an analytics DB while Scylla is a realtime, random access one. /me am Scylla co-founder.
We could stack 100k rows in a single partition and be 1000x faster in this use case than the performance we demonstrate but we wanted to keep it real. Actually the use case we wanted to show is a single row per partition which would require more machines but surprisingly we couldn't provision that many on AWS.
The presented usecase by ClickHouse is 100x slower on writes (8M row/s) as they report. It doesn't matter since it's just a completely different use case. Use Clickhouse for analytics (I wonder why stop in SSE and not to go all the way to the GPU like SqreamDB) and use Scylla for OLTP",0,21970952,0,"[21986750]","",0,"","[]",0 21984836,0,"comment","codexon","2020-01-07 21:05:43.000000000","I can't find any evidence showing that OLAP means it is okay to lose data from unexpected shutdowns. How can you have correct analytics without a complete set of data?
> For good reason. It's not a simple matter of choosing one of two options. The choice has consequences: performance.
It is a simple matter though. They can choose to sacrifice performance for data durability which I suspect would not be impacted very much since clickhouse acts like an append log. It just seems that Yandex doesn't care much for durability since they are just using the database to store people's web traffic. They wouldn't care if some of that data is lost so they don't use fsync.",0,21978559,0,"[22100368]","",0,"","[]",0 21986381,0,"comment","valyala","2020-01-07 23:23:36.000000000","VictoriaMetrics core developer here.
The performance numbers from ClickHouse running on Intel NUC are impressive! We are going to publish VictoriaMetrics performance numbers for the original Billy benchmark from ScyllaDB [1] running on the same hardware from packet.com . Initial results are quite promising [2], [3].
[1] https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-one...
[2] https://mobile.twitter.com/MetricsVictoria/status/1209116702...
[3] https://mobile.twitter.com/MetricsVictoria/status/1209186575...",0,21970952,0,"[]","",0,"","[]",0 21986543,0,"comment","valyala","2020-01-07 23:44:15.000000000","Btw, it should be easy adding fsync to ClickHouse. For instance VictoriaMetrics uses similar file format as ClickHouse, and it issues proper fsyncs at least every second, so it may lose only the last second of data on unclean shutdown such as OOM, hardware reset or `kill -9`. [1], [2].
[1] https://medium.com/@valyala/wal-usage-looks-broken-in-modern...
[2] https://medium.com/@valyala/how-victoriametrics-makes-instan...",0,21975707,0,"[]","",0,"","[]",0 21986750,0,"comment","jayleeg","2020-01-08 00:11:10.000000000","The ScyllaDB one is a bit funny anyway as it doesn't really target analytical workloads. On SSE/GPUs - the ClickHouse guys don't use GPUs today (GPUs are on the roadmap for next year) as their workloads target volumes greater than GPU memory. If your hot dataset sits totally in GPU memory then it makes sense for some things otherwise they found the cost/performance ratio doesn't add up after you paginate in/out. I don't doubt GPU based DB perf numbers but cost is the main factor.
Now just to clarify - you're saying Scylla writes are 100x faster on the same hardware as ClickHouse (so 800M row/s on a NUC). Using the same code that Altinity used I manage around 25M rows/s on my home PC (8 cores/16HT) and elsewhere in this thread the guys from VictoriaMetrics pulled in 53M rows/s on a single node with 28 cores/56 threads (probably doable with ClickHouse on similar hardware I'd suspect).
I'm going to test this with Scylla on my home PC to validate your 800M row/s claim and I'll post about it - I should be able to hit around 2.5 billion rows/s with Scylla if what you've said is true. I've had CH write 300M row/s on my 8 core box using memory buffered tables but that was only at burst.",0,21983718,0,"[]","",0,"","[]",0 21987159,0,"comment","hodgesrm","2020-01-08 01:03:14.000000000","Most ClickHouse installations run replication for availability and read scaling. If you do get corrupted data for some reason, you can read it back from another replica. That's much more efficient than trying to fsync transactions, especially on HDD. The performance penalty for fsyncs can be substantial and most users seem to be pleased with the trade-off to get more speed.
This would obviously be a poor trade-off for handling financial transactions or storing complex objects that depend on referential integrity to function correctly. But people don't use ClickHouse to solve those problems. It's mostly append-only datasets for analytic applications.",0,21974316,0,"[]","",0,"","[]",0 21997349,0,"comment","rektide","2020-01-09 00:34:30.000000000","I enjoyed the post. Good links to a lot of relevant, recent stories & events.
Not the article's fault, but it cites the "ClickHouse Cost-Efficiency in Action: Analyzing 500 Billion Rows on an Intel NUC" article that was published January 1. It's a week old, & I kind of feel like I'm never going to get away with it. It seems like a great, fun, interesting premise, but the authors took what is a challenging, huge data-set, and, under the guise of making the data look "realistic" they drained all the entropy out of the dataset, & then claimed they were 10-100x faster.
Well, yes, maybe for some workloads maybe. Maybe the changes they made might in some circumstances be "realistic" for some IoT use cases, maybe.
But I feel like I'm going to see this article come up again, and again, and again. And each time, I'll have these frustrations, about how while they may still be running queries on the same number of rows, they are running queries on many orders of magnitude less data. It's a fun read, & genuinely useful- in some circumstances- tech, but I don't expect to see this nuance showing up. I'm already weary, seeing this Clickhouse article again.",0,21995942,0,"[21998895,21997661]","",0,"","[]",0 21998895,0,"comment","manigandham","2020-01-09 06:00:36.000000000","The difference between rowstores (Scylla/Cassandra) and columnstores (Clickhouse) comes down to the physical layout of data with batch/vectorized processing and other techniques.
There will always be a 1-2 magnitude increase in performance regardless of the data. They also used the same number of rows, except with smaller cardinality in measurements which would make an insignificant speed difference.",0,21997349,0,"[]","",0,"","[]",0 22014040,0,"comment","zepearl","2020-01-10 19:01:39.000000000","Concerning the BTRFS fs:
I did use it as well many years ago (probably around 2012-2015) in a raid5-configuration after reading a lot of positive comments about this next-gen fs => after a few weeks my raid started falling apart (while performing normal operations!) as I got all kind of weird problems => my conclusion was that the raid was corrupt and it couldn't be fixed => no big problem as I did have a backup, but that definitely ruined my initial BTRFS-experience. During those times even if the fs was new and even if there were warnings about it (being new), everybody was very optimistic/positive about it but in my case that experiment has been a desaster.
That event held me back until today from trying to use it again. I admit that today it might be a lot better than in the past but as people have already been in the past positive about it (but then in my case it broke) it's difficult for me now to say "aha - now the general positive opinion is probably more realistic then in the past", due e.g. to that bug that can potentially still destroy a raid (the "write hole"-bug): personally I think that if BTRFS still makes that raid-functionality available while it has such a big bug while at the same time advertising it as a great feature of the fs, the "irrealistically positive"-behaviour is still present, therefore I still cannot trust it. Additionally that bug being open since forever makes me think that it's really hard to fix, which in turn makes me think that the foundation and/or code of BTRFS is bad (which is the reason why that bug cannot be fixed quickly) and that therefore potentially in the future some even more complicated bugs might show up.
Concerning alternatives:
I am writing and testing since a looong time a program which ends up creating a big database (using "Yandex Clickhouse" for the main DB) distributed on multiple hosts where each one uses multiple HDDs to save the data and that at the same time is able to fight against potential "bitrot" ( https://en.wikipedia.org/wiki/Data_degradation ) without having to resync the whole local storage each time that a byte on some HDD lost its value. Excluding BTRFS, the only other candidate that I found is ZFSoL that perform checksums on data (both XFS and NILFS2 do checksums but only on metadata).
Excluding BTRFS because of the reasons mentioned above, I was left only with ZFS.
I'm now using ZFSoL since a couple of months and so far everything went very well (a bit difficult to understand & deal with at the beginning, but extremely flexible) and performance is as well good (but to be fair that's easy in combination with the Clickhouse DB, as the DB itself writes data already in a CoW-way, therefore blocks of a table stored on ZFS are always very likely to be contiguous).
On one hand, technically, now I'm happy. On the other hand I do admit that the problems about licensing and the non-integration of ZFSoL in the kernel do have risks. Unluckily I just don't see any alternative.
I do donate monthly something to https://www.patreon.com/bcachefs but I don't have high hopes - not much happening and BCACHE (even if currently integrated in the kernel) hasn't been in my experience very good (https://github.com/akiradeveloper/dm-writeboost worked A LOT better, but I'm not using it anymore as I don't have a usecase for it anymore, and it was a risk as well as not yet included in the kernel) therefore BCACHEFS might end up being the same.
Bah :(",0,22012315,0,"[]","",0,"","[]",0 22023247,0,"comment","champtar","2020-01-11 23:39:19.000000000","One of the best 2h practical course that I had was just write the fastest square matrix multiplication. You could use any language, any algorithm, just no libraries. The target was a 32 core CPU server (this was ~10 years ago). At 5000x5000 all the Java and Python attempts were running out of memory. In C, We tried some openmp, some optimized algorithm, but in the end the best trick was to flip one of the matrix so that memory could be always prefetched. Out of curiosity another student tried GNU Scientific Library, it turned out to be ~100 times faster. My take away was find the right tool for the job!
A fun read on cloud scale vs optimized code is this recent article comparing ClickHouse and ScyllaDB (https://www.altinity.com/blog/2020/1/1/clickhouse-cost-effic...)",0,22020796,0,"[22024212,22024922,22028107,22026239]","",0,"","[]",0 22031499,0,"story","antman","2020-01-13 03:00:30.000000000","",1,0,0,"[]","https://tech.marksblogg.com/fast-ip-to-hostname-clickhouse-postgresql.html",1,"Fast IPv4 Lookup – Postgres vs. Clickhouse","[]",0 22038283,0,"comment","1996","2020-01-13 20:35:37.000000000","Been there done that.
Selecting a database is the least of your worries. And you learn to live with the limitations - for example, in clickhouse you add a millisecond and a microsecond field.
Litteraly every solution listed will work, as the database will only be used to persist your data. Your trading will NOT use your database in any way except to load the data when you start (and potentially restart) your bots.
What really matters is 1) execution, 2) network performance and finally 3) good data
Language, database etc are just tools. This is a case of premature optimization.",0,22038005,0,"[]","",0,"","[]",0 18476655,0,"story","valyala","2018-11-17 18:21:38.000000000","",0,0,0,"[]","https://medium.com/@AltinityDB/clickhouse-for-time-series-be35342bf31d",3,"ClickHouse for Time Series","[]",0 18491672,0,"comment","lykr0n","2018-11-20 03:43:35.000000000","Yeah, but that doc assumes you've read and understood everything else in the docs. That document provides no help in understanding the database, it's concepts, or how to implement them. I could try and reconcile that doc with the python tutorial and everything else, but I just want something where I can copy and paste and it works.
FoundationDB doesn't have a batteries included documentation like Clickhouse does: https://clickhouse.yandex/tutorial.html",0,18490417,0,"[]","",0,"","[]",0 18493427,0,"story","AltinityDB","2018-11-20 12:53:08.000000000","",0,0,0,"[]","https://www.altinity.com/blog/clickhouse-for-time-series",12,"ClickHouse for Time Series","[]",0 18505197,0,"story","jetter","2018-11-21 19:14:59.000000000","",0,0,0,"[]","https://pixeljets.com/blog/clickhouse-as-a-replacement-for-elk-big-query-and-timescaledb/",3,"Clickhouse as a Replacement for ELK, Big Query and TimescaleDB","[]",0 18516019,0,"story","AltinityDB","2018-11-23 12:30:43.000000000","",0,0,0,"[]","https://www.altinity.com/blog/2018/10/16/updates-in-clickhouse",1,"Updates and Deletes in ClickHouse","[]",0 22100368,0,"comment","pritambaral","2020-01-20 17:21:42.000000000","> I can't find any evidence showing that OLAP means it is okay to ...
OLAP also doesn't mean "be the source of truth of the data". You can have a separate source of truth of the "complete set of data" outside of your OLAP engine and load (and reload) data into your OLAP engine any time you're not sure if you have the "complete set of data" in it.
The important difference lies in how often one finds themselves in that situation. In OLAP, the sheer majority of the time is spent querying (i.e., reading) data than loading (i.e., writing) data and waiting for it to be durably saved (i.e., fsync-ed). Because of this imbalance, it makes sense to prioritise for one scenario and handle the other sub-optimally.
> They wouldn't care if some of that data is lost so they don't use fsync.
Or, they can still care about data correctness and simply re-load data they suspect is/may not consistent in the rare case of an improper shutdown. It's not like they use ClickHouse as their primary data store.",0,21984836,0,"[]","",0,"","[]",0 18553933,0,"comment","lykr0n","2018-11-28 18:05:41.000000000","Have you looked at Clickhouse for timeseries data? It's the one database I've found that can scale and can query in near-realtime.
I've loaded a 100 Billion Rows in into a 5 shard database and can do full queries across the whole dataset in under 10 seconds. It also natively consumes multiple kafka topics.",0,18553705,0,"[18601506,18553988,18554606]","",0,"","[]",0 18553988,0,"comment","haggy","2018-11-28 18:10:19.000000000","> I've loaded a 100 Billion Rows
Have you done any load tests that would more closely mirror a production environment such as performing queries while clickhouse is handling a heavy insert load?",0,18553933,0,"[18554009,18558758]","",0,"","[]",0 18554009,0,"comment","lykr0n","2018-11-28 18:12:03.000000000","I'm working on developing benchmarking tools for internal testing, but both Yandex and CloudFlare use Clickhouse for realtime querying. I'm still in development phase for my product, but I'll make sure to post information & results when we launch here.
https://blog.cloudflare.com/http-analytics-for-6m-requests-p...
But I've spent a long time looking at the various solutions out there, and while ClickHouse is not perfect, I think it's the best multi-purpose database out there for large volumes of data. TimescaleDB is another one, but until they get sharding it's dead on arrival.",0,18553988,0,"[18556251,18554030]","",0,"","[]",0 18554115,0,"comment","bsg75","2018-11-28 18:22:58.000000000","Still a bit messy, but the clickhouse-copier utility helps a bit: https://github.com/yandex/ClickHouse/issues/2579",0,18554062,0,"[18554131]","",0,"","[]",0 18560210,0,"comment","lykr0n","2018-11-29 14:08:33.000000000","Takes up 200gb or so across 5 servers (this is according to ClickHouse's query stats). Actual disk might be a bit higher.",0,18558658,0,"[18567859]","",0,"","[]",0 18569297,0,"comment","IanCal","2018-11-30 14:27:27.000000000","I've been using clickhouse (https://clickhouse.yandex) a lot recently. One thing I found useful was that the underlying tables are just folders on disk that you can easily copy around. The data will be automatically compressed too.
I'm sure it'll depend on some of your other constraints, but may be worth looking at. I've been extremely impressed with its performance.",0,18568933,0,"[18569520]","",0,"","[]",0 18569520,0,"comment","Keyframe","2018-11-30 15:00:57.000000000","How did you find clickhouse vs drill/dremio? Have you tried maybe both and compared them?",0,18569297,0,"[18572071]","",0,"","[]",0 18572071,0,"comment","IanCal","2018-11-30 19:52:21.000000000","I'm afraid I've not tried those, I played around only a bit and mostly spent a while trying to get cassandra setup but failed.
Then I hit on https://tech.marksblogg.com/benchmarks.html and saw clickhouse sitting at a very impressive point given the hardware it's on there. It was one of the few data things I saw that looked like it scaled down to a level relevant for me (cassandra I think talks about a dev setup of 5 servers). It's been fast enough with zero tuning or work so far that I've not done that much. I'm sure I could get closer to its speed by carefully setting some things in postgres, or clever work somewhere but I've got other things to do and that's not my speciality.
It has a few issues with ingesting CSV (I end up quoting everything, though I think it's less hassle if you use TSV instead).
Frankly, it's been incredible for my analysis work. apt-get install setup, fast to ingest data, easy to query and output large amounts of things. Happily works on a single box, though apparently scales up as well. More than happy to waffle on about this but I'm aware I've gone on very long already for a "I don't know" answer.",0,18569520,0,"[]","",0,"","[]",0 22155110,0,"comment","bdcravens","2020-01-26 22:04:59.000000000","Clickhouse seems to be another great option for what you've described.",0,22154337,0,"[22158334]","",0,"","[]",0 22156247,0,"comment","qaq","2020-01-27 01:47:45.000000000","ClickHouse is open source it's def not trash for olap and ts",0,22155606,0,"[22157528]","",0,"","[]",0 22156786,0,"comment","arminiusreturns","2020-01-27 04:04:43.000000000","So for the past few years since about ~2013 I've been on and off keeping an eye on Michael Stonebraker and his work on VoltDB (based on lessons learned from h-store) and Scidb and in general his criticisms of nosql and newsql variants.
I think Scidb, being a column-oriented dbms geared for multi-dimensional arrays (datacubes) is very interesting given current trends, and there are only a handful of similar dbs around, the other two that interest me are rasdaman and monetdb. I don't know if OmniScidb counts as a datacube db but it is also really interesting especially due to it's gpu and caching model.
As a sysadmin/ops type having to deal with monitoring timeseries db's are also something I like to keep an eye on. It used to be mostly rrdtool in this space, but now I am comparing prometheus and influxdb. Like OmniSci, another case of something that's not quite in the same db model but might even better a better solution for the space (metrics) is Apache Druid. (elasticsearch being another than can be massaged to fit as a quasi tsdb as well) I think there is some room to unify the monitoring/metrics and log storage arena into one space (usually they are separated which adds admin overhead) and right now I really like druid as a potential for this.
Another interesting application of timeseriesdb's that I have been keeping an eye on is in the quant/algo trading area. Most people have been using kdb+ there but many are looking for replacements and there are some really good conversations to be found about the kind of limitations they are hitting.
I'm just a sysadmin who likes to keep up with whats going on, and my db knowledge is limited, but I do have a process for narrowing my focus to the dbs I reference. It must be open source, bonus points to gpl or apache licenses. The language it is written in is important but not a deal breaker (very tired of so many java based dbs). I don't like it when they are tacked on top of an "older" tech (such as kairos on cassandra, timeseriesdb on top of postgres, opentdsb on top of hbase, kudu on hadoop, etc). Being either filesystem aware or agnostic can be nice features (playing well with ceph, lustre, etc) Not saying this is the sort of selection criteria others should use just giving some info on mine.
A few more interesting mentions: clickhouse, gnocchi, marketstore, Atlas (Netflix), opentick (on top of foundationdb).",0,22153898,0,"[22158099,22157197]","",0,"","[]",0 22157528,0,"comment","jnordwick","2020-01-27 07:39:38.000000000","clickhouse could barely complete a 15 min moving average; last time I checked it required a very slow correlated subquery. That's pretty much where I stopped evaluating it.
edit: after looking it up again, looks like that is still the case, and you have have to be fairly limited with cumulative aggregates if you want the keep performance. maybe someday, but as of now, still not very good.",0,22156247,0,"[]","",0,"","[]",0 22157669,0,"comment","pachico","2020-01-27 08:15:27.000000000","I'm still surprised that the industry barely knows about ClickHouse. Very few times I had the impression the be adopting a game changer technology and that's the case with ClickHouse. We only currently use it for analytical purposes but it's been proven that it's very valid solution for logs storage or as time-series DB. I already have in my roadmap to migrate ElasticSearch clusters (for logs) and InfluxDB to ClickHouse.",0,22153515,0,"[22157726]","",0,"","[]",0 22157726,0,"comment","snikolaev","2020-01-27 08:29:59.000000000","Does Clickhouse already have an inverted index capabilities or how are you going to search for logs containing "error"? Just LIKE's performance is going to be enough? Or it's not the case for you?",0,22157669,0,"[]","",0,"","[]",0 22158334,0,"comment","MrBuddyCasino","2020-01-27 11:20:14.000000000","He needs updates, and Clickhouse isn't really made for that. But otherwise I agree.",0,22155110,0,"[]","",0,"","[]",0 18589569,0,"comment","ajawee","2018-12-03 15:43:13.000000000","Clickhouse - https://clickhouse.yandex",0,18568216,0,"[]","",0,"","[]",0 18591037,0,"comment","lykr0n","2018-12-03 17:56:34.000000000","Location: Austin, TX
Remote: Eh
Willing to relocate: Yes.
Technologies: Python 3 Development, Linux (CentOS) & Bash, VMware, oVirt/RHEV, Nomad + Consul, ClickHouse, Ansible/Salt/Puppet, Postgresql, Stolon, Datadog + Veneur, DNS, HAproxy and a bunch of other fun stuff
Resume: Upon request via Email- same with GitHub
Email: lykron@mm.st
Looking for, ideally: Systems focused SRE role, Systems Engineer, or Systems Administrator (or some mixture of the 3).
I've been extremely involved recently with platform & application monitoring- from health self-reporting, to service SLO monitoring",0,18589704,0,"[18655458]","",0,"","[]",0 22160877,0,"comment","xzcat","2020-01-27 16:34:33.000000000",""fairly painlessly" and "without significant work or downtime" doesn't sound like it lines up with btrfs's, which I would describe as "one command and zero downtime (just some io load if you rebalance immediately)" for both operations. btrfs is also mainline, which increases how painless it is to use.
BTRFS does have some scary stories from earlier in its development, and true raid5 seems like it's unlikely to be safe for quite a while, but raid1 and "normal" fs usage has been rock solid in my experience. The only time I've ever had an issue was probably 4 years ago at this point, and it was solved by just booting an Arch live iso and running a btrfs command that was basically "fix exactly the bug that your error message indicates". I don't remember exactly what it is, something about two sizes not matching, but googling the text it showed at boot led me directly to the command to fix it. Certainly dramatically less trouble than I've ever had when hardware RAID goes south.
I do agree that modern lvm does probably compete with btrfs, but again you're trading how dang simple btrfs raid1 is to manage for monkeying with partitions in lvm in exchange for ~some? performance.
IMO ZFS is in a weird spot where I don't know where I'd use it. It's too complicated/annoying to admin for me to want to run it in my basement for myself/my family, and for anything bigger or more professional I'd use ceph or a problem-domain-specific storage system (HDFS, clickhouse, aws, etc).",0,22160529,0,"[22168494]","",0,"","[]",0 18601028,0,"comment","polskibus","2018-12-04 16:43:06.000000000","Would you mind comparing TiDB to other HTAP databases like SAP HANA, MemSQL, HyPer? I'm more interested in the architecture, trade-offs, best/worst use cases. How would you compare the analytical bit with regard to analytical databases like ClickHouse, SQL Server tabular model, MapD?",0,18600721,0,"[18601262]","",0,"","[]",0 22189960,0,"comment","barrkel","2020-01-30 09:00:38.000000000","If you're disappointed with the speed and complexity of your Hadoop cluster, and especially if you're trying to crack a bit, you should give ClickHouse a spin.",0,22188877,0,"[22190174]","",0,"","[]",0 22190174,0,"comment","KptMarchewa","2020-01-30 09:51:28.000000000",">crack a bit
What does that mean? I don't understand if you're trying to endorse ClickHouse or make fun of it.",0,22189960,0,"[22191351]","",0,"","[]",0 22190939,0,"comment","jacques_chester","2020-01-30 12:47:37.000000000","> Except when you have large, telemetry style datasets e.g. web/product analytics which won't fit.
Web analytics was one of the first applications for Greenplum. My understanding is that Yahoo collected tens of billions of events per day in the mid-2000s.
> Or when you are trying to build a wide table and you run out of columns.
HAWQ can run SQL queries over Hadoop clusters. Clickhouse's table width is limited by how much RAM you give it.
> Or when your favourite SaaS products gives you highly nested JSON data.
This is why major databases have JSON querying capabilities and why it's been added to the next SQL standard. PostgreSQL even allows you to define indices on fields inside your JSON structures.
Better yet: decompose the highly nested data. Relational databases begin to shine when you get past at least first normal form.
> RDBMS works great up until the point that it doesn't.
RDBMSes do work great until they don't. Which means they are almost always the best solution and almost always remain so.
Folks regularly overestimate the size of their problem and underestimate the capabilities of the literally dozens of RDBMSes now available for use. Yes, it irks me.
Disclosure: I work for VMware, which sponsors Greenplum development.",0,22190268,0,"[]","",0,"","[]",0 22191351,0,"comment","barrkel","2020-01-30 14:09:06.000000000","Phone typo. Should have been 'nut'.
And yes, I'm endorsing ClickHouse; it scales down much better than Hadoop.",0,22190174,0,"[]","",0,"","[]",0 18623351,0,"comment","valyala","2018-12-06 23:02:39.000000000","ClickHouse isn't general purpose DBMS. It is the best tool for collecting and near-online analyzing huge amount of events with many properties. We were successfully using ClickHouse cluster with 10 shards for collecting up to 3M events per second with 50 properties each (properties translate to columns). Each shard was running on n1-highmem-16 instance in Google Cloud. The cluster was able to scan tens of billions of rows per second for our queries. The scan performance was 100x better than on the previous highly tuned system built on PostgreSQL.
ClickHouse may be used as a timeseries backend, but currently it has a few drawbacks comparing to specialized solutions: - It has no efficient inverted index for fast metrics lookup by a set of label matchers. - It doesn't support delta coding yet - https://github.com/yandex/ClickHouse/issues/838 .
Learn how we created a startup - VictoriaMetrics - that builds on performance ideas from ClickHouse and solves the issues mentioned above - https://medium.com/devopslinks/victoriametrics-creating-the-... . Currently it has the highest performance/cost ratio comparing to competitors.",0,18601506,0,"[]","",0,"","[]",0 22226178,0,"story","davidquilty","2020-02-03 17:07:53.000000000","",0,0,0,"[22226265]","https://www.percona.com/blog/2020/02/03/clickhouse-and-mysql-better-together/",2,"ClickHouse and MySQL – Better Together","[]",2 22226265,0,"comment","PeterZaitsev","2020-02-03 17:15:30.000000000","There has been ClickHouse MySQL Access through ProxySQL for years. It is great it is now available out of the box! https://www.proxysql.com/blog/clickhouse-and-proxysql-querie...",0,22226178,0,"[22234322]","",0,"","[]",0 22229261,0,"comment","Dim25","2020-02-03 21:21:09.000000000","SEEKING WORK | San Francisco, CA, USA | REMOTE or LOCAL
Hi all, I'm Dima (https://www.linkedin.com/in/dim25/) from SF (San Francisco Bay Area). Full-stack with Machine Learning experience; AI/ML product manager.
Python: * Machine Learning: (TensorFlow; Keras; PyTorch). * Computer Vision (OpenCV; TensorFlow). * Media \ communications (Twillio; Ring Central; Kurento). * Streaming \ Workflows: Kafka+Faust; Airflow; Celery. * Web servers (Flask), and many other applications of Python.
Web Development: HTML; CSS; Bootstrap. JS (Front-end + Node.js): All the basics necessary for web development; Basic experience with d3.js and other visualizations and dashboards tools.
DBs: MongoDB; ElasticSearch; Redis (incl. RediSearch), SQLs. Basics of ClickHouse.
C/C++: Some experiments with ROS/robotics.
Most recent projects:
* Analyzing millions of job postings worldwide.
* Computer Vision CCTV Stream analytics.
Previously: * Co-founder at MBaaS startup. 'Firefighter' from $0 to $120K MRR.
* Managed a team of 15 mobile developers to assist with the delivery of
the #1 mobile banking app in Russia (iOS + Android).
* AWM, rev-share with Kinks (guys from San Francisco Armory).
Email: dima_cv1@protonmail.comOne-page CV: https://bitly.com/dima_cv1",0,22225313,0,"[]","",0,"","[]",0 22231340,0,"comment","hwwc","2020-02-04 01:18:11.000000000","SEEKING WORK | Backend Services; Data Engineering; Systems Engineering
Location: Boston, US | Remote: Yes
I'm an experienced Rust software engineer looking for 10-20 hr/week contract writing robust, performant, and ergonomic backend services.
I'm most experienced in the data-analytics backend-stack: from ETL to database design to web-api to devops. One of my major projects is an analytics engine for web applications (https://github.com/hwchen/tesseract).
However, I'm naturally curious and happy to work in any domain which requires high performance and maintainable code. I've worked with a distributed worker system, debugged async database drivers, and implemented text layout primitives.
Primary Skills: Rust, Linux, Google Compute Platform, ClickhouseDB, Postgresql
Production Experience with: Python/Pandas, Node/JS, AWS, Docker, Redis, MySql, Nginx
Github: https://github.com/hwchen
Contact: hello@hwc.io",0,22225313,0,"[]","",0,"","[]",0 18667047,0,"comment","nwmcsween","2018-12-12 19:39:05.000000000","First I'm not trashing the project, I'm just wondering why simple unix like solutions aren't used.
Store it where ever you want, this isn't a magical datastore that makes things faster, use clickhouse-client, whatever it doesn't matter.
There is a widening disconnect between the unix way and how new projects are created.",0,18666948,0,"[]","",0,"","[]",0 18669320,0,"comment","aseipp","2018-12-13 01:55:15.000000000","Timescale isn't really a columnar database, it's more like an advanced partitioning extension for time-series data ("any data you want to shard based on a time column") where you can optionally include other partition keys for the sharding. But it can be used very well for analytic cases like this thanks to that.
Real columnar databases like MemSQL or Clickhouse are a different beast -- for example they give very good column-wise compression in the dataset, which can save dramatic amounts of space. They're also good very for use cases like this, since they're heavily optimized for OLAP style workloads.
There is also cstore_fdw which does offer columnar, compressed storage for PostgreSQL as a foreign table, but it won't hold a candle to something like MemSQL or Clickhouse in terms of raw performance. Maybe one day.
Ultimately it's not about columnar storage or partitioning support, though, it's about the data and the queries you want to run on it, in what amount of time. Timescale can do pretty good for a lot of cases like this I bet, and I'm investigating it myself for a project.",0,18668321,0,"[18669414]","",0,"","[]",0 22234322,0,"comment","pachico","2020-02-04 11:12:45.000000000","Not really the same, is it? And proxysql has the very same limitations it had since day one which makes it useless to me: ClickHouse and proxysql must be in the same host and ClickHouse must not have any authentication. Or has anything changed lately?",0,22226265,0,"[]","",0,"","[]",0 22380245,0,"comment","leonardteo","2020-02-21 01:28:40.000000000","This looks great but was there a reason for going with Postgres over something purpose built for analytics like Clickhouse? I am seriously considering building a similar tool for our platform as the cost of Mixpanel/Amplitude is cost prohibitive at our scale. We had to move off Google Analytics and run our own Matomo server, and it brings MariaDB to a grinding halt.
In any case, will be looking closer at this. Looks very interesting. Thanks.",0,22376732,0,"[22380357,22380350]","",0,"","[]",0 22380357,0,"comment","james_impliu","2020-02-21 01:58:20.000000000","We have actually integrated Clickhouse already for this reason. We started with Postgres as it works well for smaller volumes, but we have this integration in a paid version.",0,22380245,0,"[22380400]","",0,"","[]",0 18678723,0,"comment","nwmcsween","2018-12-14 05:45:54.000000000","If you read my earlier posts logs vs metrics doesn't matter in this context and hence why I said to stream to clickhouse-client...",0,18673574,0,"[]","",0,"","[]",0 22392051,0,"comment","buro9","2020-02-22 17:10:34.000000000","It's a database for a metric platform.
Think of OpenTSDB and Prometheus. Or for a better comparison think of Thanos https://thanos.io/
As to whether they could fulfil Uber's needs, the thing about scale (real massive scale - I work at Cloudflare) is that everything breaks in weird ways according to your specific uses of a technology. The things listed above work for companies, until they don't. There's few things that seem to truly work at every scale, Kafka and ClickHouse come to mind for wholly different use cases than a time series database.",0,22391982,0,"[22393977,22394762,22393800]","",0,"","[]",0 22392085,0,"comment","jmakov","2020-02-22 17:16:39.000000000","So how does this compare to e.g. Clickhouse?",0,22391270,0,"[22392362]","",0,"","[]",0 22392362,0,"comment","bdcravens","2020-02-22 18:09:53.000000000","Clickhouse is an analytic column-based RDBMS. It's not a timeseries database. Each class of product is used to solve different problems.",0,22392085,0,"[22394007,22393288,22392443,22392987]","",0,"","[]",0 22392443,0,"comment","aeyes","2020-02-22 18:24:15.000000000","Clickhouse works exceptionally well as a TSDB.",0,22392362,0,"[22393250,22393305,22392995]","",0,"","[]",0 22393288,0,"comment","mbell","2020-02-22 20:59:37.000000000","Clickhouse has a table engine for graphite. We've used it for a couple years now after out scaling InfluxDB and working around it several times. Clickhouse works _extremely_ well for graphite data, it can handle several orders of magnitude more load than Influx in my experience.",0,22392362,0,"[]","",0,"","[]",0 22393303,0,"comment","mbell","2020-02-22 21:02:22.000000000","Most practical applications using Clickhouse for metrics data store the metric index separately. What index you want really depends on the metric system, e.g. with graphite data you don't want an inverted index, you want a trie.",0,22393250,0,"[22393469]","",0,"","[]",0 22393305,0,"comment","idjango","2020-02-22 21:02:33.000000000","I also confirm that. Several companies have successfully transitioned their monitoring stack from graphite initial python implementation to a clickhouse based backend.",0,22392443,0,"[22395378]","",0,"","[]",0 22393469,0,"comment","roskilli","2020-02-22 21:31:56.000000000","Yes I've seen that also work, it's a lot of stitching together things yourself and we had to put a lot of caching in front of the inverted index we were using, however definitely plausible. ClickHouse doesn't do any streaming of data between nodes as you scale up and down which was a big thing for us since we had large datasets and needed to rebalance when cluster expanded/shrunk.
With regards to trie vs inverted index for Graphite data, I'd actually still be inclined to say inverted index is better based on the amount of queries I saw at Uber with Graphite where people did `servers.*.disk.bytes-used` type queries which is way faster to do using an inverted index since you have a postings list for each part of the dot-separated metric name, rather than traversing a trie with thousands to tens of thousands of entries in index 1 host part of the Graphite name. This is what M3DB does[0].
[0]: https://github.com/m3db/m3/blob/b2f5b55e8313eb48f023e08f6d53...",0,22393303,0,"[22396235]","",0,"","[]",0 22393800,0,"comment","1996","2020-02-22 22:33:58.000000000","> ClickHouse come to mind for wholly different use cases than a time series database.
ClickHouse works fine as a TSDB if you don't mind getting a little dirty",0,22392051,0,"[22398099]","",0,"","[]",0 22393977,0,"comment","manigandham","2020-02-22 23:12:23.000000000","Clickhouse (and other columnstore RDBMS) are all perfectly fine for time-series and usually better than the standard options because they have SQL querying.",0,22392051,0,"[]","",0,"","[]",0 22393988,0,"comment","manigandham","2020-02-22 23:14:29.000000000","Any columnstore RDBMS would work great. Clickhouse, MemSQL, KDB, Greenplum, Vertica, etc. Fast, efficient, and with the full flexibility of SQL queries.",0,22391982,0,"[]","",0,"","[]",0 22394555,0,"comment","roskilli","2020-02-23 01:46:14.000000000","I think ScyllaDB would have definitely done better than Cassandra (which we were using alongside ElasticSearch), although another thing I mention in this thread is that a lot of existing distributed databases do not have a multi-dimensional inverted index available that can index keys in there primary storage engine.
This makes it tough to use solely either ScyllaDB, ClickHouse or Cassandra for that matter for metrics workloads at scale since they need to find a needle in a haystack - a few thousand time series amongst a set of millions to billions, where users only specify a subset of the dimensions on the metrics in any order they want to. This is hard to do without an inverted index.",0,22394375,0,"[]","",0,"","[]",0 22395378,0,"comment","rixed","2020-02-23 05:53:09.000000000","Not to bad-mouth Clickhouse but the original python implementation of graphite + carbon was setting the bar very low, though, and transitioning from there to anything would have increasing performances by orders of magnitude.",0,22393305,0,"[22396171]","",0,"","[]",0 22395434,0,"comment","shaklee3","2020-02-23 06:16:05.000000000","Hundreds of thousands per second isn't very high when you compare that to clickhouse or kdb+.",0,22395040,0,"[]","",0,"","[]",0 22395451,0,"comment","shaklee3","2020-02-23 06:23:03.000000000","It seems clickhouse also has most of those features, but is not considered a time series database. Is that wrong?",0,22394886,0,"[]","",0,"","[]",0 22396171,0,"comment","idjango","2020-02-23 10:32:37.000000000","You should read the link below [1]. Even if it's not uber scale I suspect yandex to use something similar.
I agree that python implementation of graphite was not particularly fast but there was faster implementation in C that companies used first to significantly increase performance. Then coordination of storage backend becomes complex when you try to scale the initial design. This is where clickhouse really shine. It provide out of the box distributed storage with compaction, rollup and fast querying. The other layers are stateless, which means that they'll scale with your computing ressources.
M3DB is roughly doing the same thing as clickhouse but clickhouse is much more advanced database that has proven records of running at petabyte scale without a sweat. For example they now have tiered storage which means that you can store recent event in nvme and rollup to standard HDD...
[1]https://medium.com/avitotech/metrics-storage-how-we-migrated...",0,22395378,0,"[]","",0,"","[]",0 22396235,0,"comment","idjango","2020-02-23 10:52:32.000000000","Just to point out that there is inverted index implementation of graphite data working on clickhouse.
Regarding the auto-rebalance feature, I cannot much more agree with you. It's something that clickhouse definitely need to handle internally.",0,22393469,0,"[22402155]","",0,"","[]",0 15091009,0,"story","alvil","2017-08-24 15:41:23.000000000","",0,0,0,"[]","https://clickhouse.yandex",1,"ClickHouse – open source distributed column-oriented DBMS","[]",0 22398099,0,"comment","valyala","2020-02-23 17:24:57.000000000","There is a TSDB solution if you don't want getting a little dirty - VictoriaMetrics [1]. It is built on the same principles as ClickHouse [2].
[1] https://github.com/VictoriaMetrics/VictoriaMetrics/blob/mast...
[2] https://medium.com/@valyala/how-victoriametrics-makes-instan...",0,22393800,0,"[22428823]","",0,"","[]",0 22400202,0,"comment","Znafon","2020-02-23 23:08:12.000000000","I had lot of success with Clickhouse recently for tables that are 200+ millions row and that was with the Log table engine, not the Merge tree one so I would expect it to get even faster when we change.
It's very easy to setup so you should be able to test it quickly to see if it fits your needs.",0,22400135,0,"[]","",0,"","[]",0 22400501,0,"comment","FridgeSeal","2020-02-24 00:06:21.000000000","They _appear_ to solve a bunch of problems by simply punting them down the road into downstream applications.
None of the databases you listed there are OLAP databases.
Clickhouse, TiDB, Redshift, Snowflake, etc are significantly more suitable and should be the target of comparison here.",0,22400443,0,"[]","",0,"","[]",0 22402155,0,"comment","roskilli","2020-02-24 06:50:20.000000000","That's interesting, I had not heard of ClickHouse as a backend for Graphite with an inverted index. Let me know if you have any links to that.
I'm assuming this is an out of process inverted index used alongside ClickHouse? Or is it more of a secondary table contained by ClickHouse which can be searched to find the metrics, then the data is looked up?
The latter scales not as well with billions of unique metrics since it's always a scan across the unique metrics stored in the time window your query searches for (since any arbitrary dimensions can be specified, all must be evaluated). This is the drawback of PromHouse which is an implementation of Prometheus remote storage on top of ClickHouse - and the major reason why PromHouse was only ever a proof of concept rather than a production offering.",0,22396235,0,"[]","",0,"","[]",0 18694525,0,"comment","InGodsName","2018-12-16 19:01:37.000000000","It will take a long time to build something robust.
Build something like Clickhouse, it's source is on GitHub.
It's work in progress, so it's not very huge yet.
See if you can recreate it in Rust.
Lemme know if you want me to contribute to your project :).
Happy to code some parts.",0,18690083,0,"[]","",0,"","[]",0 22271512,0,"comment","samokhvalov","2020-02-07 22:02:57.000000000","There are various approaches here, and there are some FOSS tools that you can use.
Some links:
- https://blog.taadeem.net/english/2019/01/03/8_anonymization_... – description of methods, and a tool for Postgres, postgresql_anonymizer
- https://habr.com/en/company/yandex/blog/485096/ – not for Postgres, it's for ClickHouse (open-source DBMS for analytics) but covers the topic very well.",0,22270554,0,"[]","",0,"","[]",0 22406396,0,"comment","stereosteve","2020-02-24 17:52:43.000000000","ClickHouse is easy to run on local machine and has great performance.",0,22405513,0,"[]","",0,"","[]",0 22431460,0,"comment","pachico","2020-02-27 07:44:58.000000000","That was my point exactly. Http API us very cool but very far from being unique to couchdb (see InfluxDB, ClickHouse, Prometheus, etc.)",0,22426378,0,"[]","",0,"","[]",0 22434325,0,"story","bigned","2020-02-27 15:54:59.000000000","",0,0,0,"[]","https://www.nedmcclain.com/why-devops-love-clickhouse/",3,"Why DevOps Love ClickHouse","[]",0 22447622,0,"comment","streetcat1","2020-02-29 00:11:23.000000000","For data warehouse try clickhouse.",0,22447197,0,"[]","",0,"","[]",0 22316204,0,"comment","manigandham","2020-02-13 08:30:52.000000000","You don't need to store everything into RAM to get fast results. Data warehouse relational databases are designed exactly for this kind of fast SQL analysis over extremely large datasets. They use a variety of techniques like vectorized processing on compressed columnar storage to get you quick results.
Google's BigQuery, AWS Redshift, Snowflake (are all hosted), or MemSQL, Clickhouse (to run yourself). Other options include Greenplum, Vertica, Actian, YellowBrick, or even GPU-powered systems like MapD, Kinetica, and Sqream.
I recommend BigQuery for no-ops hosted version or MemSQL if you want a local install.",0,22311040,0,"[22316213]","",0,"","[]",0 22455782,0,"comment","pachico","2020-03-01 08:17:15.000000000","I have started some weeks ago this project: https://github.com/iris-analytics It's a small JS that gathers data and sends it to a Go backend to then be stored in ClickHouse. Although there's lots to do, we use it in production successfully. Remember ClickHouse was born precisely for web analytics where a single instance can handle hundreds of millions of inserts per day with no effort. I did this because stats say adblocker penetration in Europe is beyond 30% and this would give us real time insights with no sampling and ad-hoc queries.
If you want to help me out, you are very welcome!!!",0,22454520,0,"[22455844]","",0,"","[]",0 22457767,0,"story","mooreds","2020-03-01 17:05:28.000000000","",0,0,0,"[22478661,22478856,22478338,22478486,22479208,22478991,22478441]","https://clickhouse.tech/docs/en/operations/utils/clickhouse-local/",124,"Clickhouse Local","[]",75 22466334,0,"comment","buro9","2020-03-02 17:12:28.000000000","Cloudflare | Engineers | London, Austin, Lisbon, Champaign, Singapore, San Francisco | Onsite | Full Time
https://www.cloudflare.com/careers/jobs/
Cloudflare has a mission to save the internet, and we are hiring in many different teams across many offices.
I'm specifically looking for data engineers / scientists to build and use systems that can answer some of the challenging questions about denial of service attacks across the internet.
If you want to work with a lot of data and systems and technologies like Kafka, ClickHouse, XDP, eBPF, Rust, Go... then get in touch.
If this isn't the role for you, check the link above as we have a lot of open roles.
Uncertain whether Cloudflare is for you? My work email is dkitchen@cloudflare.com and feel free to ask questions and when you're ready you can apply for a role via the link above.",0,22465476,0,"[]","",0,"","[]",0 22468165,0,"comment","hodgesrm","2020-03-02 19:49:32.000000000","Altinity | Multiple ClickHouse engineering positions | REMOTE in North America and Europe| Full-time | Competitive Salary and Equity
Hello! We are Altinity, a fast-growing database startup with a distributed team spanning from California to Eastern Europe. Our business is to make customers successful with ClickHouse, the leading open source data warehouse. Our customers range from ambitious startups to some of the most well-known enterprises on the planet. And we are looking for people to join us!
* Data Warehouse Implementation Engineer
* Data Warehouse Support Manager
* Data Warehouse Support Engineer
There are more positions on the way. If you have experience with ClickHouse and want to join, check out our jobs here:
https://www.altinity.com/careers",0,22465476,0,"[]","",0,"","[]",0 22338552,0,"story","pella","2020-02-15 22:43:55.000000000","",0,0,0,"[]","https://medium.com/@leventov/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7",2,"Comparison of the Open Source OLAP Systems for Big Data: ClickHouse,Druid,Pinot","[]",0 22471314,0,"comment","hwwc","2020-03-03 02:36:08.000000000","SEEKING WORK | Backend Services; Data Engineering; Systems Engineering
Location: Boston, US | Remote: Yes
I'm an experienced software engineer looking for part-time and short-term contracts.
I'm most experienced in the data-analytics backend-stack: from ETL to database design to web-api to devops. One of my major projects is an analytics engine for web applications (https://github.com/hwchen/tesseract).
However, I'm naturally curious and happy to work in any domain which requires high performance and maintainable code. I've worked with a distributed worker system, debugged async database drivers, and implemented text layout primitives.
Primary Skills: Rust, Python, Linux, Google Compute Platform, ClickhouseDB, Postgresql
Production Experience with: Python/Pandas, Node/JS, AWS, Docker, Redis, MySql, Nginx
Github: https://github.com/hwchen
Contact: hello@hwc.io",0,22465475,0,"[]","",0,"","[]",0 22478338,0,"comment","1996","2020-03-03 20:17:10.000000000","Clickhouse is one of the most underrated databases.
This basically replaces most of my usages of SQLite.
When its SQL "dialect" matures, Clickhouse will eat MySQL lunch, then PostgreSQL.",0,22457767,0,"[22478407,22479791]","",0,"","[]",0 22478407,0,"comment","oreoftw","2020-03-03 20:26:57.000000000","For analytics - sure. Clickhouse was not designed to handle OLTP workload, there's no transaction support.",0,22478338,0,"[22478698]","",0,"","[]",0 22478661,0,"comment","georgewfraser","2020-03-03 20:59:44.000000000","Why do people on HN love Clickhouse so much? As far as I can tell, it’s an ordinary column store, with a bunch of limitations around distributed joins and a heuristic-based query planner. There are several good analytical databases that will give you the same scan performance and a much better query planner and executor.
This is not a rhetorical question, I would really like to know why it gets so much attention here.",0,22457767,0,"[22480553,22479895,22478905,22479484,22481530,22481850,22481541,22478683,22480870,22480984,22480796,22479311]","",0,"","[]",0 22478698,0,"comment","1996","2020-03-03 21:04:04.000000000","I cringe a bit inside at people using say nosql approaches when it makes literally no sense to do so.
Therefore I think the lack of OLTP will not matter much and that clickhouse will be widely used, but also misused when it becomes too fashionable.",0,22478407,0,"[22478876,22478851]","",0,"","[]",0 22478732,0,"comment","edmundsauto","2020-03-03 21:08:05.000000000","Of those, it looks like only Presto is open source and/or free. So maybe it's a presto versus clickhouse comparison, which explains why so many choose clickhouse (it's one of only 2 options in its class).",0,22478688,0,"[22480627,22479774]","",0,"","[]",0 22478760,0,"comment","deepsun","2020-03-03 21:11:48.000000000","I can say for BigQuery and Databricks from personal experience.
BigQuery is much slower and is much more expensive for both storage and query.
Databricks (Spark) is even slower than that (both io and compute), although you can write custom code/use libs.
You seem to underestimate how heavily ClickHouse is optimized (e.g. compressed storage).",0,22478688,0,"[22479219]","",0,"","[]",0 22478814,0,"comment","FridgeSeal","2020-03-03 21:19:12.000000000","Snowflake doesn’t really keep up with Clickhouse (in my experience) and it costs money.
DataBricks is essentially Spark, and I shouldn’t need a whole spark cluster just to get database functionality. It also costs money.
Unless I’m mistaken, Presto is just a distributed query tool over the top of a separate storage layer, so that’s 2 things you have to setup.
I have no experience with BigQiery but I’ve heard good things about it and Redshift, however but if the rest of your infra isn’t on GCP/AWS then that will probably be a blocker.
Clickhouse is open source, comes with convenient clients in a bunch of languages as well as a HTTP API. It’s outrageously fast and has some cool features and makes the right trade-offs for its use-case, large range of supported input/output formats, built-in Kafka support and the replication and sharding is reasonably straightforward to setup.",0,22478688,0,"[]","",0,"","[]",0 15174966,0,"story","AndreyKarpov","2017-09-05 13:39:50.000000000","",0,0,0,"[]","https://www.viva64.com/en/b/0529/",2,"Check of ClickHouse column database for OLAP using PVS-Studio analyzer","[]",0 22478851,0,"comment","joelwilsson","2020-03-03 21:25:49.000000000","Sure - but the comment you're replying to made no mention of NoSQL. It just said Clickhouse lacks OLTP by design, that doesn't mean it won't be widely used, just that it will perhaps be limited to analytics workloads.
If you need deletes and transactions, look elsewhere, but Clickhouse seems to be great for what it's been designed for.",0,22478698,0,"[]","",0,"","[]",0 22478856,0,"comment","wikibob","2020-03-03 21:26:29.000000000","How Sentry.io uses Clickhouse
https://blog.sentry.io/2019/05/16/introducing-snuba-sentrys-...",0,22457767,0,"[22481267]","",0,"","[]",0 22478876,0,"comment","atombender","2020-03-03 21:28:25.000000000","This makes no sense.
For example, aside from the lack of transactions, Clickhouse is designed for insertion. There's an INSERT statement, but no UPDATE or DELETE statements. You can rewrite tables (there's ALTER TABLE ... UPDATE and ALTER TABLE ... DELETE), but they're intended for large batch operations, and the operations potentially asynchronous, meaning that they complete right away, but you only see results later.
ClickHouse has many other limitations. For example, there's no enforcement of uniqueness: You can insert the same primary key multiple times. You can dedupe the data, but only specific table engines support this.
There's absolutely no way anyone will want to use ClickHouse as a general-purpose database.",0,22478698,0,"[22487356]","",0,"","[]",0 22478991,0,"comment","dzonga","2020-03-03 21:42:40.000000000","funny thing, just learnt about clickhouse today. for experienced people that use columnar stores and pandas for analytics, which tool do you usually prefer for BI stuff ? do ya'll load data into clickhouse then analyse it using pandas. or all analysis is done via the clickhouse sql dialect. As i'm sure things like pivot tables and rolling windows are a PITA in SQL",0,22457767,0,"[22479067]","",0,"","[]",0 22479067,0,"comment","meritt","2020-03-03 21:52:07.000000000","Why would you bother using a database like clickhouse to store data if you're just going to analyze it in pandas? Just store it in a csv, parquet, or orc.
> As i'm sure things like pivot tables and rolling windows are a PITA in SQL
I can't speak for clickhouse, but group-by and window functions are a very standard part of any SQL analysts toolbelt.",0,22478991,0,"[22479339,22479232,22479290]","",0,"","[]",0 22479197,0,"comment","bdcravens","2020-03-03 22:09:05.000000000","According to https://tech.marksblogg.com/benchmarks.html Clickhouse has better performance than 3 of those (the other 2 haven't been tested in that benchmark)",0,22478688,0,"[22479267]","",0,"","[]",0 22479208,0,"comment","bdcravens","2020-03-03 22:10:49.000000000","Clickhouse Local is great also for importing into your Clickhouse Server, where you can validate and preprocess CSVs into Clickhouse's native table format.",0,22457767,0,"[]","",0,"","[]",0 22479219,0,"comment","derefr","2020-03-03 22:11:58.000000000","> You seem to underestimate how heavily ClickHouse is optimized (e.g. compressed storage).
Is it any more compressed than Apache Hive's ORC format (https://orc.apache.org)? Because that's increasingly accepted as a storage format in a lot of these analytical systems.",0,22478760,0,"[22479762,22479785]","",0,"","[]",0 22479232,0,"comment","FridgeSeal","2020-03-03 22:13:43.000000000","> Why would you bother using a database like clickhouse to store data if you're just going to analyze it in pandas?
Because I have more data than what fits locally, there’s a data pipeline that pushes more in, and I only need to work on a subset.
Storing everything in flat csv/parquet etc is useless when there’s more than fits on your local/single machine memory or if you want to search/subset etc some of the data or do anything that’s larger than memory without having to write spill-to-disk stuff in Python/pandas.",0,22479067,0,"[22479451]","",0,"","[]",0 22479484,0,"comment","amyjess","2020-03-03 22:50:01.000000000","It's legitimately fast.
My company migrated our time-series data from InfluxDB to ClickHouse last year (I personally led this, in fact), and the performance difference is night and day.
While I liked a lot of what Influx could do, it was also nonstandard in bizarre ways (Clickhouse behaves more like a subset of SQL), sometimes shockingly immature, and despite appearing fast when we first started using it, so slow that it was a considerable bottleneck.",0,22478661,0,"[22479713,22479787]","",0,"","[]",0 22479530,0,"comment","atombender","2020-03-03 22:56:41.000000000","Note that Greenplum supports column-oriented tables. I haven't used it myself, so I can't comment on whether it's slower than ClickHouse.",0,22479112,0,"[]","",0,"","[]",0 22479544,0,"comment","atombender","2020-03-03 22:59:11.000000000","CitusDB is not relevant here, I believe, as it still uses Postgres' table storage, so it's not columnar. It might be good for analytical workloads, but I very much doubt it will perform anywhere close to ClickHouse.
Greenplum: I've not used it, but it does support columnar tables, so maybe it's comparable.",0,22478966,0,"[22481297]","",0,"","[]",0 22479713,0,"comment","deepsun","2020-03-03 23:23:15.000000000","Clickhouse is not really time-series database, it's more general analytical DB (e.g. also can handle strings, logs, user IPs).
But if you have a lot of time-series metrics (only numbers), you might be better with specialized time-series databases like Prometheus + VictoriaMetrics with Grafana for visualizing it.",0,22479484,0,"[22480541]","",0,"","[]",0 22479762,0,"comment","deepsun","2020-03-03 23:29:43.000000000","Yes, looks like it. According to these posts, ORC only uses snappy or zlib compression, while Clickhouse uses double-delta, Gorilla, and T64 algorithms.
https://engineering.fb.com/core-data/even-faster-data-at-the...
https://www.altinity.com/blog/2019/7/new-encodings-to-improv...",0,22479219,0,"[]","",0,"","[]",0 22479785,0,"comment","marcinzm","2020-03-03 23:33:29.000000000","ORC or Parquet are file storage formats so without context their performance can be almost anything. Where is the data stored? S3? HDFS? Local ram disk?
Clickhouse manages the whole distributed storage, ram caching, etc. thing for you.
In my experience, a unified single purpose vertically integrated solution will be faster than a bunch of kitchen sink solutions bolted together.",0,22479219,0,"[]","",0,"","[]",0 22479787,0,"comment","jstrong","2020-03-03 23:33:45.000000000","I think of influx performance as similar to pandas. It's fast enough for many things it was designed for, but it's pretty easy to hit a performance cliff. In particular, "select * from table" type queries are very slow, so don't expect to be pulling out large blocks of data to do analysis somewhere else.
However, influx, being schemaless, can be outstanding for rapidly prototyping ephemeral metrics, as adding a new measurement is zero-cost (just start writing it). It also plays great with grafana for building dashboards. Finally, I much prefer the ergonomics enhancements of the influx query language (v1, the v2 "flux" language looks terrible to me personally), particularly how duration strings are a first-class datatype ("group by time(1h)").
Interested to hear more about clickhouse performance, haven't had a chance to use it for anything where performance would matter significantly, although am aware it can be very fast.",0,22479484,0,"[]","",0,"","[]",0 22479791,0,"comment","deepsun","2020-03-03 23:34:47.000000000","No, Clickhouse doesn't support UPDATEs, and schema changes are even harder than in classic SQL DBs. But we love it anyway :)",0,22478338,0,"[]","",0,"","[]",0 22479895,0,"comment","dunkelheit","2020-03-03 23:49:57.000000000","> There are several good analytical databases that will give you the same scan performance
The notion that you will get approximately the same query performance with all column stores is false. There can easily be an order of magnitude difference depending on the implementation. Take GROUP BY as a paradigmatic example of what OLAP stores do. Of course the way to implement GROUP BY is with a hash table but little tricks make all the difference and a lot of love went into the clickhouse implementation. Just to give you a taste: there is a custom hash table with specializations for different key types (e.g. it will store a precomputed hash for strings but not for integers and use it to speed up equality test). Variable length data is stored in arenas to reduce allocator pressure. Data will of course be aggregated in several hash tables in different threads and then merged together, but if there is a lot of keys each table will additionally be sharded so that the merge step can be performed in parallel too.
Of course you shouldn't trust random claims on the internet that clickhouse is fast and should do a small case study yourself. Then you'll appreciate how easy is to setup a clickhouse instance or a small cluster. It can easily slurp up most common formats. It is just a single binary with minimal dependencies that will run as-is on any modern linux. There is just a single node type (compare this to druid madness).
You are right that there is a lot of limitations and, how should I put it, quirks. This is resoundingly not a general-purpose database and someone used to the comforts of e.g. postgres will encounter some nasty surprises. Bugs are unfortunately common, especially in the newer functionality. But performance is its main feature and it makes many users of clickhouse put up with its limitations.",0,22478661,0,"[22480874,22480758,22481211]","",0,"","[]",0 22479973,0,"comment","meritt","2020-03-04 00:00:41.000000000","You misinterpret my comment as being against databases, when in fact I'm simply against the notion of utilizing a purpose-built OLAP database (Clickhouse) merely as the storage layer for yet another analytical platform (Pandas). In this hypothetical scenario, before the data ever makes it into Clickhouse, there's a very high likelihood it would already live in an OLTP database such as Postgres or perhaps parquet files living on S3 designed to be read by Athena or Redshift Spectrum etc.",0,22479828,0,"[]","",0,"","[]",0 22480553,0,"comment","bsg75","2020-03-04 01:55:20.000000000","Its somewhat unique in the open source category.
As a column store engine that supports (hybrid) SQL, manages its own storage and clustering, and has only one external dependency (Zookeeper), I believe its main competitor is Vertica which can be very expensive. I assume Oracle, IBM, and MS have column stores as well and for a cost.
Greenplum is a Postgres fork, and same scale requires much more hardware. Citus is row based, and therefore will lag in scan time for many OLAP query patterns. Presto, Hive, Spark, all of the "post-hadoop" options may scale larger, but will also lag in scan time, and have significant external dependencies - mainly storage.
Clickhouse is easy to install, configure a cluster, load and query. It does have limitations, but currently all horizontally scaled database platforms do.",0,22478661,0,"[]","",0,"","[]",0 22480758,0,"comment","jfim","2020-03-04 02:39:11.000000000","> Of course the way to implement GROUP BY is with a hash table but little tricks make all the difference and a lot of love went into the clickhouse implementation.
Actually, for low cardinality columns, _not_ using a hash table will speed up things.
For example, a dictionary-encoded column for states might have a few values in its dictionary (1 => CA, 2 => FL, 3 => NY) and the data looks like an array of numbers (eg. [1, 1, 1, 2, 1, 3, 3, 3, 1, 3, 2]). The fastest way to aggregate is to actually use an array, as the dictionary index conveniently maps to an array index.
Then, when it comes to merging several of those arrays, they're turned into hash tables.
Combine enough of those optimizations and proper data layouts, and you end up with several orders of magnitude of performance differences between engines.",0,22479895,0,"[]","",0,"","[]",0 22480870,0,"comment","golover721","2020-03-04 03:04:57.000000000","Clickhouse is extremely optimized for performing analytics on time series style data. Things like clickstream data (hence the name). Performing queries to do funnel level analysis for example, are extremely fast compared to other analytics databases I have tried. As others have mentioned it is often compared to Apache Druid in its intent and feature set.",0,22478661,0,"[]","",0,"","[]",0 22480874,0,"comment","hodgesrm","2020-03-04 03:05:21.000000000","In addition to excellent raw scan rates for reasons given above ClickHouse has materialized views, which can re-order and/or preaggregate data. This can speed up query response by 3 orders of magnitude over queries on the source data. See https://www.altinity.com/blog/clickhouse-materialized-views-... + ClickHouse docs for an intro.
As the parent says, try it yourself on your own data. That's all that counts.
Disclaimer: I wrote the blog article and we sell support for ClickHouse.",0,22479895,0,"[]","",0,"","[]",0 22481008,0,"comment","barrkel","2020-03-04 03:36:15.000000000","How well do those work on a single 8GB node? Because ClickHouse works very well at that scale, with a single C++ executable.
There's large complexity and cost overheads to Hadoop solutions, and not everyone has actual big data problems. ClickHouse hugely outperforms on query patterns that would devolve into table scans in a row store, while working at row store volumes of data without a bunch of big nodes.",0,22478688,0,"[]","",0,"","[]",0 22481530,0,"comment","tadkar","2020-03-04 05:57:04.000000000","To summarise a lot of the responses here, Clickhouse is extremely fast on very modest hardware, very easy to set up, very easy to get started (mostly normal SQL) and free. For our workloads and scale of data, nothing comes close in terms of performance (redshift, BQ, spark) and especially TCO. It is super simple to try out, why not give it a go? Or is the question borne from a previous bad experience?",0,22478661,0,"[]","",0,"","[]",0 22481541,0,"comment","econcon","2020-03-04 06:00:03.000000000","Its main use in analytics and real time dashboard/report generation in areas like adtech.
If a conversion happens, you need to generate the views/cubes again in reporting dashboard, clickhouse makes it cheap and easy to run such operations over commodity hardware.
If you don't have this, you'll be using big query and it might not be as fast.",0,22478661,0,"[]","",0,"","[]",0 22481850,0,"comment","pyppo","2020-03-04 07:27:08.000000000","Perspective from sentry.io (we use it in production as wikibob pointed out below): Besides all the performance considerations discussed in this thread, which are 100% correct, a few other features we really like are:
- the multiple table engines that are heavily optimized for specific data access patterns: ReplacingMergeTrees which essentially make records mutable, SummingMergeTrees that allow us to progressively build pre-aggregated data, AggregateMergeTrees that allow storing the intermediate aggregation state of most aggregation function and compose them at query time over multiple groups (example, store a p95 aggregation state hourly and query the daily p95 by composing them), and more.
- column data types is extensive and includes nested columns
- the architecture is relatively simple making it easy for developers and on prem users to deploy a single node local clickhouse very easily
- it is very efficient in inserting big batches of data which works really well for our use case were we ingest massive amount of errors.
- data skipping indexes, bloom filter indexes
(yes, as vlad@sentry.io mentioned below we are hiring for the team that manages storage and thus clickhouse)",0,22478661,0,"[]","",0,"","[]",0 22481952,0,"comment","peferron","2020-03-04 08:02:03.000000000","> Apache Druid is supposed to be very mature, but also very difficult to set up and manage
I wouldn't be surprised if this inverted at large scale (say 30+ machines). Druid data servers are rebalanced automatically; if you're on AWS and decide to scale up by adding a new data server, it will automatically load its assigned subset of data from S3. If AWS kills one of your data servers, then other data servers will automatically load from S3 some of the data that server used to carry, in order to reach the desired replication factor again.
Last time I checked ClickHouse had no automatic rebalancing at all, which sounds horrendous unless you're running at very small scale or willing to have people babysit it. I haven't operated ClickHouse at large scale though, so if I'm wrong I'd be happy to hear how people manage ordinary tasks like scaling up and down, replacing dead instances, changing instance types to adjust CPU/mem/disk, etc... with let's say a 100 TB compressed dataset.
Another difference is that Druid can index all dimensions. So if you plan to run queries with filters that only match a small fraction of rows, then Druid can be faster than ClickHouse. Conversely, if your queries have filters that match many rows, then ClickHouse will be faster because it has higher raw scan speed. (At least that was the case about a year ago. Since then, Druid has added vectorized aggregation, which I haven't benchmarked, but I'd bet that ClickHouse is still faster at doing full table scans.)
IMO these 2 things are the main elements to think about when choosing between ClickHouse and Druid.",0,22478905,0,"[22488601]","",0,"","[]",0 22482172,0,"comment","dunkelheit","2020-03-04 08:55:31.000000000","Probably you are right about vertica and snowflake (where can I browse the source code to be sure?) but with these clickhouse competes on price.",0,22481211,0,"[]","",0,"","[]",0 22487356,0,"comment","1996","2020-03-04 19:55:00.000000000","I should have phrased that differently: if something is good enough in some key metric, it extends to other uses - even if it makes a poor fit.
So I insist: everyone will WANT to use clickhouse as a general purpose database, and will create ways to make it so (ex: copy table with the columns you don't want filtered out, drop the original, rename)
It is just too fast and too good for many other things, so it will expand from these strongholds to the rest.
A personal example: I am migrating my cold storage to clickhouse, because I can just copy the files in place and be up and running.
I know about insert and the likes, I have a great existing system - but this lets me simplify the design, and deprecate many things. Fewer moving parts is in general better.
After that is done, there is a database where I would benefit from things like alter tables or advanced joins, but keeping PostgreSQL and ClickHouse side by side, just for this? No. PostgreSQL will go. Dirty tricks will be deployed. Data will be duplicated if necessary.",0,22478876,0,"[22487619]","",0,"","[]",0 22487619,0,"comment","hodgesrm","2020-03-04 20:21:28.000000000","Advanced joins (specifically merge joins) and object storage are on the way. See the following PRs:
* https://github.com/ClickHouse/ClickHouse/pulls?q=is%3Apr+mer... -- Recent work to enable merge joins
* https://github.com/ClickHouse/ClickHouse/pulls?q=is%3Apr+s3 -- Same thing for managing data on S3 compatible object storage
There's been a lot of community interest in both topics. Merge join work is largely driven by the ClickHouse team at Yandex. Object storage contributions are from a wider range of teams.
That said I don't see ClickHouse replacing OLTP databases any time soon. It's an analytic store and many of the design choices favor fast, resource efficient scanning and aggregation over large datasets. ClickHouse is not the right choice for high levels of concurrent users working on mutable point data. For this Redis, PostgreSQL, or MySQL are your friends.",0,22487356,0,"[]","",0,"","[]",0 22488601,0,"comment","oddtodd","2020-03-04 21:58:04.000000000","I have worked with Druid and ClickHouse extensively in the past year.
I do think rebalancing is a weak point for ClickHouse, although for our use case that would not be so much of an issue, and it feels like it is on the roadmap for ClickHouse this year, but we will see. And if you are on Kubernetes, some of that headache may be handled for you with the ClickHouse Kubernetes operator.
I will say, that Druid indexing comes at a heavy cost in hardware for ingestion.
We find ClickHouse can easily ingest at least 3x the rate of Druid on the same hardware, and since Druid is asymmetric in design, you then have to get even more hardware to handle the queries.
Even with the vectorized aggregation, ClickHouse is beating Druid for full table scans at least, especially high cardinality data. But the vectorized aggregation has some restrictions to get on the fast paths, so that may improve. as those are removed.
Overall, I find ClickHouse much easier to work with and manage compared to Druid. ymmv",0,22481952,0,"[]","",0,"","[]",0 18786734,0,"comment","gerdesj","2018-12-30 01:15:49.000000000","I still scracth my head as to why doing a kernel update of Ubuntu running under Hyper-V on a spinning disk is so horrifically slow.
Define slow please.
For a laugh I picked a random VM (VMWare) at work and ran (I did apt update first):
# time apt upgrade
...
82 to upgrade, 5 to newly install, 0 to remove and 0 not to upgrade
...
real 6m16.015s
user 2m38.936s
sys 0m55.216s
The updates included two client server DB engines (Clickhouse and Postgresql) the fileserving thingie (Samba) and a few other bits. The reboot takes about 90 seconds before the webby interface appears for rspamd.",0,18785069,0,"[18787264]","",0,"","[]",0
22361316,0,"comment","deepsun","2020-02-18 22:57:31.000000000","I'd recommend them to check out Clickhouse for exactly the same purposes. Works well for Cloudflare, Yandex, Sentry.Another idea is to run probabilistic queries instead of exact ones, could bring down costs way more.",0,22359324,0,"[]","",0,"","[]",0 22361628,0,"comment","streblo","2020-02-18 23:43:49.000000000","How does materialize compare in performance (especially ingress/egress latency) to other OLAP systems like Druid or ClickHouse? Would love to see some benchmarks.",0,22359769,0,"[22364627,22363243]","",0,"","[]",0 22362534,0,"comment","justlexi93","2020-02-19 02:17:48.000000000","Clickhouse has materilized views and is free.",0,22359769,0,"[22362922]","",0,"","[]",0 22362922,0,"comment","vhold","2020-02-19 03:55:37.000000000","I think the biggest difference is that Materialize can do any kind of SQL join on many tables at once. Clickhouse materialized views can only reference one table.
What I'd like to know is if that would enable basically implementing social networks as just 3 tables and one materialized view, and how it would scale and perform.
Users, Relationships, Post, and a Feed materialized view that simply joins them together with an index of user_id and post_timestamp.
As relationships and messages are created or deleted, the feed view is nearly instantly updated. The whole entire view service logic then is just one really fast query. "select user,post,post_timestamp from feed where user_id = current_user and post_timestamp <= last_page_post_timestamp order by post_timestamp desc limit page_size"",0,22362534,0,"[22363189]","",0,"","[]",0 18811677,0,"comment","tyingq","2019-01-03 00:17:54.000000000","Clickhouse is another option: https://blog.cloudflare.com/http-analytics-for-6m-requests-p...",0,18810956,0,"[]","",0,"","[]",0 25636056,0,"comment","zX41ZdbW","2021-01-04 18:49:44.000000000","Also take a look at the more technical overview of the Octoverse: https://gh.clickhouse.tech/explorer/
It is like interactive notebook when you can edit and run your own queries on the metadata from GitHub.",0,25621516,0,"[25640489]","",0,"","[]",0 25638853,0,"comment","jariel","2021-01-04 22:01:55.000000000","The top starred repo (according to query: https://gh.clickhouse.tech/explorer/#counting-stars) by stars is:
https://gh.clickhouse.tech/explorer/#counting-stars
Which is a 'protest' against 996 work schedule.
That's notable because I wouldn't have thought of GitHub as being a conduit for Chinese labour concerns before 2020.
Secondarily: TS has shot up the ranks, arguably a flavour of JS (already #1), together making them overwhelmingly the top dog.
Python now #2.
It would be interesting to see the types of information repo'd there by language, if anyone has resources on that please feel free to share.
We roughly know what people are doing in JS and C++ ... but in Python? What exactly is happening? We can only speculate. Is it Django? Devop scripts? Data? Education? AI?",0,25621516,0,"[25639709,25640642]","",0,"","[]",0 25660334,0,"comment","pachico","2021-01-06 18:01:35.000000000","I just wished it used Clickhouse as persistence layer.",0,25650795,0,"[]","",0,"","[]",0 25666630,0,"comment","joshxyz","2021-01-07 02:40:18.000000000","Id say try em out as you see them fit. Postgresql for relational, redis for cache or some streams, elasticsearch for search, clickhouse for analytics.. each of them is intended for different use cases.
The managed instances by cloud vendors (microsoft azure, google cloud platform, amazon web services, digitalocean) should also help you getting up and running without worrying about the infra side of things on small scale projects
Bit outdated but comparisons
https://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis",0,25664397,0,"[]","",0,"","[]",0 25686882,0,"comment","PeterZaitsev","2021-01-08 17:09:19.000000000","If you want cost efficient high performance Clickhouse is hard to beat",0,25686173,0,"[]","",0,"","[]",0 22580357,0,"comment","zX41ZdbW","2020-03-15 01:33:40.000000000","Example of generic COW implementation in C++: https://github.com/ClickHouse/ClickHouse/blob/master/dbms/sr...",0,22577690,0,"[]","",0,"","[]",0 22590119,0,"comment","shaklee3","2020-03-16 04:55:44.000000000","Can anyone from Netflix comment on why not clickhouse? We tried druid, and the performance was pretty bad compared to a clickhouse.",0,22574647,0,"[22590393]","",0,"","[]",0 22590393,0,"comment","modarts","2020-03-16 05:57:48.000000000","Can confirm clickhouse is generally faster across most typical workloads",0,22590119,0,"[22590539]","",0,"","[]",0 22590539,0,"comment","jpgvm","2020-03-16 06:38:40.000000000","Clickhouse is much more resource efficient in many cases but is less flexible and importantly extensible than Druid.
Druid can easily be extended through available 3rd party extensions and you can write your own to implement custom serialisation formats, aggregations, connect to new streaming systems, read directly from whatever cold storage you have etc.
In the Clickhouse model you have to work out a lot more of that stuff yourself though these days it can read from Kafka directly which is useful.
Some things that are important for Clickhouse vs Druid at big scale is the rather large difference in indexing approaches. Clickhouse uses bloom filters and other probabilistic data structures to index large chunks of data, for the most part though actually checking for rows requires a full scan of that chunk to strip false positives.
This is different to Druid which uses full inverted indices for dimension filtering.
The tradeoff is basically Clickhouse is cheaper, especially when scaling out but Druid is faster especially when the cluster is under heavy concurrent query load, like serving analytics dashboards or data exploration interfaces to users.
Clickhouse excels when you want to scan most but not all the data most of the time. Namely reporting or bulk analytics queries that will hit most rows in a block.
I consider both to be excellent databases.",0,22590393,0,"[22593630,22592398,22591489]","",0,"","[]",0 22591489,0,"comment","danielbln","2020-03-16 10:31:58.000000000","In our experience, on a significantly smaller scale, Clickhouse is vastly easier to operate compared to Druid, with all of its various components that all have various knobs and dials to configure and have to be orchestrated.",0,22590539,0,"[22593455,22592471]","",0,"","[]",0 22592398,0,"comment","shaklee3","2020-03-16 13:20:06.000000000","I think what you're getting at can be accomplished with materialized views in clickhouse now. Most queries that might be fast with inverted indices can be solved that way.
Also, I don't think they use bloom filters for the index as far as I can tell from the documentation. There is certainly an option to use a bloom filter aggregator on a table for faster counts, but it's not the default. If you're referring to the fact that count () is not precise, there's a exact count function too. This is my speculation, though, and you may be fight.",0,22590539,0,"[22592482,22593776]","",0,"","[]",0 22593423,0,"comment","derefr","2020-03-16 15:18:46.000000000","Mind you, given the “Timely Dataflow” abstraction Materialize operates on top of, if you give it a query that only requires certain result rows from one of its mat views, then Materialize is only going to compute the intermediate rows (and, further back, retrieve the source rows) required to “render” the particular result-rows you ask for. (Sort of like how Excel, in memory-constrained conditions, only computes the intermediate cell-values required to render the cells currently in view, and thus, if an intermediate cell requires an `XmlHttpRequest` call to resolve, that call won’t fire if the cell’s value doesn’t need resolving yet.)
Because of that, you don’t really need to scale Materialize in a sharding sense. You can just have a bunch of “the same” Materialize node (i.e. every node just freestanding clone of a template node, with exactly the same sources and matviews) and then hit them with the parts of a map-reduce query launched by, say, Citus—where Citus was thinking it was talking to a bunch of Citus shard nodes each holding a table-shard named X, but was actually talking to Materialize nodes each holding a matview named X. As long as the query sent from the map-reduce job to each node is constrained in its WHERE clause to only the part of the data it expects to get from that node—rather than relying on the node to know what data it has—then the Materialize nodes would each just do the work required to supply that data (including only pulling in the parts of the configured sources required to compute that result.)
That’s just my intuition from how Materialize presents itself as PG-wire-protocol compatible, though; I haven’t tried this myself, and there might be some footguns in the path of anyone really trying to implement it.
And, of course, this is all irrelevant the moment you write a query that needs a pure reduce (e.g. the computation of a current finite-state-machine state over an event-stream source) rather than a map-reduce. Druid/Clickhouse/etc. can probably “scale” those, in at least the Hadoop “move the job around between each serial stage, so each stage has data-locality for the data of that stage” sense; while Materialize would give you no benefit at all in such a job over just querying a plain PG view defined on top of a Foreign Data Wrapper source.",0,22592520,0,"[22604979]","",0,"","[]",0 15288555,0,"story","zX41ZdbW","2017-09-19 20:35:17.000000000","",0,0,0,"[]","http://www.proxysql.com/blog/proxysql-143-clickhouse",10,"ProxySQL integrates ClickHouse with MySQL protocol","[]",0 22593776,0,"comment","peferron","2020-03-16 16:02:47.000000000","> I think what you're getting at can be accomplished with materialized views in clickhouse now. Most queries that might be fast with inverted indices can be solved that way.
Let's say you have data with a few dozen dimensions, and want to compute aggregations filtered by any user-supplied union or intersection of dimension values. This is a fairly common use case in analytics dashboards. How do materialized views help with that?",0,22592398,0,"[]","",0,"","[]",0 22599107,0,"comment","jpgvm","2020-03-16 22:55:12.000000000","Good point regarding aggregations, especially if done at ingest time w/rollup they really make a big difference. Clickhouse has aggregations too but they are done at merge time in the background.
Vectorized query engine and JOINs sounds awesome.
(We did meet in SF! Beer hall!)",0,22593630,0,"[]","",0,"","[]",0 18898048,0,"comment","jetter","2019-01-13 18:53:30.000000000","Clickhouse is a good (self hosted) alternative to Elasticsearch for log storage: it saves a lot of space due to better compression, it supports sql (with regex search instead of useless by-word indexing), and ingestion speed is great.",0,18897910,0,"[18898129]","",0,"","[]",0 18898129,0,"comment","mdaniel","2019-01-13 19:09:03.000000000","> (with regex search instead of useless by-word indexing)
Perhaps I misunderstand your situation, but I don't see any "CREATE INDEX" available in Clickhouse, and thus won't "SELECT * FROM logs WHERE match(message, '(?i)error.*database')" require a full column-scan (including, as you mentioned, decompressing it)? Versus the very idea of an indexer like ES is "give me all documents that have the token 'ERROR' and the token 'database'" which need not tablescan anything
I only learned about the project 9 minutes ago so any experiences you can share about the actual performance of those queries would be enlightening -- maybe it's so fast that my concern isn't relevant",0,18898048,0,"[18898193,18898355,18911829]","",0,"","[]",0 18898193,0,"comment","ryanworl","2019-01-13 19:21:18.000000000","Clickhouse is designed for full table scans. It allows one index per table, usually a compound key including the date as the leftmost part of the key. This allows it to eliminate blocks of data that don’t contain relevant time ranges. It is also a column store, so the data being read is only the columns used in the query.
If your query is linearly scalable conceptually, Clickhouse is also linearly scalable. Per core performance is also pretty good. (tens of millions of rows per second on good hardware and simple queries, like most log aggregation queries are)",0,18898129,0,"[]","",0,"","[]",0 18898355,0,"comment","lykr0n","2019-01-13 20:01:10.000000000","Clickhouse (like any other SQL DB) would work great if you could chop up your log files into fields and store one type per DB. ElasticSearch is great for this because you don't have to worry about schema- ClickHouse you will... unless you do Two Arrays. One for the field type, and one for the field value.
If you value being able to store arbitrary log files, ClickHouse is not for you. If you want to build your system to generate tables on the fly- ClickHouse might work.
See: https://github.com/flant/loghouse",0,18898129,0,"[]","",0,"","[]",0 25722170,0,"comment","zX41ZdbW","2021-01-11 00:19:28.000000000","You can obtain this data for all existing repositories by one request in one second:
https://gh-api.clickhouse.tech/play?user=play#U0VMRUNUCiAgIC...",0,25719116,0,"[25725897]","",0,"","[]",0 18905120,0,"comment","hodgesrm","2019-01-14 18:29:12.000000000","> These companies [AWS, Microsoft, and Google] are some of the biggest contributors to open source.
True, but their contributions are quite selective. Google writes very good infrastructure code. However OSS DBMS services are quite different. They tend to originate in companies with data problems. Sometimes that's a company like Google or Microsoft, but more often it's a company like Facebook that has an issue processing a lot of data.
BTW I'm not so sure AWS counts as a big OSS contributor. They use it but don't really make a lot of contributions back. This may change over time.
Disclaimer: I work at Altinity, which offers support and software for ClickHouse. (Incidentally written by Yandex to process their clickstream data.)",0,18902851,0,"[]","",0,"","[]",0 18911829,0,"comment","hodgesrm","2019-01-15 15:32:46.000000000","You can use materialized views in ClickHouse to simulate secondary indexes. See https://www.percona.com/blog/2019/01/14/should-you-use-click... for an example of this usage. It's about half-way through the article.
Disclaimer: I work for Altinity which is commercializing ClickHouse.",0,18898129,0,"[]","",0,"","[]",0 22619025,0,"story","bankim","2020-03-18 17:25:31.000000000","",0,0,0,"[22621701]","https://blog.cloudera.com/benchmarking-time-series-workloads-on-apache-kudu-using-tsbs/",14,"Benchmarking Time Series Workloads with Kudu, InfluxDB and ClickHouse","[]",1 25749333,0,"comment","jetter","2021-01-12 16:42:32.000000000","Thank you! Yes this is a big solo project now. Didn't have time to do a write up yet. I am a big fan of Clickhouse, so I use it for log storage. To collect server logs, it is not exactly a simple file parsing setup - I have a gateway daemon that stays in front of the API server. The daemon receives all the requests from subscribers, authorizes that this specific connection has access to specific API, rate limits, and proxies all the connections to upstream API, dumping HTTP logs to Clickhouse once in a while. Then Vue.js & Laravel powered dashboard queries Clickhouse to generate various stats, which is later used for analytics and usage-based billing.",0,25748578,0,"[]","",0,"","[]",0 22649877,0,"comment","jpgvm","2020-03-21 20:40:25.000000000","Linear search approaches fall down when you have a lot of data and you only want to select a very small portion of it.
A linear based approach can get you to about 1GB/s or so per core with Rust.
A medium-ish size startup probably logs around 200GB/day of logs if they aren't very tight on their log volume. If you only want to search the last 24 hours that is maybe ok, you can search that in ~10-20 seconds on a single machine.
However this quickly breaks down when a) your log volume is a multiple of this and/or you want to search more than just a few hours.
In which case you need some sort of index.
There are different approaching to indexing logs. The most common is full text search indexing using an engine like Lucene. Elasticsearch (from the ELK stack) and Solr explicitly use Lucene. Splunk uses their own indexing format but I'm pretty sure it's in a similar vein. Papertrail uses Clickhouse which probably means they are using some sort of data skipping indices and lots of linear searching.
Of these approaches Clickhouse is probably the best way to go. It combines fast linear search with distributed storage and data skipping indices that reduce the amount of data you need to scan. (especially if you filter by PRE WHERE clauses).
So why not go with Clickhouse? Clickhouse requires a schema. You can do various things like flatten your nested structured data into KV (not a problem if you are already using a flat system) and have a single column for all keys and the other column for values. This works but doesn't get great compression, makes filtering ineffective for the most part and you now have to operate a distributed database that requires Zookeeper for coordination.
The reason I am choosing to build my own is that logs require unique indexing characteristics. First and foremost the storage system needs to be fully schemaless. Secondly you need to retain none word characters. The standard Lucene tokenizers in Elastic strip important punctuation that you might want to match on when searching log data. Field dimensionality can be very high so you need a system that won't buckle with metadata overhead when there are crazy numbers of unique fields, same goes for cardinality.
TLDR: For big users you must have indices in order not to search 20TB of logs for a month. Current indices suck for logs. I write custom index that is hella fast for regex.",0,22649733,0,"[22653403,22661285]","",0,"","[]",0 18940830,0,"comment","tristor","2019-01-18 17:17:59.000000000","There's also Clickhouse [1] which seems to scale much better than Druid, and has similar architectural decisions to make it somewhat general as a columnar store for OLAP uses. Cloudflare wrote an article in the past where they compared Clickhouse and Druid and they chose Clickhouse because they could get similar performance on the same workload with 9 nodes in Clickhouse which would require hundreds for Druid. They built all of the DNS analytics at CloudFlare on Clickhouse [2].
Disclosure: I work at Percona, and we've seen a lot of our customers make use of Clickhouse and have begun some of our own services work around it in Consulting. It's now a primary database talked about at our conferences, and we post about it regularly. [3]
[1]: https://clickhouse.yandex/ [2]: https://blog.cloudflare.com/how-cloudflare-analyzes-1m-dns-q... [3]: https://www.percona.com/blog/2018/10/01/clickhouse-two-years...",0,18940106,0,"[18942270,18941175]","",0,"","[]",0 18941175,0,"comment","polskibus","2019-01-18 17:52:46.000000000","Does ClickHouse support fine grained data security (for example role A gives access only to tuples with column X==123)?",0,18940830,0,"[18944117]","",0,"","[]",0 18942270,0,"comment","neeleshs","2019-01-18 19:41:54.000000000","There is a very good article[1] by one of the Druid committers about Clickhouse/Drui/Pinot that goes into some details on why the Cloudflare tests turned out the way they did.
[1]:https://medium.com/@leventov/comparison-of-the-open-source-o...",0,18940830,0,"[19011532]","",0,"","[]",0 18943000,0,"comment","zepearl","2019-01-18 20:57:30.000000000","I love DBeaver.
Long time ago I started searching for a DB-client similar to "TOAD", but on Linux and targeting only my following key requirements: 1) write & execute single SQLs on a "page" which has multiple SQLs without having to use any terminator (";") between them, 2) ability to show the execution plan of the SQL in which the cursor is positioned and 3) ability to connect to different kinds of databases.
I ended up with nothing, and I actually even initially excluded DBeaver as running on Java (personal thing - most Java apps I tried always had some kind of bug which made it a no-go for me).
After a while I gave it a try and I ended up being extremely happy with it, especially because I can use it with basically all databases that I am/was using (I'm currently using it mostly with MariaDB & Clickhouse, but used it in the past as well with DB2, Kudu, PostgreSQL).",0,18935138,0,"[]","",0,"","[]",0 18944117,0,"comment","atombender","2019-01-18 23:30:11.000000000","No [1]. ClickHouse is a fairly low-level tool. If you need that kind of thing, you build an ACL-aware app on top of it.
[1] https://clickhouse.yandex/docs/en/operations/access_rights/",0,18941175,0,"[]","",0,"","[]",0 25768188,0,"comment","joshxyz","2021-01-13 21:27:45.000000000","Its ok im part of audience that doesnt know much about databases its fun to read discussions about them once in a while.
I learned a lot about postgresql redis clickhouse and elasticsearch here, people's perspectives here are great to learn from, they tell you which to avoid and which to try.",0,25767271,0,"[]","",0,"","[]",0 22653403,0,"comment","manish_gill","2020-03-22 07:47:14.000000000","My first thought was also Clickhouse and the problems with schema for dynamic log content.
Also, if this is open source, I will definitely be interested in checking it out/contributing etc.",0,22649877,0,"[]","",0,"","[]",0 22653894,0,"comment","antonrevyako","2020-03-22 10:02:06.000000000","Thank you for the feedback! The service will be put into commercial operation as saas in the coming months. I do not plan to open the source code, unfortunately.
The idea is that the tool will only work with the sources of SQL queries, and I had to work hard to implement it.
The work consists of several steps 1) get the AST (Abstract Syntax Tree) database schema (DDL). At this stage, only Postgresql up to version 10 is supported. Soon I will deal with the parser from Postgresql 13. This is not a trivial task. I have to build everything from the source code of Postgresql :)
2) Building a database model. We parse all DDL-commands one by one and apply changes. For example, apply all ALTER TABLE to the table described above, add user-defined functions to the list of built-in functions, and so on. It is necessary to have a complete overview of all types of tables, indexes, and each table described in a DDL script.
3) get an AST query (DML) and build a model of the result. This is the most complex and interesting part :) The task is to get a list of field names and their types, which will be returned after the query execution. You need to consider CTE and the list of tables specified in FROM. You need to understand what function will be called and what result will be output. For example, function ROUND is described in three variants, from different arguments and with varying types of result, function ABS - in six variants. I.e., it is required to understand the types of arguments before selecting a suitable one :) In the process, implicit type casting is considered if necessary.
The same is valid for operators. An operator in Postgresql is a separate entity that you can create yourself. Postgresql 11, for example, describes 788 operators.
Various types of syntax are taken into account, for example - SELECT , t1., id, public.t2.name, t3.* FROM t1, t2, myschema.t3 - will be parsed correctly.
But even this is not the most challenging thing :) The most exciting thing is to be able to understand two things: A. whether each of the fields can be NULL or not. It depends on many factors, such as - how the JOIN of the table with the source column is made, whether there is a FOREIGN KEY, what type of JOIN is used, what conditions are described in the ON section, what is written in the WHERE conditions. B. How many records will return the query described in the NONE, ONE, ONE OR NONE, MANY, MANY OR NONE categories. Again, this is affected by the conditions described in JOIN and WHERE, whether there are aggregation functions, whether there is GROUP BY, whether there are functions that return multiple records.
This function, by the way, is also used in the first step - to get types for VIEW.
This was a brief description of the first part of the service ;). It can already be a service in itself. It is possible to generate types and all code of microservices, including JSON-schema and tests, based on DLL and DML set. But as I wrote above, most people prefer to use an ORM such as Django or RoR. :( For this reason, I've removed this functionality from a playground and will take it to a separate project when I get my hands on it. It will also include various tools, such as - information about all possible exceptions that may be thrown by request, automatic creation of migrations to CI if your DDL files are in the repository, whether there are unused indexes or fields and many other exciting things :)
And the second part of the service that I plan to promote in the first place is a tool to search for bugs, architectural, and performance problems automatically. The target audience here is DBA, who have to deal with forgeries RoR-developers and their colleagues :) This part is entirely based on all information obtained in the previous stages. Part of the errors can be understood by AST (linter principle), but the most interesting rules are based on knowledge about types, understanding of NULL/ IS NOT NULL, and the number of returned records.
There are more than a thousand such rules, but I suppose there will be about 5000 of them in the next couple of years :) Also described are about 200 rules that can lead to runtime errors, but they are not needed by DBA, because their job is to find problems in valid queries :) There will be more than 1000 such rules as well since Postgresql describes more than 1700 runtime errors.
And yes, this is all about Postgresql. After I start commercially using Postgresql, I plan to do the same for Mysql and then perhaps for Clickhouse if there are no special offers of cooperation :)",0,22652068,0,"[22654461,22655528]","",0,"","[]",0 22657523,0,"comment","pachico","2020-03-22 18:57:33.000000000","In my scarse free time, I'm working on a web analytics platform. Js, go and ClickHouse.",0,22648431,0,"[]","",0,"","[]",0 22661285,0,"comment","hodgesrm","2020-03-23 04:32:23.000000000","Here's another thought--why not try to fix this in ClickHouse? It sounds like you are rebuilding Elastic.",0,22649877,0,"[22665534]","",0,"","[]",0 22665534,0,"comment","jpgvm","2020-03-23 16:26:29.000000000","I considered that but it's harder than it sounds. Clickhouse s very strongly coupled to the idea of a schema and it also is very coupled to only using indices for data skipping.
If I was to make the changes I want to Clickhouse, i.e schemaless and full indexes per/segment then it wouldn't be Clickhouse anymore.",0,22661285,0,"[]","",0,"","[]",0 22666549,0,"comment","hwwc","2020-03-23 17:43:24.000000000","Location: Boston, US
Remote: Yes
Willing to relocate: No
Technologies: Rust, Python/Pandas, Node/JS, Clickhouse, Postgres, GCP/AWS, Linux
Resume: https://github.com/hwchen , https://www.linkedin.com/in/walther-chen-5b87a512/
Email: hello@hwc.io
ABOUT: I'm an experienced software engineer looking for part-time or short-term contracts.
I've most recently worked in a data-analytics backend-stack: from ETL to database design to web-api to devops. One of my major projects is an analytics engine for web applications (https://github.com/hwchen/tesseract) using Rust and Clickhouse.
However, I'm naturally curious and happy to work in any domain which requires high performance and maintainable code. I've worked with a distributed worker system, debugged async database drivers, and implemented text layout primitives.",0,22665396,0,"[]","",0,"","[]",0 22670293,0,"comment","hodgesrm","2020-03-24 00:00:22.000000000","Altinity | Multiple ClickHouse engineering positions | REMOTE in North America and Europe| Full-time | Competitive Salary and Equity
Hello! We are Altinity, a fast-growing database startup with a distributed team spanning from California to Eastern Europe. Our business is to make customers successful with ClickHouse, the leading open source data warehouse. Our customers range from ambitious startups to some of the most well-known enterprises on the planet. And we are looking for people to join us!
* Data Warehouse Implementation Engineer
* Data Warehouse Support Manager
* Data Warehouse Support Engineer
* Various product engineering positions
If you have experience with ClickHouse and want to join, check out our jobs here:
https://www.altinity.com/careers",0,22665398,0,"[]","",0,"","[]",0 25795804,0,"comment","hodgesrm","2021-01-15 20:18:35.000000000","There's also Cloki, which offers the same Loki APIs but stores data in ClickHouse. It looks interesting.
https://github.com/qxip/cloki-go
Disclaimer: I work on ClickHouse at Altinity.",0,25795351,0,"[25795888]","",0,"","[]",0 18991544,0,"comment","olavgg","2019-01-24 18:39:26.000000000","For data analytics I use ClickHouse instead of PostgreSQL. There is a PostgreSQL Foreign Data Wrapper (FDW) for the ClickHouse database, but I have never used it.",0,18991290,0,"[]","",0,"","[]",0 25814681,0,"comment","joshxyz","2021-01-17 20:36:48.000000000","I legit use https://hn.algolia.com/ and just type in terms like "postgresql" or "clickhouse" which is cool cause when something is relevant and sensible people upvote it, and you can filter by posts or comments, and you know, HN has good signal to noise ratio",0,25814083,0,"[]","",0,"","[]",0 15406899,0,"comment","dolbyzerr","2017-10-05 05:52:57.000000000","Divvit | Big Data Architect/Engineer | REMOTE | Full-time | http://www.divvit.com
We are a small startup aiming to make a difference in E-commerce Analytics. Our goal is to provide e-commerce owners all information they need for their businesses to succeed. We are a fully remote company.
We have constantly growing number of events coming to our platform. We need an engineer that can build and maintain new streaming and processing system that would allow us to process and store big amount which would be available for real-time analytics requests.
Responsibilities:
- Selecting and integrating any Big Data tools and frameworks required to provide requested capabilities
- Implementing ETL process
- Monitoring performance and advising any necessary infrastructure changes
Required Skills:
- Proficient understanding of distributed computing principles
- Ability to solve any ongoing issues with operating the cluster
- Experience with building stream-processing systems, using solutions such as Storm or Spark-Streaming
- Experience with integration of data from multiple data sources
- Experience with NoSQL databases, such as MongoDB, HBase or Cassandra
- Experience brokers like Kafka or AWS Kinesis
- Experience with ElasticSearch
- Experience with Amazon Web Services
- Understanding of Lambda Architecture
It would be awesome if you have:
- Experience with Cloudera/MapR/Hortonworks
- Experience with Hadoop, HDFS, and querying tools like Pig, Hive, Impala
- Experience with Yandex ClickHouse
If you are interested, drop me a line: andrei@divvit.com",0,15384262,0,"[]","",0,"","[]",0 22714537,0,"comment","manigandham","2020-03-28 21:43:13.000000000","Yet another Prometheus/time-series backend project.
And yet again, it would be far better to just export the data from Prometheus into a distributed columnstore data warehouse like Clickhouse, MemSQL, Vertica (or other alternatives). This gives you fast SQL analysis across massive datasets, real-time updates regardless of ordering, and unlimited metadata, cardinality and overall flexibility.
Prometheus is good at scraping and metrics collection, but it's a terrible storage system.",0,22712933,0,"[22714574,22714675,22715286]","",0,"","[]",0 25834353,0,"comment","hodgesrm","2021-01-19 15:30:02.000000000","ClickHouse. It's Apache 2.0 and will stay that way.
Edit to add disclaimer: I work on ClickHouse.",0,25834187,0,"[]","",0,"","[]",0 19019943,0,"story","hodgesrm","2019-01-28 18:45:39.000000000","",0,0,0,"[19021952,19020686,19021896,19020965]","https://www.altinity.com/blog/migrating-from-redshift-to-clickhouse",73,"Migrating from Redshift to ClickHouse","[]",13 19020686,0,"comment","wgjordan","2019-01-28 20:18:09.000000000","> Reasons to move to ClickHouse
> the [VACUUM] process requires an outrageous amount of time
As of Dec 19 2018, Amazon Redshift now runs VACUUM DELETE automatically, and has been made drastically more resource-efficient [1].
> There are no queries in Redshift that take less than a couple of seconds.
This is likely due to incorrectly-tuned Workload Management Queues. In addition, as of Aug 8 2018, Redshift automatically enables short query acceleration [2], which should speed up short queries by default without additional tuning.
> According to our calculations, deploying ClickHouse on AWS instances with the same resources was exactly half as expensive.
You also need to factor in the cost of migration (two full-time specialists, three months effort), and ongoing support effort needed to maintain the now-custom system.
Plus, it sounds like some (possibly similar?) cost savings could have been achieved simply optimizing the existing Redshift cluster:
> We can’t make a straight comparison about query speed because the data schema changed so much. But many queries sped up simply because less data are read from the disk. Truth be told, we should have made this change in Redshift, but we decided to combine it with our migration to ClickHouse.
In short, continue testing assumptions and tracking updated feature-sets on a regular basis- valid reasons to move to ClickHouse yesterday may no longer be valid today, or at some point in the future. Also, don't double down on sunk costs- it might make sense to migrate back to Redshift in the future, if your issues with the system are eventually improved or resolved. Or maybe not, depending on your use-case.
[1] https://aws.amazon.com/about-aws/whats-new/2018/12/amazon-re...
[2] https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-re...",0,19019943,0,"[19020888]","",0,"","[]",0 19020965,0,"comment","devereaux","2019-01-28 20:52:24.000000000","I love clickhouse: it's simple yet flexible enough and free software.
I'm migrating a lot of data to clickhouse on clusters of servers.
One of the unexpected gains is how it uses far less disk space, which means the data can be put on a few NVME in RAID10, which provides speed benefits even for data that can't fit in memory.
I am now also considering ClickHouse for "cold storage": either by compressing the directories or leaving them as such on a smaller server.
Being able to standardize on one thing for both production and storage would be nice.",0,19019943,0,"[19021584]","",0,"","[]",0 22732648,0,"comment","manigandham","2020-03-30 21:56:29.000000000","There's a similar story of using Druid at Netflix: https://news.ycombinator.com/item?id=22574647
Druid is a very specialized datastore with a niche usage around low-latency analytics under specific pre-set aggregations but it has a lot of moving parts. It also just recently got a beta SQL interface (which translates to the underlying custom JSON API) and isn't as easy to integrate with existing tools (outside of Kafka).
Overall I suggest smaller companies stick to a standard column-oriented data warehouse like Redshift, BigQuery, Snowflake, Clickhouse, MemSQL, etc instead of Druid.",0,22732604,0,"[]","",0,"","[]",0 19021896,0,"comment","manigandham","2019-01-28 22:56:41.000000000","Clickhouse is a fantastic piece of engineering but really needs work on the operational and maintenance side. Replication, backups, updates/deletes, etc are very rough and a big obstacle for greater usage.
That being said, even a single server can be many times faster than other data warehouse solutions if your data fits.",0,19019943,0,"[]","",0,"","[]",0 19021952,0,"comment","mmcniece","2019-01-28 23:03:22.000000000","Cool to see ClickHouse get some interest here. Been using it for about a year now and consistently impressed with the performance on analytical queries across very large (1b+ rows) tables
Couple pain points though:
* Integration Maturity - Many tools/services either don't have integration with CH or are missing features
* User Management/Security - Have to configure users in their custom XML format, only applied at database level (no table level or row level option), and doesn't plug into SSO, LDAP, etc.
* Getting "current state" for a table - e.g. some table has users and some attributes, harder than it should be to get the "current attribute value" for all users, to do analytics on
* Log Format - very challenging to pull into log aggregation tool and get helpful information from",0,19019943,0,"[]","",0,"","[]",0 19022341,0,"comment","sethhochberg","2019-01-28 23:54:03.000000000","Not the poster above, but using Clickhouse for similar purposes (archival/analytics on huge data that started out in MySQL and gets imported to Clickhouse for the long term). Some real-world numbers on our end for the same time period, roughly 2.5 billion rows:
InnoDB Barracuda - (logical data + indexes 583GB, physical ZFS LZ4: 386GB, ~1.51x compression radio)
Clickhouse MergeTree - (logical 349GB, physical ZFS LZ4: 279GB, ~1.27x compression ratio)
The MySQL version of the data has a few different compound indexes which certainly contribute to the size difference, but regardless Clickhouse is dozens of times faster on complex queries against this data.
We run both our MySQL and Clickhouse servers on ZFS (Clickhouse accesses its datastore over NFS on a 10gbit link... in practical terms we have not seen a severe performance penalty for doing this, even though we know it isn't encouraged by the Clickhouse maintainers).",0,19021584,0,"[19036139]","",0,"","[]",0 19023509,0,"comment","paulryanrogers","2019-01-29 03:35:03.000000000","You mean 7z the raw SQL? Just as a preemptive check before moving to Clickhouse?",0,19022274,0,"[19023687]","",0,"","[]",0 19029643,0,"comment","polskibus","2019-01-29 20:43:54.000000000","I wonder if they considered adding GPU support to ClickHouse instead of building another database.",0,19028860,0,"[19030976]","",0,"","[]",0 22740295,0,"comment","manigandham","2020-03-31 17:51:46.000000000","Vertica lost its lead a long time ago. There are better columnstore data warehouses like MemSQL and Clickhouse, and most people are just moving to the cloud with Redshift, BigQuery and Snowflake.
Druid doesn't really compete with these systems but is more of an addition if you need low-latency queries against defined and fully-indexed fields along with pre-aggregation (the source of its performance). It also has a lot more operational complexity with very basic SQL support.
As of 2020, I don't see much of a use for Druid since columnstores already support real-time updates and are adding indexing, aggregation pipelines and concurrency scaling.",0,22739461,0,"[22741761]","",0,"","[]",0 22740468,0,"comment","gopalv","2020-03-31 18:05:39.000000000","Apache Druid is pretty amazing tool, with one assumption - your data has an event timestamp as a crucial part of the data ingest & that it has no updates at all.
My run-ins with Vertica for BI/PM metrics data is almost a decade old, but it is a bit more powerful in the way it does projections + distributions for instance.
The most common queries which Vertica got hit by was Unique users workloads, which had intersections - there was a single table being ingested, but 3 projections. One partitioned by user, one partitioned by (user, property), one partitioned by (user,property,date).
The biggest dimension tables were the A/B experiment id allocation list which was duplicated on every single host.
A better storage model for this would be something like a Replex [1]
Druid can be used for the same sort of workload at a high scale (i.e millions of users), where a best-effort distinct count is as good as the real thing, but much faster.
If I had to do this today, I would also use the BloomKFilter in Apache Druid for the experiment membership queries, which would also work better at approximate queries than anything built to generate accurate results (& store the dimension table in a slowly-changing-dimension store).
The real power of Druid is to push the segments to S3 + being able to rehydrate off Kafka, to able to handle entire local data-loss without being very expensive with EBS (i.e downloading segments to ephemeral SSDs), while answering dashboard queries where a pixel is bigger than the error bar on these approximations.
Plus the immutability of the data means, you can maintain a partial results cache at the segment granularity rather than recomputing for every refresh of the dashboard.
Picking up this problem today for a web-scale environment, I will pick Druid for experiment data streams and define rollup aggregates ahead of time (over say Clickhouse), but as things get more mutable and less time ordered, other tools like Apache Kudu looks better at the storage layer.
[1] https://blog.acolyer.org/2016/10/27/replex-a-scalable-highly...",0,22739461,0,"[22741755]","",0,"","[]",0 22740504,0,"comment","pachico","2020-03-31 18:08:54.000000000","At this moment I cannot really understand why someone wouldn't use ClickHouse for OLAP. Beyond my success experience with it in production, I'm currently testing it with data that for historical or commercial reasons was always sitting in BigQuery and the results are really fantastic. I don't think Druid can compare to it at all.",0,22739461,0,"[22741055,22740920]","",0,"","[]",0 22740920,0,"comment","didip","2020-03-31 18:40:47.000000000","Cannot compare in a sense that ClickHouse is superior?",0,22740504,0,"[]","",0,"","[]",0 19030108,0,"comment","subprotocol","2019-01-29 21:48:35.000000000","Very cool! I love geeking out on analytics tech and look forward to studying its design further. My take as I see it so far (please correct me if I'm wrong)-
* As a datapoint Pinot/Druid/Clickhouse can do 1B timeseries on one server. AresDB sounds like it's in the same ballpark here
* Pinot/Druid don't do cross table joins where AresDB can. My understanding is these are at (or near?) sub-second which would be a very distinguishing feature. I'm not sure how this will translate to when distributed mode is built out, as shuffling would become the bottleneck. Maybe there would be some partitioning strategy that within a partition allows arbitrary joining or something?
* Clickhouse can do cross table joins, but aren't going to be sub-second
* AresDB supports event-deduping. I think this can easily be handled by the upstream systems (samza, spark, flink, ..) in lambda
* Reliance on fact/dimension tables. - This design/encoding is probably to help overcome transfer from memory to GPU, which in my limited experience with Thrust was always the bottleneck. - High cardinality columns would make dimension tables grow very large and could become un-unmanageable (unless they are somehow trimmable?)",0,19028860,0,"[19032495]","",0,"","[]",0 19032471,0,"comment","subhajeet2107","2019-01-30 03:46:00.000000000","How does it compares to ClickHouse ? Isnt creating a proprietary Query Language going to be a problem for Adhoc quries? Why create yet another language when the industry is standardizing on SQL or a subset of SQL",0,19028860,0,"[19032866]","",0,"","[]",0 19032866,0,"comment","jamesblonde","2019-01-30 05:10:54.000000000","ClickHouse is interesting - it's also an apache project. I am suprised Uber didn't build AresDB on it. The stigma, unfortunately, of coming from Russia makes it hard for the project to gain mindshare in the Valley.",0,19032471,0,"[19033605,19033408,19034080]","",0,"","[]",0 19033605,0,"comment","peferron","2019-01-30 08:45:23.000000000","ClickHouse is licensed under the Apache License 2.0, but isn't an "Apache project" as in an ASF project. Maybe that's what you meant—just want to make that clear for other readers.
ClickHouse scan performance is through the roof, but it also seems fairly difficult to operate compared to the alternatives. For example, the concept of "deep storage" in Druid and Pinot makes rebalancing and replication trivial (at the expense of additional storage space). ClickHouse doesn't have that and requires more babysitting. And that's without even going into something like BigQuery, which is on a completely different level regarding operational simplicity if your use cases support it.
Also, if queries are heavily filtered by arbitrary dimensions, then ClickHouse starts to lose its edge compared to fully-indexed systems.
This makes ClickHouse fairly niche IMO, despite being exceptional at what it does.",0,19032866,0,"[]","",0,"","[]",0 19034116,0,"comment","jgrahamc","2019-01-30 10:56:18.000000000","Yes we do. We moved our entire analytics backend to ClickHouse some time ago. It works well for us: https://blog.cloudflare.com/http-analytics-for-6m-requests-p...",0,19034080,0,"[19040530]","",0,"","[]",0 19036139,0,"comment","devereaux","2019-01-30 16:00:40.000000000","[my numbers match yours]
It's very interesting that we made more or less the same choices: NFS on a LAN for cold storage. In practice, I haven't seen performance penalties either. I will certainly standardize on that.
If needed, 'reviving' the cold stored data is as simple as copying it to other machines-- which will also offer sharding.
My only reproach is that servers have to be identical, as clickhouse doesn't seem to be able to take into account the speed differences. My backup servers are quite different from the production server, but in an emergency I'd like to use them all.",0,19022341,0,"[]","",0,"","[]",0 15435038,0,"comment","justinsaccount","2017-10-09 16:04:58.000000000","How hard would it be to add https://clickhouse.yandex/ to the benchmarks?",0,15434272,0,"[15435182]","",0,"","[]",0 15435230,0,"comment","justinsaccount","2017-10-09 16:27:36.000000000","Yes.. that's a good starting point but the hardware used for each platform varies a lot.
Clickhouse on a single i5 compared to redshift on a 6-node ds2.8xlarge cluster
I've been throwing everything I can at clickhouse on a little VM, and the performance is amazing.",0,15435182,0,"[15436989]","",0,"","[]",0 15436989,0,"comment","IanCal","2017-10-09 20:02:43.000000000","I've recently moved to clickhouse for some analytical work, and it's been awesome. It's not perfect for what I want, and I'm sure I'm using it wrong, but it's torn through what I have to throw at it and I've not spent any time doing perf tweaks.
There's a connector for redash too, so I can make nice little dashboards from the results.
Major problem is a lack of updates to data. At least loading is fast enough for my use.",0,15435230,0,"[15438595]","",0,"","[]",0 15438595,0,"comment","justinsaccount","2017-10-10 01:05:21.000000000","You can't do updates, but you can create tables using the AggregatingMergeTree table engine.
I've been working out how to use this for tracking count/first seen/last seen for data sets. With normal sql you need to do upserts, but with clickhouse you can just create the table using countState, minState, maxState and insert:
1, now, now
for count, first, last, and when it compacts the table it'll apply the aggregations.Basically what you do with with https://github.com/facebook/rocksdb/wiki/merge-operator just with a few lines of SQL instead of a ton of custom code.",0,15436989,0,"[15440119]","",0,"","[]",0 22741055,0,"comment","beagle3","2020-03-31 18:51:22.000000000","As of 2 years ago, ClickHouse did not have as-of joins, which was a dealbreaker for me.
(Non is the usual suspects do either... I eventually used pandas+dask)",0,22740504,0,"[22741714]","",0,"","[]",0 22741714,0,"comment","pachico","2020-03-31 19:50:37.000000000","Lots of features have been implemented in the meanwhile, including asof. https://clickhouse.tech/docs/en/query_language/select/#selec...",0,22741055,0,"[]","",0,"","[]",0 22741761,0,"comment","gianm","2020-03-31 19:54:06.000000000","Hey Mani. Druid committer here. It actually is a column store! The project makes a big deal about its ability to do indexes and pre-aggregation because those are important capabilities and, while not unique, are also not universally supported by every column store out there. So they are interesting differentiators. But architecturally they are really just extra icing on the cake.
Personally I see stuff like Druid, MemSQL, Clickhouse, Redshift, BigQuery, and Snowflake as technological siblings in the space. These systems are all evolving rapidly too (well, the healthy ones are anyway) so it's definitely a good time to be an analytical database enthusiast.
With regard to the operational complexity, that's an interesting point. It shows up in two main ways, I think -- the multi-process architecture and usage of external deep storage. On huge clusters, which is what Druid was designed for, the idea is that explicitly separating components in this way gives you three benefits: they don't interfere with each other (spikes in ingestion load won't interfere with ability to query historical data), you can scale each one individually, and it makes most components "disposable" (as long as your storage is reliable, the other Druid components can be blown away and recreated without losing any data). It helps when you're trying to run a big cluster in a stateless / containerized environment.
But these aspects are less good on small clusters or single servers, where it just feels like a bunch of overhead. So we're currently working on simplifying some of this for people that aren't running huge clusters.
We're also expanding SQL support rapidly. Almost every release adds additional SQL capabilities. The next release is a big one, adding JOIN and GROUPING SETS operators. The project's goal is to support it all before too long -- up next after this release will likely be analytic functions.
If you're interested in checking out the community, we do meetups pretty often (all virtual now, though, due to COVID-19). We're also planning our first user conference later in the year @ https://druidsummit.org/.",0,22740295,0,"[22743394]","",0,"","[]",0 19040530,0,"comment","bithavoc","2019-01-30 22:59:59.000000000","What Clickhouse engines does CF uses?",0,19034116,0,"[19040763]","",0,"","[]",0 19045911,0,"comment","dqminh","2019-01-31 16:23:19.000000000","We mostly use replicated MergeTree, SummingMergeTree, AggregatedMergeTree etc. And yes, we do use Kafka engine to consume directly from Kafka, in additions to standard Clickhouse inserters.",0,19040763,0,"[]","",0,"","[]",0 15448142,0,"comment","IanCal","2017-10-11 08:23:56.000000000","Thanks!
I was also playing yesterday with CollapsingMergeTree.
CREATE TABLE cmt
(
whatever Date DEFAULT '2000-01-01',
key String,
value String
sign Int8
) ENGINE = CollapsingMergeTree(whatever, (key, value), 8192, sign)
Now you can 'delete' a row by sending the same row again but with a sign of -1: insert into cmt (key, value, sign) values ('k1', 'v1', 1)
insert into cmt (key, value, sign) values ('k1', 'v1', -1)
insert into cmt (key, value, sign) values ('k1', 'v1 update', 1)
insert into cmt (key, value, sign) values ('k2', 'just delete this one', 1)
insert into cmt (key, value, sign) values ('k2', 'just delete this one', -1)
You have to either add FINAL onto the query or optimise the table as far as I can tell for this to work, or hope it's done it in the background.I thought you had to use the sign in the query otherwise it wouldn't work, but creating this example this morning works fine. If you're not getting the response you expect after optimising, try adding the sign column to the query.
:) select key, value, sign from cmt
SELECT
key,
value,
sign
FROM cmt
┌─key─┬─value─────┬─sign─┐
│ k1 │ v1 update │ 1 │
└─────┴───────────┴──────┘
1 rows in set. Elapsed: 0.002 sec.
:)
[disclaimer - I know very little about how to make high performance things in clickhouse, tbh the Log format has been easily fast enough for my data so far]",0,15442461,0,"[]","",0,"","[]",0
22749700,0,"comment","hwwc","2020-04-01 15:31:56.000000000","Location: Boston, USRemote: Yes
Willing to relocate: No
Technologies: Rust, Python/Pandas, Node/JS, Clickhouse, Postgres, GCP/AWS, Linux
Resume: https://github.com/hwchen , https://www.linkedin.com/in/walther-chen-5b87a512/
Email: hello@hwc.io
ABOUT:
I've most recently worked in a data-analytics backend-stack: from ETL to database design to web-api to devops. One of my major projects is an analytics engine for web applications (https://github.com/hwchen/tesseract) using Rust and Clickhouse.
However, I'm naturally curious and happy to work in any domain which requires high performance and maintainable code. I've worked with a distributed worker system, debugged async database drivers, and implemented text layout primitives.",0,22749306,0,"[]","",0,"","[]",0 22750128,0,"comment","hwwc","2020-04-01 16:02:26.000000000","SEEKING WORK | Backend Services; Data Engineering; Systems Engineering
Location: Boston, US | Remote: Yes
I'm an experienced software engineer looking for part-time and short-term contracts.
I've most recently worked in the data-analytics backend-stack: from ETL to database design to web-api to devops. One of my major projects is an analytics engine for web applications using Rust and Clickhouse (https://github.com/hwchen/tesseract).
However, I'm naturally curious and happy to work in any domain which requires high performance and maintainable code. I've worked with a distributed worker system, debugged async database drivers, and implemented text layout primitives.
Primary Skills: Rust, Python, Linux, Google Compute Platform, ClickhouseDB, Postgresql
Production Experience with: Python/Pandas, Node/JS, AWS, Docker, Redis, MySql, Nginx
Github: https://github.com/hwchen
Contact: hello@hwc.io",0,22749307,0,"[]","",0,"","[]",0 25872095,0,"comment","timgl","2021-01-22 14:54:01.000000000","We do have a completely FOSS version here: https://github.com/posthog/posthog-foss :). The only difference is support for Clickhouse and some advanced permissioning stuff.",0,25872053,0,"[]","",0,"","[]",0 25873409,0,"story","hodgesrm","2021-01-22 16:58:58.000000000","",0,0,0,"[]","https://altinity.com/blog/clickhouse-is-apache-2-0",17,"ClickHouse Is Apache 2.0","[]",0 19055461,0,"comment","buro9","2019-02-01 16:24:59.000000000","At Cloudflare a year ago we said we were doing 6m per second ingest rate on a cluster of 106 brokers with x3 replication factor, 106 partitions: https://blog.cloudflare.com/http-analytics-for-6m-requests-p...
The more interesting blog post IMHO is this one on Kafka compression https://blog.cloudflare.com/squeezing-the-firehose/",0,19055022,0,"[19056136]","",0,"","[]",0 19056444,0,"comment","sin7","2019-02-01 17:50:26.000000000","Location: Denver, CO
Remote: No
Willing to relocate: No
Technologies: R, PostgreSQL, ClickHouse, Data Analysis
Résumé/CV: Email if needed
Email: luisd303 at gmail.com
6 years in data analysis.",0,19055164,0,"[]","",0,"","[]",0 15458440,0,"comment","lima","2017-10-12 14:58:33.000000000","Have a look at ClickHouse for OLAP workloads:
It's a recently open sourced database by Yandex. It powers their web analytics backend. CloudFlare is already using it for their DNS analytics.",0,15458078,0,"[15458479]","",0,"","[]",0 15461048,0,"comment","lima","2017-10-12 20:28:25.000000000","Yes, in fact, it's specifically designed for data that doesn't fit in RAM. Their storage engine optimizes data locality for fast retrieval from non-flash storage.
See: https://clickhouse.yandex/presentations/data_at_scale/
Their docs also talk about it a lot.",0,15458479,0,"[]","",0,"","[]",0 22770105,0,"comment","kthejoker2","2020-04-03 15:21:16.000000000","You want an OLAP database, something like Clickhouse.",0,22769438,0,"[]","",0,"","[]",0 25901447,0,"story","yamrzou","2021-01-25 09:59:57.000000000","",0,0,0,"[]","https://altinity.com/blog/clickhouse-is-apache-2-0",5,"Clickhouse Is Apache 2.0","[]",0 22794568,0,"comment","seektable","2020-04-06 15:28:57.000000000","Yandex ClickHouse one more free/open source solution that can be considered for this purpose. Setup/maintenance may be a bit tricky - but you'll be surprised how fast CH executes your ad-hoc queries (SQL!) over billions of rows.",0,22786660,0,"[22797702]","",0,"","[]",0 22796087,0,"comment","sman393","2020-04-06 17:57:55.000000000","Depending on what your reporting needs are I would definitely recommend TimescaleDB. I love Prometheus for storing server resource metrics or basic app metrics (counts/timers). For more business-facing data, having the query functionality and visualization tooling of Postgres is a major win. I am working on a similar use-case to you, we considered Cassandra, Timescale, and Clickhouse. With our insert rates and reporting needs, We decided to start working with TimescaleDB.",0,22786660,0,"[]","",0,"","[]",0 22797702,0,"comment","Kkoala","2020-04-06 20:49:44.000000000","Interesting, haven't heard of ClickHouse before. I will check it out, thanks.",0,22794568,0,"[]","",0,"","[]",0 11908254,0,"story","mechmind","2016-06-15 10:06:17.000000000","",0,0,0,"[11909231,11912302,11909666,11908660,11908886,11908687,11909349,11909556,11909576,11908543,11908980,11968533,11909138,11911836,11910332,11911758]","https://clickhouse.yandex/reference_en.html",243,"ClickHouse – high-performance open-source distributed column-oriented DBMS","[]",70 11908660,0,"comment","waynenilsen","2016-06-15 11:47:14.000000000","I found this [1] much more informative in terms of what this is good for.
[1] https://clickhouse.yandex/reference_en.html",0,11908254,0,"[11910533,11909049]","",0,"","[]",0 11908910,0,"comment","tyingq","2016-06-15 12:41:06.000000000","This page seems to cover most of the prerequisites, and some other advice: https://github.com/yandex/ClickHouse/blob/master/doc/build.m...
And it does say this: "With appropriate changes, build should work on any other Linux distribution...Only x86_64 with SSE 4.2 is supported"",0,11908886,0,"[11908992]","",0,"","[]",0 11908992,0,"comment","threeseed","2016-06-15 12:53:52.000000000","Thanks so much for that. It definitely looks like the documentation might be out of sync as there are also some references to ODBC support in the code:
https://github.com/yandex/ClickHouse/blob/32057cf2afa965033c...
Going to look into building a Spark data source for this so we can see how it well it compares to databases like Cassandra.",0,11908910,0,"[]","",0,"","[]",0 11909138,0,"comment","StreamBright","2016-06-15 13:24:29.000000000","Without supporting other operating systems it is hard to consider it as an alternative to anything. We have several clusters using another operating system than the one supported by ClickHouse. Unfortunately few customers are going to invest into a different platform to try out something new like this. The lower the entering bar for new tech is the better.",0,11908254,0,"[11911813,11909563,11913060]","",0,"","[]",0 11909231,0,"comment","buremba","2016-06-15 13:39:25.000000000","This is huge. It seems to me that it's similar to BigQuery but has many other features that I didn't see in other databases.
AggregatingMergeTree is especially one of them and allows incremental aggregation which is a huge gain for analytics services.
Also it provides many table engines for different use-cases. You don't even need a commit-log such as Apache Kafka in front of ClickHouse, you can just push the data to TinyLog table and and move data in micro-batches to a more efficient column-oriented table that uses different table engine.",0,11908254,0,"[]","",0,"","[]",0 11909567,0,"comment","RMarcus","2016-06-15 14:33:35.000000000","Well, they have benchmarks! Against Vertica! Which is surprising, because I imagine Vertica's EULA forbids that.
https://clickhouse.yandex/benchmark.html#["1000000000",["Cli...]",0,11909484,0,"[11910938,11909699]","",0,"","[]",0 11910533,0,"comment","dang","2016-06-15 17:07:20.000000000","Ok, we changed the URL to that from https://clickhouse.yandex/.",0,11908660,0,"[]","",0,"","[]",0 25925326,0,"comment","joshxyz","2021-01-27 07:29:46.000000000","Clickhouse. That analytics db is a fucking beast.",0,25917198,0,"[]","",0,"","[]",0 22806852,0,"comment","jmakov","2020-04-07 19:55:17.000000000","So how does it compare to Clickhouse?",0,22803504,0,"[22806977]","",0,"","[]",0 22806977,0,"comment","bluestreak","2020-04-07 20:06:48.000000000","We have not benched against clickhouse yet. Sorry. Sounds like this should be an interesting things to do!",0,22806852,0,"[22810725]","",0,"","[]",0 22809164,0,"comment","polskibus","2020-04-08 01:08:34.000000000","It would be great if you include clickhouse in your benchmark. It also boasts heavy SIMD use and is free + open source.",0,22803504,0,"[]","",0,"","[]",0 22810725,0,"comment","missosoup","2020-04-08 07:19:00.000000000","The broader question though is how does your offering compare to Clickhouse?",0,22806977,0,"[22811515]","",0,"","[]",0 22811377,0,"comment","numlock86","2020-04-08 09:28:10.000000000","Serious but bold question: What are the benefits versus Clickhouse for example? Why should I use QuestDB?
Amazing work either way. The space of databases can never have too much competition.",0,22803530,0,"[22811869]","",0,"","[]",0 22811515,0,"comment","j1897","2020-04-08 09:54:07.000000000","co-founder of questdb here - we have been asked the same question on reddit as well. We are starting to work on an article going through a comparison between QuestDB and Clickhouse today - this will also include a bench. Will share as soon as we can. stay tuned!",0,22810725,0,"[]","",0,"","[]",0 22811869,0,"comment","bluestreak","2020-04-08 11:19:21.000000000","It is hard for me to say right now as we did not benchmark against Clickhouse yet, this is clearly the most requested comparison. We will come back on this!",0,22811377,0,"[]","",0,"","[]",0 11911813,0,"comment","pnathan","2016-06-15 20:11:14.000000000","> ClickHouse manages extremely large volumes of data in a stable and sustainable manner. It currently powers Yandex.Metrica, world’s second largest web analytics platform, with over 13 trillion database records and over 20 billion events a day, generating customized reports on-the-fly, directly from non-aggregated data. This system was successfully implemented at CERN’s LHCb experiment to store and process metadata on 10bn events with over 1000 attributes per event registered in 2011.
Might want to rethink that consideration. :-)",0,11909138,0,"[11915104]","",0,"","[]",0 11915104,0,"comment","StreamBright","2016-06-16 10:05:51.000000000","I have systems with 100 trillion records and 100 billion events / day using a technology that runs on CentOS, Ubuntu, FreeBSD, Windows Server and Solaris. What am I doing wrong? Why should I vet a technology that does not run on the platform that our company has chosen, it is approved by security and compliant with all the corporate requirements while the only platform ClickHouse is using is not?",0,11911813,0,"[11917805,11918987]","",0,"","[]",0 11918970,0,"comment","flr_null","2016-06-16 21:20:58.000000000","No, it's actually not.
Cassandra is not columnar db. It means if you need all user_id from your table Cassandra will scan all your data (e.g. 10PB) on disks. But ClickHouse will only scan 1 column file (e.g. 10GB).
ClickHouse is Russian Kudu.",0,11911758,0,"[]","",0,"","[]",0 25928020,0,"comment","zX41ZdbW","2021-01-27 13:34:06.000000000","Using LZ4 can easily improve performance even if all data reside in memory.
This is the case in ClickHouse: if data is compressed, we decompress it in blocks that fit in CPU cache and then perform data processing inside cache; if data is uncompressed, larger amount of data is read from memory.
Strictly speaking, LZ4 data decompression (typically 3 GB/sec) is slower than memcpy (typically 12 GB/sec). But when using e.g. 128 CPU cores, LZ4 decompression will scale up to memory bandwidth (typically 150 GB/sec) as well as memcpy. And memcpy is wasting more memory bandwidth by reading uncompressed data while LZ4 decompression reads compressed data.",0,25926758,0,"[25935389]","",0,"","[]",0 22815550,0,"comment","nemothekid","2020-04-08 18:24:29.000000000","My rule of thumb has become:
If you know all your access pattern and your writes >>> reads, a NoSQL solution will be cheaper to operate than Postgres. Meaning, I believe, for most deployments, you can get the same amount of performance from postgres, but simply at a higher cost (which may be 3-6x at most). Another reason to go with NoSQL is if you are latency sensitive, although I don't think Dynamo falls in this bucket.
NoSQL was also really good for OLAP, but I think now there are several really good OLAP solutions (like Clickhouse for OSS and Redshift/BigQuery in the cloud) that are easier to manage.",0,22815383,0,"[]","",0,"","[]",0 11920763,0,"story","velmu","2016-06-17 04:57:34.000000000","",0,0,0,"[]","https://clickhouse.yandex/",1,"ClickHouse is an open-source column-oriented database management system","[]",0 25941486,0,"comment","joshxyz","2021-01-28 13:29:39.000000000","My cheat sheet is: For transactions, postgresql. For analytics, clickhouse. For cache, redis. For search, elasticsearch. For objects, anything s3 compatible.
Anything else, I really have to do in-depth research and comparisons because 1.) Each database has their own pros and cons, 2.) These pros and cons between dbs tend to overlap and has each of their own gotchas, and 3.) These pros and cons also improve overtime where some problems brought up earlier might be addressed already in the lastest releases.
CMU Database Group's YT channel is also worth checking out for introductions on some databases.
https://m.youtube.com/channel/UCHnBsf2rH-K7pn09rb3qvkA",0,25911870,0,"[]","",0,"","[]",0 22827507,0,"comment","bckygldstn","2020-04-09 21:42:17.000000000","Analytics workloads are often a good fit for columnar databases. Popular examples are Redshift (AWS), Vertica (enterprise), and Clickhouse (open source).
Columnar databases are awesome, for the right kinds of task the speedup can be multiple orders of magnitude. Columnar databases excel at filtering and aggregating a subset of columns, storing sparse or slowly-changing data, and timeseries operations.
Of course there's always tradeoffs. Columnar databases tend to suck at reading individual rows, reading large number of columns, and heavy writes.
Another option is of course to put a caching layer between the dashboard and ElasticSearch, and precompute common queries e.g. daily.
Feel free to message me if you want to chat, email in profile.",0,22824040,0,"[]","",0,"","[]",0 22829797,0,"comment","snikolaev","2020-04-10 03:42:09.000000000","Do you need full-text search? I think the answer would depend on that as there are only few open source technologies that do full-text search: ElasticSearch, SOLR, Manticore Search, couple others and LOTS of others that don't, but are much better in just analytics. Clickhouse should be a good fit then.",0,22824040,0,"[]","",0,"","[]",0 22830352,0,"comment","hodgesrm","2020-04-10 05:51:00.000000000","ClickHouse is a common replacements for ElasticSearch in cases where data consists of structured records. ContentSquare publicly reported 11x decrease in cost and 10x faster 99th percentile queries after migration. [1] Others have seen similar results.
[1] https://github.com/ClickHouse/clickhouse-presentations/blob/...
Disclaimer: some of the "others" are customers of my employer Altinity, which offers support for ClickHouse.",0,22824040,0,"[]","",0,"","[]",0 11931163,0,"comment","flr_null","2016-06-18 23:53:27.000000000","Of course I'm asking with context of massive parallel realtime queries from customers as it's going in yandex metrika or google analytics. And yes 100T lines with Hive+ORC it's not an issue, but it's another league.
I assume that only Kudu is a competitor for ClickHouse now. Or may be Greenplum.
Thank you for your answer.",0,11930474,0,"[]","",0,"","[]",0 22833988,0,"comment","andrea_s","2020-04-10 15:27:09.000000000","Yandex ClickHouse also should be on the list!",0,22833299,0,"[]","",0,"","[]",0 25952898,0,"comment","barrkel","2021-01-29 02:27:16.000000000","MySQL was the chosen store, chosen primarily for its (async) replication story which was better than Postgres back in 2012. Though I'm not sure it wouldn't still be a better choice today with the update load, I've had problems in personal toy (~20 million row) projects with slow bulk update speed in Postgres.
The last thing I did before I left was build out a mechanism for mirroring a denormalized copy of recent data in ClickHouse, partitioned by time (for matched) and type (for partials / unmatched). CH, being analytics oriented, works much better for the ad-hoc end user queries, which filter and sort on a handful of arbitrary columns - ideal for a columnar store. Interactive updates can be handled by CH's versioned merge tree table engine, and batch updates by rewriting the relevant partition from the source of truth in MySQL.
I chose CH primarily because it scaled down well for a columnar store - much better than anything in the Hadoop space. The conservatism of the market meant you couldn't just throw random cloud provider data mechanisms at it, nor a whole lot of fancy big stuff.",0,25952550,0,"[]","",0,"","[]",0 22869770,0,"comment","polskibus","2020-04-14 18:41:10.000000000","Why choose Druid over clickhouse?",0,22868286,0,"[22870117,22871472,22870523,22871090,22870189]","",0,"","[]",0 22870117,0,"comment","manigandham","2020-04-14 19:06:50.000000000","Most people shouldn't. Clickhouse or other column-store data warehouses (redshift, bigquery, etc) are very fast and have all the features to handle time-series and other data.
Druid is good if you (1) make use of native integrations like Kafka (2) Need every field indexed for fast seeking to a few rows (3) can use the JSON API to make up for the still-in-beta SQL interface (4) don't need every single event/row as they are pre-aggregated (5) always have a time-column to partition by (6) want to use the native S3 tiering for older data (7) dont need joins and complex analysis
Imply's distribution is better than core Druid but it's still more operationally complex than Clickhouse and alternatives.",0,22869770,0,"[22873569,22870405,22874672]","",0,"","[]",0 22870405,0,"comment","leetbulb","2020-04-14 19:30:26.000000000","1) Most people handling large data streams are already using Kafka or similar. If you aren't, Druid's has pretty wide support for event ingress, including HTTP.
2) If you're looking to provide this type of analytics in the first place, you probably do want this. Being able to execute extremely fast, granular, ad-hoc queries over highly dimensional data is very powerful. I designed a reporting frontend that really took advantage of this and I always felt guilty when people would complimented me how fast it drilled down in to 10+ dimensions.
3) There are plenty of mature libraries for the Druid API.
4) This is almost inherent with any OLAP system. Although, even with high-cardinality data, Druid performs extremely well. Either way, you should have a backing data warehouse and offline, batch jobs if you need to perform BI analytics on row-by-row / non-aggregated data. Remember, OLAP sits on top of a data warehouse.
5) Yes, but technically you just need a key that increments.
6) Not sure if this is an argument against Druid. I've found the S3 deep storage / tiering to be very efficient and powerful, especially because you can create and push segments directly to storage and not run it through an indexer. S3 is also just an object store protocol specification now. There are lots of people who run S3-compatible object stores in-house. HDFS is also natively supported and another widely used storage backend in this space. Also, there are plenty of community extensions for other object stores.
7) Again, OLAP. Data flowing in to your OLAP layer should already be denormalized and ready for indexing. Also, you can join data in to your existing indices with Hadoop, etc. Druid support joins and lookups, although I've never used them. ClickHouse and other similar systems also don't do very well with joins. Maybe we have different definitions of "complex analysis," but in my experience, you can do some pretty crazy stuff with queries including writing your own JS functions, and if you're really dedicated, you can write your own extensions.
One thing that I feel like a lot of people miss is that Druid is specifically an OLAP layer designed for large scale, and I mean beyond Netflix size scale (they use Druid). Every individual component of Druid is designed to scale out independently and play nicely with your existing infrastructure. Similar to ES, you have nodes that have specific roles such as event ingress / real-time compute, query handlers, query brokers, historical data / cache / compute, etc. Then you also have a bunch of supporting architecture provided for you for (re)indexing data, coordinating the entire cluster, etc. Druid is huge, not an AIO (OLAP, OLTP, DWH, etc) analytics solution, and it takes more than one person to run a larger cluster, even though I did it for a few years.",0,22870117,0,"[22870913]","",0,"","[]",0 22870523,0,"comment","polote","2020-04-14 19:40:51.000000000","It is crazy, every post which talks about Druid, there is someone asking if clickhouse is better",0,22869770,0,"[22870993]","",0,"","[]",0 22870913,0,"comment","manigandham","2020-04-14 20:16:34.000000000","Relational databases with joins and full SQL support are still unmatched in flexibility, and functionality like materialized views, aggregation pipelines (and table engines for Clickhouse) allows you to do everything that Druid does with aggregated summaries while still having all the other querying abilities.
Druid has a slight edge in data scale-out and indexed seeks but modern data warehouses are adding similar tiering features, along with with field indexing, full-text search, nested data records, and even in-memory rowstores for OLTP support.
They're all converging on the same feature set and eventually Druid will just become another data warehouse option, although I'd still recommend Clickhouse or MemSQL at that point.",0,22870405,0,"[22872947]","",0,"","[]",0 22870993,0,"comment","polskibus","2020-04-14 20:22:47.000000000","The in depth pros and cons are one of the most valuable bits of knowledge I find on HN. Everyone including Chief Databaseologist Andy Pavlo recommends starting with Clickhouse for you OLAP needs (in https://www.youtube.com/watch?v=dPMc7FZ3Gqo&list=PLSE8ODhjZX...).
When this is the case, Druid needs strong arguments to secure mindshare in the future. It is crazy to write a comparison of Druid and sth else these days and not mention Clickhouse at all.",0,22870523,0,"[22871265]","",0,"","[]",0 22871265,0,"comment","polote","2020-04-14 20:50:14.000000000","I'm not saying you shouldn't, in my previous company they went all-in in clickhouse and it just changed their lives.
But it feels like ads when you have the same person every time telling you the same thing",0,22870993,0,"[22873418]","",0,"","[]",0 22871472,0,"comment","DevKoala","2020-04-14 21:08:43.000000000","I have gone from Druid to Clickhouse. At this point, I don't really know anymore.",0,22869770,0,"[22872422]","",0,"","[]",0 19161612,0,"story","nauhygon","2019-02-14 13:19:48.000000000","",0,0,0,"[19164959]","https://www.percona.com/blog/2017/03/17/column-store-database-benchmarks-mariadb-columnstore-vs-clickhouse-vs-apache-spark/",6,"Benchmarks: MariaDB ColumnStore vs. Clickhouse vs. Apache Spark","[]",1 22872422,0,"comment","code_biologist","2020-04-14 22:48:47.000000000","Can you speak briefly to the change? What motivated it? How's it been working out?
For context, I'm an engineer that uses Postgres heavily, with some basic BigQuery + Redshift. I don't understand what the benefit of something like Clickhouse are over a standard warehouse.",0,22871472,0,"[22873480]","",0,"","[]",0 22873089,0,"comment","yumraj","2020-04-15 00:27:51.000000000","Just curious what are people using Druid, Clickhouse, Pinot... for?
would love to learn by knowing about the use-cases where I should perhaps consider such technologies..",0,22868286,0,"[22873730]","",0,"","[]",0 22873480,0,"comment","DevKoala","2020-04-15 01:23:25.000000000","Druid and Clickhouse are OLAP systems. The key here is “online” as in they are optimized to answer queries in a matter of milliseconds. Maybe things have improved, but RedShift is not really an OLAP system. The last time I used RedShift was back in 2015 and it took it tens of seconds or even minutes to answer the queries that Druid could answer in just a few seconds; the Druid cluster was a third of the cost too, AWS was charging us too much. I love BigQuery, but I look at it more as a background processor for queries over huge datasets. With BigQuery you have some real limitations on the number of concurrent queries. Even if BQ was able to answer in seconds, your users would be reaching your concurrent query limit all the time.
In RedShift you can perform joins, but in Druid and Clickhouse you cannot, your data has to be denormalized. There are quasi-joins but they are not the same. If you are going to swap Redshift with Druid or Clickhouse, then you need to denormalize the data or design around the specific quasi-join concepts of these databases that abstract dictionary lookups.
Druid is operations heavy compared to Clickhouse, while RedShift is just a few button presses. I started using Druid in 2014 and back then to set up a cluster you needed a decent understanding of orchestration. There are plenty of “minimal ops” solutions for Druid nowadays, but IMHO all they are doing is abstracting the problem and not really solving it. To run a Druid cluster, you need 5 or 6 node types and to arrive at the proper configuration of resources for your data needs will take you a lot of time. Clickhouse requires a single node type.
Ingesting data is easy in RedShift, but you will have to work a bit more with Druid and Clickhouse. Two of the druid node types are dedicated to ingestion, so your data pipelines should allocate these dynamically. Loading data into Clickhouse is as simple as asking it to ingest a file, but since it is big data, you need a dynamic place to stage such file right? All I am saying is that I wish there was an easier way to pick up the data from private S3 buckets.
Performance is a wash between Druid and Clickhouse for me. I never did a proper head to head between the two. However, the product I built on top of Clikhouse is a next get version of the product I built on top of Druid, and in my experience Clickhouse achieves the same performance for a third of the costs. However, you must account that cloud computing has gotten cheaper over the years and my numbers for Druid come from 2017 which is the last time I approved a bill.
I forgot to mention PG. I have put small analytic datasets, 80 million rows or so, in Postgres with good results, but even then the response time was never less than a few seconds.",0,22872422,0,"[22877005,22875247,22874679]","",0,"","[]",0 22873569,0,"comment","DevKoala","2020-04-15 01:37:37.000000000","You are right about Clickhouse, but other data warehouses are not optimized for the same use case of Druid and Clickhouse, OLAP.
For example, RedShift and BigQuery cannot be put behind a user facing backend. BigQuery has a default limit of 50 concurrent queries, if that's your user limit, perfect. RedShift takes seconds for queries that Druid and Clickhouse can answer in milliseconds.",0,22870117,0,"[22874708,22874997]","",0,"","[]",0 22873730,0,"comment","hodgesrm","2020-04-15 02:10:14.000000000","Some common ClickHouse use cases and verticals shown below. Basically any case where you have relatively structured data, very large fact tables, and a need for low latency response. Names of companies that have given public talks are shown.
* Analyzing netflow logs (network management, numerous users)
* Content delivery (Mux.com)
* Web analytics (ContentSquare, CloudFlare)
* Generating optimized parameters for real time ad bidding (various, not too many recent talks on this)
* Log management and system observability (surprisingly common across industries-Sentry, Cloudflare, Uber)
* Valuing assets from market data, e.g. tick data (financial services, not many companies will discuss publicly)
Druid is used for some of these. I don't know that much about Pinot, hence cannot comment there.",0,22873089,0,"[]","",0,"","[]",0 22874672,0,"comment","shaklee3","2020-04-15 04:59:15.000000000","Clickhouse has had native Kafka ingest support for over a year.",0,22870117,0,"[]","",0,"","[]",0 22874679,0,"comment","shaklee3","2020-04-15 05:01:00.000000000","In our case clickhouse was significantly faster at insertions and querying compared to druid. Which metrics did you test?",0,22873480,0,"[22874781]","",0,"","[]",0 22874692,0,"comment","shaklee3","2020-04-15 05:02:33.000000000","To be fair, 10k rows per second is extremely slow. Clickhouse can do millions per second on a single server in our tests.",0,22869700,0,"[]","",0,"","[]",0 22874708,0,"comment","manigandham","2020-04-15 05:05:25.000000000","All data warehouses are designed for OLAP, that's their purpose. It doesn't require low latency though.
Redshift is an always running cluster of scale-out distributed postgres forked by AWS so it can and does return results in milliseconds, very similar to Clickhouse although still not as advanced in performance techniques.
Bigquery is a completely managed model that uses far greater scale-out architecture designed for througput (petabytes in seconds) rather than latency, although it has real-time streaming, BI Engine (memory cache) and materialized views so you can get pretty close today.
Snowflake is another option that runs on top of major clouds using instances to access object storage and also has low latency when your cluster is running.",0,22873569,0,"[22875421]","",0,"","[]",0 22874781,0,"comment","DevKoala","2020-04-15 05:20:07.000000000","Filtering by partial string match on dimensions. It is equally slow on both of them.
For straightforward filters and aggregations, yeah, Clickhouse is faster. However I read a PR by the Druid team where they added a vector engine. If they release that, then the performance gap should be smaller or maybe in Druid’s favor depending on the dataset.
You are right about insertions up to a point.
If you consider data loads as insertions, with Druid I could scale my cluster elastically to speed them up. With Clickhouse I am bound by the query nodes.
Also, Druid can ingest data in real time using a special type of node that wasn’t part of the original distribution. I haven’t done real time data ingestion on Clickhouse. Hourly updates are good enough for my use case.",0,22874679,0,"[]","",0,"","[]",0 22875247,0,"comment","manigandham","2020-04-15 06:49:43.000000000","Online refers to interactive queries that work and return results after submitting them, in contrast to offline or batch queries. There's no hard latency requirement and standard TPC-H data warehouse benchmark queries can take hours to run.
The main advantage of Druid is that it pre-aggregates and indexes everything resulting in much smaller summarized data with fast seeks to specific rows. Materialized views, stored procedures, and ETL processes can provide the same result in relational data warehouses and come with much better SQL support as you mentioned.
You should try Redshift again as it's come a long ways since the early days. Also I highly recommend MemSQL which is faster than Clickhouse with a much easier deployment, more usability and can serve OLTP workloads in the same database.",0,22873480,0,"[]","",0,"","[]",0 25994462,0,"comment","caseyaedwards","2021-02-01 21:47:43.000000000","Tesla | Senior or Staff Site Reliability Engineer (SRE), Manufacturing Systems | Fremont, CA
The Core Automation Services (CAS) team at Tesla is building applications to enable manufacturing, with an eye towards reliability, availability, scalability, speed and security. We're a diverse team composed of Controls Automation Engineers, Software Engineers, and various other disciplines that help facilitate automated manufacturing processes. As an SRE on the CAS team you'll be working with the infrastructure, systems and applications that act as the middleware layer between Programmable Logic Controllers (PLCs) and the outside world, such as Databases, MES systems and other services.
Location: Fremont, CA
Responsibilities:
* Support interim HMI/SCADA vendor application (Ignition from Inductive Automation)
* Building tooling around it, evaluating its usage, and helping to ensure its reliability, availability and security
* Design software and systems that enable automated manufacturing at Tesla
* Assist Software, Controls, Manufacturing and other types of Engineers with onboarding and integrating services into the Tesla technology stack
* Ensuring best practices and observability of the service, such as metrics, logging, tracing, and alerting
* Automate configuration and deployment of services
* Consult on and design infrastructure, systems and application architecture
Apply at:https://www.tesla.com/careers/search/job/site-reliability-en...
https://www.tesla.com/careers/search/job/site-reliability-en...
=======================
Tesla | Database Site Reliability Engineer, Manufacturing Systems | Fremont, CA
As a Database SRE on the CAS team you'll be setting up and managing the databases, including MySQL, CockroachDB, FoundationDB, Clickhouse, and InfluxDB that back various software and systems that enable manufacturing in our various factories.
Location: Fremont, CA
Responsibilities:
* Evaluate current database deployments and make recommendations for how to improve their reliability, availability, scalability and security
* Design and implement automation for managing the deployment and upgrades of the databases
* Define Disaster Recovery and Business Continuity plans for the various database deployments
* Assist Software, Controls, Manufacturing and other types of engineers with using databases sustainably
* Ensuring best practices and observability of the databases, such as metrics, logging, tracing, and alerting
* Consult on and design infrastructure, systems and application architecture
Requirements: * Experience with running databases on bare-metal or VMs
* Expert skills in Linux and its administration
* Experience in a high level language such as Go, Python and/or Java
* Understand the concepts of Observability and Infrastructure as Code
* Comfortable on an on-call rotation
* Comfortable doing live troubleshooting of issues on NOC bridges/outage calls
* Habitual documenter and spreader of knowledge
* Willing to mentor other team members and engineers with less database knowledge
* Strong bias for action vs endless planning, willing to get hands dirty and make mistakes sometimes
* 3+ years as DBA/SRE
Apply at: https://www.tesla.com/careers/search/job/database-site-relia...",0,25989764,0,"[]","",0,"","[]",0
22886664,0,"comment","manigandham","2020-04-16 06:19:21.000000000","HN is a great resource with plenty of important news posted, along with the HighScalability blog [1] and SoftwareEngineeringDaily podcast [2].1. http://highscalability.com/
2. https://softwareengineeringdaily.com/
Other than that, reading the blogs of the various vendors is how I keep up (using Feedly with RSS). The modern projects are Redshift, BigQuery, Snowflake, Azure DW, MemSQL, Clickhouse, YellowBrick with older projects being Vertica, Teradata, Greenplum. It's also useful to follow the "new" distributed SQL projects like CockroachDB, Citus, TiDB, Vitess, Yugabyte.",0,22882131,0,"[22889573]","",0,"","[]",0 19177700,0,"story","gerenuk","2019-02-16 08:21:58.000000000","Hey everyone,
Looking for suggestion and recommendations, what would be your ideal choice if you are about to build a backlinks database and running later on PageRank algorithm on it. Our use case is to improve the search relevancy depending on the URL rank.
Considered Storage DBs:
1. Clickhouse 2. ElasticSearch (team is experienced with it, but finding a duplicate URLs etc. over a huge dataset is still clumsy in ElasticSearch) So thought of going with the Clickhouse.
Thanks.",0,0,0,"[]","",2,"Ask HN: Storage DB for backlinks?","[]",0 15590888,0,"comment","est","2017-10-31 03:21:11.000000000","This in general is a terrible idea IMHO
I use tabix.ui for Clickhouse, clickhouse generates tons of extra headers as progress bars. Chrome will just cut the connection and says header too large error.
http 103 will just be abused like that. People will make megabytes sized headers",0,15590379,0,"[15591400,15592805]","",0,"","[]",0 12000233,0,"comment","buremba","2016-06-29 08:24:07.000000000","We're implementing Clickhouse (http://clickhouse.yandex) to our open-source analytics platform Rakam. (https://github.com/rakam-io/rakam)",0,12000099,0,"[]","",0,"","[]",0 22901555,0,"comment","zepearl","2020-04-17 18:28:05.000000000","I had as well some Seagate drives failing... . (using now only HGST and WD drives - no experience with Toshiba)
Question about ZFS and its RAIDZ: do you have any recommendation (personal experience, links, books, ...) concerning parameters/setup to be used when setting up a RAIDZ(1 and 2)?
I'm new to ZFS and I already had to do a lot of tests when using ZFS only on 1 HDD until I finally managed to get good performance out of it (used by a "Clickhouse" database which itself writes data in a CoW-style => I had to raise the "recordsize" to 2MiB) and I imagine that with RAIDZ it can get more complicated?
I would like to set up a RAIDZ for the database (again, "Clickhouse", which generates multi-GB files) and two more to be used as simple NAS (a main one and a backup, storing files of all sizes).
I searched a lot and found some websites which were "ok", and bought as well 2 tiny books ( https://www.amazon.com/Introducing-ZFS-Linux-Understand-Stor... and https://www.amazon.com/ZFS-Linux-Administration-William-Spei... ) but the books were mediocre and the informations I found in the web were sparse and a bit oldish.
Cheers",0,22901073,0,"[22904937]","",0,"","[]",0 19194981,0,"comment","PeterZaitsev","2019-02-18 22:15:24.000000000","Hi,
Apache Spark is easy target to beat. I would be much more interested looking at comparison to ClickHouse :)",0,19192625,0,"[19195102]","",0,"","[]",0 19195102,0,"comment","felipe_aramburu","2019-02-18 22:32:48.000000000","Do you have an example of workloads that are easy to reproduce using clickhouse so we can see how much effort it would be to make a comparison? We are always happy to test against other tools!",0,19194981,0,"[19197267]","",0,"","[]",0 19198310,0,"story","stefant","2019-02-19 11:57:01.000000000","",0,0,0,"[]","https://sematext.com/blog/clickhouse-monitoring-key-metrics/",5,"Key Metrics for Monitoring Clickhouse","[]",0 19198415,0,"comment","felipe_aramburu","2019-02-19 12:21:36.000000000","And you think https://tech.marksblogg.com/billion-nyc-taxi-rides-clickhous... for example is something that can be considered fast? It takes the user 55 minutes just to load its data into a state so that it can be "queryable".
After importing then they spend 34 more minutes making the data into a columnar representation. Alright so 89 minutes in and we still haven't run queries.
Oh but its not distribute yet. Darn I have to run some non standard sql commands like
CREATE TABLE trips_mergetree_x3 AS trips_mergetree_third ENGINE = Distributed(perftest_3shards, default, trips_mergetree_third, rand());
Ok can I query my data yet? No you have to move it into this distributed representation and that takes 15 more minutes. Oh ok...
And now? Yes you can run your queries but they aren't really very fast.
SELECT cab_type, count(*) FROM trips_mergetree_x3 GROUP BY cab_type;
Can take 2.5 seconds on a 108 cpu core cluster for only 1.1BN rows? Thats not fast. That's particularly slow given that requires you to ingest and optimize your data.
Maybe you want to show us an example of some simple tests you have run with blazing and clickhouse. As I read it now its not worth our time to look into becuase its so very different from what we are trying to offer which is:
Connect to your files wherever you have them ETL quickly Train / Classify Move on!",0,19197267,0,"[19198701]","",0,"","[]",0 19198701,0,"comment","bicubic","2019-02-19 13:15:10.000000000","The ingest time is due to updating the merge tree. You don't need a merge tree for etl... It's like the worst backing store you could possibly choose. You're also comparing an intentionally horizontally distributed query to a purely vertical one on a single node. You can see just slightly below the same query takes 0.2 seconds on a single node.
I was hoping to see some serious consideration given to these kinds of benchmarks, considering Clickhouse is one of the most cost effective tools I've used in the real world and occasionally outperforms things like mapd.
I was expecting your solution to outperform Clickhouse at least in some aspects, and a benchmark showing where it wins. Instead you reveal ignorance of Clickhouse and even the benchmarks you linked.
Your comment comes off as incredibly arrogant and at the same time incredibly misinformed. Disappointing to see this attitude from the team.",0,19198415,0,"[19199094]","",0,"","[]",0 19199094,0,"comment","felipe_aramburu","2019-02-19 14:26:16.000000000","I am ignorant of clickhouse. It doesn't really compete in the workloads we are interested in. Sorry you feel this way but we are a small team and need to consider tools that integrate with Apache Arrow and CUDF natively.
If it doesn't take input from Arrow and CUDF and it doesn't produce output that is Arrow CUDF or one of the file formats we are decompressing on the GPU. Then we don't care unless one of our users asks us for this.
We are 16 people and a year ago were 5. We can't test everything out just the tools our users need to replace in their stacks. I apologize if I came off as arrogant. I have tourette's syndrome and a few other things that make it difficult for me to communicate, particularly when discussing technical matters. If I have offended you I do apologize but not a single one of our users has said to me I am using clickhouse and want to speed up my GPU workloads. Maybe its so fast they don't mind paying a serialization cost going from clickhouse to GPU workload and if so thats great for them!",0,19198701,0,"[19199209]","",0,"","[]",0 19199209,0,"comment","bicubic","2019-02-19 14:44:46.000000000","Understood.
I do suggest you seriously benchmark against clickhouse, because where single node performance is concerned, it is the tool to beat outside arcane proprietary stuff like kdb+ and brytlytdb. I have used single-node clickhouse and seen interactive query times where an >10 node spark cluster was recommended by supposed experts.
Clickhouse is not a mainstream tool (and I have discussed its limitations in other threads) but it is certainly rising in popularity, and in my view it comes pretty close to 1st place for general purpose perf short of Google scale datasets.",0,19199094,0,"[19199524]","",0,"","[]",0 19199524,0,"comment","felipe_aramburu","2019-02-19 15:23:34.000000000","Ok. Right now we are in tunnel vision mode to get our distributed version out by GTC in mid march. We will benchmark against clickhouse sometime in March. Do you know of any benchmark tests that are a bit more involved in terms of query complexity? We are most interested in queries where you can't be clever and use things like indexing and precomputed materializations.
The more complex the query the less you can rely on being clever and the more the guts need to be performant and that is more important to us right now.",0,19199209,0,"[19218571,19218697]","",0,"","[]",0 26021076,0,"comment","wesm","2021-02-04 00:23:06.000000000","Almost no database systems support multidimensional arrays. So they are not appropriate for many use cases?
* BigQuery: no * Redshift: no * Spark SQL: no * Snowflake: no * Clickhouse: no * Dremio: no * Impala: no * Presto: no ... list continues
We've invited developers to add the extension types for tensor data, but no one has contributed them yet. I'm not seeing a lot of tabular data with embedded tensors out in the wild.",0,26020657,0,"[26022146,26023933,26021239,26024658,26022161,26024119]","",0,"","[]",0 26021239,0,"comment","mkl","2021-02-04 00:49:48.000000000","If you out a blank line between your bullet points, they'll display properly:
* BigQuery: no
* Redshift: no
* Spark SQL: no
* Snowflake: no
* Clickhouse: no
* Dremio: no
* Impala: no
* Presto: no",0,26021076,0,"[]","",0,"","[]",0 26022161,0,"comment","est","2021-02-04 03:11:15.000000000","On a side note, Clickhouse had some Arrow support
https://github.com/ClickHouse/ClickHouse/issues/12284",0,26021076,0,"[]","",0,"","[]",0 26023684,0,"comment","otabdeveloper4","2021-02-04 08:21:15.000000000","It's nice that the cyclical technology pendulum is finally swinging back from XML/JSON to files-with-C-structs again, but any serious analytics data store (e.g. Clickhouse) uses its own aggressively optimized storage format, so the process of loading from Arrow files to database won't go away.",0,26018375,0,"[]","",0,"","[]",0 26024119,0,"comment","zX41ZdbW","2021-02-04 09:54:25.000000000","ClickHouse has support for multidimensional arrays.",0,26021076,0,"[]","",0,"","[]",0 26024128,0,"comment","zX41ZdbW","2021-02-04 09:57:03.000000000","ClickHouse has support for multidimensional arrays with arbitrary types and number of dimensions.
They are stored in tables in efficient column-oriented format.",0,26021326,0,"[]","",0,"","[]",0 19204867,0,"comment","bnolsen","2019-02-20 01:45:53.000000000","Clickhouse, a Russian product, seems to index dramatically better than splunk. Add with grafana, etc and you can build a useable and dramatically faster logging platform. Feature parity?",0,19204611,0,"[19205215]","",0,"","[]",0 19205215,0,"comment","justinsaccount","2019-02-20 03:06:12.000000000","Not even close. Clickhouse doesn't even really index, it's a column store for structured data.",0,19204867,0,"[]","",0,"","[]",0 26031043,0,"comment","rkwasny","2021-02-04 23:29:22.000000000","We live in a post big data world, few companies maintain datasets 10b rows+ for anything else a decent MPP database like Clickhouse.
Also beware - investing in technology that does not support standard SQL will create problems in the future, like MAPD - amazing database just there is no way to connect it to enterprise analytics software like Tableau.",0,26030152,0,"[]","",0,"","[]",0 19218571,0,"comment","hodgesrm","2019-02-21 17:28:57.000000000","I work for Altinity, which offers commercial support for ClickHouse. We like benchmarks. :)
We use the DTC airline on time performance dataset (https://www.transtats.bts.gov/tables.asp?DB_ID=120) and Yellow Taxi trip data from NYC Open Data (https://data.cityofnewyork.us/browse?q=yellow%20taxi%20data&...) for benchmarking real-time query performance on ClickHouse. I'm working on publishing both datasets in a form that makes it easy to load them quickly. Queries are an exercise for the reader but see Mark Litwintschik's blog for good examples of queries: https://tech.marksblogg.com/billion-nyc-taxi-clickhouse.html.
We've also done head-to-head comparisons on time series using the TSBS benchmark developed by the Timescale team. See https://www.altinity.com/blog/clickhouse-timeseries-scalabil... for a description of our tests as well as a link to the TSBS Github project.",0,19199524,0,"[19254585]","",0,"","[]",0 19218697,0,"comment","hodgesrm","2019-02-21 17:38:02.000000000","BTW, I think you do need to consider materialized views. ClickHouse materialized views function like projections in Vertica. They can apply different indexing and sorting to data. Unless your query patterns are very rigid it's hard to get high performance in any DBMS without some ability to implement different clustering patterns in storage.",0,19199524,0,"[]","",0,"","[]",0 26048816,0,"comment","merightnow","2021-02-06 18:50:18.000000000","
Location: Zurich Switzerland
Remote: Yes
Willing to relocate: No
Technologies:
- [Professional experience] Scala, Java, Typescript, Kafka, Docker, Kubernetes, Clickhouse, RDBMS, Serverless, AWS, Lightbend stack
- [Hobby experience] Kotlin, Go, Rust, VueJS, Firebase
Résumé/CV: On request
Email: nivarut@protonmail.com
Experience:
More than 7 of experience in designing, implementing, and operating distributed platforms with ownership in the entire lifecycle of the platform, from business requirements to monitoring code and infrastructure, CI/CD, architecture design and setting platform wide standards.
",0,25989762,0,"[]","",0,"","[]",0
22949788,0,"story","pachico","2020-04-22 20:24:33.000000000","",0,0,0,"[]","https://www.altinity.com/blog/2020/4/14/handling-real-time-updates-in-clickhouse",4,"Handling Real-Time Updates in ClickHouse","[]",0
19245390,0,"story","set321","2019-02-25 13:20:50.000000000","",0,0,0,"[]","https://sematext.com/blog/clickhouse-monitoring-tools/",3,"ClickHouse Monitoring Tools","[]",0
15650021,0,"comment","deepsun","2017-11-08 05:06:22.000000000","No ClickHouse, no PipelineDB?",0,15649405,0,"[]","",0,"","[]",0
15652988,0,"comment","manigandham","2017-11-08 14:58:07.000000000","Any distributed relational database, especially with compressed columnstores, will be better than any existing timeseries specific database.Timescale, Citus, PipelineDB = postgres based but no columnstores. MemSQL, MariaDB, ClickHouse = with columnstores.",0,15649679,0,"[15653696]","",0,"","[]",0 22960302,0,"comment","pachico","2020-04-23 19:45:40.000000000","Don't get me wrong, I celebrate any competitor of AWS, even though I use it massively, but Redis is a tricky one.
For start, you want redis to be as near as you can to your application. It many times is used as cache and it makes no sense to have long latencies to your cache layer (many times Redis is even deployed in the same pod as the app using it precisely because you want it to be nearby). And if your infrastructure is already in AWS (why would you choose elasticache otherwise?) you would be paying all the data from AWS to your external Redis as a service provider and that might be much more than what you even expected to save in the first place.
To be honest, AWS elasticache is not even an expensive service (t3.micro instances work just great and allow you do no upfront reservations, which you failed to use as comparison for obvious reasons).
Really, I don't think Redis is a problem to solve and I'd put my money on someone giving cheaper documentdb alternatives, or redshifts, or managed ClickHouse services, etc. Those are the real killers!
Anyhow, sorry for being a bummer and wish you best of luck!!!",0,22957091,0,"[22961546,22961505,22962535,22963071]","",0,"","[]",0 19252995,0,"comment","qaq","2019-02-26 09:18:35.000000000","Depends on the use-case you can look at performance of say ClickHouse vs alternatives which separate storage layer. The performance difference is fairly significant.",0,19252930,0,"[]","",0,"","[]",0 15655167,0,"comment","manigandham","2017-11-08 18:12:53.000000000","None of that is difficult nor is it related to time-series data specifically, they are just modern data warehouse features.
SQL already has window/analytical functions. Relational databases with columnstores also have their traditional rowstores which support all the OLTP features you need, along with easy joins across both table types. Many also now pair a rowstore or in-memory segment with each columnstore for background merges to handle rapid ingest and easy updates.
If you want a polished system, use MemSQL or SQL Server Columnstore Indexes, or more manual work with clickhouse and others. We have 28 billion rows of classic "time series" monitoring data in memsql compressed to less than 50gb and complex aggregations return in milliseconds.",0,15654756,0,"[15657462]","",0,"","[]",0 26081569,0,"comment","pranay01","2021-02-09 19:34:06.000000000","Thanks for your suggestions on better ways to support self hosting! I agree we need to do a much better job here.
We chose Kafka and Druid because: 1. Any company which reaches a decent scale invariably uses some form of Kafka. And it is a trusted system which scales upto huge scale. 2. Community adoption and support. When choosing datastore, we also evaluated Apache Pinot & Clickhouse, but Druid seemed to have the best community. Also, it was proven to use at scale in places like Lyft
I agree though that these are not simple systems, and may be too much for smaller orgs. We are also evaluating supporting simpler datastores, but that would depend on what the community demands. Our architecture is modular so we are not strictly tied to druid and we can support other datastores if there is interest.
I agree with your point around integrations. That is one of the moats of DataDog in my opinion. Agree to the usefulness of integrations for PagerDuty/Slack. I have added an issue for this - https://github.com/SigNoz/signoz/issues/21#issue-804860212
Though we are hoping being an open source projects, our community would be able to create integrations. Have answered this in more detail in another comment - https://news.ycombinator.com/item?id=26080530",0,26081408,0,"[26081994]","",0,"","[]",0 26081994,0,"comment","julienfr112","2021-02-09 20:22:02.000000000","What was wrong with clickhouse ?",0,26081569,0,"[26082370]","",0,"","[]",0 26082370,0,"comment","ankitnayan","2021-02-09 21:05:18.000000000","Nothing wrong there. If enough users want, we can add clickhouse also",0,26081994,0,"[]","",0,"","[]",0 22964836,0,"comment","pachico","2020-04-24 05:08:31.000000000","One of our applications receive more than 10m hits a day through Kong, which uses Redis for its rate limit plugin. We put a t3.micro for that and never had any issue. In reality, during our performance tests we got to much higher volumes and it always worked fine. What kind of issue with network did you encounter? Micro, small and medium should have the same network capacity. So, t3.micros cost $7.4, yes, but if you do just 1 year no-upfront reservation it should be around $4.9 (it's usually a third in saving). I think the value proposition is too weak to even consider it, at least to me.
However, as someone else also mentioned, we need alternatives to other AWS services, those that are expensive enough to run them on prem. The day you offer cheap Elasticsearch, ClickHouse, documentdb, etc, I'll kill my Hetzner machines and will come to you, sir.",0,22962535,0,"[]","",0,"","[]",0 26089587,0,"story","xoelop","2021-02-10 14:31:29.000000000","",0,0,0,"[]","https://blog.tinybird.co/2021/02/10/tinybird-tips-debugging-clickhouse-on-vscode/",5,"Debugging ClickHouse on VSCode","[]",0 15673323,0,"comment","lima","2017-11-10 21:46:24.000000000","Yandex's recently open sourced ClickHouse[1] column store does some of these.
It heavily relies on compression, data locality and SIMD instructions and supports external dictionaries for lookup.
[1]: https://clickhouse.yandex/",0,15672027,0,"[15674070,15673868]","",0,"","[]",0 15673868,0,"comment","spinco","2017-11-10 23:09:40.000000000","HN discussion on ClickHouse from a few years ago: https://news.ycombinator.com/item?id=11908254",0,15673323,0,"[]","",0,"","[]",0 15674070,0,"comment","posnet","2017-11-10 23:47:37.000000000","I am always impressed with clickhouse, especially when it holds up against massive data processing systems, but running on a laptop http://tech.marksblogg.com/benchmarks.html",0,15673323,0,"[]","",0,"","[]",0 15675513,0,"comment","manigandham","2017-11-11 08:32:16.000000000","Most of these techniques are already in production:
Microsoft SQL Server has columnstore indexes and can even be combined with its in-memory tables. MemSQL has been doing this for years and v6 is incredibly fast, also combines in-memory row-stores. ClickHouse is very good if you don't mind more operations work. MariaDB has the ColumnStore storage engine, Postgre has the cstore_fdw extension. Vertica, Greenplum, Druid, etc. EventQL was an interesting project but abandoned now.
AWS RedShift, Azure SQL Data Warehouse, Snowflake Data, Google BigQuery are the hosted options, with BQ being the most advanced with its vertical integration.
If you want to operationalize Apache Arrow today, Dremio is built around it and works similar to Apache Drill and Spark to run distributed queries and joins across data sources.",0,15672027,0,"[15675756]","",0,"","[]",0 15678110,0,"comment","xstartup","2017-11-11 20:50:27.000000000","Have you actually deployed this in production? Last year, we had lots of issue with Druid. Do you know about clickhouse?",0,15673631,0,"[15681314]","",0,"","[]",0 22994010,0,"story","ThePhysicist","2020-04-27 08:30:49.000000000","I'm looking for a database that can efficiently store and retrieve a very large number (billions) of structured datapoints for use in machine learning. Each datapoint can have an arbitrary number of categorical and numerical attributes and belong to one or more datasets.
I want to be able to quickly (ideally in several seconds at most for result sets with 1.000-1.000.000 datapoints) select datapoints of a given dataset and possibly filter them based on their attribute values, e.g. formulating queries like "give me all datapoints belonging to dataset A for which x < 4.5 AND category = 'test' AND event_date >= '2009-04-10'". Once written, datapoints will not change, though I would like to attach additional information to specific datapoints (e.g. test results or additional labels), which could be done in a separate data structure or table though.
Right now I'm solving this using a simple PostgreSQL database with auxiliary index tables, but I'm looking for more scalable alternatives.
I've considered software like Cassandra or Clickhouse but I'm not sure they will fit my use case well. Do you have any recommendations or did you realise such a system in your work and can provide some ideas or guidance? Thanks!",0,0,0,"[23028951,22994045]","",3,"Ask HN: Database for storing machine learning data?","[]",3 19284499,0,"comment","MrBuddyCasino","2019-03-01 19:39:59.000000000","Instana | Munich, Germany | ONSITE | Senior Backend Developer | instana.com
Instana is revolutionizing the application performance management space, where we compete with the likes of AppDynamics and Dynatrace. Financed by Merritech and Accel, we are growing very rapidly and need help! We have teams all over the world, but most of the engineering is in Germany and Serbia.
Our stack is good'ol Java on Dropwizard with Reactive Streams, using ClickHouse, Cassandra, Elasticsearch, Kafka and CockroachDB as data stores. The product is both SaaS on AWS and on-prem.
You should have a minimum of 5 years of relevant work experience with Java 8, lambdas and streams and be fluent in english. Familiarity with Reactor, K8S and AWS are a plus. We are processing > 40k events/sec per datacenter, so if you want to work on something more interesting than a SpringBoot CRUD app, by all means drop us a line.
We offer competitive pay, stock options and all the usual perks. Check us out at: https://www.instana.com/
Please email fabian.staeber@instana.com directly.",0,19281834,0,"[]","",0,"","[]",0 26111598,0,"comment","FridgeSeal","2021-02-12 08:05:12.000000000"," It that I advocate running your own Postgres setup in your own cluster instead of just renting a managed version, but I’ve run a few databases on K8s and found it pretty fine: useful for when your hosting provider doesn’t support the database you want to run (Clickhouse managed AWS service when?) or for application-specific KV-stores: EBS volumes and PVC’s are great, solid performance, kubernetes takes care of the networking, will resurrect it if the worst happens and it does go down.
I probably could have those things on their own instance but then I’d need to have to go through the hassle of networking, failover/recreation, deployments, etc and for the vast majority of cases that’s 100% more effort than deploy a stateful-set.",0,26111236,0,"[26115163]","",0,"","[]",0 26115163,0,"comment","hodgesrm","2021-02-12 16:10:59.000000000","> (Clickhouse managed AWS service when?)
Now! Altinity runs Altinity.Cloud now in AWS. Feel free to drop by.
There are also services in other clouds. Yandex runs one in their cloud and there are at least 3 in China. ClickHouse has a big and active community of providers.
Disclaimer: I work for Altinity.",0,26111598,0,"[26119888]","",0,"","[]",0 19300871,0,"story","stefant","2019-03-04 13:01:40.000000000","",0,0,0,"[]","https://sematext.com/blog/clickhouse-monitoring-sematext/",2,"Monitoring ClickHouse with Sematext","[]",0 19310430,0,"comment","buro9","2019-03-05 13:46:23.000000000","In the other blog post (linked from the top of this blog post) I wrote about the motivation for using the Wireshark-like syntax and how we also have a few customers with a lot of rules (tens of thousands and greater).
That said, it seems obvious to us that now we have a Rust library that can parse a Wireshark-like syntax into an AST... that we don't have to just perform the matching in Rust. i.e. that we can ask the library to produce translations of the expression as SQL (for our ClickHouse), GraphQL (for our analytics API), or even eBPF.
We can't run everything in eBPF, but we could check the list of the fields within an expression to see whether it could be run in eBPF, and then look at heavy hitter rules and promote the ones doing the most work inside L7 to be eBPF and to run in XDP.
Even if we don't do this for customer configured rules, this might be something we do for handling denial of service attacks using the same Wireshark-like expression syntax throughout our system.",0,19309702,0,"[19312699]","",0,"","[]",0 19321668,0,"comment","zeeg","2019-03-06 19:12:14.000000000","For the curious, all of this is powered by Clickhouse, which has been a phenomenal addition to Sentry's stack.",0,19321655,0,"[]","",0,"","[]",0 26141317,0,"comment","FridgeSeal","2021-02-15 10:34:04.000000000","Yandex is responsible for developing and maintaining my favourite columnar database: ClickHouse. It’s one of those pieces of software where everything I use it go “wow this is fast”.",0,26138745,0,"[]","",0,"","[]",0 23028951,0,"comment","pachico","2020-04-30 09:15:33.000000000","I use ClickHouse for analytical purposes and I managed to ingest with very modest hardware up to 5 million rows per second. I stopped there but with more multiple jobs I might achieve even more. Queries and export are very fast too. At the moment I cannot think of anything better for this. Let me know if you need extra info.",0,22994010,0,"[]","",0,"","[]",0 26153876,0,"story","xoelop","2021-02-16 13:57:15.000000000","",0,0,0,"[]","https://blog.tinybird.co/2021/02/16/tinybird-clickhouse/",6,"Why we are hiring open source devs to contribute to ClickHouse full-time","[]",0 23043955,0,"comment","hwwc","2020-05-01 16:47:42.000000000","SEEKING WORK | Backend Services; Data Engineering; Systems Engineering
Location: Boston, US | Remote: Yes
I'm an experienced software engineer looking for part-time and short-term contracts.
I've most recently worked in the data-analytics backend-stack: from ETL to database design to web-api to devops. One of my major projects is an analytics engine for web applications using Rust and Clickhouse (https://github.com/hwchen/tesseract).
However, I'm naturally curious and happy to work in any domain which requires high performance and maintainable code. I've worked with a distributed worker system, debugged async database drivers, and implemented text layout primitives.
Primary Skills: Rust, Python, Linux, Google Compute Platform, ClickhouseDB, Postgresql
Production Experience with: Python/Pandas, Node/JS, AWS, Docker, Redis, MySql, Nginx
Github: https://github.com/hwchen
Contact: hello@hwc.io",0,23042617,0,"[]","",0,"","[]",0 15740718,0,"story","valyala","2017-11-20 15:49:38.000000000","",0,0,0,"[]","https://github.com/Vertamedia/chproxy",2,"Chproxy – configurable proxy and load balancer for ClickHouse","[]",0 23044627,0,"comment","lykr0n","2020-05-01 17:46:52.000000000","Role: Site Reliability Engineer/System Administrator/System Engineer
Location: Seattle, WA (and surrounding areas)
Willing to relocate: I'd rather not
Technologies: Linux (CentOS/RHEL), MySQL, Postgres, Clickhouse, Docker, Nomad, Consul, Vault, Puppet, Ansible, SaltStack, Python 2/3 (development + administration), Rust (development + administration), Java + JVM (administration), KVM (oVirt/RHEV), VMware vSphere, Limited AWS/GCP, etcd, zookeeper, kafka, haproxy, nginx, Bash, GitHib/GitLab, Git, HTML, Datadog, Grafana, InfluxDB, and so on and so on. Looking for On Call? I find it fun.
Résumé/CV: On Request
Email: lykron@mm.st
I love building infrastructure and being involved with architecture design. I've been heavily involved in improving reliability of applications and systems to make sure they do not go down. Always looking to learn and help others do so as well.",0,23042616,0,"[]","",0,"","[]",0 23055414,0,"comment","scoresmoke","2020-05-02 22:31:51.000000000","Of course, the world is not limited to just nginx. As soon as the Web server allows sending access logs via the syslog protocol, Ballcone should handle it well. Well, nginx is my usual choice, so I focused on supporting it first. As far as I know, haproxy allows it, too, but I have not tried it yet.
An embedded database is used intentionally. Not just because I like them, but because I am inspired by the services like Cockpit (https://cockpit-project.org/) that provides a Web console to the server on SSH login.
Scaling is an excellent question. My preliminary tests showed that MonetDBLite handled approximately 500–700 inserts per second on a small Linux virtual machine with HDD. Larger scales require specialized solutions like ClickHouse or Vertica. Just in case, Ballcone supports gathering data from multiple remote hosts.",0,23055304,0,"[]","",0,"","[]",0 12168055,0,"comment","ddispaltro","2016-07-26 19:09:20.000000000","How does it compare to Clickhouse, recently released from Yandex [0]?
[0] https://clickhouse.yandex/",0,12167944,0,"[]","",0,"","[]",0 26182015,0,"story","mmcclure","2021-02-18 16:40:29.000000000","",0,0,0,"[26186955,26187501,26190502,26188020,26187459,26193314,26190763,26187541]","https://mux.com/blog/from-russia-with-love-how-clickhouse-saved-our-data/",128,"How ClickHouse saved our data (2020)","[]",52 26186955,0,"comment","wiredfool","2021-02-18 22:53:15.000000000","I’ve also had some pretty clear successes moving a ~2TB Postgres database into a 70GB Clickhouse instance, (though, in this case a good 2/3 of the issue was scheme design on the PG side, but there were still another order of magnitude compression to be had from clickhouse.)
Queries went from a 12 hour table scan to ~2 minutes, and production queries from 45 sec to 400ms.
This is without doing anything particularly tricky on the clickhouse side, nor even putting it on a particularly large machine. When benchmarking it, I put it on a 10yr old laptop, and I was seeing production queries in the seconds range,",0,26182015,0,"[26192972,26189742,26188249]","",0,"","[]",0 26187541,0,"comment","bdod6","2021-02-18 23:52:58.000000000","I don't understand the obsession with ClickHouse. While it seems like it fits this particular use case, it still deals with the same limitations and challenges of columnar DBs. Your queries will be very fast with counts/averages, but there's a tradeoff with other functions: inserts are efficient for bulk inserts only, your deletes and updates are slow, no secondary indexes...
While Clickhouse can be lightning fast, is it really designed to be a main backend database?",0,26182015,0,"[26187613,26190607,26190666,26187599]","",0,"","[]",0 26187613,0,"comment","tadkar","2021-02-19 00:00:52.000000000","Clickhouse is not something you use for a CRUD backend.
The obsession with Clickhouse is the phenomenal performance for the OLAP use case, a scenario where there were not many open source, easy to install and maintain options. For the most part you can treat it as a “normal” database, insert data into it and query it without messing about with file format conversions and so on. The fact that it is blindingly fast is a big bonus!",0,26187541,0,"[26189806]","",0,"","[]",0 26187645,0,"comment","A-F1V3","2021-02-19 00:04:04.000000000","(Mux co-founder here) The Postgres system described as the predecessor to Clickhouse here actually was cstore_fdw with a pretty heavily customized (and outdated ) CitusDB. CitusDB is a great system for sharding and distributing postgres, but we found that the compression and query performance for this specific analytics use case was much better served by Clickhouse.",0,26187459,0,"[26187798,26187720]","",0,"","[]",0 26187720,0,"comment","tomnipotent","2021-02-19 00:14:39.000000000","The only place Citus is mentioned in the post is in the title of a chart. Might be worth updating to include this, would also clarify note #3:
"ClickHouse is still faster than the sharded Postgres setup at retrieving a single row, despite being column-oriented and using sparse indices. I'll describe how we optimized this query in a moment."",0,26187645,0,"[26188087]","",0,"","[]",0 26187798,0,"comment","caust1c","2021-02-19 00:23:10.000000000","Cloudflare also moved from Citus to Clickhouse",0,26187645,0,"[]","",0,"","[]",0 26188020,0,"comment","_kyran","2021-02-19 00:48:39.000000000","Does anyone know of a hosted Clickhouse service?
I'd like to use Clickhouse but on a very small team without an ops person, it feels like a liability managing a server ourselves.",0,26182015,0,"[26190423,26188069]","",0,"","[]",0 26189742,0,"comment","riku_iki","2021-02-19 04:28:37.000000000","> ~2TB Postgres database into a 70GB Clickhouse
was your PG DB been compressed?",0,26186955,0,"[26193252,26191107]","",0,"","[]",0 26189806,0,"comment","lixtra","2021-02-19 04:36:39.000000000","> For the most part you can treat it as a “normal” database
While Clickhouse is great at what it does, I would expect a „normal“ database to support transactions. Don’t use it to handle bank accounts.",0,26187613,0,"[26190650]","",0,"","[]",0 23072138,0,"comment","hodgesrm","2020-05-04 19:18:58.000000000","Altinity | Multiple ClickHouse engineering positions | REMOTE in North America and Europe| Full-time | Competitive Salary and Equity
Hello! We are Altinity, a fast-growing database startup with a distributed team spanning from California to Eastern Europe. Our business is to make customers successful with ClickHouse, the leading open source data warehouse. Our customers range from ambitious startups to some of the most well-known enterprises on the planet. And we are looking for people to join us!
* Demand Generation Marketing Manager
* Cloud Engineer
* Site Reliability Engineer
* Altinity Test Engineer
* Data Warehouse Implementation Engineer
* Data Warehouse Support Manager
* Data Warehouse Support Engineer
If you are looking and want to join, check out our jobs here: https://www.altinity.com/careers",0,23042618,0,"[]","",0,"","[]",0 23072555,0,"comment","thda","2020-05-04 19:56:49.000000000","> Oracle DB also has been the leader performance wise for complex queries and large datasets
SQL server has a column store type of storage, and major innovations like 'froid'. Oracle is not such a strong leader there. Also, on a whole lot of workloads clickhouse is much superior.",0,23071596,0,"[23073975]","",0,"","[]",0 23073975,0,"comment","tomnipotent","2020-05-04 22:12:31.000000000","> whole lot of workloads clickhouse is much superior.
ClickHouse is a read-only analytics database and overlaps with Oracle in only those areas, otherwise Oracle blows it out of the water.",0,23072555,0,"[23085244]","",0,"","[]",0 26190165,0,"comment","mastazi","2021-02-19 05:44:42.000000000","> and if it's analytics why were you using a row store instead of a columnar database?
This point confuses me, parent is telling us how they went from a row-based (pgsql) to a column-based DB (Clickhouse), and you asked them why they didn't?",0,26188249,0,"[26192256]","",0,"","[]",0 26190423,0,"comment","x4m","2021-02-19 06:47:50.000000000","Consider using Yandex Cloud, we offer managed Clickhouse as a service. Clickhouse is opensource DB created by Yandex.",0,26188020,0,"[]","",0,"","[]",0 26190502,0,"comment","DevKoala","2021-02-19 07:05:45.000000000","I am currently building an internal analytics platform for web event data on top of CH, and I recommend it too.
I am migrating a lot of complex Scala queries to CH SQL and I am surprised at everything I can do, very deep and rich API backed by great performance. Also, Clickhouse is a bit more operationally involved than a solution like BigQuery, but at a fraction of the performance per dollar cost.",0,26182015,0,"[]","",0,"","[]",0 26190607,0,"comment","llampx","2021-02-19 07:34:24.000000000","I believe Clickhouse is competing with other analytical, OLAP databases, not something like vanilla Postgres or MariaDB or Oracle.
This is not your application backend database.",0,26187541,0,"[]","",0,"","[]",0 26190650,0,"comment","zX41ZdbW","2021-02-19 07:46:56.000000000","ClickHouse supports transactional (ACID) properties for INSERT queries. It can also replicate data across geographically distributed locations with automatic support for consistency, failover and recovery. Quorum writes are supported as well.
This allows to safely use ClickHouse for billing data.",0,26189806,0,"[26191659]","",0,"","[]",0 26190666,0,"comment","zX41ZdbW","2021-02-19 07:50:48.000000000","Secondary indexes are supported by ClickHouse.
https://clickhouse.tech/docs/en/engines/table-engines/merget...
They are not like indexes for point lookups, but also sparse. Actually they are the best you can do without blowing up the storage.",0,26187541,0,"[]","",0,"","[]",0 26190763,0,"comment","ramraj07","2021-02-19 08:11:29.000000000","Controversial question: can you trust clickhouse? After the JetBrains hack I'm starting to think if I should question any software that's coming from east of Europe (even if the org is legit).",0,26182015,0,"[26190937,26190902,26191536]","",0,"","[]",0 26190937,0,"comment","aPoCoMiLogin","2021-02-19 08:43:05.000000000","> After the JetBrains hack
AFAIK there was no evidence that solarwind fiasco was because of jetbrains [0], also the clickhouse is open source [1] under apache2 license and you can check the source and question it by yourself.
[0] - https://www.zdnet.com/article/jetbrains-denies-being-involve...
[1] - https://github.com/ClickHouse/ClickHouse",0,26190763,0,"[]","",0,"","[]",0 26191118,0,"comment","wiredfool","2021-02-19 09:16:02.000000000","Why did I do a full scan? Because I wanted to see just how bad it was, and I wanted to see roughly what was in the database.
I ran: `select count (distinct foo)` and got ~9000 distinct items over 2B rows in 12 hours. Running the same query in Clickhouse was O(minute)
As a little extra comparison, the total size of the Clickhouse database is 1/2 the size of the indexes in Postgres.",0,26188249,0,"[]","",0,"","[]",0 26191659,0,"comment","lixtra","2021-02-19 10:47:24.000000000","ACID seems to be a very recent feature.
https://stackoverflow.com/questions/60163070/is-clickhouse-d...",0,26190650,0,"[]","",0,"","[]",0 26191986,0,"comment","olavgg","2021-02-19 11:31:16.000000000","I have evaluated these different options. And by far I prefer Clickhouse when were talking about less than a petabyte of data. It super simple to install and setup. And is a lot faster.",0,26187501,0,"[]","",0,"","[]",0 26192972,0,"comment","zepearl","2021-02-19 13:36:55.000000000","I had as well excellent results when I experimented with Clickhouse (in my case the table with the most rows that I ever created had 6 columns using a total of ~55 bytes per row - after having inserted ~35 billion rows the table was 450MiB big). My usecases involving big aggregations and mini-searches (when searching/retrieving just few rows) were excellent (big aggr) respectively good (small searches - not as fast as a normal DB, but for me for sure acceptable).
For whoever is interested to try it out, be aware that you really should understand how "parts" and "merges" work and your write-workload should take that into consideration (for example dynamically lower the workload when "merges" are running, take into consideration the I/O needed to delete pending old "parts", etc...) - if you don't then you're very likely to get bad performance if your workload is write-heavy. The amount of columns and the sort order is as well extremely important.
Last year I decided to start using Clickhouse to host Collecd's metrics ( https://www.segfault.digital/how-to-directly-use-the-clickho... ) and so far it's working well. I upgraded twice the version of the CH-software, and killed once by mistake its VM but afterwards I never had problems ("so far", hehe).",0,26186955,0,"[26193972,26200786]","",0,"","[]",0 26193192,0,"comment","FridgeSeal","2021-02-19 14:00:38.000000000","I’ve used Presto and Athena and neither could really hold a candle to ClickHouse. Presto was handy, but I’d personally much rather go to the effort of getting things into a proper columnar database than put up with the quirks of the likes of Presto.
If you want, there’s a detailed analysis of various db’s performance on various tasks here:
https://tech.marksblogg.com/benchmarks.html
It’s one of the top performing ones, even the single machine setup comfortably beats the likes of Presto, Athena, etc",0,26187501,0,"[]","",0,"","[]",0 26193252,0,"comment","FridgeSeal","2021-02-19 14:05:24.000000000","In addition to what OP mentions in the sibling comment, ClickHouse supports a whole bunch of encodings that can take advantage of column based layouts to further compress the columns in very clever ways",0,26189742,0,"[]","",0,"","[]",0 26193314,0,"comment","FridgeSeal","2021-02-19 14:12:09.000000000","ClickHouse has to be one of my favourite databases and pieces of technology.
It’s stupid fast, I know it’s capable of workload sizes past where I’ve pushed it so far. I’ve used it in anger and even then I was able to solve the issues I had easily. The out-of-the-box integrations (like the Kafka one) have made my life so much easier. I get excited whenever I get to use the instance I manage at work, or I demo something about it to someone because I know I’ll often blow their minds a bit.",0,26182015,0,"[]","",0,"","[]",0 26195057,0,"story","adam_carrigan","2021-02-19 16:55:02.000000000","Hi HN,
Adam and Jorge here, and today we’re very excited to share MindsDB with you (http://github.com/mindsdb/mindsdb). MindsDB AutoML Server is an open-source platform designed to accelerate machine learning workflows for people with data inside databases by introducing virtual AI tables. We allow you to create and consume machine learning models as regular database tables.
Jorge and I have been friends for many years, having first met at college. We have previously founded and failed at another startup, but we stuck together as a team to start MindsDB. Initially a passion project, MindsDB began as an idea to help those who could not afford to hire a team of data scientists, which at the time was (and still is) very expensive. It has since grown into a thriving open-source community with contributors and users all over the globe.
With the plethora of data available in databases today, predictive modeling can often be a pain, especially if you need to write complex applications for ingesting data, training encoders and embedders, writing sampling algorithms, training models, optimizing, scheduling, versioning, moving models into production environments, maintaining them and then having to explain the predictions and the degree of confidence… we knew there had to be a better way!
We aim to steer you away from constantly reinventing the wheel by abstracting most of the unnecessary complexities around building, training, and deploying machine learning models. MindsDB provides you with two techniques for this: build and train models as simply as you would write an SQL query, and seamlessly “publish” and manage machine learning models as virtual tables inside your databases (we support Clickhouse, MariaDB, MySQL, PostgreSQL, and MSSQL. MongoDB is coming soon.) We also support getting data from other sources, such as Snowflake, s3, SQLite, and any excel, JSON, or CSV file.
When we talk to our community, we find that they are using MindsDB for anything ranging from reducing financial risk in the payments sector to predicting in-app usage statistics - one user is even trying to predict the price of Bitcoin using sentiment analysis (we wish them luck). No matter what the use-case, what we hear most often is that the two most painful parts of the whole process are model generation (R&D) and/or moving the model into production.
For those who already have models (i.e. who have already done the R&D part), we are launching the ability to bring your own models from frameworks like Pytorch, Tensorflow, scikit-learn, Keras, XGBoost, CatBoost, LightGBM, etc. directly into your database. If you’d like to try this experimental feature, you can sign-up here: (https://mindsdb.com/bring-your-own-ml-models)
We currently have a handful of customers who pay us for support. However, we will soon be launching a cloud version of MindsDB for those who do not want to worry about DevOps, scalability, and managing GPU clusters. Nevertheless, MindsDB will always remain free and open-source, because democratizing machine learning is at the core of every decision we make.
We’re making good progress thanks to our open-source community and are also grateful to have the backing of the founders of MySQL & MariaDB. We would love your feedback and invite you to try it out.
We’d also love to hear about your experience, so please share your feedback, thoughts, comments, and ideas below. https://docs.mindsdb.com/ or https://mindsdb.com/
Thanks in advance, Adam & Jorge",0,0,0,"[26196710,26195128,26195783,26206329,26196973,26202549,26198393,26196883,26206247,26202898,26201021,26196173,26196083,26196479,26197256,26195949,26195792,26195900]","",176,"Launch HN: MindsDB (YC W20) – Machine Learning Inside Your Database","[]",60 26195783,0,"comment","pachico","2021-02-19 17:54:01.000000000","I've been following you, guys, for some months and I must say I'm a huge fan. Being a hardcore Clickhouse user, I got hooked with your tutorial about how to make it work with your product.
Best of luck!",0,26195057,0,"[26195991,26195960]","",0,"","[]",0 26195960,0,"comment","torrmal","2021-02-19 18:07:22.000000000","thank you!! lets chat, we would love to show you the timeseries cool stuff we have done for clickhouse!",0,26195783,0,"[26195995,26197763]","",0,"","[]",0 26195996,0,"comment","torrmal","2021-02-19 18:09:38.000000000","you are right!! the main thing is to offer it for other databases (mysql, mariadb, postgres, clickhouse, clickhouse, timescale, mongodb) as well as to support more powerful machine learning capabilities than the vanila classical models supported by oracle, for instance great timeseries support",0,26195949,0,"[]","",0,"","[]",0 26196973,0,"comment","hodgesrm","2021-02-19 19:28:07.000000000","Go MindsDB!! We've enjoyed working with the MindsDB team at Altinity. The integration with ClickHouse makes clever use of the MySQL protocol to implement models as queryable tables. For anybody interested in the specifics check out the following article: https://altinity.com/blog/machine-learning-models-as-tables.
We will watch your career with great interest.",0,26195057,0,"[26197246]","",0,"","[]",0 26197037,0,"comment","zepearl","2021-02-19 19:33:29.000000000","> That's a savings of something like 10% over the whole dataset.
Sorry, I'm probably misunderstanding you - I'm not sure about that "10%", what does it it refer to?
Anyway, I do like a lot in Clickhouse to be able to chain codecs (I personally like to think about "Delta/DoubleDelta/Gorilla/T64 codecs" being "encodings" and the "general purpose codecs LZ4/LZ4HC/ZSTD" being "compression codecs").
I don't have a good math background => about Delta & DoubleDelta I liked this explanation ( https://altinity.com/blog/2019/7/new-encodings-to-improve-cl... ) which defines "delta" as tracking "distance" and "doubledelta" tracking "acceleration" between consecutive column values.
In the end I tested, using relatively small datasets (which were different between usecases), all combinations of encodings (delta/doubledelta/gorilla/t64) and ZSTD-compression levels (mostly 1/3/9) (I ignored LZ4*).
- "Delta"&"DoubleDelta" were often interesting (but in general for my data using "Delta"+ZSTD was already good enough compared to the rest).
- "Gorilla" somehow never gave me any benefits if compared to other codecs and/or compression algos.
- "T64" is a bit a mistery for me, anyway in some tests it delivered excellent results compared to the other combinations, therefore I'm currently using just T64 for some columns, and for some other columns as T64+ZSTD(9).
EDIT: sorry, I think I got it - you probably meant something like "just by doing that on that specific column, the overall storage needs were reduced by 10% for the whole dataset", right?",0,26193972,0,"[26198957]","",0,"","[]",0 26197170,0,"story","igorlukanin","2021-02-19 19:43:47.000000000","",1,0,0,"[]","https://dev.to/cubejs/building-clickhouse-dashboard-and-crunching-wallstreetbets-data-14ao",1,"ClickHouse Analytical Dashboard with Stock Market Data","[]",0 26197404,0,"comment","hodgesrm","2021-02-19 20:03:23.000000000","Great to see this project write-up, especially the level of detail about the implementation. The team at Altinity collaborated with Chao and his colleagues early on to investigate options for schema definition and query scaling, as these topics are of general interest to the ClickHouse community. I really appreciate the generosity of the Uber team in sharing the work publicly.
It would be really cool to see this work evolve into one or more open source projects for service log management.",0,26197126,0,"[]","",0,"","[]",0 26197763,0,"comment","ajawee","2021-02-19 20:38:07.000000000","Is it possible do the anomaly deduction with clickhouse data?",0,26195960,0,"[26198010]","",0,"","[]",0 26199078,0,"comment","hodgesrm","2021-02-19 22:50:29.000000000","Based on the feedback here it may soon be time to do a follow-up talk on MindsDB at a future ClickHouse meetup. :)",0,26197246,0,"[26222147]","",0,"","[]",0 26200786,0,"comment","hodgesrm","2021-02-20 02:06:35.000000000","The system tables are incredibly helpful for optimizing ClickHouse performance. The system.columns table gives exact compression ratios for every column and system.parts shows similar stats for table parts. It's possible to tune schema very effectively without a lot of deep knowledge of how codecs and table partitioning work internally.",0,26192972,0,"[]","",0,"","[]",0 23085244,0,"comment","derefr","2020-05-05 21:24:15.000000000","Yes, and? If you're choosing a tool to deploy for that workload, why would you deploy Oracle instead of ClickHouse? The same question goes for any other analogous workload. Why use something that's second-best at 100 different jobs (Oracle), when you could just choose the best-in-class tool for the exact job you're doing each time?
Especially since, in the particular use-case we're talking about here (data warehousing), the whole paradigm and all the tooling is built around the expectation of ETL pipelines copying+transforming+"cubing" data around from OLTP (or data-lake) systems to OLAP systems. "Everything being part of one solution from one vendor" doesn't make one whit of difference in that case, since the whole architecture is expected to be built around having a one-way pipeline of mutually-opaque interoperating systems, so any two pipeline stages that can manage to speak to one-another at all can't really be any "more" well-integrated than that.",0,23073975,0,"[23088465]","",0,"","[]",0 23088465,0,"comment","tomnipotent","2020-05-06 06:48:05.000000000","> Especially since, in the particular use-case we're talking about here (data warehousing)
I didn't read any context of data warehousing except for the ClickHouse comment.
When comparing CH to Oracle, there's at best a 10% overlap. Within that overlap, CH is pretty amazing in what it can offer. However, for the remaining 90% Oracle kicks the shit out of CH.
CH does not have to worry about being an OLTP database and everything that entails (transactions, MVCC etc.) That means CH gets to take a LOT of shortcuts to offer what it does.",0,23085244,0,"[23088903]","",0,"","[]",0 23088903,0,"comment","thda","2020-05-06 08:12:56.000000000","I thought that by large datasets you meant TB of data that we see in analytics. And in this area clickhouse is growing and coming to the big companies I work for, one way or another. postgresql handles the OLTP decently enough. That leaves only niches for oracle.",0,23088465,0,"[23092118]","",0,"","[]",0 23102304,0,"comment","pachico","2020-05-07 12:44:46.000000000","It depends on what you have to do. If your stack includes a series of microservices/monoliths connected to the typical OLTP DB then you might very well sit entirely on cloud. Things change when you need heavy lifting like having big ElasticSearch or ClickHouse clusters, or any other operation that requires heavy CPU and high RAM capacity. In that case using providers like Hetzner can cost you 1/10 of the bill compared to AWS.",0,23089999,0,"[]","",0,"","[]",0 19424007,0,"story","gerenuk","2019-03-18 19:00:03.000000000","Hi,
We have a dataset around 150 million URLs and other meta data in ElasticSearch and looking for an efficient way to identify the duplicate URLs/titles from our dataset. Used ElasticSearch term aggregation but it becomes very slow and returns only 10,000 URLs and most of the time it misses the URLs.
Currently, we have a redis with Sorted Sets, before any indexing URL, we look for the into redis set.
Options we have explored:
1. Clickhouse, storing all the URL and running aggregation etc. on it later on? 2. Storing the URLs in redis along with bloomfilter.
If you have worked on a similar thing, would love to hear your feedback.
Thanks.",0,0,0,"[19424640]","",2,"Ask HN: Identifying duplicate data from a large dataset?","[]",3 15831555,0,"comment","amund","2017-12-02 12:06:48.000000000","Zedge | Data Scientist and Android SWE positions | Trondheim, Norway | ONSITE, FULL-TIME | EU/EEC work permit/visa required | https://corp.zedge.net/join-our-playground
Zedge (NYSE MKT: ZDGE) provides personalization apps/services (primarily on Android and iOS) for ~30 million monthly active users.
On the data science side we use Hadoop and (increasingly) Clickhouse for analytics in combination with both using and developing Deep Learning (Keras/Tensorflow) for content analysis (e.g. audio and images) and content discovery (e.g. recommender systems and search). We are looking for data scientist candidates that also have solid software engineering skills, a doer mindset and an aptitude to learn.
Blog posts related to some of the things we've been looking into related to Deep Learning:
- https://corp.zedge.net/developers-blog/creative-ai-on-the-ip...
- https://corp.zedge.net/developers-blog/deep-learning-at-zedg...
(I am leading the data science team)",0,15824597,0,"[]","",0,"","[]",0 26263053,0,"comment","ymolodtsov","2021-02-25 14:25:17.000000000","The rise of Clickhouse is fascinating.",0,26262949,0,"[]","",0,"","[]",0 23164263,0,"story","rrampage","2020-05-13 07:05:20.000000000","",0,0,0,"[]","https://www.youtube.com/watch?v=fGG9dApIhDU",2,"ClickHouse-The Fastest Data Warehouse You've Not Heard Of – CMU DB Talk [video]","[]",0 23167494,0,"story","bluestreak","2020-05-13 14:58:17.000000000","",0,0,0,"[]","https://github.com/questdb/questdb/releases/tag/4.2.1",6,"QuestDB 4.2.1 Release. SIMD performance gain 20%. Clickhouse benchmark","[]",0 23171991,0,"story","mmcclure","2020-05-13 21:23:49.000000000","",0,0,0,"[]","https://mux.com/blog/from-russia-with-love-how-clickhouse-saved-our-data/",6,"From Russia with Love: How ClickHouse Saved Our Data","[]",0 23180010,0,"comment","aseipp","2020-05-14 14:33:15.000000000","GPU based databases haven't reached broad adoption because sending things over the PCIe link is a huge waste of time if you can avoid it. Working around this with custom design like NVLink/NVSwitch do is ridiculously expensive (and why a DGX costs a gajillion dollars), and there is simply not enough volume to subsidize it. They are largely analytics focused, because the parallel hardware can obviously map onto primitives like sequential scan and filters relatively easily. Futhermore, data sizes are not small. Thus the architectures tend to emphasize things like in-memory (VRAM) workloads that get scaled horizontally via RDMA (or RoCE, whatever people are doing these days), which is expensive and limited. Major businesses (i.e. people with money, who nvidia are targeting) already pay for proprietary databases, regularly, every day. That's not the barrier. All of the actual true secret sauce is in the hardware design, and you can't replicate that. You're always at Nvidia's mercy to design solutions to their customers needs. (And frankly, they've done that pretty well, I think.)
Sure, you can pay almost $10,000 per Tesla V100 (which aren't going to become magically cheaper, all of a sudden), and buy 8x of them. That's a 256GB working set, for the price of like, $70k USD. It might make sense for some things. For everyone else? Pay $30,000 for a single server, run something like ClickHouse, and you'll have a better overall TCO for a vast majority of workloads. It'll saturate every NVMe drive and all the RAM (terabytes) you can give it, and will scale out too. It's got nothing to do with openness and everything to do with system architecture. You can replicate all of this with whatever AMD has and it won't make a single bit of difference in the market.
I don't like the fact Nvidia keeps their software closed either (and in fact it was a motivating reason for replacing my old GTX in my headless server with a Radeon Pro card recently), but the problems you're talking about are not ones of openness.
> If I type a list comprehension in Python, it would run at higher speed and lower power on a SIMD or MIMD platform.
I think you vastly underestimate the complexity of these platforms and how to extract performance from them, if you think it's as simple as your list comprehension going faster now and you hang up your coat and you're done. Sure, when you're experimenting, that 5x raw wall clock time improvement is nice, and you don't think about whether or not you could have done it with comparable hardware under a different cost profile (5x faster is good, but 5x longer wall clock than the GPU but 15x lower power is a winner). But when you're paying millions of dollars for these systems, it's not a matter of "how to make this thing faster", it's "how do I utilize the resources I have, so 85% of this $300,000 machine isn't sitting idle". This thinking is what drives the design of the overall system, and that's much more complicated.",0,23179611,0,"[23181448]","",0,"","[]",0 26306169,0,"comment","caust1c","2021-03-01 18:29:05.000000000","Segment by Twilio | Engineering | SF, NYC, Vancouver | Full-Time | Onsite & Remote (US & Canada)
Segment is building the customer data platform for everyone. We transform data and route to over 200 different integrations, adding new ones every day. We're processing billions of events daily and maintain the analytics infrastructure for Fortune 500s. Recently acquired by Twilio, we've increased our hiring velocity substantially. Our goal is to help companies learn from how their users interact with the products to build even better products. We also like to share our work and what we learn, here are some examples:
- https://segment.com/blog/kafka-optimization
- https://segment.com/blog/kubernetes-configuration
- https://segment.com/blog/goodbye-microservices
- https://segment.com/blog/the-10m-engineering-problem
- https://open.segment.com
We have a modern stack consisting of Go (golang), AWS ECS/EKS, Docker, Typescript, React, GraphQL, Kafka, Clickhouse and others! If any of this sounds interesting, we'd love to hear from you! Check out our open positions at https://segment.com/careers/ . If you have specific questions about a role or company culture, feel free to reach out to me: alan ⒜ segment.com (but please do apply on the site).",0,26304051,0,"[]","",0,"","[]",0
26306170,0,"comment","caseyaedwards","2021-03-01 18:29:09.000000000","Tesla | Senior or Staff Site Reliability Engineer (SRE), Manufacturing Systems | Fremont, CAThe Core Automation Services (CAS) team at Tesla is building applications to enable manufacturing, with an eye towards reliability, availability, scalability, speed and security. We're a diverse team composed of Controls Automation Engineers, Software Engineers, and various other disciplines that help facilitate automated manufacturing processes. As an SRE on the CAS team you'll be working with the infrastructure, systems and applications that act as the middleware layer between Programmable Logic Controllers (PLCs) and the outside world, such as Databases, MES systems and other services.
Location: Fremont, CA
Responsibilities:
* Support interim HMI/SCADA vendor application (Ignition from Inductive Automation)
* Building tooling around it, evaluating its usage, and helping to ensure its reliability, availability and security
* Design software and systems that enable automated manufacturing at Tesla
* Assist Software, Controls, Manufacturing and other types of Engineers with onboarding and integrating services into the Tesla technology stack
* Ensuring best practices and observability of the service, such as metrics, logging, tracing, and alerting
* Automate configuration and deployment of services
* Consult on and design infrastructure, systems and application architecture
Apply at:https://www.tesla.com/careers/search/job/site-reliability-en...
https://www.tesla.com/careers/search/job/site-reliability-en...
=======================
Tesla | Database Site Reliability Engineer, Manufacturing Systems | Fremont, CA
As a Database SRE on the Core Automation Services (CAS) team you'll be setting up and managing the databases, including MySQL, CockroachDB, FoundationDB, Clickhouse, and InfluxDB that back various software and systems that enable manufacturing in our various factories.
Location: Fremont, CA
Responsibilities:
* Evaluate current database deployments and make recommendations for how to improve their reliability, availability, scalability and security
* Design and implement automation for managing the deployment and upgrades of the databases
* Define Disaster Recovery and Business Continuity plans for the various database deployments
* Assist Software, Controls, Manufacturing and other types of engineers with using databases sustainably
* Ensuring best practices and observability of the databases, such as metrics, logging, tracing, and alerting
* Consult on and design infrastructure, systems and application architecture
Requirements: * Experience with running databases on bare-metal or VMs
* Expert skills in Linux and its administration
* Experience in a high level language such as Go, Python and/or Java
* Understand the concepts of Observability and Infrastructure as Code
* Comfortable on an on-call rotation
* Comfortable doing live troubleshooting of issues on NOC bridges/outage calls
* Habitual documenter and spreader of knowledge
* Willing to mentor other team members and engineers with less database knowledge
* Strong bias for action vs endless planning, willing to get hands dirty and make mistakes sometimes
* 3+ years as DBA/SRE
Apply at: https://www.tesla.com/careers/search/job/database-site-relia...",0,26304051,0,"[]","",0,"","[]",0
26316401,0,"story","jetter","2021-03-02 15:40:32.000000000","",0,0,0,"[26320811,26317841,26318260,26317060,26317759,26318150,26318082,26317791,26317445,26317438,26323542,26319794,26317672,26317058,26318189,26317318,26321444,26321989,26320145,26317032,26321483,26318448,26319059,26324073,26317768,26316829,26316499]","https://pixeljets.com/blog/clickhouse-vs-elasticsearch/",383,"ClickHouse as an alternative to Elasticsearch for log storage and analysis","[]",134
26317058,0,"comment","wakatime","2021-03-02 16:32:06.000000000","A related database using ideas from Clickhouse:https://github.com/VictoriaMetrics/VictoriaMetrics",0,26316401,0,"[26317796]","",0,"","[]",0 26317060,0,"comment","sylvinus","2021-03-02 16:32:10.000000000","ClickHouse is incredible. It has also replaced a large, expensive and slow Elasticsearch cluster at Contentsquare. We are actually starting an internal team to improve it and upstream patches, email me if interested!",0,26316401,0,"[26318707,26317969,26317775]","",0,"","[]",0 26317318,0,"comment","harporoeder","2021-03-02 16:53:28.000000000","I don't have any production experience running Clickhouse, but I have used it on a side project for an OLAP workload. Compared to Postgres Clickhouse was a couple orders of magnitude faster (for the query pattern), and it was pretty easy to setup a single node configuration compared to lots of the "big data" stuff. Clickhouse is really a game changer.",0,26316401,0,"[26317393]","",0,"","[]",0 26317438,0,"comment","dominotw","2021-03-02 17:01:55.000000000","How does clickhouse compare to druid, pinot, rockset (commercial), memsql (commercial). I know clickhouse is easier to deploy.
But from user's perspective is clickhouse superior to the others?",0,26316401,0,"[26318733,26322178,26318460,26318218]","",0,"","[]",0 26317445,0,"comment","BoorishBears","2021-03-02 17:02:22.000000000","My biggest problem with Elasticsearch is how easy it is to get data in there and think everything is just fine... until it falls flat on its face the moment you hit some random use case that, according to Murphy's law, will also be a very important one.
I wish Elasticsearch were maybe a little more opinionated in its defaults. In some ways Clickhouse feels like they filled the gap not having opinionated defaults created. My usage is from a few years back so maybe things have improved",0,26316401,0,"[26317607]","",0,"","[]",0 26317672,0,"comment","moralestapia","2021-03-02 17:19:18.000000000","Sorry to hijack the thread but can anyone suggest alternatives to the 'search' side of Elasticsearch?
I haven't been following the topic and there's probably new and interesting developments like ClickHouse is for logging.",0,26316401,0,"[26325652,26323607,26318268,26317806,26317725,26318058,26317811,26319299]","",0,"","[]",0 26317711,0,"comment","aseipp","2021-03-02 17:21:41.000000000","There's replication in ClickHouse and you can just shove reads off to one of them if you'd like. From a backup/safety standpoint that's important, but I think there are other options besides just replicas, of course.
From an operations standpoint, however, ClickHouse is ridiculously efficient at what it does. You can store tens of billions, probably trillions of records on a single node machine. You can query at tens of billions of rows a second, etc, all with SQL. (The only competitor I know of in the same class is MemSQL.) So another thing to keep in mind is you'll be able to go much further with a single node using ClickHouse than the alternatives. For OLAP style workloads, it's well worth investigating.",0,26317393,0,"[26336987]","",0,"","[]",0 26317759,0,"comment","js2","2021-03-02 17:26:27.000000000","Sentry.io is using ClickHouse for search, with an API they built on top of it to make it easier to transition if need be. They blogged about it at the time they adopted it:
https://blog.sentry.io/2019/05/16/introducing-snuba-sentrys-...",0,26316401,0,"[26336863]","",0,"","[]",0 26317768,0,"comment","moralsupply","2021-03-02 17:26:56.000000000","I'm happy that more people are "discovering" ClickHouse.
ClickHouse is an outstanding product, with great capabilities that serve a wide array of big data use cases.
It's simple to deploy, simple to operate, simple to ingest large amounts of data, simple to scale, and simple to query.
We've been using ClickHouse to handle 100's of TB of data for workloads that require ranking on multi-dimensional timeseries aggregations, and we can resolve most complex queries in less than 500ms under load.",0,26316401,0,"[]","",0,"","[]",0 26317791,0,"comment","dabeeeenster","2021-03-02 17:28:30.000000000","I've been recording a podcast with Commercial Open Source company founders (Plug! https://www.flagsmith.com/podcast) and have been surprised how often Clickhouse has come up. It is always referred to with glowing praise/couldn't have built our business without it etc etc etc.",0,26316401,0,"[26317912]","",0,"","[]",0 26317796,0,"comment","wikibob","2021-03-02 17:28:36.000000000","Are you familiar with VictoriaMetrics?
Can you elaborate on how it is similar and dissimilar to Clickhouse?
What specific techniques are the same?",0,26317058,0,"[26319434]","",0,"","[]",0 26317841,0,"comment","tgtweak","2021-03-02 17:32:51.000000000","I think it's an unfair comparison, notably because:
1) Clickhouse is rigid-schema + append-only - you can't simply dump semi-structured data (csv/json/documents) into it and worry about schema (index definition) + querying later. The only clickhouse integration I've seen up close had a lot of "json" blobs in it as a workaround, which cannot be queried with the same ease as in ES.
2) Clickhouse scalability is not as simple/documented as elasticsearch. You can set up a 200-node ES cluster with a relatively simple helm config or readily-available cloudformation recipe.
3) Elastic is more than elasticsearch - kibana and the "on top of elasticsearch" featureset is pretty substantial.
4) Every language/platform under the sun (except powerbi... god damnit) has native + mature client drivers for elasticsearch, and you can fall back to bog-standard http calls for querying if you need/want. ClickHouse supports some very elementary SQL primitives ("ANSI") and even those have some gotchas and are far from drop-in.
In this manner, I think that clickhouse is better compared as a self-hosted alternative to Aurora and other cloud-native scalable SQL databases, and less a replacement for elasticsearch. If you're using Elasticsearch for OLAP, you're probably better to ETL the semi-structured/raw data out of ES that you specifically wan to a more suitable database which is meant for that.",0,26316401,0,"[26317929,26318186,26318077,26318599,26318405,26321270,26317913,26336909]","",0,"","[]",0 26317912,0,"comment","jabo","2021-03-02 17:39:00.000000000","Thanks for sharing your podcast! Just subscribed.
To add yet another data point, we use Clickhouse as well for centralized logging for the SaaS version of our open source product, and can't imagine what we would have done without it.",0,26317791,0,"[]","",0,"","[]",0 26317913,0,"comment","ignoramous","2021-03-02 17:39:01.000000000","> Elastic is more than elasticsearch...
Grafana Labs sponsored FOSS projects are probably adequate replacement for the Elasticsearch? https://grafana.com/oss/
> ...clickhouse is better compared as a self-hosted alternative to Aurora and other cloud-native scalable SQL databases
Aurora would be likely be less better at this than RedShift or Snowflake.",0,26317841,0,"[]","",0,"","[]",0 26317929,0,"comment","jetter","2021-03-02 17:40:08.000000000","I address your concern from #1 in "2. Flexible schema - but strict when you need it" section - take a look at https://www.youtube.com/watch?v=pZkKsfr8n3M&feature=emb_titl...
Regarding #2: Clickhouse scalability is not simple, but I think Elasticsearch scalability is not that simple, too, they just have it out of the box, while in Clickhouse you have to use Zookeeper for it. I agree that for 200 nodes ES may be a better choice, especially for full text search. For 5 nodes of 10 TB logs data I would choose Clickhouse.
#3 is totally true. I mention it in "Cons" section - Kibana and ecosystem may be a deal breaker for a lot of people.
#4. Clickhouse in 2021 has a pretty good support in all major languages. And it can talk HTTP, too.",0,26317841,0,"[26318676]","",0,"","[]",0 26318082,0,"comment","kaak3","2021-03-02 17:51:54.000000000","Uber recently blogged that they rebuilt the log analytics platform based on ClickHouse, replacing the previous ELK based one. The table schema choices made it easy to handle JSON formatted logs with changing schemas. https://eng.uber.com/logging/",0,26316401,0,"[26318119]","",0,"","[]",0 26318150,0,"comment","guardiangod","2021-03-02 17:56:08.000000000","I am using Clickhouse at my workplace as a side project. I wrote a Rust app that dumps the daily traffic data collected from my company's products into a ClickHouse database.
That's 1-5 billion rows, per day, with 60 days of data, onto a single i5 3500 desktop I have laying around. It returns a complex query in less than 5 minutes.
I was gonna get a beef-ier server, but 5 minutes is fine for my task. I was flabbergasted.",0,26316401,0,"[26318239]","",0,"","[]",0 26318218,0,"comment","jetter","2021-03-02 18:01:44.000000000","I've mentioned Pinot and Druid briefly in 2018 writeup: https://pixeljets.com/blog/clickhouse-as-a-replacement-for-e... (see "Compete with Pinot and Druid" )",0,26317438,0,"[]","",0,"","[]",0 26318460,0,"comment","caust1c","2021-03-02 18:19:22.000000000","When Cloudflare was considering clickhouse, we did estimates on just the hardware cost and it was well over 10x what clickhouse was based on druids given numbers on events processed per compute unit.",0,26317438,0,"[26318769]","",0,"","[]",0 26318599,0,"comment","hodgesrm","2021-03-02 18:29:25.000000000","I'm the author of at least one of the ClickHouse video presentations referenced in the article as well as here on HN. ElasticSearch is a great product, but three of your points undersell ClickHouse capabilities considerably.
1.) ClickHouse JSON blobs are queryable and can be turned into columns as needed. The Uber engineering team posted a great write-up on their new log management platform, which uses these capabilities at large scale. One of the enabling ClickHouse features is ALTER TABLE commands that just change metadata, so you can extend schema very efficiently. [1]
2.) With reference to scalability, the question is not what it takes to get 200 nodes up and running but what you get from them. ClickHouse typically gets better query results on log management using far fewer resources than ElasticSearch. ContentSquare did a great talk on the performance gains including 10x speed-up in queries and 11x reduction in cost. [2]
3.) Kibana is excellent and well-liked by users. Elastic has done a great job on it. This is an area where the ClickHouse ecosystem needs to grow.
4.) This is just flat-out wrong. ClickHouse has a very powerful SQL implementation that is particular strong at helping to reduce I/O, compute aggregations efficiently and solve specific use cases like funnel analysis. It has the best implementation of arrays of any DBMS I know of. [3] Drivers are maturing rapidly but to be honest it's so easy to submit queries via HTTP that you don't need a driver for many use cases. My own team does that for PHP.
I don't want to take away anything from Elastic's work--ElasticSearch and the ecosystem products are great, as shown by their wide adoption. At the same time ClickHouse is advancing very quickly and has much better capabilities than many people know.
p.s., As far as ANSI capability, we're working on TPC-DS and have ClickHouse running at full steam on over 60% of the cases. That's up from 15% a year ago. We'll have more to say on that publicly later this year.
[1] https://eng.uber.com/logging/
[2] https://www.slideshare.net/VianneyFOUCAULT/meetup-a-successf...
[3] https://altinity.com/blog/harnessing-the-power-of-clickhouse...
p.s., I'm CEO of Altinity and work on ClickHouse, so usual disclaimers.",0,26317841,0,"[26318748]","",0,"","[]",0 26318658,0,"comment","AurimasJLT","2021-03-02 18:34:04.000000000","https://github.com/ClickHouse/clickhouse-presentations/blob/... and the presentation itself https://www.youtube.com/watch?v=lwYSYMwpJOU 300Elastic nodes vs 12ClickHouse nodes / 260TB/ lots of querries",0,26317775,0,"[]","",0,"","[]",0 26318707,0,"comment","hodgesrm","2021-03-02 18:38:25.000000000","Your 2019 talk on this was great. I cited it above. Here's a link to slides to supplement others elsewhere in the thread: https://www.slideshare.net/VianneyFOUCAULT/meetup-a-successf...",0,26317060,0,"[]","",0,"","[]",0 26318733,0,"comment","AurimasJLT","2021-03-02 18:40:02.000000000","Yeah Druid got blown away by ClickHouse at eBay Druid 700+ servers versus 2 region fully replicated ClickHouse system of 40 nodes. https://tech.ebayinc.com/engineering/ou-online-analytical-pr... and a webinar they did with us at Altinity https://www.youtube.com/watch?v=KI0AqpmcSOk&t=20s",0,26317438,0,"[]","",0,"","[]",0 26318748,0,"comment","jetter","2021-03-02 18:40:50.000000000","Thank you for what you guys do. Altinity blog and videos are an outstanding source of practical in-depth knowledge on the subject, so much needed for Clickhouse recognition.",0,26318599,0,"[26318807]","",0,"","[]",0 26318769,0,"comment","advaita","2021-03-02 18:42:55.000000000","Are you saying druid hardware costs were coming out to be 10x of clickhouse hardware costs?
Caveat: English is not my first language so might have missed your point in translation. :)",0,26318460,0,"[26320294]","",0,"","[]",0 26319059,0,"comment","didip","2021-03-02 19:06:53.000000000","Does ClickHouse have integration with Superset and Grafana?",0,26316401,0,"[26319496]","",0,"","[]",0 26319434,0,"comment","ekimekim","2021-03-02 19:34:37.000000000","The core storage engine borrows heavily from it - I'll attempt to summarize and apologies for any errors, it's been a while since I worked with VictoriaMetrics or ClickHouse.
Basically data is stored in sorted "runs". Appending is cheap because you just create a new run. You have a background "merge" operation that coalesces runs into larger runs periodically, amortizing write costs. Reads are very efficient as long as you're doing range queries (very likely on a time-series database) as you need only linearly scan the portion of each run that contains your time range.",0,26317796,0,"[]","",0,"","[]",0 26319496,0,"comment","daniel_levine","2021-03-02 19:38:35.000000000","Yes to both, Altinity maintains the ClickHouse Grafana plugin https://altinity.com/blog/2019/12/28/creating-beautiful-graf...
And Superset has a recommendation of a ClickHouse connector https://superset.apache.org/docs/databases/clickhouse",0,26319059,0,"[]","",0,"","[]",0 26319794,0,"comment","jcims","2021-03-02 20:00:13.000000000","Does ClickHouse or anything else out there that even remotely compete with Splunk for adhoc troubleshooting/forensics/threat hunting type work?
I started off with Splunk and every time I try Elasticsearch I feel like I'm stuck in a cage. Probably why they can charge so much for it.",0,26316401,0,"[26320477]","",0,"","[]",0 26320294,0,"comment","caust1c","2021-03-02 20:37:31.000000000","That's right, yeah. We would have had to buy 10x the servers in order to support the same workload that clickhouse could.",0,26318769,0,"[]","",0,"","[]",0 26320811,0,"comment","the-alchemist","2021-03-02 21:11:26.000000000","Also wanted to share my overall positive experience with Clickhouse.
UPSIDES
* started a 3-node cluster using the official Docker images super quickly
* ingested billions of rows super fast
* great compression (of course, depends on your data's characteristics)
* features like https://clickhouse.tech/docs/en/engines/table-engines/merget... are amazing to see
* ODBC support. I initially said "Who uses that??", but we used it to connect PostgreSQL and so we can keep the non-timeseries data in PostgreSQL but still access PostgreSQL tables in Clickhouse (!)
* you can go the other way too: read Clickhouse from PostgreSQL (see https://github.com/Percona-Lab/clickhousedb_fdw, although we didn't try this)
* PRs welcome, and quickly reviewed. (We improved the ODBC UUID support)
* code quality is pretty high.
DOWNSIDES
* limited JOIN capabilities, which is expected from a timeseries-oriented database like Clickhouse. It's almost impossible to implement JOINs at this kind of scale. The philosophy is "If it won't be fast as scale, we don't support it"
* not-quite-standard SQL syntax, but they've been improving it
* limited DELETE support, which is also expected from this kind of database, but rarely used in the kinds of environments that CH usually runs in (how often do people delete data from ElasticSearch?)
It's really an impressive piece of engineering. Hats off to the Yandex crew.",0,26316401,0,"[26322916,26322264,26321358,26321558,26323524]","",0,"","[]",0 26321270,0,"comment","outworlder","2021-03-02 21:42:22.000000000","> In this manner, I think that clickhouse is better compared as a self-hosted alternative to Aurora and other cloud-native scalable SQL databases, and less a replacement for elasticsearch.
Neither of which is normally used for logging.
I am glad there are some alternatives to ELK. Elasticsearch is great, but it's not as great when you have to ingest terabytes of logs daily. You can do it, but at a very large resource cost (both computing and human). Managing shards is a headache with the logging use-case.
Most logs don't have that much structure. A few fields, sure. For this, Elasticsearch is not only overkill, but also not very well suited. This is the reason why placing Kafka in front of Elasticsearch for ingestion is rather popular.",0,26317841,0,"[]","",0,"","[]",0 26321558,0,"comment","yamrzou","2021-03-02 22:00:11.000000000","Could you share more details about the limited JOIN capabilities? AFAIK, Clickhouse has multiple join algorithms and supports on-disk joins to avoid out of memory:
https://github.com/ClickHouse/ClickHouse/issues/10830
https://github.com/ClickHouse/ClickHouse/issues/9702#issueco...",0,26320811,0,"[26322297]","",0,"","[]",0 26321698,0,"comment","mgachka","2021-03-02 22:09:29.000000000","Hi
I'm of the guy who did the 2 presentations of Clickhouse at ContentSquare. There are no blog posts on the migration from ES to CH. But you can find the slides of the 2018 presentation here https://www.slideshare.net/VianneyFOUCAULT/clickhouse-meetup... And the slides of the 2019 presentation here https://www.slideshare.net/VianneyFOUCAULT/meetup-a-successf...
There is also a video recording of the 2019 presentation available here. https://www.youtube.com/watch?v=lwYSYMwpJOU nb: The video is not great because the camera is often losing focus but it's still understandable.",0,26317969,0,"[]","",0,"","[]",0 26322178,0,"comment","mgachka","2021-03-02 22:45:51.000000000","FYI, we're using clickhouse since 2018 at ContentSquare.
I did a few POCs to compare clickhouse vs other databases on ContentSquare's use case. One of them was memSQL. Although memSQL was very good, since we don't need to JOIN big datasets or need killer features like fulltext search, clickhouse gave a better perf/cost ratio for us (I don't remember exactly but it was at least twice cheaper).",0,26317438,0,"[]","",0,"","[]",0 26322264,0,"comment","FridgeSeal","2021-03-02 22:52:53.000000000","Most minor of nitpicks:
> timeseries-oriented database
Technically it’s a column oriented database that is good at time series stuff. I only say that because I know there are some databases that are even more specialised towards timeseries and ClickHouse can do way more.",0,26320811,0,"[26326865]","",0,"","[]",0 26322916,0,"comment","IanCal","2021-03-02 23:53:55.000000000","I'd like to add an upside which is:
Totally great and simple on a single node.
I looked at a bunch of analytical databases and had a lot that started with "so here's a basic 10 node cluster". Clickhouse installed and worked instantly with decent but not "big" data with no hassle. A hundred million rows with lots of heavy text blobs and a lot of columns, that kind of thing. Happily dealt with triple nested joins over that, and with billions of entries in arrays on those columns didn't bat an eye.
I'm sure I could do some great magic in postgres but naive work didn't give anywhere near the same results as clickhouse (obvious caveat for my workload).
Pretty good with JSON data, my only issue there at the time (may have improved) was you had to format the JSON quite strictly.",0,26320811,0,"[26323799]","",0,"","[]",0 26323081,0,"comment","FridgeSeal","2021-03-03 00:13:20.000000000","ClickHouse will happily replace the ElasticSearch bit, and there’s a few open source dashboards you could use as a kibana stand in: - Metabase (with ClickHouse plug-in) - Superset - Grafana",0,26317032,0,"[]","",0,"","[]",0 26323524,0,"comment","kureikain","2021-03-03 01:19:59.000000000","Can clickhouse deal with medium-large blob data? Say the size of a normal email?
We are using Postgres to store email at my app: https://hanami.run The log is append only and getting scrub daily.
Can clickhouse deal with that? The query is very simple, just need to match exactly a single column(domain) and pagination?",0,26320811,0,"[26324010]","",0,"","[]",0 26323907,0,"comment","neilknowsbest","2021-03-03 02:18:36.000000000","One issue I've come across is that the query optimizer in Clickhouse does not propagate `where` clauses through a join. My terminology might be wrong, so consider this example:
select * from a inner join b using (id) where b.foo = 'bar'
Clickhouse will not evaluate `foo = 'bar'` before performing the join, so you might wind up with a join that produces a large intermediate result before the filtering happens. Postgres (probably other databases) will optimize this for you. To force Clickhouse to filter first, you would need to write something like
select * from a inner join ( select * from b where foo = 'bar' ) b using (id)
Maybe not a strict limitation, but the workaround is a bit janky.",0,26322297,0,"[26324587]","",0,"","[]",0 26325254,0,"comment","tgtweak","2021-03-03 05:54:17.000000000","Imagine trying to make the argument that forcing your developers/clients to send all their telemetry on fixed/rigid schemas to make it immediately queryable is quicker than updating 1 line of an etl script on the data warehouse side. That adding a new queryable field to your event now requires creating migration scripts for the databases and api versioning for the services so things don't break and old clients can continue using the old schema. Imagine making a central telemetry receiver that needs to support 200+ different external apps/clients, with most under active development - adding new events and extending existing ones - being released several times per day. What's the alternative you're proposing? Just put it in a json column and make extractors in the databases every time you want to analyze a field? I've seen this design pattern often enough in MSSQL servers with stored procedures... Talk to me about painful rewrites.
I'll take semi-structured events parsed and indexed by default during ingestion over flat logs + rigid schema events any day. When you force developers to log into a rigid schema you get json blob fields or "extra2" db fields, or perhaps the worst of all, no data at all since it's such a pain in the ass to instrument new events.
We're talking about sending, logging and accessing telemetry. The goal is to "see it" and make it accessible for simple querying and analysis - in realtime ideally, and without a ticket to data engineering.
ES type-inferrence and wide/open schema with blind json input is second to none as far as simplicity of getting data indexed goes. There are tradeoffs with the defaults such as putting lots of text that you don't need to fulltext search - you might want to tell ES that it doesn't need to parse every word into an index if you don't want to burn extra cpu and storage for nothing. This is one line of config at the cluster level and can be changed seamlessly while running and ingesting data.
I guarantee you there is more semi-structured data in the world than rigid schema, and for one simple reason: It's quicker to generate. The only argument against it has thus far been "yeah but then it's difficult to parse and make it queryable again" and suddenly you've come full circle and you have the reason elasticsearch exists and shines (extended further on both ends by logstash and kibana).
I'm not saying it makes sense to do away with schemas everywhere but for logging and telemetry - of any that you actually care to analyze anyway - there is rarely a reason to go rigid schema on the accepting or processing side since you'll be working with, in the vast majority of cases, semi-structured data.
Changing ES index mappings on the fly is trivial, you can do it with an much ease as alter table on clickhouse, and you have the luxury of doing it optimistically and after the fact, once your data/schema has stabilized.
Rewriting the app to accommodate this should never be required unless you really don't know how to use indexing and index mapping in ES. You would, however, have to make changes to your app and/or database and/or ETL every time you wanted to add a new queryable field to your rigid-schema masterpiece.
Ultimately, applications have always and will always generate more data than will be ultimately analyzed so saving development time on that generating end (by accepting any semi-structured data without first having to define a rigid schema) is more valuable than saving it on the end that is parsing a subset of that data. Having to involve a data team to deploy an alter table so you can query a field from your json doesn't sound like the hallmark of agile self-serve. I also believe strongly and fundamentally that encouraging product teams and their developers to send and analyze as much telemetry as both their hearts desire and DPOs agree to without worrying about the relatively trivial cost of parsing and storing it, will always come out on top vs creating operational complexity over the same. Maybe if you have a small team logging billions of heavy, never-changing events will seldom get queried it would tip the scales in favor of using rigid schema. I counter: you don't need telemetry you need archiving.
On that subject of pure compute and storage/transfer efficiency: Yes both rigid schema and processing-by-exception will win here every time as far as cycles and bits go. Rarely is the inefficiency of semi-structured so high that it merits handicapping an entire engineering org into dancing around rigid schemas to get their telemetry accepted and into a dashboard.
I hear you, platform ops teams... "But the developers will send big events! ! There will be lots of data that we're parsing and indexing for nothing!" Ok - so add a provision to selectively ignore those? Maybe tell the offender to stop doing it? On the rare occasion that this happens (I've seen 1 or 2 events out of 100s in my anecdotal experience) you may require some human intervention. Compare this labor requirement to the proposed system where human intervention is required every time somebody wants to look at their fancy new field.
In practice, I've not seen it be a nightmare unless you've got some very bad best practices on the ingestion or indexing side - both of which are easily remedied without changing much if anything outside of ES.
I think clickhouse is pretty cool, but it's not handing anywhere near the constraints that ES does even without logstash and kibana. ES is also getting faster and more efficient at ingestion/parsing with every release - releases the seem to be coming faster and faster these days.",0,26318405,0,"[]","",0,"","[]",0 26326865,0,"comment","valyala","2021-03-03 10:56:30.000000000","I'm working on VictoriaMetrics - a fast specialized time series database and monitoring solution built on top of ClickHouse architecture ideas. [1]
[1] https://valyala.medium.com/how-victoriametrics-makes-instant...",1,26322264,0,"[]","",0,"","[]",0 15915304,0,"comment","srcmap","2017-12-13 16:37:30.000000000",""There is half a terabyte of DDR4 RAM on each machine."
I also wonder how much of the speed up is due to the GPU parallelization vs the DB size seem to fit within the size of DDR4.
" 0.005 0.011 0.103 0.188 BrytlytDB 2.1 & 5-node IBM Minsky cluster" NVME
"1.034 3.058 5.354 12.748 ClickHouse, Intel Core i5 4670K" SSD
The core i5 system only has 16GB of RAM. I love to see what that number would looks like if it also has 512GB or more of DDR4 with NVME drive.Also one can also try increase to data size from 1.1 B -> 1000B and see how it scale on Minsky cluster.",0,15914950,0,"[]","",0,"","[]",0 26343294,0,"story","wooguru","2021-03-04 15:06:40.000000000","",0,0,0,"[26364895]","https://twiwoo.com/docker/clickhouse-server-in-1-minute-with-docker/",3,"ClickHouse Server in 1 minute with Docker","[]",1 26351936,0,"comment","zX41ZdbW","2021-03-05 02:50:06.000000000","The similar list from ClickHouse repository: https://github.com/ClickHouse/ClickHouse/blob/master/base/ha...",0,26347867,0,"[]","",0,"","[]",0 26368435,0,"comment","azophy_2","2021-03-06 16:25:36.000000000","there is a HN thread a couple days ago discussing Clickhouse. Haven't try it myself, but may be useful
https://news.ycombinator.com/item?id=26316401",0,26367952,0,"[]","",0,"","[]",0 19550281,0,"comment","hwwc","2019-04-02 01:39:32.000000000","SEEKING WORK | Backend Services and Data Engineering
Location: US Remote: Yes
I'm a software engineer experienced in all parts of a data-analytics backend-stack: from ETL to database design to web-API to devops. One of my major projects is an analytics engine for web applications (https://github.com/hwchen/tesseract).
I'm looking for a 10-20 hr/week contract writing robust, performant, and ergonomic applications for processing and querying data.
Primary Skills: Rust, Linux, Google Compute Platform, ClickhouseDB, Postgresql
Production Experience with: Python/Pandas, Node/JS, AWS, Docker, Redis, MySql, Nginx
Github: https://github.com/hwchen
Contact: hello@hwc.io",0,19543939,0,"[]","",0,"","[]",0 26375042,0,"comment","arp242","2021-03-07 11:34:15.000000000","The table is essentially:
| site_id | created_at | [..other columns trimmed..]
| 1 | 2021-06-18 | [..]
| 2 | 2021-06-18 | [..]
| 1 | 2021-06-19 | [..]
| 2 | 2021-06-19 | [..]
| 3 | 2021-06-19 | [..]
Site 1 may choose to keep things for 30 days, site 2 for 50 days, and site 3 may choose to keep things infinitely.The solution I use now is to just run a background job where day with "delete from hits where site_id = ? and created_at < now() - interval ?". It's pretty simple, but works.
There are probably some clever things I can do: partition by site ID and then date range, or something like that. But this has a bunch of downsides too:
- I like to keep things as simple as possible to make it easier for people to self-host this, which is a big reason I use PostgreSQL in the first place. If it was just some SaaS I was running then I wouldn't have a lot of problems turning up the complexity a bit if it gave me serious benefits, but when you distribute software you expect other people to be able to run then it's a bit more of a trade-off.
- The entire thing also runs on SQLite, and I'd like to minimize the number of special PostgreSQL/SQLite branches when possible.
Ideally, what I'd like is that people can optionally plug in something like Citus if they need it (and many don't! Even SQLite actually works fine for many) and have standard SQL "just work" without sweeping architectural changes. They can switch it on/off as they please. I don't mind adding a few "if using_citus then ..." exceptions here or there, but there's a limit, especially since many people just don't need it (but will still "stuck" with a much more complex table structure because of it).
This, for me anyway, is the appeal of things like Citus or TimescaleDB vs. more specialized solutions like Clickhouse or whatnot. I don't need to get the "fastest possible solution", and these solutions strike a good middle ground between complexity/ease of setup vs. speed.
There is also a second use case for UPDATE (the above is mostly for DELETE): right now I pre-compute some data and run most queries on that, rather than the main append-only table with events because this is rather faster in various cases. Because the data is aggregated by hour or day, it needs to update those rows when new events come in. The way I do that now is with a unique key and "insert [..] on conflict [..] do update [..]", but there's a bunch of other ways to do this (but this it probably the easiest).
In principle those tables should become obsolete with a column store, but when I tried cstore_fdw last year this wasn't really fast enough (although this version may be, I didn't try yet). Even if it would be fast enough, this still means I'd have to write all queries twice: once for the "normal" PostgreSQL/SQLite use case, and once for the Citus use case, which isn't especially appealing.
So, tl;dr, I want Citus to be an optional thing people can use, and maintain compatibility with mainstream PostgreSQL and SQLite (and, possibly, other engines in the future too).
Perhaps this is a rare use cases? I don't know. But this is my use case.",0,26374003,0,"[]","",0,"","[]",0 26375679,0,"comment","Kovah","2021-03-07 13:27:07.000000000","I cannot use the docker-compose setup they provide on my server, because it interferes with my existing Docker containers. Settting up the Clickhouse container never worked because Plausible wasn't able to connect properly to it. After researching for hours, I gave up. Not sure what the issues was.",0,26375073,0,"[26375909,26380398]","",0,"","[]",0 23263792,0,"comment","pachico","2020-05-21 19:30:31.000000000","I thought Druid lost the battle against ClickHouse long ago. Am I wrong?",0,23261198,0,"[23264662,23264001,23267103]","",0,"","[]",0 23264662,0,"comment","gilbetron","2020-05-21 20:27:22.000000000","Incorrect. They both have their strengths and weaknesses:
https://medium.com/@leventov/comparison-of-the-open-source-o...",0,23263792,0,"[23268474]","",0,"","[]",0 26380398,0,"comment","amanzi","2021-03-07 22:26:46.000000000","I've been self-hosting Plausible for a few months now, but you're right - there's a bit of fiddling to get everything working. I ended up creating dedicated hosts for the Clickhouse and Postgres databases, and then playing with the connection strings in the Plausible container to connect to them.",0,26375679,0,"[]","",0,"","[]",0 26381726,0,"story","yousef63","2021-03-08 02:13:02.000000000","",1,0,0,"[]","https://clickhouse.tech/blog/en/2020/pixel-benchmark/",1,"Running ClickHouse on an Android Phone","[]",0 26383043,0,"comment","st1ck","2021-03-08 07:19:04.000000000","Dataset similar to yours (github data) which you can query using Clickhouse: https://gh.clickhouse.tech/explorer/",0,26371706,0,"[]","",0,"","[]",0 23267103,0,"comment","otabdeveloper4","2020-05-21 23:58:46.000000000","You're right. Clickhouse is the kind of viral grass-roots tech that eventually spontaneously appears in every enterprise.",0,23263792,0,"[23269340]","",0,"","[]",0 23268474,0,"comment","rb808","2020-05-22 03:49:44.000000000","> ClickHouse is simpler and has less moving parts and services.
sounds good to me",0,23264662,0,"[23276256,23269337]","",0,"","[]",0 23269337,0,"comment","pachico","2020-05-22 06:53:47.000000000","I've been using ClickHouse in production to do things I just couldn't do with any other technology, it's not only simpler.",0,23268474,0,"[]","",0,"","[]",0 23274769,0,"comment","petr25102018","2020-05-22 17:38:21.000000000","Why should someone use TimescaleDB over ClickHouse for time-series/analytics workloads?",0,23272992,0,"[23276474,23275928,23275424,23282237,23275414]","",0,"","[]",0 23275414,0,"comment","valyala","2020-05-22 18:38:11.000000000","If you use PostgreSQL, then it feels natural to add TimescaleDB extension and start storing time series or analytical data there alongside other relational data.
If you need effectively storing trillions of rows and performing real-time OLAP queries over billions of rows, then it is better to use ClickHouse [1], since it requires 10x-100x less compute resources (mostly CPU, disk IO and storage space) than PostgreSQL for such workloads.
If you need effectively storing and querying big amounts of time series data, then take a look at VictoriaMetrics [2]. It is built on ideas from ClickHouse, but it is optimized solely for time series workloads. It has comparable performance to ClickHouse, while it is easier to setup and manage comparing to ClickHouse. And it supports MetricsQL [3] - a query language, which is much easier to use comparing to SQL when dealing with time series data. MetricsQL is based on PromQL [4] from Prometheus.
[2] https://github.com/VictoriaMetrics/VictoriaMetrics
[3] https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/Metr...
[4] https://medium.com/@valyala/promql-tutorial-for-beginners-9a...",0,23274769,0,"[23299096,23293710]","",0,"","[]",0 23275424,0,"comment","hagen1778","2020-05-22 18:39:14.000000000","That's a good question! Especially considering these overwhelming benchmarks [1] made via Timescale TSBS [2].
[1] https://www.altinity.com/blog/clickhouse-for-time-series
[2] https://github.com/timescale/tsbs",0,23274769,0,"[23276606]","",0,"","[]",0 23275529,0,"comment","valyala","2020-05-22 18:50:33.000000000","There is more cost effective alternative to BigQuery for storing and analyzing big amounts of logs - LogHouse [1], which is built on ClickHouse.
[1] https://github.com/flant/loghouse",0,23274485,0,"[]","",0,"","[]",0 23275928,0,"comment","manigandham","2020-05-22 19:26:38.000000000","The biggest reason is if you're using Postgres already as an operational database and want some timeseries/analytical capabilities.
Originally Timescale wasn't much more than automatic partitioning but with the new compression and scale out features, along with the automatic aggregations and other utilities, it can actually be pretty good overall performance. It still won't get you the raw speed of Clickhouse but instead you get all the functionality of Postgres (extensions, full SQL support, JSON, etc) and can avoid big ETL jobs.
Another PG extension is Citus which does scale-out automatic sharding with distributed nodes but is more generalized than Timescale for handing non-timeseries use-cases. Microsoft offers Citus on Azure.",0,23274769,0,"[23281163]","",0,"","[]",0 23276256,0,"comment","gilbetron","2020-05-22 19:59:35.000000000","Having administered Druid for several years, ClickHouse's supposed simplicity is definitely appealing were I to start a new project with similar requirements. Then again, back then, I needed Petabyte scale and > 1 million inserts/sec and ClickHouse couldn't do it.",0,23268474,0,"[23277029]","",0,"","[]",0 23276474,0,"comment","k-rus","2020-05-22 20:18:56.000000000","I've heard several points for not choosing ClickHouse and going to TimescaleDB as an extension of PostgreSQL:
1. As it is already mentioned, if metadata (data about timeseries) are already in PostgreSQL, then it is nice to stay in the same database engine for querying data with joins of both metadata and timeseries data, so there is no need to implement integration of the two source in the application layer.
2. Also related to the first item: advantage of already knowing PostgreSQL API. ClickHouse has different management API, so it is necessary to learn. While if you know PostgreSQL, you don't need to learn new management API and only timeseries specific API of TimescaleDB.
3. ClickHouse doesn't support to update and delete of existing data in the same way as relation databases.
Then the final decision still depends on your need.",0,23274769,0,"[]","",0,"","[]",0 23282237,0,"comment","cevian","2020-05-23 12:03:23.000000000","Another thing to mention is that TimescaleDB has much stronger ACID guarantees than ClickHouse. Which means you get more clear semantics for consistency",0,23274769,0,"[]","",0,"","[]",0 23288082,0,"comment","smbrian","2020-05-24 00:48:13.000000000","The pitch is faster, and more space efficient since column stores are far better for analytics than row stores. Some benchmarks that found ~5-10x speedup: https://uwekorn.com/2019/10/19/taking-duckdb-for-a-spin.html
Consider someone who analyzes medium-sized volumes of (slowly-changing) data -- OLAP, not OLTP. People who need to do this primarily have 2 alternatives:
* columnar database (redshift, snowflake, bigquery)
* a data lake architecture (spark, presto, hive)
The latter can be slow and wasteful, because the data is stored in a form that allows very limited indexing. So imagine you want query speeds that require the former.
Traditional databases can be hugely wasteful for this usecase -- space overhead due to no compression, slow inserts due to transactions. The best analytics databases are closed-source and come with vendor lockin (there are very few good open-source column stores -- clickhouse is one, duckdb is another). Most solutions are multi-node, so they come with operational complexity. So DuckDB could fill a niche here -- data that's big enough to be unwieldy, but not big enough to need something like redshift. It's analogous to the niche SQLite fills in the transactional database world.",0,23287675,0,"[23288134,23288197]","",0,"","[]",0 23289488,0,"comment","econcon","2020-05-24 06:00:38.000000000","I've been using clickhouse with our 100gb click log data per day. It just works, before that we were using bigquery and bigquery was cheaper than our home brewed solution :(
So now we use two together, only recent data that's available for fast querying for user dashboard is served from clickhouse cluster.
And old data were customer is willing to be more tolerant to slow responses upto 2seconds is served by bigquery.",0,23287278,0,"[]","",0,"","[]",0 23294083,0,"comment","scoresmoke","2020-05-24 19:41:32.000000000","In my case, the typical scenario is running aggregation and analytical functions over a small number of columns, while the number of rows is huge. The SQL syntax is the same, but the columnar storage model better fits this case. A good illustration is available in ClickHouse documentation: https://clickhouse.tech/docs/en/#why-column-oriented-databas....",0,23292746,0,"[23294562]","",0,"","[]",0 15995806,0,"comment","lima","2017-12-23 19:31:05.000000000","I recently discovered ClickHouse[1], Yandex's recently open sourced BigQuery equivalent (someone on HN pointed me to it!).
If you're looking for a OLAP database running on your own infrastructure, make sure to give it a try.
[1]: https://clickhouse.yandex/",0,15993947,0,"[15996924,15995910]","",0,"","[]",0 26412458,0,"comment","lipanski","2021-03-10 15:49:47.000000000","I recently installed ClickHouse (part of my self-hosted Plausible setup) and had to modify a record inside the database. At this point my only knowledge of ClickHouse was that it comes with an SQL interface. So I opened up a client and typed "SHOW DATABASES" and guess what - it showed me a list of all databases. Then I typed "USE mydatabase" and I was connected to my database. I typed "SHOW TABLES" and got a list of tables, followed by "DESCRIBE TABLE users" and "UPDATE users SET email_verified = true" (FYI I was trying to avoid having to set up SMTP credentials for Plausible). I was able to use ClickHouse without any prior knowledge because the authors decided to based it on a well-known and fairly simple standard instead of inventing their own.
It felt as good as building Ikea furniture without checking the manual and it's what user/developer experience should be about.",0,26410047,0,"[26419848]","",0,"","[]",0 23299096,0,"comment","quade1664","2020-05-25 08:11:07.000000000","We spent about 6 months looking at pretty much every database tech on the market, cockroach, clickhouse, influx, voltdb, memsql etc were top contenders, there was an outdated article on medium.com (by victoria metrics) which slammed TimescaleDB for its disk usage, we did not realised it was biased, so we dropped TSDB dropped off the list, but we saw a email about their compression segment by device_id, and gave it a shot, ....we implemented it, 5 months after our production release we now have outstanding performance and compression (95x) We are planning to move the rest of our databases to TSDB now as it ticks our boxes our use case is HTAP, not solely OLAP and OLTP
I'm super excited about this news, but TSDB please work on allowing us to put data over 1 year old on slow disk seperate servers, so we can keep the hot stuff on the NVME servers, once you get this sorted it will be the perfect fit for us.",0,23275414,0,"[23307184,23305214]","",0,"","[]",0 26419848,0,"comment","hodgesrm","2021-03-11 03:29:21.000000000","You've used MySQL, I would guess. Personally I like the superficial similarities to basic MySQL syntax in ClickHouse. MySQL and Sybase T-SQL have always struck me as the friendliest SQL dialects.",0,26412458,0,"[]","",0,"","[]",0 23307184,0,"comment","hodgesrm","2020-05-26 03:39:10.000000000","> TSDB please work on allowing us to put data over 1 year old on slow disk seperate servers, so we can keep the hot stuff on the NVME servers, once you get this sorted it will be the perfect fit for us.
ClickHouse recently added multi-volume storage for exactly the use case you describe. [1] It's a great feature.
[1] https://www.altinity.com/blog/2019/11/27/amplifying-clickhou...",0,23299096,0,"[]","",0,"","[]",0 26428471,0,"comment","hodgesrm","2021-03-11 20:13:54.000000000","Sometimes it's the lesser of two evils. I'm working on ClickHouse SQLAlchemy driver support to enable better integration with Superset. Superset talks to a bunch of backends and chose SQLAlchemy as the API. It has many of the problems you mention, though since it's read-only there are perhaps fewer of them. Superset does workarounds on top, but they don't have to worry about basic stuff like listing table metadata, distinguishing between tables and views, etc.
Using an ORM seems like a reasonable choice for multi-platform use cases or at least some of them. The alternative would be to implement something more or less from scratch.",0,26424140,0,"[]","",0,"","[]",0 26433364,0,"story","ligurio","2021-03-12 08:23:09.000000000","",0,0,0,"[26441758]","https://clickhouse.tech/blog/en/2021/fuzzing-clickhouse/",5,"Fuzzing ClickHouse with SQLancer","[]",1 26434119,0,"comment","citrin_ru","2021-03-12 10:25:47.000000000","ClickHouse suits well for string logs, especially if a log message have some fixed structure which can be mapped to separate columns. And most logs have some number of predefined fields which are easy to map to columns (e. g. timestamp, IP, request time, e. t. c.). You can store free-form JSON which doesn't have per-defined schema in a string column, but ELK would be probably better in this case.
And then you can query logs using all power of SQL.",0,26424312,0,"[]","",0,"","[]",0 19612086,0,"story","avivallssa","2019-04-09 05:35:46.000000000","",0,0,0,"[]","https://www.percona.com/blog/2019/03/29/postgresql-access-clickhouse-one-of-the-fastest-column-dbmss-with-clickhousedb_fdw/",2,"PostgreSQL Access ClickHouse, the Fastest Column DBMSs, with Clickhousedb_fdw","[]",0 26445570,0,"story","krnaveen14","2021-03-13 10:26:32.000000000","",0,0,0,"[]","https://altinity.com/blog/clickhouse-nails-cost-efficiency-challenge-against-druid-rockset",4,"ClickHouse nails cost efficiency challenge against Druid and Rockset","[]",0 26446053,0,"comment","em500","2021-03-13 12:10:51.000000000","You could try Clickhouse. It's a bit heavier than DuckDB and the default mode is server-client. But you can also use the client (a single binary) without a server to directly query data from csv or parquet files.
edit: added a better link, the stand-alone mode is called clickhouse-local
https://clickhouse.tech/docs/en/interfaces/cli
https://altinity.com/blog/2019/6/11/clickhouse-local-the-pow...",0,26442745,0,"[26446287]","",0,"","[]",0 26450834,0,"comment","joshxyz","2021-03-13 22:44:28.000000000","Reading official docs (be it postgresql, clickhouse, elasticsearch, redis), a lot of googling and stackoverflow, reading blogs and watching presentations on latest developments (like Altinity does for ClickHouse).
Thing is I want to cover the latest stable version of tools im using so i can take advantage of their latest features that might be absent on older ones and not covered by older third party content. The latest stable docs and changelogs are my best source for that.",0,26425483,0,"[]","",0,"","[]",0 26452863,0,"comment","the_optimist","2021-03-14 04:44:26.000000000","You missed the other two points.
We can all compile C to wasm now.
K is a db that loses to Clickhouse and isn’t resilient. J is cleaner syntax, and with no crushing commercial license.
Show us something interesting.",0,26442684,0,"[26457218]","",0,"","[]",0 26453182,0,"comment","jinmingjian","2021-03-14 06:09:34.000000000","It is often doubtful if one uses the word "fastest". You often see that one micro-bench lists ten products, then it says "look, I am running in the shortest time".
The problem is that, people often compare "apple to orange". Do you know how to correctly use ClickHouse(there are 20-30 engines in ClickHouse to use. Do you compare an in-memory engine to an disk-persistent-design Database?), Spark, Arrow... ? How can you guarantee to do a fair evaluation among ten or twelve products?",0,26451894,0,"[]","",0,"","[]",0 26453283,0,"comment","f311a","2021-03-14 06:37:49.000000000","I think ClickHouse does that.",0,26452120,0,"[26477070]","",0,"","[]",0 26457218,0,"comment","kelas","2021-03-14 17:53:48.000000000","
> We can all compile C to wasm now.
yes, true. some people did indeed port not just python3 to wsm, but also llvm (not sure about llvm9 to run dex/jax/vax, but still). the practicality of such heroics is questioned by some folks, but tastes differ."we can compile anything to anything" is a bit of a non-argument. while it is a true statement, the public k9 wasm build is meant to illustrate how well k9 runs on bare metal ("bare", in this particular case). syscall table and two stdlib functions used by the web console are optional.
> clickhouse is winning left right and center
oh, thank you - never heard of that one, Russians produce a lot of excellent software (and sometimes manage to get away with it, see nginx). in particular, 3 queries against famous nyc taxi ride db is a very popular (non-)comparison, i really enjoyed the story of this guy riding a taxi on 100-way cluster: https://tech.marksblogg.com/billion-nyc-taxi-rides-clickhouse-cluster.html
> show us something
maybe this, although old news: https://shakti.sh/benchmark/taxi.k
i know it doesn't look remotely as shiny (and resilient!) as the clickhouse cluster configuration rant up above, but taxi.k is fully self-contained, includes some economy of scale estimations for k9 and some other systems, and will happily do 100-way over ipc if you so desire. > ..interesting?
maybe this: https://kparc.io/k/#%5Cl%20taxi.k
this is nyc taxi Q1 on your phone. no threads, no ipc, no ec2, no simd, no simdjs. it is just you and your mobile phone. maybe not so impressive.i get about 9ms on iphone12 and about 40ms on mbp2012.
what do you get?
> J is cleaner and open source
while could be very true as well, especially open source part, i am not aware of taxi shootout readings for J, so my competence ends here. what i know about J, however, is where it came from - see the legendary "origins of j" anecdote at https://kparc.io* * *
finally:
> k is a db
that is imprecise. k is a small, agile and integrated system, very well designed. it is a language that gives you a "db" if you need one, but it can also solve euler p572 for you only 10-20 times slower than a carefully crafted c solution (one should be lucky to get ~200ms, but maybe you get luckier).
that's the idea.",0,26452863,0,"[26457419]","",0,"","[]",0 23342509,0,"comment","valyala","2020-05-28 20:57:43.000000000","Take a look at ClickHouse [1] and VictoriaMetrics [2]. Both solutions share architecture details and are optimized for high performance and low resource usage. They can handle trillons of rows (i.e. more than 10^12 rows) on a single node and can scale to multiple nodes.
[2] https://github.com/VictoriaMetrics/VictoriaMetrics",0,23254639,0,"[]","",0,"","[]",0 23346819,0,"comment","speedgoose","2020-05-29 06:24:18.000000000","Have you tried column based databases such as Cassandra, HPE vertica, or Clickhouse?
Have you tried a "big data" approach like Apache Spark on parquet files?
Did you consider saving less data? Do you really need to save that much data? Can't you sample the data and save a lot less while keeping the same information overall ?",0,23346645,0,"[23346894]","",0,"","[]",0 23347196,0,"comment","speedgoose","2020-05-29 07:31:31.000000000","I would try first with these columns based databases but I think you should consider saving less data. Doing what you want is definitely possible but it may be expensive, and if it's only because one non technical co-worker would love to query so much data for his reports, perhaps you should try to show him the cost once you tried a few technologies.
Querying less data but saving everything is also an alternative. With Clickhouse you can specify a sampling rate for example.",0,23346894,0,"[23376430]","",0,"","[]",0 19642501,0,"comment","polskibus","2019-04-12 06:33:36.000000000","How's bigquery performance in comparison to clickhouse? When is it worth switching from one to the other?",0,19632263,0,"[19663807,19643303,19643187,19643207]","",0,"","[]",0 26461781,0,"comment","pupdogg","2021-03-15 01:58:09.000000000","This is an overly complex solution that we were able to resolve using a simple VPS running Clickhouse as backend and Grafana for frontend. Our production db is an Aurora MySQL instance and we keep it lean by performing daily dumps of reporting related data into a CSV with gzip compression -> push it to S3 -> convert it to parquet file format using AWS glue -> bring it into ClickHouse. Data size for these specific reports is approx 100k rows daily and is partitioned by MONTH/YEAR. Overall cost: $20/month VPS and approx. $15/month in AWS billing.",0,26446207,0,"[26470747,26464523]","",0,"","[]",0 26467374,0,"story","feross","2021-03-15 15:47:29.000000000","",0,0,0,"[]","https://thenewstack.io/clickhouse-rapidly-rivals-other-open-source-databases-in-active-contributors/",2,"ClickHouse has rapidly rivaled other open source databases in active","[]",0 23352521,0,"comment","bluestreak","2020-05-29 17:04:04.000000000","Author here.
About a month ago, I posted about using SIMD instructions to make aggregation calculations faster. I am very thankful for the feedback so far, this post is the result of the comments we received last time.
Many comments suggested that we implement compensated summation (aka Kahan) as the naive method could produce inaccurate and unreliable results. This is why we spent some time integrating kahan and Neumaier summation algorithms. This post summarises a few things we learned along this journey.
We thought Kahan would badly affect the performance since it uses 4x as many operations as the naive approach. However, some comments also suggested we could use prefetch and co-routines to pull the data from RAM to cache in parallel with other CPU instructions. We got phenomenal results thanks to these suggestions, with Kahan sums nearly as fast as the naive approach.
A lot of you also asked if we could compare this with Clickhouse. As they implement Kahan summation, we ran a quick comparison. Here's what we got for summing 1bn doubles with nulls with Kahan algo. The details of how this was done are in the post.
QuestDB: 68ms Clickhouse: 139ms
Thanks for all the feedback so far and keep it going so we can continue to improve. Vlad",0,23352517,0,"[23354625]","",0,"","[]",0 26470747,0,"comment","twotwotwo","2021-03-15 23:31:10.000000000","Appreciate the info on real-world use here. A data warehouse would be sort of interesting at work but is not urgently needed (because reporting from MySQL works, without the nifty speedups data warehouses can achieve), and we're somewhere between the size you're talking about and the really-big-data use cases I tend to see blogged about more often. Am curious about ClickHouse and a lower-cost deployment might make it worthwhile when it wouldn't be otherwise.",0,26461781,0,"[]","",0,"","[]",0 26474508,0,"story","jinqueeny","2021-03-16 09:14:30.000000000","",0,0,0,"[]","https://thenewstack.io/clickhouse-rapidly-rivals-other-open-source-databases-in-active-contributors/",1,"ClickHouse Rapidly Rivaled Other Open Source Databases in Active Contributors","[]",0 23362812,0,"comment","hodgesrm","2020-05-30 17:23:14.000000000","The JSON-to-columns approach is a best practice for analytic applications. ClickHouse has a feature called materialized columns that allows you to do this cheaply. You can add new columns that are computed from JSON on existing rows and materialized for new rows.",0,23270912,0,"[]","",0,"","[]",0 26477070,0,"comment","nijave","2021-03-16 14:25:05.000000000","I actually started looking at Clickhouse a couple weeks ago but got a bit side tracked trying to grok how distributed tables work. It looks promising but there's a bit of a learning curve (seems some of the performance also comes from its use of arrays but best I can tell my use case should just use regular tables)",0,26453283,0,"[26481094]","",0,"","[]",0 26481094,0,"comment","hodgesrm","2021-03-16 18:45:52.000000000","ClickHouse performance is principally due to column storage, compression, and ability to parallelize processing. Arrays can improve performance in some specific cases but are more commonly used to help deal with semi-structured data or perform custom processing on values within groups.
If your data maps cleanly to tables, that's in fact the best case with the easiest options for performance enhancement.",0,26477070,0,"[]","",0,"","[]",0 26481656,0,"comment","DevKoala","2021-03-16 19:28:06.000000000","K8. It is night and day how well our organization operates now that we are fully on k8.
ClickHouse/BigQuery have allowed me to tackle massive analytics projects with a tenth of the effort when I had to setup a map-reduce/spark infra.",0,26477507,0,"[26481683]","",0,"","[]",0 19663807,0,"comment","filimonov","2019-04-15 07:50:33.000000000","For ClickHouse speed and performance is #1 priority, so it beats most of the other databases.
So as an in-house solution, ClickHouse most probably would be the fastest option (if your use case suits OLAP requirements).
For clouds / PaaS - it's hard to compare directly. Do you know how many servers will process your BigQuery request? AFAIK usually BigQuery shows a bit higher performance than single mid-level ClickHouse server (but you can also have a cluster of ClickHouse servers).",0,19642501,0,"[]","",0,"","[]",0 26487416,0,"story","jinmingjian","2021-03-17 06:51:59.000000000","",0,0,0,"[26502753,26497622]","https://tensorbase.io/2021/03/16/announce_base_fe.html",19,"Show HN: TensorBase: 5x~10000x Faster Drop-In/Accelerator for ClickHouse in Rust","[]",3 23376430,0,"comment","psankar","2020-06-01 08:37:52.000000000","This turned out to be a good advice. I am evaluating if it is possible to somehow intelligently sample the data when it comes to ELK. Something like an average of 1min data via a logstash filter. We could do this from our backend programs too, but for now, I am trying to do this in ELK during the writes.
Also we do not use Clickhouse, but I will see if I can somehow do a sampling rate in the reads, during my read queries. Thanks.",0,23347196,0,"[]","",0,"","[]",0 23383465,0,"comment","hwwc","2020-06-01 20:36:06.000000000","SEEKING WORK | Backend Services; Data Engineering; Systems Engineering
Location: Boston, US | Remote: Yes
I'm an experienced software engineer looking for part-time and short-term contracts.
I've most recently worked in the data-analytics backend-stack: from ETL to database design to web-api to devops. One of my major projects is an analytics engine for web applications using Rust and Clickhouse (https://github.com/hwchen/tesseract).
However, I'm naturally curious and happy to work in any domain which requires high performance and maintainable code. I've worked with a distributed worker system, debugged async database drivers, and implemented text layout primitives.
Primary Skills: Rust, Python, Linux, Google Compute Platform, ClickhouseDB, Postgresql
Production Experience with: Python/Pandas, Node/JS, AWS, Docker, Redis, MySql, Nginx
Github: https://github.com/hwchen
Contact: hello@hwc.io",0,23379195,0,"[]","",0,"","[]",0 23383544,0,"comment","starduck","2020-06-01 20:41:53.000000000","SEEKING WORK | Design, Full Stack Development & Data Engineering | Location: US | Remote: Yes
We're starduck, a multidisciplinary designer/developer team experienced in the entire web application stack:
- Wireframing & design mockups
- Design systems
- Front & back-end development
- Web accessibility & responsive design
- ETL
- Database design & Data APIs
- Devops & build tooling
For every client, we focus intensely on:
- a coherent design system for better user experience
- performance as a part of the user experience
- maintainable code
- timely and transparent communication
Relevant projects include:
- A web platform for reporting & analyzing the state of open source software (https://opensourcecompass.io/).
- An analytics engine for web applications (https://github.com/hwchen/tesseract).
Primary Skills: Sketch, Photoshop, (S)CSS, JS, React/Vue/Svelte, Rust, Linux, Google Compute Platform, ClickhouseDB, Postgresql
Production experience with: Python/Pandas, Node/JS, AWS, Docker, Redis, MySql, Nginx, PHP
Github: https://github.com/hwchen | https://github.com/perpetualgrimace
Contact: hello@hwc.io",0,23379195,0,"[]","",0,"","[]",0 23384021,0,"comment","simoes","2020-06-01 21:22:23.000000000","Datawheel (datawheel.us) | Front-End Developer and Back-End Developer and Product Designer | Cambridge MA and Washington DC | Full-time
Datawheel is a small but mighty crew of programmers and designers who are here to make sense of the world’s vast amount of data! Learn more about us here: https://www.datawheel.us/
Fullstack Developer
-----------------------------
We are looking for someone comfortable with both front-end and back-end technologies. An ideal candidate is someone who is passionate about what they do and can bring that to the projects assigned to them. We are looking to work with someone on a contract basis with the option to transition to a salaried employee based on performance. Requirements
-----------------------------
- 3+ years experience
- Eligible for Security Clearance
- Familiarity with Java, Node.js, React
- Comfortable with rapid prototyping
- Experience writing SQL queries
- Experience working with Linux server environments
Bonuses
-----------------------------
- Experience with Scikit-Learn/Tensorflow or other machine learning libraries
- Experience working with ClickHouse or similar columnar databases
- Experience working with GCP and/or similar cloud platforms
- Experience with Docker/Kubernetes
- Experience with Spring Boot
APPLY HERE: https://www.datawheel.us/jobs",0,23379196,0,"[]","",0,"","[]",0
23384994,0,"comment","lykr0n","2020-06-01 22:49:41.000000000","Location: Seattle, WARemote: No, sorry :(
Relocation: Maybe.
Technologies: Programming Languages Rust 2018, Bash, Python. Limited JavaScript & GoLang. Knowledge of C++ and Java.
Technologies CentOS 7/8, Puppet, Salt Stack, Git, Consul, Nomad, HAproxy, Nginx, Datadog, Docker, Postgres, MySQL, Clickhouse, PowerDNS, Kafka, Memcache + mcrouter, zookeeper, ELK Stack, Grafana, and more.
Email: lykron@mm.st
Resume: Upon Email",0,23379194,0,"[]","",0,"","[]",0 26502753,0,"comment","pachico","2021-03-18 14:00:29.000000000","Unless this means "sure, the result was already cached in memory" I don't see how you can be 10000x faster than Clickhouse. This claim is too big.",0,26487416,0,"[26504279]","",0,"","[]",0 19693092,0,"comment","zepearl","2019-04-18 17:48:08.000000000","I just needed a DB-client with which 1) I could have multiple SQLs in a single page/file (hope you know what I mean) and 2) execute single ones based on where the cursor is positioned, 3) without a delimiter at the end of each SQL, and 4) see as well execution plans (for some DBs e.g. MariaDB) and 5) to work with multiple databases (MariaDB, PostgreSQL, Clickhouse, DB2, MonetDB, Oracle, generic JDBC) and DBeaver totally saved me and I don't see any alternative (which I now anyway don't need).
I would pay for the "enterprise edition", but for a fixed amount not limited by usage but based on SW-versions (e.g. get upgrades for 1 year and after that to be able to keep using "the old version XYZ" until I have a reason to upgrade for which I would have to pay again). The 149$/year ( https://dbeaver.com/ at the bottom) sounds like it would stop working after 1y no matter which version I'm using :(",0,19690671,0,"[]","",0,"","[]",0 19695935,0,"comment","hodgesrm","2019-04-19 00:12:44.000000000","ML and neural nets work surprisingly well on a lot of boring but important use cases. Optimizing online ad placement might be humdrum but it's the foundation of a $100B market in the US alone. [0]
In my opinion the real hype is that ML has become so popular that new practitioners tend to forget other analytic techniques like SQL data warehouses. Interestingly these are starting to absorb ML capabilities like logistic regression, which are now accessible through SQL and can benefit from MPP and vectorwise query execute in DBMS types like ClickHouse, Vertica, and Google BigQuery.
[1] https://www.forbes.com/sites/danafeldman/2018/03/28/u-s-tv-a...
Disclaimer: I work on ClickHouse.",0,19693902,0,"[]","",0,"","[]",0 26526648,0,"comment","zX41ZdbW","2021-03-20 22:49:57.000000000","Testing performance of memcpy should be more exhaustive than just running on single CPU: https://github.com/ClickHouse/ClickHouse/issues/18583#issuec...",0,26526475,0,"[26526797]","",0,"","[]",0 26538873,0,"comment","pupdogg","2021-03-22 10:25:57.000000000","Based on their numbers, they should be archiving their historical data in parquet format partitioned by YYYYMMDD onto something like Clickhouse. This way, they can run a lean Postgres instance(s) at all times yet still get benefits of real-time reporting. Based on their use case, they can retain up-to 30 days of data in Postgres and offload the rest onto Clickhouse.",0,26535357,0,"[26539057]","",0,"","[]",0 26540132,0,"comment","pupdogg","2021-03-22 12:56:06.000000000","Your comment clearly illustrates that you have no working knowledge of Clickhouse or parquet file format or data archiving capabilities available in 2021. It's OK! I was in the same boat until I needed to implement such a solution for my use case. What I'm suggesting does not limit their customers from searching any historical data. Matter of fact, it might be 100x to 1000x faster for them to do so with the suggested solution. I strongly believe that mission critical transactional databases (postgres in this case) MUST be run very lean to keep their app running at hyper speeds at all times. 50TB overhead seems very inefficient when you take into account the the low cost solutions available in this day and age.
Based on my personal experience of achieving 94% compression on 2TB of data using snappy parquet file format, they could be looking at a final dataset size of 3.5TB on Clickhouse.",0,26539057,0,"[26544601,26540267]","",0,"","[]",0 26540267,0,"comment","KptMarchewa","2021-03-22 13:09:52.000000000",">Matter of fact, it might be 100x to 1000x faster for them to do so with the suggested solution.
That must be trolling. 1000x faster than single digit millisecond indexed query retrieving 15 rows?
The fact that you keep talking about storage size means that you're talking about analytics not transactional needs.
>Your comment clearly illustrates that you have no working knowledge of Clickhouse or parquet file format or data archiving capabilities available in 2021. It's OK! I was in the same boat until I needed to implement such a solution for my use case.
Also, fuck off with that condescension.",0,26540132,0,"[26544615,26540639]","",0,"","[]",0 26552621,0,"story","xoelop","2021-03-23 08:55:25.000000000","",0,0,0,"[]","https://blog.tinybird.co/2021/03/16/coming-soon-on-clickhouse-windodw-functions/",4,"Window functions are coming to ClickHouse","[]",0 26561085,0,"comment","bmn__","2021-03-23 22:22:33.000000000","> PCRE is hardly used "everywhere".
The topic under discussion was extensions that make regex non-regular (as popularised by Perl and libpcre), not PCRE per se. Per this site's rules, I assume good faith and that you simply misunderstood me and did not deliberately put up and topple this straw-man.
Adoption of non-regular extensions is overwhelmingly larger than adoption of the opposite.
1. These non-regular extensions can be found in Java/Kotlin/Scala/etc., Javascript, Perl, PHP, Python, Ruby, C#, R, Swift, Matlab, Julia, Haxe, Ocaml and literally dozens of other languages on various popularity charts, and as a first pick option in C, C++ and Lua. Go and Rust are the exemptions to the rule! There are millions of pieces of software written using these which one can't even see because they are not public.
2. Programmers and end users want features and power much more than they want determinism. (Performance is a red herring because the vast majority of the time, performance is good enough, or even identical to non-extended.) That's why ripgrep and GNU grep and rspamd have them.
https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#how...
https://www.gnu.org/software/grep/manual/html_node/Regular-E...
https://rspamd.com/announce/2016/03/21/rspamd-1.2.0.html
3. A factual survey where libraries are used. This will be invisible for the aforementioned programming languages because they have built-in regex, but simply libpcre alone versus re2 and libhs shows clearly which paradigm is dominant and which is a niche.
libpcre: ag, apache2, blender, clamav, cppcheck, exim, fish, git, gnome-builder, godot, grep, haproxy, kodi, libvte, lighttpd, lldb, mariadb, mongodb, mutt/neomutt, mysql-workbench, nginx, nmap, pam, postfix, Qt5/Qt6, rspamd, selinux, sway, swig, syslog-ng, systemd, uwsgi, uwsgi, varnish, vlc, wget, zsh … … … and 110 more.
re2: bloaty, chromium/chromedriver/qtwebengine, clickhouse, libgrpc
libhs: libndpi, rspamd",0,26557597,0,"[]","",0,"","[]",0 26566369,0,"story","xoelop","2021-03-24 12:38:45.000000000","",0,0,0,"[]","https://blog.tinybird.co/2021/03/24/tips-5-adding-and-subtracting-intervals/",1,"Adding and Subtracting Intervals on ClickHouse","[]",0 23453442,0,"comment","st1ck","2020-06-08 03:58:14.000000000","I'm in no way a DBA, but if your main usecase is analytics (OLAP) and updates are infrequent it's common to use column-oriented DBMS. Postgres has cstore_fdw, but you can also use others: DuckDB, MonetDB, ClickHouse among FOSS, and quite a few well-known proprietary options.
That said, 2M rows is not really a lot of data (unless the rows are huge). In case you don't need to worry about updates, you can just load everything into memory, e.g. in Pandas dataframe (large overhead, slow, many features) or more efficient implementation, like `datatable` (lower overhead, faster, less features).
Also I recently discovered BI tools (more like realized that despite the name it doesn't have to apply to business data). E.g. Metabase provides nice UI for non-complicated analytical (like SELECT avg(...) GROUP BY ...). So if it fits 90% of your queries, then maybe you got the frontend for free, and only need to work on backend (and the rest 10% of queries).",0,23389108,0,"[]","",0,"","[]",0 26581807,0,"comment","pupdogg","2021-03-25 15:53:23.000000000","You put in-place a loss mitigation strategy. This strategy will vary by application. In my case, I have a similar setup where we write 25-30k records to SQLite daily. We start each day fresh with a new SQLite db file (named yyyy-mm-dd.db) and back it up to AWS S3 daily under the scheme /app_name/data/year/month/file. You could say that's 9 million records a year or 365 mini-sqlite dbs containing 25-30k records. Portability is another awesome trait of SQLite. Then, at the end of the week (after 7 days that is), we use AWS Glue (PySpark specifically) to process these weekly database files and create a Parquet (snappy compression) file which is then imported into Clickhouse for analytics and reporting.
At any given point in time, we retain 7 years worth of files in S3. That's approx. 2275 files for under $10/month. Anything older, is archived into AWS Glacier...all while the data is still accessible within Clickhouse. As of right now, we have 12 years worth of data. Hope it helps!",0,26581410,0,"[26586079,26587340]","",0,"","[]",0 26586079,0,"comment","hodgesrm","2021-03-25 22:03:51.000000000","This sounds interesting. Have you thought of doing a talk or blog article about it?
p.s., I run the SF Bay Area ClickHouse meetup. Sounds like an interesting topic for a future meeting. https://www.meetup.com/San-Francisco-Bay-Area-ClickHouse-Mee...",0,26581807,0,"[]","",0,"","[]",0 23477378,0,"comment","bsg75","2020-06-10 14:30:51.000000000","Clickhouse can bring a performance advantage on the query side, and with a proper cluster config scale for ingest.
CloudFlare has some useful blog posts on their uses of it.",0,23470941,0,"[]","",0,"","[]",0 23487040,0,"story","long2ice","2020-06-11 09:59:56.000000000","",1,0,0,"[23487041]","https://github.com/long2ice/mysql2ch",1,"Sync data from MySQL to ClickHouse, support full and increment ETL","[]",0 23487041,0,"comment","long2ice","2020-06-11 09:59:56.000000000","Sync data from MySQL to ClickHouse, support full and increment ETL.",0,23487040,0,"[]","",0,"","[]",0 23493070,0,"comment","oddtodd","2020-06-11 20:40:02.000000000","Hmm, considering after 2 years of shipping Druid as part of a big data analytics product, my company is moving off of Druid because of performance and stability issues, I don't think my company, at least, would consider Druid a success at all.
We are moving to Vertica, because other products in our company use that, although I'd have liked if we had gone to ClickHouse, which is faster than Vertica and Druid for our product.",0,23479799,0,"[]","",0,"","[]",0 16197496,0,"comment","seektable","2018-01-21 10:12:59.000000000","Take a look to Yandex ClickHouse, this is open-source append-only analytical database. It offers ultimate performance of OLAP queries and data like events log, and its SQL dialect includes a lot of specialized functions for metrics calculation.",0,16197084,0,"[]","",0,"","[]",0 16224488,0,"comment","ddorian43","2018-01-24 17:33:30.000000000","See Clickhouse for `oltp` columnar store (built for powering dashboards).",0,16224298,0,"[16225049]","",0,"","[]",0 16224599,0,"comment","manigandham","2018-01-24 17:44:15.000000000","> Pretty much the columnar/mpp stuff is good for simple aggregates for ad-hoc non-real-time queries
How is a column-oriented database which is primarily designed for fast performance across large data not good for real-time queries? Memsql, clickhouse, vertica, druid, etc are all fast systems that can scan terabytes in milliseconds with complex queries.
It's great that timescale provides new options and works with postgres but let's keep things accurate in a field that has plenty of confusion already.",0,16224298,0,"[]","",0,"","[]",0 16225049,0,"comment","cevian","2018-01-24 18:32:21.000000000","Clickhouse is a very cool project but is more analytical than OLTP. Percona had a nice writeup[1] (the comparison to mysql is especially apt -- note the lack of real-time updates and deletes in Clickhouse).
[1] https://www.percona.com/blog/2017/02/13/clickhouse-new-opens...",0,16224488,0,"[]","",0,"","[]",0 23534113,0,"comment","lima","2020-06-15 23:06:18.000000000","ClickHouse also supports incremental streaming from Kafka into a materialized view.
You can even detach and reattach the view from its backing table.",0,23531825,0,"[]","",0,"","[]",0 19826472,0,"comment","11thEarlOfMar","2019-05-04 13:58:59.000000000","We're planning to deploy ClickHouse from Yandex[0]. Would like to hear from anyone who has it in production already, and what is your experience with it.
[0] https://clickhouse.yandex/",0,19825566,0,"[19826683,19828535,19827033,19834258,19827245]","",0,"","[]",0 19826683,0,"comment","gaahrdner","2019-05-04 14:33:33.000000000","Cloudflare[0] uses ClickHouse extensively, might want to reach out to them.
[0]https://blog.cloudflare.com/http-analytics-for-6m-requests-p...",0,19826472,0,"[19826861]","",0,"","[]",0 19827245,0,"comment","shaklee3","2019-05-04 16:00:14.000000000","Any reason you choose clickhouse over druid or Pinot?",0,19826472,0,"[19827333,19828055]","",0,"","[]",0 19827333,0,"comment","mfrye0","2019-05-04 16:14:46.000000000","I've evaluated Timescale, Clickhouse, Druid, and Pinot for our own use case.
Druid and Pinot have a lot of moving parts. If I remember correctly, Druid was something like 6-8 different nodes for different parts of the ingestion / querying processes. So it's going to be a lot of upfront and ongoing dev ops work.
Clickhouse is interesting because it seems to just "work". One thing to deploy and you just increase the number of nodes as you scale.",0,19827245,0,"[19829939]","",0,"","[]",0 19827924,0,"comment","bsdpqwz","2019-05-04 17:39:26.000000000","Azure Data Explorer (Kusto) might be one to add to the "To watch list"
https://azure.microsoft.com/en-us/services/data-explorer/
We're migrating from a 50-node Elasticsearch to ADX, imho: - amazing query language (KQL) - less work to maintain cluster - lower cost
(it appears to be similar to Clickhouse, but more feature rich)",0,19825566,0,"[]","",0,"","[]",0 19828055,0,"comment","bsg75","2019-05-04 17:58:32.000000000","Simplicity.
In our case, the simple deployment and management of Clickhouse is a key feature. It is masterless [1] with no namenodes or coordinators, so each machine looks the same, and there is only one process to manage.
If you rely on its replication mechanism for sharding, Zookeeper becomes necessary, but writing directly to nodes in an orderly fashion is also an option (as we are).
[1] This means nodes are not aware of what each other contain, so queries hit all nodes with some maybe having no work to do. Depending on your workload this may or may not be a concern.",0,19827245,0,"[19921622,19828582]","",0,"","[]",0 19828535,0,"comment","tepidandroid","2019-05-04 19:07:09.000000000","I would love to use Clickhouse, if only it has some kind of Cassandra-esque node-to-node replication mechanism. I'm loathe to bring in something heavy like Zookeeper for that.
It would also be ideal if they could accelerate the data ingestion process somehow without the need to buffer chunks with yet another moving piece like Kafka.
I guess ideally, I'm looking for some kind of standalone HTAP system minus transactional guarantees.",0,19826472,0,"[19913534]","",0,"","[]",0 19828782,0,"comment","bsg75","2019-05-04 19:41:55.000000000","> In effect, you're doing your sharding client-side?
Correct. Some ETL inserts to specific nodes when sharding is necessary, in other cases Kafka engine tables on a group of nodes subscribe to common topics, and we simply let the whole cluster participate in queries. This works just fine when table scans are acceptable.
Rebalancing is a missing option here, short of moving partitions manually. But in my specific use cases, I have not yet needed to rebalance across nodes.
Note using native Clickhouse replication is still an option if we need it. One cost to it is the extra work needed in the database cluster, so addressing it in an eariler layer works for us.
> How do you handle cross-node queries like joins?
If I understand your question, since we are using the Distributed view type across the cluster definition, a query on any node will receive data from the others as part of a join-less SELECT, and federate on the node with the client connection.
We are not doing any database-side JOINs currently. Plans are to augment data in ETL, or join data post-query (potentially Spark). Clickhouse dictionaries handle simple cases.",0,19828582,0,"[19828838]","",0,"","[]",0 16233493,0,"comment","lima","2018-01-25 19:00:53.000000000","ClickHouse (the analytics DMBS by Yandex), while not explicitly designed as such, is a fantastic time series database.
There's even a special backend, the GraphiteMergeTree, which does staggered downsampling, something most TSDBs aren't able to.
It's the most promising development in this space I've seen in a long time.
https://clickhouse.yandex/docs/en/table_engines/graphitemerg...",0,16230464,0,"[16234636,16235060,16233855,16233858]","",0,"","[]",0 16233855,0,"comment","betaby","2018-01-25 19:38:14.000000000","Direct link how to use as a graphite whisper replacement https://github.com/yandex/graphouse
Telegram channel https://t.me/clickhouse_en",0,16233493,0,"[]","",0,"","[]",0 16233858,0,"comment","nawgszy","2018-01-25 19:38:27.000000000","Completely unrelated question, but how on earth is 'clickhouse.yandex' a valid web address?",0,16233493,0,"[16233897]","",0,"","[]",0 16235060,0,"comment","cevian","2018-01-25 21:54:15.000000000","Clickhouse is very cool. But note that it does not support transactional and relational semantics and does not have real-time updates or deletes. Thus, its meant for very different applications than TimescaleDB. I would classify Clickhouse more in the data-warehouse space...",0,16233493,0,"[]","",0,"","[]",0 16235166,0,"comment","manigandham","2018-01-25 22:08:55.000000000","It depends on the queries but columnstores would yield a faster result. We're not new to this and have used ClickHouse, MemSQL, SQL Server, and Druid extensively.
Columnstores just store data by column, they do not have any inherent limitations because of it. They all support SQL and compatible tools (although Druid is experimental SQL using apache calcite). They all store columnstore tables on disk (memsql uses rowstores in memory, sql server can optionally run columnstores in-memory using its hekaton engine, and they all use in-memory buffers for rapid ingest). They can all do geospatial queries, support JSON columns and some can handle nested/repeated structures. Indexes are available but unnecessary when you can prune partitions based on what's contained in each segment, especially when using a primary sort key (like a timestamp column in your case). SnappyData has a unique statistical engine to tradeoff query precision for much faster results (like HLL+ algorithms applied to the entire dataset). MemSQL will do OLTP access with full transactions across both rowstore and columnstore data.
Congrats on the VC funding, I'm always happy to see new projects and building on Postgres does give you a solid base with triggers and foreign keys (which come with their own scaling issues), and extending time-based functions will be useful -- however my issue is the marketing spin where you claim to be better than everything else. Columnstores are very fast, efficient, performant, and time as a dimension is not a new challenge. That's before considering the bigquery/snowflake superscale options or specialized databases like kdb+ which have served the financial industry for decades.
Approaching the field with a single-node automatic partitioning extension (as of today) for a rowstore RDMS and saying you're better than the rest on features that they already have just strikes me as insincere. It would be better to recognize the competition and focus on what you're good at instead.",0,16235021,0,"[16235368,16235290]","",0,"","[]",0 23537312,0,"comment","pachico","2020-06-16 09:34:04.000000000","> What I have yet to see but always secretly wanted, however, is a database that natively supports incremental updates to materialized views. Yep, that’s right: Materialize listens for changes in the data sources that you specify and updates your views as those sources change.
This is precisely one of the features that make ClickHouse shine",0,23531825,0,"[]","",0,"","[]",0 19834258,0,"comment","hodgesrm","2019-05-05 18:41:24.000000000","There's a ClickHouse meetup 4 June in the Bay Area. Good place to find out more about ClickHouse. Google 'Clickhouse SF Meetup'.
Disclaimer: I'm an organizer.",0,19826472,0,"[]","",0,"","[]",0 19838092,0,"story","atomlib","2019-05-06 10:08:37.000000000","",0,0,0,"[]","https://habr.com/en/post/449818/",4,"PHP scripts monitoring in real time. ClickHouse and Grafana go to Pinba for help","[]",0 23549329,0,"comment","citrin_ru","2020-06-17 09:01:53.000000000","ClickHouse DBMS allows to combine delta, double delta, Gorilla with LZ4 or ZSTD for column compression. But it is not often used as DB for monitoring metrics so something else is probably expected from time series DB.",0,23548300,0,"[23549464]","",0,"","[]",0 23549341,0,"comment","redis_mlc","2020-06-17 09:03:57.000000000","> In particular, I'd love to know if theres anything major that generic RDBMS's could do better here.
Well, everybody with experience outsources monitoring now since it's a non-core cost center, unless there's a compelling scale or secrecy issue.
If RAM and CPU were free, I'd use MySQL or Postgres w/partitions because of their mgmt. features, tested replication and SQL.
But Prometheus or Clickhouse are 10-25x more efficient in terms of space, and often have much faster queries. The tradeoffs are bizarre HA gaps, lack of trained people, and ops groups are stuck supporting it.
I would never recommend monitoring with anything based on HDFS (OpenTSDB), written in Java (Cassandra), or in-memory for large clusters (InfluxDB.)
For monitoring under 200 nodes, anything will work.
If you only have a day to do something, just install Nagios and you'll get 99% of what you really need.
Source: DBA.",0,23548174,0,"[23550968,23552618]","",0,"","[]",0 23549464,0,"comment","redis_mlc","2020-06-17 09:24:44.000000000","It would work, but Clickhouse is a Russian (Yandex) thing, and SQL isn't really needed for most monitoring and alerting use cases.
If I didn't want to use MySQL or Postgres, I'd rank Prometheus #1 and Clickhouse #2.
The killer thing for Clickhouse is that Percona supports it, so if you want to outsource the installation, mgmt. and support, you can just write a check and get good results.
Also, Clickhouse is a column store with SQL, so you could use an instance for monitoring and another to replace Vertica or Greenplum or whatever so long as it has the client libraries you need.",0,23549329,0,"[23552909]","",0,"","[]",0 23551565,0,"comment","ekabod","2020-06-17 14:08:26.000000000","clickhouse as aggregate functions (Pearson correlation coefficient,quantiles , etc... )[0].
Associated with their Materialized view and Liveview features, you can achieve observability [1].
[0]https://clickhouse-docs.readthedocs.io/en/latest
[1]https://www.altinity.com/blog/2019/11/13/making-data-come-to...
Edit: fixed broken link.",0,23550997,0,"[23551877]","",0,"","[]",0 23551808,0,"comment","hodgesrm","2020-06-17 14:27:45.000000000","Compression for correlated measurements can be spectacular. ClickHouse can often get 99.9% reduction in data size with the right combination of codecs (delta, double-delta, gorilla, etc.) and compression (e.g., ZSTD). Sort order within tables is also important.",0,23547786,0,"[]","",0,"","[]",0 23551877,0,"comment","hodgesrm","2020-06-17 14:32:45.000000000","That second link seems to be broken. The base URL is: https://www.altinity.com/blog/2019/11/13/making-data-come-to...",0,23551565,0,"[23551949]","",0,"","[]",0 26669757,0,"comment","mescudi","2021-04-02 07:54:43.000000000","Location: Kazakhstan
Remote: Yes, English/Russian
Willing to relocate: Yes
Technologies: k8s, golang, aws, python, docker, helm, linux, terraform/ansible, clickhouse, ELK, cockroachdb, aerospike
Resume: https://tinyurl.com/2rpdpzm9
Email: nurtas977@gmail.com",0,26661279,0,"[]","",0,"","[]",0 23552909,0,"comment","rcatcher","2020-06-17 15:59:07.000000000","ClickHouse is licensed under Apache License 2.0 and Yandex is incorporated in the Netherlands. What are your concerns with it being developed by russians (other than xenophobia)?",0,23549464,0,"[23559435]","",0,"","[]",0 23559933,0,"comment","snicker7","2020-06-18 05:00:44.000000000","While those numbers are good, TimescaleDB does not even attempt to compete with the performance of purpose-built time series DBs (e.g. shakti, clickhouse).
Time series DBs (and OLAP dbs in general) have very different trade-offs/needs than transactional DBs.",0,23554766,0,"[23563512]","",0,"","[]",0 19851107,0,"comment","lima","2019-05-07 16:41:39.000000000","TimescaleDB confuses me. Postgres is an OLTP database and their disk storage format is uncompressed and not particularly effective.
By clever sharding, you can work around the performance issues somewhat but it'll never be as efficient as an OLAP column store like ClickHouse or MemSQL:
- Timestamps and metric values compress very nicely using delta-of-delta encoding.
- Compression dramatically improves scan performance.
- Aligning data by columns means much faster aggregation. A typical time series query does min/max/avg aggregations by timestamp. You can load data straight from disk into memory, use SSE/AVX instructions and only the small subset of data you aggregate on will have to be read from disk.
So what's the use case for TimescaleDB? Complex queries that OLAP databases can't handle? Small amounts of metrics where storage cost is irrelevant, but PostgreSQL compatibility matters?
Storing time series data in TimescaleDB takes at least 10x (if not more) space compared to, say, ClickHouse or the Prometheus TSDB.",0,19850071,0,"[19851660,19852336,19851320,19857438,19851371,19851494]","",0,"","[]",0 19851660,0,"comment","akulkarni","2019-05-07 17:41:10.000000000","(TimescaleDB co-founder)
TimescaleDB is more performant that you may think. We've benchmarked this extensively: eg outperforming vs InfluxDB [1] [2], vs Cassandra [3], vs Mongo [4].
We've also open-sourced the benchmarking suite so others can run these themselves and verify our results. [5]
We also beat MemSQL regularly for enterprise engagements (unfortunately can't share those results publicly).
I think the scalability of ClickHouse is quite compelling, and if you need more than 1-2M inserts a second and 100TBs of storage, then that would be one reason where I'd recommend another database over our own. But horizontal scalability is something we have been working on for nearly a year, so we expect this to be a less of an issue in the near future (will have more to share later this month).
You are correct however that TimescaleDB requires more storage than some of these other options. If storage is the most important criteria for you (ie more important than usability or performance), then again I would recommend you to one of the other databases that are more optimized for compression. However, you can get 6-8x compression by running TimescaleDB on ZFS today, and we are also currently working on additional techniques for achieving higher compression rates.
[1] https://blog.timescale.com/timescaledb-vs-influxdb-for-time-...
[2] https://blog.timescale.com/what-is-high-cardinality-how-do-t...
[3] https://blog.timescale.com/time-series-data-cassandra-vs-tim...
[4] https://blog.timescale.com/how-to-store-time-series-data-mon...
[5] https://github.com/timescale/tsbs",0,19851107,0,"[19852983,19852020,19852946]","",0,"","[]",0 19852946,0,"comment","ruw1090","2019-05-07 20:03:38.000000000","> You are correct however that TimescaleDB requires more storage than some of these other options. If storage is the most important criteria for you (ie more important than usability or performance), then again I would recommend you to one of the other databases that are more optimized for compression. However, you can get 6-8x compression by running TimescaleDB on ZFS today, and we are also currently working on additional techniques for achieving higher compression rates.
This is a weird answer since compression is used by columnar databases like MemSQL and Clickhouse to both save on storage and accelerate queries. Compare this to using a generic a filesystem compression which would both compress worse and make the system slower.",0,19851660,0,"[19853136,19852984]","",0,"","[]",0 23562895,0,"comment","pachico","2020-06-18 13:29:24.000000000","We run our own analytics solution based on a js library, a small go app and ClickHouse for data aggregation. With a very cheap and small setup you can handle hundreds of millions of events per day.",0,23560823,0,"[]","",0,"","[]",0 19872647,0,"comment","krick","2019-05-09 21:31:51.000000000","I wholeheartedly agree with the "complaining" part of the post, but then here comes the "solution" and I'm not quite sold.
I mean, I don't necessarily claim that this is not a solution, it just isn't obvious to me at all. Maybe a more extensive explanation with better examples would make it all clear to me and I'd be super-hyped about it already, but right now I'm more like confused.
First off, it would be helpful show the table structure in the examples, and then compare EdgeQL query to the easiest solution in the SQL. After all, the readers supposedly use SQL almost daily for many years (I know I do), but don't know a thing about EdgeQL, so if it can do everything SQL can, but easier, such a comparison must make it pretty obvious.
TBH, my knowledge of the relational algebra is quite rusty by now, so maybe that's the problem, but as I remember, many queries we commonly use with the SQL are not really "relational" queries. Relational algebra deals with the sets of tuples, so things like count(*) or ORDER BY, or GROUP BY are not really a part of relational model, they just exist because they are super-helpful in what we usually are trying to achieve with SQL.
The problems with NULL are of a similar nature. I don't think we should pretend that NULL not being equal NULL is not useful (we don't expect "missing data" to be exactly the same value as another "missing data", do we?), and SELECT DISTINCT treating them as equals is not intuitive (for me it absolutely is: when I'm asking what values occur in a table, a missing entry is a missing entry to me, I don't want to see NULL 10000 times).
So, the introduction kind of made me to expect the solution to be more in compliance with relational concepts, but it doesn't seem to be, since all of the above are present in the EdgeQL in one form or another.
I'm not sure how {} is different from NULL in the EdgeQL, since {} seems to be kind of special thing here, the same as NULL is in the SQL. I mean, it doesn't behave like a true empty set at all! Non-empty set {value} OR {} = {value}, not {} (OR ≡ ∪). <bool>{} being {} instead of a true boolean value looks even more confusing to me than NULL OR (NOT NULL) = NULL. Same ternary algebra here.
Then, I don't really understand a concept of a flat set here. I do kind of understand what we are trying to achieve here: we want to solve the problem of SELECT x, (SELECT y) FROM z throwing an error in a runtime, if count(SELECT y) != 1. And it kind of would make sense in SQL, but it's explicitly advertised as a feature of EdgeQL that it can return trees (json-like structures), and here it doesn't seem to make sense that an output of a query (which is a "flat set", I guess?) cannot have another set as an element. Moreover, it obviusly can be ordered, which also isn't a property of how "set" is commonly defined in the set theory.
As a first impression, syntax and overall structure of the queries doesn't strike me as obvious as well. In fact, since
> SQL does not integrate well enough with application languages and protocols
I would ultimately hope for something that can be expressed as a number of function calls and commonly used data structures (a list, a dictionary/record, etc.) in most/any mainstream PLs, not a one more DSL as in "free form text". (Maybe with a more succinct DSL for the use in a console. Maybe.)
And my ultimate source of confusion. SQL is more or less the same thing even in these DBMS where it isn't exactly The SQL (like ClickHouse). And given I know the overall structure of the DBMS (like, is it, for instance, row-based or column-based?) I can make pretty good assumptions of performance of a given query, even though SQL is still declarative and I do not know what exactly the query-optimizer will do. Maybe it's just that I'm not used to it, but I don't have a feel about how performant would be the last EdgeQL example of the article, and if it would be better to separate it into several queries at some scale. In fact, I don't even understand if it's something that would be reasonable easy to implement in other major RDBMS', or is it ultimately EdgeDB-only feature? If so, it can be only as good as EdgeDB — and is it as good as PosgreSQL, or MariaDB, or sqlite? Unfortunately, in the real world I have to worry more about how performant and robust a thing is under load, than I can worry about programming convenience.",0,19871051,0,"[19872836]","",0,"","[]",0 19873464,0,"comment","zepearl","2019-05-09 23:18:48.000000000","In my opinion the example of the first chapter ("Lack of Orthogonality") is wrong. The subquery (is it called "inline subquery?)...
> SELECT name FROM emp WHERE role = 'dept head' AND deptno = dept.no
...should in my opinion definitely return only 1 row for each department - from a logical point of view returning multiple rows would mean that the data is corrupt upstream or that the organization itself is corrupt or that there is a lack of attributes in the DB (if no additional selection criteria like for example "management" or "operations" or "deputee" etc... can be added to the query - meaning that no sub/organization can have more than 1 person responsible for the exact same thing).
I admit that this is a very focused critic and that I'm very happy with the current behaviour of the generic SQL language and its special cases which are linked to the DB being used (using currently Oracle, MariaDB, Clickhouse - used DB2, Kudu through Cloudera stack, PostgreSQL, maybe something else) and how it stores/processes the data etc... .",0,19871051,0,"[]","",0,"","[]",0 23582552,0,"comment","posnet","2020-06-20 08:05:52.000000000","What I find interesting about the whole TiFlash part (Which is not open source as far as I can tell).
Is that looking at the symbols table of the binary, they have embedded the entirety of the Clickhouse database as their processing engine.
I hope they do open source it at some point. The more I think about it, the more I like the idea of having a transaction database do analytic predicate pushdown by just transparently querying an actual OLAP database.",0,23581096,0,"[23584022,23583119,23583074]","",0,"","[]",0 23583119,0,"comment","sradman","2020-06-20 10:37:10.000000000","That is an interesting observation/claim that TiFlash is based on ClickHouse. I’m not sure what benefits ClickHouse has over the ORC/Parquet based Open Source engines like Presto/Impala.
What is emerging in HTAP is two patterns: scale-up like HANA and scale-out like TiDB 4.0. The engine/system in both cases transparently handles the merge between the OLTP delta row store and the OLAP column store (AutoETL) and there is a transparent federated query that is aware of both store types.
Does Presto or another scale-out solution transparently perform these two HTAP functions?",0,23582552,0,"[23584351,23583294]","",0,"","[]",0 23583294,0,"comment","qaq","2020-06-20 11:20:14.000000000",""I’m not sure what benefits ClickHouse has over the ORC/Parquet based Open Source engines like Presto/Impala" performance for one",0,23583119,0,"[]","",0,"","[]",0 23584022,0,"comment","ilovesoup","2020-06-20 14:10:14.000000000","I'm the product owner of TiFlash. Yes. We used ClickHouse as the compute engine for TiFlash. The project started as a modification of ClickHouse (more or less it still is). It was like "pushdown query to an actual OLAP database" style 2 years ago. But later on we made a lot tighter integration (raft instead of binlog sync from TP or data-replication for TiFlash itself, implemented same type system, txn, online ddl, Coprocessor interface as TiKV and etc) to make it more "transparent" for query layer.
We will have more detail explained recently.
It will be open sourced in a year or two. For us we need to make the code open-source ready instead of just turn on github settings.",0,23582552,0,"[23597431]","",0,"","[]",0 23584351,0,"comment","ilovesoup","2020-06-20 15:09:35.000000000","The reason for ClickHouse is simple: it's fast. And we need to have it function like a TiKV coprocessor which support filtering and aggregation mainly and ClickHouse is good at aggregation and filtering. Also it might take more time and dirtier to do seamless / compatible integration to TiDB if Impala or presto on top of MPP layer. But the price we paid is implementing MPP layer by ourselves now.
Almost all the data lake based products loss full control over storage system. It makes them very hard to build delta-main engine we need. To make HTAP storage transparent to query layer, TiFlash need a lot more control over storage engine than data lake can provide.",0,23583119,0,"[]","",0,"","[]",0 16281955,0,"story","leventov","2018-02-01 14:16:25.000000000","",0,0,0,"[]","https://medium.com/@leventov/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7",6,"Comparison of the Open Source OLAP Big Data Systems: ClickHouse, Druid and Pinot","[]",0 16285379,0,"story","jastix","2018-02-01 20:06:58.000000000","",0,0,0,"[16288751,16288299]","https://medium.com/@leventov/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot",7,"Comparison of the Open Source OLAP Systems (Big Data): ClickHouse, Druid, Pinot","[]",2 16288751,0,"comment","jastix","2018-02-02 06:12:24.000000000","Another attempt to publish a link: https://medium.com/@leventov/comparison-of-the-open-source-o...",0,16285379,0,"[]","",0,"","[]",0 26711343,0,"comment","orthoxerox","2021-04-06 13:10:01.000000000","ClickHouse is the opposite: it has no optimizer, so your SQL must be structured the way you want it to run: deeply nested subqueries with one JOIN per SELECT. But at least you can be sure your query runs the way you intended.",0,26710542,0,"[26712157,26721963,26716941]","",0,"","[]",0 26714429,0,"comment","ants_a","2021-04-06 17:04:55.000000000","I did a set of benchmarks recently for multi-dimensional scientific sensor data. You most definitely don't want row per measurement in PostgreSQL, but you can get surprisingly good results where you store a block of results per row in an array. For even better results TimescaleDB and ClickHouse achieved approximately 2-6 bytes per float32 timestamped measurement, depending on the dataset and shape.",0,26712978,0,"[26714647]","",0,"","[]",0 26716863,0,"story","weastur","2021-04-06 20:18:36.000000000","",0,0,0,"[26716864]","https://github.com/weastur/grafana-dashboards/tree/main/dashboards/clickhouse",3,"Grafana dashboard for ClickHouse with annotations, trends/peaks view","[]",1 26721963,0,"comment","barrkel","2021-04-07 08:42:08.000000000","Well, you're better off not doing joins at all in ClickHouse, beyond small dimension tables. Don't do joins between two or more big tables at all, is generally the rule in analytics databases; instead, pre-join your data at insert time.
CH supports optimizations for low-cardinality columns, so you can efficiently store things like enums directly as strings, rather than needing a separate table for them.",0,26711343,0,"[]","",0,"","[]",0 26728884,0,"comment","citrin_ru","2021-04-07 18:26:21.000000000","> 100 sensors sampling at 1kHz for a year, you'd have ~3 trillion rows
PosgreSQL is a great OLTP DB, but this looks like a good fit for ClickHouse or some time series DB.",0,26711226,0,"[]","",0,"","[]",0 23617196,0,"comment","osipov","2020-06-23 17:27:05.000000000","Cool project! You posted a comparison to Clickhouse on the front page. Do you have any insights on how QuestDB compares performance-wise to PrestoDB?
Also, PrestoDB is known to be manageable at scale, for example as part AWS Athena. What are you thoughts on building out an Athena-like service using QuestDB?",0,23616896,0,"[23617568,23617225]","",0,"","[]",0 19913534,0,"comment","valyala","2019-05-14 20:24:19.000000000","Clickhouse cluster can run without Zookeeper and without replication. Data durability may be provided by using durable storage such as Google Compute Engine disks [1].
Data ingestion process may be accelerated without resorting to Kafka - just insert data into Buffer table [2].
[1] https://cloud.google.com/compute/docs/disks/#pdspecs
[2] https://clickhouse.yandex/docs/en/operations/table_engines/b...",0,19828535,0,"[]","",0,"","[]",0 19913583,0,"comment","valyala","2019-05-14 20:29:25.000000000","There is a chproxy [1] - a proxy that is able to balance inserts and selects across ClickHouse nodes / replicas. Then client applications shouldn't know anything about ClickHouse cluster topology - they just talk to chproxy.
[1] https://github.com/Vertamedia/chproxy",0,19828582,0,"[]","",0,"","[]",0 19913783,0,"story","valyala","2019-05-14 20:51:21.000000000","",0,0,0,"[]","https://pixeljets.com/blog/clickhouse-as-a-replacement-for-elk-big-query-and-timescaledb/",2,"ClickHouse as a Replacement for Elk, BigQuery and TimescaleDB","[]",0 23619555,0,"comment","sa46","2020-06-23 20:08:02.000000000","How timely! I've done a deep dive into column store databases for the past couple of weeks. Reading through the Quest docs, I'd give it the following characteristics. Are these accurate?
- single node database, not [yet] distributed
- primary focus is time-series data, specifically in-order time series data (the `designated timestamp` extension)
- physical data layout is an append-only column store
- Implements a small subset of SQL with some affordances for time series (LATEST BY, SAMPLE BY).
- Doesn't support explicit GROUP BY or HAVING clauses. Instead, questdb implicitly assumes GROUP BY or HAVING based on presence of aggregation functions in the select clause.
- Small standard library of functions: only 4 text functions.
Based on these characteristics it seems the quest db is well positioned against Influx. It's probably faster than Timescale DB but significantly less flexible given that Timescale has all of Postgres behind it. Quest DB might eventually compete with clickhouse but it's long ways out given that it's not distributed and implements a much smaller subset of SQL.
I'd love to get any insight into join performance. Quite a few column stores handle large joins poorly (clickhouse, druid).",0,23616878,0,"[23620005,23629821,23619690,23621954]","",0,"","[]",0 23623374,0,"comment","jinmingjian","2020-06-24 02:40:09.000000000","sharing some thoughts here, in that I am recently developing a similar thing:
1. "Query 1.6B rows in milliseconds, live" is just like "sum 1.6B numbers from memory in ms".
In fact, if not full SQL functionalities supported, a naive SQL query is just some tight loop on top of arrays(as partitions for naive data parallelism) and multi-core processors.
So, this kind is just several-line benchmark(assumed to ignore the data preparing and threading wrapping) to see how much time the sum loop can finish.
In fact again, this is just a naive memory bandwidth bench code.
Let's count: now the 6-channel xeon-sp can provide ~120GB/s bandwidth. Then sum loop with 1.6B 4-byte ints without compression in such processors' memory could be finished about ~1.6*4/120 ~= 50ms.
Then, if you find that you get 200ms in xxx db, you in fact has wasted 75% time(150ms) in other things than your own brew a small c program for such toy analysis.
2. Some readers like to see comparisons to ClickHouse(referred as CH below).
The fact is that, CH is a little slow for such naive cases here(seen at web[1] been pointed by guys).
This is because CH is a real world product. All optimizations here are ten- year research and usage in database industry and all included in CH and much much more.
Can you hold such statement in the title when you enable reading from persistent disk? or when doing a high-cardinality aggregation in the query(image that low-cardinality aggregation is like as a tight loop + hash table in L2)?
[1] https://tech.marksblogg.com/benchmarks.html",0,23616878,0,"[23626109]","",0,"","[]",0 23624343,0,"comment","gregwebs","2020-06-24 05:14:39.000000000","This seems very similar to Victoria Metrics. Victoria Metrics is very much based on the design of Clickhouse and currently shows best of class performance numbers for time series data: it would be a lot more interesting to see a comparison to Victoria Metrics than ClickHouse (which is not fully optimized for time series). Victoria Metrics is Prometheus compatible whereas Quest now supports Postgres compatibility. Both have compatibility with InfluxDB.",0,23616878,0,"[]","",0,"","[]",0 23625027,0,"comment","seektable","2020-06-24 07:17:02.000000000","SYMBOL concept sounds like ClickHouse LowCardinality (or maybe implicit 'dictionary').",0,23620005,0,"[]","",0,"","[]",0 19916075,0,"comment","zeeg","2019-05-15 03:07:34.000000000","Absolutely is - sqlite had been valuable for a long time as it made testing/local dev fast, but MySQL has always been a burden. Its made it hard for us to build optimal solutions in many cases as we had to cater to multiple different approaches to a solution. With our newer stuff we're actually able to remove a lot of the infrastructure cost/complexity by using a better solution (Clickhouse). Obviously has its costs, but its a net win.",0,19915874,0,"[19917697]","",0,"","[]",0 19921622,0,"comment","oddtodd","2019-05-15 17:19:08.000000000","I'll second Simplicity and add in that ClickHouse is also faster on less hardware and more stable compared to our current Druid configuration.
My company currently uses Druid and has for a few years now, but I have been evaluating ClickHouse on the side as a possible replacement, and as a testament to its simplicity, I was able to stand it up and get in going as a PoC pretty quick/easily.
So far I have only found good things about ClickHouse, maybe the only ClickHouse downside has been the management of the cluster and data, but I haven't gotten too far into operationalizing ClickHouse to know how much those kind of items will cost.
Perhaps the other thing is the documentation, while reasonably good imo, still doesn't explain everything as well as I'd like. It was good enough, but I definitely had to experiment on a couple items to get them working as needed.",0,19828055,0,"[]","",0,"","[]",0 26739113,0,"comment","Sodman","2021-04-08 14:27:19.000000000","A lot of people are going to jump on the "he used k8s and he doesn't even work at Google scale!" part of this writeup, but I think it's a perfect demonstration of the concept of innovation tokens [1]. He admits in TFA that clickhouse was the only new piece of tech in his stack, and he was already familiar with k8s et al - so he's able to focus on actually building the products he wants. I could see somebody unfamiliar with k8s (but very familiar with all other pieces of tech in the system they want to build) being able to learn it as part of a side project, if it's the only new thing. Where the wheels come off is when you've never touched k8s, postgres, aws, rust, graphQL or vue - and you try to mash them all together in one ambitious project.
[1] https://mcfunley.com/choose-boring-technology",0,26737771,0,"[26740830,26740314,26741042]","",0,"","[]",0 26740902,0,"comment","anurag","2021-04-08 16:40:23.000000000","(Render founder) This is incredible work, and underscores the reason Render exists and is recommended by OP. Everything mentioned in the post is baked into Render already:
* Automatic DNS, SSL, and Load Balancing
* Automated rollouts and rollbacks
* Health checks and zero downtime deploys (let it crash)
* Horizontal autoscaling (in early access!)
* Application data caching (one-click ClickHouse and Redis)
* Built-in cron jobs
* Zero-config secrets and environment variable management
* Managed PostgreSQL
* DNS-based service discovery with private networking
* Infrastructure-as-Code
* Native logging and monitoring and 3rd-party integrations (LogDNA, Datadog, more coming this month!)
* Slack notifications
More at https://render.com.",0,26737771,0,"[26741032]","",0,"","[]",0 23626397,0,"comment","y42","2020-06-24 10:45:10.000000000","I cannot talk for the whole industry, but I know for sure, that I (working in BigData / OnlineMarketing) was looking for exactly this recently: A BI tool that is agile and a database that is fast. I ended up with Tableau Server and ClickHouse (which is impressivly fast). Problem was: Tableau does not really support CH and Tableau is not that agile. So to answer your question: Yes. If QuestDB is that fast, there is a demand.",0,23625404,0,"[23626526]","",0,"","[]",0 23626526,0,"comment","seektable","2020-06-24 11:06:37.000000000","JFYI SeekTable already has a connector for ClickHouse :) it works over native binary protocol (with connection-level compression). Since you have control over SQL query and SQL expressions used for dimensions/measures ClickHouse SQL-dialect is not a problem.",0,23626397,0,"[]","",0,"","[]",0 23628369,0,"comment","valyala","2020-06-24 14:40:50.000000000","Great demo!
I'm curios why the following query doesn't finish in tens of seconds:
select sum(passenger_count) from trips where passenger_count=1
The same query is executed in a hundred of milliseconds on ClickHouse running on the same hardware (~20 vCPUs).",0,23616896,0,"[23629288]","",0,"","[]",0
12732989,0,"comment","woodcut","2016-10-18 08:59:45.000000000","How would redshift compare to Yandexs' Clickhouse[1] for this kind of architecture?[1] https://clickhouse.yandex/",0,12729515,0,"[12741324]","",0,"","[]",0 23639284,0,"comment","pachico","2020-06-25 11:15:13.000000000","It looks very proming, congrats. I use ClickHouse in production and I'd love to see how this project evolves. My main disappointment is the amount of aggregation functions: https://questdb.io/docs/functionsAggregation Clickhouse provides hundreds of functions, many of which I use. It would be hard to even consider QuestDB with this amount of functions. I'll stay tuned, anyway. Keep up the good work!",0,23616878,0,"[23642864]","",0,"","[]",0 19939568,0,"story","bretthoerner","2019-05-17 13:52:05.000000000","",0,0,0,"[]","https://blog.sentry.io/2019/05/16/introducing-snuba-sentrys-new-search-infrastructure/",2,"Snuba: Sentry's New Search Infrastructure (Built on ClickHouse)","[]",0 16343731,0,"comment","samat","2018-02-09 20:52:36.000000000","Yay, finally a Paw for SQL! :)
Do you consider adding Clickhouse support? I would really appreciate that
(https://clickhouse.yandex)",0,16339004,0,"[]","",0,"","[]",0 12750904,0,"story","lcnmrn","2016-10-20 09:45:35.000000000","",0,0,0,"[]","https://clickhouse.yandex/",1,"ClickHouse","[]",0 26766022,0,"comment","Redsquare","2021-04-10 22:54:48.000000000","react, c#, redis for caching+pubsub, mongo/postgres, clickhouse - awesome for analytics/mi, algolia for search, mindsdb, logentries for log aggregation, datadog monitoring + catchpoint for synthetic tests
killer combo",0,26762674,0,"[26873113]","",0,"","[]",0 23659275,0,"story","wangfenjin","2020-06-27 03:22:11.000000000","",0,0,0,"[]","https://github.com/wangfenjin/xeus-clickhouse",3,"Show HN: A Jupyter Kernel for ClickHouse","[]",0 23666971,0,"comment","st1ck","2020-06-28 05:49:41.000000000","For analytical workloads, DuckDB may be better than SQLite, but it's just incomparably slower than e.g. ClickHouse (which is admittedly not embedded).",0,23663843,0,"[23667630]","",0,"","[]",0 16376040,0,"comment","olavgg","2018-02-14 14:47:40.000000000","I've stopped reading database benchmarks, because they are extremely vague. Instead I spend my time optimizing my current solution/stack. For example Postgresql has hundreds of knobs that you can adjust for almost every scenario you can imagine. Sometimes you have a special query and increase the work_mem just for that session. Other cases you adjust the cost settings for another query/session. You can analyze your indexes and index types. And sometimes you need to rewrite parts of a big query.
Learning all this takes time, you are much better off learning more about your chosen technology stack than switching to another technology stack.
Though in a few rare races, you need a different technology to solve your business problem. In most cases they complement your existing solution, like Elasticsearch/Solr for full-text search or Clickhouse for OLAP workloads.",0,16375503,0,"[16376083]","",0,"","[]",0 19994106,0,"comment","lcnmrn","2019-05-23 17:51:19.000000000","'hash' column (varchar). Index created in advance slowed down the import so much it wasn't usable. I couldn't create an index after the import, too slow. I just want to find if one or multiple hashes are present in their data set.
I used Clickhouse previously with large data like this and worked much better. Obliviously, I compare oranges with apples, but PostgreSQL could support a columnar data engine or some kind of index?",0,19993540,0,"[19994873,19994181]","",0,"","[]",0 19994181,0,"comment","sterwill","2019-05-23 17:59:41.000000000","If you want an index on a column (and you probably do if you're going to query over 500 million rows) you have to create it at some point. Creating it after will be more efficient in your case.
So what do you mean "too slow?" How long did it take?
I don't have any experience with Clickhouse, and not much with columnar databases in general, but if we're talking about a simple table with one index over one text column, I'm not sure whether it makes any difference if you store the tuples row-wise or column-wise on disk. It's an index that covers 100% of the data either way.
I don't know anything about your server but it sounds like PostgreSQL just wasn't able to get much IO throughput from your storage system. Things like storage hardware, filesystem type, and kernel parameters are the big factors here.",0,19994106,0,"[]","",0,"","[]",0 26824842,0,"story","sundaresanr","2021-04-15 19:06:57.000000000","",0,0,0,"[]","https://clickhouse.tech/blog/en/2020/the-clickhouse-community/",2,"The ClickHouse Community","[]",0 26858356,0,"comment","jodrellblank","2021-04-19 01:34:22.000000000","If you want a self-hostable thing, https://www.adminer.org/ is a single PHP file you can put on a webserver and use to manage a database[1] that the webserver can talk to; put in the server name and credentials and have a low-effort useful GUI management tool.
[1] It claims: Works with MySQL, MariaDB, PostgreSQL, SQLite, MS SQL, Oracle, Elasticsearch, MongoDB, SimpleDB (plugin), Firebird (plugin), ClickHouse (plugin)",0,26857098,0,"[26858483]","",0,"","[]",0 26873113,0,"comment","karolist","2021-04-20 11:15:49.000000000","Curious to hear what you use clickhouse for",0,26766022,0,"[]","",0,"","[]",0 26873816,0,"story","jinmingjian","2021-04-20 12:44:45.000000000","",0,0,0,"[26887183]","https://tensorbase.io/2021/04/20/base_reload.html",4,"Show HN: New ClickHouse in Rust on the Top of Apache Arrow and DataFusion","[]",1 26910927,0,"comment","pachico","2021-04-23 04:14:42.000000000","It is a feature, although experimental, of ClickHouse https://clickhouse.tech/docs/en/sql-reference/statements/cre...",0,26901352,0,"[]","",0,"","[]",0 26915422,0,"story","xoelop","2021-04-23 15:01:57.000000000","",0,0,0,"[]","https://blog.tinybird.co/2021/04/23/tips-6-faster-joins-without-joining-using-where/",1,"Avoiding making joins on ClickHouse prefiltering data","[]",0 26929207,0,"comment","joshxyz","2021-04-24 23:50:10.000000000","uWeebSockets.js: better performance and api than expressjs, hapi, koa, fastify
React: stable, doesn't break, probably lindy.
Tailwind css: it just works.
Postgresql, elasticsearch, clickhouse, redis.
Namecheap, digitalocean.",0,26903018,0,"[]","",0,"","[]",0 20125373,0,"comment","felixge","2019-06-07 15:14:39.000000000","SQL isn't the problem IMO. The problem is that the implementation we're using (PostgreSQL) is a row-store which is indeed more optimal for the kind of operations you mention.
However, SQL isn't limited to row-stores. There are column-store implementations that are quite amazing for aggregate queries, e.g. Clickhouse. Using one of those would very likely work for us, but my understanding is that loading data into them in real-time is problematic.",0,20124243,0,"[20126408]","",0,"","[]",0 16532643,0,"story","bretthoerner","2018-03-06 22:24:25.000000000","",0,0,0,"[]","https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/",15,"HTTP Analytics for 6M requests per second using ClickHouse","[]",0 16535759,0,"story","lima","2018-03-07 11:40:03.000000000","",0,0,0,"[]","https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/",6,"CloudFlare replaced Citus with ClickHouse","[]",0 26952413,0,"comment","joshxyz","2021-04-27 05:37:57.000000000","Makes me curious if clickhouse has already explored gpus",0,26951455,0,"[26962222]","",0,"","[]",0 26959010,0,"comment","welder","2021-04-27 17:36:12.000000000","Totally get that. The journey here wasn't straightforward. I've tried dual-writing in production to many databases including Cassandra, RethinkDB, CockroachDB, TimescaleDB, and more. Haven't tried ClickHouse/VictoriaMetrics, but probably won't now because S3 scales beautifully. The main reason not using EC2 is compute and attached SSD IOPs costs. This balance of DO compute and AWS S3 is the best combination so far.",0,26958900,0,"[]","",0,"","[]",0 16540285,0,"comment","ryanworl","2018-03-07 21:54:41.000000000","Did you evaluate Clickhouse?",0,16539317,0,"[16540344,16540315,16541734]","",0,"","[]",0 16540344,0,"comment","pauldix","2018-03-07 22:00:29.000000000","Yeah, I'd think about that for the analytics use case. It's interesting technology and I always keep an eye out for what Influx can learn from these other projects. This CloudFlare post is on my readlist to learn more about it: https://blog.cloudflare.com/http-analytics-for-6m-requests-p...",0,16540285,0,"[]","",0,"","[]",0 16541486,0,"comment","ryanworl","2018-03-08 01:17:03.000000000","I know I mentioned this in another comment, but you really should check out Clickhouse. They have a table engine for exactly this purpose.
You create a table with the raw logs, then a materialized view (or another actual table) which declaratively does the rollup for you in real time.
https://clickhouse.yandex/docs/en/table_engines/aggregatingm...
A full example: https://www.altinity.com/blog/2017/7/10/clickhouse-aggregate...",0,16541399,0,"[]","",0,"","[]",0 16541734,0,"comment","cevian","2018-03-08 02:15:35.000000000","The issue for an operational workload like this with Clickhouse might be the lack of direct support for UPDATE and DELETE. The workarounds required would add additional complexity, I think.",0,16540285,0,"[]","",0,"","[]",0 16543654,0,"comment","eis","2018-03-08 11:11:17.000000000","The "developer" edition of Memsql does not permit you to put it into production and doesn't have any high availability features. So basically you can only run the enterprise version which starts at a minimum commitment of $25k annually and goes up according to RAM usage. Many small projects and startups will opt for a database that lets them get started cheaper/free and with less requirements. Also being not open source might make people worried as too many closed source databases folded in the past, leaving users stuck or forced to migrate to something else.
I evaluated Memsql for a project which would have made good use of both the row storage and columnar storage engines. But with Clickhouse, there's now a columnar database that performs extremely well and is completely free and open source. So half the usecase for Memsql went away. For the row based engine, the competition is a bit tougher. If one doesn't need extreme performance, CockroachDB provides a super easy to cluster consistent SQL db. And for people with more performance need, there's Mysql Cluster (NDB) for example or several NoSQL solutions.
Memsql is aiming for the enterprise market with well paying customers. They are not targeting the HN startup scene that much.",0,16543513,0,"[16544075]","",0,"","[]",0 16543668,0,"story","jgrahamc","2018-03-08 11:15:20.000000000","",0,0,0,"[]","https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/",2,"HTTP Analytics for 6M requests per second using ClickHouse","[]",0 26962222,0,"comment","hodgesrm","2021-04-27 21:55:47.000000000","Yes. There was a prototype at NVIDIA that did this and there have been other attempts. See discussion in the issues. [1]
[1] https://github.com/ClickHouse/ClickHouse/issues/12572",0,26952413,0,"[]","",0,"","[]",0 16549618,0,"comment","eis","2018-03-09 03:32:35.000000000","You already mentioned two examples. But open sourcing a database doesn't just prevent against the folding of the company behind it. It creates a community that drives improvements and helps prevent the project from folding. Clickhouse for example has a growing community that files bugs and authors improvements. CloudFlare for example contributed some very useful features.
Open source also means one can examine the bug tracker (in most cases, some don't provide an open bug tracker) for known bugs and dive into the implementation in order to understand the inner workings in more detail if needed. I've made good use of this ability numerous times in the past.
The gist is: if a database which is always a key part of a software architecture, is not open source, then it better provide extremely convincing arguments to choose it over other products. Being open source on the other hand doesn't mean choosing a partical software is a no-brainer. There are tons and tons of open source databases with questionable quality. Open Source as a criterion is just one amongst many. I would for example not hesitate to use some hosted proprietary DB on AWS if it fit the project because I know AWS is unlikely to go away. But some smaller/young companies? The risk that they'll disappear unfortunately is very real in this industry.",0,16544075,0,"[]","",0,"","[]",0 16563721,0,"story","r4um","2018-03-11 16:57:30.000000000","",0,0,0,"[]","https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/",2,"HTTP Analytics for 6M requests per second using ClickHouse","[]",0 20163017,0,"story","valyala","2019-06-12 09:28:32.000000000","",0,0,0,"[]","https://www.altinity.com/blog/2019/6/11/clickhouse-local-the-power-of-clickhouse-sql-in-a-single-command",2,"Lickhouse-local: The power of ClickHouse SQL in a single command","[]",0 20167441,0,"comment","vinay_ys","2019-06-12 17:59:06.000000000","For such a small scale, you can use a simple event tracking schema from your client-side and server-side code and have a simple stream processor to join these events and then save it to a simple event table in a SQL database. The DB tech you choose should be something suitable for OLAP workloads. For your scale, PostgreSQL or MySQL would just work fine. When your data grows you can look at more distributed systems like Vertica or Memsql or Clickhouse etc.
In this architecture, most of your brain cycles will go into designing the queries for generating aggregates at regular intervals from the raw events table and storing in various aggregate tables. You must be familiar with facts and dimensions tables as understood in data warehouse context.",0,20161841,0,"[]","",0,"","[]",0 23897656,0,"story","blinkov","2020-07-20 13:36:09.000000000","",0,0,0,"[]","https://clickhouse.tech/blog/en/2020/pixel-benchmark/",9,"Running ClickHouse on an Android Phone","[]",0 23901176,0,"comment","jakebol","2020-07-20 19:43:05.000000000","Most every (analytic) RDMS database system can model sparse arrays. A sparse array is modeled by defining a clustered index on the table "array" dimensions and defining a uniqueness constraint on that clustered index. This works well with columnar storage because the data needs to have (and assumed to naturally have) a total sort order on the dimensions. Ex. Vertica, Clickhouse, Bigquery... all allow you to do this. TileDB allows for efficient range queries through an R-Tree like index on the specified dimensions.
Most real world data though is messy and defining a uniqueness constraint upfront (upon ingestion) is often limiting, so for practical use cases this gets relaxed to a multi-set rather than sparse array model for storage, and uniqueness imposed in some way after the fact (if required).",0,23899545,0,"[23901308]","",0,"","[]",0 16597098,0,"story","derekperkins","2018-03-15 22:42:28.000000000","",0,0,0,"[]","https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/",1,"HTTP Analytics for 6M requests per second using ClickHouse","[]",0 27024086,0,"story","thegurus","2021-05-03 12:14:42.000000000","",1,0,0,"[]","https://medium.com/datatau/how-to-connect-to-clickhouse-with-python-using-sqlalchemy-760c8df8d753",1,"How to connect to ClickHouse with Python using SQLAlchemy","[]",0 23906494,0,"comment","jordic","2020-07-21 13:06:26.000000000","We started evaluating clickhouse, after saying that sentry it's using it",0,23905720,0,"[]","",0,"","[]",0 23906734,0,"comment","mbell","2020-07-21 13:46:43.000000000","We had major issues with scaling InfluxDB. We use clickhouse (graphite table engine) now and it is more than order of magnitude more resource efficient.",0,23906556,0,"[]","",0,"","[]",0 23906735,0,"comment","pachico","2020-07-21 13:47:03.000000000","I approached InfluxDB since it looked promising. It did actually served its purpose when it was simple and Telegraf was indeed handy. Now that I have more mature requirements I can't wait to move away from it. It gets frozen frequently, it's UI Chronograph is really rubbish, functions are very limited and managing continuous queries is tiresome.
I'm now having better results and experience storing data in ClickHouse (yes, not a timeseries dB).
From time to time I also follow what's coming in InfluxDB 2.0 but I must confess that 16 betas in 8 months are not very promising.
It might just be me.",0,23906165,0,"[23906790,23907550,23907439]","",0,"","[]",0 23907439,0,"comment","gregwebs","2020-07-21 15:14:00.000000000","Have you tried Victoria Metrics? It originallyl started as a ClickHouse fork for time series but is now a re-write keeping some of the same principals.",0,23906735,0,"[23908016]","",0,"","[]",0 27027649,0,"comment","caseyaedwards","2021-05-03 17:04:19.000000000","Tesla | Site Reliability Engineer (SRE), Manufacturing Systems | Fremont, CA or Austin, TX
The Core Automation Services (CAS) team at Tesla is building applications to enable manufacturing, with an eye towards reliability, availability, scalability, speed and security. We're a diverse team composed of Controls Automation Engineers, Software Engineers, and various other disciplines that help facilitate automated manufacturing processes. As an SRE on the CAS team you'll be working with the infrastructure, systems and applications that act as the middleware layer between Programmable Logic Controllers (PLCs) and the outside world, such as Databases, MES systems and other services.
Location: Fremont, CA | Austin, TX
Responsibilities:
* Support interim HMI/SCADA vendor application (Ignition from Inductive Automation)
* Building tooling around it, evaluating its usage, and helping to ensure its reliability, availability and security
* Design software and systems that enable automated manufacturing at Tesla
* Assist Software, Controls, Manufacturing and other types of Engineers with onboarding and integrating services into the Tesla technology stack
* Ensuring best practices and observability of the service, such as metrics, logging, tracing, and alerting
* Automate configuration and deployment of services
* Consult on and design infrastructure, systems and application architecture
Apply at:https://www.tesla.com/careers/search/job/site-reliability-en...
https://www.tesla.com/careers/search/job/site-reliability-en...
=======================
Tesla | Database Site Reliability Engineer, Manufacturing Systems | Fremont, CA or Austin, TX
As a Database SRE on the Core Automation Services (CAS) team you'll be setting up and managing the databases, including MySQL, CockroachDB, FoundationDB, Clickhouse, and InfluxDB that back various software and systems that enable manufacturing in our various factories.
Location: Fremont, CA | Austin, TX
Responsibilities:
* Evaluate current database deployments and make recommendations for how to improve their reliability, availability, scalability and security
* Design and implement automation for managing the deployment and upgrades of the databases
* Define Disaster Recovery and Business Continuity plans for the various database deployments
* Assist Software, Controls, Manufacturing and other types of engineers with using databases sustainably
* Ensuring best practices and observability of the databases, such as metrics, logging, tracing, and alerting
* Consult on and design infrastructure, systems and application architecture
Requirements: * Experience with running databases on bare-metal or VMs
* Expert skills in Linux and its administration
* Experience in a high level language such as Go, Python and/or Java
* Understand the concepts of Observability and Infrastructure as Code
* Comfortable on an on-call rotation
* Comfortable doing live troubleshooting of issues on NOC bridges/outage calls
* Habitual documenter and spreader of knowledge
* Willing to mentor other team members and engineers with less database knowledge
* Strong bias for action vs endless planning, willing to get hands dirty and make mistakes sometimes
* 3+ years as DBA/SRE
Apply at: https://www.tesla.com/careers/search/job/database-site-relia...",0,27025922,0,"[]","",0,"","[]",0
27027861,0,"comment","hodgesrm","2021-05-03 17:18:46.000000000","ClickHouse has lambdas for arrays. They are very useful. Here's an example. WITH ['a', 'bc', 'def', 'g'] AS array
SELECT arrayFilter(v -> (length(v) > 1), array) AS filtered
┌─filtered─────┐
│ ['bc','def'] │
└──────────────┘
The lambda in this case is a selector for strings with more than one character. I would not argue that they are as general as Pandas, but they are might useful. More examples from the following article.https://altinity.com/blog/harnessing-the-power-of-clickhouse...",0,27026802,0,"[]","",0,"","[]",0 27029335,0,"comment","mescudi","2021-05-03 19:05:37.000000000","Location: Kazakhstan
Remote: Yes, English/Russian
Willing to relocate: Yes
Technologies: kubernetes (k8s), golang, aws, python, docker, helm, linux, terraform, ansible, clickhouse, ELK, cockroachdb, aerospike
Resume: https://tinyurl.com/2rpdpzm9
Email: nurtas977@gmail.com",0,27025920,0,"[]","",0,"","[]",0 27029410,0,"comment","hodgesrm","2021-05-03 19:11:15.000000000","Altinity | ClickHouse Support & Data Engineer | Full-time | North America | Remote
Altinity helps enterprise companies succeed with ClickHouse, a popular SQL data warehouse. We offer managed ClickHouse in Amazon plus 24x7 support for self-managed installations. We're an Accel portfolio company with a rapidly growing business. If you like databases and cloud technology, you'll love working at Altinity.
We have a variety of great positions, but the one I would like to highlight is support. We're looking for database experts who are customer-focused puzzle solvers. You will help users design, deploy, and operate analytic apps that deliver low-latency response on enormous datasets.
Check out our ClickHouse Support & Data Engineer position and many others at https://altinity.com/careers today!",0,27025922,0,"[]","",0,"","[]",0 27030194,0,"comment","0xferruccio","2021-05-03 20:24:50.000000000","June (YC W21) | Founding Engineer | Remote | Full-time
June (https://june.so) is instant product analytics. We connect to Segment and automatically generate graphs of the metrics companies should track.
- Join a team of 3
- Product minded, talk with users, write code, scale the infrastructure
- Interesting challenges (like scaling a Clickhouse cluster and writing a DSL for filtering user cohorts)
Learn more: https://www.notion.so/Founding-Engineer-339274009f594b58aff3...
If interested reach out at work [at] june [/dot] so",0,27025922,0,"[]","",0,"","[]",0 23917541,0,"comment","valyala","2020-07-22 14:43:28.000000000","Did you try ClickHouse? [1]
We were successfully ingesting hundreds of billions of ad serving events per day to it. It is much faster at query speed than any Postgres-based database (for instance, it may scan tens of billions of rows per second on a single node). And it scales to many nodes.
While it is possible to store monitoring data to ClickHouse, it may be non-trivial to set up. So we decided creating VictoriaMetrics [2]. It is built on design ideas from ClickHouse, so it features high performance additionally to ease of setup and operation. This is proved by publicly available case studies [3].
[2] https://github.com/VictoriaMetrics/VictoriaMetrics/
[3] https://victoriametrics.github.io/CaseStudies.html",0,23908292,0,"[23918104]","",0,"","[]",0 23918104,0,"comment","gshulegaard","2020-07-22 15:33:34.000000000","ClickHouse's intial release was circa 2016 IIRC. The work I was doing at NGINX predates ClickHouse's initial release by 1-2 years.
ClickHouse was certainly something we evaluated later on when we were looking at moving to a true columnar storage approach, but like most columnar systems there are trade-offs.
* Partial SQL support.
* No transactions (not ACID).
* Certain workloads are less efficient like updates and deletes, or single key look ups.
None of these are unique to ClickHouse, they are fairly well known trade-offs most columnar stores make to improve write throughput and prioritize high scaling sequential read performance. As I mentioned before, the amount of data we were ingesting never really reached the limits of even Postgres 9.4, so we didn't feel like we had to make those trade-offs...yet.
I would imagine that servicing ad events is several factors larger scale than we were dealing with.",0,23917541,0,"[]","",0,"","[]",0 23918411,0,"comment","1996","2020-07-22 16:01:00.000000000","Speaking from my own experience, you may save yourself some future effort by moving directly to clickhouse.
Timescale brings its own issues. If your goal is performance, you will be better served by clickhouse.",0,23917266,0,"[23921122,23918569]","",0,"","[]",0 23918733,0,"comment","1996","2020-07-22 16:29:39.000000000","It would be too long. To quickly summarize, from the pain of backups (unless you setup a WAL replica, the load may take your database down), the large size of the data on disk (timescale does offer some compression now, but it's still too much), the low performance of large queries, the memory requirements - it's death by a thousand papercuts!
Don't get me wrong, timescale is a great way to get started with time series - just like sqlite is a great way to get started with databases if all you know is nosql.
However, it quickly brings its own challenges - and the new license is the cherry on the cake: it is locking you down to your own infrastructure unless you want to pay for timescale own SAAS offering (and then prey they do not alter the condition of the deal too much later)
It is just not worth it, unless you have a very small problem, or you can afford to have people concentrating on timescale maintenance - and in this case, you would be getting better bang for bucks by having these people work on clickhouse.
I'm speaking only from my own experience. I have relatively large servers dedicated to time series (about 100T of disk space, between 128 and 256 Gb of RAM). They were going to be retired for even bigger servers. Instead, we experimented with clickhouse on one of the recently decommissioned servers. We could not believe the benchmarks! Moving to clickhouse has improved the performance on about every metric. Yes, it required some minor SQL rewrites, about 1 day of work total, but unless your hardware is free and your queries are set in stone, clickhouse makes more sense.",0,23918569,0,"[23919184,23919371,23919132,23977275]","",0,"","[]",0 23919132,0,"comment","cevian","2020-07-22 17:06:28.000000000","(TimescaleDB engineer here) Some of the comments here sound technically off.
We've never seen a backup take down a machine. The backups we use are the same as Postgres which are used by millions of companies without a problem (and can be streaming incremental backups like pgBackrest, WAL-E, etc. or whole-database backups like pg_dump). As with any DB you do have to size and configure your database correctly (which these days isn't hard).
We've never seen anybody claim that ClickHouse offers significantly better compression than we do overall. Obviously compression depends heavily on data distribution and I'm sure you could make up a dataset where clickhouse does better (just as you could where TimescaleDB does better). But on real distributions we don't see this at all, we do pretty advanced columnar compression on a per-datatype basis [1], and see median space reduction of 95% from compression across users.
Large queries is a weird claim to make since Postgres has more different types of indexes than Clickhouse and has support for multiple indexes. If you are processing all of your data for all your queries then yes, click house sequential scans may be better. But that's less common, and also where TimescaleDB continuous aggregates come in.
We've seen customers successfully use our single-node version with 100s of billions of rows so claiming that we are just for small use-cases is simply untrue, and especially with the launch of multi-node TimescaleDB.
I understand people may have different preferences and experiences, but some of these felt a bit off to me.
[1] https://blog.timescale.com/blog/building-columnar-compressio...",0,23918733,0,"[23920405]","",0,"","[]",0 23919371,0,"comment","manigandham","2020-07-22 17:25:53.000000000","Timescale has added their own layer of compression and columnar layouts to the Postgres row storage. That will get you to around 70% of the performance of using a dedicated column-oriented data warehouse, with the rest depending on how complex and selective your queries are.
It won't match the pure scan and computation speed of Clickhouse but the continuous aggregation feature is the recommended approach for querying large datasets (similar to Clickhouse table engines like AggregatingMergeTree).",0,23918733,0,"[]","",0,"","[]",0 23920405,0,"comment","1996","2020-07-22 19:01:42.000000000","On servers with a very high CPU load, backup done without using continuous streaming to a second server and done from this second server, something (I have stopped using timescale so I can't tell you what did) during backup caused a peak in load and IO, impacting read and write performance of the primary server, causing a cascading failure of the processes due to timeouts, eventually taking the server down due to swap issues and OOM triggering a reboot.
So we stopped doing backups. Actually, that's how we started using clickhouse: for cold storage, as the files in /var/lib/clickhouse used far less storage space and issues. Eventually the same data was sent both to timescaledb and clickhouse, in a poor's man backup. Finally, timescaledb was removed.
> As with any DB you do have to size and configure your database correctly (which these days isn't hard).
Thanks for supposing we didn't try. We did not end up with 256Gb of RAM per server for no reason.
All I'm saying is that Timescale totally has a place, but not beyond a certain scale and complexity.
> We've never seen anybody claim that ClickHouse offers significantly better compression than we do overall
Altiny does, so do a few others. mandigandham above says that you are now at 70% of what clickhouse does. I'm not saying you're not improving. It was just one of the too many issues we had to fight.
Also, you have only recently introduced compression - good, but I'm not aware if you already offer something like DateTime Codec(DoubleDelta, LZ4), or the choice of compression algorithms. LZ4 can be slow, so there is a choice between various alternatives.
For example, T64 calculates the max and min values for the encoded range, and then strips the higher bits by transposing a 64-bit matrix. Sometimes it makes sense. zStd is slower than T64 but needs to scan less data, which makes up for it. Sometimes it makes more sense.
Large databases need more flexibility.
> If you are processing all of your data for all your queries then yes, click house sequential scans may be better
I confirm, it is better.
And for some workloads, continuous aggregates make no sense.
> We've seen customers successfully use our single-node version with 100s of billions of rows so claiming that we are just for small use-cases is simply untrue, and especially with the launch of multi-node TimescaleDB
I have about 50Tb of data per server. What is below 1Tb I call "small use cases".
> I understand people may have different preferences and experiences, but some of these felt a bit off to me.
When I was trying to use timescaledb and reported weird issues, I had the same return: my use case and bug report felt "off" to the person I reported them to.
Maybe it is why they weren't addressed - or maybe much later, when reported by more clients?
Personally, I have no horse in the game. If you become better than clickhouse for my workload, and if the license change to allow me to deploy to a cluster of AWS servers (just in case we ditch our own hardawre), I will consider timescale again in the future.
For now, I'm watching it evolve, and slowly address the outstanding issues, like disk usage, and performance. By your own admission and benchmarks, you are now at 70% of what clickhouse does - in my experience, the actual difference is much higher.
But I sincerely hope you succeed and catch up, as more software diversity is always better.",0,23919132,0,"[23920527]","",0,"","[]",0 23920476,0,"comment","1996","2020-07-22 19:10:17.000000000","> InfluxDB is purely proprietary (paid, closed source).
And clickhouse is not. I just suggest skipping the timescaledb step to someone migrating from influx, and going straight to clickhouse.
> For the TSL version, what it primarily restricts is the cloud providers like AWS and Azure from offering TimescaleDB-as-a-service (e.g., TimescaleDB Community on AWS RDS)
If there is some kind of emergency and I need to have the database on the cloud, this is a serious restriction. It limits my choices and constrains my actions.
> Many thousands of companies use our community version for free to build SaaS services running on their own AWS instances.
We have our servers, so it wasn't an issue. It was more of a long term concern, a chilling effect: what else may be restricted in the future?
Again, I think timescaledb has a wonderful place. It will certainly become the entry level database for timeseries.
It is just not suite for our workload.",0,23919184,0,"[23920751,23921119]","",0,"","[]",0 23920527,0,"comment","akulkarni","2020-07-22 19:15:07.000000000","Thanks for the thoughtful response.
On the compression point:
- I believe the Altinity Benchmarks [0] are from 2018, on TimescaleDB 0.12.1. TimescaleDB has gotten much better since then (now on version 1.7.2), and most notably, offers native compression now (it did not then).
- I believe manigandham's 70% comment is more of an offhand estimate and not a concrete benchmark. But perhaps he can weigh in. :-)
- Re: compression algorithms, TimescaleDB now employs several best-in-class algorithms, including delta-delta, gorilla, Simple-8b RLE. Much more than just LZ4. [1]
Overall, I don't think anyone has done a real storage comparison between TimescaleDB and Clickhouse since we launched native compression. It's on our todo list, but we also welcome external benchmarks. But based on what we've found versus similar systems, I suspect our storage usage would be really similar.
[0] https://www.altinity.com/blog/clickhouse-for-time-series
[1] https://blog.timescale.com/blog/time-series-compression-algo...",0,23920405,0,"[]","",0,"","[]",0 16617473,0,"comment","ryanworl","2018-03-19 11:00:16.000000000","For 5, check out Clickhouse. It isn’t identical, but scanning a trillion rows a second is just a matter of sharding the data into enough nodes.
https://clickhouse.yandex",0,16617317,0,"[]","",0,"","[]",0 16617615,0,"comment","olavgg","2018-03-19 11:28:54.000000000","With the ClickHouse OLAP db I'm limited by memory bandwidth.",0,16617469,0,"[]","",0,"","[]",0 27042202,0,"comment","oandrew","2021-05-04 20:01:37.000000000","How does this compare to Apache Echarts (https://echarts.apache.org/en/index.html) ?
p.s. My favorite data exploration toolkit is Clickhouse + Tabix (it uses apache echarts). e.g. https://tabix.io/doc/draw/Draw_Chart/",0,27036768,0,"[27042232]","",0,"","[]",0 23920751,0,"comment","mfreed","2020-07-22 19:40:49.000000000","Hey, thanks for the continued discussion:
> If there is some kind of emergency and I need to have
> the database on the cloud, this is a serious restriction.
> It limits my choices and constrains my actions.
> if the license change to allow me to deploy to a cluster
> of AWS servers (just in case we ditch our own hardawre),
You can deploy TimescaleDB on AWS servers (the TSL certainly allows it). Most of our users do. They don't run their own hardware.You can even use our Apache-2 k8s helm charts [1] to immediately spin up a cluster of replicated TimescaleDB nodes with automated leader-election/failover and continuous incremental backup. The helm charts have first-class support for AWS EKS.
What the TSL prevents is _Amazon_ offering TimescaleDB as a paid DBaaS service. To my knowledge, none of the major cloud vendors offer Clickhouse as a first-class paid service, so that's somewhat a moot point. I guess theoretically Amazon could launch Clickhouse-as-a-service, but that theoretical possibility doesn't help you in your emergency.
[1] https://github.com/timescale/timescaledb-kubernetes",0,23920476,0,"[]","",0,"","[]",0 23921119,0,"comment","fiddlerwoaroof","2020-07-22 20:16:14.000000000","Apart from license issues, Clickhouse is just really impressive: it’s a minor pain operationally, but in our tests it left all the Postgres-based timeseries solutions in the dust for real-time analytics without rollup tables.",0,23920476,0,"[23928568]","",0,"","[]",0 23921122,0,"comment","nwmcsween","2020-07-22 20:16:26.000000000","The issue I have with clickhouse is the codebase, it's an absolute behemoth and seemingly embeds musl libc? It also uses a huge amount of SIMD intrinsics for everything when SWAR or really nothing from my view looking in would have been better (memcpy, etc).",0,23918411,0,"[23923432]","",0,"","[]",0 23923432,0,"comment","jstrong","2020-07-23 01:53:24.000000000","caveat: I'm not familiar with the clickhouse codebase. however, usually "embedding" musl libc is about portability - it allows you to build an entirely static binary that can run on practically any box. is this different?
secondly, I don't get where you're coming from faulting a columnar database for using SIMD! why is memcpy better? it comes across as, 'they worked too hard making it fast'!",0,23921122,0,"[]","",0,"","[]",0 23928568,0,"comment","1996","2020-07-23 15:46:50.000000000","> Clickhouse is just really impressive: it’s a minor pain operationally, but in our tests it left all the Postgres-based timeseries solutions in the dust for real-time analytics without rollup tables
There are many things I'm willing to tolerate with that level of performance!",0,23921119,0,"[]","",0,"","[]",0 20221036,0,"comment","andrea_s","2019-06-19 05:31:21.000000000","MongoDB is not well suited for OLAP-style workloads - have you considered Yandex ClickHouse?",0,20219803,0,"[]","",0,"","[]",0 16624045,0,"comment","manigandham","2018-03-19 22:47:55.000000000","It'll be faster than Redshift and about the same as ClickHouse, +/- depending on hardware and setup.
It's a great system, we used it for 2 years and it's one of the most polished databases out there with a simple MySQL interface. It's more general purpose than kdb, with a nice rowstore + columnstore architecture. I believe they're adding full-text search indexes in the latest version too.
If you need the query language, the advanced/asof joins, or the tightly integrated query/process environment, then there's no match to kdb though.",0,16621988,0,"[]","",0,"","[]",0 27051613,0,"comment","throwaway375","2021-05-05 16:00:03.000000000","I think idea and promise of Timescale is great, but current(well actually I tried it 1 year ago) state of things makes it very hard to choose Timescale over Clickhouse. I've tried to setup simple Twitter parser for trends analysis, so I needed few thousand counters every few seconds. While I did not encounter any perfomance issues, size on disk was a huge deal. I don't remember precise numbers, but Clickhouse used few magnitudes lower disk space. And while Timescale has nice things like materialized views, Clickhouse has them too. And apart from them Clickhouse has excellent data compression algorithms for repeated key value type counters. So it becomes really hard to understand why Timescale. It aims to help you with tables bigger than traditional pg can handle, but at the same time uses same amount of space.",0,27050072,0,"[27055151,27051691]","",0,"","[]",0 27051920,0,"comment","throwaway375","2021-05-05 16:21:33.000000000","I checked documentation and I don't think I did. Looks like it has same compressing algorithms as Clickhouse, so it should be pretty close in space requirements for old chunks.",0,27051691,0,"[]","",0,"","[]",0 27055151,0,"comment","hagen1778","2021-05-05 20:19:05.000000000","I think ClickHouse is underestimated as a database for time series. Many companies using it for analytics purposes (like Cloudflare [0]), for logs processing (like Uber [1]). I'm just waiting when someone builds something outstanding for monitoring. Articles like [2] shows ClickHouse potential in this area.
Btw, ClickHouse is under Apache 2 license, which makes it much easier to use in big companies.
[0] https://blog.cloudflare.com/http-analytics-for-6m-requests-p...
[1] https://eng.uber.com/logging/
[2] https://altinity.com/blog/clickhouse-for-time-series",0,27051613,0,"[]","",0,"","[]",0 23941002,0,"comment","tristor","2020-07-24 15:56:24.000000000",""serialized" here doesn't really mean processed in serial, it means "serializable" in the context of database information theory. Databases have special concurrency control requirements in order to create hard guarantees on database consistency. You can process queries in parallel and still have a serializable result, because of transaction coordination. Doing this on one server is much easier than doing this across a cluster of servers.
So in your case, MVCC is what you're talking about, which is not the same level of consistency guarantee as serializable, rather it is based on snapshot isolation. Some database vendors consider them effectively the same isolation level because the anomalies associated to other common non-serializable isolation levels aren't typically present in most MVCC implementations, but there's a lot more complexity here than you are acknowledging.
Mixing OLTP and OLAP workloads on the same database is pretty much always a bad idea. This is why it's common practice to use ETL jobs to move data from an OLTP optimized database like Postgres or MySQL to a separate database for OLAP (which could be another MySQL or PG instance, or could be something like ClickHouse or another columnar database optimized for OLAP). Just because you /can/ do something, doesn't mean you /should/ do something...",0,23940476,0,"[23944638,23942683]","",0,"","[]",0 23941733,0,"comment","MrBuddyCasino","2020-07-24 16:53:33.000000000","I‘ve used Cassandra, its not that impressive. Much slower than the C++ rewrite (ScyllaDB?), latency issues due to GC, can’t hold a candle to Clickhouse. And they’ve been optimizing it for a long time now.",0,23941592,0,"[23941997]","",0,"","[]",0 23941997,0,"comment","peferron","2020-07-24 17:15:34.000000000","Cassandra and ClickHouse are designed to do different things. To flip things around, have you compared the latency of a single-row update or delete in Cassandra vs ClickHouse?",0,23941733,0,"[23942939,23945803]","",0,"","[]",0 23942939,0,"comment","hodgesrm","2020-07-24 18:30:01.000000000","Or the fact that Cassandra uses consistent hashing to distribute data automatically across hosts.
My company supports ClickHouse, but there are many use cases where it's simply not the right solution.",0,23941997,0,"[]","",0,"","[]",0 23945803,0,"comment","MrBuddyCasino","2020-07-25 00:17:12.000000000","If you care about the latency of a single row update or delete, Clickhouse is definitely the wrong tool for the job. First, it doesn’t really have deletes(afaik). Second, you need to batch updates aggressively to get good throughput.
But you’re right C* and CH are designed to do different things. I just found the difference in general performance across everything (startup, schema changes, throughput, query performance, optimization opportunities) to be quite pronounced. One feels like a race car, the other not so much.",0,23941997,0,"[]","",0,"","[]",0 27070808,0,"comment","zX41ZdbW","2021-05-07 01:07:41.000000000","What are the largest datasets in Splitgraph? Can I list the datasets sorted by size?
We have the need for large public datasets for testing ClickHouse: https://clickhouse.tech/docs/en/getting-started/example-data...",0,27070056,0,"[27071136]","",0,"","[]",0 27071136,0,"comment","chatmasta","2021-05-07 02:06:29.000000000","On the public DDN (data.splitgraph.com:5432), we enforce a (currently arbitrary) 10k row limit on responses. You can construct multiple queries using LIMIT and OFFSET, or you can run a local Splitgraph engine without a limit. We also have a private beta program if you want a managed or self-hosted cloud deployment with the full catalog and DDN features. And we are planning to ship some "export to..." type workflows for exporting to CSV and potentially other formats.
For live/external data, we proxy the query to the data source, so there is no theoretical data size limit except for any defined by the upstream.
For snapshotted data, we store the data as fragments in object storage. Any size limit depends on the machine where Splitgraph's Postgres engine is running, and how you choose to materialize the data when downloading it from object storage. You can "check out" an entire image to materialize it locally, at which point it will be like any other Postgres schema. Or you can use "layered querying" which will return a result set while only materializing the fragments necessary to answer the query.
Regarding ClickHouse, you could watch this presentation [0] my co-founder Artjoms gave at a recent ClickHouse meet-up on the topic of your question. We also have specific documentation for using the ClickHouse ODBC client with the DDN [1], as well as an example reference implementation. [2]
[0] https://www.youtube.com/watch?v=44CDs7hJTho
[1] https://www.splitgraph.com/connect
[2] https://github.com/splitgraph/splitgraph/tree/master/example...",0,27070808,0,"[]","",0,"","[]",0 20251614,0,"comment","danlark","2019-06-22 20:39:12.000000000","We tried mimalloc in ClickHouse and it is two times slower than jemalloc in our common use case https://github.com/microsoft/mimalloc/issues/11",0,20249743,0,"[20251821,20306843,20251823]","",0,"","[]",0 20256904,0,"comment","bsg75","2019-06-23 17:01:55.000000000","TiDB is not (currently) as commonly used as the other things in this stack. But I guess KOS(NewSQL) does not make for a good acronym.
Maybe KOSS where the last S covers anything from TiDB, Cockroach, Clickhouse, and S3 (Parquet, Avro, etc.).",0,20255668,0,"[]","",0,"","[]",0 23968609,0,"story","zX41ZdbW","2020-07-27 19:37:56.000000000","",0,0,0,"[23969473]","https://www.percona.com/blog/2020/07/27/clickhouse-and-columnstore-in-the-star-schema-benchmark/",6,"ClickHouse and ColumnStore in the Star Schema Benchmark","[]",1 23969473,0,"comment","PeterZaitsev","2020-07-27 21:13:51.000000000","Was interesting to compare how Clickhouse and MariaDB Columnstore have improved over the years. Clickhouse still kicks ass :)",0,23968609,0,"[]","",0,"","[]",0 23975926,0,"story","davidquilty","2020-07-28 14:08:55.000000000","",0,0,0,"[]","https://www.percona.com/blog/2020/07/27/clickhouse-and-columnstore-in-the-star-schema-benchmark/",4,"ClickHouse and ColumnStore in the Star Schema Benchmark","[]",0 23977017,0,"comment","gregwebs","2020-07-28 15:51:11.000000000","I am still hoping to see comparisons to Victoria Metrics, which also shows much better performance than many other TSDB. Victoria Metrics is Prometheus compatible whereas Quest now supports Postgres compatibility. Both have compatibility with InfluxDB.
The Victoria Metrics story is somewhat similar where someone tried using Clickhouse for large time series data at work and was astonished at how much faster it was. He then made a reimplementation customized for time series data and the Prometheus ecosystem.",0,23975807,0,"[23984208]","",0,"","[]",0 23977275,0,"comment","valyala","2020-07-28 16:14:53.000000000","Did you try VictoriaMetrics for storing time series data? This is specialized high-performance time series database, which is based on ClickHouse ideas [1].
[1] https://medium.com/@valyala/how-victoriametrics-makes-instan...",1,23918733,0,"[]","",0,"","[]",0 23977522,0,"comment","valyala","2020-07-28 16:32:45.000000000","The following time series databases are popular right now:
* ClickHouse (this is a general-purpose OLAP database, but it is easy to adapt it to time series workloads)
* InfluxDB
* TimescaleDB
* M3DB
* Cortex
* VictoriaMetrics
The last three of these TSDBs support PromQL query language - the most practical query language for typical time series queries [1]. So I'd recommend starting from learning PromQL and then evaluating time series databases from the list above.
[1] https://medium.com/@valyala/promql-tutorial-for-beginners-9a...",0,23921331,0,"[]","",0,"","[]",0 23978732,0,"comment","pachico","2020-07-28 17:58:51.000000000","It would then be in your interest to know ClickHouse. I recommend your to have a look at it.",0,23978257,0,"[23978793]","",0,"","[]",0 23978793,0,"comment","j1897","2020-07-28 18:03:59.000000000","We've had one of their contributors bench questdb versus Clickhouse recently - you can find the results here https://github.com/questdb/questdb/issues/436
This came from a bench we had on our previous website versus them about summing 1 billion doubles.",0,23978732,0,"[]","",0,"","[]",0 23978850,0,"comment","pachico","2020-07-28 18:08:09.000000000","I see this as a very interesting project. I use ClickHouse as OLAP and I'm very happy with it. I can tell you features that make me stick to it. If some day QuestDB offers them, I might explore the possibility to switch but never before. - very fast (I guess we're aligned here) - real time materialized views for aggregation functions (this is absolutely a killer feature that makes it quite pointless to be fast if you don't have it) - data warehouse features: I can join different data sources in one query. This allows me to join, for instance, my MySQL/MariaDB domain dB with it and produce very complete reports. - Grafana plugin - very easy to share/scale at table level - huge set of functions, from geo to URL, from ML to string manipulation - dictionaries: I can load maxdb geo dB and do real time localisation in queries I might add some more once they come to my mind. Having said this, good job!!!",0,23975807,0,"[23979059,23986665]","",0,"","[]",0 23979730,0,"comment","pachico","2020-07-28 19:31:03.000000000","Glad to be useful. On the other side, I can tell you that ClickHouse also misses a feature everyone in the community of users wish for, which is automatic regarding when you add a new node (sort of what elasticsearch does).
And before I forget, ClickHouse Kafka Engine is simply brilliant. The possibility of just publishing to Kafka and have your data not only inserted in your DB but also pre-processed is very powerful.
Let me know if I can help you with use cases we have.
Cheers",0,23979059,0,"[23980083,23980209]","",0,"","[]",0 23981281,0,"comment","gregmac","2020-07-28 22:18:06.000000000","Is also be interested in hearing when is QuestDB not a good choice? Are there use cases where TimescaleDB, InfluxDB, ClickHouse or something else are better suited?",0,23976092,0,"[23981358,24027710,23981417]","",0,"","[]",0 23981564,0,"comment","beagle3","2020-07-28 22:55:10.000000000","One differentiating feature is "as of" join. You have records of the form (time, value), and you ask "what's the most recent value as of $time?"; On a non-TS oriented DBMS, this query is usually slow and hard to write. Window extensions to SQL can make it a little better, but - you can assume that a proper TSDB answers this query x10 to x10,000 times faster on the same hardware, especially when done in bulk (e.g.: I have one million (time,bid_price) records, and one million (time,transaction_price) records; For each transaction record, I want to know what the most-recent bid price was at that time.
That's something kdb+ and ClickHouse do in milliseconds; and I assume QuestDB can to, though I didn't check.",0,23976122,0,"[23984911]","",0,"","[]",0 23983841,0,"comment","j1897","2020-07-29 05:57:18.000000000","Not yet - there is a bench vs clickhouse that has been done by one of their contributor though see below in the comments.",0,23983559,0,"[]","",0,"","[]",0 23985414,0,"comment","numlock86","2020-07-29 11:11:35.000000000","> https://news.ycombinator.com/item?id=22803504
As I already said or rather asked there: Assume I already use Clickhouse for example. What are the benefits of QuestDB? Why should I use it instead?
Surely it's a good tech and competition is key. But what are the key points that should make me look into it? There is a lot of story about the making and such, but I don't see the "selling point".",0,23975807,0,"[23985760]","",0,"","[]",0 23985760,0,"comment","j1897","2020-07-29 12:13:46.000000000","Hey, one of the key differences here is that Clickhouse is owned by a large corporation, Yandex (the Google of Russia) and seems to drive its roadmap in function of the needs of the company. We are committed to our community and driving our roadmap based on their needs rather than having to fulfill needs of a parent company.
Ultimately as a result we think that questDB will be a better fit for your community. We acknowledge that Clickhouse has lot more features as of now being a more mature product.",0,23985414,0,"[]","",0,"","[]",0 23985964,0,"comment","seektable","2020-07-29 12:39:13.000000000","For Excel-like analysis with pivot tables take a look to our https://www.seektable.com; it can connect to various DW that are suitable for near real-time big-data analytics (like Redshift, BigQuery, Snowflake, Clickhouse). SeekTable has can be deployed on-premise.
We can add a connector for QuestDB if someone is really interested in this.",0,23978869,0,"[]","",0,"","[]",0 27106640,0,"comment","kokizzu2","2021-05-10 14:58:32.000000000","oh batched '__') that makes sense.. clickhouse also fast if batched, but not fast when stormed with lots of small requests",0,27106459,0,"[27106862]","",0,"","[]",0 23986665,0,"comment","pupdogg","2020-07-29 13:53:34.000000000","If you don’t mind sharing, what are the specs of your ClickHouse cluster including Zookeepers? Also, how large of a dataset are you working with?",0,23978850,0,"[]","",0,"","[]",0 23990050,0,"comment","bdcravens","2020-07-29 18:30:37.000000000","No, but I think each piece of software is put in a proper context, to match what most would commonly use in that particular use case. For example, the Clickhouse benchmarks are run against typical modest cloud instances.",0,23989158,0,"[23990310]","",0,"","[]",0 23990310,0,"comment","tmostak","2020-07-29 18:53:12.000000000","To be fair, the c5d.9xlarge instances are $1.728 each per hour, or $5.18 for the 3-server cluster (looks to be about $3.06/hr for reserved 1-year pricing). Even with reserved pricing, that's $26,806 a year, or 6.5X more than a $4K laptop that likely will last for years and would be bought anyway (or at least a cheaper variant, which would also run these queries nearly as quickly). Of course that's very apples-to-oranges, so another way to look at this is that OmniSci would probably see significantly better performance on a single c5d.9xlarge than what we saw on this Mac (would need to benchmark, but informally I can say that OmniSci was 2-3X faster running on CPU on my Linux workstation compared to my Mac).
Disclaimer: No disrespect to ClickHouse here, it's an amazing system that I'm sure beats out OmniSci for certain workflows.",0,23990050,0,"[23994824]","",0,"","[]",0 23990851,0,"comment","zX41ZdbW","2020-07-29 19:49:15.000000000","Instructions, scripts and log can be found here: https://github.com/ClickHouse/ClickHouse/tree/master/benchma...",0,23990844,0,"[]","",0,"","[]",0 23990961,0,"comment","hodgesrm","2020-07-29 20:04:30.000000000","It would be great to understand why OmniSciDB does so well on this benchmark but seems to do far less well on others.
The ClickHouse team was (obviously!) very interested in Mark's result and tried out OmniSciDB on the standard analytics benchmark that CH uses to check performance. Results are here: https://presentations.clickhouse.tech/original_website/bench...]
Anyway, really intriguing results from Mark. Looking forward to learning more about the source of the differences.
Disclaimer: I work at Altinity, which supports ClickHouse.
Edit: Fixed bad link",0,23986925,0,"[23992009,23994478,23991096]","",0,"","[]",0 27110988,0,"comment","fiddlerwoaroof","2021-05-10 21:16:14.000000000","I really wanted to migrate an analytics project I was working on from Elasticsearch to Postgres: however, when we sat down and ran production-scale proofs of concepts for the change, ClickHouse handily outclassed all the Postgres-based solutions we tried. (A Real DBA might have been able to solve this for us: I did some tuning, but I’m not an expert). ClickHouse, however, worked near-optimally out of the box.",0,27110663,0,"[]","",0,"","[]",0 27111552,0,"comment","dreyfan","2021-05-10 22:15:51.000000000","Take a look at Kimball/Star-Schema. It's worked extremely well as a data warehouse technique for decades. That said, I think modern offerings (e.g. Clickhouse) are superior in most use cases, but it's definitely not impossible on a traditional row-oriented RDBMS.",0,27111254,0,"[27116889]","",0,"","[]",0 27112366,0,"comment","FridgeSeal","2021-05-10 23:44:01.000000000","> Because of this dedicated data warehouses…use column-oriented storage and don't have indexes.
Well, that’s not really correct is it. ClickHouse for one definitely has them as Snowflake the last time I used it.
This is a lot of work to go through to avoid using the right tool for the job. Just use something like ClickHouse or even DuckDB and reap the benefits of better performance with less caveats.",0,27109960,0,"[27112401,27113744,27112791,27137432]","",0,"","[]",0 27113744,0,"comment","tomnipotent","2021-05-11 02:57:18.000000000","It's correct.
Snowflake does not have indexes, and ClickHouse indexes are what they call "data skipping indexes". BigQuery, Redshift, Netezza, and Vertica also do not have support for indexes.",0,27112366,0,"[27115818,27114449]","",0,"","[]",0 27114449,0,"comment","FridgeSeal","2021-05-11 05:12:52.000000000","> ClickHouse indexes are what they call "data skipping indexes".
That’s still an index though isn’t it? Might work slightly differently, but the purpose is still the same.",0,27113744,0,"[27114485]","",0,"","[]",0 27114485,0,"comment","tomnipotent","2021-05-11 05:19:09.000000000","Not in the same sense we consider them in traditional databases, used to find specific rows (needle in haystack). ClickHouse indexes are used to eliminate data pages to reduce IO for range queries, not to find specific rows by value.",0,27114449,0,"[27115851]","",0,"","[]",0 27115818,0,"comment","zX41ZdbW","2021-05-11 08:55:08.000000000","ClickHouse has both variants: primary and seconday indexes.
Primary key is range index - to quickly locate records. Secondary indexes are data skipping indexes - to quickly skip blocks of data.",0,27113744,0,"[]","",0,"","[]",0 27115840,0,"comment","zX41ZdbW","2021-05-11 08:58:18.000000000","ClickHouse has support for block range indexes as well. If you want to check open-source implementations - this is probably the best place to do it.",0,27112791,0,"[]","",0,"","[]",0 27117338,0,"comment","dreyfan","2021-05-11 12:38:50.000000000","Yes in Clickhouse you’d generally take a denormalized approach.",0,27116889,0,"[]","",0,"","[]",0 24003120,0,"comment","champtar","2020-07-30 21:36:39.000000000","If they want to migrate to something else, they need to have a look at ClickHouse. When switching from Elasticsearch to ClickHouse 1.5 years ago, I reduced my storage needs by 20, gained SQL, performance, and a ton of analytics features.
In hindsight I would say that Elasticsearch is for full text queries, and if you are using it for something else (access logs) there is a good chance this is the wrong tool for the job.",0,23998503,0,"[]","",0,"","[]",0 20304917,0,"comment","stareatgoats","2019-06-28 14:58:13.000000000","Not to mention ClickHouse",0,20304087,0,"[]","",0,"","[]",0 16709888,0,"comment","xstartup","2018-03-29 18:44:16.000000000","We use clickhouse cluster with 1000 nodes and 50000 GB clickstream data.",0,16709630,0,"[16709938]","",0,"","[]",0 16709938,0,"comment","_wmd","2018-03-29 18:48:24.000000000","That's only 50gb per node. Why do you/Clickhouse need so many nodes?",0,16709888,0,"[16713940]","",0,"","[]",0 27137230,0,"comment","hodgesrm","2021-05-13 01:00:26.000000000","ClickHouse can handle multiple joins just fine and has for a while. I just gave a conference talk on CH this morning that discussed this exact topic, among others.
The fact is that for large datasets scans on denormalized fact tables parallelize well, which means you can (a) offer stable performance and (b) scale more efficiently. This is important for use cases like web analytics, where users play around with different dimensions and measures but still expect consistent response. Note also, dimensions for things like Year, Month, Week, and the like compress absurdly well. It is often way faster to scan these values than to join them.",0,27116889,0,"[]","",0,"","[]",0 27137432,0,"comment","hodgesrm","2021-05-13 01:30:32.000000000","In addition to other example cited here, Druid has bitmap indexes on dimension columns. So, it's hard to make a hard and fast generalization. All databases use indexes to a greater or lesser extent.
To my mind the big differences for data warehouses are the following:
(a) Table scans are relatively cheap thanks to columnar structure and compression. It's much more important to tune compression than indexes. If you can reduce stored data size by 10^3, you don't need an index. That's exactly the opposite of row stores like MySQL.
(b) Data warehouses don't use indexes to maintain referential integrity, because it's not something they really care about in the first place.
My DW experience is with ClickHouse, but I think it illustrates a lot of the principles.",0,27112366,0,"[]","",0,"","[]",0 16721721,0,"story","xstartup","2018-03-31 04:26:32.000000000","Previously, there was only Redshift and BigQuery. Now, there are more Columnstores. New products like Snowflake, TimeseriesDB, ClickHouse. Which one are you using and why?",0,0,0,"[16741360]","",1,"Ask HN: Which OLAP do you use in 2018?","[]",1 16724289,0,"comment","xstartup","2018-03-31 17:49:44.000000000","We run 1K node ClickHouse cluster on top of OVH. In BigQuery, we had a latency of 10-50s per query and it was costing us 400K per month.
ClickHouse costs us 100K per month, all deployed on OVH. 100-200ms query latency.",0,16723599,0,"[16724398]","",0,"","[]",0 16724398,0,"comment","manigandham","2018-03-31 18:17:37.000000000","Why $400k for BigQuery? It has flat-rate billing starting at 40k/month for unlimited queries.
Also your other comment mentioned 50TB across 1000 Clickhouse nodes, why so little data per node?",0,16724289,0,"[16724829]","",0,"","[]",0 27148535,0,"comment","Sytten","2021-05-13 22:57:41.000000000","They mentionned that they decided against thanos for the storage of metrics, but would be curious to hear if other TSDB were considered. It is a hot space, I know about M3BD, Clickhouse, Timescale, influx, QuestDB, opentsdb, etc.",0,27147482,0,"[27148609]","",0,"","[]",0 24032845,0,"comment","FridgeSeal","2020-08-03 00:01:31.000000000","I agree, 400ms is a long time.
I’d be over the moon if most websites did things within 50ms, even 100ms would be a significant and welcome improvement.
Anecdotally, when I’m working on the database at work, the olap database (Clickhouse) feels amazingly snappy because it’ll respond in around 30-60ms, but SQL server feels like molasses, even when doing queries each dB is designed for.",0,24031938,0,"[]","",0,"","[]",0 20329403,0,"comment","napsterbr","2019-07-01 20:27:53.000000000","Hey HN. I have worked for 10 years on my own projects (some of which are open source), and now I'm after my first "real" job.
This gave me a wide range of experience. I have worked on frontend, backend, infrastructure and security roles. I prefer working with backend and/or infrastructure, but I'd take a frontend job depending on the language.
Location: Sao Paulo, Brazil Remote: Yes (any timezone)
Willing to relocate: Not currently. I can spend 2-3 months per year on-site though.
Technologies: I have used a lot of different tools throughout the years. I'm listing the ones I'm most familiar with.
Elixir/Erlang (4y) Python (3y) PHP (3y) Clojure/CLJS (9m) Elm (2y) Vanilla JS + React (1y) Currently learning Go (2m)
AWS & GCP Ansible Postgresql Nginx FreeBSD & jails Linux & Docker Clickhouse
CV: I'll send it upon request.
Email: renatosmassaro@gmail.com
Github: https://github.com/renatomassaro",0,20325923,0,"[]","",0,"","[]",0 24041804,0,"comment","hodgesrm","2020-08-03 18:53:20.000000000","Altinity | Multiple ClickHouse & Cloud engineering positions | REMOTE in North America and Europe| Full-time | Competitive Salary and Equity
Hello! We are Altinity, a fast-growing database startup with a distributed team spanning from California to Eastern Europe. Our business is to make customers successful with ClickHouse, the leading open source data warehouse. Our customers range from ambitious startups to some of the most well-known enterprises on the planet. And we are looking for people to join us!
Here are a few of our open positions:
* Cloud Engineer
* Security Engineer
* Site Reliability Engineer
* Test Engineer
* Data Warehouse Implementation Engineer
* Data Warehouse Support Manager
* Data Warehouse Support Engineer
If you have experience with ClickHouse, data warehouses in general, and/or cloud technology, check out our jobs here:
https://www.altinity.com/careers",0,24038520,0,"[]","",0,"","[]",0 24042484,0,"comment","simoes","2020-08-03 19:49:42.000000000","Datawheel (datawheel.us) | Back-End Developer | REMOTE, ONSITE Cambridge MA and Washington DC | Full-time Datawheel is a small but mighty crew of programmers and designers who are here to make sense of the world’s vast amount of data! Learn more about us here: https://www.datawheel.us/
Back-end Developer
-----------------------------
We are looking for someone with back-end development and data ETL experience and comfort with devops. An ideal candidate is someone who is passionate about what they do and can bring that to the projects assigned to them. We are looking to work with someone on a contract basis with the option to transition to a salaried employee based on performance. Requirements
-----------------------------
- 3+ years experience
- Familiarity with Python, Node.js and Rust (bonus)
- Comfortable with rapid prototyping
- Experience writing SQL queries
- Experience working with Linux server environments
Bonuses
-----------------------------
- Experience with Scikit-Learn/Tensorflow or other machine learning libraries
- Experience working with ClickHouse or similar columnar databases
- Experience working with GCP and/or similar cloud platforms
- Experience with Docker/Kubernetes
APPLY HERE: https://www.datawheel.us/jobs",0,24038520,0,"[24056610]","",0,"","[]",0
20336993,0,"comment","jhgg","2019-07-02 16:56:38.000000000","We use a rather bespoke syslog -> clickhouse log sink (https://github.com/discordapp/punt/tree/clickhouse) we wrote in house because logstash (and then subsequently elastic starch) was too slow. Would love to switch off of it and to this! Hopefully a clickhouse sink comes soon! Maybe will contribute one upstream!",0,20334779,0,"[20337443,20337563]","",0,"","[]",0
20337443,0,"comment","reacharavindh","2019-07-02 17:38:23.000000000","Out of curiosity, could you tell us a little more about your log analysis workflow? Once they are in Clickhouse, how do you visualise/search/analyse your logs? What is your equivalent of Kibana?",0,20336993,0,"[20340274]","",0,"","[]",0
20340274,0,"comment","jhgg","2019-07-02 23:23:20.000000000","We do rollups into bigquery where we have a bunch of dashboards to look at stuff historically.I did really like Kibana, ultimately, we had to ditch it (because of ditching ES). Of course, this was a good thing, as I more than once degraded ingest the ES cluster by just using Kibana to do some aggressive filtering. Clickhouse handles these without problem.
I think a more complete world view may be to pipe logs into kafka, and ingest them into Clickhouse/Druid for different types of analysis/rollups.
Our current logging volume exceeds ~10b log lines per day now. Clickhouse handles this ingest almost too well (we have 3 16 core nodes that sit at 5% CPU). This is down from a... 20ish node ES cluster that basically set pegged on CPU... and our log volume then was ~1b/day.
For more ad-hoc, we just use the clickhouse-cli to query the dataset directly. We are tangentially investigating using superset with it.",0,20337443,0,"[20342070]","",0,"","[]",0 16741360,0,"comment","bsg75","2018-04-03 02:13:56.000000000","Clickhouse: Having enough hardware at my disposal, I can do a lot with it at minimal relative cost. Its performance as a column store, and masterless setup are the attractive features. If I did not have hardware at my disposal, I would be using BQ, or looking at Snowflake.",0,16721721,0,"[]","",0,"","[]",0 20342070,0,"comment","reacharavindh","2019-07-03 06:22:14.000000000","Thanks for the response. Lot of tips to go research about for me.
I was mentally debating between trying to find a schema for our logs, and store them in a database where it can be queried efficiently from
Vs
Throwing logs into ELasticSearch in a lazy way and let it index the whole thing to enable us do full-text search on logs. But, with a limitation of only have a few days worth of data in ES indexes.
Kibana’s visualisation is what is holding ES up for me. I will look into superset+Clickhouse to see if I can come up with a good analysis front for our log data.",0,20340274,0,"[]","",0,"","[]",0 27180452,0,"story","Bimal_kumar","2021-05-17 05:57:09.000000000","",0,0,0,"[27181471,27181215,27180600,27180693,27183112,27183852,27180846,27185219,27180903,27182303]","https://clickhouse.tech/blog/en/2020/the-clickhouse-community/",122,"The ClickHouse Community","[]",62 27180600,0,"comment","andridk","2021-05-17 06:33:54.000000000","Sorry if this is obvious, but... What is ClickHouse?",0,27180452,0,"[27180661,27180606,27180646]","",0,"","[]",0 27180661,0,"comment","amyjess","2021-05-17 06:45:38.000000000","Officially, it's a column-oriented database. This means that, internally, it stores columns together rather than rows together. In practice, it means that it's optimized for calculating analytics over large datasets.
I've found, from personal experience, that it makes a good replacement for time-series databases, even though it's technically not a time-series database. My employer migrated our KPIs and other metrics from InfluxDB to ClickHouse a couple of years ago, and the drastic improvements in performance were well worth the time it took to migrate our data. It also helped that ClickHouse uses a subset of SQL, unlike InfluxDB which uses a superficially SQL-like but practically very different proprietary language.",0,27180600,0,"[27180976,27180730,27183831]","",0,"","[]",0 27180846,0,"comment","mikpanko","2021-05-17 07:23:39.000000000","What is the difference between ClickHouse and Apache Parquet or Google Capacitor?",0,27180452,0,"[27180869]","",0,"","[]",0 27180869,0,"comment","FridgeSeal","2021-05-17 07:29:35.000000000","ClickHouse is a column-based database, also called an OLAP/analytic database.
Parquet is a column based file storage format.
Edit: capacitor also appears to be an internal storage format google uses inside of Big Query.",0,27180846,0,"[]","",0,"","[]",0 27180903,0,"comment","mikpanko","2021-05-17 07:35:24.000000000","Is it faster for querying than using Presto on top of Hive? The page comparing ClickHouse with other analytics solutions doesn’t list Presto, which is very popular these days.",0,27180452,0,"[27181957,27180979,27182427]","",0,"","[]",0 27181215,0,"comment","pachico","2021-05-17 08:29:34.000000000","I must say that I relied on the community since we adopted ClickHouse in production almost 2 years ago. It was an essential part of the success of our project.
I'm grateful for their work and the passion they put assisting the community.
Well done, everyone!",0,27180452,0,"[]","",0,"","[]",0 27181471,0,"comment","doix","2021-05-17 09:15:15.000000000","Has anyone used Clickhouse in production in anger? Our DevOps guy hates it and complains about how "fragile" it is.
Keeping zookeeper happy seems to be a huge pain. Getting consistent backups seems to be difficult for some reason, and there is some problem with multiple replicas. As in, if you have 3 replicas and one dies, when it comes back up it will be out of sync and will refuse to resync itself automatically. So someone has to go and prod it to make it work again.
In practice, none of this has ever been an issue and only comes up when he does his "chaos engineering" tests, but it makes him very nervous.
On the otherhand, it's been orders of magnitude faster than using AWS Redshift/Apache Drill/PostgreSQL whenever we try and benchmark them on comparable hardware, so we stick with it.",0,27180452,0,"[27182010,27182607,27182045,27182762,27182544,27184507]","",0,"","[]",0 27181957,0,"comment","vulkoingim","2021-05-17 10:29:40.000000000","Waay faster. In my experience it has been the fastest OLAP db I've ever used, and by a wide margin. I don't think any other system gets close for the price/performance ratio you get with ClickHouse.
In this blog [1] you can find a very nice comparison of different dbs on the same dataset/same queries
The fastest Presto benchmark there [2] vs ClickHouse on a single node core i5 [3] vs ClickHouse in cluster mode [4]
[1] https://tech.marksblogg.com/benchmarks.html
[2] https://tech.marksblogg.com/billion-nyc-taxi-rides-spark-2-4...
[3] https://tech.marksblogg.com/billion-nyc-taxi-clickhouse.html NOTE: not really comparable, since it's on single machine, but I think it really shows what ClickHouse is capable of
[4] https://tech.marksblogg.com/billion-nyc-taxi-rides-clickhous...",0,27180903,0,"[27182023,27187249,27185064]","",0,"","[]",0 27182010,0,"comment","ekimekim","2021-05-17 10:39:12.000000000","Hi, I ran a Clickhouse server in production at non-trivial scale for storage and query of application logs (semi-structured JSON documents).
In our case we were consuming from a kafka topic which was the source of truth, so we opted not to use Clickhouse's built-in write replication and instead manually sharded and wrote to each replica independently (this meant replicas weren't always in perfect sync for brand-new data, but in practice this was acceptable).
The distributed query side still worked perfectly despite this weird setup, even when we were doing weird things like re-sharding or taking some replicas down for one reason or another.
I can't speak to backups as, once again, we were considering the kafka stream the source of truth and not the clickhouse datastore. We archived the data directly out of kafka, in a format more suitable to long term archival instead of live query.
My experience with Clickhouse was that the core primitives were rock solid, but you were often on shaky ground when using more niche features - we encountered crashes when trying to use bloom filters for text search, for example.
The performance of Clickhouse absolutely blew me away. It could read, filter and aggregate results as fast as the underlying disks could serve it data. I'll point out though that while Clickhouse is very good at what it does, it puts a lot of onus on the person designing the schema and writing the query to make sure it does it in the right way. In our case this worked well because we only had a small number of "shapes" of query we needed to serve, and we had a CLI tool that translated the user's query written in a simple DSL (really just a key-matches-value filter expression) into the SQL that would result in an efficient query on clickhouse's end. But for more flexibility-demanding workloads this could be a major issue as you need to know the system pretty well to write good queries.",0,27181471,0,"[]","",0,"","[]",0 27182045,0,"comment","vulkoingim","2021-05-17 10:47:21.000000000","In my experience there is a bit of a learning curve until you understand how each of the parts fit together and how to optimize for your use case. ZooKeeper indeed was one of those pain points, but once you get it stable there is practically no need to touch it.
I found the tips in the docs on optimizing ZK really helpful [1].
Anoter mistake we did early on with ZooKeeper, that caused us a lot of pain was running it with network attached disks (we ran ClickHouse+ZK in the cloud). Once we switched to local disks, 99% problems disappeared.
[1] https://clickhouse.tech/docs/en/operations/tips/#zookeeper",0,27181471,0,"[]","",0,"","[]",0 27182427,0,"comment","ckdarby","2021-05-17 11:52:34.000000000","You should look at switching from hive to Iceberg. You'll see a 4-10x speed up and it'll start to give Clickhouse a run for it's money.",0,27180903,0,"[]","",0,"","[]",0 27182544,0,"comment","lnsp","2021-05-17 12:10:27.000000000","I'm quite sure Cloudflare uses ClickHouse for their web analytics stuff (see https://blog.cloudflare.com/http-analytics-for-6m-requests-p...).",0,27181471,0,"[27183000,27184214]","",0,"","[]",0 27182607,0,"comment","caust1c","2021-05-17 12:18:49.000000000","Similar experience to vulkoingim: steep learning curve but quite stable once deployed properly.
Schema management in zookeeper has been the biggest pain point for us. Occasionally individual clickhouse shards will get out of sync during a schema update, which can be hard to diagnose.
We use a heavily modified version of clickhouse-backup[1], which works well for us.
As for hands-off replica reboot: you must have an automated process to reapply the same schema which exists in zookeeper, otherwise it won't resync. If the local schema gets out of sync with that in zookeeper, then you'll have issues again.
I expect a lot of these ergonomics issues will be fixed over time. It's already much easier to use than it was 3 years ago, and even if progress on usability and reducing the learning curve is slow the database performance makes it worth it.
[1] https://github.com/AlexAkulov/clickhouse-backup",0,27181471,0,"[27183773]","",0,"","[]",0 27182762,0,"comment","rdsubhas","2021-05-17 12:39:31.000000000","What you say is true, clustering clickhouse is hard. But we are happy in a very different way. What used to be 6+6 cluster of graphite is now served by a single clickhouse instance handling millions of unique metric datapoints everyday, that we haven't even had the need to cluster it yet. It's an order of magnitude difference in performance & cost. Scale, but just in a different way, not just horizontal but holistically.
Maybe when we need to cluster, we'll face the same issues.",0,27181471,0,"[]","",0,"","[]",0 27183112,0,"comment","dugmartin","2021-05-17 13:26:16.000000000","If you want to seem some code where it is used I ran across it when looking at the Plausible Analytics repo:
https://github.com/plausible/analytics/search?q=clickhouse
They use it to store analytics data.",0,27180452,0,"[]","",0,"","[]",0 27183261,0,"comment","swasheck","2021-05-17 13:41:41.000000000","Every high-volume Influx/Grafana implementation I’ve used has been a disaster. I’m now at a place that uses ClickHouse and I can now see the utility of Grafana",0,27180976,0,"[27183490,27187949]","",0,"","[]",0 27183773,0,"comment","FridgeSeal","2021-05-17 14:24:58.000000000","I hold out perpetual hope that ClickHouse will move to a Raft or similar based consensus model and drop the external dependence on ZooKeeper on day.",0,27182607,0,"[27184366,27184978]","",0,"","[]",0 27183852,0,"comment","gautambt","2021-05-17 14:31:31.000000000","If someone from the clickhouse team is reading this, the slack link[1] on your community page seems broken [2]
[1] https://clickhousedb.slack.com/join/shared_invite/zt-nwwakmk... [2] https://clickhouse.tech/#community",0,27180452,0,"[]","",0,"","[]",0 27184214,0,"comment","wdb","2021-05-17 14:59:56.000000000","PostHog is using Clickhouse for the Cloud version",0,27182544,0,"[]","",0,"","[]",0 27184366,0,"comment","tonyhb","2021-05-17 15:13:05.000000000","This is happening: https://github.com/ClickHouse/ClickHouse/issues/15090
They're using NuRaft from ebay to drop zookeeper. As far as I remember, it's actually already merged and available to test. Performance is ~3x slower in terms of write as they're coordinating everything themselves, but it should be _much_ easier to maintain.
I'm looking forward to this, too, and want to see more in terms of docs / support here :)",0,27183773,0,"[27190246]","",0,"","[]",0 27184507,0,"comment","dominotw","2021-05-17 15:25:32.000000000","How does the setup and operational ease compared to druid. Afaik, druid also uses zookeeper. Is there any reason someone would choose druid over clickhouse?",0,27181471,0,"[]","",0,"","[]",0 27185219,0,"comment","kokizzu2","2021-05-17 16:22:34.000000000","https://t.me/clickhouse_en clickhouse en '__')",0,27180452,0,"[]","",0,"","[]",0 27187020,0,"comment","manigandham","2021-05-17 19:05:30.000000000","Clickhouse/Druid/Pinot are all columnstores/column-oriented databases. Clickhouse is a relational engine while Druid/Pinot are a different (and older) design using heavy indexing and pre-aggregation. All of them store table data as per-column segments though which is a defining feature leading to high compression and I/O performance.
There's also the badly named wide-column database type like Cassandra, but this is really just advanced or nested key/value rather than what people would consider "columns".",0,27183831,0,"[]","",0,"","[]",0 27187249,0,"comment","drewpc","2021-05-17 19:23:22.000000000","Thanks for these links. Unrelated to ClickHouse, it was interesting to read about BrytlytDB--I'd never heard of it before. Essentially a PostgreSQL DB that uses GPUs.",0,27181957,0,"[]","",0,"","[]",0 27187344,0,"story","swyx","2021-05-17 19:30:37.000000000","",0,0,0,"[]","https://softwareengineeringdaily.com/2021/05/17/clickhouse-data-warehousing-with-robert-hodges/",3,"ClickHouse: Data Warehousing with Robert Hodges","[]",0 27187949,0,"comment","ithkuil","2021-05-17 20:20:23.000000000","The shape of the data matters. In particular the cardinality of the tags. If clickhouse works well for you, chances are that your use case will be well served by influxdb_iox too",0,27183261,0,"[]","",0,"","[]",0 27190284,0,"comment","FridgeSeal","2021-05-18 01:14:28.000000000","Scuba appears to be an in-memory-only database, so it’s possibly faster (not seen benchmarks nor used it) but at the cost of persistence.
Is Procella even usable by anyone not-google?
Haven’t used dremel + capacitor, but have used Snowflake and found it lacking. Promises a lot, performance was just kind of alright, support was useless, sales team was aggressive to the point of harassment, and it was eye-wateringly expensive on top of all that. Some people love it, I will never go near it again, it’s advantages just aren’t sufficient to warrant choosing it over ClickHouse or even Redshift.",0,27185064,0,"[]","",0,"","[]",0 24078935,0,"story","mmcclure","2020-08-07 06:23:17.000000000","",0,0,0,"[]","https://mux.com/blog/from-russia-with-love-how-clickhouse-saved-our-data/",7,"ClickHouse Saved Our Data (2020)","[]",0 20380560,0,"comment","alexvaut","2019-07-08 07:13:36.000000000","I will go a step further by stating that metrics, logs and traces are very similar and should be treated as such in a unique platform. Leveraging these 3 sources of data in a micro-services world is more than needed for troubleshooting, documentation and monitoring.
Right now I'm using Prometheus (metrics) + Jaeger (traces) + Fluentd&Clickhouse (logs) + Grafana to render all of that. It's not that easy to correlate data but I'm getting there (with tricky queries in Grafana panels and custom Grafana sources). A PoC about displaying traces in a nice way: https://github.com/alexvaut/OpenTracingDiagram.",0,20375190,0,"[]","",0,"","[]",0 16782355,0,"comment","Nitrado","2018-04-07 19:28:31.000000000","ClickHouse ships with a command line tool which does this (without the actual database server):
ps aux | tail -n +2 | awk '{ printf("%s\t%s\n", $1, $4) }' | \
clickhouse-local -S "user String, mem Float64" \
-q "SELECT user, round(sum(mem), 2) as memTotal FROM table GROUP BY user ORDER BY memTotal DESC FORMAT Pretty"
┏━━━━━━━━━━┳━━━━━━━━━━┓
┃ user ┃ memTotal ┃
┡━━━━━━━━━━╇━━━━━━━━━━┩
│ clickho+ │ 0.7 │
├──────────┼──────────┤
│ root │ 0.2 │
├──────────┼──────────┤
│ netdata │ 0.1 │
├──────────┼──────────┤
│ ntp │ 0 │
├──────────┼──────────┤
│ dbus │ 0 │
├──────────┼──────────┤
│ nginx │ 0 │
├──────────┼──────────┤
│ polkitd │ 0 │
├──────────┼──────────┤
│ nscd │ 0 │
├──────────┼──────────┤
│ postfix │ 0 │
└──────────┴──────────┘
Has the advantage of being really fast.",0,16781294,0,"[16782630]","",0,"","[]",0
13183758,0,"story","jgrahamc","2016-12-15 10:59:24.000000000","",0,0,0,"[]","https://github.com/cloudflare/sqlalchemy-clickhouse",5,"SQLAlchemy ClickHouse","[]",0
16800881,0,"comment","yarapavan","2018-04-10 12:40:24.000000000","tl;dr:No redhshift, clickhouse or bigquery. Use citus extensions for postgreSQL - HyperLogLog (HLL)and TopN - to handle billions of searchers per day across thousands of customers. Data pipelines are written in 'GO', run on GKE/K8S for orchestration using Google Pub/Sub for communication across services.",0,16800468,0,"[]","",0,"","[]",0 16802650,0,"comment","ryanworl","2018-04-10 16:31:37.000000000","I think the choice to not go with Clickhouse deserves a bit more explanation than what was given in the article.
Instead of writing all this code to do roll ups they could’ve used an AggregatingMergeTree table over their raw events table and... gotten back to work.
Cloudflare is using Clickhouse for their DNS analytics and (maybe even by now) soon their HTTP analytics. And the system they migrated off of looked a heck of a lot like this one in the article.
Edit: I should add that I am not saying their decision was wrong. I just think the sentence that was given in the article does not justify the decision by itself on an engineering level.
The data load process of Clickhouse and Citus (in this configuration) are nearly identical. Clickhouse takes CSV files just fine like Citus. The default settings are fine for the volume mentioned in the article of single digit billions of records per day. This would probably fit on a single server if you age out the raw logs after your coarsest aggregate is created. Queries over the AggregatingMergeTree table at five minute resolution will finish in high double digit to low triple digit milliseconds if the server is not being hammered with queries and the time range is days to weeks.",0,16797124,0,"[16803606,16803107]","",0,"","[]",0 16803606,0,"comment","sfg75","2018-04-10 18:09:05.000000000","Hey, sorry if that wasn't clear enough (author here).
We decided not to go with ClickHouse because we were mostly looking for a SaaS solution. That's pretty much why we also didn't spend too much time on Druid either.
Choosing Citus meant we could leverage a technology that we already had a bit of experience with (Postgres) and not have to really care about the infrastructure underneath it. We're still a fairly small team and those are meaningful factor to us.
At the end of day I'm sure all those systems would do the job fine (ClickHouse or Druid), we just went for what seemed the easiest to implement and scale.",0,16802650,0,"[16803647]","",0,"","[]",0 16803647,0,"comment","ryanworl","2018-04-10 18:12:59.000000000","That makes sense. If you do ever want to check out Clickhouse and want someone to run it for you, Percona or Altinity [1] can probably help. Not affiliated with either, I just read their Clickhouse-related content.
[1](https://www.altinity.com)",0,16803606,0,"[]","",0,"","[]",0 16804338,0,"comment","al_james","2018-04-10 19:24:47.000000000","A great article, and I am a big fan of algolia, Citus and Redshift. However this article ends up making an odd apples to oranges comparison.
They state that "However, achieving sub-second aggregation performances on very large datasets is prohibitively expensive with RedShift", this suggests that they want to do sub-second aggregations across raw event data. However, later in the article, the solution they build is to use rollup tables for sub-second responses.
You can also do rollup tables in Redshift, and I can assure you (if you enable the fast query acceleration option) you can get sub-second queries from the rolled up lower-cardinality tables. If you want even better response times, you can store the rollups in plain old Postgres and use something like dblink or postgres_fdw to perform the periodic aggregations on Redshift and insert into the local rollup tables (see [1]). In this model the solution ends up being very similar to their solution with Citus.... and I would predict that this is cheaper than Citus Cloud as Redshift really is a great price point for a hosted system.
So the question of performing sub-second aggregations across the raw data remains unanswered... however that really is the ideal end game as you can then offer way more flexibility in terms of filtering than any rollup based solution.
Right now, research suggests Clickhouse, Redshift or BigQuery are probably the fastest solutions for that. Not sure about Druid, I dont know it. GPU databasees appear to the be the future of this. I would be interested to see benchmarks of Citus under this use case. I should imagine that Citus is also way better if you have something like a mixed OLAP and OLTP workload (e.g. you need the analytics and the row data to match exactly at all times).
Aside: It would be great to see Citus benchmarked against the 1.1 billion taxi rides benchmark by Mark Litwintschik. Any chance of that?
[1] https://aws.amazon.com/blogs/big-data/join-amazon-redshift-a... [2] http://tech.marksblogg.com/benchmarks.html",0,16797124,0,"[16806631,16806452]","",0,"","[]",0 16806631,0,"comment","massaman_yams","2018-04-10 23:27:30.000000000","Similar to your point about mixed workloads, I have a hunch that Mark's benchmarks are not comprehensive enough to correlate well to real-world usage across a lot of different scenarios, even on pure OLAP workloads. It's great that a billion rows can be aggregated in 0.02 seconds, but there's a reason TPC-H uses 9(-ish?) different queries with varying aggregations and joins, vs. these benchmarks on a single table. (Of course, if your use case is heavy on a specific type of aggregation, it probably makes sense to optimize for that at the expense of other query performance.)
And - perhaps I missed it, but his benchmarks don't seem to utilize rollup/materialization unless the DB does it automatically (or at least easily) on the backend.
As is, it's almost certain that Citus would underperform most of the leaders here. The PG9.5 benchmark actually uses the Citus-developed cstore_fdw extension, and it shows up towards the bottom, albeit running on a single node with hardware a few CPU generations old. (Same as used for the Clickhouse benchmark.) I am curious how Citus/Postgres might perform using the HLL / TopN extensions, though.
Also of note is his Redshift benchmark was run on magnetic drives on ds2 instances, not SSDs. Using those would almost certainly bump performance up a bit.
Druid is optimized for aggregation and filtering, and is somewhat similar to BigQuery on the backend, as I understand it. The Cloudflare blog posted elsewhere in the thread covers it briefly. https://blog.cloudflare.com/how-cloudflare-analyzes-1m-dns-q... More on its indexing strategy here: https://hortonworks.com/blog/apache-hive-druid-part-1-3/
Druid's downsides are: more complex deploy and operational needs due to architectural complexity, lack of full SQL support, limited fault tolerance on the query execution path, and the whole query being bottlenecked by the slowest historical data access.
More here: https://medium.com/@leventov/the-problems-with-druid-at-larg...",0,16804338,0,"[]","",0,"","[]",0 16811633,0,"comment","geocar","2018-04-11 14:52:34.000000000","> Julia is regularly 1.2x C and sometimes faster than c when leveraging builtin libraries.
It is very difficult to say a language is faster than another, and only slightly easier to say an implementation is faster.
http://tech.marksblogg.com/benchmarks.html
describes a number of benchmarks for an interesting data problem. kdb+ is the third fastest performer and the fastest CPU performer. kdb+ is 1000x faster than another database written in C (kdb+ is written in k); but only about 100x faster than other C-based databases. The next closest looks like Yandex Clickhouse which is written in C++ and still less than half the speed of kdb+.
How much work is spent getting C++ to approach half the speed of an interpreted language?
> How can you be "a lot faster" than Julia?
That is an incredibly loaded question, but a big part of "how" is because of the notation: Because it's written in a language that is very dense, and puts parts of the solution near eachother, it is very easy to be correct, and to find repetitive parts of the program that can be coalesced and merged.
> What's mmu?
http://code.kx.com/q/ref/matrixes/#mmu
> Im sorry, but when i see apl, I just see a lot of symbols I can't bother to look up to understand.
It never ceases to amaze me when a programmer thinks that just because they can't do something, that somehow that follows that nobody else should either.
I mean, seriously!
> Code reviews would be a nightmare.
Why would you do code reviews in a language you don't understand?
The dense and carefully chosen symbols and meanings in an APLish language is its value. Not that it's fast (although it is), but specifically that it's easy to understand. Heck, the article we're commenting about mentions several algorithms that are so much easier to understand in an array language than anything else.",0,16811431,0,"[16830534,16812496]","",0,"","[]",0 24121202,0,"comment","ing33k","2020-08-11 15:18:52.000000000","Congratulations on Shipping !
This is an interesting problem to solve .
ClickHouse support will make it way more interesting.
Are you planning to leverage Materialized views ?",0,24120325,0,"[24121294]","",0,"","[]",0 24121294,0,"comment","vklmn","2020-08-11 15:25:13.000000000","We actually working on ClickHouse integration right now, it will be shipped next week!
Re: materialized views - since not all DWH support them, we're not sure how can we leverage them. However, what do you have in mind? Maybe we're wrong!",0,24121202,0,"[]","",0,"","[]",0 24121299,0,"comment","pachico","2020-08-11 15:25:31.000000000","Ok, if this project gets traction it might replace my own analytics solution based on JavaScript, Go and ClickHouse. I'll stay tuned about it. Are the maintainers in the room? I'd like to exchange some ideas.",0,24120325,0,"[24121367]","",0,"","[]",0 20415639,0,"comment","valyala","2019-07-11 20:43:47.000000000","I'd recommend taking a look at ClickHouse for storing and analyzing billions of ad-tech raw events in real time. There is no need in grouping them beforehand, since ClickHouse is able to scan billions of rows per second per server. See https://clickhouse.yandex",0,20380507,0,"[]","",0,"","[]",0 24127492,0,"comment","Nihilartikel","2020-08-12 00:32:13.000000000","I've used druid.io in the past and it had worked well, but it's a lot of trouble to set up and tune.. Haven't tried it, but clickhouse looks good and has approximate aggregations for high cardinality dimensions.",0,24126739,0,"[24129952]","",0,"","[]",0 27252131,0,"comment","FridgeSeal","2021-05-23 03:16:28.000000000","To provide an opposing viewpoint here: ES and it’s monstrous API and resourcing requirements are a pain to manage and run. It’s a product that has pivoted in so many directions that it’s just become a bit of a mess. I don’t want a full-text search engine that also has graphs, ML, some bizarre scripting feature, log management, etc all stapled in on top. Geospatial and other analytic stuff I’d rather use a dedicated OLAP db like Redshift or ClickHouse.
I’m currently evaluating typesense vs ES for a fts project and typesense is winning so far by simply be “not painful” to deal with.",0,27249603,0,"[27252310]","",0,"","[]",0 27252310,0,"comment","vosper","2021-05-23 03:53:27.000000000","> I don’t want a full-text search engine that also has graphs, ML, some bizarre scripting feature, log management, etc all stapled in on top
Sure, so use something else. I do need (most all of) that at my work (plus the horizontal scaling), and there's no competition. I know we're not the only ones.
Also, there's nothing bizarre about the scripting feature. There are several options for scripting, it's very flexible, and it suits implementing custom logic when you need it.
And, I'm not saying ES is perfect! I'm saying that there's a set of use-cases that only ES (to my knowledge) can fulfil, and that's complex aggregations also involving complex full-text search, over tera/petabytes of data. Clickhouse can do aggregations, but doesn't have anything close to the search chops (again, to my knowledge).",0,27252131,0,"[]","",0,"","[]",0 20444856,0,"comment","polskibus","2019-07-15 20:24:44.000000000","Is there a tutorial to run it on-prem? For example for development or testing purposes?
What data warehouses are currently and which are on the roadmap? Is ClickHouse somewhere in your plans?",0,20443926,0,"[20444931]","",0,"","[]",0 20444931,0,"comment","1ewish","2019-07-15 20:32:05.000000000","On-prem: Right now our IDE is only available as SaaS, although we will be looking at this in the near future. You can develop and test projects with the CLI and deploy them yourself but no tutorials for setting this up beyond the basics yet: https://docs.dataform.co/guides/command-line-interface/
Warehouse support: Athena/Presto and Azure are top of mind. I've not come across ClickHouse before but I'll definitely add it to our tracker!",0,20444856,0,"[]","",0,"","[]",0 20445661,0,"story","valyala","2019-07-15 21:55:55.000000000","",0,0,0,"[]","https://www.altinity.com/blog/2019/7/new-encodings-to-improve-clickhouse",2,"New encodings to improve ClickHouse efficiency","[]",0 20453036,0,"comment","monstrado","2019-07-16 19:19:40.000000000","Cool project! I've had success with ClickHouses's local utility which is extremely fast. It helps that its basically a "local" version of an already insanely efficient columnar database.
https://www.altinity.com/blog/2019/6/11/clickhouse-local-the...",0,20449703,0,"[]","",0,"","[]",0 16868163,0,"comment","lima","2018-04-18 15:31:31.000000000","I built a similar pipeline using Kafka and ClickHouse - it's amazing how easy it is nowadays to ingest and analyze billions of events a day using standard tools.
ClickHouse can even ingest directly from Kafka (courtesy of Cloudflare - http://github.com/vavrusa contributed it).",0,16868028,0,"[16868269,16868583]","",0,"","[]",0 16868725,0,"comment","buremba","2018-04-18 16:33:18.000000000","Sounds like they over-engineered the solution. If you have ad-hoc use-case, BigQuery is great but it's quite expensive. If you just need to pre-calculate the metrics using SQL, Athena / Prestodb / Clickhouse / Redshift Spectrum might be much easier and cost-efficient.",0,16868028,0,"[16869218,16870866,16870717,16868997]","",0,"","[]",0 16872227,0,"comment","manigandham","2018-04-18 23:40:47.000000000","How much data size are you dealing with? For that price range, you can probably just get MemSQL or Clickhouse and handle real-time queries across all of the data.",0,16869652,0,"[]","",0,"","[]",0 20482882,0,"comment","zepearl","2019-07-19 22:14:14.000000000","Thanks a lot - zstd is reaaaally fast for the compression ratios that it achieves.
I personally use it all the time in Clickhouse tables (the "Yandex Clickhouse" database). I admit that I'm still using "xz" when I focus on hardcore max compression (max compression within an "acceptable" timeframe) when doing specific tests with their focus on max final compression.",0,20482517,0,"[]","",0,"","[]",0 27310247,0,"story","tosh","2021-05-28 00:12:48.000000000","",0,0,0,"[27313075,27311202,27313354,27313029,27314247,27314888,27313827,27313962,27311715,27314289,27312540,27315384]","https://github.com/ClickHouse/ClickHouse",198,"ClickHouse: An open-source column-oriented database management system","[]",55 27311202,0,"comment","skunkworker","2021-05-28 02:46:55.000000000","I'm not sure why this got posted again on HN. It's been posted numerous times, though in the past the url used to be https://github.com/yandex/ClickHouse so it wouldn't have been caught as a duplicate.
Just searching on HB algolia there are numerous posts. eg:
https://hn.algolia.com/?q=clickhouse",0,27310247,0,"[27311477,27311896]","",0,"","[]",0 27313029,0,"comment","sixhobbits","2021-05-28 08:34:03.000000000","Their docs intro page is really nice and has a cool gif explanation of row vs column databases. I have never used a column db before and while I kind of knew some of that theory this made it really click for me.
https://clickhouse.tech/docs/en/",0,27310247,0,"[]","",0,"","[]",0 27313075,0,"comment","momothereal","2021-05-28 08:43:27.000000000","People who've used Clickhouse or other OLAP databases in production & at scale, how do you "interconnect" it with relational data?
I'm currently experimenting with Clickhouse, because my dataflow is increasing in size (40M rows right now, doubling every month or so) and my current setup (MongoDB) is at its limits. I would like to migrate the 40M rows to CH, but I also need the metadata for the rows to be in something more robust like Postgres or Mongo. Would you have a microservice that does the queries between the OLAP and relational DBs and does the join manually, exposing that as some low-level API? Or is using the various FDW options (remote tables in Clickhouse, Postgres clickhouse_fdw, etc.) realistic in production?
Sorry for the ramble, it's late here. But I can clarify if necessary.",0,27310247,0,"[27316565,27314038,27315106,27313484,27313594,27313153,27313648,27314267,27313919]","",0,"","[]",0 27313354,0,"comment","pibi","2021-05-28 09:37:39.000000000","Great software. We are managing terabytes of stocks data and realtime market scanners queries across all market (billions of books and timesales) with hundreds of concurrent requests.
We were using kdb before, but clickhouse is more scalable, way cheaper and much more easy to grasp for a newbie.",0,27310247,0,"[27313941,27313711,27314538]","",0,"","[]",0 27313401,0,"comment","tosh","2021-05-28 09:44:29.000000000","I'm not very familiar with baseball so the concept of a "stolen base", the rules around it and how they changed over time were a fascinating read for me personally and I thought it might resonate here as well. It seems like it did not.
My heuristic for submitting is when I come across something (in this case ClickHouse) that I find useful or fascinating (a-ha moment) where I also think this might be of interest to others here (filter).",0,27312284,0,"[]","",0,"","[]",0 27313484,0,"comment","unamedrus","2021-05-28 09:54:45.000000000","> I also need the metadata for the rows to be in something more robust like Postgres or Mongo.
How big is that metadata? If it's less than tens of millions rows, you can look at External Dictionaries support in clickhouse.
https://clickhouse.tech/docs/en/sql-reference/dictionaries/e...
https://altinity.com/blog/2020/5/19/clickhouse-dictionaries-...",0,27313075,0,"[]","",0,"","[]",0 27313594,0,"comment","pella","2021-05-28 10:10:52.000000000","there is a new Clickhouse Database Engine: "PostgreSQL"
"Allows to connect to databases on a remote PostgreSQL server. Supports read and write operations (SELECT and INSERT queries) to exchange data between ClickHouse and PostgreSQL."
https://clickhouse.tech/docs/en/engines/database-engines/pos...",0,27313075,0,"[]","",0,"","[]",0 27313648,0,"comment","wiredfool","2021-05-28 10:18:22.000000000","I ran across this recently: https://eng.uber.com/logging/
Uber's use of clickhouse to handle semi-structured data from logging.",0,27313075,0,"[]","",0,"","[]",0 27313827,0,"comment","hocuspocus","2021-05-28 10:47:48.000000000","My team is (over|ab)using Elasticsearch and I've had my eye on ClickHouse for a while. However we're going to migrate everything to AWS and I wonder if RedShift could be a good alternative too, since it's now supporting JSON and semi-structured data apparently.",0,27310247,0,"[27313966,27354673,27314933]","",0,"","[]",0 27313941,0,"comment","qeternity","2021-05-28 11:06:15.000000000","You’re ingesting real time data into Clickhouse?",0,27313354,0,"[27314842,27354645]","",0,"","[]",0 27314038,0,"comment","hashhar","2021-05-28 11:20:05.000000000","Take a look at query engines like Trino (formerly PrestoSQL) [https://trino.io/]. (Disclaimer: I'm a contributor to Trino).
I used it at a previous job to combine data from MongoDB, Kafka, S3 and Postgres to great effect. It tries to push-down as many operations as possible to the source too to improve performance.
Full ANSI SQL support over multiple number of backends (Kafka, Cassandra, Postgres, ClickHouse, S3 and many more).
The best part is it has a plugin ecosystem so you can very easily implement your own connectors and all the heavy lifting gets done by the core-engine while your plugin only has to abstract your backend to concepts that the engine can understand.",0,27313075,0,"[27315985,27314994,27314565]","",0,"","[]",0 27314247,0,"comment","nojito","2021-05-28 11:51:40.000000000","Clickhouse is easily my favorite secret weapon.
Lowered our costs by ~50% and reduced data processing times by orders of magnitudes.",0,27310247,0,"[27314558]","",0,"","[]",0 27314933,0,"comment","daniel_levine","2021-05-28 13:12:42.000000000","Altinity offers a managed service for ClickHouse on AWS https://altinity.com/cloud-database/",0,27313827,0,"[27314975]","",0,"","[]",0 27315106,0,"comment","hodgesrm","2021-05-28 13:32:26.000000000","ClickHouse can select and insert directly from/to remote MySQL [1] and PostgreSQL [2] tables. See MySQL and PostgreSQL database engines. It's a common way to access mutable dimension data as well as to pull data into ClickHouse for analysis.
I have not used PostgreSQL myself but the MySQL database engine works great. In some cases queries from ClickHouse to MySQL run faster than they do on MySQL itself. There are other engines as well, e.g., MongoDB.
[1] https://clickhouse.tech/docs/en/engines/table-engines/integr...
[2] https://clickhouse.tech/docs/en/engines/table-engines/integr...",0,27313075,0,"[]","",0,"","[]",0 27315985,0,"comment","LaserToy","2021-05-28 14:48:35.000000000","The problem with Trino is that it is not that easy to scale to possible RPS of Clickhouse, it introduces tons a of latency and push downs are faaaar from perfect. Uber has a smart solution for Pinot, when they run it as a single node proxies",0,27314038,0,"[27330357]","",0,"","[]",0 27316565,0,"comment","mason55","2021-05-28 15:30:05.000000000","Really depends what you're trying to do with your OLAP database.
If you're using it purely for reporting purposes then you really don't want to interconnect it with your OLTP database to support real-time queries. Reason number one is that you don't want some unexpected analytics workload to suddenly impact your production Postgres. Reason number two is that your analytics data model and your OLTP data model are frequently different. Usually your OLAP model needs to know what the value of a given dimension was at the time the event occurred, but if you're linking directly to Postgres then you can only see what the values are right now.
You can also go the other direction if you need the current values of OLTP data joined with your Clickhouse data: schedule an ETL out of Clickhouse and back into your OLTP database and aggregate the data to a reasonable level such that Posgres can handle it without a problem.
What's your actual use case? I'd normally consider a requirement to join OLAP & OLTP data in real time to be a "design smell". I don't mean that there's no value in things like fdw or easier ways to move data around, but you should consider using it to help with the ETL process and not as a real-time interconnect. Keep OLTP & OLAP workloads separate and both of your DBs will be happier.",0,27313075,0,"[27317049]","",0,"","[]",0 27317049,0,"comment","momothereal","2021-05-28 16:02:32.000000000","Alright, here is an example. My data is a stream of events, each row has the values (event_id, person_id, date). This is the table with 40M rows, with inserts between 2/s and 10/s.
person_id is a foreign ID for a "person table" that fits well in the relational model. A "person" has various attributes (name, DOB, email) as well as a one-to-many relation to "groups". Groups are a collection of users with additional attributes (e.g. group name).
Now, what if I want to answer questions like:
- how many total events for all persons in group X
- who are the top 10 users in number of events in group X
- which are the top 10 groups in number of events
In this case, the person/group tables are part of the core business logic, they aren't specific to the events table. It doesn't make sense to store it in Clickhouse. Also, this person/group data gets updated sparsely, but "freshness" should be kept at a minimum (< 30secs).The simple approach to the first question would be
- Get all the user IDs in group X
- Filter events by those user IDs
But what if there are tens of thousands of users in group X? And hundreds of groups? Are megabyte-long queries supported in Clickhouse?",0,27316565,0,"[27317772,27319464]","",0,"","[]",0
27317772,0,"comment","mason55","2021-05-28 16:56:25.000000000","Some things to think about1. Are the questions you're asking completely ad hoc? Or can you mostly define them ahead of time? If it's the former then you should be looking at getting your OLTP data into Clickhouse. If it's the latter then you should be looking to aggregate the data to various levels and get it out of Clickhouse.
All three of your sample questions lend themselves quite nicely to pre-aggregation. I'm sure your actual questions are more complex, but what I'd do to address all three of your examples is every night I'd roll up the raw events into (person_id, date, event_count) and send it back over to Postgres. Then every week you roll up the previous seven days into (person_id, week, event_count). Each month you roll the weeks up and each year you roll the months up. If you need the data more frequently than daily then you can go down to hourly or whatever it is you need.
Now you've got your data back into Postgres but at a reasonable granularity. Depending on the cardinality of the user-*group relationship you might have to do some magic to pre-aggregate that if the join is big as well, which could turn into a challenge as group membership changes (you'd need to re-aggregate all your group metrics any time group membership changed) but it's still better than trying to join across the Clickhouse/Postgres boundary.
If you really do need to support totally ad hoc questions all the time then you should figure out how to get your Postgres data into Clickhouse. If the data really gets update infrequently then it shouldn't be a problem to get changes in user/group membership into Clickhouse quickly, then you can do all your joins and analysis completely in Clickhouse.
2. Do you really want the current group memberships? Or do you want the group memberships at the time the event occurred? It's a subtle difference and there's not usually one right answer (or the answer is "I need both").",0,27317049,0,"[]","",0,"","[]",0 27318374,0,"comment","nexuist","2021-05-28 17:45:34.000000000","See the docs: https://clickhouse.tech/docs/en/
> Different orders for storing data are better suited to different scenarios. The data access scenario refers to what queries are made, how often, and in what proportion; how much data is read for each type of query – rows, columns, and bytes; the relationship between reading and updating data; the working size of the data and how locally it is used; whether transactions are used, and how isolated they are; requirements for data replication and logical integrity; requirements for latency and throughput for each type of query, and so on.",0,27315384,0,"[]","",0,"","[]",0 24208310,0,"comment","AlfeG","2020-08-19 08:23:31.000000000","500 000 000 requests per month is just about 200 request per second. Why there should be any problem for any DB?
As for question - I saw a lot of great reviews on ClickHouse DB",0,24208193,0,"[24208336]","",0,"","[]",0 24209036,0,"comment","superlupo","2020-08-19 10:53:50.000000000","Yes, there are some peaks with 10-20000 per sec.
Clickhouse seems very suitable as database. Does anybody know open source analytics tools that use it? Two parts would be needed: the client javascript tracker which injects into the database, and a GUI for reports.",0,24208336,0,"[]","",0,"","[]",0 24210819,0,"comment","hodgesrm","2020-08-19 14:38:58.000000000","Anyone interested in hearing more please join the next San Francisco ClickHouse meetup. The SplitGraph folks will be doing a presentation on integration of open data with ClickHouse.
https://www.meetup.com/San-Francisco-Bay-Area-ClickHouse-Mee...",0,24210332,0,"[24210957]","",0,"","[]",0 24210957,0,"comment","chatmasta","2020-08-19 14:51:48.000000000","We're excited for it :)
In the meantime, if anyone wants to query Splitgraph data from ClickHouse, we have specific instructions for that here:
https://www.splitgraph.com/connect",0,24210819,0,"[]","",0,"","[]",0 27330357,0,"comment","hashhar","2021-05-30 03:27:40.000000000","I wouldn't use Trino if you are looking to ONLY query a single database like Clickhouse or Postgres etc (unless you want an ANSI SQL abstraction over your choice of database). Obviously ClickHouse and Postgres will have lower latency when hit directly because you can bypass the analysis, planning, optimization and scheudling that Trino does.
It does federation better than ClickHouse and that's where it shines. Joins across disparate systems - even between relational and non-relational systems. And obviously for the MPP queries on distributed filesystems.",0,27315985,0,"[27340072]","",0,"","[]",0 27335619,0,"comment","rekwah","2021-05-30 19:00:09.000000000","Interesting concept. I could imagine a slightly modified approach being used as a learning tool for making large/complex codebases more approachable.
Want to understand how Clickhouse works? We've got this challenge with 20 stages that teaches you the concepts, using the CH codebase itself, not re-implemented in the language of your choosing. Learn the storage format & use the internal codebase classes/data structures to implement some things.
Now, the learner has a more holistic view of how the technology works and hopefully an increased likely hood of contributing back to the open source project.",0,27334932,0,"[]","",0,"","[]",0 24235556,0,"comment","chatmasta","2020-08-21 15:15:00.000000000","I'm not sure if you're asking about (a) querying Oracle from a Postgres client through Splitgraph, or (b) querying Splitgraph from Oracle.
We want to support both these use cases. For (a), Oracle would be an "upstream" to Splitgraph. We'll need to write a plugin that implements the FDW and does introspection. Eventually, we want you to be able to configure upstreams from the Web UI.
For (b), you can probably find a way to query Splitgraph from Oracle, e.g. using Oracle's "gateway" feature [0]. What's nice about Splitgraph is that it's compatible with any SQL client that can speak the Postgres protocol (or ODBC). So if Oracle can connect to a Postgres database, it can connect to Splitgraph.
We have instructions for how to query Splitgraph from within ClickHouse at [1]. We're actually giving a presentation about this to a ClickHouse meetup on Sep 10, feel free to join. [2]
[0] https://docs.oracle.com/cd/E18283_01/owb.112/e10582/gateways...
[1] https://www.splitgraph.com/connect
[2] https://www.meetup.com/San-Francisco-Bay-Area-ClickHouse-Mee...",0,24235177,0,"[24238776]","",0,"","[]",0 24238776,0,"comment","hodgesrm","2020-08-21 20:52:36.000000000","SF ClickHouse meetup organizer here. Thanks for the shout-out for the SF ClickHouse Meetup. We're looking forward to hearing about SplitGraph on September 10th.",0,24235556,0,"[]","",0,"","[]",0 20532936,0,"comment","kirankn","2019-07-26 07:08:16.000000000","We haven't used Druid yet. We did a high level comparison among Druid, Clickhouse & Pinot from material available on the internet. Apparently, all 3 have similar mechanisms. But druid is a little expensive to deploy when at a smaller scale. Clickhouse seems to be performant too. We seem to be inclining towards Clickhouse.",0,20532615,0,"[20532985]","",0,"","[]",0 27356030,0,"comment","protoduction","2021-06-01 15:32:17.000000000","Friendly Captcha (https://friendlycaptcha.com) | Remote, EU timezone | Part-time or hourly basis | Full-stack developer
We are a small, profitable company providing a privacy-friendly and accessible alternative to Google ReCAPTCHA based on proof of work. We are looking for a fullstack developer to help us further improve our anti-spam tools that don't carry an accessibility and privacy cost to end-users. The ideal candidate is a generalist, has strong knowledge of web APIs, has experience with open source, and has built a SaaS product before.
Technologies:
* Typescript, Go, HTML, CSS (and likely React in the future for a revamped customer/admin dashboard)
* Serverless (Cloudflare Workers) as well as good-old load balanced services (in Go)
* Redis, FaunaDB, Postgres, Clickhouse
* Git, Github Actions, Sentry, Stripe
* Plugins for Wordpress in PHP, and Flutter (Dart)
Reach out to me at guido@<our domain name>.",0,27355392,0,"[]","",0,"","[]",0 27360199,0,"comment","ajzo90","2021-06-01 19:43:28.000000000","Data and backend engineer | infobaleen.com | remote | sweden | go
As a data engineer in Infobaleen, you will design and develop data systems and machine learning pipelines. The data-centric platform consists of a data ingestion database, a machine-learning engine, and a dashboard engine. You will be responsible for helping our customer success team master all components and make crucial decisions when deciding which requested features we should prioritize. In addition, you will provide technical leadership and a deep understanding of data modeling. We work remotely from Stockholm, Gothenburg, Umeå, Piteå, and Berlin.
We are a tech-agnostic company and have built our service using a wide range of technologies. The core application is built in Go. Supporting tools depend on Rust, Python, Tensorflow, and ClickHouse. We orchestrate and deploy everything with Docker and Kubernetes on Google Cloud Platform
christian @ [our-domain]
https://www.linkedin.com/jobs/view/2568138502/",0,27355392,0,"[]","",0,"","[]",0 16949137,0,"story","postila","2018-04-28 19:44:54.000000000","",0,0,0,"[]","https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/",3,"HTTP Analytics for 6M requests per second using ClickHouse","[]",0 27365272,0,"comment","mescudi","2021-06-02 05:42:40.000000000","Location: Kazakhstan
Remote: Yes, English/Russian
Willing to relocate: Yes
Technologies: kubernetes (k8s), golang, aws, python, docker, helm, linux, terraform, ansible, clickhouse, ELK, cockroachdb, aerospike
Resume: https://tinyurl.com/2rpdpzm9
Email: nurtas977@gmail.com
Eager to work at startup culture, but open to interviews from big companies. Currently DevOps, sometimes accomplish SE and SRE tasks.
Gallup's CliftonStrength top-5: Restorative, Competition, Deliberative, Learner, Achiever",0,27355390,0,"[]","",0,"","[]",0 27367036,0,"story","xoelop","2021-06-02 10:33:40.000000000","",0,0,0,"[27378332]","https://blog.tinybird.co/2021/06/01/tips-8-generating-time-series-on-clickhouse/",2,"Generating Time-Series on ClickHouse","[]",1 27378332,0,"comment","qoega","2021-06-03 07:50:46.000000000","Have you tried generateRandom? It supports almost all types. https://clickhouse.tech/docs/en/sql-reference/table-function...",0,27367036,0,"[]","",0,"","[]",0 27378380,0,"comment","mananaysiempre","2021-06-03 07:57:50.000000000","... Also known as a very simple case of a “column-oriented database”, of which are several at various scales from Metakit[1] to Clickhouse[2]. It’s a neat way to have columns which are sparsely populated, required to accommodate large blobs, numerous but usually not accessed all at once, or frequently added and deleted.
Nothing’s perfect, of course: you can’t stream records in such a format, so no convenient Unix-style tooling.
[1]: http://www.equi4.com/metakit.html [2]: https://yandex.com/dev/clickhouse/",0,27375564,0,"[]","",0,"","[]",0 24265301,0,"comment","Redsquare","2020-08-24 20:45:00.000000000","Try clickhouse",0,24265269,0,"[24267813]","",0,"","[]",0 24266358,0,"comment","corford","2020-08-24 22:41:27.000000000","We make heavy use of views within Snowflake, have sensible cluster keying and, further downstream, also leverage things like Looker's PDTs and Elastic Search.
The issue is we have billions of rows and very varied analytical requirements, so there are quite a few "pathalogical" queries.
Change streams are something we're looking at (as well as MemSQL, Clickhouse etc.)",0,24265933,0,"[24266537]","",0,"","[]",0 24267800,0,"comment","FridgeSeal","2020-08-25 03:06:55.000000000","I could not disagree more.
Working with it was fraught with issues. Performance was mediocre at best, it was horribly expensive, Python and JS client libs had re-occurring issues with disconnecting and reconnecting. The advice given to us around scaling concurrent connections was bizarre at best. Teammates had numerous issues where it was clear corners had been cut in handling some edge cases around handling certain unicode characters. Their Snowpipe "streaming" implementation was...not good. The idea of having having compute workers that "spun up and down" sounded good in theory, but in practice lead to more bottlenecks and delays than anything else.
The AWS outage last year that prevented you from provisioning new instances essentially crippled our snowflake DB.
I almost go out of my way to recommend people _not_ use it. I keep seeing it pop up, but mostly because it seems they're doing what Mongo DB did in the early days and just throw marketing money to capture mindshare as opposed to being an actually good product.
We changed to ClickHouse and the difference was literally night-and-day. The performance especially was far superior.",0,24265319,0,"[24270537]","",0,"","[]",0 24267817,0,"comment","FridgeSeal","2020-08-25 03:10:16.000000000","Haven't used MemSQL, but check out ClickHouse.",0,24265458,0,"[]","",0,"","[]",0 24267876,0,"comment","FridgeSeal","2020-08-25 03:23:48.000000000","Out of curiousity, have you used ClickHouse?
Because I had the opposite experience - you literally couldn't pay me to use Snowflake again.",0,24265349,0,"[]","",0,"","[]",0 24270797,0,"comment","aseipp","2020-08-25 12:54:12.000000000","GPU databases are limited primarily by memory constraints on the card (e.g. ~32GB maximum per card for GV100 or whatever) and interconnect latency/bandwidth, not by raw parallel scan speed. If scan speed was all that mattered, we'd have had GPU-like parallel database hardware decades ago. You can crunch rows, but only as long as it fits in memory. Once your working set exceeds the provided RAM and has to page out data to the CPU over PCIe or some other link, the numbers and utilization begin looking much worse. "Every benchmark looks amazing when your working set fits entirely in cache."
But even more than that, for the price of a single high end Tesla (approx. 10k USD), you can build a high-end COTS x86 machine with a shitload of RAM, NVMe, and then install ClickHouse on it. That machine will scale to trillions of rows with ease and millisecond response times, whether or not everything fits in memory. It will cost less money and also cost less energy and it will scale out easier, and have better utilization of the hardware.
I'd wager that unless you have infinite money to dump on Nvidia or exceedingly specific requirements, any GPU database will get soaked by a comparable columnar OLAP store in every dimension.",0,24268905,0,"[24273903]","",0,"","[]",0 20566206,0,"story","bdcravens","2019-07-30 16:06:14.000000000","",0,0,0,"[]","https://www.altinity.com/blog/2019/6/11/clickhouse-local-the-power-of-clickhouse-sql-in-a-single-command",3,"Clickhouse-local: The power of ClickHouse SQL in a single command","[]",0 16968165,0,"comment","mike_heffner","2018-05-01 15:57:13.000000000","SolarWinds Cloud | Sr Data Engineer | SF / US-REMOTE | Full-time | https://solarwinds.jobs/jobs/?q=cloud
We're looking for a full-time software engineer to take a key role in building the large-scale distributed systems that power Solarwinds Cloud products: Papertrail (Real Time Logging), AppOptics (Server, Infrastructure, Application Performance Monitoring and Distributed Tracing), Pingdom (DEM) and Loggly (Structured Log Analysis).
We’re a small team so everyone has the opportunity to have a big impact. We’ve built our platform out largely on Java8 Dropwizard services, a handful of Golang services and some C++ where performance is critical. We leverage Kafka as our main service bus, Cassandra for long term storage, our in-house stream processing framework for online analytics, ClickHouse for large scale log storage, and we rely on Zookeeper as a core part of intra/inter-service coordination. Our data pipeline pushes millions of messages a second and tens of terabytes of logs per day.
All team members, whether in San Francisco, one of many offices, or remote, commit code to Github, communicate over Slack and Hangouts, push code to production via our ChatOps bot, and run all production applications on AWS. We also use an array of best-breed SaaS applications to get code to production quickly and reliably. We are a team that is committed to a healthy work/life balance.
At SolarWinds Cloud you get all the benefits of a small startup, with the backing of a big company so there is no worry about the next round of funding. SolarWinds offers competitive bonus and matching 401k programs that create an attractive total compensation package.
This is an example of some of the technology we build and work with on a regular basis: http://www.heavybit.com/library/blog/streamlining-distribute....
Learn more at: https://solarwinds.jobs/jobs/?q=cloud or contact me directly at mike-at-solarwinds.cloud (no recruiters).",0,16967543,0,"[]","",0,"","[]",0 16968222,0,"comment","dmangot","2018-05-01 16:02:00.000000000","SolarWinds Cloud | Site Reliability Engineers (SRE) | VAN, ATX, BOS, RTP, SLC | ONSITE
http://bit.ly/2z4qmId For more information, email dmangot[at]solarwinds[dot]cloud with the subject line [Hacker News SRE]
Metrics, monitoring, observability. You live and breathe it every day. Now you want to take it to the next level and work on a product that does the same. The SolarWinds Cloud teams are looking for SREs to help build, improve, and manage our high performance stream processing pipelinse. This is truly one of those jobs where you and your developer/ops friends can use the tool you operate every single day.
The Cloud teams (Loggly, Papertrail, AppOptics, Pingdom) stack is largely Ruby, Java, Kafka, Python, Elasticsearch, Clickhouse, and Cassandra, processing millions of metrics, logs, and traces every second. The SRE team uses a mix of Terraform, Packer, Python, Vagrant, and SaltStack to run our 100% AWS platform. This is your opportunity to join a talented SRE team at a company that is constantly growing (7 acquisitions in 4 years). Plus, with the backing of SolarWinds behind it, there are no worries about running out of VC funding, or where the next round is coming from. We're a distributed team where everyone writes code, building for now and the future and we're looking for the next piece of the puzzle to collaborate in creating that future.
If this sounds interesting to you, we'd love to open up a conversation about whether we're a good match, setup some interviews and a coding test. You can find the contact info above.
About the company: The SolarWinds Cloud companies are a collection of tools that can be used together or independently to give best of breed monitoring to cloud, hybrid, and on premise installations. Offering metrics, traces, and events, the products cover all aspects of the observability triad. Having been grown through acquisition, each product has a high throughput stream processing pipeline that serves thousands of customers.",0,16967543,0,"[]","",0,"","[]",0 24282198,0,"comment","kakoni","2020-08-26 13:22:30.000000000","Well there is clickhouse. And projects like this https://github.com/flant/loghouse",0,24281008,0,"[]","",0,"","[]",0 24288531,0,"comment","FridgeSeal","2020-08-26 23:34:33.000000000","Pretty much everything I threw at both, Clickhouse did fatter. I never benchmarks write speeds properly, but I do know CH is capable of high write performance.
General analytics queries for the like of dashboards, CH latencies in the order of < 100ms, Snowflake about a second. Snowflake couldn’t do Geospatial queries when I had to use it, but I was getting responses from CH in like 40ms for a dataset of 10’s millions of points.
This guy has done some really in depth benchmarks: https://tech.marksblogg.com/billion-nyc-taxi-clickhouse.html
And https://tech.marksblogg.com/benchmarks.html
CH is one of the fastest non-GPU databases there.",0,24278880,0,"[24288933]","",0,"","[]",0 27418820,0,"comment","parhamn","2021-06-07 03:47:29.000000000","Great write up and congrats on the results. I think one of my takeaways from this is also how well rounded Clickhouse is. That fact that it is also so performant against timeseries datasets + databases is very surprising.",0,27411307,0,"[]","",0,"","[]",0 27446704,0,"comment","citrin_ru","2021-06-09 11:33:52.000000000","ClickHouse includes a tool which allows to run SQL queries over files/data in any supported format including TSV [1]
E. g. you can use this command to see which unix users use most memory (RSS):
ps aux | awk 'NR > 1 { print $1"\t"$6 }' | clickhouse-local -S "user String, mem UInt64" -q "SELECT user, formatReadableSize(sum(mem)*1024) as mem_total FROM table GROUP BY user ORDER BY sum(mem) DESC LIMIT 20 FORMAT PrettyCompact"
It is pretty fast - I've generated 7.7G test data for this query and it took 23 second to run the query above (using multiple threads). For comparison wc -l for this file takes 9 seconds.
[1] https://clickhouse.tech/docs/en/operations/utilities/clickho...",0,27423276,0,"[]","",0,"","[]",0 13443653,0,"comment","sply","2017-01-20 13:35:39.000000000","Very thoughtful notes, thanks. Waiting for your full blog posts.
Have you examined emerging databases like Tarantool https://tarantool.org/, GunDB http://gundb.io, TiDB https://github.com/pingcap/tidb, ClickHouse https://clickhouse.yandex/ ?
It would be great to read some deep and independent analysis for them to.",0,13442042,0,"[13444361,13530101]","",0,"","[]",0 27464503,0,"comment","nicbaz","2021-06-10 19:21:35.000000000","Kaiko | Software Engineer | Remote (France) / Onsite (Paris, France) | Full Time
+ We collect and aggregate financial data on centralized crypto exchanges as well as DeFi. We're talking about 9.7B unique transactions stored in May 2021
+ We built our platform using things like ClickHouse, Kafka, CockroachDB, NATS, Consul, Golang, Nomad, Terraform, Vault, VictoriaMetrics, Loki, Alertmanager and Grafana
+ We do continuous integration, peer reviewing and semantic versioning
+ You will join a team of 6 people (minimum 6 years of experience), most of whom are parents, work in short iterations (2 weeks), with daily team status updates
+ You will have full shared-ownership of the platform, complete freedom in how you work and organize yourself (one person's freedom ends where another's begins)
+ You can and are expected to make mistakes
+ You will be working on prioritization, organization, implementation and support of everything that needs to be done in order to fulfill the company's objectives. We're talking about road map building, infrastructure deployments, knowledge sharing, code development, customer feedback, architectural decisions, etc
+ You have the upmost respect for legacy code, with the occasional and perfectly understandable rant every now and then
In short, if you want to be accountable for all the aspects of your work, are a good communicant, value team effort, and are looking for a god technical challenge (high volumes, low latencies) : let's talk about what we have in store for you.
Our interview process is fairly straightforward. You'll be able to meet people from all departments, ask all the questions you need. No whiteboard quick sorting, no documentation quizzes, we do a role-play discussion and put ourselves in the shoes of an imaginary team, with real-life problems.
https://www.kaiko.com/pages/software-developer / nicolas@kaiko.com",1,27355392,0,"[]","",0,"","[]",0 24357257,0,"story","theorangeone","2020-09-02 19:52:12.000000000","",0,0,0,"[]","https://theorangeone.net/posts/calming-down-clickhouse/",2,"Calming Down Clickhouse","[]",0 20654604,0,"comment","olavgg","2019-08-09 14:25:01.000000000","I would love to see benchmarks with Clickhouse which scales much better than regular SQL databases on a single machine.",0,20653266,0,"[20684796]","",0,"","[]",0 27482843,0,"comment","nezirus","2021-06-12 08:46:11.000000000","Deleting whole partitions is generally useful strategy. It's like the difference between single inserts and batch inserts (often huge performance difference, and much lower IO)
Since you mentioned Cassandra and TTL, I'll mention ClickHouse, very nice TTL options, splitting into smaller partitions and using "ttl_only_drop_parts=1" has prove itself in the production with big data ingestion rates.
Last, but not the least, I almost always prefer Postgres for data storage needs, one can trust it to be safe and fast enough. Only some specific situations warrant other solutions, but it's a long way until that point (if ever), and better not optimize too early.",0,27482732,0,"[]","",0,"","[]",0 17077675,0,"comment","buro9","2018-05-15 20:39:59.000000000","CPU presently.
The filter expressions are based on the Wireshark Display Filters https://www.wireshark.org/docs/wsug_html_chunked/ChWorkBuild... and we support everything except the slice operator.
Rust handles the parsing, validation, AST creation, etc. That AST can then be applied to a trait table similar to the Wireshark implementation but without the necessity of a pcap step.
I hope that the filter becomes an invariant form of filter against traffic and that once we've got the AST we can apply that filter to different places. Initially just to itself within a Rust matching engine at the edge, but if you have columns on a DB why not ask for a SQL expression derived from the filter expression and then filter a ClickHouse store using the same filter, and likewise as per your suggestion if we can take some of the expressions that aren't L7 why can't we have these run in the network card, etc.
Right now... just CPU as it is early days. But eventually we can look at all places we match traffic and consider that a contender for the same filter to be applied there.",0,17077618,0,"[]","",0,"","[]",0 27495519,0,"comment","nwmcsween","2021-06-13 19:29:13.000000000","I dont understand the need to NiH things, why not clickhouse with weekly, monthly, etc aggregation. It comes with built in sharding, can store time series data somewhat efficiently and doesn't have some hack h/a setup (in general not thanos).",0,27491955,0,"[27498450]","",0,"","[]",0 27498450,0,"comment","BurritoAlPastor","2021-06-14 01:44:37.000000000","Wouldn’t Clickhouse be the NIH solution? Thanos supports Prometheus natively, but Clickhouse doesn’t (or at least doesn’t appear on the integrations list in the Prometheus docs), so you’d have to write an adapter for it.",0,27495519,0,"[27558434]","",0,"","[]",0 24381343,0,"comment","hodgesrm","2020-09-05 04:09:57.000000000","Adding columns can be hard at scale. MongoDB allows you to add columns incrementally without stopping the world. You just add the value and it's materialized in the DBMS. The point is you don't have to manually define schema but instead the database works with the structure of the data.
I'm a dyed-in-the-wool RDBMS user but I can see the value of this feature. The trade-off, of course, is that your application has to handle varying schema levels. MongoDB also won't protect you against typos, inconsistent types, and other foolishness. I would not judge people who choose to make this trade--it's a sensible one for many use cases.
Many analytic databases are headed in this direction due to the amount of data that arrives in the form of nested JSON structures. I can't speak for other DBMS types but it's something we're very interested in for ClickHouse.",0,24380008,0,"[24422556,24391968]","",0,"","[]",0 20681763,0,"comment","fnewberg","2019-08-13 00:53:19.000000000","Thanks for all the great questions!
The overhead is fairly minimal since we mostly intercept things that happen in the app and don't really have any things that run continuously at high frequency, e.g. we add about 0.5ms to network calls on average devices. The RAM usage is limited since we put caps on how much data we capture in a session. We do give devs the option to configure the SDK to capture screenshots when errors occur, which is the largest consumption of RAM we incur, but that can be disabled.
We make network calls for logs and submitting session data, so that does some bandwidth, but as mentioned above we put caps on how much data we capture in a session to avoid payloads getting excessively large.
We've had some customers who have been pretty sensitive to battery drain, and from working with them we've solved a couple of issues that did affect battery drainage, but no longer do.
We haven't invested a ton of effort yet to optimize the SDK specifically for 2G environments. That said we do have customers using our service with large user bases in India and the Philippines, where many of their users are on less-than-stellar network connections. As we expand to serve more markets where 2G is more common, we will be focusing on SDK performance for that.
As to how we ensured the SDK wasn't causing problems.... blood, sweat, and tears? Jokes aside, we had great early adopters who worked through some painful bugs with us. We've also had some devs at companies that we think have high code standards give us pointed feedback. Our basic thesis is that development for mobile is hard, and developing an SDK for mobile that does not impact the app it's integrated in is really hard, so we also used our own tool to figure out when things were not going right with the SDK.
The backend stack uses a bunch of wonderful OSS tech like nginx, Kafka, Cassandra, Clickhouse, Redis, MySQL, React, Gin, and Django. We are dealing with data volumes that pose fun engineering challenges, and we wouldn't be able to do it if we weren't standing on the shoulders of giants.
If I missed the mark or you're looking for more depth, don't hesitate to follow up!",0,20680921,0,"[20683490]","",0,"","[]",0 20684796,0,"comment","zX41ZdbW","2019-08-13 11:48:08.000000000","ClickHouse is happy to use multiple cores if the query is heavy enough. We have tested it on AMD EPYC 7351 more than a year ago and get promising results. (I have not saved them but I'll try to reproduce and post them here.)
Another case of scalability: we have also tested ClickHouse on an Aarch64 server (Cavium ThunderX2) with 224 logical cores and despite the fact that each core is 3..7 times slower than Intel E5-2650 and the code is not optimized as much as for x86_64, it was on-par in throughput of heavy queries.
There are also tests of ClickHouse on Power9 if you mind...",0,20654604,0,"[]","",0,"","[]",0 13485691,0,"comment","dunkelheit","2017-01-25 20:29:04.000000000","Pretty wide array of technologies covered in these benchmarks. I wonder how ClickHouse will fare, should be very competitive.",0,13481824,0,"[]","",0,"","[]",0 20708411,0,"comment","luizfelberti","2019-08-15 18:53:28.000000000","I'll give a non-orthodox suggestion: ClickHouse
You'll need to manage some stuff yourself, and assemble your own dashboards and stuff, so there will be some labor involved. That being said, I doubt it will be more painful than managing an ELK stack: there are just too many ways you can destabilize a cluster with it.
ClickHouse clusters from my experience are ridiculously scalable, fast, and stable. There are several other accounts to back that up, and a good case study is Cloudflare, which uses it to store and query all of their logs and metrics from all data centers (that's quite a few PB of data).
There are some projects on GitHub you can use to get inspired, but what you need is pretty much a ClickHouse cluster, Grafana, and a Log Shipper.",0,20707324,0,"[]","",0,"","[]",0 20711285,0,"comment","PeterZaitsev","2019-08-16 01:11:04.000000000","Checkup ClickHouse https://clickhouse.yandex/ this looks like a very good fit for what you say",0,20683016,0,"[]","",0,"","[]",0 24425095,0,"comment","hodgesrm","2020-09-09 19:46:24.000000000","It's definitely easier than conventional RDBMS. Adding a column in ClickHouse is just a metadata operation. BiqQuery has great nested structure support.
Still, it's hard to beat MongoDB in this respect. In my first app I was amazed that I could just insert a BSON object and MongoDB created a queryable table automatically. You pay for it of course in other ways but the ease of use is quite extraordinary.",0,24422556,0,"[]","",0,"","[]",0 27555785,0,"story","jtsymonds","2021-06-18 22:39:09.000000000","",0,0,0,"[]","https://altinity.com/blog/integrating-clickhouse-with-minio",2,"Integrating ClickHouse with S3 Compatible MinIO","[]",0 27558434,0,"comment","nwmcsween","2021-06-19 07:05:31.000000000","I'll bite, no because Prometheus, Thanos, etc all have to reinvent sharding (Thanos), a query language (Promql), vectorizing (Prometgeus can't), transactions (Prometheus internals), etc when it all exists in clickhouse and uses somewhat standard SQL. There are trade-offs but imo it's not worth NiHing something",0,27498450,0,"[]","",0,"","[]",0 17142734,0,"comment","olavgg","2018-05-24 10:45:13.000000000","Postgresql is good for analytics, but it doesn't scale really well with a lot of data. I have moved my analytics to Clickhouse, 1000x better performance.",0,17135718,0,"[]","",0,"","[]",0 20757055,0,"comment","einpoklum","2019-08-21 13:44:38.000000000","Don't use PostgreSQL or MySQL/MariaDB for analytic work - they're super-slow at that. Try systems like MonetDB, ClickHouse (not a full DBMS) etc: https://en.wikipedia.org/wiki/List_of_column-oriented_DBMSes Column stores are where it's at.
(Not relevant if you need to process transactions.)",0,20753985,0,"[20757136]","",0,"","[]",0 24462597,0,"comment","nickpeterson","2020-09-13 17:50:49.000000000","As someone interested in esoteric databases, I’d be curious to hear which ones you’re aware of that in your opinion don’t live up to claims or do. For instance, I’ve been very impressed with Clickhouse and have messed around with Jd (from Jsoftware ). Any other good ones I should check out?",0,24462403,0,"[]","",0,"","[]",0 24462933,0,"comment","DevKoala","2020-09-13 18:38:46.000000000","Does anybody with Clickhouse experience at scale know if AresDB is better on some use cases?",0,24461844,0,"[24463303,24467764]","",0,"","[]",0 24463303,0,"comment","hodgesrm","2020-09-13 19:33:33.000000000","It's hard to say, though I think the UPSERT capability looks useful because it simplifies handling duplicates. On the other hand it does not appear that Ares offers clustering, which is critical for large datasets.
(I work on ClickHouse and enjoyed this article when it came out.)",0,24462933,0,"[24467575,24463490]","",0,"","[]",0 24467575,0,"comment","einpoklum","2020-09-14 07:02:37.000000000","It should be mentioned, to ClickHouse' credit, that they made an effort to publish relatively detailed benchmark results for a some data sets and queries, when they first came out. They even got in contact with my research group at the time (the MonetDB group at CWI) to make an effort to present the MonetDB results in a fair manner.",0,24463303,0,"[]","",0,"","[]",0 24467764,0,"comment","tmd83","2020-09-14 07:35:42.000000000","What databases are comparable to ClickHouse say for Ease of single node deployment Super fast basic analytics, filter/group/count without a lot of optimization Fabulous compression",0,24462933,0,"[24468526]","",0,"","[]",0 20762890,0,"comment","manigandham","2019-08-21 23:07:46.000000000","There are other relational databases like MemSQL or Clickhouse that use distributed column-oriented architectures that are much better at large scale analytics and aggregations.
Postgres is getting pluggable storage engines in the next version (and already has foreign data wrappers) so that can at least lead to a better storage design.",0,20762820,0,"[]","",0,"","[]",0 20765160,0,"comment","einpoklum","2019-08-22 06:16:01.000000000","Yes, it's just like I thought. You're comparing against transaction-oriented DBMSes, or ones which handle documents rather than tabular data (and hence slow on tabular data).
One possible exception is InfluxDB - I'm not familiar enough with it.
Anyway, try running TSBS on columnar DBMSes like Actian VectorH, Vertica, SAP HANA etc. ClickHouse may also be relevant; they don't support any possible schema, but it may be enough to run TSBS.",0,20763024,0,"[20771646,20766847,20767331,20769608]","",0,"","[]",0 20765215,0,"comment","einpoklum","2019-08-22 06:26:24.000000000","Oh yea, MemSQL and ClickHouse are also indeed relevant and in this category, except that ClickHouse doesn't support all of SQL and any table structure, so it's not a full-fledged DBMS.",0,20763014,0,"[]","",0,"","[]",0 20766847,0,"comment","oddtodd","2019-08-22 11:24:11.000000000","At the end of 2018 Altinity benchmarked ClickHouse against the TSBS and documented it.
https://www.altinity.com/blog/clickhouse-for-time-series",0,20765160,0,"[]","",0,"","[]",0 20770073,0,"comment","valyala","2019-08-22 16:53:40.000000000","TimescaleDB could fit your workload if PostgreSQL fits you. The main issue with PostgreSQL and TimescaleDB is big amounts of storage space required for huge time series data volumes. There are reports that storing data on ZSF can reduce the required storage space.
Probably, ClickHouse [1] would fit better your needs. It can write millions of rows per second [2]. It can scan billions of rows per second on a single node and it scales to multiple nodes.
Also I'd recommend taking a look at other open-source TSDBs with cluster support:
- M3DB [3]
- Cortex [4]
- VictoriaMetrics [5]
These TSDBs speak PromQL instead of SQL. PromQL is specially optimized query language for typical time series queries [6].
[2] https://blog.cloudflare.com/http-analytics-for-6m-requests-p...
[4] https://github.com/cortexproject/cortex
[5] https://github.com/VictoriaMetrics/VictoriaMetrics/
[6] https://medium.com/@valyala/promql-tutorial-for-beginners-9a...",0,20762133,0,"[20772404]","",0,"","[]",0 20771646,0,"comment","RobAtticus","2019-08-22 19:40:03.000000000","We're happy to take pull requests for new databases, we have so far from Clickhouse, CrateDB, and SiriDB (and one pending). We've tried to make it relatively easy for new databases to hook in.
We usually implement ones that we hear about a lot from customers, and so far those haven't come up a ton. We'll keep it in mind though as we look to keep adding new ones.",0,20765160,0,"[]","",0,"","[]",0 27604239,0,"story","dengolius","2021-06-23 13:22:06.000000000","",0,0,0,"[27604240]","https://gitlab.com/mikler/glaber",4,"Glaber is a Zabbix fork with many improvements and ClickHouse based","[]",1 20787585,0,"comment","zX41ZdbW","2019-08-24 15:16:31.000000000","BTW, this algorithm is implemented in ClickHouse for fuzzy string matching employing rather clever technique for SIMD optimization: https://github.com/yandex/ClickHouse/blob/a9cfe4ce91b5cdfedb...",0,20772044,0,"[]","",0,"","[]",0 20797568,0,"comment","bdcravens","2019-08-26 05:05:17.000000000","Clickhouse - the speed and cost savings I'm able to get out of it for expensive database queries is some next-level wizardry.",0,20793590,0,"[]","",0,"","[]",0 20807398,0,"comment","wikibob","2019-08-27 08:32:18.000000000","BigQuery is /terrific/, for in-house analytics. It would very likely not be appropriate for backing a SaaS, at $5 per terabyte scanned.
I would suggest the OP is just fine with Postgres for awhile. They can shard it when needed.
Then eventually they can either get more sophisticated with Postgres sharding, or move to something like TiDB, clickhouse, or another event store.",0,20802833,0,"[]","",0,"","[]",0 17206342,0,"comment","mike_heffner","2018-06-01 15:39:35.000000000","SolarWinds Cloud | Sr Data Engineer | SF / US-REMOTE | Full-time | https://solarwinds.jobs/jobs/?q=cloud
We're looking for a full-time software engineer to take a key role in building the large-scale distributed systems that power Solarwinds Cloud products: Papertrail (Real Time Logging), AppOptics (Server, Infrastructure, Application Performance Monitoring and Distributed Tracing), Pingdom (DEM) and Loggly (Structured Log Analysis).
We’re a small team so everyone has the opportunity to have a big impact. We’ve built our platform out largely on Java8 Dropwizard services, a handful of Golang services and some C++ where performance is critical. We leverage Kafka as our main service bus, Cassandra for long term storage, our in-house stream processing framework for online analytics, ClickHouse for large scale log storage, and we rely on Zookeeper as a core part of intra/inter-service coordination. Our data pipeline pushes millions of messages a second and 50TB of logs per day.
All team members, whether in one of our offices or those remote, commit code to Github, communicate over Slack and Hangouts, push code to production via our ChatOps bot, and run all production applications on AWS. We also use an array of best-breed SaaS applications to get code to production quickly and reliably. We are a team that is committed to a healthy work/life balance.
At SolarWinds Cloud you get all the benefits of a small startup, with the backing of a big company so there is no worry about the next round of funding. SolarWinds offers competitive bonus and matching 401k programs that create an attractive total compensation package.
Learn more at: https://solarwinds.jobs/jobs/?q=cloud or contact me directly at mike-at-solarwinds.cloud (no recruiters).",0,17205865,0,"[]","",0,"","[]",0 13609870,0,"story","marklit","2017-02-09 19:10:37.000000000","",0,0,0,"[]","http://tech.marksblogg.com/billion-nyc-taxi-clickhouse.html",47,"1.1B Taxi Rides on ClickHouse","[]",0 13613290,0,"comment","rixed","2017-02-10 04:47:32.000000000","Don't they use clickhouse instead?",0,13611496,0,"[13613653]","",0,"","[]",0 17215037,0,"comment","jeroensoeters","2018-06-02 15:47:38.000000000","Instana | Senior Software Engineer | Austin, TX | Onsite | Full-time | Competitive salary + equity
Instana is the leading provider of Application Performance Management solutions for containerized microservice applications, at Instana we apply automation and artificial intelligence to deliver the visibility needed to effectively manage the performance of today's dynamic applications across the DevOps lifecycle.
At Instana, we have a myriad of complex and interesting projects to work on; from our agent software that has ridiculous performance requirements, to our big data processing pipeline that processes many terabytes per day, and from a fully 3D rendered web UI, to state of the art machine learning algorithms for detecting and predicting anomalies.
Tech: Java8, Project Reactor, Cassandra, ElasticSearch, ClickHouse, Kafka, C, C++, Go, ES6, React, ThreeJS, AWS and much more.
Requirements: deep knowledge of the JVM, solid understanding of building distributed systems. Preferred: experience building ingress systems, stream processing
If you're interested please email me at jeroen.soeters@instana.com",0,17205865,0,"[]","",0,"","[]",0 24518926,0,"story","krnaveen14","2020-09-18 16:41:02.000000000","",0,0,0,"[]","https://altinity.com/blog/clickhouse-and-redshift-face-off-again-in-nyc-taxi-rides-benchmark",4,"ClickHouse and Redshift Face Off Again in NYC Taxi Rides Benchmark","[]",0 24522573,0,"comment","sixdimensional","2020-09-18 22:44:17.000000000","IMHO the Wikipedia definition of OLAP [1] covers the general topic quite well. Doing OLAP in the big data world, if you use the sense of OLAP equating to multidimensional analysis of data aggregated into cubes, there are several approaches to it in the open source world, such as Apache Kylin [2].
The basic idea remains the same though - building dimensional data models, and then using OLAP/cube technology with MDX on top.
That said, if I think about it, my experiences with wide columnar databases were with HBase and Cassandra. The main goals when using those systems were usually three things: A) getting many attributes/variables being tracked in a single row as part of a measurement or observation of some event - and needing to write the data ridiculously fast (append only) and report on that data in real time; B) coming up with creative data processing patterns to aggregate data into higher level tabular structures, to make for blazing fast queries on these aggregates, which could be updating in real time, and C) being able to store huge volumes of data on a redundant cluster in an efficient format.
If you look at A, B and C above, these are, in a sense, the same reasons why people wish to implement OLAP cube technology, for the most part - aggregating huge results into aggregated views, lightning fast query responses and easy of navigating - "slicing and dicing" the data. However, OLAP cubes are most commonly associated with batch processing - there are real-time OLAP cube technologies, that basically work by doing change data detection on the sources.
I kind of believe that, you don't need specialized OLAP technology to accomplish the same thing - it's all about pre-aggregated and calculated data, efficient writing/reading and storage. I mean, OLAP technology is there to try to make that easier to do, but you can do the same thing with certain tricks in traditional dimensional models (accumulating snapshots, periodic snapshots, aggregate fact tables). Or in wide columnar databases (storing aggregates in wide tables). It's just maybe easier to use OLAP technology, otherwise, you kind of need to do all the scheduling, processing and aggregating of data yourself - but it's totally doable too and I have seen that work really well.
I personally have never been a fan of OLAP cube technology (MDX), even for how powerful it is, because it was so specialized - but that opinion was formed from a legacy of years of only being able to get these features from commercial vendors.
If you look at the legacy of SQL as a language, I actually believe that there was an attempt to unify OLAP queries with SQL by implementing "CUBE" and "ROLLUP" as part of ANSI SQL standard, but it was never as powerful as MDX and the OLAP engines that were built.
I think today, the world has changed, as we have open source OLAP and MDX options available - Pentaho Mondrian was popular for a while, not sure how much today. The current hot thing seems to be Apache Kylin. I also played with Apache Druid, in the past, although it doesn't seem as popular any more, as far as I can tell. Yandex Clickhouse is very popular also these days, for being extremely fast and open source. Again, Wikipedia has a good list of OLAP systems [3].
More to the point, if you use an event-based streaming approach (things like Kafka) - where the event comes directly from the source, combined with writing the data efficiently to something like an append-only, wide, column data store.. well, you have something pretty powerful. The only thing that a solution like this lacks, sometimes, is a powerful query engine - for example, Cassandra or HBase cannot do joins. There are ways to simulate joins with Apache Hive or other technologies, sitting on top of Cassandra or HBase.. but not sure I ever saw those work well.
I mean, the goal of aggregation is to avoid needing joins.. but, you'd be surprised how once in a while, you really wish you could just join this with that to come up with a new result - even in things like Cassandra or HBase. If you do your design right, you can avoid it.. but I find myself wanting it sometimes.
Really, in OLAP, everything is built to avoid JOINs or minimize them as much as possible. It's one reason, I believe though, that having JOINs in a dimensional model is still helpful. But maybe that's the old geezer who still likes SQL in me talking :)
Lastly, I thought I might mention - anyone who says that leaving all data without any kind of model structure or architecture (ex. flat or raw data only) is in a world of hurt, once you get to scale. This is what creates data swamps instead of data lakes. Data models are still as important as ever - snowflake, dimensional, multidimensional, etc. Pure raw data can be very useful for exploratory analysis, even machine learning models, but easy analysis for end consumers always requires some form of data modelling and architecture.
There is an opinion that we should throw data modelling to the wayside, and just feed raw data into AI and machine learning models - but I personally have never seen this as a reality yet.
I hope all the above might be useful!
[1] https://en.wikipedia.org/wiki/Online_analytical_processing
[3] https://en.wikipedia.org/wiki/Comparison_of_OLAP_servers",0,24486010,0,"[]","",0,"","[]",0 24532883,0,"comment","FridgeSeal","2020-09-20 09:27:38.000000000","We should make a list of technology that does this, because I know Clickhouse also has a reasonably detailed page on when to use it and when to not use it and why. Postgres also has a very nice “do and donts” wiki page.",0,24532152,0,"[24534465]","",0,"","[]",0 24533612,0,"comment","FridgeSeal","2020-09-20 12:28:14.000000000","OLAP databases can/are still Relational databases. The difference is that they’re optimised for different workloads.
SQLite/MySQL/Postgres/MSSQL etc are all OLTP databases whose primary operation is based around operations on single (or few) rows.
OLAP databases like ClickHouse/DuckDB, Monet, Redshift, etc are optimised for operating on columns and performing operations like bulk aggregations, group-bys, pivots, etc on a large subsets or whole tables.
If I was recording user purchases/transactions: SQLite.
If I was aggregating and analysing a batch of data on a machine: DuckDB.
I read an interesting engineering blog from Spotify(?) where in their data pipeline instead of passing around CSV’s or rows of JSON, passed around SQLite databases: DuckDB would probably be a good fit there.",0,24533493,0,"[24533692]","",0,"","[]",0 24538495,0,"comment","FridgeSeal","2020-09-21 00:03:37.000000000","To add to the other comment (albeit with different DB’s because I haven’t used DuckDB yet) comparing MSSQL and Clickhouse: similar hardware and same dataset, CH responds in ~30-60ms, MSSQL 300-400+ ms for simpler queries. 90-180ms vs several seconds, up to about 30s for more complex queries.
I could add more indices to MSSQL and do all sorts of query optimisations, but out of the box the OLAP wins hands down for that workload.",0,24533692,0,"[]","",0,"","[]",0 13643171,0,"story","zX41ZdbW","2017-02-14 12:56:34.000000000","",0,0,0,"[13643533,13643209,13650608,13650609]","https://www.percona.com/blog/2017/02/13/clickhouse-new-opensource-columnar-database/",28,"ClickHouse: New Open Source Columnar Database","[]",1 13643533,0,"comment","brudgers","2017-02-14 14:03:41.000000000","Clickhouse's house: https://clickhouse.yandex/",0,13643171,0,"[]","",0,"","[]",0 13644094,0,"story","kungfudoi","2017-02-14 15:27:10.000000000","",0,0,0,"[]","https://clickhouse.yandex",2,"ClickHouse – open-source distributed column-oriented DBMS","[]",0 20849084,0,"comment","zX41ZdbW","2019-08-31 22:07:27.000000000","Worth to note clickhouse-local: full featured ClickHouse SQL engine for files in CSV/TSV/JSONLines, whatever...
https://www.altinity.com/blog/2019/6/11/clickhouse-local-the...",0,20848581,0,"[20850368]","",0,"","[]",0 20855156,0,"comment","giancarlostoro","2019-09-01 21:44:04.000000000","Nice, although ClickHouse seems to be a Yandex project. Not sure my boss would appreciate ClickHouse, wonder how does this compare to ElasticSearch though? I'm not looking for 8GB of RAM is not something I can sell my manager.",0,20853335,0,"[20855234]","",0,"","[]",0 20855234,0,"comment","PeterZaitsev","2019-09-01 21:54:21.000000000","Check out Altinity - this is US based Clickhouse vendor if your boss needs commercial support and less ties to Russia",0,20855156,0,"[]","",0,"","[]",0 27673840,0,"comment","marvinblum","2021-06-29 09:59:03.000000000","We're hosting Postgres on HashiCorp Nomad and ClickHouse on a separate VM for Pirsch [0]. The Postgres db is only a few kb (maybe it's a few mb now, I haven't checked in a while), as it is only used for user accounts, settings, and some configuration. It doesn't do much so it's doing OK in the cluster using a host volume on one of the machines. ClickHouse uses more storage (don't know how much right now, but it should be less than 100mb) and resources and therefore lives on its own VM.
The main reason we self-host is privacy and cost. Postgres costs almost nothing, because it's part of the cluster we require anyways (also self-hosted) and ClickHouse can be scaled as needed. Hetzner has some really cheap VM, our whole setup, including the cluster, costs about 45€ a month.
[0] https://pirsch.io/",0,27671376,0,"[]","",0,"","[]",0 27677852,0,"story","jtsymonds","2021-06-29 16:04:58.000000000","",0,0,0,"[]","https://altinity.com/blog/clickhouse-object-storage-performance-minio-vs-aws-s3",4,"ClickHouse Object Storage Performance Benchmarking: MinIO vs. AWS S3","[]",0 27680170,0,"comment","dengolius","2021-06-29 18:54:32.000000000","Yes. We used Mysql setups, Percona Xtradb cluster and Clickhouse servers and it's ok to run it by myself cuz we work on bare metal servers.",0,27671376,0,"[]","",0,"","[]",0 24562520,0,"story","daniel_levine","2020-09-23 03:04:57.000000000","",0,0,0,"[]","https://tech.ebayinc.com/engineering/ou-online-analytical-processing/",6,"How eBay moved their OLAP to ClickHouse on Kubernetes","[]",0 24572288,0,"story","harporoeder","2020-09-23 21:28:54.000000000","",0,0,0,"[]","https://altinity.com/blog/clickhouse-materialized-views-illuminated-part-1",1,"ClickHouse Materialized Views Illuminated","[]",0 24577480,0,"comment","samcolvin","2020-09-24 11:22:31.000000000","Is it compatible with clickhouse?",0,24577239,0,"[]","",0,"","[]",0 24581748,0,"comment","yamrzou","2020-09-24 18:21:45.000000000","Another side question, please.
How does TimescaleDB compare to Clickhouse? Since you mentioned compression, I've been using Clickhouse to store bitemporal data, and have been amazed by its speed and compression levels. Unfortunately it's lacking in terms of relational modeling.
Would I get the best of two worlds with TimescaleDB? What are the tradeoffs?",0,24581066,0,"[24582018]","",0,"","[]",0 24582018,0,"comment","lima","2020-09-24 18:42:13.000000000","Compression and aggregation performance in TimescaleDB is much worse than ClickHouse - it represents a different tradeoff, you trade relational features for raw speed.
TimescaleDB is basically (very) fancy PostgreSQL sharding and row-oriented, while ClickHouse is a column store.
Depending on what you need to do, ClickHouse dictionaries and JOINs might be good enough.",0,24581748,0,"[24582408,24587497]","",0,"","[]",0 24582408,0,"comment","akulkarni","2020-09-24 19:15:15.000000000","> Compression and aggregation performance in TimescaleDB is much worse than ClickHouse
You are likely basing this off on an old version of TimescaleDB.
For the last year TimescaleDB has included native compression (in part by storing data in a columnar format):
https://blog.timescale.com/blog/building-columnar-compressio...
TimescaleDB now implements delta-delta, Gorilla, and other best-in-class compression algorithms:
https://blog.timescale.com/blog/time-series-compression-algo...
This has yielded 94%+ compression, which should make TimescaleDB and Clickhouse fairly similar in storage compression.",0,24582018,0,"[]","",0,"","[]",0 27698750,0,"story","jtsymonds","2021-07-01 13:17:30.000000000","",0,0,0,"[]","https://altinity.com/blog/clickhouse-object-storage-performance-minio-vs-aws-s3",2,"AWS vs. MinIO – Benchmarking ClickHouse with the NYC Taxi Dataset","[]",0 27703321,0,"comment","ajzo90","2021-07-01 19:19:43.000000000","Data and backend engineer | infobaleen.com | remote | sweden | go | golang As a data engineer in Infobaleen, you will design and develop data systems and machine learning pipelines. The data-centric platform consists of a data ingestion database, a machine-learning engine, and a dashboard engine. You will be responsible for helping our customer success team master all components and make crucial decisions when deciding which requested features we should prioritize. In addition, you will provide technical leadership and a deep understanding of data modeling. We work remotely from Stockholm, Gothenburg, Umeå, Piteå, and Berlin.
We are a tech-agnostic company and have built our service using a wide range of technologies. The core application is built in Go. Supporting tools depend on Rust, Python, Tensorflow, and ClickHouse. We orchestrate and deploy everything with Docker and Kubernetes on Google Cloud Platform
christian @ [our-domain] https://infobaleen.com/jobs/data-and-backend-engineer",0,27699704,0,"[]","",0,"","[]",0 24587497,0,"comment","ants_a","2020-09-25 07:36:48.000000000","There is no technical tradeoff here. PostgreSQL could add a batched, vectorized and JITed execution engine, and ClickHouse could add "relational features". Either one would of course be a significant engineering project, but there is no fundamental breakthrough required. A small matter of programming as they say.",0,24582018,0,"[]","",0,"","[]",0 13693359,0,"comment","LOLOLOLO1","2017-02-21 04:51:36.000000000","Spark is not just written in Scala, they shared the same philosophy of underthought design.
Luckily, there's a little reason of using Spark after Yandex released their Clickhouse: it better compresses data, it is much faster, it doesn't need crazy infrastructure of low quality Apache's code, it can be used in a realtime which is really huge.",0,13644589,0,"[]","",0,"","[]",0 24599307,0,"story","arvindkumarc","2020-09-26 14:58:48.000000000","",0,0,0,"[24600121,24599308]","https://github.com/delium/clickhouse-migrator",29,"Clickhouse DB Migration Framework","[]",5 17299420,0,"comment","kenshaw","2018-06-13 00:59:48.000000000","Just released the latest version of usql, which fixes issues with syntax highlighting, adds initial support for Cassandra databases via CQL, among other changes.
If you've not seen usql before, it's a universal command-line client for SQL databases (and now also Cassandra), modeled on psql. usql makes it easy to work from the command-line, in a simple and consistent way across any database and on any platform (Windows, macOS, Linux). usql is written in Go, and provides things like syntax highlighting and compatibility with databases other than PostgreSQL (see below for the list of the supported databases).
Progress is moving at a decent clip towards a v1.0, which I expect to include 100% native Go support for Oracle databases, tab-completion, full compatibility with psql's \pset (and other output formatting) commands.
usql supports the following databases:
Microsoft SQL Server
MySQL
PostgreSQL
SQLite3
Oracle
Apache Avatica
Cassandra
ClickHouse
Couchbase
Cznic QL
Firebird SQL
Microsoft ADODB (Windows only)
ODBC
Presto
SAP HANA
VoltDB
Hope others in the community can make use of usql. Glad to answer any questions if anyone has any.",0,17299356,0,"[17300297,17306492,17302115,17306315]","",0,"","[]",0
24604779,0,"story","mooreds","2020-09-27 06:56:19.000000000","",0,0,0,"[]","https://altinity.com/blog/introduction-to-clickhouse-backups-and-clickhouse-backup",2,"Introduction to ClickHouse Backups and clickhouse-backup","[]",0
27722514,0,"story","vishnunair","2021-07-03 16:00:06.000000000","",1,0,0,"[]","https://github.com/vishnudxb/clickhouse-db-cluster",1,"A GitHub action to run ClickHouse db with 2 shards / 2 replica with ZooKeeper","[]",0
24639184,0,"comment","jgrahamc","2020-09-30 14:42:08.000000000","Frontend: ReactServerless: Cloudflare Workers and Workers KV
Data: ClickHouse",0,24639164,0,"[24640441,24639268]","",0,"","[]",0 17334911,0,"story","setra","2018-06-17 23:08:25.000000000","",0,0,0,"[]","https://medium.com/@leventov/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7",1,"Comparison of OLAP Systems for Big Data: ClickHouse, Druid and Pinot","[]",0 17339835,0,"comment","ryanworl","2018-06-18 17:07:31.000000000","They have since moved off of that set up and now use Clickhouse according to later blog posts and conference talks.
https://blog.cloudflare.com/http-analytics-for-6m-requests-p...",0,17339651,0,"[]","",0,"","[]",0 17339851,0,"comment","bzillins","2018-06-18 17:09:26.000000000","In a follow up blog post https://blog.cloudflare.com/http-analytics-for-6m-requests-p... Cloudflare released the following image https://blog.cloudflare.com/content/images/2018/03/Old-syste... with 'single instance PostgreSQL database (a.k.a. RollupDB), accepted aggregates from Zoneagg consumers and wrote them into temporary tables per partition per minute' as the description for the single Postgres instance depicted. The aggregation system described in the Citus blog distributes the aggregation work over multiple Postgres instances.",0,17339651,0,"[]","",0,"","[]",0 17342406,0,"comment","samaysharma","2018-06-18 23:16:06.000000000","(Samay from Citus)
This blog post is quite different than Cloudflare's old pipeline. I can think of three fundamental differences.
(1) Cloudflare's pipeline used an old version of Citus. On that version, Citus was still a proprietary database that forked from Postgres. Since then, Citus became an extension of Postgres (not a fork) and went open source.
(2) Features highlighted in this blog post weren't available in the old version of Citus. Example features are hash distribution, distributed roll-ups, and sharding by site_id and partitioning by time.
These features help in two ways: (a) They simplify the overall pipeline quite a bit. You don't need to rely on a single Postgres node and Kafka queues to do roll-ups. (b) Your end to end price-performance improves by 4-5x. You can ingest larger volumes of data through sharding + time partitioning. You can also do parallel roll-ups inside the database. Neither of these were possible back in the day.
(3) Similarly, Cloudflare outlined issues around single point of failures around Postgres / Citus: https://blog.cloudflare.com/http-analytics-for-6m-requests-p...
Postgres streaming replication and Citus' HA solutions have come a long way in the past three years. These points shouldn't be an issue in newer versions.
Overall, the database space is an exciting market, where products continuously evolve. This blog post presents a reference architecture that uses new features in Citus - that have been added in the past three years and that our customers are using today.",0,17339651,0,"[]","",0,"","[]",0 27760627,0,"story","elenasamuylova","2021-07-07 13:02:26.000000000","Hi HN, we are Elena and Emeli, co-founders of Evidently AI http://evidentlyai.com. We're building monitoring for machine learning models in production. The tool is open source and available on GitHub: https://github.com/evidentlyai/evidently. You can use it locally in a Jupyter notebook or in a Bash shell. There’s a video showing how it works in Jupyter here: https://www.youtube.com/watch?v=NPtTKYxm524.
Machine learning models can stop working as expected, often for non-obvious reasons. If this happens to a marketing personalization model, you might spam your customers by mistake. If this happens to credit scoring models, you might face legal and reputational risks. And so on. To catch issues with the model, it is not enough to just look at service metrics like latency. You have to track data quality, data drift (did the inputs change too much?), underperforming segments (does the model fail only for users in a certain region?), model metrics (accuracy, ROC AUC, mean error, etc.), and so on.
Emeli and I have been friends for many years. We first met when we both worked at Yandex (the company behind CatBoost and ClickHouse). We worked on creating ML systems for large enterprises. We then co-founded a startup focused on ML for manufacturing. Overall we've worked on more than 50 real-world ML projects, from e-commerce recommendations to steel production optimization. We faced the monitoring problem on our own when we put models in production and had to create and build custom dashboards. Emeli is also an ML instructor on Coursera (co-author of the most popular ML course in Russian) and a number of offline courses. She knows first-hand how many data scientists try to repeatedly implement the same things over and over. There is no reason why everyone should have to build their own version of something like drift detection.
We spent a couple of months talking to ML teams from different industries. We learned that there are no good, standard solutions for model monitoring. Some quoted us horror stories about broken models left unnoticed which led to $100K+ in losses. Others showed us home-grown dashboards and complained they are hard to maintain. Some said they simply have a recurring task to look at the logs once per month, and often catch the issues late. It is surprising how often models are not monitored until the first failure. We spoke to many teams who said that only after the first breakdown they started to think about monitoring. Some never do, and failures go undetected.
If you want to calculate a couple of performance metrics on top of your data, it is easy to do ad hoc. But if you want to have stable visibility into different models, you need to consider edge cases, choose the right statistical tests and implement them, design visuals, define thresholds for alerts etc. That is a harder problem that combines statistics and engineering. Beyond that, monitoring often involves sharing the results with different teams: from domain experts to developers. In practice, data scientists often end up sharing screenshots of their plots and sending files here and there. Building a maintainable software system that supports these workflows is a project in itself, and machine learning teams usually do not have time or resources for it.
Since there is no standard open-source solution, we decided to build one. We want to automate as much as possible to help people focus on the modeling work that matters, not boilerplate code.
Our main tool is an open-source Python library that generates interactive reports on ML model performance. To get it, you need to provide the model logs (input features, prediction, and ground truth if available) and reference data (usually from training). Then you choose the report type and we generate a set of dashboards. We have pre-built several reports to detect things like data drift, prediction drift, visualize performance metrics, and help understand where the model makes errors. We can display these in a Jupyter notebook or HTML. We can also generate a JSON profile instead of a report. You can then integrate this output with any external tool (like Grafana) and build a workflow you want to trigger retraining or alerts.
Under the hood, we perform the needed calculations (e.g. Kolmogorov Smirnov or Chi-Squared test to detect drift) and generate multiple interactive tables and plots (using Plotly on the backend). Right now it works with tabular data only. In the future, we plan to add more data types, reports and make it easier to customize metrics. Our goal is to make it dead easy to understand all aspects of model performance and monitor them.
We differ from other approaches in a couple of ways. There are end-to-end ML platforms on the market that include monitoring features. These work for teams who are ready to trade flexibility in order to have an all-in-one tool. But most teams we spoke to have custom needs and prefer to build their own platform from open components. We want to create a tool that does one thing well and is easy to integrate with whatever stack you use. There are also some proprietary ML monitoring solutions on the market, but we believe that tools like these should be open, transparent, and available for self-hosting. That is why we are building it as open source.
We launched under Apache 2.0 license so that everyone can use the tool. For now, our focus is to get adoption for the open-source project. We don’t plan to charge individual users or small teams. We believe that the open-source project should remain open and be highly valuable. Later on, we plan to make money by providing a hosted cloud version for teams that do not want to run it themselves. We're also considering an open-core business model where we charge for features that large companies care about like single sign-on, security and audits.
If you work in tech companies, you might think that many ML infra problems are already solved. But in more traditional industries like manufacturing, retail, finance, etc., ML is just hitting adoption. Their ML needs and environment are often very different due to legacy IT systems, regulations, and types of use cases they work with. Now that many move from ML proof-of-concept projects to production, they will need the tools to help run the models reliably.
We are super excited to share this early release, and we’d love if you could give it a try: https://github.com/evidentlyai/evidently. If you run models in production - let us know how you monitor them and if anything is missing. If you need some help to test the tool - happy to chat! We want to build this open-source project together with the community, so let us know if you have any thoughts or feedback.",0,0,0,"[27761194,27764697,27769787,27765167,27762103,27760694,27778496,27765758,27768352]","",111,"Launch HN: Evidently AI (YC S21) – Track and Debug ML Models in Production","[]",17 27761434,0,"comment","hodgesrm","2021-07-07 14:21:49.000000000","> No, we don't need olap cubes. But we do need some type of rigor around analytic data. Otherwise, why go to all the trouble to collect it, count it, and predict from it if it may be wrong with no measure of that uncertainty?
Modern data warehouses don't need to build cubes for performance reasons. Low latency data warehouses like ClickHouse or Druid can aggregate directly off source data. The biggest driver for modeling is allowing non-coders to access data and perform their own analyses. I don't see that problem ever going away. Cube modeling with dimensions and measures solves it well.",0,27740150,0,"[]","",0,"","[]",0 24643541,0,"story","daniel_levine","2020-09-30 20:39:55.000000000","",0,0,0,"[]","http://brandonharris.io/redshift-clickhouse-time-series/",2,"ClickHouse, Redshift and 2.5B Rows of Time Series Data – Brandonharris.io","[]",0 24653865,0,"comment","lykr0n","2020-10-01 17:38:12.000000000","Location: Seattle, WA
Remote: I'd like not to
Willing to relocate: No, but the right offer makes anything is possible.
Technologies: Linux (CentOS), Bare Metal & Cloud Infrastructure, MySQL, PostreSQL, Hashicorp Suite (Vault, Consul, Nomad), Docker, Clickhouse, Kafka, Rust, Python, Puppet, SaltStack, Bash, Some Golang, Prometheus, Grafana, Datadog, and more.
Resume: On Request
Email: hn@lykron.mm.st
I'm looking for a position where I can learn an grow, and help others do the same. I know I might not have all the skills needed, but I'm the kind of person who will accomplish a task and figure out the best solution possible given the constraints. To be frank, if you're looking for someone who would manage your AWS or GCP stack, I don't think I'm your guy.
Site Reliability Engineer (SRE) and/or Systems Engineer and/or Infrastructure Engineer 4 AAoE",0,24651637,0,"[]","",0,"","[]",0 27780996,0,"story","matesz","2021-07-09 06:31:57.000000000","",0,0,0,"[]","https://presentations.clickhouse.tech/meetup24/2.%20SQLGraph%20--%20When%20ClickHouse%20marries%20graph%20processing%20Amoisbird.pdf",2,"SQLGraph: When ClickHouse marries graph processing [pdf]","[]",0 27783714,0,"story","saleiva","2021-07-09 13:57:31.000000000","",1,0,0,"[]","https://blog.tinybird.co/2021/07/09/projections/",7,"Experimental ClickHouse: Projections","[]",0 27788434,0,"comment","yuppie_scum","2021-07-09 21:00:11.000000000","You probably want Clickhouse",0,27784100,0,"[]","",0,"","[]",0 27797343,0,"comment","zX41ZdbW","2021-07-10 23:57:51.000000000","This is how it is done in ClickHouse. It has Nullable(T) type. The functions of non-Nullable types will return non-Nullable types (except some specific functions).
https://clickhouse.tech/docs/en/sql-reference/data-types/nul...",0,27791738,0,"[]","",0,"","[]",0 27797359,0,"comment","zX41ZdbW","2021-07-11 00:01:06.000000000","This is also fixed in ClickHouse - you can set and reuse aliases in any expressions in the query.",0,27792169,0,"[]","",0,"","[]",0 27810290,0,"comment","mescudi","2021-07-12 13:54:47.000000000","Full-time/Part-time
Location: Kazakhstan
Remote: Yes, English/Russian
Willing to relocate: Yes
Technologies: kubernetes (k8s), golang, aws, python, docker, helm, linux, terraform, ansible, clickhouse, ELK, cockroachdb, aerospike
Resume: https://tinyurl.com/2rpdpzm9
Email: nurtas977@gmail.com
Eager to work at startup culture, but open to interviews from big companies. Currently DevOps Engineer, sometimes accomplish SE and SRE tasks.
Gallup's CliftonStrength top-5: Restorative, Competition, Deliberative, Learner, Achiever",0,27699702,0,"[]","",0,"","[]",0 27810844,0,"story","carlosap","2021-07-12 14:50:48.000000000","",0,0,0,"[27812109]","https://guides.tinybird.co/guide/postgres-to-clickhouse",2,"From Postgres to ClickHouse","[]",1 27812109,0,"comment","xoelop","2021-07-12 16:39:02.000000000","Hey HN,
Inspired by this great post on how to do Pandas stuff with Postgres, I wrote this guide on how to do all that on ClickHouse.
Hope you like it! Xoel",0,27810844,0,"[]","",0,"","[]",0 24696149,0,"comment","markosaric","2020-10-06 09:50:27.000000000","Hello HN!
We started developing Plausible early last year, launched our SaaS business and you can now self-host Plausible on your server too! The project is battle-tested running on more than 5,000 sites and we’ve counted 180 million page views in the last three months.
Plausible is a standard Elixir/Phoenix application backed by a PostgreSQL database for general data and a Clickhouse database for stats. On the frontend we use TailwindCSS for styling and React to make the dashboard interactive.
The script is lightweight at 0.7 KB. Cookies are not used and no personal data is collected. There’s no cross-site or cross-device tracking either.
We build everything in the open with a public roadmap so would love to hear your feedback and feature requests. Thank you!",0,24696145,0,"[24699471,24697795,24696760,24700565,24701112]","",0,"","[]",0 24697272,0,"comment","sradman","2020-10-06 13:09:06.000000000","Plausible Analytics [1] is an MIT Licensed alternative to Google Analytics. It is hosted at plausible.io but can also be self-hosted. The app server is written in Phoenix/Elixir. The self-hosted version is distributed as a Docker image. It is configured [2] with a PostgreSQL server for user data, a Clickhouse server for analytics data, and an SMTP server for transactional email.
EDIT: according to markosaric, the data policy restricts the granularity of Active Users [3] to daily statistics for privacy reasons so the common Monthly Active Users (MAU) stat is not available.
[1] https://github.com/plausible/analytics
[2] https://docs.plausible.io/self-hosting-configuration
[3] https://en.wikipedia.org/wiki/Active_users",0,24696145,0,"[]","",0,"","[]",0 21013986,0,"story","ngaut","2019-09-19 07:06:15.000000000","",0,0,0,"[]","https://github.com/Vertamedia/clickhouse-grafana",1,"New Release for Grafana ClickHouse Plugin","[]",0 24725785,0,"comment","hodgesrm","2020-10-09 01:26:23.000000000","Which existing data warehouses do you mean? ClickHouse and Druid can return answers in millisecond. Data warehouses are starting to optimize for low latency response.
Disclaimer: My company supports ClickHouse.",0,24720583,0,"[]","",0,"","[]",0 27851430,0,"comment","peferron","2021-07-15 23:44:08.000000000","Once you start having multiple data sources, business logic implemented in application code might be easier to reuse than business logic implemented in one of the data sources. For example: start with Postgres, then add ClickHouse to the mix. It might be easier to let the application handle access to both DBs, rather than write RLS policies in Postgres and then try to apply them to ClickHouse.",0,27849889,0,"[]","",0,"","[]",0 17440849,0,"comment","kockic","2018-07-02 11:45:01.000000000","I see that most of the `don't use mongodb for analytics` are being down-voted, however I tend to agree with them. For all the people out there looking for the database for analytics please check Clickhouse from Yandex, it's easy to get started, amazingly fast and open source.
Disclaimer: I am not affiliated with Yandex in anyway, just a happy customer",0,17438516,0,"[]","",0,"","[]",0 17444763,0,"story","PeterZaitsev","2018-07-02 19:43:40.000000000","",0,0,0,"[17444786]","https://www.altinity.com/blog/2018/6/30/realtime-mysql-clickhouse-replication-in-practice",2,"Realtime replication from MySQL to ClickHouse","[]",1 17444786,0,"comment","PeterZaitsev","2018-07-02 19:46:57.000000000","I am big fan of ClickHouse to supplement MySQL for some of the analytics needs. While ClickHouse does not have full SQL support, for many simple queries it 100 times faster than MySQL on the single system and with linear multi server scaleability you can get performance 1000x improvements in Performance.",0,17444763,0,"[]","",0,"","[]",0 13845769,0,"comment","mindprince","2017-03-11 15:49:10.000000000","Try ClickHouse if you need a columnar DBMS: https://github.com/yandex/ClickHouse
We started using it recently and it's been amazing.",0,13844461,0,"[13854226]","",0,"","[]",0 21049677,0,"comment","missosoup","2019-09-23 15:17:00.000000000","> And so for data processing/streaming/batch [...] serverless actually does work out pretty well.
This is my field of expertise. Serverless in the sense of lambda/functions is not usable for serious analytics pipelines due to the max allowed image size being smaller than the smallest NLP models or even lightweight analytics python distributions. You can't use lambda on the ETL side and you can't use lambda on the query side unless your queries are trivial enough to be piped straight through to the underlying store. And if your workload is trivial, you should just use clickhouse or straight up postgres because it vastly outperforms serverless stacks in cost and performance[1]
For non-trivial pipelines, tools like spark and dask dominate. And it just so happens that both have plugins to provision their own resources through kubernetes instead of messing around with serverless/paas noise.
And PasS products, well.
https://weekly-geekly.github.io/articles/433346/index.html
>One table instead of 90
>Service requests are executed in milliseconds
>The cost has decreased by half
>Easy removal of duplicate events
Please explain.
[1] https://blog.cloudflare.com/http-analytics-for-6m-requests-p...
IaaS is the peak value proposition of cloud vendors. Serverless/PaaS are grossly overpriced products aimed at non-technical audiences and are mostly snake oil. Change my mind.",0,21049488,0,"[21049854,21049722,21049996,21051086]","",0,"","[]",0 21049722,0,"comment","cthalupa","2019-09-23 15:21:36.000000000","That article appears to be discussing a migration from Redshift to Clickhouse. Redshift is a managed data warehouse, not a serverless solution in the same vein as Lambda.
I don't understand the point you are trying to make.
Edit: The comment I am replying to was originally just 'Please explain' and a link to the article in question, and contained no other context or details.",0,21049677,0,"[21049814]","",0,"","[]",0 21051086,0,"comment","RhodesianHunter","2019-09-23 17:22:54.000000000","Clickhouse is a really strange thing to compare to Lambda here. One is a method of performing small compute jobs, the other is an analytics database. They serve vastly different functions and saying "Clickhouse or postgres is cheaper and more performant than lambdas" is nonsensical.",0,21049677,0,"[]","",0,"","[]",0 27873640,0,"comment","hodgesrm","2021-07-18 15:18:48.000000000","You can query CSV directly without a DBMS backend using clickhouse-local. [1] It's often used to clean data exactly as this article describes. You can also load CSV very easily into tables.
Superintendent.app seems to bring automatic schema definition to the table. We've discussed adding this to ClickHouse. Would be a great PR if anyone is interested.
[1] https://altinity.com/blog/2019/6/11/clickhouse-local-the-pow...
(This note brought to you by the ClickHouse evangelization task force.)",0,27871574,0,"[27875938,27877687]","",0,"","[]",0 27874290,0,"comment","pachico","2021-07-18 16:37:04.000000000","I've been dealing with lots of data and SQLite and my outtakes are:
- language has very little to do since the bottleneck will most likely be the way you insert data
- indeed prepared statements are useful but the performance didn't change much when I did long transactions and commit every certain amount of thousand of rows
- having lots of rows in your table is good but certain queries, like aggregation over many rows, are not what SQLite is great about.
- ClickHouse can easily ingest that and more in a laptop without even any scripting language.",0,27872575,0,"[]","",0,"","[]",0 27875938,0,"comment","btown","2021-07-18 19:39:40.000000000","Interested, but writing and compiling a large new C++ codebase when I haven't written C++ in years isn't something I quite have time for at the moment :)
For anyone else with more time to roll up their sleeves, https://github.com/ClickHouse/ClickHouse/blob/b0ddc4fb30f1a0... would be the place to split the header row, and perhaps use some heuristics on the first data row to identify datatypes!",0,27873640,0,"[]","",0,"","[]",0 24759684,0,"comment","supz_k","2020-10-12 21:34:52.000000000","Hi. I'm surprised that you have run this platform for 8 years. That's incredible!
There's one question I want to ask you. Do you have any experience handling millions of traffic? I saw that you are using MYSQL for storage. I assume each pageview is stored in a row. So, from my experience MYSQL is very slow performing aggregated queries on those datasets (We recently moved to Clickhouse due to this). I'd like to know your experiences on handling millions of rows in MYSQL, if you have any :)",0,24746921,0,"[24759767]","",0,"","[]",0 24763653,0,"comment","supz_k","2020-10-13 09:04:07.000000000","Hi,
Thanks for the answer. In our MYSQL database, we wanted to count number of pageviews for each website. So, the query was like
SELECT COUNT(id) FROM page_views WHERE website_id = x AND created_at > (start of the month)
For some websites, there are 20 million+ pageviews for each month. MYSQL goes through all of those 20 million rows (according to EXPLAIN) even with index or composite index (It took a few seconds). So, we had to pre-calculate the number of pageviews each hour and show it to the user. Then, things were worst when we had to show analytics. There we had to group by month or day. So, it took more time.
That's when we started finding a solution. And, I learned that there's something called "Analytical Databases", which are designed for those analytics purposes. How they work is completely different from MYSQL.
And, here's a benchmark on MYSQL vs Clickhouse. http://mafiree.com/blogs.php?ref=Benchmark-::-MySQL-Vs-Colum...
MYSQL does a pretty good job for many things. However, when it comes to analytics, I think it's better to use an analytical database.
As each of your users has its own database, there won't be any issues at all. In our case, all of our clients' pageviews are stored in one table, which grows at 35m records per month.
We're also new to this analytical databases thing. I'd like to know your thoughts. :)",0,24759767,0,"[24766619]","",0,"","[]",0 27878541,0,"comment","jinmingjian","2021-07-19 02:08:09.000000000","I recommend one ClickHouse compatible OLAP database project in Rust: [TensorBase](https://github.com/tensorbase/tensorbase/) for anyone who likes working with AP-side DBs on Rust.
FYI, recent information and progresses for TensorBase:
1. TensorBase(TB, for short) is not an reimplementing or clone of ClickHouse(CH, for short). TensorBase just supports the ClickHouse wire protocol in its server side.
2. TB's in-Rust CH compatible server side is faster than that in-C++ of CH. TB enables *F4* in the critical writing path: Copy-Free, Lock-Free, Async-Free, Dyn-Free (no dynamic object dispatching).
The result of TB's architectural performance: the untuned write throughput of TB is ~ 2x faster than that of CH in the Rust driver bench, or ~70% faster by using CH own ```clickHouse-client``` command. Use [this parallel script](https://github.com/tensorbase/tools/blob/main/import_csv_to_...) to try it yourself!
3. Thanks to the Arrow-DataFusion, TensorBase has supported good parts of TPC-H. [Untuned TPC-H Q1 result here](https://github.com/tensorbase/benchmarks/blob/main/tpch.md).
4. In simple (no-groupby) aggregation, TensorBase is several times faster than ClickHouse. [Benchmark here](https://github.com/tensorbase/benchmarks/blob/main/quick.md).
5. For complex groupby aggregations, recently we help to boost the speed of the TB engine to the same level of ClickHouse(not released, but coming soon).
6. TB will soon supports MySQl wire protocol, distributed query, adaptive columnar storage optimization... Watch [issues here](https://github.com/tensorbase/tensorbase/issues)
Finally, it is really great to build an AP database in Rust. Welcome to join!
Disclaimer: I am the author of TensorBase.",0,27874992,0,"[]","",0,"","[]",0 24766619,0,"comment","XCSme","2020-10-13 15:39:28.000000000","Thanks for the extra details!
I built userTrack mostly as a cheaper alternative for smaller businesses, so they can still have access to good tools/data without paying enterprise prices, so the goal was never to support 1M+ monthly sessions, thus I never spent too much time looking into heavy-traffic performance or scaling.
> For some websites, there are 20 million+ pageviews for each month. MYSQL goes through all of those 20 million rows (according to EXPLAIN) even with index or composite index (It took a few seconds). So, we had to pre-calculate the number of pageviews each hour and show it to the user.
Yes, count usually "goes" through all the rows when using more complex conditions, the way to improve this is usually, as you also did, to pre-calculate the counts using databse triggers (whenever a new row is inserted, the trigger will update the total count).
I am not familiar with Clickhouse, but I assume for being so fast it provides at lot fewer features/options to store and manipulate data. How hard was the transition to Clickhouse? Where you able to easily convert the DB schema and all quries?",0,24763653,0,"[]","",0,"","[]",0 21063427,0,"comment","minitoar","2019-09-24 18:56:07.000000000","How would you categorize something like ClickHouse or Interana or Druid? Columnar I guess, but then the description of Column-family in the article doesn't match up with my experience of how those work.",0,21060866,0,"[21064194]","",0,"","[]",0 21066396,0,"comment","barrkel","2019-09-25 00:43:18.000000000","There's nothing intrinsic about not supporting joins, in a columnar store; it's just that you lose a huge amount of the linear scanning performance if you have to do joins for each value. Most columnar stores I've used (primarily Impala, SparkSQL and Clickhouse) all support joins, but they materialize one side of the join as an in-memory hash table, which limits the allowable size of the join, and is a cost multiplier for a distributed query. I believe per the docs that MemSQL can mix and match row-based with columnar more easily, but joins are always going to be really slow compared to the speed you can get from scanning the minimum number of columns to answer your question.",0,21064306,0,"[21068381,21067150]","",0,"","[]",0 21067561,0,"comment","shrumm","2019-09-25 04:19:30.000000000","ClickHouse is another favourite",0,21066579,0,"[21073734]","",0,"","[]",0 21068381,0,"comment","hodgesrm","2019-09-25 06:52:50.000000000","The ClickHouse team is working on merge joins which will supplement the currently supported in-memory hash join mechanism. It's not a panacea as you point out, especially on distributed tables. That said it will help with a number of important use cases such as those that require joining a fact table against very large dimension tables.",0,21066396,0,"[]","",0,"","[]",0 21073734,0,"comment","einpoklum","2019-09-25 18:33:03.000000000","Clickhouse is a columnar system, yes, but is not a full-fledged DBMS. Specifically, I don't think it can join tables.",0,21067561,0,"[]","",0,"","[]",0 27910214,0,"comment","fredros","2021-07-21 18:36:37.000000000","> but on internet scale joins have miserable performance.
Please define internet scale. Joining 10s of billions of rows using clickhouse here.",0,27909580,0,"[27910226]","",0,"","[]",0 27910226,0,"comment","flowerlad","2021-07-21 18:38:04.000000000","Clickhouse is OLAP. We are talking OLTP.",0,27910214,0,"[]","",0,"","[]",0 24791657,0,"comment","machiaweliczny","2020-10-15 18:03:51.000000000","From my understanding (don't have much backend experience) you need those only for specific workloads. First learn difference between OLTP and OLAP. Traditional DBs are usually designed for OLTP, and new DBs are designed for OLAP and some for mega scale (Petabytes).
I recommend you learn: * ES - for text search * Clickhouse - simplest OLAP * Cassandra (Petabytes of data, columnar store) * Learn some about timeseries DBs (analytics) * Graph DBs
RabittMQ, Kafka or Pulsar are used for message bus/que implementations. Simple case: producing message takes 1 time unit but processing 5 units, so you want to implement kind of threading without coupling to specific hosts, so you use queue and subscribe to quue with readers. Read ZeroMQ docs on all communication patterns to learn typical cases.",0,24762734,0,"[]","",0,"","[]",0 17492903,0,"comment","gricardo99","2018-07-09 19:49:42.000000000","Yup.
>The pickup_ntaname column is stored as varchars by ClickHouse and kdb+, and as dictionary encoded single byte values by LocustDB.
But it would be trivial to convert to enum/sym type in kdb+. It's silly to query and group by strings.",0,17492571,0,"[17493689]","",0,"","[]",0 17493544,0,"comment","frankmcsherry","2018-07-09 21:25:04.000000000","This seems like a very unfair reading of what the author actually wrote:
> One note about the results for kdb+: The ingestion scripts I used for kdb+ partition/index the data on the year and passenger_count columns. This may give it a somewhat unfair advantage over ClickHouse and LocustDB on all queries that group or filter on these columns (queries 2, 3, 4, 5 and 7). I was going to figure out how to remove that partitioning and report those results as well, but didn’t manage before my self-imposed deadline.",0,17492571,0,"[]","",0,"","[]",0 17496209,0,"comment","geocar","2018-07-10 08:12:05.000000000","The first red flag is that Mark's benchmarks look very different for kdb, even though his ClickHouse times are similar to Clemens.
Looking over the queries, he made some... interesting changes that have him benchmarking oranges to apples.",0,17492571,0,"[]","",0,"","[]",0 13900769,0,"story","sply","2017-03-18 11:02:23.000000000","",0,0,0,"[]","https://www.percona.com/blog/2017/03/17/column-store-database-benchmarks-mariadb-columnstore-vs-clickhouse-vs-apache-spark/",4,"Column Store Benchmarks: MariaDB vs. Clickhouse vs. Apache Spark","[]",0 27917345,0,"story","charlieirish","2021-07-22 10:34:50.000000000","",0,0,0,"[]","https://tech.marksblogg.com/clickhouse-prometheus-grafana.html",8,"Monitor ClickHouse with Prometheus and Grafana","[]",0 13904810,0,"story","dkarapetyan","2017-03-19 02:27:24.000000000","",0,0,0,"[]","https://clickhouse.yandex/",8,"ClickHouse – open-source distributed column-oriented DBMS","[]",0 13910472,0,"comment","otterley","2017-03-19 23:30:22.000000000","What do you think of ClickHouse vis-a-vis Druid and Redshift?",0,13909968,0,"[13911349]","",0,"","[]",0 27934480,0,"comment","zepearl","2021-07-23 19:13:30.000000000","Thanks a lot!!! I really mean it!
This might change everything for me. I'm already using MariaDB+MyRocks and Clickhouse for some special/dedicated tasks, but I was really missing a good DB for normal/typical OLTP tasks.
So far I used (as mentioned above) MariaDb+TokuDB (but the optimizer of MariaDB can be a bit crazy from time to time), had multiple times during the past months thoughts about PostgreSQL but the lack of hints always made me take a step back from it => this addon seems to be exactly what I wished for.
Looks like that my weekend will be all about PG - the last time that I set it up its version number had a single digit => cannot remember anything anymore... .
Again, thanks for the hint :P",0,27927233,0,"[]","",0,"","[]",0 24816631,0,"comment","ing33k","2020-10-18 09:43:49.000000000","I run a replicated ClickHouse server setup, Clickhouse uses zookeeper to enable replication. The zookeeper instance was not replicated.it was a single node. The server on which zookeeper was running ran out of hard disk and Clickhouse went into read only mode. Luckily,no data was lost while this happened because we use RabbitMq to store the messages before it gets written to the db. Thanks to RabbitMq's ACK mechanism.",0,24813795,0,"[]","",0,"","[]",0 27939325,0,"comment","pritambaral","2021-07-24 08:51:24.000000000","That's exactly fair. MongoDB Inc. et. al. love to take others' contributions alongwith their rights too. This applies to everyone who demands a copyright assignment or exemption. They don't "collaborate" with the public to build something for the public, they take public contributions to enrich only themselves.
Open Source allows us to build software greater than any single contributor. Any single contributor can go out of business and the software still has a chance of surviving. Any single contributor can decide to take their work a different direction and other people can still build what they want with it. We've seen this happen so many times, over and over, and the world is in a much better place because that happened.
Closed, source-available, software vendors are the anti-thesis of that. They want others to participate and contribute, but they want to keep all the rights and benefits to themselves. If they go out of business, or just decide to drop development on a piece of software, the software dies. If they decide they don't want a certain feature in that software, you've got no recourse. That's them having the cake.
But if I offer them some code under the same license they hold everyone else to, using the same copyright rules they use against everyone else, they demand I sign over my rights and use a different license. Were MongoDB to accept a patch, under their own SSPL, under fair collaborative conditions, they'd have to actually abide by the intentionally-insane conditions of their SSPL. But that's impossible, so do they stop taking public contributions? Absolutely not. They take contributions, when accompanied by the contributors rights, just fine. This is them eating it too.
This environment actively discourages collaboration. The parasitic demand for copyright assignment deters me from even touching the concept of improving their code. When I have a client with an insurmountable database problem that could be solved by going to the source code, I'm glad when it's a database that has an actually collaborative approach, like Postgres or ClickHouse. In cases of both those databases, I've been able to actually collaborate with the maintainers and improve the software. When a client has an SSPL version of Mongo? I can't even take a look at the source code.
That last client had actually already paid MongoDB Inc. to come in and have a look at the problem. 10K USD later, the consultant gives up and says, "I'm sorry, this will never work." At least the guy was honest, as opposed to the company, which in their report says, "We promise we can bring down your latencies to single-digit milliseconds. Just move to Atlas."
What can the client do? Nothing. What can I do? I have some ideas, if I could modify the source code of the DB to add some acceleration engines specifically for this client. But I can't. Not out of technical limitations, but legal. Because of a hostile license that was itself plagiarized (illegally) from the AGPL.
So the client moves to ClickHouse, and along the way ClickHouse gets some improvements.",0,27937361,0,"[27946626]","",0,"","[]",0 17524507,0,"comment","lossolo","2018-07-13 16:46:33.000000000","I am running top-1000 site in one of EU countries on one 4 core machine with 20-30% load. Around 1000 http/https reqs/s. Most of those requests do couple of postgres reqs (read and write) and couple of redis reqs.
Elasticsearch - for searching/recommendations
Redis - hot data (certain data is only kept in redis)
Postgres - for the rest of data
Clickhouse - analytics
Most of the system is written in Go. Whole system was tuned for performance from day one. As to latency, data from the last 21 million requests today:
p99: 17.37 ms
p95: 6.86 ms
avg: 2.37 ms",0,17523259,0,"[17525414,17524851]","",0,"","[]",0 27944640,0,"comment","pupdogg","2021-07-24 21:29:31.000000000","Sorry I don’t mean to hijack the original post but for performant insert and indexing (which I assume is for analysis), I’d recommend using Clickhouse or QuestDB",0,27944569,0,"[27944905]","",0,"","[]",0 27944905,0,"comment","wiredfool","2021-07-24 22:20:54.000000000","I’d second the rec for clickhouse.",0,27944640,0,"[27945102]","",0,"","[]",0 27945102,0,"comment","champtar","2021-07-24 23:03:34.000000000","+1 for clickhouse",0,27944905,0,"[]","",0,"","[]",0 24838903,0,"comment","caust1c","2020-10-20 15:58:07.000000000","Flink uses highwatermarks like google's dataflow and is based on (I think) the original Millwheel paper.
The alternative to all this nonsense is to just throw everything into clickhouse and build materialized views! The drawback is you can't do complex joins, but for 90% of use-cases, clickhouse materialized views work swimmingly.",0,24838111,0,"[24840028]","",0,"","[]",0 24840028,0,"comment","thom","2020-10-20 17:23:37.000000000","As I understand it, ClickHouse still doesn't support window functions either, so it gets hard to do complex sequence-based logic in calculations. Materialise actually supports lateral joins so you can do some powerful things, but I am sure a better developer experience is possible.",0,24838903,0,"[24840344]","",0,"","[]",0 24840344,0,"comment","hodgesrm","2020-10-20 17:51:47.000000000","ClickHouse does this using arrays. There's a rich set of functions to pull grouped values into arrays, process them with lambdas (e.g, map, sort, etc.), and explode them back out into tabular result.
Here's a slide deck with examples from a recent presentation: https://altinity.com/presentations/introduction-to-high-velo...
ClickHouse sneaks a functional programming model with lambda expressions into SQL. It's not standard SQL but has enormous flexibility.",0,24840028,0,"[]","",0,"","[]",0 24843080,0,"comment","zX41ZdbW","2020-10-20 22:47:57.000000000","How does it compare in performance with the optimized code from ClickHouse? https://github.com/ClickHouse/ClickHouse/blob/c0fef71507b43c...",0,24839113,0,"[24843776]","",0,"","[]",0 24843776,0,"comment","aqrit","2020-10-21 00:34:32.000000000","The algorithm used by ClickHouse was "inspired" by https://github.com/cyb70289/utf8/ ... Which is known to be slower than this new method from simdjson.",0,24843080,0,"[]","",0,"","[]",0 24849533,0,"comment","jplevine","2020-10-21 16:44:15.000000000","Hi, I'm the product manager for Cloudflare Analytics. Thanks for this thorough and thoughtful review.
We are totally serious about building a world-class, privacy-first, free analytics product. At risk of HN cliche, this is our "early work". We are actively working to fix many of the rough edges mentioned here; if we had waited to fix all of them before shipping, we never would have shipped!
For folks who haven't seen it, I suggest checking out our launch blog post[0] which gives some more context around edge vs browser analytics (spoiler: we do both!), why we count visits the way we do, and how we handle bot traffic.
We know we have work to do on the "jagged lines" problem. For some low-traffic websites, we might show noisier, low-resolution data than is ideal. (We've artificially constrained our analytics to query a maximum of 7 days at a time because this problem is exacerbated with longer time ranges.)
My colleague Jamie wrote a nice blog post about how and why we sample data [1]. In short: we have an existing customer base of 25 million+ Internet priorities, whose traffic volume spans 9 orders of magnitude! Sampling data is an elegant approach that allows us to serve fast, flexible analytics for all our customers. Sampling shouldn't be feared, but we know we can do better in some cases. We've recently merged some deep-in-the-weeds improvements to ClickHouse [2] that should result in improved resolution. And we're currently working to store full-resolution data for the smallest websites.
Happy to address any other specific points that folks have questions about.
[0] https://blog.cloudflare.com/free-privacy-first-analytics-for... [1] https://blog.cloudflare.com/explaining-cloudflares-abr-analy... [2] https://github.com/ClickHouse/ClickHouse/pull/14221",0,24846300,0,"[24849742,24852457]","",0,"","[]",0 27977814,0,"comment","dillondoyle","2021-07-27 21:20:20.000000000","I hate when HN'ers just chime in and say you're shit why didn't you do this.
So hopefully I'm not that totally rude person, no-shade intended pointing out that if for some random reason you haven't seen ClickHouse you should check it out.
Solves problems that I parsed from your blog post and originally built for that use case.
100 recs/s for an ad network is super duper mega tiny so I hope you're successful and get bigger!",0,27977217,0,"[27977844]","",0,"","[]",0 27977844,0,"comment","davidfischer","2021-07-27 21:23:20.000000000","I did check out ClickHouse but I haven't gotten a chance to load more real data to give it the full test. It's definitely on the todo.",0,27977814,0,"[]","",0,"","[]",0 27979989,0,"comment","dilyevsky","2021-07-28 03:33:41.000000000","We run almost all of our DBs - pg, cockroachdb, clickhouse and etcd/kafka/redis (if you can consider those a database) inside docker/crio inside k8s. In production under high load. Works really well. We’ve had more crashes of the db itself than anything container/node related",0,27979363,0,"[]","",0,"","[]",0 27980406,0,"comment","hodgesrm","2021-07-28 05:00:44.000000000","If you mean using Docker containers, well those are pretty stable. There are hundreds of companies running ClickHouse on Kubernetes, which deploys using Docker containers. Some of them run on very large K8s clusters. We've seen very few problems.
The one issue I've seen specific to Docker is that you can run into configuration errors that keep them from coming up. That happens occasionally and it can be tricky to debug.
Disclaimer: My company Altinity wrote the ClickHouse Kubernetes Operator, which is also a Docker container.",0,27979363,0,"[27981389]","",0,"","[]",0 24866936,0,"comment","zX41ZdbW","2020-10-23 07:39:37.000000000","ClickHouse has been comparatively tested on Graviton2 as well as other types of ARM servers: https://clickhouse.tech/benchmark/hardware/",0,24860579,0,"[]","",0,"","[]",0 27991000,0,"comment","hodgesrm","2021-07-28 23:21:04.000000000","> Unless that suggestion happens to be on the project's to-do list or it fixes an open bug, I'd say it's unlikely to get merged.
There are lots of open source projects that contradict your assertion. Here are 3 projects that have been quite liberal about accepting PRs from outside.
* MySQL - MySQL AB was very open about accepting external patches. Oracle still does today.
* ClickHouse - Yandex team has the most commits, but accepts PRs from hundreds of outside contributors.
* Superset - Apache project that is supported by Preset. Again, pretty open to new PRs as far as I can tell.
Overall it's a question of how the community is governed. A lot of us view open source communities as sources of innovation. That being the case, why would you refuse a good PR? Somebody just solved a problem for you, maybe one you didn't even know existed.",0,27962428,0,"[]","",0,"","[]",0 21194186,0,"story","tmlee","2019-10-08 16:44:32.000000000","",0,0,0,"[]","https://clickhouse.yandex/",3,"Clickhouse – column-oriented database management system","[]",0 28022712,0,"comment","zX41ZdbW","2021-08-01 00:37:56.000000000","In ClickHouse there is 1.6 times performance improvement after simply adding __restrict in aggregate functions:
https://github.com/ClickHouse/ClickHouse/pull/19946#issuecom...",0,28013259,0,"[]","",0,"","[]",0 28029301,0,"comment","joshxyz","2021-08-01 19:53:29.000000000","Yandex ClickHouse is also good one for this.
clickhouse.tech",0,28006892,0,"[]","",0,"","[]",0 24919752,0,"comment","markpapadakis","2020-10-28 15:20:36.000000000","https://medium.com/@markpapadakis/interesting-codebases-159f...
In addition to those I highly recommend studying ClickHouse’s codebase. There are brilliant design and engineering bits everywhere. I learned more from studying this codebase than most other I can think of, especially with regards to templates meta-programming ( I learned about “template template parameters” from coming across extensive use of those there ). It’s actually somewhat challenging to grok what’s going on — but it is worth pushing through until you “get it”.",0,24901244,0,"[24923992]","",0,"","[]",0 28038258,0,"comment","skadamat","2021-08-02 15:51:09.000000000","Preset (http://preset.io/) | San Mateo, CA or Remote | Full-time | Multiple roles in CS, Marketing, Product, & Engineering
Preset is a modern business intelligence platform designed to support all data users in an organization. Founded and lead by Maxime Beauchemin (the original creator of Apache Airflow and Apache Superset), Preset is about to launch our cloud-hosted offering of Apache Superset. We're very well capitalized (most of our funding is not yet announced) by the fine folks at A16Z and others.
Apache Superset is:
- a top 200 Github project (in terms of stars, vanity metric but still!)
- built for scale (originally incubated at Airbnb, but we've met with hundreds of large teams self-managing & scaling Superset internally)
- written primarily in Python, TypeScript, and ReactJS
- built on frameworks like Flask App Builder, Apache ECharts (for viz), and SQLAlchemy (for helping us provide a no-code query / viz interface for a large # of databases)
- visualization rich (over 40 chart types, including geospatial) and you can build custom viz plugins as well
Here are some recent blog posts about Superset / Preset:
- Apache Superset as a Chart.io Alternative (https://preset.io/blog/2021-7-22-superset-vs-chartio/)
- Apache Superset 1.2 Released (https://preset.io/blog/2021-7-14-apache-superset-1.2/)
- 9 New ECharts Visualizations in Apache Superset (https://preset.io/blog/2021-6-14-superset-nine-new-charts/)
- Why Apache Superset is Betting on Apache ECharts (https://preset.io/blog/2021-4-1-why-echarts/)
- Superset + Clickhouse (https://preset.io/blog/2021-5-26-clickhouse-superset/), Druid (https://preset.io/blog/2021-03-03-druid-prophet-pt1/), Trino (https://preset.io/blog/2021-6-22-trino-superset/), Athena (https://preset.io/blog/2021-5-25-data-lake-athena/), Dremio (https://preset.io/blog/2021-7-27-dremio-superset/), and more (https://preset.io/blog/)
We're hiring in pretty much all departments:
* Engineering: Senior Backend, Cloud Infra, Frontend, Fullstack, and an Engineering Manager
* Product: Product Managers (Growth, SaaS), Product Designerse (Data Viz, Lead)
* GTM / Developer Relations: Developer Relations Engineer (I work on this team, join our awesome team of 2!). Demand Generation Manager & SDR as well!
* Customer Engagement: Head of Technical Customer Engagement and Senior Customer Support specialist
Our careers page is here! (https://preset.io/careers/)",0,28037366,0,"[]","",0,"","[]",0 28041057,0,"comment","fuziontech","2021-08-02 18:53:04.000000000","PostHog | Remote (US/Europe timezones) | Senior Backend, SRE, ClickHouse/C++ Engineers | https://posthog.com PostHog is open-source product analytics. Graduated YC W20, we were the most popular B2B software HN launch since 2012. Our GitHub repo [0] has 4k stars and a growing and active community. We've raised significant funding with 10 years of runway and are growing quickly. We're 30+ people and will be 50 by the end of the year. We're looking for senior Backend and ClickHouse/C++ engineers, someone who geeks out on scale and doing crazy fun stuff with a ton of data. Our stack is Django/Node/C++ and of course ClickHouse. We are also looking for a senior SRE to help us level up scaling to 100B+ events and beyond for both our cloud offering and our on-prem customers.
We have a culture of written async communication (see our handbook [1]), lots of individual responsibility and an opportunity to make a huge impact. Being fully remote means we're able to create a team that is truly diverse. We're based all over the world, and the team includes former YC founders, CTOs turned developers and recent grads.
To apply see https://posthog.com/careers or email us careers@posthog.com
[0] https://github.com/posthog/posthog [1] https://posthog.com/handbook/",0,28037366,0,"[]","",0,"","[]",0 28041459,0,"comment","orthoxerox","2021-08-02 19:26:40.000000000","And yet Postgres still refuses to add hints.
I'd rather deal with a nonexistent query optimizer, like the one in Clickhouse, than with an insufficiently smart one that I can't control.",0,28040801,0,"[28041918,28041693,28041906]","",0,"","[]",0 28041681,0,"comment","mescudi","2021-08-02 19:45:32.000000000","Full-time/Part-time
Location: Kazakhstan
Remote: Yes, English/Russian
Willing to relocate: Yes
Experience: 4+ years
Technologies: kubernetes (k8s), golang, aws, python, docker, helm, linux, terraform, ansible, clickhouse, ELK, cockroachdb, aerospike
Resume: https://tinyurl.com/2rpdpzm9
Email: nurtas977@gmail.com
Eager to work at startup culture, but open to interviews from big companies. Currently DevOps Engineer, sometimes accomplish SE and SRE tasks.",0,28037364,0,"[]","",0,"","[]",0 24921501,0,"comment","oandrew","2020-10-28 17:22:49.000000000","ClickHouse has very clean and modern C++ codebase. https://github.com/ClickHouse/ClickHouse",0,24901244,0,"[]","",0,"","[]",0 24923992,0,"comment","shepik","2020-10-28 20:51:11.000000000","Yep, i second the clickhouse suggestion. Much more easier to understand than, say, mongodb or mysql codebases.",0,24919752,0,"[]","",0,"","[]",0 28043065,0,"comment","ndom91","2021-08-02 21:42:04.000000000","Checkly | Remote (GMT-3 to GMT+3) | Full-time
Making E2E Active Webtesting kick ass! https://checklyhq.com
Jobs (https://www.checklyhq.com/jobs):
- Frontend Dev (mostly Vue)
- Fullstack (AWS (Lambdas, Queues, DBs, etc.) / Vue / Postgres / Clickhouse)
- Head of Marketing
- Product
- SRE
Mention "Nico" :)",0,28037366,0,"[]","",0,"","[]",0 24933136,0,"comment","pachico","2020-10-29 17:52:45.000000000","From what I've seen, performance is still much worse than Clickhouse, that was always distributed, open source, data warehouse like and feature rich. Why should I use timescale? I'm really asking, I'm not being rhetorical.",0,24931994,0,"[24933228,24938454,24933268]","",0,"","[]",0 24933228,0,"comment","hardwaresofton","2020-10-29 18:01:04.000000000","Clickhouse and Timescale are different types of databases -- Clickhouse is a columnar store and Timescale is a row-oriented store that is specialized for time series data with some benefits of columnar stores[0].
Something like InfluxDB is a better thing to compare to TimescaleDB (and TimescaleDB does very well, though the benchmark was a bit old[1] and influx might have improved in the meantime).
Database types aside, what really gets me excited about Timescale is that it's just another Postgres extension. If you're already running a Postgres cluster for your OLTP workloads (web-app-y workloads) and have just a bit of fast-moving time series data (ex. logs, audit logs, event streams, etc), Timescale is only an extension away. You get the usual time-tested battle hardened Postgres, with all it's features and also support for your time series workloads. Yeah you could set up declarative partitioning yourself (it is a postgres feature after all) but why bother when Timescale has done the heavy lifting?
[EDIT] - see the response below -- the benchmark is up to date, and Timescale does even better against the purpose-built tool that is InfluxDB.
> Note: This study was originally published in August 2018, updated in June 2019 and last updated on 3 August 2020.
[0]: https://blog.timescale.com/blog/building-columnar-compressio...
[1]: https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...",0,24933136,0,"[24933376,24933564,24935050]","",0,"","[]",0 24933268,0,"comment","szemet","2020-10-29 18:04:05.000000000","Based on more mature codebase.
/When I tried Clickhouse, I managed to segfault it with NULL pointer dereference error. Ok it is anecdotal, and maybe I just had bad luck, same could happen with Postgres etc etc... But anyway: it can be a deciding factor. (And they fixed it quickly - issue 7955 on github) /",0,24933136,0,"[]","",0,"","[]",0 24933564,0,"comment","pachico","2020-10-29 18:21:10.000000000","I mean using Clickhouse for time-series, of course. I understand your point on adding a new feature to your already existing Postgres solution. It's kinda what I do by using MySQL engine and dictionaries with Clickhouse, I assume.",0,24933228,0,"[24933698]","",0,"","[]",0 24933698,0,"comment","hardwaresofton","2020-10-29 18:28:43.000000000","Yes -- but slightly different, but without the network hop!
MySQL engine for Clickhouse sounds like dblink[0] or foreign data wrappers(fdw)[1] in Postgres. Doing it with Postgres allows for way more flexibility (the data could be local or remote) in this case, and the data will be at home in Postgres, with all the stability, features, operational knowledge (and also bugs/warts of course) that come with Postgres.
You may never get 100% of the performance you'd get from a purpose-built database that doesn't make the choices Postgres makes but the idea of getting 80/90% of the way there, with only one thing to maintain is very exciting to me.
[0]: https://www.postgresql.org/docs/current/contrib-dblink-funct...
[1]: https://www.postgresql.org/docs/current/postgres-fdw.html",0,24933564,0,"[]","",0,"","[]",0 24938454,0,"comment","manigandham","2020-10-30 02:19:18.000000000","Native column-oriented data warehouses designed for OLAP queries will always be faster. There are multiple alternatives from Clickhouse to Redshift that will be faster.
Originally I didn't like Timescale because it didn't offer anything new but the product has improved greatly over the years. Today it's close on performance by using a custom column-oriented data layer that stores the actual chunks in PostgreSQL rows and has several time-related processing and analytical features (continuous aggregates, time bucketing, smoothing values, etc) that make it easier than doing it yourself in raw SQL.
One of the big advantages is that it allows you to use Postgres which means you can continue to use it as your main OLTP operational database as well. This avoids a lot of complicated polyglot issues like syncing datasets or using different querying systems with different syntax. It's one of the better examples of using Postgres as a data platform rather than a simple database.
There are other alternatives that combine this OLAP+OLTP functionality like Citus (another automatic sharding distributed database extension for Postgres), Vitess (automatic sharded mysql), TiDB (natively distributed mysql interface on top of key/value store), MemSQL (proprietary distributed mysql interface with ram-based rowstores and disk-based columnstores) and SQL Server (with hekaton column-stores, in-memory tables, and scale out).",0,24933136,0,"[]","",0,"","[]",0 24939732,0,"comment","zX41ZdbW","2020-10-30 06:37:06.000000000","In ClickHouse* we patched Ryu to provide nicer representations and to have better performance for floats that appear to be integers.
* https://github.com/ClickHouse/ClickHouse/pull/8542",0,24939430,0,"[]","",0,"","[]",0 24939744,0,"comment","zX41ZdbW","2020-10-30 06:40:04.000000000","The change to Ryu is here: https://github.com/ClickHouse-Extras/ryu/pull/1",0,24939430,0,"[]","",0,"","[]",0 17644885,0,"story","ztlpn","2018-07-30 15:04:01.000000000","",0,0,0,"[17645135]","https://hackernoon.com/clickhouse-an-analytics-database-for-the-21st-century-82d3828f79cc",8,"ClickHouse, an analytics database for the 21st century","[]",1 17645135,0,"comment","zX41ZdbW","2018-07-30 15:32:35.000000000","What's interesting, ClickHouse works like APL/kdb inside (vectorized processing) but it looks like an SQL database and convenient to use.",0,17644885,0,"[]","",0,"","[]",0 14055637,0,"comment","zX41ZdbW","2017-04-06 23:21:33.000000000","I think, ClickHouse (open-source distributed column-oriented DBMS) should fit better for your needs.",0,14055592,0,"[14055667]","",0,"","[]",0 14055882,0,"comment","ddorian43","2017-04-07 00:06:53.000000000","No. Clickhouse is actual column-store. Cassandra just stores every column of a row as an actual separate row. Cassandra is for oltp, column-stores for olap.",0,14055667,0,"[14055896]","",0,"","[]",0 28071631,0,"comment","BohuTANG","2021-08-05 09:53:29.000000000","Datafuse team mainly from the ClickHouse community, but more focused on the cloud database. Datafuse Labs team: https://github.com/orgs/datafuselabs/people",0,28071281,0,"[28072147]","",0,"","[]",0 28072147,0,"comment","MrBuddyCasino","2021-08-05 11:13:12.000000000","Suddenly things got a lot more interesting. The Clickhouse guys are really talented & have good taste.",0,28071631,0,"[]","",0,"","[]",0 28074841,0,"comment","caust1c","2021-08-05 15:10:52.000000000","Curious what the motivation behind rebuilding it in rust is, versus contributing more to Clickhouse? Obviously memory safety is a big one, but is that the only reason?
What are the other goals of the project?
Personally, I'd love to see an easier-to-manage system with replication considered as a first-class feature rather than bolted on at the end.",0,28069895,0,"[28075483]","",0,"","[]",0 28077830,0,"comment","caust1c","2021-08-05 18:44:29.000000000","Awesome, great to hear! I've been using clickhouse for a long time and although we haven't contributed significantly to development, random bugs and issues have been quite painful in the past. Looking forward to what you're able to do!
p.s. Please don't add in-process DNS caching ;-)
https://github.com/ClickHouse/ClickHouse/issues/5287",0,28075483,0,"[]","",0,"","[]",0 28079551,0,"comment","eatonphil","2021-08-05 20:47:54.000000000","JVM and CLR has the long tail of libraries that just doesn't exist for smaller languages. What happens when you want to integrate with Parquet files or some common but sort-of-obscure proprietary system like ClickHouse? I've done a bunch of FFI integrations with C libraries and I never enjoy it. You always have to make sure you're handling object lifetimes correctly and you just don't have that issue if you're able to use Java or CLR as your FFI target.
For small apps where you know you'll never have to leave the native ecosystem it can be fine to stay outside of JVM/CLR. But for business apps I think there's a lot of security to be had by being able to integrate with JVM/CLR.
Edit: I'm not saying it's a good idea to choose ABCL, I hope this article made that clear. SBCL is much more mature in most dimensions.",0,28079423,0,"[28081912,28081324]","",0,"","[]",0 21259706,0,"comment","haggy","2019-10-15 15:47:40.000000000","Not sure how much sarcasm is built into your statement but as a long time user & supporter of Postgres I have to disagree. Postgres is fantastic at many things but there are times when it's not the best choice. An example would be high-scale OLAP. I've tried to use PG for those cases but ingesting thousands+ of events per second into PG and then trying to perform rollups on them in an online capacity (near real-time availability) requires a TON of extra legwork to get it going and near continuous maintenance thereafter. Other purpose-built DB's for OLAP such as Druid, Clickhouse, etc are much better suited for this type of use-case. The main advantage these systems have over Postgres is their column-oriented nature vs. row-oriented of PG and other RDMS.",0,21259579,0,"[21259757,21260160,21260359]","",0,"","[]",0 17669917,0,"comment","arespredator","2018-08-02 08:00:22.000000000","MessageBird | Amsterdam, Netherlands | Data Engineer, ML Engineer | Full-time | Onsite | Visa
MessageBird is a Cloud Communications Platform as a Service (CPaaS) company for SMS, Voice and Chat communications that connects businesses to 7 billion phones worldwide. We’re one of the fastest growing software companies in the world and we’re looking to expand our best-in-class Engineering Team with an experienced Data Engineer and Machine Learning Engineer.
Data engineering at MessageBird is programming-heavy, so we're looking for people who like to code and have significant software engineering experience. On the ML front we're looking for engineers experienced in delivering products more than purely research-oriented folk, but if you've a solid research background and want to try moving to the private sector, give us a shout too.
Tech stack: Go, gRPC, Clickhouse, Bigtable, Java, Apache Beam (Google Dataflow), GCP, k8s.
Our data team is currently 8 engineers and 8 nationalities. We have a very well stocked kitchen and a roof terrace in our brand new Rivierenbuurt office.
Apply at https://www.messagebird.com/en/careers and feel free to contact me at piotr@messagebird.com in case you have any questions.",0,17663077,0,"[]","",0,"","[]",0 14072179,0,"comment","lima","2017-04-09 13:31:09.000000000","And apparently ClickHouse out-performs MemSQL: https://clickhouse.yandex/benchmark.html
Has anyone here experience with ClickHouse?",0,14071657,0,"[14072738,14072807]","",0,"","[]",0 14072738,0,"comment","dignan","2017-04-09 15:31:54.000000000","No experience, but did a thorough read through of the docs. One thing to keep in mind about clickhouse is that their replication guarantees aren't very strong. From the docs: "There are no quorum writes. You can't write data with confirmation that it was received by more than one replica."
That's pretty troubling, but at least they're open about it. That said their performance claims are pretty spectacular, and it seems solidly engineered. Further if you're not planning on using replication it certainly seems interesting. I'd be curious to hear about someone's production experience as well, since the list of companies running it seems rather thin.",0,14072179,0,"[14081452]","",0,"","[]",0 24973802,0,"comment","hodgesrm","2020-11-02 21:46:09.000000000","Altinity | Seeking ClickHouse engineers | 100% Remote | Contact: hr@altinity.com
Altinity offers support and public cloud ClickHouse data warehouse-as-a-service. If you know data services and are looking for a great technical challenge with a fun team, please check us out.
We have open positions for cloud development, site reliability engineering, security, support, customer success, and technical writing. Support and customer success roles in particular require excellent expertise in ClickHouse.
For more information please check out: https://altinity.com/careers",0,24969524,0,"[]","",0,"","[]",0 14081452,0,"comment","ztlpn","2017-04-10 18:55:09.000000000","(ClickHouse dev here)
Yes, replication in ClickHouse is asynchronous by default. For intended use cases (OLAP queries aggregating data from many rows) data that is a few seconds stale is usually okay. In a serious production deployment you absolutely should enable replication, otherwise you risk losing all your data, not just last couple of seconds of inserts.
That said, sometimes synchronous replication is necessary despite the latency penalty that comes with it. This feature is actually implemented but not yet considered ready for prime time.
We have several years of production experience with ClickHouse (as a DBMS powering Yandex.Metrica - second largest web analytics system in the world). If you have questions - just ask.",0,14072738,0,"[]","",0,"","[]",0 24982186,0,"comment","amzans","2020-11-03 18:05:33.000000000","I’m building a private analytics suite for websites. It aims to be a more complete toolset than alternatives like Plausible, Fathom or Simple Analytics (great products too BTW). Feeling super lucky as I just launched and already handling millions of page views per month for 200+ customers.
Not yet at $500 MRR but I’m focusing on providing a lot of value in a simple package, and automating as much as possible to reduce daily operations and move fast.
The stack mainly consists of Python, React, Postgres, Clickhouse, and Redis running on Kubernetes (deployment configured via Terraform and Kustomize). Was running on DO + Linode and just moved to AWS.
Happy to answer any questions!",0,24947167,0,"[24986809,24983961,24986646]","",0,"","[]",0 24986809,0,"comment","hodgesrm","2020-11-04 06:31:06.000000000","Hey, that's cool! I run the SF Bay Area ClickHouse Meetup. We like hearing about new analytic apps. Let me know if you want to do a talk about how you are using ClickHouse. Our next meetup going to be in early December.
My email is on my profile if you want to contact me directly.",0,24982186,0,"[25016101]","",0,"","[]",0 24986810,0,"comment","carlineng","2020-11-04 06:31:10.000000000","Somewhat related is Pavlo's musings on Naming a Database Management System [1]:
"In my opinion, the best DBMS names from the last thirty years are Postgres[4] and Clickhouse.
These two names did not mean anything before, but now they only have one connotation. There is no ambiguity. There is no overlap with other DBMS names or non-database entities. They are easy to spell correctly[5]. Everyone refers to them by their full name. Nobody mistakingly calls them "PostgresDB" or "ClickhouseSQL."
After reflecting on what makes them so good, I realized what the secret was to them. They are a two-syllable name that is derived from combining two unrelated one-syllable words together (e.g., Post + Gres, Click + House). Each individual word has its own meaning. It is only when you put them together does it mean the database.
Given this, I henceforth contend that the best way to name a DBMS is to combine two one-syllable words."
Noise + Page
[1]: https://www.cs.cmu.edu/~pavlo/blog/2020/03/on-naming-a-datab...",0,24983872,0,"[24987083]","",0,"","[]",0 24987083,0,"comment","chrismorgan","2020-11-04 07:38:39.000000000","> They are easy to spell correctly[5]. Everyone refers to them by their full name. Nobody mistakingly calls them "PostgresDB" or "ClickhouseSQL."
… and yet that article misspells both of them, consistently not calling one of them by its full name (though it acknowledges this in the fourth footnote, with the relevant history).
Postgres is actually PostgreSQL.
Clickhouse is actually ClickHouse.
Gluing two words together guarantees that a significant percentage of your users will misspell it. (I’m categorising incorrect capitalisation as a misspelling.) Some will follow the declared spelling, and some will incorrectly capitalise, regardless of what you declare to be true, either introducing spurious capitals or lowercasing authentic capitals.
Some entities change the spelling of their name over time. Some are inconsistent by accident, which I don’t really understand (I could never do it myself; yet I observe it happens quite commonly). But you know what really grinds my gears? When the original source is deliberately inconsistent in its spelling, refusing to declare a canonical spelling, as is the case with sauceHut.",0,24986810,0,"[]","",0,"","[]",0 24998385,0,"story","jgrahamc","2020-11-05 14:18:00.000000000","",0,0,0,"[]","https://blog.cloudflare.com/clickhouse-capacity-estimation-framework/",2,"ClickHouse Capacity Estimation Framework","[]",0 17697044,0,"comment","olavgg","2018-08-06 12:35:06.000000000","I highly recommend Clickhouse for this. It is blazing fast and can do over 100GB/s on a single modern machine if you have enough RAM. And it is very easy to install and configure.",0,17696886,0,"[17697056,17697283]","",0,"","[]",0 17697283,0,"comment","lykr0n","2018-08-06 13:20:27.000000000","+1 for ClickHouse. But it's the kind of software you need to figure out before you scale beyond a single node.
Fast as hell and surprisingly performance in low memory conditions.",0,17697044,0,"[]","",0,"","[]",0 17700348,0,"comment","massaman_yams","2018-08-06 18:44:35.000000000","And it's... really not all that fast when compared to mature analytical databases. ClickHouse on identical hardware is ~ 100x faster than cstore_fdw.
http://tech.marksblogg.com/benchmarks.html
More interesting to me is the reverse: using FDW from the analytical DB to Postgres, e.g., https://aws.amazon.com/blogs/big-data/join-amazon-redshift-a...",0,17697032,0,"[17704281,17703170,17728374]","",0,"","[]",0 25009347,0,"story","pachico","2020-11-06 17:30:02.000000000","",0,0,0,"[25009370,25009801]","https://blog.cloudflare.com/clickhouse-capacity-estimation-framework/",47,"ClickHouse Capacity Estimation Framework at Cloudflare","[]",6 25009370,0,"comment","pachico","2020-11-06 17:32:54.000000000","I'm wondering how they manage the tables creation on a 100 nodes cluster. Is it all by hand? And why do to they use clickhouse_exporter? Doesn't the built in exporter provide the required data?",0,25009347,0,"[25009854,25009990]","",0,"","[]",0 25009854,0,"comment","hodgesrm","2020-11-06 18:21:19.000000000","I don't know about CloudFlare specifically but the usual way to create tables in clusters is to use the CREATE TABLE IF NOT EXISTS <name> ON CLUSTER <cluster>. It executes the command across all nodes. You can automate this easily. [1]
The built-in exporter is relatively new--the original PR was merged in December of last year and it still had changes coming in mid-year 2020.
Also, compatibility with clickhouse-exporter was unfortunately not a requirement, so the metric names and coverage do not fully match.
[1] https://clickhouse.tech/docs/en/sql-reference/statements/cre...",0,25009370,0,"[25010208]","",0,"","[]",0 25009990,0,"comment","ooxaanaa","2020-11-06 18:37:08.000000000","All migrations are applied automatically. We have bootstrap files and migrations themselves. All of these files are stored in the repository and go through the review process, after which they are deployed to a specific cluster.
We use clickhouse_exporter from the early days. It queries system tables and exposes metrics in the required format to Prometheus.",0,25009370,0,"[]","",0,"","[]",0 17704610,0,"story","wiradikusuma","2018-08-07 07:59:04.000000000","",0,0,0,"[]","https://clickhouse.yandex",3,"Clickhouse","[]",0 28129297,0,"comment","spyspy","2021-08-10 14:30:35.000000000","Real-time aggregates is the core value pitch of Clickhouse.",0,28128955,0,"[]","",0,"","[]",0 28130110,0,"comment","yawniek","2021-08-10 15:35:06.000000000","C Wire | Typescript Developer - Vue.js / Nest.js | REMOTE | Zurich | Full-Time | cwire.media
C Wire found a sustainable way to fund Free Journalism without betraying the users Privacy. We are an AdTech company using NLP to place Ads on Premium Publisher sites within the right context.
State of the art tech ( Typescript, Golang, K8s, Clickhouse etc ) and a fun, distributed team.
You should ideally in Central Europe Timezone
https://www.cwire.ch/career/typescript-developer-vue-js-nest...",0,28037366,0,"[]","",0,"","[]",0 25016101,0,"comment","amzans","2020-11-07 16:41:07.000000000","Hey thanks for considering me! I do intend to write a few blog posts about how I use Clickhouse, but I'll get in touch if anything.",0,24986809,0,"[]","",0,"","[]",0 28132846,0,"comment","citrin_ru","2021-08-10 19:05:22.000000000","If all you have is a hummer everything look like a nail: when Hadoop first appeared there was almost no other open source systems to process 'big data' and it was widely adopted. Now there are many options to choose from. We don't have to use map-reduce for every task which could be solved using map-redude. E. g. for some tasks a columnar store, like ClickHouse is a better fit.",0,28130144,0,"[28135539]","",0,"","[]",0 25023839,0,"comment","st1ck","2020-11-08 07:26:34.000000000","Probably not what you want, but I was looking for a while how I can get rid of Pandas, and for the most part ClickHouse (DBMS for analytics) can do almost everything Pandas does (for my use case at least) and much more efficiently. It has very good support of array types, so it's pretty convenient for slightly nested data. Whatever you can't do completely in ClickHouse, you can export into Parquet/JSON/CSV/etc. and finish the analysis in Pandas or anywhere else.
Just be aware that ClickHouse is still somewhat experimental and idiosyncratic DBMS, doesn't have some SQL goodies like window functions or B+Tree indexes, or pivots. They only added CTE support a week ago, I haven't even tried it yet.",0,24985184,0,"[]","",0,"","[]",0 25042294,0,"comment","1996","2020-11-10 02:28:54.000000000","Indeed, that's what I currently have, but for this database I'm looking at alternatives (ex: clickhouse) that could get everything and "send it onward" on a schedule.
It's just a glorified bulk insert with extra moving parts, but if it works around the issue, why not?
BTW Have you played with oddysey? It doesn't seem essentially different (or better) from pgbouncer.",0,25040583,0,"[25042350]","",0,"","[]",0 25056712,0,"comment","barrkel","2020-11-11 08:55:37.000000000","My Linux laptop locks up every now and then with swapping when I'm running our app in k3s; three database servers (2 mysql, 1 clickhouse), 4 JVMs, node, rails, IntelliJ, Chrome, Firefox and Slack, and you're starting to hit the buffers. I was contemplating adding more ram; 64 GB looks appealing.
I would not buy a new machine today for work with less than 32 GB.",0,25055682,0,"[]","",0,"","[]",0 25057747,0,"comment","valyala","2020-11-11 12:23:54.000000000","There is more mature ClickHouse database [1]. This is column-based OLAP database, which provides outstanding query performance (it can scan billions of rows per second per CPU core) and outstanding on-disk data compression (up to 100x if proper table configs are used). ClickHouse also scales horizontally to multiple nodes. Comparing to QuestDB, ClickHouse consistently shows high performance on a wide range of query types from production. It also provides many SQL extensions optimized for analytical workloads.
BTW, VictoriaMetrics [2] is a time series database built on top of ClickHouse architecture ideas [3], so it inherits high performance and scalability from ClickHouse, while providing simpler configuration and operation for typical production time series workloads.
[2] https://victoriametrics.com/
[3] https://valyala.medium.com/how-victoriametrics-makes-instant...",0,25057320,0,"[25067013]","",0,"","[]",0 25058009,0,"comment","valyala","2020-11-11 13:07:42.000000000","It looks like Influx-IOx and Apache Druid have many common base building blocks. The main difference is that Influx-IOx is optimized for time series workloads, while Apache Druid is optimized for analytical workloads. While time series workloads can be treated as analytical workloads in most cases, time series databases usually provide specialized query languages such as Flux, InfluxQL, PromQL [1] or MetricsQL [2] that simplify typical queries over time series data.
P.S. Apache Druid should be compared to ClickHouse [3] or similar analytical databases.
[1] https://valyala.medium.com/promql-tutorial-for-beginners-9ab...
[2] https://victoriametrics.github.io/MetricsQL.html
[3] https://clickhouse.tech/",0,25056991,0,"[]","",0,"","[]",0 25058051,0,"comment","valyala","2020-11-11 13:13:25.000000000","It looks like InfluxDB is going to target general-purpose analytical workloads. It would be interesting to look at how InfluxDB will compete with ClickHouse in this space.",0,25051727,0,"[]","",0,"","[]",0 25058209,0,"comment","valyala","2020-11-11 13:31:57.000000000","It looks like TensorBase has many common things with ClickHouse. Do you have benchmarks that compare performance of TensorBase with ClickHouse similar to benchmarks published by ClickHouse [1]?
[1] https://clickhouse.tech/benchmark/dbms/",0,25055005,0,"[]","",0,"","[]",0 28183155,0,"comment","flashm","2021-08-14 19:07:31.000000000","PostGIS would get you up and running, but is not particularly quick unless you have a budget and the knowledge to tune it. It is however probably the most mature, and will be able to do anything you need.
If you need read speed, check out Clickhouse, or something in memory, like Tile38. Clickhouse in my experience is at least 50-100x faster for a point in polygon than Postgis out of the box.",0,28181726,0,"[]","",0,"","[]",0 28188436,0,"comment","zepearl","2021-08-15 13:00:31.000000000","Same opinion, I would say "it depends".
Personally I like Gentoo on the desktop especially in relation to when I code (e.g. very handy to be able to easily switch SW-versions of some packages while always using the same repository). I use it as well on some servers as root OS (I mean the one that runs mainly just the hypervisor), if they have special needs (e.g. if for some reason I want/have to use a recent version of some SW, e.g. ZFS, Kernel, firewall, QEMU, etc...).
On the other hand for VMs I usually just use Debian or Mint, as maintenance/upgrade effort is a lot lower & quicker. In some cases I still have to use PPAs but they're usually exceptions (e.g. Postgres 13 & kernel 5.10 & Clickhouse for Debian 10).",0,28186001,0,"[]","",0,"","[]",0 25069614,0,"comment","jiofih","2020-11-12 13:30:56.000000000","Using Clickhouse btw.",0,25061674,0,"[]","",0,"","[]",0 14213672,0,"comment","otterley","2017-04-27 18:25:01.000000000","This is an issue even with non-GPU-based OLAP engines. See, e.g., Clickhouse (https://clickhouse.yandex/reference_en.html):
"We'll say that the following is true for the OLAP (online analytical processing) scenario...Queries are relatively rare (usually hundreds of queries per server or less per second)."",0,14213528,0,"[14214304]","",0,"","[]",0 21414293,0,"comment","grumpydba","2019-10-31 22:11:38.000000000","> Anyone considering a “time series database” should first set up a modern commercial column store, partition their tables on the time column, and time their workload. For any scan-oriented workload, it will crush a row store like Timescale.
Or you can set up a clickhouse instance. It's a seriously promising and underrated product.",0,21414180,0,"[21414938,21414943]","",0,"","[]",0 21414938,0,"comment","manigandham","2019-10-31 23:40:07.000000000","Clickhouse is a distributed relational columnar database. It competes with MemSQL, Vertica, Actian, Greenplum, and hosted options like Redshift, Bigquery, Snowflake, etc.",0,21414293,0,"[21414946]","",0,"","[]",0 21414943,0,"comment","atombender","2019-10-31 23:41:02.000000000","Clickhouse is good, but it's definitely made for a very limited purpose; it's not a general purpose SQL database. Which is fine, but the attraction with something like TimescaleDB is that your time series data can coexist with normal data.",0,21414293,0,"[21415022,21414962]","",0,"","[]",0 21415022,0,"comment","PeterZaitsev","2019-10-31 23:54:29.000000000","Many Time-Series applications do not need super complicated SQL. This is why there are many timeseries focused databases even without SQL support.
There is also PostgreSQL Foreign Data Wrapper for ClickHouse which allows you to run all SQL PostgreSQL support and often with great performance",0,21414943,0,"[21417031]","",0,"","[]",0 21417031,0,"comment","buremba","2019-11-01 08:26:38.000000000","If you use Postgresql's query engine on the Clickhouse data, you lose all the benefits of the columnar query engine of Clickhouse so that's not correct.",0,21415022,0,"[21417444]","",0,"","[]",0 21417444,0,"comment","grumpydba","2019-11-01 10:22:49.000000000","No you don't lose them. Fdw supports push down of where clauses, only selects the required columns. You can also create views in clickhouse to make sur the joins are processed there.",0,21417031,0,"[21418292]","",0,"","[]",0 21418292,0,"comment","buremba","2019-11-01 13:09:29.000000000","You're right but if the syntax that you're using is not supported in Clickhouse, aggregate and predicate pushdowns won't work and this FDW (https://github.com/adjust/clickhouse_fdw) needs to map all the Postgresql functions / produces to Clickhouse in order to take advantage of push-down so the only use-case here is that you may want to join the data in Clickhouse with the data in Postgresql (or other FDW sources).",0,21417444,0,"[]","",0,"","[]",0 21419926,0,"comment","missosoup","2019-11-01 15:37:48.000000000","> Also, I think typical scenario is to resolve embeddings in your model code or data input pipeline.
Correct. PG has no place in this workload other than being the final store for the model output. And even then, you'd be using a column store like Redshift or Clickhouse. PG not even suitable for the ngram counters because its ingest rates are way too slow to keep up with a fanned out model spitting out millions of ngrams per second in addition to everything else going on in the pipeline.
You -could- probably do it all in PG. But that'd be a silly esoteric challenge exercise and not something anyone would try on a project. I am sure you recognise that.",0,21419054,0,"[21420227]","",0,"","[]",0 21423138,0,"comment","arespredator","2019-11-01 20:05:23.000000000","Location: Amsterdam, NL
Remote: Yes
Willing to relocate: No
Technologies: Go, Python, AWS, GCP, K8S, Linux, Clickhouse, Dataflow
Resume: https://piotr.is/cv.pdf
Email: m@piotr.is
I am an experienced backend/data engineer with background in systems engineering/SRE and research. I'm interested in any job that features a significant programming component (not so keen on devops roles). I have remote work experience, but I'd only consider remote work when a large part of the organisation is remote.",0,21419534,0,"[]","",0,"","[]",0 21428413,0,"comment","hwwc","2019-11-02 16:09:15.000000000","SEEKING WORK | Backend Services and Data Engineering
Location: US Remote: Yes
I'm a software engineer experienced in all parts of a data-analytics backend-stack: from ETL to database design to web-API to devops. One of my major projects is an analytics engine for web applications (https://github.com/hwchen/tesseract).
I'm looking for a 10-20 hr/week contract writing robust, performant, and ergonomic applications for processing and querying data.
Primary Skills: Rust, Linux, Google Compute Platform, ClickhouseDB, Postgresql
Production Experience with: Python/Pandas, Node/JS, AWS, Docker, Redis, MySql, Nginx
Github: https://github.com/hwchen
Contact: hello@hwc.io",0,21419535,0,"[]","",0,"","[]",0 28248292,0,"story","zX41ZdbW","2021-08-20 16:53:50.000000000","",0,0,0,"[28248344]","https://clickhouse.tech/blog/en/2021/performance-test-1/",5,"Testing the Performance of ClickHouse","[]",1 28248344,0,"comment","zX41ZdbW","2021-08-20 16:57:55.000000000","Cheburashka is on the photo, wearing ClickHouse T-shirt.",0,28248292,0,"[]","",0,"","[]",0 21432858,0,"comment","dylz","2019-11-03 08:59:42.000000000","I use https://clickhouse.yandex for first-party analytics.",0,21432790,0,"[]","",0,"","[]",0 21433046,0,"comment","missosoup","2019-11-03 10:03:34.000000000","If you can meet your demands in PG currently, you will probably never need something as heavy as Spark.
If PG is currently working for you, then chances are it'll keep working if you optimise it (partitioning, clustering, etc).
If you do want something else, the particular use cases you described are pretty much exactly what Clickhouse was built for.",0,21432790,0,"[]","",0,"","[]",0 21434148,0,"comment","PeterZaitsev","2019-11-03 14:49:43.000000000","I would surely look at ClickHouse for such use case.
If you're rolling your own solution (rather than using DBaaS) ClickHouse is unparalleled when it comes to scaling and efficiency. Its SQL is somewhat restricted but it should be able to do what you're looking very well, especially it was born out of the needs of Yandex Metrica which is quite similar use case of yours.
If you're using Kubernetes there is great Operator available from Altinity to run Clickhouse on Kubernetes.",0,21432790,0,"[21434267]","",0,"","[]",0 21434267,0,"comment","hwwc","2019-11-03 15:10:08.000000000","I’ve also had a great experience with clickhouse. Very easy to set up and maintain. Perhaps a little rough around the edges compared to Postgres, but I would look to clickhouse first for analytics.",0,21434148,0,"[]","",0,"","[]",0 21434974,0,"comment","hodgesrm","2019-11-03 16:52:46.000000000","Cloudflare documented their PostgreSQL to ClickHouse journey here: https://blog.cloudflare.com/http-analytics-for-6m-requests-p...
It covers a lot of the scaling bottlenecks in their original PG-based pipeline and how they eliminated them.
(Disclaimer: I work on ClickHouse.)",0,21432853,0,"[]","",0,"","[]",0 14243734,0,"story","bockra","2017-05-02 03:26:11.000000000","",0,0,0,"[14243771]","https://clickhouse.yandex/",3,"Clickhouse DBMS","[]",1 28256205,0,"story","Algunenano","2021-08-21 11:50:08.000000000","",0,0,0,"[]","https://clickhouse.tech/blog/en/2021/performance-test-1/",1,"Testing the Performance of ClickHouse","[]",0 21439981,0,"comment","buremba","2019-11-04 09:07:21.000000000","We also developed a product that collects the user event data via SDKs (https://github.com/rakam-io/rakam), depending on the data volume the users pick one of the two deployment types. The first one is based on Postgresql 11 and it takes advantage of partitioned tables (one table for each event time for the configured time period such as month, week, etc.), BRIN indexes, and materialized views.
It's capable of handling 250M events per month without an issue but beyond that we suggest the users to use data-warehouse that supports horizontal scaling and columnar storage engine so we use Snowflake with Snowpipe, Kinesis or Kafka (pluggable) and S3.
I have also tried using Clickhouse which is much cheaper but it lacks tooling for automatic scaling.",0,21432790,0,"[]","",0,"","[]",0 14246354,0,"comment","lima","2017-05-02 14:01:18.000000000","It looks like Yandex recently open-sourced their Graphite engine they built on top of Clickhouse:
https://github.com/yandex/graphouse
Looks really interesting for Graphite-like use cases.",0,14245603,0,"[]","",0,"","[]",0 25153782,0,"story","arunk-s","2020-11-19 20:09:41.000000000","",0,0,0,"[25155772,25155107,25154179,25155513,25157636]","https://arunsori.me/posts/postgres-clickhouse-fdw-in-go/",114,"Writing a Postgres Foreign Data Wrapper for Clickhouse in Go","[]",26 25155772,0,"comment","tristor","2020-11-19 23:43:07.000000000","I wonder if you've benchmarked your Go implementation against the Percona Lab's FDW for Clickhouse? https://github.com/Percona-Lab/clickhousedb_fdw",0,25153782,0,"[25157642,25157637]","",0,"","[]",0 21451953,0,"comment","lima","2019-11-05 13:02:03.000000000","Yandex' open source ClickHouse analytics database also had significant engineering effort applied to enable selective and permanent deletion of data in what used to be an append-only database that could only drop whole partitions. Regulation works, and the large tech companies are the most likely to be compliant, since they have the most to lose and have mature compliance and legal processes. Google is particularly good about this - consumer trust is their #1 asset.",0,21450881,0,"[]","",0,"","[]",0 25157637,0,"comment","AhtiK","2020-11-20 05:43:50.000000000","There's now also a more recent CH FDW written in C by the team at adjust, https://github.com/adjust/clickhouse_fdw",0,25155772,0,"[25157866]","",0,"","[]",0 25157642,0,"comment","arunk-s","2020-11-20 05:44:12.000000000","Hi, Author here.
I have to admit, I haven't done any benchmarking against the existing FDWs for Clickhouse. But I actually wrote the Go FDW 2 years ago(around Dec 2018). ;)
There weren't any Clickhouse FDW available at that time and I probably would've tried them as well.
Now I just got around to write the blog post and convincing the team to release the code.
Though I have a suspicion that the percona FDW might win in the benchmarks as they won't have to pay the penalties when crossing the Go land to C land.[1]
1: https://www.cockroachlabs.com/blog/the-cost-and-complexity-o...",0,25155772,0,"[25157873]","",0,"","[]",0 25165121,0,"comment","zepearl","2020-11-20 21:21:09.000000000","Thank you.
Yeah, in my case the DBs "Clickhouse" and "MariaDB+MyRocks" might fit well the 1MB-case (as they both never "update" existing files but keep writing new files not just for "inserts" but as well for "updates", Clickhouse anyway not supporting "update/delete" almost at all, he).
On the other hand "PostgreSQL" and (maybe) as well "MariaDB+TokuDB" might need a small recordsize -> I'll have to test it, and anyway, splitting each single DB to use different datasets seems to be a great idea :)",0,25164290,0,"[]","",0,"","[]",0 28288604,0,"comment","joshxyz","2021-08-24 13:32:00.000000000","- Zstd compression, fb
- Brotli compression, google
- Clickhouse (highly efficient OLAP db), yandex
- ESBuild (extremely fast js bundler), figma
- TailwindCSS (rapid css styling made ez)
- Caddy (nice auto-ssl for the lazy, also nice proxy)
- HAProxy (highly configurable proxy/lb)
- uWebSockets (highly performant web server with websockets; used by bitfinex, kraken, trello; has uWebSockets.js api for nodejs)",0,28277646,0,"[28291221]","",0,"","[]",0 28291221,0,"comment","pdevr","2021-08-24 16:37:41.000000000","I am amazed at the variety of repos shared - by you, as well as others. Thanks!
Repo links:
https://github.com/facebook/zstd
https://github.com/google/brotli
https://github.com/ClickHouse/ClickHouse
https://github.com/evanw/esbuild
https://github.com/tailwindlabs/tailwindcss
https://github.com/caddyserver/caddy
https://github.com/haproxy/haproxy
https://github.com/uNetworking/uWebSockets",0,28288604,0,"[]","",0,"","[]",0 28302924,0,"comment","protoduction","2021-08-25 15:32:13.000000000","I'm the technical founder of FriendlyCaptcha [1], a privacy friendly proof of work alternative to reCaptcha that doesn't suck for end users (and also doesn't need any tracking or cookies so legal/compliance teams like it too). While not a one-man company anymore, I've been the only engineer of it for a long time that I think it qualifies :)
* We use cloudflare workers for our API endpoints which have given us amazing reliability and scalability (which is of course very important for our service)
* We use cloudflare KV and cloudflare cache for some caching and logged in user sessions.
* Mailgun, FaunaDB, Sentry, LogDNA, Stripe, BigQuery.
* The web app is good old server side rendered html and css with a tiny bit of JS here and there.
* The widget is open source, written in Typescript (and as it has to run in really old browsers there's Babel to transpile. The solver inside of it and the proof of work library is AssemblyScript (i.e. WASM), with a JS fallback.
* PHP for our wordpress plugin.
The way we operate is that we provide free (or very low cost) plans for hobby users and stuff like small (wordpress) blogs to hopefully make a dent into recaptcha's market share, it's a bit of a social mission. Larger companies pay for more advanced protection features, EU-only endpoints, (as well as custom agreements and other paperwork). That balance has worked well for us, and even the small customers that use the free service contribute to our protection.
The past half year or so we learned that our customers that bring in the lion share of the revenue really prefer it if we keep their processing and data in Europe (our privacy friendliness and our EU-basedbess are big selling points, even more than improved ux+accessibility). So much so that we are heavily investing into that, and we are slowly moving in favor of our own infra in Hetzner (Germany).
There our tech stack is fully Golang (Fiber framework), with Redis, Postgres and Clickhouse as data stores. The way that our system works is that we look at patterns of access, and we can tweak the difficulty of the proof of work challenge on a request by request basis. One nice property is that it's not all or nothing: if we suspect a puzzle request is from a spammer they will get a rather difficult puzzle which will take a while to solve, but at least it won't lock out any false positives. Clickhouse is fantastic for this purpose (putting in millions of events is not even close to its capacity, it's lightning fast). Of course the widget itself also has the most basic of anti headless browser checks, but that will only deter the most naive spammers.
Of course no captcha system is perfect and will protect against a spammer who is willing to spend real resources (e.g. pay compute, or human labelers) to spam, but so far we're happy with its effectiveness, and it warms our hearts when we receive messages from blind or even deaf-blind users that encountered our captcha and web out of their way to say thank you :). I hope that at some point captcha labeling tasks can be a thing of the past.
[1]: https://friendlycaptcha.com",0,28299053,0,"[]","",0,"","[]",0 25187308,0,"comment","robertlagrant","2020-11-23 14:56:03.000000000","Very, very similar to the tech stack we have in my engineering teams (other than we use Flask and SQLAlchemy instead of Django). Really like it. I've not heard of ClickHouse before, but that looks interesting. (If I had a requirement around time series, I was thinking of checking out Timescale - https://www.timescale.com.)",0,25186342,0,"[]","",0,"","[]",0 25187467,0,"comment","eric_b","2020-11-23 15:08:19.000000000","It's cool that this stack works for the author, but if anyone is just starting out on their own one-person journey to build a product, I don't think they should follow the stack in this article.
There are too many dependencies and too much complexity here. Kubernetes is overkill for 95% of applications, especially single founder SaaS businesses. Clickhouse may make sense for an analytics product but caring and feeding it is non-trivial (see: ZooKeeper dependency and all the problems you get once you move to a distributed database).
Most people would be better off with a single beefy machine running Linux, Postgres and whatever web app framework they know. The whole "cattle not pets" thing is fine once you obtain product market fit and your product needs to scale up. At that point you'll have time and money to do it - before then you're just wasting cycles.",0,25186342,0,"[25187725,25188474,25187806,25187636,25187779,25190102,25190046,25189212,25188374,25188455,25188317,25188726,25189840,25188576,25192053]","",0,"","[]",0 25188257,0,"comment","joshmn","2020-11-23 16:12:58.000000000","Serial one-man SaaS builder here with a pair of soft-landings and one big hit:
I would agree with others who have opined that this is overkill. You couldn't pay me to use Terraform or Kubernetes for my SaaS unless it was a much larger team.
My stack usually consists of this: A framework I am comfortable with that removes a lot of boilerplate (happens to be the one I know best in the language I know best), the database I know best, the memory store I know best, Turbolinks for load times, a theme from ThemeForest (with the commercial license), self-hosted analytics, Amplitude, and Heroku.
The author mentions "lessons learned" regarding Clickhouse and Kubernetes. I'm not sure what their original goal was when starting their SaaS. Usually mine is financially motivated more than learning things.
What I'm really saying is that just go with what you know best. Having to scale is a good problem to have. Pre-optimization is the root of all evil — asking "is x the right stack for y?" will always be yes when you ask in the x subreddit.
Stop thinking and build.",0,25186342,0,"[25189096,25188385,25188334,25191687]","",0,"","[]",0 25188317,0,"comment","jturpin","2020-11-23 16:17:31.000000000","Setting up Kubernetes is a fixed cost in the beginning with clear payoffs. Things like microk8s, Rancher or GKE/EKS make it a lot easier to set up than it used to be. Deploying your app is then just a simple Helm chart, or another way of getting deployments out. If I was running any kind of business that was making more than zero dollars I would absolutely go with Kubernetes. I don't think it's as complicated as people think at any rate, and it lets you use premade Helm charts which can simplify the deployment of things like Clickhouse.",0,25187467,0,"[25188668]","",0,"","[]",0 25188536,0,"comment","joshmn","2020-11-23 16:37:51.000000000","That's really great, happy to hear.
I didn't mean to refute what you shared. In fact I was reinforcing some of your points. Doing things the cheap way is arguably the best way. Learning new tech while building is good, but I wouldn't recommend new tech to build a SaaS.
My concern was copycats. Some starters reading your post may have this feeling that "because Panelbear did it with Clickhouse and Kubernetes and Django, that's the only way to do it." What it really should read is "because Panelbear used what he knew best, he was able to focus on getting customers; that's the only way to do it."",0,25188385,0,"[25188901]","",0,"","[]",0 28325991,0,"comment","dikei","2021-08-27 10:07:02.000000000","Column store is just the first step in OLAP optimization. You also need an engine that's optimized for columnar calculation(minimize data transfer/copy, vectorized calculation, etc..). It's the reason why purpose-built columnar DB like Clickhouse still run circle around MySQL/PgSQL in analytic jobs, even when the later use their column store.",0,28325769,0,"[]","",0,"","[]",0 28334141,0,"comment","hodgesrm","2021-08-27 23:31:36.000000000","I work for a company that supports ClickHouse. Our focus is analytic systems, which tend to run large compared to OLTP systems.
* Sharding is part of schema design for any analytic app whose data exceeds the capacity of a single host. This is very common for use cases like web analytics, observability, network management, intrusion detection, to name just a few. Automatic resharding is one of the top asks in the ClickHouse community. (We're working on it.)
* How do I backup ClickHouse is one of our top 3 support questions in order of frequency. I just taught a ClickHouse class yesterday--part of a regular series--and it was the first question out of the gate from the audience. It has come up at some point in almost every customer engagement I can think of.
In my experience, your comment is only correct for relatively small applications that are not critical to the business.",0,28331950,0,"[]","",0,"","[]",0 25223026,0,"comment","dilyevsky","2020-11-26 19:49:21.000000000","There are a few oss dremel alternatives with Clickhouse being my favorite
To me their absolute best technology edge is still their storage system (colossus)",0,25218778,0,"[]","",0,"","[]",0 14326405,0,"comment","lima","2017-05-12 18:17:34.000000000","Currently using Kafka + ClickHouse for something very similar.
The Grafana integration comes in handy! I've been stalking https://github.com/vavrusa on GitHub for a while ;-)
Thanks for the writeup!",0,14314931,0,"[]","",0,"","[]",0 14326810,0,"comment","ndr","2017-05-12 18:57:51.000000000","If you are reading this comment before the article: they use Yandex's OLAP solution, ClickHouse, and Kafka.
The whole article explains how and why they got there, it is actually very well written!",0,14314931,0,"[]","",0,"","[]",0 21521068,0,"comment","subhajeet2107","2019-11-13 04:00:06.000000000","Custom Collector(analytics) -> Clickhouse -> Custom ETL Scripts -> Clickhouse -> Re:dash We tried metabase which is awesome , but Redash is also great and easy to setup as well if your team knows sql then Redash is better We also looked at druid and after some benchmarking we settled on Clickhouse, realtime queries even without etl runs within seconds in clickhouse",0,21513566,0,"[]","",0,"","[]",0 14333696,0,"comment","tejasmanohar","2017-05-14 00:46:25.000000000","How does ClickHouse compare to Redshift? I'm sure CloudFlare doesn't want to use AWS products, but I am curious if anyone's used both.",0,14314931,0,"[]","",0,"","[]",0 14335239,0,"comment","timeseriesdbfan","2017-05-14 11:53:27.000000000","Can anybody comment how this compares to open source alternatives [1] [2] ?
[1] https://clickhouse.yandex/ [2] https://eventql.io/",0,14334742,0,"[14335488]","",0,"","[]",0 14335488,0,"comment","ishi","2017-05-14 13:15:44.000000000","As far as I can see, the design shares some similarities with ClickHouse.
LittleTable sounds less performant than ClickHouse:
LittleTable - "returns 500,000 rows/second"
ClickHouse - "processes hundreds of millions to more than a billion rows and tens of gigabytes of data per single server per second."
LittleTable also seems to be much less flexible (e.g. ClickHouse offers quite a few different storage engines, views, materialized views, and other goodies).
ClickHouse rocks.",0,14335239,0,"[14340266]","",0,"","[]",0 25234661,0,"comment","zX41ZdbW","2020-11-28 02:31:11.000000000","ClickHouse presentations are almost 100% in HTML [1].
All of them are based on Shower [2] and to prepare a new presentation I just copy-paste previous presentation and edit HTML directly... actually very understandable and convenient even for C++ developer.
- [1] https://github.com/ClickHouse/clickhouse-presentations - [2] https://shwr.me/",0,25233136,0,"[25234741]","",0,"","[]",0 21533936,0,"story","softwarelimits","2019-11-14 09:04:57.000000000","",0,0,0,"[]","https://www.altinity.com/blog/2019/11/13/making-data-come-to-life-with-clickhouse-live-view-tables",2,"Making Data Come to Life with ClickHouse Live View Tables","[]",0 28366879,0,"comment","joshxyz","2021-08-31 12:10:41.000000000","i think good example too is clickhouse analytics db, their development is all on github, fun to read issues and release notes every once in a while",0,28363479,0,"[28387527]","",0,"","[]",0 25275181,0,"story","xoelop","2020-12-02 10:36:18.000000000","",0,0,0,"[25275219]","https://twitter.com/tinybirdco/status/1334080035351912456",3,"A contest to win free dev accounts of our ClickHouse-based SaaS product","[]",1 25275219,0,"comment","xoelop","2020-12-02 10:45:56.000000000","Hey people!
We (https://tinybird.co) are a small startup that makes very easy to build the data infrastructure to do real time analytics on big amounts of data. It provides a layer of abstraction on top of ClickHouse to make ingesting, transforming and creating API endpoints on the data super fast, and after our first year in business we have a bunch of small customers, and several big enterprise ones
During Black Friday we had the biggest load on our systems so far (everything held out!) and we're running a small contest on Twitter to give free accounts to the 3 closest guesses on how many rows were read per hour on BF
We're slowly opening up and we'd love to see our tech applied to new use cases, so feel free to participate or reach out to us to try out the product :D",0,25275181,0,"[]","",0,"","[]",0 25278085,0,"comment","PeterZaitsev","2020-12-02 16:36:14.000000000","This is what is the problem with such compatibility tests... they tend to test full power of the language while you really may use quite small subset in your application, as such even solution which is "20% compatible" may well meet all your application needs.
I remember in its early days MySQL had pretty poor SQL support (if you think about full standard) which did not prevent it from having huge success.
Or more recent example ClickHouse which I think similar to VictoriaMetrics as it does not fully implement SQL, but also adds many convenient extensions which are not part of the standard.
Chances are if you chose VictoriaMetrics you will find a lot more utility in advanced features of MetricsQL than you loose from exact compatibility with PromQL",0,25271643,0,"[]","",0,"","[]",0 25288875,0,"comment","ants_a","2020-12-03 14:03:29.000000000","I did some time series data benchmarking recently. Most large data is "time series" data, but I will not digress on terminology right now. For the usecase I was looking at my results for InfluxDB did not qualitatively disagree the above blogpost. Timescale got better compression efficiency, faster query results, more constrained memory usage, but lower ingest speed at high concurrencies blocked by WAL insert locking and single threaded compression. If InfluxDBs measurement oriented data model and query interface is a good fit for your use case, and you don't have high cardinality data, then it might be convenient to use it, but it's a terrible choice for anything outside its niche.
However none of the databases tested above is anywhere close to the efficiency of a column store database that can do vectorized execution over batches of rows. ClickHouse is a good example of one such database. For queries that have to shift through large amounts of data, either filtering or aggregating it, the performance difference is easily >10x. I was seeing aggregation performance above 2B rows/s and that was I/O throughput bound.",0,25288393,0,"[25289028,25315094]","",0,"","[]",0 25288898,0,"comment","BenoitP","2020-12-03 14:06:15.000000000","Pre-reading hypothesis: TimescaleDB is declared orders of magnitude faster because the benchmark is serving results they're computing at writing time? Is it just like the ClickHouse benchmark from earlier, where they read from a `CREATE TABLE [...] ENGINE = AggregatingMergeTree`?
Post-reading:
"faster queries via continuous aggregates". So is this it? I couldn't find how tables / materialized views were created in the source though [1].
TimescaleDB is probably a very good product (and pg-compatible!), but producing such articles hiding the usage of a magic feature is sort of dishonest. Why not make an article directly on the power of the feature? It's hurting their brand reputation a bit.
[1] https://github.com/timescale/tsbs",0,25287793,0,"[25289214,25289041]","",0,"","[]",0 25289028,0,"comment","hardwaresofton","2020-12-03 14:20:32.000000000","I know the first thing that shocked me was how they could get so close to something that was purpose built for time series. Even being within spitting distance is really great in my opinion for an off the shelf, general tool.
> However none of the databases tested above is anywhere close to the efficiency of a column store database that can do vectorized execution over batches of rows. ClickHouse is a good example of one such database. For queries that have to shift through large amounts of data, either filtering or aggregating it, the performance difference is easily >10x. I was seeing aggregation performance above 2B rows/s and that was I/O throughput bound.
Agreed -- OLTP (in the case of Timescale) and purpose built timeseries-focused (but not necessarily analytics focused) DBs hold nothing to a proper OLAP database.
Did you write about this anywhere? would love to read it. I've never had a real need for the kind of stuff that Clickhouse does, but it looks to be the best in class for F/OSS OLAP DBs. Have you ever tried Druid?",0,25288875,0,"[25299727,25297247]","",0,"","[]",0 14402678,0,"comment","justinsaccount","2017-05-23 16:35:09.000000000","One problem I've noticed is that there are no good "medium data" tools.
Column stores are crazy fast, but there isn't much simple tooling built around things like parquet or ORC files. It's all gigantic java projects. Having some tools like grep,cut,sort,uniq,jq etc that worked against parquet files would go a long way to bridge the gap.
Something like pyspark may be the answer, I think it may be possible to wrap it and build the tools that I want.. like
find logs/ | xargs -P 16 json2parquet --out parquet_logs/
parquet-sql-query parquet_logs/ 'select src,count(*) from conn group by src...'
I've been testing https://clickhouse.yandex/. I threw it on a single VM with 4G of ram and imported billions of flow records into it. queries rip through data at tens of millions of records a second.Edit: another example... I have a few months of ssh honeypot logs in a compressed json log file. Reporting on top user/password combos by unique source address took tens of minutes with a jq pipeline. The same thing imported into clickhouse took a few seconds to run something like
select user,password,uniq(src) as sources from ssh group by user,password order by sources desc limit 100
",0,14401399,0,"[14403402,14403969,14408570,14402890,14402747,14404418,14408774,14407410,14404099,14403164,14403874]","",0,"","[]",0
14403028,0,"comment","justinsaccount","2017-05-23 17:12:21.000000000","Exactly.. I used to say that I didn't have "big data", but I have "annoying data".clickhouse turns my annoying data back into something that I can query in 30 seconds.
I just ran a random query to find what day had the most connections:
select day,count() as c from conn group by day order by c desc limit 1
And that took all of: 1 rows in set. Elapsed: 16.412 sec. Processed 1.43 billion rows, 2.87 GB (87.33 million rows/s., 174.65 MB/s.)
",0,14402890,0,"[]","",0,"","[]",0
14403071,0,"story","saleiva","2017-05-23 17:18:58.000000000","",0,0,0,"[14404254]","https://carto.com/blog/inside/geospatial-processing-with-clickhouse",12,"Geospatial processing with Clickhouse","[]",1
14404794,0,"comment","justinsaccount","2017-05-23 20:07:15.000000000","> Today, we’re very happy with Redshift, and we have 5 billion rows stored in our database.1.4 billion of our connection logs (24 fields) takes up 89G on my clickhouse VM. 5 billion records would take ~320G.
Based on http://tech.marksblogg.com/benchmarks.html a 6-node ds2.8xlarge redshift cluster is about as fast as clickhouse on a single i5.",0,14404418,0,"[]","",0,"","[]",0 14404974,0,"comment","justinsaccount","2017-05-23 20:23:18.000000000","I did above:
> I've been testing https://clickhouse.yandex/. I threw it on a single VM with 4G of ram and imported billions of flow records into it. queries rip through data at tens of millions of records a second.",0,14404746,0,"[]","",0,"","[]",0 28423632,0,"comment","vinay_ys","2021-09-05 11:57:49.000000000","Very nice write up. For the story from 2010 to now, I would mention the emergence by MPP columnar processing systems like Vertica and in-memory distributed systems like MemSQL to the narrative. Of course Kylin, Clickhouse etc are great open-source contenders (although at the time I looked into them (~5-6 years ago), they were not mature enough).
In my experience, people often underestimate the continuous effort to maintain the the Kimball's "Enterprise Data Warehouse Bus Architecture" diagram, even with more powerful machines and modern distributed tooling.
In today's fast evolving Internet apps world, the data use cases and scenarios are very fast evolving. That brings its own set of challenges.
Having good usable tools for managing the lifecycle of entity or event definitions, their variants like emitted/logged vs cleaned/processed/synthesized, their data quality checks etc and ensuring they are easily discoverable and understandable by everyone in the org is super crucial and it is significantly under-appreciated.
Usually, strong systems engineers who are in charge of the data platform focus on building the data infra (job scheduling, data pipelines execution, storage etc) but the crucial work of defining the data dictionaries, event or entity models etc are left out. The data producers and data consumers who are spread out throughout the organization have to muddle through this on their own without any centralized tooling to support this activity. These make data use very difficult and siloed.
Usually, there would be a team of BI analysts who are tasked to get some answers out of the data for the questions asked of them by various data users. Funnily these analysts are also working in silos assigned to those different data users. Inevitably, they become the super-inefficient intermediary between the data users and the data insights.
The pre-cooked data insights are presented in spreadsheets and slides in review meetings – where a narrative is already prepared by the analysts.
This robs the opportunity for the data users to explore and ask data questions on their own in a fast iteration cycle to improve their intuition and understanding of their product/business environment.
IMO, these challenges still remain largely unsolved even to this day across organizations of all size and scale.",0,28401230,0,"[]","",0,"","[]",0 28426142,0,"comment","hodgesrm","2021-09-05 17:44:08.000000000","JSON blob + columns is the recommended approach for handling semi-structured data in ClickHouse. It's easy to add columns on the fly, since you just give the expression to haul out the data using a DEFAULT clause. ClickHouse applies it automatically on older blocks without rewriting them. For new blocks it materializes the data as rows are inserted.",0,28422618,0,"[]","",0,"","[]",0 25311386,0,"comment","hardwaresofton","2020-12-05 03:18:01.000000000","> I was really surprised at how well Timescales compression worked. It was pretty much comparable to best in class columnstores. Only the row-by-row query execution engine was holding it back. Perhaps something that future versions of postgres can help with.
I think zedstore[0] might be something that could help here. I've mentioned it in the past but one of the best things about postgres is it's extensibility, and if timescale rides that wave (and maybe contacts zedstore to get this integration started early) it could be awesome.
[0]: https://blogs.vmware.com/opensource/2020/07/14/zedstore-comp...
> Didn't look into Druid in detail, but did try out Hive. Both of them look more suitable for cases where there is significant engineering effort in developing the data ingest and structuring pipeline. I wouldn't recommend either to a small team. With a measly triple digit TB database size both seemed overkill.
Thanks for this -- I haven't tried it yet at all but will try to remember this. ClickHouse was already first on my list for hobbyist->enterprise scalability but this cements it.",0,25299727,0,"[]","",0,"","[]",0 28427841,0,"comment","gkoberger","2021-09-05 21:11:22.000000000","Oh I've been thinking a lot about this! It might not be the answer you're looking for since they're all very UX-related and not particularly database-specific, however I have a few ideas.
Sometimes I feel like databases were created by people who never built a website before. Most websites are pretty similar, and databases historically have never felt (to me at least) "modern". I always feel like I'm fighting against them, and making usability concessions for the sake of performance.
First, the ability to subscribe to external data sets. I feel like I spend so much time writing crappy syncing code with external APIs (like Clearbit, GitHub, etc), and it would be so much nicer if I could just "connect" with them and know it will be fairly up to date.
I also think there's so many things everyone finds themselves redoing for no reason. For example, almost every site on the internet as a user database with sessions, and each user has an email (that must be valid + verified), a password (that's encrypted + salted) and 2FA (which everyone is basically implementing themselves). It'd be so nice if the database just "knew" it was a user, and you were tweaking the presets rather than building it from scratch.
Every single company has similar workflows they each solve themselves (often in insecure ways): they all have a production database, staging environments, migrations, direct access for customer support to fix things, local db access/clones for development, etc. I'd LOVE a database that was created with all these use-cases in mind... such as a way to connect to a DB locally but scrub sensitive data, take care of migrations seamlessly, etc.
This might be a bit too "in a magical world"-y, but I'd love to not have to think about tradeoffs. Kind of like an automatic car, I'd love my database to be able to shift based on the types of data and amount of read/writes. At my company, we have 3-4 different databases for different reasons (Mongo, REDIS, ElasticSearch, ClickHouse), and it gets really difficult to keep all the data synced and connect them behind the scenes. I'd love to just never have to think about the low-level data store ever again, and have the DB do all the work without us having to worry.
There's a number of primitives that I think are used a lot, and it'd be amazing if they were built in. For example, time. It'd be great to easily get the difference between two times, or total the times of a bunch of rows. Airtable has a time primitive, and it's amazing how much friendlier it is to use.
Overall, I'd also love it to just feel a lot more like Airtable, including an Airtable-like interface for working with it (right down to the ability to create custom views and create on-the-fly forms people can submit data to). I honestly use Airtable for most of my DB needs these days (for one-off small projects), and it's such a delight to use.
Maybe I'm underestimating the importance but... I feel like databases are pretty performant these days. I hope that we can start seeing dramatic UX improvements, since we don't have to optimize for performance the same way we have in the past.",0,28425379,0,"[28427905,28446918]","",0,"","[]",0 28430409,0,"comment","5e92cb50239222b","2021-09-06 04:54:42.000000000","> moving old records to separate cold storage
FWIW this is available in ClickHouse (which is an analytics database, though)
https://clickhouse.tech/docs/en/engines/table-engines/merget...",0,28429229,0,"[]","",0,"","[]",0 25315094,0,"comment","valyala","2020-12-05 15:12:42.000000000","ClickHouse is great OLAP database with outstanding performance! It can be used for collecting and querying observability data [1]. But it may be hard to properly design database schema for ClickHouse for storing general-purpose observability data. That's why we created VictoriaMetrics - purpose-built time series database, which is based on ClickHouse architecture ideas [2]. It just works out of the box without the need to design database schema, while providing outstanding performance [3].
[1] https://github.com/lomik/graphite-clickhouse
[2] https://valyala.medium.com/how-victoriametrics-makes-instant...
[3] https://valyala.medium.com/measuring-vertical-scalability-fo...",0,25288875,0,"[]","",0,"","[]",0 25315289,0,"comment","valyala","2020-12-05 15:38:01.000000000","3) comparison with ClickHouse
4) comparison with VictoriaMetrics",0,25291441,0,"[]","",0,"","[]",0 21613683,0,"comment","missosoup","2019-11-23 11:55:33.000000000","IIRC clickhouse still doesn't have this guarantee. It guarantees exactly one ingestion but not that the ingested event gets processed all the way through to view. And if the processing fails, that event won't be retried and is now gone.",0,21613635,0,"[]","",0,"","[]",0 21614126,0,"comment","1996","2019-11-23 13:58:53.000000000","Or normalize your data and use clickhouse.",0,21614007,0,"[21625282]","",0,"","[]",0 28440717,0,"comment","g48ywsJk6w48","2021-09-07 03:44:37.000000000","
Location: Montreal, Canada
Remote: Yes
Willing to relocate: within Canada
Technologies:
Backend: Perl, Python, Ruby
Frameworks: Mojolicious, Flask, Django, Sinatra, Ruby on Rails
Frontend: Responsive HTML/CSS, JavaScript
SQL databases: PostgreSQL, MySQL, ClickHouse, Oracle, InfluxDB
SQL databases skills: DBA role, query optimization, performance tuning, database design
OS: Linux - all major distributions (Debian, CentOS, Ubuntu, RHEL) + FreeBSD
OS skills: OS security, optimization, troubleshooting, custom install images
everyday instruments: vim, emacs, zsh, dozens cli utils, code testing tools
instruments: nginx, git, docker, LXC, CI/CD tools, stress test tools
Résumé/CV: on request
Email: g48ywsjk6w48@gmail.com
Hard-working, reliable software engineer who enjoys troubleshoot and bottleneck, solve challenges and working with others. In search of a challenging position that provides professional growth opportunities.
",0,28380659,0,"[]","",0,"","[]",0
28441890,0,"story","SerCe","2021-09-07 07:14:02.000000000","",0,0,0,"[]","https://clickhouse.tech/blog/en/2021/performance-test-1/",1,"Testing the Performance of ClickHouse","[]",0
21625282,0,"comment","shin_lao","2019-11-25 03:44:27.000000000","Clickhouse is a nightmare to scale and operate and doesn't support TS joins.",0,21614126,0,"[]","",0,"","[]",0
18027070,0,"comment","qaq","2018-09-19 19:37:57.000000000","Nice product for small/mid scale workloads. It's no Vertica or ClickHouse but if you do not need the scale should work well.",0,18026699,0,"[18027429,18027303]","",0,"","[]",0
18029426,0,"comment","qaq","2018-09-20 02:58:25.000000000","Would be interesting to see benchmarks vs performant options e.g Vertica and ClickHouse",0,18028893,0,"[]","",0,"","[]",0
18036297,0,"comment","sethhochberg","2018-09-20 23:55:33.000000000","IMO this is one of the biggest issues with the alternative storage engines for MySQL-family databases... we've also experimented with TokuDB for log-like data but found that, ultimately, the shortage of detailed documentation and operational issues like needing to develop homegrown tooling for things like backups overpowered the performance benefits.InnoDB isn't perfect, but it _is_ exhaustively documented and pretty well-understood, with a great set of related tools from Percona, etc, for simplifying operations. That goes a long way.
Recently we've switched back to using InnoDB for ingestion on one of our write-heavy tables and aggressively archiving the data out of it and into Clickhouse (InnoDB deals with the high volume of concurrent inserts, data is loaded into Clickhouse in large batches for querying). By comparison to Toku or RocksDB, Clickhouse is refreshingly well-documented and its easy for us to make consistent backups with ZFS snapshots.",0,18035821,0,"[]","",0,"","[]",0 18048783,0,"comment","manigandham","2018-09-23 00:46:32.000000000","There is absolutely nothing special about "time-series" to be an actual type of database. It's all hype.
Time-series data is data that has a time component which is usually the primary property to query by. Almost every database can handle this, like MongoDB/Redis for lightweight use, Cassandra for write-heavy/global replication, ElasticSearch for raw search-style quering, or an OLAP columnstore (Redshift, MSSQL, Snowflake, Clickhouse) for serious ingest and querying. Monitoring systems like Prometheus and Netdata even have time-series storage built-in because it just isn't that hard.
InfluxDB is only really useful in the context of the integrations that it provides with the common monitoring software stacks. You can easily just create a regular relational table with "time" as a column and get great performance with an index, and then use partitioning to break up the table to get great performance over lots of data.
pg_partman is an extension that makes partitioning automatic. Timescale is an extension that makes time-focused partitioning automatic. Citus is an extension that makes partitioning across multiple nodes automatic. Or use one of the distributed OLAP systems mentioned above.",0,18045506,0,"[]","",0,"","[]",0 28471488,0,"story","eternalban","2021-09-09 16:29:12.000000000","",0,0,0,"[]","https://clickhouse.tech/docs/en/development/architecture/",2,"Overview of ClickHouse Architecture","[]",0 28480513,0,"story","diminish","2021-09-10 12:23:24.000000000","",0,0,0,"[]","https://www.youtube.com/watch?v=cMdQsxolcqc",1,"Benchmarking ClickHouse vs. TimescaleDB","[]",0 25365005,0,"comment","orloffv","2020-12-09 21:07:54.000000000","Amazing!
You write: "all the events in all GitHub repositories since 2011" I have a question: When was the data loaded? Will the data in https://gh-api.clickhouse.tech/play be updated?",0,25364000,0,"[25365281]","",0,"","[]",0 25365281,0,"comment","zX41ZdbW","2020-12-09 21:24:09.000000000","The downloadable dataset was created two days ago and the queries were run on this dataset.
Data on https://gh-api.clickhouse.tech/play (for interactive queries) is updated every hour as described in the article, but downloadable datasets in .xz are not updated.",0,25365005,0,"[]","",0,"","[]",0 25365355,0,"comment","siradjev","2020-12-09 21:28:26.000000000","Below query says there are 40 threads, not 80, and ram for Clickhouse is limited by 20GB =)
select * from system.settings where name IN ('max_memory_usage', 'max_threads'); -- 20GB limit, 40 CPU -- 40 threads, 20GB RAM limit",0,25364319,0,"[25365448]","",0,"","[]",0 25365448,0,"comment","zX41ZdbW","2020-12-09 21:33:45.000000000","Yes, 80 vCPU are logical cores, but ClickHouse is using the number of threads equal to the number of physical cores by default (40 threads).
This is a reasonable default - when setting up the number of threads to full vCPU count, performance of single query will be slightly better but this will affect the max number of concurrent queries badly.",0,25365355,0,"[]","",0,"","[]",0 25365653,0,"comment","zX41ZdbW","2020-12-09 21:45:39.000000000",""Pull panda" for your repository:
https://gh-api.clickhouse.tech/play?user=play#c2VsZWN0IGNvdW...
(replace condition on repo_name to your favorite repo)",0,25364000,0,"[25365770]","",0,"","[]",0 25365770,0,"comment","iamazat","2020-12-09 21:53:47.000000000","Something more complex - How much time it takes to merge non trivial PRs
https://gh-api.clickhouse.tech/play?user=play#U0VMRUNUCiAgIC...",0,25365653,0,"[]","",0,"","[]",0 25372772,0,"story","zX41ZdbW","2020-12-10 12:04:44.000000000","",0,0,0,"[25372773]","https://github.com/github-sql/explorer",2,"Show HN: Insights on GitHub Ecosystem with ClickHouse","[]",1 14480419,0,"story","george3d6","2017-06-04 08:32:56.000000000","",1,0,0,"[]","https://blog.cerebralab.com/#!/blog/%0AClickhouse%2C%20a%20database%20for%20the%2021st%20century%0A",1,"Clickhouse","[]",0 25384372,0,"comment","zX41ZdbW","2020-12-11 10:58:49.000000000","+ more details: https://gh.clickhouse.tech/explorer/",0,25384371,0,"[25384547,25393168]","",0,"","[]",0 25384547,0,"comment","mvfmvf","2020-12-11 11:27:33.000000000","Hm, interesting... Especially possibility to run own queries
It seem using it we can made our summary of the year for the favorite repo:
https://gh-api.clickhouse.tech/play?user=play#U0VMRUNUIGV2ZW...
Or cross-check the stats published by GitHub (there are differences):
https://gh-api.clickhouse.tech/play?user=play#U0VMRUNUIGV2ZW...",0,25384372,0,"[]","",0,"","[]",0 21683604,0,"comment","Dim25","2019-12-02 16:04:00.000000000","SEEKING WORK | San Francisco, CA, USA | REMOTE or LOCAL
Hi all, I'm Dima (https://www.linkedin.com/in/dim25/) from SF (San Francisco Bay Area). Startup Founder, PM, Full-stack with Machine Learning experience.
Python: * Machine Learning: (TensorFlow; Keras) [+ML Engineer Nanodegree]. * Computer Vision (OpenCV; TensorFlow). * Media \ communications (Twillio; Ring Central; Kurento). * Streaming \ Workflows: Kafka+Faust; Airflow; Celery. * Web servers (Flask), and many other applications of Python.
Web Development: HTML; CSS; Bootstrap. JS (Front-end + Node.js): All the basics necessary for web development; Basic experience with d3.js and other visualizations and dashboards tools.
DBs: MongoDB; ElasticSearch; Redis (incl. RediSearch), SQLs. Basics of ClickHouse.
C/C++: Basic experience with ROS (Robot Operating System). [As a part of Self-Driving Car engineering nanodegree].
Most recent projects:
* Analyzing millions of job postings. Orchestration (Airflow, Docker);
Data gathering (Selenium; Scrapy; Plugins; MitmProxy), enrichment, and analytics.
* CCTV Stream analytics (TensorFlow computer vision w/ Kurento WebRTC gateway).
Previously: * Co-founder at MBaaS startup. 'Firefighter', from $0 to $120K MRR.
* Hired and managed a team of 15 mobile developers to assist with the delivery of
the #1 mobile banking app in Russia (iOS + Android).
* AWM, rev-share with Kinks (guys from San Francisco Armory).
Especially good match: if you need a cost-efficient prototype; fix and deliver your machine learning or automation strategy; looking for an early-stage full-stack dev with ML experience; or have a remote team you don’t have time to manage.Rate: Open to discuss. Don't need perks, 'cool' office spaces and other shenanigans. Available now.
Email: dima_cv1@protonmail.com
Latest version of this CV: https://bitly.com/dima_cv1",0,21683553,0,"[]","",0,"","[]",0 18083114,0,"comment","manigandham","2018-09-27 09:40:03.000000000","Modern column-oriented databases are rather incredible at what they do, and I'm surprised by how little they are used or known about. Redshift, BigQuery, Snowflake, Azure SQL DW, SQL Server, MonetDB, Vertica, Greenplum, MemSQL, kdb+, Clickhouse, even Druid are all column-stores that can do sub-second queries on massive amounts of data.
I also want to note that time-series databases are basically obsolete at this point because time-series data is very well handled by these OLAP column-stores. Create a table with time as a primary or sort key and you'll get fast queries with full SQL and joins.",0,18076547,0,"[18085635,18084952,18083465]","",0,"","[]",0 18083777,0,"comment","halayli","2018-09-27 12:25:28.000000000","mapd beats it and iirc so does clickhouse.",0,18083465,0,"[18083798]","",0,"","[]",0 18085417,0,"comment","halayli","2018-09-27 15:30:33.000000000","Yes there is, and that's why tpc benchmarks exist. I didn't say there's a one-database-fits-all. But comparing kdb+ to mapd and clickhouse is a very reasonable comparison.",0,18083798,0,"[18085473]","",0,"","[]",0 18099796,0,"story","jinqueeny","2018-09-29 11:18:32.000000000","",0,0,0,"[18109037,18110685,18104978,18109250,18115267,18118294,18109219]","https://github.com/yandex/ClickHouse",104,"ClickHouse, a column-oriented DBMS to generate analytical reports in real time","[]",32 25406354,0,"comment","joshxyz","2020-12-13 13:15:21.000000000","ClickHouse lol, that database blows my fucking mind",0,25401590,0,"[]","",0,"","[]",0 21708313,0,"comment","hwwc","2019-12-04 23:28:58.000000000","SEEKING WORK | Design, Full Stack Development & Data Engineering Location: US Remote: Yes
We're a multidisciplinary designer/developer team experienced in the entire web application stack:
- Wireframing & design mockups
- Design systems
- Front & back-end development
- Web accessibility & responsive design
- ETL
- Database design & Data APIs
- Devops & build tooling
For every client, we focus intensely on:
- a coherent design system for better user experience
- performance as a part of the user experience
- maintainable code
- timely and transparent communication
Relevant projects include:
- A web platform for reporting & analyzing the state of open source software (https://opensourcecompass.io/).
- An analytics engine for web applications (https://github.com/hwchen/tesseract).
Primary Skills: Sketch, Photoshop, (S)CSS, JS, React/Vue/Svelte, Rust, Linux, Google Compute Platform, ClickhouseDB, Postgresql
Production experience with: Python/Pandas, Node/JS, AWS, Docker, Redis, MySql, Nginx, PHP
Github: https://github.com/hwchen | https://github.com/perpetualgrimace
Contact: hello@hwc.io",0,21683553,0,"[]","",0,"","[]",0 18109037,0,"comment","lykr0n","2018-09-30 23:26:48.000000000","Clickhouse has some weird quirks when you think of it as a SQL Database, but its astounding to use it. It's faster than one would think, it can do some really cool data modeling, and provides a wealth of features for the average user out of the box.
The most important thing, and the thing that makes it attractive to me is that it is almost stupidly simple to setup and get running. It's quite simple (when you wrap your head around it) to do sharding or replication and scale up. The zookeeper stuff takes a bit more effort, but most of that is due to zookeeper and not ClickHouse.",0,18099796,0,"[18109197,18110566]","",0,"","[]",0 18109197,0,"comment","gary__","2018-10-01 00:11:44.000000000","A look through the below does highlight some of its differences with a standard sql database.
https://www.slideshare.net/Altinity/migration-to-clickhouse-...
Year on now, perhaps things have changed.",0,18109037,0,"[]","",0,"","[]",0 18109250,0,"comment","georgewfraser","2018-10-01 00:33:52.000000000","The basic techniques for implementing a fast column-store data warehouse have been well-known for 10 years. There are several excellent commercial and open-source implementations of these techniques:
- BigQuery
- Snowflake
- Redshift
- Presto
ClickHouse is not one of them. It doesn't have: - Transactions
- Distributed joins
- Separate compute from storage
- UPDATE
- User management
I don't mean to be a jerk, I'm just trying to save people some time. Columnar DBs is well-trod territory and ClickHouse is way behind.",0,18099796,0,"[18109373,18109565,18115230,18109305,18109593,18109568,18109554]","",0,"","[]",0
18109305,0,"comment","bretthoerner","2018-10-01 00:58:18.000000000","ClickHouse stable has both UPDATE and DELETE.",0,18109250,0,"[18109375]","",0,"","[]",0
18109373,0,"comment","ehfeng","2018-10-01 01:18:19.000000000","I wouldn't call Redshift "excellent", nor would I call ClickHouse "way behind". ClickHouse was the best choice for my last employer's use case (https://twitter.com/zeeg/status/987009550501928960), after many other solutions were tested and benchmarked.Just because a tool doesn't have a specific feature checklist doesn't mean you should categorically rule it out, particularly if you don't have experience using/running/deploying it.",0,18109250,0,"[]","",0,"","[]",0 18109565,0,"comment","manigandham","2018-10-01 02:13:03.000000000","Redshift doesn't separate compute from storage either unless you're using Spectrum. Presto isn't a database at all and can read from many data stores. The rest are all cloud-hosted with lots of moving parts. MemSQL, Vertica, Actian, Greenplum, and SQL Server are better comparisons.
ClickHouse is a column-oriented db and actually one of the most advanced, focusing on performance at all costs with lots of table storage engines that provide flexibility for your exact use-case. It also supports distributed joins and deletes but has some limitations they are working on.
It can definitely use better tooling and compatibility though, but that's the tradeoff the core team made, and it seems to be working well for the companies that can afford the time and talent.",0,18109250,0,"[18109715]","",0,"","[]",0 18109568,0,"comment","dikei","2018-10-01 02:13:13.000000000","I wouldn't dismiss ClickHouse so quickly. There's no one-size-fit-all solution for data warehouse, everything has its own quirks.",0,18109250,0,"[]","",0,"","[]",0 18109715,0,"comment","georgewfraser","2018-10-01 02:54:12.000000000","My point is there’s just no advantage to ClickHouse. The things that make it fast are in every column store. There’s other options that do everything it does and more.",0,18109565,0,"[18110144,18112923]","",0,"","[]",0 18109893,0,"comment","eldargab","2018-10-01 03:44:00.000000000","monetDB is a sort of drop-in replacement for a regular database with all expected features and good compatibility.
On other hand ClickHouse will be incompatible with most of existing tools and it's better to learn well its limitations and workaround technics in advance. But once you dump into it substantial amounts of time series data you'll find it 10+ times faster and 2-3 times smaller than monet.",0,18109219,0,"[18110425]","",0,"","[]",0 18110027,0,"comment","sin7","2018-10-01 04:20:49.000000000","You can run it on a cluster or a single server. It's pretty easy to setup either way.
No updates. Fast inserts.
You can only join two tables at a time, but the joins can be chained to deal with this limitation.
I tried Monet. It wasn't very stable for me. I didn't stick with it long enough to judge it. ClickHouse has backing of Yandex. I think that makes a huge difference.
I have used Clickhouse for the past year. Thrown 3000 column by 120 million row tables on it. It worked where PostgreSQL came to a halt. Different use cases really.
I fits my use case perfectly. Large amounts of data with no updates and tons of aggregations. It's lighting fast.",0,18109219,0,"[18110153]","",0,"","[]",0 18110566,0,"comment","tadkar","2018-10-01 07:10:02.000000000","Second the “stupidly simple to setup and get running”. My company works with billion row datasets on client sites where we get super locked down accounts. Clickhouse is a single binary that you can run with no actual “install” needed.
Also echo the comments in the rest of the discussion about it being blazing fast. On our beefier machines we get querying in the 100s of millions of rows per second when the data is not in cache.",0,18109037,0,"[]","",0,"","[]",0 18110986,0,"comment","dschuler","2018-10-01 08:58:58.000000000","By one developer you mean Yandex or that most commits are made by a couple of users? Being backed by a large company (the Russian Google apparently) that has an independent revenue stream seems like a large plus, but maybe not enough to cancel out.
I'm wary of investing effort into a potentially unsupported project as well, but I wonder if ClickHouse only seems "out there" because we're not aware of the Russian tech ecosystem (at least I'm not).
People don't seem concerned about building anything with Firebase, but Google doesn't have a good track record of changing its mind about priorities or service pricing.
What would you recommend instead for a column-oriented db that you can self-host (commercial or open source)?",0,18110685,0,"[18113423]","",0,"","[]",0 28526526,0,"story","x4m","2021-09-14 15:45:44.000000000","",0,0,0,"[28557268]","https://github.com/jaegertracing/jaeger-clickhouse/blob/main/blog/post1.md",31,"ClickHouse Storage for Jaeger Tracing","[]",1 28533209,0,"comment","hodgesrm","2021-09-14 23:56:41.000000000","Lakehouse seems like an evolution of Hadoop to add better SQL and transactions + reasonable performance on large datasets. ("Reasonable" = not dog slow like Hive.) Reading this article as well as the survey Armbrust, Ghodsi et al. paper [0] you might easily forget that a large fraction of new data warehouse use cases get real-time data from event streams like Kafka, not S3 or HDFS. They also require stable response in small numbers of milliseconds for the more demanding use cases.
So Lakehouse is not really an evolution of data warehouses or at least new ones like ClickHouse and Druid. SQL data warehouses are highly optimized for analytic query speed. Think columnar storage, high compression, vectorized query, materialized views, etc. They also couple well with event streams. You can't get high performance without optimized storage and very tight integration of parts.
I have massive respect for Ali and Matei but there's no way Lakehouse will replace this.
[0] http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf
Edit: replaced "original" with "survey".",0,28531009,0,"[28535395]","",0,"","[]",0 18113423,0,"comment","drej","2018-10-01 15:21:27.000000000","It's in the Yandex namespace and it is used by said company, which is a huge plus. But if you looked at the development history just a while ago, it was highly dependent on Alexey.
It reminds me of Grumpy (https://github.com/google/grumpy), which was released by Google, but was later basically abandoned when the lead left Google.
That being said, the situation is better than last time I checked this, there is a handful of somewhat active developers. https://github.com/yandex/ClickHouse/graphs/contributors",0,18110986,0,"[]","",0,"","[]",0 18114792,0,"comment","bsg75","2018-10-01 17:27:44.000000000","> It seemed (and still seems) like a project that lives and dies with one developer
One major contributor (who may be project lead at Yandex?) and a lot of active contributors: https://github.com/yandex/ClickHouse/graphs/contributors",0,18110685,0,"[]","",0,"","[]",0 18115192,0,"comment","PeterZaitsev","2018-10-01 18:07:18.000000000","If you need Advanced SQL Support ClickHouse is not there (Yet) but if you need high performance for relatively basic queries ClickHouse is great.
It is developed mostly by ClickHouse staff but there is at least one company https://www.altinity.com/ which offers Commercial Support, Consulting, Trainin for ClickHouse",0,18110685,0,"[]","",0,"","[]",0 18115230,0,"comment","PeterZaitsev","2018-10-01 18:11:02.000000000","There is quite a difference between theoretical technologies and stable high-performance implementation. Majority of things ClickHouse does are very well known it just does them
Here is example Performance comparison we did at Percona https://www.percona.com/blog/2017/02/13/clickhouse-new-opens...",0,18109250,0,"[]","",0,"","[]",0 18115267,0,"comment","PeterZaitsev","2018-10-01 18:14:34.000000000","ClickHouse Indeed does not do "Separate Compute from Storage" yet it is architectural decision not a feature gap. Running ClickHouse with directly attached storage and built in replication can be super fast and cost efficient. It works best for stable workloads",0,18099796,0,"[]","",0,"","[]",0 18118294,0,"comment","tuananh","2018-10-02 02:06:09.000000000","CloudFlare is using ClickHouse. That does say something
https://blog.cloudflare.com/http-analytics-for-6m-requests-p...",0,18099796,0,"[]","",0,"","[]",0 28538831,0,"comment","hodgesrm","2021-09-15 13:33:28.000000000","Thanks for your comment and sorry if I was unclear. I'm not arguing that storage and compute need to be directly coupled. However, storage does need to be very carefully optimized to match compute, especially when you are trying read events and make them available for immediate storage. ClickHouse for example has multiple formats for table parts in order to allow efficient buffering of rapidly arriving records. Using customized formats has allowed the project to evolve quickly.
In fact the Lakehouse paper seems to be setting up a strawman. Here are three examples.
* The new low-latency SQL data warehouses are open source. They are are not locking data in proprietary formats. We're not Snowflake.
* SQL data warehouses are already headed toward support for object storage for the same reason everyone else is: costs and durability in large datasets. Here's just one sample of many: https://altinity.com/blog/tips-for-high-performance-clickhou...
* Not everyone cares about ML and data warehouse integration. From my experience working on ClickHouse only a small percentage of users integrate ML. By contrast 100% of our users care about efficient visualization and keeping data pipelines as short as possible, hence the benefit of a tightly integrated server.
I think there's actually a bifurcation of the market into low-latency use cases driven by event streams versus much larger datasets containing unstructured/semi-structured data stored in low-cost object storage. Lakehouse addresses the latter. SQL data warehouses are focused on the former. I don't see one "winning"--both markets are growing.",0,28535395,0,"[]","",0,"","[]",0 21721739,0,"comment","pachico","2019-12-06 13:55:45.000000000","Ever thought of using ClickHouse for data storage?",0,21720881,0,"[21735484]","",0,"","[]",0 18145551,0,"comment","qaq","2018-10-05 03:34:16.000000000","And we are running pretty hefty setup in docker Pulsar Cluster, multiple Postgres Instances, ClickHouse and a bunch of services hyperkit spins out of control periodically slowing down MBP to unusable and much less beefy linux notebook is doing OK.",0,18145139,0,"[]","",0,"","[]",0 21767967,0,"comment","SergeAx","2019-12-11 23:39:52.000000000","Disclaimer: I worked at Yandex in 2006-2007.
Yandex is a very BIG company in terms of users, requests, data stored etc. It's surely bigger than Twitter, I believe it is bigger than Netflix. It is also algorithms company (like "not content company") So, when Yandex doing something, is mostly doing it because all other options were failed on their load.
Yandex also has a very extensive expertise in C/C++. There is a ClickHouse, there is a CatBoost. I would trust them in their domain.",0,21766538,0,"[21768018,21768728]","",0,"","[]",0 28593840,0,"story","hodgesrm","2021-09-20 13:59:08.000000000","",0,0,0,"[]","https://clickhouse.com/blog/en/2021/clickhouse-inc/",3,"ClickHouse spins out From Yandex","[]",0 28594160,0,"story","zX41ZdbW","2021-09-20 14:27:54.000000000","",1,0,0,"[]","https://clickhouse.com/blog/en/2021/clickhouse-inc/",1,"ClickHouse, Inc","[]",0 28594734,0,"story","rochoa","2021-09-20 15:17:55.000000000","",0,0,0,"[]","https://clickhouse.com/blog/en/2021/clickhouse-inc/",2,"Introducing ClickHouse Inc., the new home of ClickHouse","[]",0 28595419,0,"story","zX41ZdbW","2021-09-20 16:13:48.000000000","",0,0,0,"[28602495,28596480,28597482,28597553,28598408,28595911,28596490,28596023,28599834,28602911,28596528,28600794,28595902,28597522,28651987,28641396,28602904,28602432,28596082,28596408,28599535,28602123,28598168,28599455,28603751,28595932,28597445,28595983,28597167]","https://github.com/ClickHouse/ClickHouse/blob/master/website/blog/en/2021/clickhouse-inc.md",519,"ClickHouse, Inc.","[]",159 28595902,0,"comment","ucarion","2021-09-20 16:52:57.000000000","Might make more sense to link to the blog post, instead of its underlying markdown in GitHub?
https://clickhouse.com/blog/en/2021/clickhouse-inc/",0,28595419,0,"[28595988]","",0,"","[]",0 28595911,0,"comment","chrismorgan","2021-09-20 16:53:35.000000000","Canonical link: https://clickhouse.com/blog/en/2021/clickhouse-inc/
But I presume the GitHub link (https://github.com/ClickHouse/ClickHouse/blob/master/website...) has been submitted because clickhouse.com is going to be blocked for a large fraction of HN users (Peter Lowe’s Ad and tracking server list, which I think uBlock Origin has enabled by default, includes ||clickhouse.com^). I’m actually a bit curious why clickhouse.com (or more likely a subdomain?) would be being used this way; I’d have thought that they’d separate any such uses to a different domain so as not to hinder their main domain which is about the software and nothing to do with ads or tracking at all (even if that’s probably the main end use of such an OLAP DBMS).",0,28595419,0,"[28595996,28598088,28602333]","",0,"","[]",0 28595932,0,"comment","KitDuncan","2021-09-20 16:55:34.000000000","Just got started with clickhouse. Super cool software.",0,28595419,0,"[]","",0,"","[]",0 28595996,0,"comment","pgl","2021-09-20 17:01:11.000000000","Someone just reported this to me and I've removed the entry from my blocklist.
This was a very old entry - it was added on Fri, 06 Jun 2003 19:53:00. Back then it was a marketing company that served ads.
I pride myself on knowing the entries in my list very well, but I have to admit I forgot about this one, which is ironic because I use Clickhouse at my job these days.",0,28595911,0,"[28596355,28596026,28597757,28597569,28596039]","",0,"","[]",0 28596408,0,"comment","einpoklum","2021-09-20 17:36:59.000000000","For those who don't know it:
ClickHouse is a columnar, analytic, close-to-a-DBMS, but not a full-fledged one. The "100x-1000x faster" is compared to row stores. Last time I checked it was mostly single-table-oriented.",0,28595419,0,"[]","",0,"","[]",0 28596480,0,"comment","whitepoplar","2021-09-20 17:42:47.000000000","Haven't used any of these yet, but how does ClickHouse compare to Postgres extensions like TimescaleDB and Citus (which recently launched a columnar feature)? I remember reading in the ClickHouse docs some time ago that it does not have DELETE functionality. Does this pose any problems with GDPR and data deletion requests?",0,28595419,0,"[28597744,28596892,28596718,28596827,28600300,28602970,28596774,28598072,28596649,28597956,28605433]","",0,"","[]",0 28596490,0,"comment","data_ders","2021-09-20 17:43:17.000000000","I'm betting we'll see a "Clickhouse Cloud" product announcement in the next 12 months. I'm curious to see if they can provide enough add-on value to their open source product to be profitable. But I'm certainly rooting for them!",0,28595419,0,"[28596949,28596870,28596907,28597410]","",0,"","[]",0 28596529,0,"comment","acidbaseextract","2021-09-20 17:46:49.000000000","As a sidenote, I saw your talk on Clickhouse to the CMU database group [1] back when and was extremely impressed with your deep technical knowledge yet down-to-earth presentation. Still haven't had an opportunity to use Clickhouse for production work, but would welcome it.
[1] https://www.youtube.com/watch?v=fGG9dApIhDU",0,28596229,0,"[28597997]","",0,"","[]",0 28596546,0,"comment","zX41ZdbW","2021-09-20 17:48:15.000000000","Thank you! This is an important milestone for ClickHouse and will benefit the entire ecosystem.",0,28596229,0,"[]","",0,"","[]",0 28596649,0,"comment","ryanbooz","2021-09-20 17:55:41.000000000","(Timescale DevRel here)
We've recently been working through a detailed benchmark of TimescaleDB and Clickhouse. The DELETE/UPDATE question has been an intriguing story to follow - and I honestly hadn't considered the GDPR angle.
ATM, Clickhouse is still OLAP focused and their MergeTree implementation does not allow direct DELETE (or UPDATE) of any data. All DELETE/UPDATE requests are applied asynchronously by (essentially) re-writing/merging the table data (it's referred to as a "mutation") without whatever data was referenced in the DELETE/UPDATE. [1]
[1]: https://clickhouse.com/docs/en/sql-reference/statements/alte...",0,28596480,0,"[28598182,28601681,28597887,28599333]","",0,"","[]",0 28596718,0,"comment","stingraycharles","2021-09-20 18:01:25.000000000","In a nutshell, my extremely subjective and biased take on it:
* Citus has a great clustering story, and a small data warehousing story, afaik no timeseries story;
* TimescaleDB has a great timeseries story, and an average data warehousing story;
* Clickhouse has a great data warehousing story, an average timeseries story, and a bit meh clustering story (YMMV).
(Disclaimer: I work for a competitor)",0,28596480,0,"[28597197,28597709]","",0,"","[]",0 28596774,0,"comment","zX41ZdbW","2021-09-20 18:05:14.000000000","There are many independent comparisons of ClickHouse vs TimescaleDB:
By Splitbee: https://github.com/ClickHouse/ClickHouse/issues/22398#issuec... By GitLab: https://github.com/ClickHouse/ClickHouse/issues/22398#issuec... And others: https://github.com/ClickHouse/ClickHouse/issues/22398#issuec... https://github.com/ClickHouse/ClickHouse/issues/22398#issuec...
If you'll find more, please post it there.
TimescaleDB can work pretty fine in time series scenario but does not shine on analytical queries. For most of time series queries, it is below ClickHouse in terms of performance but for small (point) queries it can be better.
The main advantage of TimescaleDB is that it better integrates with Postgres (for obvious reasons).
There are also many comparisons of ClickHouse vs Citus. The most notable is here: https://blog.cloudflare.com/http-analytics-for-6m-requests-p...
ClickHouse can do batch DELETE operations for data cleanup. https://clickhouse.com/docs/en/sql-reference/statements/alte... It is not for frequent single-record deletions, but it can fulfill the needs for data cleanup, retention, GDPR requirements.
Also you can tune TTL rules in ClickHouse, per table or per columns (say, replace all IP addresses to zero after three months).",0,28596480,0,"[28597313]","",0,"","[]",0 28596827,0,"comment","didip","2021-09-20 18:09:37.000000000","ClickHouse competes with OLAP storages like Druid or Pinot.
I don't know about ClickHouse but the other 2 uses bitmap indexes to make storing petabytes of data affordable.
Row oriented databases would struggle to compete against ClickHouse. They are easily an order of magnitude slower.",0,28596480,0,"[28600074]","",0,"","[]",0 28596870,0,"comment","ignoramous","2021-09-20 18:13:03.000000000","There's definitely market for a managed clickhouse 1p product. It remains to be seen if the product is substantial enough to challenge the incumbents. The engineering pedigree is ample. So that's already 50% of the way there. With money in the bank, it is all about how they suit it up with their sales and marketing. Interesting times ahead for them.",0,28596490,0,"[28597290,28596921]","",0,"","[]",0 28596892,0,"comment","ericb","2021-09-20 18:14:30.000000000","ClickHouse wins on licensing--Apache.
The TimeScale licensing approach, the way it is written, perhaps accidentally, has lots of hidden landmines. The TimeScale license slants toward cloud giant defense to the extent that normal use is perilous.
For example, timescale can be used for normal data (postgres) as well, so any rules seem to apply to all your data in the database. The free license only usable available if:
the customer is prohibited, either contractually or technically, from defining, redefining, or modifying the database schema or other structural aspects of database objects, such as through use of the Timescale Data Definition Interfaces, in a Timescale Database utilized by such Value Added Products or Services.
My read is that if you let a customer do anything that adds a custom field, or table, or database, or trigger, or anything that is "structural" (even in the regular relational stuff) anywhere in your database (metrics or not), you are in violation. There doesn't seem to be a distinction about whether this is "direct" control or not, or whether a setting indirectly adds a trigger. I don't want to be in a courtroom debating whether a new metric is a "structural change!"
Now, none of that might be the intent of the license, but you have to go by what it says, not intentions.
The sad part of that is, I, and I'm sure many folks, have no interest in starting a database company, but we can't rally timescale because of legal risk. Looks awesome otherwise, though.",0,28596480,0,"[28597093,28597916]","",0,"","[]",0 28596921,0,"comment","polskibus","2021-09-20 18:16:50.000000000","Yandex offers managed clickhouse : https://cloud.yandex.com/en/services/managed-clickhouse",0,28596870,0,"[28598656]","",0,"","[]",0 28597197,0,"comment","akulkarni","2021-09-20 18:39:09.000000000","[Timescale co-founder]
This is a really great comparison. I might borrow it in the future :-)
But yes, if you have classic OLAP-style queries (e.g., queries that need to touch every database row), Clickhouse is likely the better option.
For anything time-series related, and/or if you like/love Postgres, that is where TimescaleDB shines. (But please make sure you turn on compression!)
TimescaleDB also has a good clustering story, which is also improving over time. [0][1]
[0] https://news.ycombinator.com/item?id=23272992
[1] https://news.ycombinator.com/item?id=24931994",0,28596718,0,"[28598696,28601481]","",0,"","[]",0 28597290,0,"comment","yigitkonur35","2021-09-20 18:47:41.000000000","Aiven will offer a managed Clickhouse service too: https://landing.aiven.io/2020-upcoming-aiven-services-webina...
And also Altinity is a trustable partner with a great know-how about Clickhouse internals. They have started to offer managed instances in AWS: https://altinity.com/altinity-cloud-test-drive/
At last, Alibaba Cloud has an option to use: https://www.alibabacloud.com/product/clickhouse
Are there any other ones?",0,28596870,0,"[]","",0,"","[]",0 28597313,0,"comment","ryanbooz","2021-09-20 18:50:41.000000000","[Timescale DevRel here]
@zX41ZdbW@ - Thanks for pointing out the various benchmarks that have been run by other companies between Clickhouse and TimescaleDB using TSBS[1]. As we mentioned, we'll dig deeper into a similar benchmark with much more detail than any of those examples in an upcoming blog post.
One notable omission on all of the benchmarks that we've seen is that none of them enable TimescaleDB compression (which also transforms row-oriented data into a columnar-type format). In our detailed benchmarking, queries on compressed columnar data in Timescale outperformed Clickhouse in most queries, particularly as cardinality increases, often by 5x or more. And with compression of 90% or more, storage is often comparable. (Again, blog post coming soon - we are just making sure our results are accurate before rushing to publish.)
The beauty of TimescaleDB columnar compression model is that it allows the user to decide when their workload can benefit from deep/narrow queries of data that doesn't change often (although it can still be modified just like regular row data), verses shallow/wide queries for things like inserting data and near-time queries.
It's a hybrid model that provides a lot of flexibility for users AND significantly improves the performance of historical queries. So yes, we do agree that columnar storage is a huge performance win for many types of queries.
And of course, with TimescaleDB, one also gets all of the benefits of PostgreSQL and its vibrant ecosystem.
Can't wait to share the details in the coming weeks!
[1]: https://github.com/timescale/tsbs",0,28596774,0,"[28599935,28599891,28598269]","",0,"","[]",0 28597410,0,"comment","nemo44x","2021-09-20 19:00:32.000000000","I'm guessing that's the entire purpose here. Build Snowflake with Clickhouse branding.",0,28596490,0,"[]","",0,"","[]",0 28597482,0,"comment","jenny91","2021-09-20 19:06:18.000000000","These some really great technology coming out of Russia in the information retrieval/database world: ClickHouse, a bunch of Postgres stuff that Yandex is working on, 2gis.ru (a super detailed vector map on a completely different stack to Google/MapBox), etc.",0,28595419,0,"[28601308]","",0,"","[]",0 28597522,0,"comment","mempko","2021-09-20 19:09:44.000000000","It's great to see this spin off. ClickHouse is fast but certain use cases are not ideal and having it spin off the team can focus more on the community and what they need instead of just the use cases Yandex had. Cheers to the team and good luck!",0,28595419,0,"[]","",0,"","[]",0 28597744,0,"comment","fiddlerwoaroof","2021-09-20 19:25:27.000000000","I benchmarked ClickHouse vs. Timescale, Citus, Greenplum and Elasticsearch for a real-time analytics application. With a couple hours learning for each (although I’ve used Postgres extensively and so Postgres-backed databases had a bit of an advantage), ClickHouse’s performance was easily an order of magnitude or two better than anything except ES. ES had its own downsides with respect to the queries we could run (which is why we were leaving ES in the first place).",0,28596480,0,"[28601492,28598173,28600428]","",0,"","[]",0 28597887,0,"comment","nezirus","2021-09-20 19:38:29.000000000","You are correct, the proper way to do deletions in ClickHouse is to use partitions, and drop partitions. That is probably good enough for most analytical use cases, but YMMV.",0,28596649,0,"[]","",0,"","[]",0 28597916,0,"comment","goodpoint","2021-09-20 19:40:30.000000000","> ClickHouse wins on licensing--Apache
How so? An end user should prefer a database under a license that protect the developer and users from cloudification/proprietization/SaaS",0,28596892,0,"[28598390]","",0,"","[]",0 28598072,0,"comment","lima","2021-09-20 19:51:28.000000000","> I remember reading in the ClickHouse docs some time ago that it does not have DELETE functionality. Does this pose any problems with GDPR and data deletion requests?
Clickhouse has ALTER ... DELETE and ALTER ... UPDATE functionality now! (and TTLs)",0,28596480,0,"[]","",0,"","[]",0 28598088,0,"comment","newman314","2021-09-20 19:52:08.000000000","FWIW, clickhouse.com is also blocked by "Malvertising filter list by Disconnect"",0,28595911,0,"[28598265]","",0,"","[]",0 28598161,0,"comment","jordanthoms","2021-09-20 19:57:22.000000000","We recently setup Clickhouse on GKE using the Altinity operator (and signed up for Altinity support).
There's been so many queries where I've thought 'that's going to need a join and aggregation across tens of billions of rows, no way!' - and then Clickhouse spits back a query result in 10 seconds...",0,28596229,0,"[]","",0,"","[]",0 28598168,0,"comment","legg0myegg0","2021-09-20 19:57:41.000000000","Does anyone happen to know which country the new company is incorporated in? I'm still looking for a chance to use ClickHouse because it sounds so excellent!",0,28595419,0,"[]","",0,"","[]",0 28598182,0,"comment","hkolk","2021-09-20 19:58:17.000000000","We are using Clickhouse combined with GDPR's Data Deletion Requests. We store the user-ids in a separate system, and run the ALTER/DELETE statements once per week. Works pretty smooth, though I would prefer some more automation within Clickhouse for them.
Data for in-active users gets deleted because our clickhouse retention policy is lower than the in-active-user timeout",0,28596649,0,"[]","",0,"","[]",0 28598269,0,"comment","zX41ZdbW","2021-09-20 20:05:54.000000000","Thank you! Looking forward for a blog post. We need more references for comparison to optimize ClickHouse performance.",0,28597313,0,"[]","",0,"","[]",0 28598408,0,"comment","PeterZaitsev","2021-09-20 20:16:39.000000000","Interesting news indeed! I very much wonder what it means long term in terms of Licenses. I would imagine much better future if Clickhouse would become Foundation driven process which gives good protection from license change (through I'm biased here) - Currently Clickhouse fully under Apache 2.0 license may look too good to be true compared to where many successful VC funded projects took licenses of their projects (think Elastc, MongoDB, Redis)
In any case though I expect a lot of growth in Clickhouse community now and investment both engineering and most importantly Marketing - I think Clickhouse technology has a lot more adoption potential than it currently has",0,28595419,0,"[28598865]","",0,"","[]",0 28598656,0,"comment","ignoramous","2021-09-20 20:35:56.000000000","To be clear, I meant 1p as in Confluent -> Kafka; not AWS -> Managed Kafka.
Despite Yandex (who originally built Clickhouse) offering a managed solution, a substantial investment outlay from the VCs does come off as a huge vote of confidence in the founders.",0,28596921,0,"[]","",0,"","[]",0 28598668,0,"comment","zX41ZdbW","2021-09-20 20:36:49.000000000","We don't require Yandex CLA:
> As an alternative, you can provide DCO instead of CLA. You can find the text of DCO here: https://developercertificate.org/ It is enough to read and copy it verbatim to your pull request.
> If you don't agree with the CLA and don't want to provide DCO, you still can open a pull request to provide your contributions.
https://github.com/ClickHouse/ClickHouse/blob/master/CONTRIB...
Anyway, Yandex CLA will be removed in the upcoming days (it should be already removed).",0,28598390,0,"[]","",0,"","[]",0 28599216,0,"comment","fiddlerwoaroof","2021-09-20 21:26:40.000000000","ClickHouse deployed to EKS with Clickhouse operator",0,28598173,0,"[]","",0,"","[]",0 28599455,0,"comment","monstrado","2021-09-20 21:50:28.000000000","I'm excited for the future of ClickHouse! I'm hopeful that this move will help smooth out the rough-edges of ClickHouse, mainly around clustering.",0,28595419,0,"[]","",0,"","[]",0 28599535,0,"comment","nhoughto","2021-09-20 21:57:40.000000000","We just did months of testing on a bunch of dbs for a time-series workload, and whilst we really liked the story and devs behind clickhouse. The ops burden of not separating storage and compute ended up being a turn off. Good to see it is progressing and will hopefully get more investment, although I wonder what this means for companies like Altinity.
We compared Timescale, Clickhouse, Snowflake and Firebolt. Ended up really liking Firebolt, some amazing tech with a few roughedges (its pretty new), basically Clickhouse speed meets Snowflake simplicity definitely one to watch.
https://www.firebolt.io/",0,28595419,0,"[28599848,28600435,28615244,28600234]","",0,"","[]",0 28599569,0,"comment","ctvo","2021-09-20 22:00:23.000000000","> One engineer can do a lot. Even in Texas we run slimmer. I'm the sole frontend dev. I created and maintain our iOS, Android, and Web app for a largish tech company.
What do you mean by "do a lot". Can you deliver as quickly as a team? If so, do you work more hours or are you just better? If you're just better, why do you decide to stay with your largish tech company when we're acknowledging SV pays more? A remote role would increase your salary, no?
Breaking this down:
Russia has a great mathematics and engineering education system. Many graduates, unable to leave, take jobs with Russian tech companies. Russian tech companies pay less than US tech companies.
That's why the situation may be as is with ClickHouse. You're not in Russia.
Texas isn't particularly known for running lean. Every Big Tech has a presence in Austin. Dallas is filled with legacy financial companies burning money on IT.",0,28598663,0,"[28599583,28601542,28714291]","",0,"","[]",0 28599834,0,"comment","anglinb","2021-09-20 22:25:29.000000000","We're using Clickhouse to power our in-product analytics. It's awesome but would love a managed service b/c it definitely requires a bit of management overhead. Super excited about this announcement!",0,28595419,0,"[28600081,28600242]","",0,"","[]",0 28599935,0,"comment","stavros","2021-09-20 22:36:43.000000000","I have a related question, in case anyone knows: We want to store typical analytics data somewhere (currently in BigQuery) to analyze with Looker. Things like "CI run started", "CI run finished" and then calculate analytics over average CI runtimes.
Which database would be a good fit for this? There isn't too much data, maybe tens of thousands of rows eventually. Would Timescale be a good fit? I'd prefer that, due to existing familiarity with Postgres, but if ClickHouse is better, that's good too.",0,28597313,0,"[28600917,28601122]","",0,"","[]",0 28600074,0,"comment","hodgesrm","2021-09-20 22:53:51.000000000","ClickHouse uses skip indexes. They basically answer the question "is the value I'm seeking not in the block."
For example, there are a couple varieties of Bloom filters, which allow you to test for presence of string sequences in blocks. This allows ClickHouse to skip reading and uncompressing blocks (actually called granules) unnecessarily.",0,28596827,0,"[]","",0,"","[]",0 28600242,0,"comment","mast22","2021-09-20 23:15:13.000000000","Yandex provides managed service as well https://cloud.yandex.com/en/services/managed-clickhouse",1,28599834,0,"[]","",0,"","[]",0 28600300,0,"comment","ddbennett","2021-09-20 23:22:53.000000000","Sentry.io settled on Clickhouse for error and transaction data after reviewing several options including Citus and Elastic. We've been happy with both the performance and how well it scales from Open Source installs to our SaaS clusters.",0,28596480,0,"[]","",0,"","[]",0 28600794,0,"comment","merb","2021-09-21 00:40:02.000000000","I looked into Clickhouse for OLAP. Our main database would be PostgreSQL unfortunatly their MaterializedPostgresql does not support TOAST, which is a major downside, considering we are TEXT/JSONB heavy users.
Edit: I tested it but for some reason either the docs are strange or wrong. but TOAST tables are actually replicated?! or at least I see the data?",0,28595419,0,"[28601115]","",0,"","[]",0 28601086,0,"comment","nemo44x","2021-09-21 01:25:05.000000000","Of course it does - it’s purpose built for a narrow use case. However it’s an extremely popular use case.
Clickhouse optimizes on the 2 most important things for OLAP - minimal disk space due to compression benefits of columnar storage and minimal compute for the same reason - and therefore fast.
However it isn’t flexible when you want to expand the use case. You can’t do any sort of text search, no complex joins (there are no foreign keys), and you need to order you tables there way you want to sort them.
For certain things it’s perfect. It was built to solve a problem Yandax had and that’s notable. But it doesn’t have anywhere near the flexibility of Elasticsearch for example.
But yes it’s purpose built to be extremely fast and minimize storage for the types of use cases it is built for.",0,28600691,0,"[28601710,28601164]","",0,"","[]",0 28601115,0,"comment","jordanthoms","2021-09-21 01:29:07.000000000","TEXT/JSONB are stored inline IIRC unless they hit a certain size limit at which point they'll be put into TOAST - so you'll see data in clickhouse but some big values might be missing.",0,28600794,0,"[28603538]","",0,"","[]",0 28601164,0,"comment","hodgesrm","2021-09-21 01:37:09.000000000","I don't think it's accurate to say the ClickHouse use case is narrow. ClickHouse is extremely good at loading and querying events arriving in near-real time from event streams like Kafka. It can also load very efficiently from data lakes. Like Druid it can offer low latency response even as the data set size scales into trillions of rows.
ClickHouse is used for everything from log management to managing CDN delivery to real-time marketing and many other applications. It's gone far beyond the web analytics use case for which it was originally developed at Yandex.
Edit: clarification",0,28601086,0,"[28601366]","",0,"","[]",0 28601268,0,"story","hodgesrm","2021-09-21 01:58:31.000000000","",0,0,0,"[]","https://altinity.com/blog/big-news-in-the-clickhouse-community",3,"Big News in the ClickHouse Community","[]",0 28601492,0,"comment","eastdakota","2021-09-21 02:36:33.000000000","Cloudflare’s analytics have been powered by Clickhouse for a long time. And I was an early investor in Timescale. They’re both excellent products.",0,28597744,0,"[28601850]","",0,"","[]",0 28601681,0,"comment","DevKoala","2021-09-21 03:04:39.000000000","ClickHouse does allow delete and update operations. They are just asynchronous functions.
I use them every now and then, but I prefer working with partition strategies when I have to these programmatically.",0,28596649,0,"[]","",0,"","[]",0 28601850,0,"comment","fiddlerwoaroof","2021-09-21 03:36:18.000000000","Yeah, I was a little disappointed because I was rooting for Timescale myself. And, maybe if I had spent more time optimizing, I could have made Timescale work but, between our experiments and Cloudflare’s blog post about migrating from Citus to ClickHouse, ClickHouse seemed like the one most likely to hold up for our production workloads.",0,28601492,0,"[]","",0,"","[]",0 28601908,0,"comment","nemo44x","2021-09-21 03:51:06.000000000","BigQuery isn’t really OLTP and if I needed real time results for things like search I don’t think Clickhouse is the solution at even limited scale. I guess everyone’s mileage varies though.",0,28601710,0,"[]","",0,"","[]",0 28602123,0,"comment","patrickevans","2021-09-21 04:48:37.000000000","Data syncing in golang for ClickHouse: https://github.com/tal-tech/cds",0,28595419,0,"[]","",0,"","[]",0 28602432,0,"comment","seektable","2021-09-21 06:11:44.000000000","People often think that ClickHouse is useful only for TBs of data. That's wrong! ClickHouse perfectly works on a single-server, and it works like a charm as a data source for self-service BI tools - an maintenance of this setup remains very simple in comparing to CH cluster. Very powerful stuff for getting operational analytics with minimal investments.",0,28595419,0,"[]","",0,"","[]",0 28602495,0,"comment","Smrchy","2021-09-21 06:26:23.000000000","I'd like to thank the creators of ClickHouse as i hope they are reading here. We've been using it since 2019 in a single server setup with billions of rows. No problems at all. And query speeds that seem unreal compared to MySQL and pg.
As we did not want to go into the HA/backup/restore details at that time we created a solution that can be quickly recreated from data in other databases.
Interesting presentation from Alexey about Features and Roadmap from May 2021:
https://www.youtube.com/watch?v=t7mA1aOx3tM",0,28595419,0,"[28614029,28602549]","",0,"","[]",0 28602911,0,"comment","pixel_tracing","2021-09-21 07:54:20.000000000","So forgive my very basic question here, I’m coming from mobile dev world. But can I:
1.) Use Clickhouse as infrastructure to build a product similar to MixPanel / Amplitude 2.) If I wanted a basic MVP of above can anyone point to me in steps (like 1., 2., 3., etc.) on what I would need to do to have a basic MVP ready. (Note: I am already very familiar with Docker, Kubernetes and writing rest APIs) Would greatly appreciate this since it would clear up a lot of questions I have",0,28595419,0,"[28603411]","",0,"","[]",0 28602960,0,"story","arunmu","2021-09-21 08:06:09.000000000","",1,0,0,"[]","https://www.reddit.com/r/bigdata/comments/pse4gb/clickhouse_and_apache_pinot/",1,"ClickHouse and Apache Pinot","[]",0 28602970,0,"comment","dagi3d","2021-09-21 08:07:19.000000000","ClickHouse can delete rows but work as batch/async operations: https://clickhouse.com/docs/en/faq/operations/delete-old-dat...",0,28596480,0,"[28603455]","",0,"","[]",0 28603411,0,"comment","timgl","2021-09-21 09:41:43.000000000","1) Essentially yes! You'll have to write SQL queries yourself 2) You'd want to have some way of sending events (simple api) to a Kafka cluster which would be read by Clickhouse, then you'd be able to query the data using metabase or datagrip.
(Or you can use PostHog, which has essentially done all this for you and has all the functionality that Mixpanel/Amplitude has, but you're able to self host it!)",0,28602911,0,"[28610377]","",0,"","[]",0 28603455,0,"comment","zepearl","2021-09-21 09:53:03.000000000","Correct + wanted to mention that "lightweight/point-deletes" might come as a new feature.
Initial discussion: https://github.com/ClickHouse/ClickHouse/issues/19627
Being implemented: https://github.com/ClickHouse/ClickHouse/pull/24755",0,28602970,0,"[]","",0,"","[]",0 28603538,0,"comment","merb","2021-09-21 10:09:48.000000000","do you know if it will in the future? or is this a clickhouse limit of their string data type?",0,28601115,0,"[28613981]","",0,"","[]",0 28603554,0,"comment","pachico","2021-09-21 10:13:02.000000000","I use Grafana for that. At the moment, we have developed entire internal products based on ClickHouse + Grafana.",0,28602549,0,"[]","",0,"","[]",0 28603568,0,"comment","qoega","2021-09-21 10:15:59.000000000","From docs it seems they use forked ClickHouse code.",0,28600428,0,"[28606162]","",0,"","[]",0 28603670,0,"comment","hodgesrm","2021-09-21 10:36:08.000000000","For example, you can use ClickHouse queries to dynamically change the shape of pages based on user behavior across multiple sites (aka retargeting). You can can also use ClickHouse to manage CDN downloads in real time. Here are a couple of talks that illustrate both use cases.
We still call this OLAP but it's quite different from traditional uses. In particular the core data come event streams.
https://altinity.com/presentations/2020/06/16/big-data-in-re...
https://altinity.com/webinarspage/2020/6/23/big-data-and-bea...",0,28601366,0,"[]","",0,"","[]",0 28603751,0,"comment","bayesian_horse","2021-09-21 10:54:01.000000000","I learned of the Clickhouse in an unpleasant way. It is a dependency of Sentry. I was tasked with trying to install a self-hosted sentry on an OpenShift cluster, which failed on account of Clickhouse not running in unprivileged containers. No, I was not permitted to change the privileges or use a plain vm.",0,28595419,0,"[]","",0,"","[]",0 28605433,0,"comment","hodgesrm","2021-09-21 14:37:55.000000000","> I remember reading in the ClickHouse docs some time ago that it does not have DELETE functionality. Does this pose any problems with GDPR and data deletion requests?
Altinity is fixing this. The project is called Lightweight Delete and it's for exactly the GDPR reason cited. The idea is that there will be a SQL DELETE command that causes rows to disappear instantly. What actually will happen is that they will be marked as deleted, then garbage collected on the next merge.
Disclaimer: I work for Altinity.",0,28596480,0,"[]","",0,"","[]",0 25488437,0,"story","eatonphil","2020-12-20 18:37:14.000000000","",0,0,0,"[]","https://altinity.com/blog/clickhouse-and-s3-compatible-object-storage",3,"Clickhouse and S3 Compatible Object Storage","[]",0 28610377,0,"comment","pixel_tracing","2021-09-21 21:42:50.000000000","You are a god send thank you.
Can you explain this in a little more detail:
“Which would be read by Clickhouse” are talking about something like a Kafka connector? Or some Ksql type query?",0,28603411,0,"[]","",0,"","[]",0 28612540,0,"story","arunmu","2021-09-22 02:33:19.000000000","https://www.reddit.com/r/bigdata/comments/pse4gb/clickhouse_and_apache_pinot/
Originally a reddit post. Wanted to see/have more discussion on this topic.",0,0,0,"[28612804]","",2,"ClickHouse vs. Pinot","[]",3 28612804,0,"comment","hodgesrm","2021-09-22 03:24:49.000000000","We're planning to have a panel discussion with Pinot, ClickHouse, and Druid lead committers at OSA Con on Nov 2. It's a new conference devoted to open source analytics. [0] It will be fun to see the contrasts. Please have a look at our conference page for more info. (Session is not announced yet, but will be up in a week or so.)
[0] https://altinity.com/osa-con-2021/",0,28612540,0,"[28612953]","",0,"","[]",0 28613981,0,"comment","qoega","2021-09-22 07:46:51.000000000","ClickHouse has no limit on sting data length. MaterializedPostgreSQL engine is just very recent feature with not large adoption rate. I believe if community will use it frequently it will become more bulletproof and more edge cases will be supported. TOAST in replication protocol is just not trivial to implement.",0,28603538,0,"[]","",0,"","[]",0 28614029,0,"comment","parth_patil","2021-09-22 07:55:44.000000000","I have similar first hand experience with ClickHouse. In the past I have moved custom analytics solution I had built on HBase to a solution running on a single node ClickHouse and had no issues whatsoever. In my current startup I am again using ClickHouse with great success. It's a mind boggling fast. Thanks ClickHouse team for building such an amazing system and for making it open source.",0,28602495,0,"[]","",0,"","[]",0 28615004,0,"comment","mlazowik","2021-09-22 11:02:41.000000000","There’s a community connector for metabase https://github.com/enqueue/metabase-clickhouse-driver",0,28602549,0,"[]","",0,"","[]",0 18194723,0,"comment","ajawee","2018-10-11 16:12:39.000000000","We are using clickhouse.https://clickhouse.yandex/tutorial.html",0,18194181,0,"[]","",0,"","[]",0 28628515,0,"story","jurajmasar","2021-09-23 13:05:07.000000000","",0,0,0,"[28636091,28628896,28629094,28628579]","https://logtail.com/",43,"Show HN: Logtail – ClickHouse-Based Log Management","[]",6 28628579,0,"comment","Redsquare","2021-09-23 13:11:23.000000000","Looks v.good, using Clickhouse as the persistence is interesting, pricing certainly is very competative when comapred to the likes of logentries et al.",0,28628515,0,"[28628682]","",0,"","[]",0 25512817,0,"comment","hodgesrm","2020-12-22 23:45:56.000000000","That behavior is similar to a number of analytic databases. It's expensive to maintain constraints in large distributed datasets. Referential integrity checks are also not meaningful in denormalized fact tables. Redshift [1] and ClickHouse [2] work this way as well. If things like duplicates are an issue, you can remove them by choosing query sort orders carefully, for example.
[1] https://docs.aws.amazon.com/redshift/latest/dg/t_Defining_co...
[2] https://clickhouse.tech/docs/en/engines/table-engines/merget...",0,25512380,0,"[25513920]","",0,"","[]",0 28636091,0,"comment","rslabbert","2021-09-23 21:56:27.000000000","The pain of not being able to do complex queries in an Elastic world means this is a pretty logical conclusion. I'd love to see the ability to also collect metrics and traces in Clickhouse as well, which would let me easily join across dynamic service boundaries to collate information I need.
For example, being able to correlate a customer ID (stored in a log) to a trace ID (stored in the request trace) to Snowflake warehouse usage (stored as metrics) to a subset of the pipeline (mixed between logs and traces) to get a full understanding of how much each customer cost us in terms of Snowflake usage would be immensely valuable.",0,28628515,0,"[]","",0,"","[]",0 25534405,0,"comment","joshxyz","2020-12-25 05:22:16.000000000","Learning clickhouse, clickhouse, clickhouse. It's a cool OLAP db",0,25533770,0,"[]","",0,"","[]",0 28651987,0,"comment","eamag","2021-09-25 10:28:40.000000000","There is a video (in russian) [1] with an idea "if someone claims something works faster than clickhouse - it means I haven't optimised this specific query yet" [1] https://www.youtube.com/watch?v=MJJfWoWJq0o",0,28595419,0,"[]","",0,"","[]",0 21834139,0,"comment","missosoup","2019-12-19 12:05:10.000000000","I don't understand how people struggle with this concept. There's a reason pilots all over the world must learn English. Fragmenting the development community across language boundaries benefits no one.
A great example is Clickhouse. Russian product, build by an all-Russian team. Yet their docs are in English and they track issues on github in English. I could read them if they were in Russian, but then I would never advocate for the use of Clickhouse in any commercial projects.
If Tencent is unable to cross that bar, then the project should absolutely be rejected from the Linux foundation.",0,21833163,0,"[21839162,21839598]","",0,"","[]",0 21835091,0,"comment","eb0la","2019-12-19 14:21:19.000000000","Right now I am evaluating Clickhouse as a warm/cold storage for logs.
Looked first at elastic; but since most (>80%) log searches will be on the time dimension and not in free form queries, it makes sense to use a columnar SQL DB
Also, my company has a lot of Oracle Hexadata DBAs on payroll and we can just train them on Clickhouse which shares a lot of concepts with Oracle.",0,21834910,0,"[]","",0,"","[]",0 21839969,0,"comment","missosoup","2019-12-19 22:32:20.000000000","The answer is above. Because it fragments the development community which ultimately harms everyone.
LF translating documentation for them is a one-shot effort that's ultimately going to be useless as time goes on without someone on their side dedicated to keeping it updated and keeping github issues in English.
It's not too much of an ask. Again, see Clickhouse for an example of how to do international open source correctly.",0,21839598,0,"[21845818]","",0,"","[]",0 14657010,0,"comment","ceyhunkazel","2017-06-28 18:39:57.000000000","You can check https://clickhouse.yandex/",0,14655779,0,"[]","",0,"","[]",0 28673307,0,"comment","doliveira","2021-09-27 17:07:49.000000000","They moved to Clickhouse recently, though: https://blog.cloudflare.com/http-analytics-for-6m-requests-p...",0,28673127,0,"[]","",0,"","[]",0 28673397,0,"comment","dreyfan","2021-09-27 17:14:12.000000000","Eh, there are numerous options today (vitess, citus, clickhouse, aurora, redshift, bigquery, snowflake). Clickhouse outperforms Scylla by a wide margin at a fraction of the cost and complexity.
Any of the above makes your data accessible via standard SQL. You can hire and utilize proficient data analysts, not people writing esoteric queries in whatever flavor of NoSQL, while actually solving business needs.",0,28673336,0,"[28674511,28685795,28685315]","",0,"","[]",0 28677629,0,"comment","dreyfan","2021-09-27 23:55:15.000000000","Exactly where did I suggest using a single mysql instance or bandaiding a fleet of them together?
Clickhouse vastly outperforms and outscales Scylla with a fraction of the hardware and cost.",0,28675900,0,"[28685154,28685302]","",0,"","[]",0 28679519,0,"comment","milkywaybrain","2021-09-28 04:58:25.000000000","Hey All,
I am the original author of the app. This is an update to the project.
Cryptogalaxy [0] is an app which will get any cryptocurrencies ticker and trade data in real time from multiple exchanges and then saves it in multiple storage systems.
Currently supported exchanges : FTX, Coinbase Pro, Binance, Bitfinex, BHEX, Huobi, Gateio, Kucoin, Bitstamp, Bybit, Probit, Gemini, Bitmart, Digifinex, AscendEX, Kraken, Binance US, OKEx, FTX US, HitBTC. Total 20.
Currently supported storages : Terminal Output, MySQL, Elasticsearch, InfluxDB, NATS, ClickHouse, S3. Total 7.
All feedback is welcome.
P.S Created fun twitter bot [1] using the app.
[0] : https://github.com/milkywaybrain/cryptogalaxy [1] : https://twitter.com/moon_or_earth
Thanks,
Pavan",0,28679504,0,"[]","",0,"","[]",0 28680403,0,"comment","chmike","2021-09-28 08:13:06.000000000","ClickHouse is written in C++ and is open source: https://github.com/ClickHouse/ClickHouse",0,28679490,0,"[28693683]","",0,"","[]",0 28680733,0,"comment","541","2021-09-28 09:13:43.000000000","Using ClickHouse for log storage and analysis is discussed here - https://news.ycombinator.com/item?id=26316401",0,28679490,0,"[]","",0,"","[]",0 28680825,0,"comment","reacharavindh","2021-09-28 09:29:48.000000000","I have looked into ElasticSearch + Kibana as a solution to aggregate logs. There may be plenty of choice to replace ElasticSearch(ClickHouse, even Postgres, heck even journald), but a nice UI where you can simply search for that random piece of text you need to sift through the logs is the red herring.
Until now, I have not seen a web interface to log as powerful as Kibana that can work with anything other than ElasticSearch.
This is why I chose to stop my search and pay for Datadog to do this correctly, and simply allow me to search for that keyword on logs when I need it the most(and not worry about whether I indexed stuff correctly, or balanced some whatever in ElasticSearch, or remembered to setup something far too technical for a log system). Datadog allows you to keep a short periods worth of data in the index and "expire" old content into archives while retaining the ability to add them back to index if needed for any investigation.",0,28680484,0,"[28681202,28681066]","",0,"","[]",0 28681504,0,"comment","reacharavindh","2021-09-28 11:24:35.000000000","Haven’t played with the logs part of Grafana recently, but would it work on top of say Clickhouse? I thought it was more tuned for the Loki use case… is it not?",0,28681066,0,"[28683320]","",0,"","[]",0 28682177,0,"comment","jurajmasar","2021-09-28 12:55:07.000000000","Agreed!
> Ubers Clickhouse as a Log Storage thing
We built hosted ClickHouse-based logging as a service https://logtail.com, just launched with Show HN last week.
Disclaimer: I'm the founder, happy to answer any questions",0,28679490,0,"[]","",0,"","[]",0 28682858,0,"comment","pachico","2021-09-28 13:59:01.000000000","Elasticsearch is good because it just ingests whatever you sent to it, which allows you to deliver solutions rather quickly.
Having said this, I agree there are better solutions. (Also, Elasticsearch shines because of its full text search capability, which is not often exploited in case of logs.)
Loki is fine (or better said, it will be fine once they finally release a version without write-out-of-order constrain) but I find its lack of high-availability solution a bit frustrating.
ClickHouse, on the other side, is just magnificent. I use it in combination with Vector as message pipeline solution (it's an alternative to Fluentd, let's say).
So, yes, Elasticsearch is just not great and not only for logs, but for everything else that doesn't require full text search, in my opinion.",0,28679490,0,"[28685030]","",0,"","[]",0 28684304,0,"comment","hodgesrm","2021-09-28 15:58:37.000000000","There's also cLoki. It's a new project that puts a Loki gateway over a ClickHouse backend store. We're looking at it and plan a presentation from the author(s) at the next ClickHouse SF Bay Area Meetup.
https://github.com/lmangani/cLoki",0,28681737,0,"[]","",0,"","[]",0 28684373,0,"comment","justinsaccount","2021-09-28 16:04:00.000000000","> Uber has not open sourced this work so we are unable to benchmark it and see how it performs
I implemented their design here, specifically for importing zeek logs:
https://github.com/JustinAzoff/zeek-clickhouse
I don't have the elastic compatible query api though, or the smarts that auto materialize popular columns.
It works though, does a good job at soaking up any sort of log type and handling fields being added or removed.",0,28679490,0,"[]","",0,"","[]",0 28685302,0,"comment","VirusNewbie","2021-09-28 17:32:46.000000000","Clickhouse is in no way an RDBMS.",0,28677629,0,"[28686505]","",0,"","[]",0 28685315,0,"comment","VirusNewbie","2021-09-28 17:34:00.000000000","clickhouse, bigquery, snowflake are not even close to RDBMS so what are you going on about? Have you actually used any of those?",0,28673397,0,"[]","",0,"","[]",0 28685795,0,"comment","thekozmo","2021-09-28 18:18:11.000000000","Clickhouse and other solutions you mentioned are for analytics while Scylla and Cassandra are for real-time. You can't compare the two types of tech",0,28673397,0,"[28686184]","",0,"","[]",0 28686184,0,"comment","hodgesrm","2021-09-28 19:01:04.000000000","I don't see how Scylla / Cassandra are real-time but ClickHouse is not. You can load event data into ClickHouse just as fast as Cassandra. The data are instantly queryable including pre-computed aggregates in materialized views. You can get answers in milliseconds on data that is only a few second old. Real-time marketing is now an important use case for ClickHouse, just to give one example. If the problem is analyzing and reacting to external events, ClickHouse is tough to beat.
How are Scylla / Cassandra better?",0,28685795,0,"[]","",0,"","[]",0 28686505,0,"comment","dreyfan","2021-09-28 19:32:33.000000000","R doesn't stand for row-store. The access patterns of Clickhouse are comparable with any RDBMS. The fact that it happens to be a column-store doesn't make it not an RDBMS. It doesn't mean "row-oriented databases designed for OLTP workloads" as it appears you think is the case.
Should I have simply said "SQL databases where data is represented in tables with a defined schema" to simplify the discussion and prevent your ignorant diatribe?",0,28685302,0,"[28687990]","",0,"","[]",0 28687990,0,"comment","VirusNewbie","2021-09-28 21:55:45.000000000","It stands for "relational", you know that CLickhouse is not a relational database right?
Have you used any of these technologies? Cassandra/Scylla have a defined schema. Did you really not know that?",0,28686505,0,"[]","",0,"","[]",0 28693683,0,"comment","dariusj18","2021-09-29 13:53:09.000000000","Where can I find resources on tools that integrate with Clickhouse, ex. are there any tools for gathering server metrics and sending them to Clickhouse?",0,28680403,0,"[]","",0,"","[]",0 21874238,0,"comment","monstrado","2019-12-24 19:12:30.000000000","Related, but ClickHouse utilizes this for their regex parsing.",0,21873557,0,"[]","",0,"","[]",0 14703044,0,"story","3manuek","2017-07-05 15:16:05.000000000","",0,0,0,"[]","http://www.3manuek.com/redshiftclickhouse",1,"Importing data from Redshift into Clickhouse","[]",0 14704954,0,"story","3manuek","2017-07-05 18:27:26.000000000","",0,0,0,"[]","http://www.3manuek.com/clickhousesample",1,"Sampling considerations on Clickhouse distributed tables","[]",0 25612186,0,"comment","st1ck","2021-01-02 11:54:44.000000000","> HTML/React inspired UI library that works on all platforms, so we can do electron without wasting 99% of my CPU cycles.
There are React Native forks for Windows, MacOS and Linux. I have no idea whether any of them is "good implementation" though.
> SQL + realtime computed views (eg materialize)
ClickHouse (OLAP DB) has materialized views (but only for inserts). Also Oracle and (I guess!) Materialize DB should have it too.
> Desktop apps that can be run without needing to be installed. (Like websites, but with native code.)
AppImage (and maybe Snap and Flatpak) is like this. Also technically, with Nix you can just run something like
nix-shell -p chromium --command chromium
(without root), but it feels like cheating.> Git but for data
https://github.com/dolthub/dolt (again, never tried it yet, but would like in future)",0,25611076,0,"[]","",0,"","[]",0 28735747,0,"comment","ryohkyo","2021-10-03 09:28:11.000000000","In concept each 16-digit chromosome would work as the nearest neighbors, and finding “a range” of the nearest neighbors would be very fast.
In a data scientist’s perspective, this may seem to be a hack. I would be very appreciated to learn more from you about this. I can tell you that the weakness on this method is the multiple writes to the database. I assume vector database can gets this implemented with less writes.
In the past, I have used this for supervised training and yielded very good results. However, I think this would be inefficient in large scale networks. I am planning to use Go + Clickhouse to improve the performance in the next project.",0,28735613,0,"[28736168]","",0,"","[]",0 18318289,0,"comment","buremba","2018-10-27 23:33:30.000000000","AFAIK Heap uses Citus but also has an internal partitioning scheduler for their customer event data so I don't think that they're a good example. Timescale doesn't support scaling out yet but it's in their roadmap so let's wait for them to implement for a fair conclusion.
If you're going to create roll-up tables and power your dashboard using those tables, you're fine with both options IMO. Cloudflare was also using Citus exactly for this use-case before they switched to Clickhouse.
If you have ad-hoc use-cases for time-series data, Timescale might be a better option because it's built exactly for this use-case and it knows the semantics of the data so it can partition the data in an optimized way and perform some optimizations such as parallelized operations and re-sizing chunks. In that sense, it's comparable to Influxdb, not Citus.",0,18300542,0,"[]","",0,"","[]",0