As we all know Facebook is data driven company which has billion plus users .They have one of the largest warehouse of the world which holds more than 300 Petabytes of data. Crunching these data and get meaning out of it requires lot of queries to be executed. Being able to run more queries and get faster results save a lot of money and add up eventually in the revenue.
Facebook’s warehouse data is managed inside giant Hadoop clusters on which Map reduce and Hive are being used for large-scale computation. It is always good to have faster throughput. But as warehouse grows big and more petabytes scale of data getting added and at the sametime the demand of near realtime analytics is getting increased, there comes a need for analytical engine which can give low query latency in this crunch situation.
So Facebook decides to build Presto :
•A new interactive query system that could operate fast at petabyte scale.
•It is SQL-on-Hadoop engine claimed to be up to 10 times faster than Hive for large-scale queries, is now open source.
•It does not rely on Mapreduce.
•Facebook claims it’s 10 times faster than hive. That is good enough to attract the eyeballs.
Architecture of Presto:
Steps of Execution:
1.Discovery service finds Presto in the cluster.
2.Client sends a query using HTTP to Presto Coordinator.
3.The coordinator parses, analyses, and builds a query plans the query plan.
4.Connector Plugin Provide Metadata (Table Schema).
5.Coordinator assigns work to the worker nodes closest to the data
6.Workers read data through connector plugin.
7.Workers run these task on memory.
8.Client gets the result from the worker.
Difference Hive Vs Presto:
Hive translates queries into multiple stages of Map Reduce tasks that execute one after another.
Each task reads inputs from disk and writes intermediate output back to disk.
That does not allow it to avoid the unnecessary I/O and associated latency overhead.
Presto engine does not use MapReduce . It employs a custom query and execution engine with operators designed to support SQL semantics.
The pipelined execution model runs multiple stages at once, and streams data from one stage to the next as it becomes available.
All processing is in memory and pipelined across the network between stages. It avoid the I/O and latency overhead.
Pluggable Connector and Supported File Format:Presto supports pluggable connectors that provide data for queries that get executed in Presto.
•Apache Hadoop 1.x
•Apache Hadoop 2.x
•Custom S3 Integrator for Hadoop
•Cloudera CDH 4
•Cloudera CDH 5
Supported File Format:
These three serves has to be on.