Monday, September 21, 2015

Presto


Introduction:

As we all know Facebook is data driven company which has billion plus users .They have one of the largest warehouse of the world which holds more than 300 Petabytes of data. Crunching these data and get meaning out of it requires lot of queries to be executed. Being able to run more queries and get faster results save a lot of money and add up eventually in the revenue.


Facebook’s warehouse data is managed inside giant Hadoop clusters on which Map reduce and Hive are being used for large-scale computation. It is always good to have faster throughput. But as warehouse grows big and more petabytes scale of data getting added and at the sametime the demand of near realtime analytics is getting increased, there comes a need for analytical engine which can give low query latency in this crunch situation.

So Facebook decides to build Presto :

•A new interactive query system that could operate fast at petabyte scale.
•It is SQL-on-Hadoop engine claimed to be up to 10 times faster than Hive for large-scale queries, is now open source.
•It does not rely on Mapreduce.
•Facebook claims it’s 10 times faster than hive. That is good enough to attract the eyeballs.

Architecture of Presto:

•Presto CLI
•Presto Coordinator
•Presto Worker
•Discovery Service
•Connector Plugin

Steps of Execution:

1.Discovery service finds Presto  in the cluster.
2.Client sends a query using HTTP to Presto Coordinator.
3.The coordinator parses, analyses, and builds a query plans the query plan.
4.Connector Plugin Provide Metadata (Table Schema).
5.Coordinator assigns work to the worker nodes closest to the data
6.Workers read data through connector plugin.
7.Workers run these task on memory.
8.Client gets the result from the worker.



Difference Hive Vs Presto:

Hive
Presto
Hive translates queries into multiple stages of Map Reduce tasks that execute one after another.
Each task reads inputs from disk and writes intermediate output back to disk.
That does not allow it to avoid the unnecessary I/O and associated latency overhead.
Presto engine does not use MapReduce . It employs a custom query and execution engine with operators designed to support SQL semantics.
The pipelined execution model runs multiple stages at once, and streams data from one stage to the next as it becomes available.
All processing is in memory and pipelined across the network between stages. It avoid the I/O and latency overhead.

Connectors:

Pluggable Connector and Supported File Format:Presto supports pluggable connectors that provide data for queries that get executed in Presto.

•Apache Hadoop 1.x
•Apache Hadoop 2.x
•Custom S3 Integrator for Hadoop
•Cloudera CDH 4
•Cloudera CDH 5
•Cassandra

Supported File Format:

•Text
•SequenceFile
•RCFile
•ORC


Start Presto:

These three serves has to be on.

•Coordinator
•Worker
•Discovery Service

Saturday, January 10, 2015

Big Data Sources and Origins

Data are getting generated from various sources . The world is now confronted with such a situation where it can not ignore these data.They hold  information and insight which can improve the way we live, the way do business , the way we visualise our planet . But it's not like that the informations are sitting on it that we will just go and get it.

First thing we need to have viable mediums where we can store these galaxy of data and we also need to have enough computing power which can dare to process these data.


These data gets generated from various sources .

But any of these data to be qualified as big data , they got to have these three below traits to a certain level.

  • Velocity
  • Variety 
  • Volume

I have given few examples for each of these data sources. 




I  have just given brief ideas on various data sources. I will keep putting more information on various data sources in near future.

Wednesday, January 7, 2015

Storage :DAS , NAS and SAN


Data is the soul of any organisation.If the soul gets bruised or tarnished then the whole body gets affected likewise if data gets compromised then the organisation is at risk.So it matters a lot how you store your data.Data are the most vulnerable asset in this 21st century.So companies reputation and revenues are now more dependent like never before ,  the way it stores its data

DAS: The name says that the storage is directly attached to the server.

So lets say , If I have four servers , then I have four sets of storage attached individually
to each of this server. Lets say if one server is down then data resided in that particularly
attached storage can not be transferred.

But as it has less intial cost setup and the business which is running in a localised environment,
can go for this form of storage.


NAS : Here storage  is directly attached to the network which is fully capable enough to serve files
over network where as in case  of DAS where Server has to play dual role of file sharing and
provide services.


SAN: It's a high performance storage network that transfers data between server and storage devices.
Here the storage devices is separate from local area network.As the degree of sophistication
and cost is more , so it been used in mission critical application.

Here the bunch of networked storage is connected to the server over fibre channel.The inherent
property of fibre channel is fast.

I will discuss in a future post about Fibre channel and SCSI.

Tuesday, January 6, 2015

HDFS enabled Storage,DataLake,DataHub

HDFS enabled storage such as EMC isilon will surely cut short the process where we don't need to migrate the data into a separate bigdata architecture.

By Layering this Datahub (Cloudera Enterprise) over this Datalake(Isilon) ,
Cloudera and EMC believes they can remove the cycle of moving data to a separate Bigdata infrastructure.


Datahub: "Cloudera" defines it as an engineered system designed to analyze and process bigdata in place, so it needn’t be moved to be used

Data lake : is an already existing huge repositories where huge amount of data gets stored and managed but traditionally it has to be moved out of this lake to bigdata infrastructure to be analysed.

Data locality Understanding

In1990 , a normal disk drive used be a capacity of 1.3GB.The reading speed from drive is 4.4 MB/Sec. So it will take near about 5 mins.

Now a normal disk drive is 1TB. It has multiplied to 10^6 times.
The reading speed is close 100MB/sec . So it will take around

1TB/(100MB/Sec)= 10^4 Sec= It is approximately  2hours 30 minutes.

It makes sense to use distributed system . But in traditional distributed system where data moves from storage to computing node . So evenif the processpr speed has improved a lot , it has to still wait for the data to reach . So there comes Hadoop  Distributed  System  where the data does not move to the node  where computation  happens instead code  . This is actually  called  as data  locality.