Saturday, January 10, 2015

Big Data Sources and Origins

Data are getting generated from various sources . The world is now confronted with such a situation where it can not ignore these data.They hold  information and insight which can improve the way we live, the way do business , the way we visualise our planet . But it's not like that the informations are sitting on it that we will just go and get it.

First thing we need to have viable mediums where we can store these galaxy of data and we also need to have enough computing power which can dare to process these data.


These data gets generated from various sources .

But any of these data to be qualified as big data , they got to have these three below traits to a certain level.

  • Velocity
  • Variety 
  • Volume

I have given few examples for each of these data sources. 




I  have just given brief ideas on various data sources. I will keep putting more information on various data sources in near future.

Wednesday, January 7, 2015

Storage :DAS , NAS and SAN


Data is the soul of any organisation.If the soul gets bruised or tarnished then the whole body gets affected likewise if data gets compromised then the organisation is at risk.So it matters a lot how you store your data.Data are the most vulnerable asset in this 21st century.So companies reputation and revenues are now more dependent like never before ,  the way it stores its data

DAS: The name says that the storage is directly attached to the server.

So lets say , If I have four servers , then I have four sets of storage attached individually
to each of this server. Lets say if one server is down then data resided in that particularly
attached storage can not be transferred.

But as it has less intial cost setup and the business which is running in a localised environment,
can go for this form of storage.


NAS : Here storage  is directly attached to the network which is fully capable enough to serve files
over network where as in case  of DAS where Server has to play dual role of file sharing and
provide services.


SAN: It's a high performance storage network that transfers data between server and storage devices.
Here the storage devices is separate from local area network.As the degree of sophistication
and cost is more , so it been used in mission critical application.

Here the bunch of networked storage is connected to the server over fibre channel.The inherent
property of fibre channel is fast.

I will discuss in a future post about Fibre channel and SCSI.

Tuesday, January 6, 2015

HDFS enabled Storage,DataLake,DataHub

HDFS enabled storage such as EMC isilon will surely cut short the process where we don't need to migrate the data into a separate bigdata architecture.

By Layering this Datahub (Cloudera Enterprise) over this Datalake(Isilon) ,
Cloudera and EMC believes they can remove the cycle of moving data to a separate Bigdata infrastructure.


Datahub: "Cloudera" defines it as an engineered system designed to analyze and process bigdata in place, so it needn’t be moved to be used

Data lake : is an already existing huge repositories where huge amount of data gets stored and managed but traditionally it has to be moved out of this lake to bigdata infrastructure to be analysed.

Data locality Understanding

In1990 , a normal disk drive used be a capacity of 1.3GB.The reading speed from drive is 4.4 MB/Sec. So it will take near about 5 mins.

Now a normal disk drive is 1TB. It has multiplied to 10^6 times.
The reading speed is close 100MB/sec . So it will take around

1TB/(100MB/Sec)= 10^4 Sec= It is approximately  2hours 30 minutes.

It makes sense to use distributed system . But in traditional distributed system where data moves from storage to computing node . So evenif the processpr speed has improved a lot , it has to still wait for the data to reach . So there comes Hadoop  Distributed  System  where the data does not move to the node  where computation  happens instead code  . This is actually  called  as data  locality.