HDFS Operations

Writing Data to HDFS

User tells client to write data to HDFS (Need to specify chuck size & Replication)
Client divides file into chunks
Client tells Namenode a chuck needs to be written with so and so size and replication
Namenode finds necessary Datanodes and sends its address to Client
Client writes data to 1st Datanode. If replication is required the 1st Datanode will directly send the data to the next Datanode in the chain (Replication Pipeline)
Once block is written to all the Datanodes they send an Done message to the Namenode
If there are more blocks the above steps is repeated for all of them
Once all blocks is written Clients tells Namenode operation is complete

NOTE

When data is split into blocks suppose when at the middle of an row the max block size is reached we will continue to read till the end of line is reached before creating an new block

So the block size can be a bit bigger than the max block size

Replication Placement Strategy

Cluster → Racks → Nodes
For 1st location if client is member of a cluster write to the same cluster else use an random Datanode
For the next 2 replications pick an rack which is different than the 1st one and write to different Nodes on that rack
For subsequent replication the following rules need to be satisfied. 1 replica per node. max 2 replicas per rack

NOTE

Sometimes these conditions cannot be met so they are ignored. We can use our own placement algorithm as well

Reading Data from HDFS

Client sends request to Namenode to send information about the required file
The location of the data is send back to the client (The nodes are arranged in the increasing order of distance from client)
The client understands how much data needed to be downloaded and where its located
It access the nodes one by one and downloads them

Writing data to HDFS.png - Google Drive

Replica placement strategy.png - Google Drive

Reading Data from HDFS.png - Google Drive

Digital Archive

Explorer

HDFS Operations

Writing Data to HDFS

Replication Placement Strategy

Reading Data from HDFS

Table of Contents

Backlinks

Graph View