Access Azure Blob Storage from Spark

Over the weekend I was working on a big data platform & during the POC I found that in big data platform libraries are very important. If you used different version things will not work as expected.

This blog will help you to integrate apache spark with Azure blob storage as a data lake.

For this activity we need followings
- Azure Account(Blob Storage)
- Linux Instance
- Spark-3.1.2

Let’s start with Azure Blob Storage. For that, you need an Azure account and create a storage account like below

Azure Storage Account

Now will create a container, by default spark use containers in blob storage

Azure storage container: store spark data
Access Keys

Azure Access key’s to access blob storage from spark using core-site.xml.

Now we have done half part of our POC let’s move towards the next task Apache spark. To set up spark we need java installed on the server, I am assuming that everyone aware of how to install java in Linux. Here I am using centos 7.

java -version
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (build 25.292-b10, mixed mode)

Download apache-spark from the official website using the below link

wget https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
tar -xvf spark-3.1.2-bin-hadoop3.2.tgz
mv spark-3.1.2-bin-hadoop3.2 /opt/spark

We downloaded the tar file, extract that and move it to the OPT directory. Now define below environments variable in the ~/.bashrc file

vim ~/.bashrc
export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk"
export SPARK_HOME="/opt/spark"
export PATH=$SPARK_HOME/bin:$SPARK_HOME/python:$PATH
export PATH=$JAVA_HOME/bin:$PATH

Now we need Azure jar libraries so that we will access Azure blob storage from the spark. For that, we need the below jar files

  • azure-storage-2.0.0.jar
  • azure-storage-blob-12.0.0.jar
  • hadoop-azure-2.7.7.jar
wget https://repo1.maven.org/maven2/com/microsoft/azure/azure-storage/2.0.0/azure-storage-2.0.0.jar -O /opt/spark/jars/azure-storage-2.0.0.jar
wget https://repo1.maven.org/maven2/com/azure/azure-storage-blob/12.0.0/azure-storage-blob-12.0.0.jar -O /opt/spark/jars/azure-storage-blob-12.0.0.jar
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.7/hadoop-azure-2.7.7.jar -O /opt/spark/jars/hadoop-azure-2.7.7.jar

Now we need to set up core-site.xml to azure access keys as below

vim /opt/spark/conf/core-site.xml
-------
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.AbstractFileSystem.wasb.Impl</name>
<value>org.apache.hadoop.fs.azure.Wasb</value>
</property>
<property>
<name>fs.azure.account.key.StorageAccountName.blob.core.windows.net</name>
<value>xxxxx....</value> //AccessKey
</property>
<property>
<name>fs.azure.block.blob.with.compaction.dir</name>
<value>/hbase/WALs,/data/myblobfiles</value>
</property>
<property>
<name>fs.azure</name> <value>org.apache.hadoop.fs.azure.NativeAzureFileSystem</value>
</property>

<property>
<name>fs.azure.enable.append.support</name>
<value>true</value>
</property>
</configuration>

Let’s try to access the azure blob storage from the spark.

from pyspark.sql import SparkSession
import datetime
from pyspark.sql.functions import year, month, dayofmonth
df = spark.read.format("jdbc").option("url", "jdbc:mysql://xxxxxxxxxxxxxx:3306/database_name") \
.option("driver", "com.mysql.cj.jdbc.Driver").option("dbtable", "table") \
.option("user", "anverma").option("password", "xxxxxxxxxxx").load()
df.write.options(header='True').mode("overwrite").parquet("wasbs://sparkpoc@sparkpoc.blob.core.windows.net/test")

TL;DR

Create an Azure Storage Account. Download Apache Spark and the required Azure library Jar file. Update the access key’s in core-site.xml.
Please make sure you are using the right version of the jar with the spark.

DevOps Engineer with 10+ years of experience in the IT Industry. In-depth experience in building highly complex, scalable, secure and distributed systems.