Tuesday, September 26, 2017

What is wasb(s) protocol for accessing Azure blob storage

WASB Protocol

I didn't get a chance to study wasb(s) protocol before I had to use it for HDInsight related tasks. So it was trial and error in the initial time. That lead me to take a decision to write detailed post on the wasb protocol since it helps us especially when we are Hadoop in Azure world.

But before writing, thought of googling for similar work. Why should there be one more post in the internet with same content? The results were interesting. Below are the links explaining what is wasbs protocol for accessing Azure storage blob.

https://blogs.msdn.microsoft.com/cindygross/2015/02/03/why-wasb-makes-hadoop-on-azure-so-very-cool/
https://blogs.msdn.microsoft.com/cindygross/2015/02/04/understanding-wasb-and-hadoop-storage-in-azure/

But these were not answering all the questions I had. So decided to add those here.

Can .Net access the wasbs:// url

.Net really don't need to access Azure blob storage via wasbs protocol. It can access via SDK using corresponding object model. Or it can access using https protocol.

References

Tuesday, September 19, 2017

SQL 2016 - Create external polybase table on parquet file stored in Azure blob storage

This is a walk through on creating an external polybase table in SQL 2016 which stores data in Azure blob storage using parquet file format.

Prerequisite

The prerequisite is the basic knowledge about SQL Server and Microsoft Azure.

Use cases

One of the best use case is to move data between SQL Server transaction systems to Azure blob storage. Mainly if we are in data analytics world, there we can rarely see transactional relation databases. So we has to move data in between. eg: if we have internal or normal transaction table we can create external polybase table with same schema and do insert..select to move data from SQL Server to unstructured stores such as Azure blob or hadoop.

Azure storage account and blob setup

This is as straight forward as creating Azure storage account and blob container for any other purpose. Note down the storage account key and the container name to be used in SQL to setup connection.

Installing SQL 2016 with Polybase support

The polybase support came with SQL 2016. This Microsoft SQL Server 2016 feature requires Java runtime!!! Yes. Polybase seems a wrapper over Java libraries and those need JVM to run. Min inum required version for running polybase at the time of writing this post is JRE 7 Update 51

Once we have the JRE 1.7 version installed, we can install SQL 2016 with polybase support. Polybase has to be selected manually during the installation as it is not selected by default.

SQL Server level Settings

There are some settings to be done at the SQL Server level.

Enable Polybase

The below query will enable polybase export.

-- Enable INSERT into external table  
sp_configure 'allow polybase export', 1;  
reconfigure

Create password

This is the master key

-- Create a master key.
-- Only necessary if one does not already exist.
-- Required to encrypt the credential secret in the next step.
Create MASTER KEY   ENCRYPTION BY PASSWORD = '<Password>'

Database level settings

This is the step where the credential towards the Azure blob storage is setup in SQL Server database.

-- Create a database scoped credential
-- IDENTITY: Provide any string, it is not used for authentication to Azure storage.
-- SECRET: Provide your Azure storage account key.
go
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
    IDENTITY = 'user',
    SECRET = '<storage account key>'
; 
This doesn't tell what is the location the data resides. It just opens entire storage account to SQL Server.

Create data source

Now we can use the stored credentials to map to the Azure blob container path. This just tells where in the Storage account the data goes. The wasbs:// protocol is used to point blob storage path.

-- Create an external data source
-- TYPE: HADOOP - PolyBase uses Hadoop APIs to access data in Azure blob storage.
-- LOCATION: Provide Azure storage account name and blob container name.
-- CREDENTIAL: Provide the credential created in the previous step.

--  Example with sample Storage account,container name:polytest,storage account name: parquetstore
CREATE EXTERNAL DATA SOURCE AzureStorage
WITH (
    TYPE = HADOOP,
    LOCATION = 'wasbs://polytest@parquetstore.blob.core.windows.net',
    CREDENTIAL = AzureStorageCredential
); 

Create file format

One more setup required before creating the table is creation of the file format. This is required to tell what format to be used to store the table data.

-- Create an external file format
-- FORMAT_TYPE: Type of file format in Azure storage (supported: DELIMITEDTEXT, RCFILE, ORC, PARQUET).
-- FORMAT_OPTIONS: Specify field terminator, string delimiter, date format etc. for delimited text files.
-- Specify DATA_COMPRESSION method if data is compressed.

CREATE EXTERNAL FILE FORMAT [ParquetFileFormatSnappy]  
WITH (  
    FORMAT_TYPE = PARQUET,  
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'  
);

The data compression uses is Snappy. There are others as well.

Create external table

This is the last step. It creates external table with the file format, storage location and data source. 

-- CREATE AN EXTERNAL TABLE WITH PARQUET FILE.   
--A: Create the external table
-- Specify column names and data types. This needs to match the data in the sample file.
-- LOCATION: Specify path to file or directory that contains the data (relative to the blob container).
-- To point to all files under the blob container, use LOCATION='.'
CREATE EXTERNAL TABLE [dbo].[SampleTable_Parquet_External] (  
        [customerName] [nvarchar](100) NULL,
 [customerNumber] [varchar](70) NULL
)  
WITH (  
        LOCATION='/customerdata.parquet',  
        DATA_SOURCE = AzureStorage,  
        FILE_FORMAT = [ParquetFileFormatSnappy],  
        REJECT_TYPE = VALUE,  
        REJECT_VALUE = 0  
);

CRUD operations on external table

As it is stated by polybase technology, it don't support DML operations such as update, delete statements. We can just insert.

--Insert Data to the external table
INSERT INTO [dbo].[SampleTable_Parquet_External]
select * from dbo.[SampleTable_Internal]

Things to remember about polybase

The API seems granular. We can create file format, data source and all separately and use those when creating table.

As stated earlier it is mainly useful in big data analytics. But when we tried moving more than 300,000 records from SQL Server to blob we could see it just breaks. Still troubleshooting with Microsoft. Hopefully it might be something we did wrong.

Another issue faced was about data types. The polybase engine don't know all the data types and can move them. There are settings to say how may mismatches it should handle/forgive before it fails.

More limitations can be found here.

More reading

Tuesday, September 5, 2017

Starting Scala Spark - Read write to parquet file

Introduction

This is a post to index information related to parquet file format and how Spark can use it. Since there are already many tutorials to perform various operations in the context, this post mainly consolidate the links.

What is parquet file?

It is a columnar storage format, can contain schema along with data, supporting various encoding, compressions etc...
The file format specifications are from Apache. There are good support in the Java world. .Net seems catching up with Parquet. Since it is mainly for data analysis world it is not recommended to use in transnational systems.

Read write to local

From Spark we can read and write to parquet files using the methods given in below link.
https://community.hortonworks.com/articles/21303/write-read-parquet-file-in-spark.html

Read write to parquet present in Azure blob storage

Below goes a tutorial which explains how local Spark cluster can be used to access Azure blob.

https://blogs.msdn.microsoft.com/arsen/2016/07/13/accessing-azure-storage-blobs-from-spark-1-6-that-is-running-locally/

It explains writing text files. If we just use the parquet function Spark will write data to parquet format to Azure blob storage.

Version compatibility

To get Azure connectivity to Azure from Spark it has to know the Azure libraries. They are hadoop-azure-v#.v#.v#.jar,azure-storage-v#.v#.v#.jar. The above link explains using Spark 1.6 version libraries hadoop-azure-2.7.0.jar,azure-storage-2.0.0.jar respectively. Both these libraries can be downloaded from http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.0/. How to link these libraries to Spark is clearly explained in the link.

At the time of writing this post the latest Spark version is 2.2.0. So what should be the Hadoop version? Should it be the latest Hadoop version 2.8.1? There will be a tendency to use Hadoop 2.8.1. But if we use it we may get NoSuchMethodError Exception. Stack trace below stripping root callers.

java.lang.NoSuchMethodError: org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/Class;)Lorg/apache/hadoop/conf/Configuration; at org.apache.hadoop.fs.azure.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:45) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.getAccountKeyFromConfiguration(AzureNativeFileSystemStore.java:852) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:932) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450) at

As a true developer the way to troubleshoot this is to look at the source code of Hadoop.Azure and related packages to see what changes between versions and use the proper one. Else do trial and error and get one combination working.

The below combinations seems working
Spark 2.2.0
http://apache.claz.org/hadoop/common/hadoop-2.7.0/

Dealing with parquet files of different encoding

Writing parquet files with specified codec

Below link explains how to use different API to write which accepts the encoding.

eg:df.write.format("parquet").option("compression", "gzip")..save(<path>)

Reading parquet with specified codex

While reading parquet using sqlContext, we can use setConf() to mention the encoding.

eg: sqlContext.setConf(“spark.sql.parquet.compression.codec”,”gzip”)

More details can be found in official Spark documentation.

More links

https://gist.github.com/aseigneurin/59844ac82da93eb9c7623931d3412783
https://www.youtube.com/watch?v=_0Wpwj_gvzg