Tuesday, September 19, 2017

SQL 2016 - Create external polybase table on parquet file stored in Azure blob storage

This is a walk through on creating an external polybase table in SQL 2016 which stores data in Azure blob storage using parquet file format.


The prerequisite is the basic knowledge about SQL Server and Microsoft Azure.

Use cases

One of the best use case is to move data between SQL Server transaction systems to Azure blob storage. Mainly if we are in data analytics world, there we can rarely see transactional relation databases. So we has to move data in between. eg: if we have internal or normal transaction table we can create external polybase table with same schema and do insert..select to move data from SQL Server to unstructured stores such as Azure blob or hadoop.

Azure storage account and blob setup

This is as straight forward as creating Azure storage account and blob container for any other purpose. Note down the storage account key and the container name to be used in SQL to setup connection.

Installing SQL 2016 with Polybase support

The polybase support came with SQL 2016. This Microsoft SQL Server 2016 feature requires Java runtime!!! Yes. Polybase seems a wrapper over Java libraries and those need JVM to run. Min inum required version for running polybase at the time of writing this post is JRE 7 Update 51

Once we have the JRE 1.7 version installed, we can install SQL 2016 with polybase support. Polybase has to be selected manually during the installation as it is not selected by default.

SQL Server level Settings

There are some settings to be done at the SQL Server level.

Enable Polybase

The below query will enable polybase export.

-- Enable INSERT into external table  
sp_configure 'allow polybase export', 1;  

Create password

This is the master key

-- Create a master key.
-- Only necessary if one does not already exist.
-- Required to encrypt the credential secret in the next step.

Database level settings

This is the step where the credential towards the Azure blob storage is setup in SQL Server database.

-- Create a database scoped credential
-- IDENTITY: Provide any string, it is not used for authentication to Azure storage.
-- SECRET: Provide your Azure storage account key.
    IDENTITY = 'user',
    SECRET = '<storage account key>'
This doesn't tell what is the location the data resides. It just opens entire storage account to SQL Server.

Create data source

Now we can use the stored credentials to map to the Azure blob container path. This just tells where in the Storage account the data goes. The wasbs:// protocol is used to point blob storage path.

-- Create an external data source
-- TYPE: HADOOP - PolyBase uses Hadoop APIs to access data in Azure blob storage.
-- LOCATION: Provide Azure storage account name and blob container name.
-- CREDENTIAL: Provide the credential created in the previous step.

--  Example with sample Storage account,container name:polytest,storage account name: parquetstore
    LOCATION = 'wasbs://polytest@parquetstore.blob.core.windows.net',
    CREDENTIAL = AzureStorageCredential

Create file format

One more setup required before creating the table is creation of the file format. This is required to tell what format to be used to store the table data.

-- Create an external file format
-- FORMAT_TYPE: Type of file format in Azure storage (supported: DELIMITEDTEXT, RCFILE, ORC, PARQUET).
-- FORMAT_OPTIONS: Specify field terminator, string delimiter, date format etc. for delimited text files.
-- Specify DATA_COMPRESSION method if data is compressed.

WITH (  
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'  

The data compression uses is Snappy. There are others as well.

Create external table

This is the last step. It creates external table with the file format, storage location and data source. 

--A: Create the external table
-- Specify column names and data types. This needs to match the data in the sample file.
-- LOCATION: Specify path to file or directory that contains the data (relative to the blob container).
-- To point to all files under the blob container, use LOCATION='.'
CREATE EXTERNAL TABLE [dbo].[SampleTable_Parquet_External] (  
        [customerName] [nvarchar](100) NULL,
 [customerNumber] [varchar](70) NULL
WITH (  
        DATA_SOURCE = AzureStorage,  
        FILE_FORMAT = [ParquetFileFormatSnappy],  
        REJECT_TYPE = VALUE,  
        REJECT_VALUE = 0  

CRUD operations on external table

As it is stated by polybase technology, it don't support DML operations such as update, delete statements. We can just insert.

--Insert Data to the external table
INSERT INTO [dbo].[SampleTable_Parquet_External]
select * from dbo.[SampleTable_Internal]

Things to remember about polybase

The API seems granular. We can create file format, data source and all separately and use those when creating table.

As stated earlier it is mainly useful in big data analytics. But when we tried moving more than 300,000 records from SQL Server to blob we could see it just breaks. Still troubleshooting with Microsoft. Hopefully it might be something we did wrong.

Another issue faced was about data types. The polybase engine don't know all the data types and can move them. There are settings to say how may mismatches it should handle/forgive before it fails.

More limitations can be found here.

More reading

Tuesday, September 5, 2017

Starting Scala Spark - Read write to parquet file


This is a post to index information related to parquet file format and how Spark can use it. Since there are already many tutorials to perform various operations in the context, this post mainly consolidate the links.

What is parquet file?

It is a columnar storage format, can contain schema along with data, supporting various encoding, compressions etc...
The file format specifications are from Apache. There are good support in the Java world. .Net seems catching up with Parquet. Since it is mainly for data analysis world it is not recommended to use in transnational systems.

Read write to local

From Spark we can read and write to parquet files using the methods given in below link.

Read write to parquet present in Azure blob storage

Below goes a tutorial which explains how local Spark cluster can be used to access Azure blob.


It explains writing text files. If we just use the parquet function Spark will write data to parquet format to Azure blob storage.

Version compatibility

To get Azure connectivity to Azure from Spark it has to know the Azure libraries. They are hadoop-azure-v#.v#.v#.jar,azure-storage-v#.v#.v#.jar. The above link explains using Spark 1.6 version libraries hadoop-azure-2.7.0.jar,azure-storage-2.0.0.jar respectively. Both these libraries can be downloaded from http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.0/. How to link these libraries to Spark is clearly explained in the link.

At the time of writing this post the latest Spark version is 2.2.0. So what should be the Hadoop version? Should it be the latest Hadoop version 2.8.1? There will be a tendency to use Hadoop 2.8.1. But if we use it we may get NoSuchMethodError Exception. Stack trace below stripping root callers.

java.lang.NoSuchMethodError: org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/Class;)Lorg/apache/hadoop/conf/Configuration; at org.apache.hadoop.fs.azure.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:45) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.getAccountKeyFromConfiguration(AzureNativeFileSystemStore.java:852) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:932) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450) at

As a true developer the way to troubleshoot this is to look at the source code of Hadoop.Azure and related packages to see what changes between versions and use the proper one. Else do trial and error and get one combination working.

The below combinations seems working
Spark 2.2.0

Dealing with parquet files of different encoding

Writing parquet files with specified codec

Below link explains how to use different API to write which accepts the encoding.

eg:df.write.format("parquet").option("compression", "gzip")..save(<path>)

Reading parquet with specified codex

While reading parquet using sqlContext, we can use setConf() to mention the encoding.

eg: sqlContext.setConf(“spark.sql.parquet.compression.codec”,”gzip”)

More details can be found in official Spark documentation.

More links


Tuesday, August 29, 2017

Azure Application Insights - End to end correlation from Angular to internal WCF services

Azure AppInsights is one of the PaaS offering which can be easily configured to our application and leverage capabilities. And it really works without any changes in our application. But does it helps us more than any other tools? May be yes with default setup. We will really love it if we can get end to end correlation to understand what is happening in our system. Production debugging will be a breeze with correlation. 

This post is talking about how can we correlate activities happening in different tiers of our system and see it in one place. Lets take a scenario where an Angular application is interfacing with users, it calls ReST endpoints to talk to front end services and front end services communicate with other internal services. We can correlate operations originating from client side to related operations happening in the internal services.


  • Basic knowledge about AngularJS 1.x
  • Knowledge about WCF services.
  • Knowledge about setting up AppInsights service

Sample setup

It is advised to download the sample from below location before reading further. Below goes the structure of sample.

AngularJS application just does one thing. Finding the area of circle if we give radius. For finding area, it calls a ReST web service.
It sends one event during startup. Another one during calculating the area ie when calling the FronEndWCFService. These are custom AppInsights events

FronEndWCFService is a simple WCF service hosted using webHttpBinding to get ReST end points. It calculates the area using the parameter radius. It don't know the value of pi. It dependent on InternalService to get the value of pi. During the calculation it sends some custom events/operations to AppInsights.

InternalService is hosted using net.pipe binding. It has one method which returns the value of Pi. It also sends some custom events to AppInsights.

The above 3 components are hosted inside IIS and enabled with AppInsights for default system events.

Running the sample

  • Make sure the 3 web applications are hosted in IIS.
    • The web application has to net.pipe binding to host InternalService.
    • Try to browse the service Urls.
  • Give the Url of InternalService in the web config of FrontEndWCFService so that it can call.
  • Obtain AppInsights instrumentation key by either creating a new instance in Azure or copy from existing AppInsights instance. Replace this key into web applications. Better use visual studio search and replace and the word to replace with key is {Your Key} .There are 3 files to get the instrumentation key. 
    • /AngularClient/Views/Shared/_Layout.cshtml
    • /FrontEndWCFService/ApplicationInsights.config
    • /InternalService/ApplicationInsights.config
  • Run the AngularClient application in browser.
    • During the application load, it will log "Custom init event from JS"
    • When we give the radius in the text box and click on button it will log "Custom event from JS before service call"

Viewing the AppInsights

It may take some time to get the entries reflected in the AppInsight in the Azure portal. It is better to go to the analytics portal of ApplicationInsight instance and view the results. In the analytics portal, the details can be seen as follows.

The entries highlighted in the name column are the custom events generated by different tiers in the system. Along with custom events the system will generate event if AppInsights collectors are attached to the processes. That can be done by connecting interceptors in the WCF or HttpModules in Web request pipelines.

Understanding correlation results

The operation_id column is the id which really correlate the operations. The id has to be unique and the parent_id can be any id used previously. It can be a tree of related events.

Don't get confused on the term 'events' used here. AppInsight can accept events, operations, traces etc...For easy reference term event is used.


How does a web request from Angular gets correlated with servers

When we send web request from Angular or any client if we send http header "Request-Id" it will be treated as operation_id by server side, if we have the AppInsight request interceptors

Why the sample has 2 methods in Angular to write custom AppInsights

When the application is getting initialized, the appInsights object may not have the context ready.If so we have to put into queue. Once the context is initialized, we can use it to set operation_id. At that point the queue will be null.

Should I write any code for intercepting incoming web/WCF requests to read correlation data?

No we just need to use already available nuget packages and setup our web.config appropriately.

Should I write any code for intercepting the out going web/WCF requests to write correlation data?

No we just need to use already available nuget packages and setup our web.config appropriately.



Tuesday, August 22, 2017

Starting Scala Spark - Setting up local development environment

When someone comes to me and says 'this can be or cannot be done using .Net, SQL or Browser', I know whether it is really possible or not. Recently someone came to me and said 'we cannot connect to SQL Server database from Scala and load data frames, if Scala is running inside Azure HDInsight Spark cluster'. To be frank I did some google immediately but there are no direct answer to that scenario. In the internet, every one talking about loading data from Azure blob and all into Spark data frame and doing parallel data processing. But, if Spark is an in-memory distributed execution technology, why can't it read from SQL Server database and load data frame and do the processing? Scala is just another JVM language. Spark is written in Scala so it too runs on JVM and JVM has connectivity to SQL Server. So theoretically yes, but I don't have confidence to say practically yes.

It leads to the dilemma of  'should Software Architects code' or depend on what the development team says? Since I am in the 'yes should code' side, I decided to start exploring Spark so that next time I should be able to handle Spark same like how I am handling .Net and browser side technologies. 

The topics I should cover are Scala, Spark, Azure HDInsight.

This post just aims to some one who wants to learn Spark data processing technology at high level so that it can be integrated to other applications. Not to go in depth to become Spark data processing expert programmer.


First step is to setup the development environment. Learning Spark or Scala in Azure HDInsight environment is not a cost effective decision. There is already good tutorials explaining the development environment setup. One such is given below which talks about setting up Spark environment in Windows machines.


This is little old article. It worked for most of the steps. Below are some we may need to revisit for latest versions

Spark version

The instructions are little old regarding the version numbers though the concept remains same. One is regarding the Scala version which is suitable for Spark. The idea is to first select the Spark version and then select the Scala version. At the time of writing this article, the Spark version is 2_2_0 and corresponding Scala version is 2.12.3 (Spark works from Scala 2.11 on wards). The Scala version support is mentioned in the Spark downloads page.


Another dependency is the winutils.exe which has to be in the \hadoop folder. It can be downloaded by googling. One such link is given below.

The location of the winutils can be C:\hadoop\bin\winutils.exe where the PATH environment variable points to the c:\hadoop folder.

After following the steps mentioned in the tutorial, we may get into multiple different issues. One such issue is given below.


This may happen once we run the Spark shell after installation. The problem is that the Spark context variable 'sc' is not initialized. If this is not, we cannot do any Spark activities though we can issue Scala code. The error message is as follows. The font size is reduced to view more in less space.

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/22 17:33:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/22 17:33:29 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/spark_2_2_0/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark_2_2_0/bin/../jars/datanucleus-rdbms-3.2.9.jar."
17/08/22 17:33:29 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/spark_2_2_0/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark_2_2_0/bin/../jars/datanucleus-api-jdo-3.2.6.jar."
17/08/22 17:33:29 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/spark_2_2_0/jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark_2_2_0/bin/../jars/datanucleus-core-3.2.10.jar."
17/08/22 17:33:35 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/08/22 17:33:35 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
  at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1053)
  at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
  at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:129)
  at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:126)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:938)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:938)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
  at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
  at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
  at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:938)
  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:97)
  ... 47 elided
Caused by: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------;
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
  at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:193)
  at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:105)
  at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:93)
  at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
  at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
  at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
  at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35)
  at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
  at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1050)
  ... 61 more
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------
  at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
  at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:191)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
  at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:362)
  at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:266)
  at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
  at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  ... 70 more
Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------
  at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612)
  at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
  at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
  ... 84 more
<console>:14: error: not found: value spark
       import spark.implicits._
<console>:14: error: not found: value spark
       import spark.sql
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val sf = sc.textFile("C:\\Temp\\joygazure.publishsettings")
<console>:17: error: not found: value sc
       val sf = sc.textFile("C:\\Temp\\joygazure.publishsettings")


If we analyze the log carefully we can see there is a permission issue to tmp folder related to hive. Did we install hive? Not explicitly. What is the relation with initializing Spark context and hive folder permissions. At first it might not seems related but ultimately they are related.

c:\hadoop\bin\winutils.exe chmod 777 c:\tmp\hive

Tuesday, August 15, 2017

Difference between singleton object and static class members

Long back during one of interview this question came. Why should we use singleton instead of static class with members? Are they same? Some thoughts on the same below.

Singleton objects and static classes have their own purposes. The difference between those may be too little but there are differences. Lets see what is meant by these 2 varieties.

class Context
        private static Context internalContext = new Context();
        private Context() { }
        public static Context Singleton { get { return internalContext; } }
        public string GetEnvironmentSettings() { return "settings"; }
public void DoBusinessOperation()
        string settings= Context.Singleton.GetEnvironmentSettings();

static class StaticContext
        public static string GetEnvironmentSettings() { return "settings"; }
public void DoBusinessOperation()
        string settings = StaticContext.GetEnvironmentSettings();

Technically the above 2 methods achieve same thing. But we have to understand when should we use the singleton class based approach and static class members approach.

Singleton object can be inherited static class cannot.

The above code was running perfect in an web application which was hosted in onpremise data center. Then the cloud migration came and the strategy is to support onpremise and Azure cloud for a period of 1 year and after that Azure only. Suppose getting environment settings is different in Azure than onpremise, how can we make this code support the scenario without having different codebases?

Don't take the above scenario as is. Just assume we came across a scenario where the behavior of Singleton has to change based on running environment and how to support it without having to maintain many code bases?

Yes its a solved problem. We should use inheritance and change behavior based on the environment. If we go with Singleton model we can have the class derived from another base and expose the base to outside. Below goes the code snippet.

class Context
        private static Lazy<Context> internalContext = new Lazy<Context>(() =>
            //Logic to create the correct object based on criteria. Environment in this case.
            return new AzureContext();

        public static Context Singleton { get { return internalContext.Value; } }
        public virtual string GetEnvironmentSettings() { return "settings"; }
        private class OnPremiseContext : Context
            public override string GetEnvironmentSettings()
                return "settings";
        private class AzureContext : Context
            public override string GetEnvironmentSettings()
                return "Azure settings";

The above code is just for demonstrating the advantage of using Singleton object over static class members.

Singleton can be disposed,static class cannot

In reality we may not meet a situation where a Singleton object needs to be disposed. For argument sake we can say that if we need Singleton object at a thread level (thread static), this is useful. Once that thread completes execution we can dispose the object.

Passing singleton object as param, static class cannot

This is similar to the inheritance scenario but another advantage of using Singleton object. Ideally Singleton is a mechanism where we can access it from anywhere in the code by using its class name. It doesn't need to be passed into a function. We we want to make our functions testable, we can think of accepting the Singleton objects. Also we can use dependency injection mechanism to unit test the code without any overhead of actual singleton object.

There are many other differences between these models but mainly depends on these 3. Since the Singleton is an object it can be cloned but static class cannot. Storage of data in process memory, performance between these 2 models etc...are some other differences.



Tuesday, August 8, 2017

Caution JavaScript Ahead - Hoisting

What is variable hoisting

In the 'var' world of JavaScript, the variable is not bound to the boundary of block level curly braces {}. Even if we declare the variable inside an if loop, it gets moved to the function where if belongs. This is not the case in other languages such as C,C++, Java, C# etc...There the variable is limited to the boundary of block level braces but never in the case of JavaScript. That seems a developer friendly feature in JavaScript at first. But it can cause issues, if we write code without knowing what is variable hoisting.

Below is a good example


function foo() {
  if (new Date().getMinutes() % 1 === 0) {
    var time = new Date();

If we just look at the code we may conclude that the alert will always shows undefined. But if we know hoisting, we can easily identify the difference here. If the condition is true, the time variable is going to have value.
Even if the statements inside the condition didn't execute, we don't get any runtime exception in strict mode. It just shows undefined. Lets see below code.


function foo() {
  function innerFoo() {
    if (new Date().getMinutes() % 1 === 0) {
      var time = new Date();

Here always we get exception if the code is in strict mode. This is because the time variable never gets hoisted to the outer function.

Variable hoisting and global variables

Below is another interesting scenario if global variable or variable in parent scope has same name as inner variable

time = new Date();

function foo() {
  var time = new Date();

When we run the above snippet. We could see that first alert shows the time. Second one shows undefined and third one shows time again. Why the second one shows undefined? Lets see how the variable hoisting worked here. Below code snippet shows the code after hoisting.

time = new Date();

function foo() {
  var time;
  time = new Date();

Of course we cannot see the hoisted code since JavaScript don't compile to output file. But this is what happens.

Does this mean global variable is not safe

The above may cause confusion that global variable is altered by the variable inside function. But it is not. Lets see the below code snippet.

time = new Date(1947,7,15);

function foo() {
  var time = new Date();


When we run we could see that the 4th alert is showing same date of first alert. That means the global variable is not altered by the function since it is different variable with same name. This is similar to the other languages. But this is true only to the variables out side of function. What if the outer variable is inside the function and inner is inside an if statement?

function foo() {
var time = new Date(1947, 7, 15);
if (time.getYear() < 2000) {
var time = new Date();

We could see that the 3rd alert shows current date not the old one initialized at the beginning of the function. This shows the hoisting is to the immediate function level and it is same variable through out the function.

Hoisting functions

In the above code snippets we can see that the foo() is called before it is defined. JavaScript has no problem with that. But lets see the below scenario. Guess what will happen

var foo =function () {
var time = new Date();

It complain that foo is not a function. Many of us might have spent good amount of time troubleshooting why this is not a function at first. The reason is hosting. When it does, it moves the variable foo declaration to the top of the function ie above the foo(); statement and when we call foo(); it is not yet a function as the function definition is assigned to that variable later.

If we move the calling foo(); below the assignment it starts working.

Enter let for block level scoping

The keyword 'let' is introduced into JavaScript ES6 / ES2015 to have block scope. ie the variable can be scoped to an if block than to function level. Lets have a look at below snippet.

function foo() {
var time = new Date(1947, 7, 15);
if (time.getYear() < 2000) {
let time = new Date();

We could see the above code shows old date. The only difference is the usage of let for declaring variable inside if block.

Does let hoist

If we use a variable before it is declared with let, it throws exception. But if we use a variable before it is declared using var it has undefined. Means hoisting is not happening as is in let.

function foo() {
try {
var oldTime = new Date(1947, 7, 15);
} catch (ex) { alert(ex) }
try {
let time = new Date();
} catch (ex) { alert(ex) }
try {
} catch (ex) { alert(ex) }

We could see in the above code, the oldTime gets hoisted with undefined value. The let throws exception similar to an real non declared variable noTime.

There are more in to 'let' such as temporal dead zone and restriction of redefinition etc... The new constructs such as class in JavaScript works the same way of let. Classes in JavaScript don't get hoisted.

If we think about different scenarios, things gets very much interesting. Eg: What if we declare same variable using var and let in same scope, different scopes, how they overlap etc...See the reference section for more such interesting scenarios.


Tuesday, August 1, 2017

Architecting Azure limits - Azure Search

When we look at Azure evangelism, we could see that it is projected as a platform where we can scale to infinity. Some even go to a level where scale is magic. Just put the code in Azure and it scale automatically without we worrying about anything. But not all are true. When we design in Azure platform we are limited to scaling characteristics of each service. Often it demands us to design application or support tools to align with the nature of Azure service(s) we selected. To do it efficiently we should know the limits of Azure services. In other words our Azure architecture revolves around Azure limits.

Lets take Azure Search in this post. Please note that these limits might be changed in future by Azure platform. So it might not be accurate for long. 

Limits of Azure Search

Total services per subscription - 43 (Among this there are limits in different tiers eg: only 6 services in s3 tier)

Data storage

Maximum number of indexes in search service - 200 (S2 & S3 tiers)
Partitions per service - 12
Max partition size - 200GB (S3 tier)
The above 2 define the max index size  = 200GB*12
Maximum documents - 120 million/partition or 1.4 Billion / service


Replicas per service - 12( S1,S2 & S3)
Estimated queries per replica - 60/second 
The above limits total of 12*60 = 720 queries per second.
The latest limits can be found in the below link.

This clearly indicate that using one subscription we cannot build another google. May be Azure will increase the limits later. The advantage here is that the index is doing some kind of compression so the source data size doesn't directly translate to index size.

Azure search in SaaS / Multi-tenancy scenarios

If we really want isolation of tenant data, we should put one tenant data into one services. But there is limit of 6 in the higher levels. Next level of isolation is putting tenant data by search index. The search index gives us different URL to get better isolation. But in this approach, we are limited to 200 indexes / service. When we gets 201th tenant, we have to add one more service. That is either a change in the application or support tool specific to the limits of Azure search service. 

The last resort is to tag the documents by tenant. But that is less secure. More design guidelines related to multi-tenancy can be found below.

If we distribute the data in multiple indexes the query has to change since the out of box querying is towards an index. If we want to search one search term in multiple indexes, it translate to one http call to one index.

Azure search service has got really nice features which otherwise would be difficult to code ourselves. But as any other abstraction, it comes with limits. We should be careful when designing the application by not falling into the marketing materials and magic claims.

Is Azure search really pay per use?

Since it is offered as PaaS, we may assume that it is pay per use. But that is not true for Azure search. When we provision a search service it starts charging regardless of the usage(data stored and the search queries issued). More details on pricing can be found in the below link.

The advantage here is that we gets one more level of scaling inside the service which has pricing tier defined. The scaling inside a search service is called Search units.

Is Azure search auto scaling

Not fully automatic. When we create a search service there will be a default search unit. When we want to scale we have to increase it explicitly. This is another knowledge our application or support tool has to have in order to scale. Else human intervention is required when a scaling need arises such as more data getting added or more users are expected in busy season. More details below.

In simple sense when the data volume increased increase the partitions and when queries increase increase the replicas as the query limit is per replica.