Sankey Diagram

Sankey diagrams feature directed arrows that have a width proportional to the flow quantity visualized: if a flow is twice as wide it represents double the quantity. Flows in the diagram can show e.g. energy, materials, water or costs. Within a Sankey diagram the directed flow is always drawn between at least two nodes (processes). Thus it shows not only flow values but also information about the structure and distribution of the defined system. So they are a great alternative to common flow or bar & pie charts.

They are gaining popularity in energy management, facility management, process engineering and process control and many other kinds of data visualization.

Sankey diagrams are really fascinating to see. I have been mesmerized by these for a while now. They often show up on Tableau Public as you traverse through the gallery. I have tried replicating Sankey Diagram on Tableau, however, the data prep process is extremely arduous which led me to find out if there’s an easier alternative out there.

I found the solution with Python – Pandas. When it comes to data prep, pandas will have your back.

Lets dive right into the solution –

The data I have used is the very popular Superstore Dataset. You can download it here

https://drive.google.com/open?id=1yeIM2iKBHiR6tiW3L_HhV5N9A4HVICLe

If you look at the sankey diagram you’ll notice that it is a collect of source and destination labels with a parameter represented by the flow from the source to destination. Here, the thickness of the flow is determined by the value of the parameter (numeric). We can extend the same principle to multilevel sankey diagram where the destination of the first flow is the source for the next one and so on.

I have used Jupyter Notebook for writing the python code. Here we import pandas and read the data from you local drive.

import pandas as pd
df = pd.read_csv('/Users/mastermind/Desktop/Super.csv')
df.head()

I stored the column headers I want to drop in a variable ‘column’. I retained only those categorical variables which I wish to see in my Sankey Diagram.

column = ['Order ID', 'State','Order Date', 'Ship Date', 'Customer ID', 'Customer Name', 'Country', 'City','Postal Code','Product ID','Product Name','Discount']
df.drop(column,inplace=True, axis=1)
df.head()

From within the created dataframe – I further subset it to include only one numerical variable (here – Profit) which makes up the thickness of the flow.

df1 = df.filter(['Ship Mode','Segment','Region','Category','Sub-Category','Profit'], axis=1)
df1.head()

Now we do a simple maneuver, similar to group by in SQL where group the numerical values pair-wise with the categorical variables.

df2=df1.groupby(['Region', 'Segment'], as_index=False)['Profit'].count()
df2.head()
df3=df1.groupby(['Segment', 'Category'], as_index=False)['Profit'].count()
df3.head()
df4=df1.groupby(['Category','Sub-Category'], as_index=False)['Profit'].count()
df4.head()
df5=df1.groupby(['Sub-Category','Ship Mode'], as_index=False)['Profit'].count()
df5.head()

In essence, the above code creates the source – destination pairs we need in our multilevel Sankey Diagram. Each of the above data-frames have different column headers. Since we eventually want to append all of these in one data-frame, it would help us if we have common column header for each of these. The code below does that for us in addition to appending all the individual data-frames into one data-frame.

df2.columns = ['a', 'b', 'Quantity']
df3.columns = ['a', 'b', 'Quantity']
df4.columns = ['a', 'b', 'Quantity']
df5.columns = ['a', 'b', 'Quantity']
df6 = df2.append(df3)
df7 = df6.append(df4)
df8 = df7.append(df5)
df8

We will now create the Sankey Function built into Plotly. Credits to Ken Lok for helpful resources on this function.

import plotly
import plotly.plotly as py

def genSankey(df,cat_cols=[],value_cols='',title='Sankey Diagram'):
    # maximum of 6 value cols -> 6 colors
    colorPalette = ['#FFD43B','#646464','#4B8BBE','#306998']
    labelList = []
    colorNumList = []
    for catCol in cat_cols:
        labelListTemp =  list(set(df[catCol].values))
        colorNumList.append(len(labelListTemp))
        labelList = labelList + labelListTemp
        
    # remove duplicates from labelList
    labelList = list(dict.fromkeys(labelList))
    
    # define colors based on number of levels
    colorList = []
    for idx, colorNum in enumerate(colorNumList):
        colorList = colorList + [colorPalette[idx]]*colorNum
        
    # transform df into a source-target pair
    for i in range(len(cat_cols)-1):
        if i==0:
            sourceTargetDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
            sourceTargetDf.columns = ['source','target','count']
        else:
            tempDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
            tempDf.columns = ['source','target','count']
            sourceTargetDf = pd.concat([sourceTargetDf,tempDf])
        sourceTargetDf = sourceTargetDf.groupby(['source','target']).agg({'count':'sum'}).reset_index()
        
    # add index for source-target pair
    sourceTargetDf['sourceID'] = sourceTargetDf['source'].apply(lambda x: labelList.index(x))
    sourceTargetDf['targetID'] = sourceTargetDf['target'].apply(lambda x: labelList.index(x))
    
    # creating the sankey diagram
    data = dict(
        type='sankey',
        node = dict(
          pad = 15,
          thickness = 20,
          line = dict(
            color = "black",
            width = 0.5
          ),
          label = labelList,
          color = colorList
        ),
        link = dict(
          source = sourceTargetDf['sourceID'],
          target = sourceTargetDf['targetID'],
          value = sourceTargetDf['count']
        )
      )
    
    layout =  dict(
        title = title,
        font = dict(
          size = 10
        )
    )
       
    fig = dict(data=[data], layout=layout)
    return fig

We will now call the Sankey function and we will pass the column headers from our new data-frame as arguments to the function.

fig = genSankey(df8,cat_cols=['a','b'],value_cols='Quantity',title='Sankey Diagram x Profit')
plotly.offline.plot(fig, validate=False)

The end product should look something like this –

https://plot.ly/~viraj1596/94/

Since this function is built into Plotly it gives us all the functionalities that plotly has built within when you click “export to plotly” on the lower right corner of your plot. You can host this plot on plotly for other to view, or to embed within your own webpage.

You can further condense the graph to show only two elements at a time or to show other parameters like Quantity or Sales for the flow thickness. You can download my Jupyter Notebook for reference.

The diagrams look as follows –

Happy Vizzing!

1 comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s