Data Analysis Notes

SQL Server Data Analysis using Python and JupyterLab

JupyterLab

Looking for a free tool for SQL Server data analysis? If you are familiar with Python I will show you how to use JupyterLab as a great data analysis tool.

Why Python and JupyterLab?

Python is a multi-purpose language but the abundance of data-processing libraries makes it a really great instrument for data manipulation and data analysis. No wonder that this is one of the leading languages used in data science and machine learning.

JupyterLab is a web-based tool by Project Jupyter that is a successor of Jupyter Notebook, a computational environment for creating rich interactive documents. JupyterLab allows one to simultaneously work with several files, including Jupyter Notebooks, in different tabs. Actually it’s just a logical continuation of Jupyter Notebook and should eventually replace it. While JupyterLab can work with different languages (like R and Julia) but in this case, I will focus on Python.

One of the greatest advantages of Project Jupyter is that you can mix code, markup, table, and charts in one file with .ipynb extension. Later you can share this file with your colleagues online or as a simple HTML file if necessary. If your notebook file is available online everybody can view it using NBViewer.

Installing Python and JupyterLab

1. Install Anaconda

If you don’t have python and JupyterLab installed on your computer then the simplest way to do it is by installing Anaconda, a famous data science platform. It comprises many packages for data analysis including Python itself, Jupyter Notebooks and necessary libraries. It supports several languages and works on Linux, Windows and Mac OS X.

You can download and install Anaconda here.

2. Install JupyterLab

Since JupyterLab is not installed automatically by Anaconda, you need to installing it manually by typing the following command (use Anaconda Prompt app if you are working on Windows):

conda install -c conda-forge jupyterlab

When the installation is completed you can start JupyterLab by typing

jupyter lab

It should be opened in a new browser tab. At this point, all installation is accomplished and we can proceed with SQL Server data analysis.

Connecting SQL Server from a Jupyter Notebook

When you start JupyterLab you get to the Launcher where you need to click the Python icon under the “Notebook” section. This will start a new notebook which we will use for connecting to SQL Server and analyzing its data.

First of all, in order to connect SQL Server, you need to install Microsoft ODBC Driver for SQL Server. Here are instructions for Linux and Mac and Windows.

I will show you two options of SQL Server connection: using pyodbc library and using SQL magic extension.

Raw way: Using pyodbc library

In this way, we will use python’s pyodbc library that allows connecting to different database sources using ODBC drivers. If this library is not installed on your computer you can do this using conda:

conda install -c anaconda pyodbc

Now you can use the following sample to connect your database and select some data from a table:

In this sample, I used the AdventureWorks database and my local SQL Server with SQL Server Authentication mode. You need to change the connection string accordingly.

Also, you can see that I used the pandas library which is the best library for working with heterogeneous data formed into arrays with named columns. It should be already included in your Anaconda installation.

More ipythonic way: Using IPython SQL Magic extension

Jupyter Notebook allows using magic commands, set of convenient functions helping to solve common problems in data analysis. IPython SQL magic extension allows you to execute SQL queries right in your notebook that makes the whole process more natural without adding any additional code.

First, you need to install this extension using conda:

conda install -c conda-forge ipython-sql

Then you need to load this extension to your notebook using the following directive:

%load_ext sql

Then you need to use %sql magic to connect your SQL Server and run queries:

Note, that here I had to URl-encode the connection string in order to pass it to the magic. There are other options for connection to SQL Server via ODBC, but this one worked for me better and looks more concise (though a bit hard-to-read).

Analyzing SQL Server Data using python

Uuuhhh, finally we have come to the main point of this post: data analysis. This topic is huge but I’m going to show you some main principles and methods so that you can proceed further using the given links to the documentation.

Using Pandas’ methods

Pandas library provides a bunch of useful methods allowing you to instantly get some statistical information on your data.

Let’s see what we can know about data in FactProductInventory table of our AdventureWorks database. I will use the first method of querying data from SQL Server (the one that uses pyodbc directly) as I’m more get used to it. If you prefer the second method (that uses %sql magic) then you need to use DataFrame() method of the result set received after magic execution to get a DataFrame object.

Let’s put all 776286 records of the table into df variable (which is a DataFrame object) and get some statistical information on each column using describe() method:

DataFrame.describe

You can see that this method gives such statistics such as total value count, mean value, standard deviation, minimum, maximum and quartile values. This obviously gives garbage for DateKey column (that contains encoded date value) but for UnitCost column we can get a nice overview telling, for example, that while the maximum cost is 1131.25 the half of the units costs 24.65 or less.

Note also in the above example, that I used head() method to see several (five by default) first rows of the data frame to see whether all data imported correctly.

The info() method of a DataFrame object returns a summary of DataFrame such as data types, memory consumption, number of values and so on:

DataFrame.info

Another useful method is value_counts() that shows how many times each value appears in the set:

DataFrame.value_counts

It especially useful for discrete values like the ProductKey column but in this data set it appeared to be not representable so I used the UnitCost column instead.

But if you prefer a visual representation of how often each different value in a set of data occurs (so-called frequency distribution) you need to use hist() method:

DataFrame.hist

DataFrame has huge amount of other useful methods that you can find in the documentation.