Visual Data Analysis with Python in Excel: Using Scatter Plots
Emilie Lewis
Dave Langer
This is the third in a series of blog posts that teach you to analyze data using Python code in Microsoft Excel visually.
If you are new to Python in Excel, you should start with my Python for Excel Analysts blog series, which covers many concepts that will be assumed in this blog series.
This series will use the Microsoft Excel Labs Python Editor to write code. However, the Python Editor is not required. All code can be entered using the Formula Bar and the new PY() function.
Each post in the series has an accompanying Microsoft Excel workbook to download and use to build your skills. This post’s workbook is available for download here.
For convenience, here are links to all the blog posts in this series:
- Part 1 – Using Histograms
- Part 2 – Using Box Plots
- Part 3 – Using Scatter Plots (this post)
- Part 4 – Using Bar Charts
- Part 5 – Using Line Charts
Note: To reproduce the examples in this post, install the Python in Excel trial. If you like this blog series, check out my self-paced certification program, Anaconda Certified: Data Analysis with Python in Excel.
Introducing Scatter Plots
A common scenario in business analytics is investigating if there exists some relationship between two measures—for example, a relationship between advertising spend and sales.
In other words, a relationship between two columns of numbers.
As discussed in previous posts, it is possible to visually inspect columns of numbers if the columns are small. However, visual inspection of numeric data quickly breaks down as the columns become larger.
Visualizing two columns of numbers using scatter plots achieves the following goals:
- It allows your analysis to scale to large columns of numbers.
- Patterns and relationships in the data are much easier to detect and analyze.
A Contrived Example
Take the following collection of data as a contrived example:
Fig 01 – A Contrived Example of Numeric Data
The data shown in Fig 01 are structured as follows:
- Each row represents some period (e.g., months).
- The ice cream sales and shark attack counts are recorded for each period.
While Fig 01 depicts a scenario where visual inspection of data could be performed and some insights can be derived, later examples in this post will demonstrate the power of scatter plots.
Your First Scatter Plot
The Python seaborn library provides several scatter plot-based visualizations. The following code loads the Excel table shown in Fig 01 as a pandas DataFrame and then visualizes the data as a scatter plot:
Fig 02 – Python Code for a Scatter Plot
Note: While the above Python code is written using the Excel Labs Python Editor, this is not required.
The code shown in Fig 02 leverages the seaborn scatterplot() function, with the visualization configured as follows:
- The shark_attacks DataFrame is the data source.
- The Ice Cream Sales column is mapped to the x-axis.
- The Shark Attacks column is mapped to the y-axis.
Lastly, the set_title() method adds a title to the scatter plot.
The code cell is configured with the Convert to Excel values option by clicking the down arrow for the cell in the Python Code Editor. Selecting this option renders the scatter plot within the worksheet cell:
Fig 03 – Rendering the Scatter Plot Inside the Worksheet Cell
Clicking the disk icon tells the Python Code Editor to execute the Python code:
Fig 04 – Executing the Python Code
The code shown in Fig 02 produces the following scatter plot:
Fig 05 – A Scatter Plot of Shark Attacks by Ice Cream Sales
Each point in Fig 05 represents a single row from the data table shown in Fig 01.
Introducing Correlation
When using scatter plots to analyze numeric data visually, the primary goal is to identify any correlation between the numeric columns.
Correlation is an important concept in statistics, with several mathematical definitions. While knowing the underlying mathematics of correlation is useful, this blog post will take an intuitive approach to understanding correlation.
First, let us understand the three basic forms of correlation.
Positive Correlation
The first form of correlation is a relationship in the data where as the values of one column get larger, the values of the second column also get larger.
When this relationship is present in your data, it is called positive correlation.
Often, a positive correlation is obvious with visual inspection of scatter plots. Other times, it isn’t so obvious.
The seaborn library offers the regplot() function for creating scatter plots with trend lines. Adding a trend line to a scatter plot is often useful:
Fig 06 – Python Code for a Scatter Plot with a Trend Line
The code shown in Fig 06 is very similar to that in Fig 02. The notable difference is the use of the line_kws parameter to set the style of the trend line to be dashed.
The Python Code Editor cell shown in Fig 06 is configured with the Convert to Excel values option. Executing the code produces the following visualization:
Fig 07 – Scatter Plot with a Trend Line
As displayed in Fig 07, the data displays a positive correlation. As the value of Ice Cream Sales gets larger, so does the value of Shark Attacks.
Fig 07 also provides an additional insight into the data – the trend line intersects every point in the scatter plot. When this happens, the data is known as being perfectly correlated.
When looking at Fig 07, the intuition is that, based on the limited dataset provided, you can 100% accurately predict Shark Attacks based on the value of Ice Cream Sales.
In other words, as the data points in a scatter plot get closer and closer to the trend line, the stronger the correlation.
Negative Correlation
The second form of correlation is a relationship in the data where as the values of one column get larger, the values of the second column get smaller.
When this relationship is present in your data, it is called negative correlation.
Consider the following scatter plot:
Fig 08 – An Example of Negative Correlation
Examining the nature of the data depicted in Fig 08, the relationship is intuitive – as the Daily Temperature increases, the Monthly Energy Bill decreases (i.e., less energy is used).
Compare the data points shown in Fig 08 to those in Fig 07. Notice how the trend line in Fig 08 is close to the data points but only intersects one of the points?
Unlike Fig 07, the data in Fig 08 are highly correlated, but not perfectly correlated. More precisely, the data in Fig 08 are highly negatively correlated.
Once again, the intuition is based on being able to predict one value for another.
For example, using the limited dataset provided, knowing the Daily Temperature allows you to create a reasonably accurate prediction for the Monthly Energy Bill value.
Note: Positive vs. negative correlation only refers to the nature of the relationship between the columns of numbers, not the quality of the relationship (i.e., negative does not mean “bad”).
No Correlation
The last form of correlation is where there is no discernible relationship between the values of the columns of numbers.
Consider the following scatter plot:
Fig 09 – An Example of No Correlation
The data depicted in Fig 09 is contrived, to be sure, but it demonstrates an important concept – there is no predictive relationship between X Values and Y Values.
For example, when X Values are equal to 1.00, Y Values could be 1.00, 2.00, 3.00, 4.00, or 5.00. This lack of predictive relationship is embodied in the trend line – it is horizontal.
Weak Correlation
It is worth noting that correlation is a spectrum. Generally speaking:
- The steeper the trend line, whether positive or negative, the stronger the correlation.
- The flatter the trend line, the weaker the correlation.
Consider the following scatter plot. The trend line has been made a solid black line to enhance visibility:
Fig 10 – An Example of Very Weak Correlation
When examining Fig 10, think about the slope of the trend line and the accuracy of predictions possible with this dataset.
The slope of the trend line is not horizontal. Its slope is minimal. This means there is only the tiniest bit of relationship between the two columns of numbers.
In other words, the predictive relationship is terrible.
Correlation Analysis
When using correlation to analyze data, the following is arguably the most important concept to remember: correlation does not imply causation.
The basic idea is that because two numeric columns are highly correlated does not mean one column causes the other.
For example, Fig 07 shows perfect positive correlation between Ice Cream Sales and Shark Attacks. However, it is silly to assume that Ice Cream Sales cause Shark Attacks!
This is where thinking about correlation in terms of prediction accuracy is important in business analytics. While correlation does not imply causation, predictive relationships are super useful.
Analyzing Reseller Orders
Imagine you are analyzing the performance of promotion strategies and have the following dataset:
Fig 11 – The Reseller Orders Dataset
A reasonable place to begin your analysis is to see the relationship (i.e., the correlation) between DiscountAmount and profit.
The dataset currently doesn’t contain profits. The following code loads the Excel table as a pandas DataFrame and adds a Profit column.
Fig 12 – Python Code to Engineer the Profit Feature
Executing the code shown in Fig 12 produces the updated reseller_orders DataFrame:
Fig 13 – The Updated reseller_orders DataFrame
The updated DataFrame allows for a correlation analysis of Profit by DiscountAmount.
Correlation of Profit by DiscountAmount
The following Python code creates a scatter plot with a trend line for the reseller_orders DataFrame:
Fig 14 – Python Code for a Scatter Plot with a Trend Line
The code shown in Fig 14 differs from previous examples by using the ci parameter. The ci parameter controls the size of confidence intervals displayed around the trend line.
Behind the scenes, the trend line is a linear regression predictive model created from the dataset. When the data does not exhibit perfect correlation, the trend line model’s predictions will not be perfect.
The confidence intervals provide a visual indication regarding the level of error for the trend line model. The code in Fig 14 configures the scatter plot not to display the confidence intervals.
The Python Code Editor cell shown in Fig 14 is configured with the Convert to Excel values option. Executing the code produces the following visualization:
Fig 15 – Scatter Plot of Profit by DiscountAmount
Examining Fig 15 reveals the following:
- Overall, there is a strong negative correlation between DiscountAmount and Profit.
- Based on the cloud of data points in the top left of the visualization, the predictive relationship isn’t very strong for very low DiscountAmount values.
- However, all orders appear to have negative Profit amounts when DiscountAmount exceeds approximately $2,500.
As discussed earlier, this correlation analysis shouldn’t assume that DiscountAmounts cause Profit. However, this highly predictive relationship is certainly worthy of further analysis.
Correlation of Profit by DiscountAmount and SalesTerritoryGroup
An example of drilling more into the correlation of Profit by DiscountAmount is investigating if certain geographies may exhibit higher/lower propensity for profitable/unprofitable orders.
The reseller_orders DataFrame has a SalesTerritoryGroup column containing the following categorical values:
- Europe
- North America
- Pacific
The following code creates a scatter plot of Profit by DiscountAmount and SalesTerritoryGroup:
Fig 16 – Python Code for a Scatter Plot with Categorical Data
The code in Fig 16 uses the hue parameter to color code the scatter plot data points based on the values of SalesTerritoryGroup.
The Python Code Editor cell shown in Fig 16 is configured with the Convert to Excel values option. Executing the code produces the following visualization:
Fig 17 – Scatter Plot of Profit by DiscountAmount and SalesTerritoryGroup
Examining the Fig 17 scatter plot reveals the following:
- There are far more orders placed in North America than in the other geographies.
- Of the least profitable orders, the bulk are placed in North America.
The combination of Fig 15 and Fig 17 generates questions requiring additional analyses. Two examples:
- Are there other factors beyond DiscountAmount (e.g., getting rid of excess inventory) that impact Profit?
- Are there real differences between the geographies that impact Profit? For example, are US sales far less profitable in North America than Canadian sales?
What’s Next?
This post has demonstrated how correlation analysis using scatter plots is a powerful tool for crafting insights between columns of numbers.
Not surprisingly, correlation analysis is a go-to tool in a data analyst’s toolbox.
The next post in this series will switch gears and cover how to perform visual analysis of one of the most common types of data in business analytics – categories like product lines and geographies.
Until next time, stay healthy and happy data sleuthing!
Talk to an Expert
Talk to one of our experts to find solutions for your AI journey.