Visualization with SEABORN STEP by STEP
This is a quick walk through the Seaborn Library on how to make basic plot and draw insights from your data.
#collapse-hide
# ignore library warnings
import warnings
warnings.filterwarnings('ignore')
# data manipulation
import numpy as np
import pandas as pd
# visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
tips_df = sns.load_dataset('tips')
tips_df.sample(5)
tips_df.info()
sns.relplot(x='total_bill', y='tip', data=tips_df)
plt.show()
we can see that there is a linear relationship between the Total bill and Tip i.e when the total bill increases, is likely for the tip given to be higher.
Let's add a hue='smoker' to get more insights on who gives better tip between a smoker and non-smoker.
sns.relplot(data=tips_df, x='total_bill', y='tip', hue='smoker')
plt.show()
Note We can see that there are more non-smokers and only a few tips are above 6 and that's when the total bill exceeds $30.
So how may smokers are there?
tips_df.smoker.value_counts()
We can change the hue to sex, size, time and check drill through to understand tip based on gender[male, female] and time[lunch or Dinner]
sns.relplot(data=tips_df, x='total_bill', y='tip', hue='smoker', style='time')
plt.show()
sns.relplot(data=tips_df, x='total_bill', y='tip', hue='size')
plt.show()
setting paletes
palette='ch:r=0.7, l=0.85'
#collapse-show
sns.relplot(data=tips_df, x='total_bill', y='tip', hue='size', palette='ch:r=0.7, l=0.85')
plt.show()
Now lets set size based on size
- try to set style='size' and you will see the pollygeons on the charts .
- alternatively if you want to change the size of the ponts based on the descrete values.
- set
size='Size'
sns.relplot(data=tips_df, x='total_bill', y='tip', size='size')
plt.show()
Lets normalize the data points based on parameters
# min=15
# max=200
sns.relplot(data=tips_df, x='total_bill', y='tip', size='size', sizes=(15,200))
plt.show()
SubPlots for Scatter Plots/ relplots
It wont always be easy to convey the information in one plot. So we introduce subplots to help clear things out.
sns.relplot(data=tips_df, x='total_bill', y='tip', hue='smoker', col='time')
We can now see the plots categorized in two ways.
- outer category Time
- inner category smoker
This is beautiful and easy to digest! Now, lets try to interprete the plots.
- During Dinner time there are more people than Lunch time, right?
- So we can say more people go out for dinner than they do during lunch time.
- When can one get better Tip?
- better tips are more possible during dinner time
- who are those people?
- both smokers and non-smokers.
- what's their gender?
- Let's find out!
Tip:row
andcol
can help you acheive subplots for different categories. Let's see it in action by adding row=sex
sns.relplot(data=tips_df, x='total_bill', y='tip', hue='smoker', col='time', row='sex',height=3.5)
Ooh! Men like to impress during Dinner dates, i wonder what is the size of the meal LOL.
Check the total Bill for meles during dinner time compared to Lunch time LOL.
sns.relplot(data=tips_df, x='total_bill', y='tip', hue='smoker', col='size' , col_wrap=3, height=3.5)
sns.scatterplot(data=tips_df, x='total_bill', y='tip', hue='sex')
# generate toy dataset
np.random.seed(1996)
df = pd.DataFrame(dict(time= np.arange(500), value= np.random.randn(500).cumsum()))
df.head()
sns.relplot(kind='line',data=df, x='time', y='value')
plt.show()
i.e Here we focus more on the trend of the data.
Lets load a new dataset from seaborn called fmri
fmri_df = sns.load_dataset('fmri')
fmri_df.sample(5)
Lets plot a line plot for timepoint with signal
sns.relplot(data=fmri_df, x=’timepoint’, y=’signal’)
The zones are called confidence interval.
we can set the on and off with
ci=False
sns.lineplot(data=fmri_df, x='timepoint', y='signal')
sns.lineplot(data=fmri_df, x='timepoint', y='signal', ci=False)
hue=’event’
and another one with hue=’region’
sns.lineplot(data=fmri_df, x='timepoint', y='signal', hue='event', ci=False)
sns.lineplot(data=fmri_df, x='timepoint', y='signal', hue='region', ci=False)
Let's add markers to our line plots to better understand the ponts
style='region
dashes=False
markers=True
sns.lineplot( data=fmri_df, x='timepoint', y='signal', style='region', hue='region' , markers= True, ci=False, dashes=False)
# set palettes
palettes = sns.cubehelix_palette(light=0.5, n_colors=2)
sns.lineplot( data=fmri_df, x='timepoint', y='signal', style='region', hue='region', ci=False, dashes=False, palette=palettes)
sns.catplot(data=tips_df, x='day', y='total_bill', hue='sex', col='time', col_wrap=3, height=3.5, jitter=False)
sns.catplot(data=tips_df, x='smoker', y='tip', order=['No','Yes'])
sns.catplot(data=tips_df, x='day', y='total_bill', hue='size', row='time', col='sex', height=3.5)
sns.catplot(data=tips_df, x='day', y='total_bill', hue='sex', jitter=False )
sns.swarmplot(data=tips_df, x='day', y='total_bill', hue='sex')
BOX PLOT Understanding the Statistics
tips_df.sample(5)
sns.boxplot(data=tips_df, x='day', y='total_bill', )
sns.boxplot(data=tips_df, x='day', y='total_bill', hue='time', dodge=True)
ax = sns.boxenplot(data=tips_df, x='day', y='total_bill')
ax
BARPLOT Categorical vs categorical
sns.barplot(data=tips_df, x='sex', y='total_bill', ci=None )
sns.countplot(data=tips_df, x='smoker', hue='sex' )
sns.pointplot(data=tips_df, x='sex', y='size', hue='smoker', ci=False)
Univarient Distribution
# generate data
np.random.seed(1996)
x = np.random.randn(100)
# plot x
sns.distplot(x, bins=20, kde=False)
sns.kdeplot(x, shade=True, bw=1)
Bi-Variate Plot
tips_df.head()
sns.jointplot(data=tips_df ,x='total_bill', y='tip')
sns.jointplot(x='total_bill', y='tip', data=tips_df, kind='kde', color='r')
g = sns.jointplot(x='total_bill', y='tip', data=tips_df, kind='kde', color='m')
g.plot_joint(plt.scatter, c='w', s=30, linewidth=1, marker='+')
g.ax_joint.collections[0].set_alpha(0)
Pair-Plot
df.dtypes
iris = sns.load_dataset('iris')
iris.sample(5)
We are now going to plot all the features using pair-plot!
sns.pairplot(iris)
Lets modifiy our pair plot abit.
g = sns.PairGrid(iris)
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, n_levels = 10)
tips_df.head()
What is the linear relationship between total bill and tips?
sns.regplot(data=tips_df, x='total_bill', y='tip')
sns.lmplot(x='size', y='tip', data=tips_df, x_jitter=0.05)
lets fix the way the plot is showing, by using the x_estimator = np.mean
sns.lmplot(x='size', y='tip', data=tips_df, x_estimator=np.mean)
Lets load a new dataset : to show what to do if data does not have a linear relationship.
data = sns.load_dataset('anscombe')
data.sample(5)
sns.lmplot( x='x', y='y', data=data.query("dataset == 'I'"), ci=None, scatter_kws={'s':80})
sns.lmplot( x='x', y='y', data=data.query("dataset == 'II'"), ci=None, scatter_kws={'s':80}, order=2)
robust=True
sns.lmplot( x='x', y='y', data=data.query("dataset == 'III'"), ci=None, scatter_kws={'s':80}, robust=True)
sns.lmplot(x='total_bill', y='tip', data=tips_df, hue='sex', markers = ['o', 'x'], col='smoker', row='time')
fig, ax = plt.subplots(figsize = (8,4))
sns.regplot(x='total_bill', y='tip', data=tips_df, ax=ax)
Oontrolling Plotted Figure Aesthetics
- figure style
- axes styling
- color palettes
- etc.
def sinplot(flip=1):
x = np.linspace(0,14,100)
for i in range(1,7):
plt.plot(x, np.sin(x+i*0.05)*(7-i)*flip)
sinplot(1)
sns.set_style('whitegrid', {'axes.grid': True, 'xtick.direction': 'out'})
sinplot()
sns.despine(left=False, bottom=False)
sns.set_context('paper')
sns.set_style('dark', {'axes.grid': False, 'xtick.direction': 'out'}, )
sinplot()
sns.axes_style()
How to find the current used color palette
current_palettes = sns.color_palette()
sns.palplot(current_palettes)