Statistical plotting

This section documents a few very basic additions to matplotlib’s plotting commands that can be useful for statistical analysis. The 1D plotting section should be read before this section. Some of these tools will be expanded in the future, but for a more comprehensive suite of statistical plotting utilities, you may be interested in seaborn (we try to ensure that seaborn plotting commands are compatible with proplot figures and axes).

Error bars and shading

Error bars and error shading can be quickly added on-the-fly to line, linex (equivalently, plot, plotx), scatter, scatterx, bar, and barh plots using any of several keyword arguments.

If you pass 2D arrays to these commands with mean=True, means=True, median=True, or medians=True, the means or medians of each column are drawn as lines, points, or bars, while error bars or error shading indicates the spread of the distribution in each column. Invalid data is ignored. You can also specify the error bounds manually with the bardata, boxdata, shadedata, and fadedata keywords. These commands can draw and style thin error bars (the bar keywords), thick “boxes” overlaid on top of these bars (the box keywords; think of them as miniature boxplots), a transparent primary shading region (the shade keywords), and a more transparent secondary shading region (the fade keywords). See the documentation on the plotting commands for details.

[1]:
import numpy as np
import pandas as pd

# Sample data
# Each column represents a distribution
state = np.random.RandomState(51423)
data = state.rand(20, 8).cumsum(axis=0).cumsum(axis=1)[:, ::-1]
data = data + 20 * state.normal(size=(20, 8)) + 30
data = pd.DataFrame(data, columns=np.arange(0, 16, 2))
data.columns.name = 'column number'
data.name = 'variable'

# Calculate error data
# Passed to 'errdata' in the 3rd subplot example
means = data.mean(axis=0)
means.name = data.name  # copy name for formatting
fadedata = np.percentile(data, (5, 95), axis=0)  # light shading
shadedata = np.percentile(data, (25, 75), axis=0)  # dark shading
[2]:
import proplot as pplt
import numpy as np

# Loop through "vertical" and "horizontal" versions
varray = [[1], [2], [3]]
harray = [[1, 1], [2, 3], [2, 3]]
for orientation, array in zip(('vertical', 'horizontal'), (varray, harray)):
    # Figure
    fig = pplt.figure(refwidth=4, refaspect=1.5, share=False)
    axs = fig.subplots(array, hratios=(2, 1, 1))
    axs.format(abc='A.', suptitle=f'Indicating {orientation} error bounds')

    # Medians and percentile ranges
    ax = axs[0]
    kw = dict(
        color='light red', edgecolor='k', legend=True,
        median=True, barpctile=90, boxpctile=True,
        # median=True, barpctile=(5, 95), boxpctile=(25, 75)  # equivalent
    )
    if orientation == 'horizontal':
        ax.barh(data, **kw)
    else:
        ax.bar(data, **kw)
    ax.format(title='Bar plot')

    # Means and standard deviation range
    ax = axs[1]
    kw = dict(
        color='denim', marker='x', markersize=8**2, linewidth=0.8,
        label='mean', shadelabel=True,
        mean=True, shadestd=1,
        # mean=True, shadestd=(-1, 1)  # equivalent
    )
    if orientation == 'horizontal':
        ax.scatterx(data, legend='b', legend_kw={'ncol': 1}, **kw)
    else:
        ax.scatter(data, legend='ll', **kw)
    ax.format(title='Marker plot')

    # User-defined error bars
    ax = axs[2]
    kw = dict(
        shadedata=shadedata, fadedata=fadedata,
        label='mean', shadelabel='50% CI', fadelabel='90% CI',
        color='ocean blue', barzorder=0, boxmarker=False,
    )
    if orientation == 'horizontal':
        ax.linex(means, legend='b', legend_kw={'ncol': 1}, **kw)
    else:
        ax.line(means, legend='ll', **kw)
    ax.format(title='Line plot')
_images/stats_3_0.svg
_images/stats_3_1.svg

Box plots and violin plots

Vertical and horizontal box and violin plots can be drawn using boxplot, violinplot, boxploth, and violinploth (or their new shorthands, box, violin, boxh, and violinh). The proplot versions employ aesthetically pleasing defaults and permit flexible configuration using keywords like color, barcolor, and fillcolor. They also automatically apply axis labels based on the DataFrame or DataArray column labels. Violin plot error bars are controlled with the same keywords used for on-the-fly error bars.

[3]:
import proplot as pplt
import numpy as np
import pandas as pd

# Sample data
N = 500
state = np.random.RandomState(51423)
data1 = state.normal(size=(N, 5)) + 2 * (state.rand(N, 5) - 0.5) * np.arange(5)
data1 = pd.DataFrame(data1, columns=pd.Index(list('abcde'), name='label'))
data2 = state.rand(100, 7)
data2 = pd.DataFrame(data2, columns=pd.Index(list('abcdefg'), name='label'))

# Figure
fig, axs = pplt.subplots([[1, 1, 2, 2], [0, 3, 3, 0]], span=False)
axs.format(
    abc='A.', titleloc='l', grid=False,
    suptitle='Boxes and violins demo'
)

# Box plots
ax = axs[0]
obj1 = ax.box(data1, means=True, marker='x', meancolor='r', fillcolor='gray4')
ax.format(title='Box plots')

# Violin plots
ax = axs[1]
obj2 = ax.violin(data1, fillcolor='gray6', means=True, points=100)
ax.format(title='Violin plots')

# Boxes with different colors
ax = axs[2]
ax.boxh(data2, cycle='pastel2')
ax.format(title='Multiple colors', ymargin=0.15)
_images/stats_5_0.svg

Histograms and kernel density

Vertical and horizontal histograms can be drawn with hist and histh. As with the other plotting commands, multiple histograms can be drawn by passing 2D arrays instead of 1D arrays, and the color cycle used to color histograms can be changed on-the-fly using the cycle and cycle_kw keywords. Likewise, 2D histograms can be drawn with the hist2d hexbin commands, and their colormaps can be changed on-the-fly with the cmap and cmap_kw keywords (see the 2D plotting section). Marginal distributions for the 2D histograms can be added using panel axes.

In the future, proplot will include options for adding “smooth” kernel density estimations to histograms plots using a kde keyword. It will also include separate proplot.axes.PlotAxes.kde and proplot.axes.PlotAxes.kde2d commands. The violin and violinh commands will use the same algorithm for kernel density estimation as the kde commands.

[4]:
import proplot as pplt
import numpy as np

# Sample data
M, N = 300, 3
state = np.random.RandomState(51423)
x = state.normal(size=(M, N)) + state.rand(M)[:, None] * np.arange(N) + 2 * np.arange(N)

# Sample overlayed histograms
fig, ax = pplt.subplots(refwidth=4, refaspect=(3, 2))
ax.format(suptitle='Overlaid histograms', xlabel='distribution', ylabel='count')
res = ax.hist(
    x, pplt.arange(-3, 8, 0.2), filled=True, alpha=0.7, edgecolor='k',
    cycle=('indigo9', 'gray3', 'red9'), labels=list('abc'), legend='ul',
)
_images/stats_7_0.svg
[5]:
import proplot as pplt
import numpy as np

# Sample data
N = 500
state = np.random.RandomState(51423)
x = state.normal(size=(N,))
y = state.normal(size=(N,))
bins = pplt.arange(-3, 3, 0.25)

# Histogram with marginal distributions
fig, axs = pplt.subplots(ncols=2, refwidth=2.3)
axs.format(
    abc='A.', abcloc='l', titleabove=True,
    ylabel='y axis', suptitle='Histograms with marginal distributions'
)
colors = ('indigo9', 'red9')
titles = ('Group 1', 'Group 2')
for ax, which, color, title in zip(axs, 'lr', colors, titles):
    ax.hist2d(
        x, y, bins, vmin=0, vmax=10, levels=50,
        cmap=color, colorbar='b', colorbar_kw={'label': 'count'}
    )
    color = pplt.scale_luminance(color, 1.5)  # histogram colors
    px = ax.panel(which, space=0)
    px.histh(y, bins, color=color, fill=True, ec='k')
    px.format(grid=False, xlocator=[], xreverse=(which == 'l'))
    px = ax.panel('t', space=0)
    px.hist(x, bins, color=color, fill=True, ec='k')
    px.format(grid=False, ylocator=[], title=title, titleloc='l')
_images/stats_8_0.svg