PyBroom Example - Multiple Datasets¶

This notebook is part of pybroom.

This notebook demonstrate using pybroom when fitting a set of curves (curve fitting) using lmfit.Model. We will show that pybroom greatly simplifies comparing, filtering and plotting fit results from multiple datasets.

In [1]:

%matplotlib inline
%config InlineBackend.figure_format='retina'  # for hi-dpi displays
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from lmfit import Model
import lmfit
print('lmfit: %s' % lmfit.__version__)

lmfit: 0.9.7

/home/docs/checkouts/readthedocs.org/user_builds/pybroom/envs/latest/lib/python3.5/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated since IPython 4.0. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)

In [2]:

sns.set_style('whitegrid')

In [3]:

import pybroom as br

Create Noisy Data¶

We start simulating N datasets which are identical except for the additive noise.

In [4]:

N = 20

In [5]:

x = np.linspace(-10, 10, 101)

In [6]:

peak1 = lmfit.models.GaussianModel(prefix='p1_')
peak2 = lmfit.models.GaussianModel(prefix='p2_')
model = peak1 + peak2

In [7]:

#params = model.make_params(p1_amplitude=1.5, p2_amplitude=1,
#                           p1_sigma=1, p2_sigma=1)

In [8]:

Y_data = np.zeros((N, x.size))
Y_data.shape

Out[8]:

(20, 101)

In [9]:

for i in range(Y_data.shape[0]):
    Y_data[i] = model.eval(x=x, p1_center=-1, p2_center=2,
                           p1_sigma=0.5, p2_sigma=1.5,
                           p1_height=1, p2_height=0.5)
Y_data += np.random.randn(*Y_data.shape)/10

In [10]:

plt.plot(x, Y_data.T, '-k', alpha=0.1);

../_images/notebooks_pybroom-example-multi-datasets_11_0.png

Model Fitting¶

Single-peak model¶

Define and fit a single Gaussian model to the \(N\) datasets:

In [11]:

model1 = lmfit.models.GaussianModel()

In [12]:

Results1 = [model1.fit(y, x=x) for y in Y_data]

Two-peaks model¶

Here, instead, we use a more appropriate Gaussian mixture model.

To fit the noisy data, the residuals (the difference between model and data) is minimized in the least-squares sense.

In [13]:

params = model.make_params(p1_center=0, p2_center=3,
                           p1_sigma=0.5, p2_sigma=1,
                           p1_amplitude=1, p2_amplitude=2)

In [14]:

Results = [model.fit(y, x=x, params=params) for y in Y_data]

Fit results from an lmfit Model can be inspected with with fit_report or params.pretty_print():

In [15]:

#print(Results[0].fit_report())
#Results[0].params.pretty_print()

This is good for peeking at the results. However, extracting these data from lmfit objects is quite a chore and requires good knowledge of lmfit objects structure.

pybroom helps in this task: it extracts data from fit results and returns familiar pandas DataFrame (in tidy format). Thanks to the tidy format these data can be much more easily manipulated, filtered and plotted.

Glance¶

A summary of the two-peaks model fit:

In [16]:

dg = br.glance(Results, var_names='dataset')

dg.drop('model', 1).drop('message', 1).head()

Out[16]:

	method	num_params	num_data_points	chisqr	redchi	AIC	BIC	num_func_eval	success	dataset
0	leastsq	6	101	1.145367	0.012056	-440.418975	-424.728252	80	True	0
1	leastsq	6	101	0.880633	0.009270	-466.965754	-451.275031	59	True	1
2	leastsq	6	101	0.752634	0.007922	-482.828918	-467.138195	95	True	2
3	leastsq	6	101	0.746746	0.007860	-483.622251	-467.931527	80	True	3
4	leastsq	6	101	1.022689	0.010765	-451.861188	-436.170465	101	True	4

A summary of the one-peak model fit:

In [17]:

dg1 = br.glance(Results1, var_names='dataset')

dg1.drop('model', 1).drop('message', 1).head()

Out[17]:

	method	num_params	num_data_points	chisqr	redchi	AIC	BIC	num_func_eval	success	dataset
0	leastsq	3	101	1.850787	0.018886	-397.950482	-390.105121	115	True	0
1	leastsq	3	101	1.303274	0.013299	-433.374332	-425.528970	63	True	1
2	leastsq	3	101	1.724410	0.017596	-405.093812	-397.248451	43	True	2
3	leastsq	3	101	2.043674	0.020854	-387.937525	-380.092163	105	True	3
4	leastsq	3	101	2.376938	0.024254	-372.680057	-364.834695	107	True	4

Tidy¶

Tidy fit results for all the parameters:

In [18]:

dt = br.tidy(Results, var_names='dataset')

Let’s see the results for a single dataset:

In [19]:

dt.query('dataset == 0')

Out[19]:

	name	value	min	max	vary	expr	stderr	init_value
0	p1_amplitude	0.926646	-inf	inf	True	None	0.190639	1.0
1	p1_center	-1.087737	-inf	inf	True	None	0.058759	0.0
2	p1_fwhm	1.285132	-inf	inf	False	2.3548200*p1_sigma	0.189530	NaN
3	p1_height	0.677383	-inf	inf	False	0.3989423*p1_amplitude/max(1.e-15, p1_sigma)	0.082226	NaN
4	p1_sigma	0.545745	0.000000	inf	True	None	0.080486	0.5
5	p2_amplitude	1.174310	-inf	inf	True	None	0.281646	2.0
6	p2_center	1.603433	-inf	inf	True	None	0.589793	3.0
7	p2_fwhm	4.895003	-inf	inf	False	2.3548200*p2_sigma	1.197038	NaN
8	p2_height	0.225371	-inf	inf	False	0.3989423*p2_amplitude/max(1.e-15, p2_sigma)	0.032352	NaN
9	p2_sigma	2.078716	0.000000	inf	True	None	0.508335	1.0

or for a single parameter across datasets:

In [20]:

dt.query('name == "p1_center"').head()

Out[20]:

	name	value	min	max	vary	expr	stderr	dataset
1	p1_center	-1.087737	-inf	inf	True	None	0.058759	0
11	p1_center	-0.985144	-inf	inf	True	None	0.050625	1
21	p1_center	-0.987962	-inf	inf	True	None	0.037483	2
31	p1_center	-1.032991	-inf	inf	True	None	0.040458	3
41	p1_center	-0.990100	-inf	inf	True	None	0.044825	4

In [21]:

dt.query('name == "p1_center"')['value'].std()

Out[21]:

0.035626705561751418

In [22]:

dt.query('name == "p2_center"')['value'].std()

Out[22]:

0.4123156933193024

Note that there is a much larger error in fitting p2_center than p1_center.

In [23]:

dt.query('name == "p1_center"')['value'].hist()
dt.query('name == "p2_center"')['value'].hist(ax=plt.gca());

../_images/notebooks_pybroom-example-multi-datasets_34_0.png

Augment¶

Tidy dataframe with data function of the independent variable (‘x’). Columns include the data being fitted, best fit, best fit components, residuals, etc.

In [24]:

da = br.augment(Results, var_names='dataset')

In [25]:

da1 = br.augment(Results1, var_names='dataset')

In [26]:

r = Results[0]

Let’s see the results for a single dataset:

In [27]:

da.query('dataset == 0').head()

Out[27]:

	x	data	best_fit	residual	Model(gaussian, prefix='p1_')	Model(gaussian, prefix='p2_')
0	-10.0	-0.121893	3.862094e-08	0.121893	8.342024e-59	3.862094e-08
1	-9.8	0.035362	6.577435e-08	-0.035362	3.098875e-56	6.577435e-08
2	-9.6	0.134672	1.109865e-07	-0.134672	1.006492e-53	1.109865e-07
3	-9.4	-0.101508	1.855510e-07	0.101508	2.858187e-51	1.855510e-07
4	-9.2	-0.086389	3.073522e-07	0.086390	7.096505e-49	3.073522e-07

Plotting a single dataset is simplified compared to a manual plot:

In [28]:

da0 = da.query('dataset == 0')

In [29]:

plt.plot('x', 'data', data=da0, marker='o', ls='None')
plt.plot('x', "Model(gaussian, prefix='p1_')", data=da0, lw=2, ls='--')
plt.plot('x', "Model(gaussian, prefix='p2_')", data=da0, lw=2, ls='--')
plt.plot('x', 'best_fit', data=da0, lw=2);
plt.legend()

Out[29]:

<matplotlib.legend.Legend at 0x7fad6f261128>

../_images/notebooks_pybroom-example-multi-datasets_43_1.png

But keep in mind that, for a single dataset, we could use the lmfit method as well (which is even simpler):

In [30]:

Results[0].plot_fit();

../_images/notebooks_pybroom-example-multi-datasets_45_0.png

However, things become much more interesting when we want to plot multiple datasets or models as in the next section.

Comparison of different datasets¶

In [31]:

grid = sns.FacetGrid(da.query('dataset < 6'), col="dataset", hue="dataset", col_wrap=3)
grid.map(plt.plot, 'x', 'data', marker='o', ls='None', ms=3, color='k')
grid.map(plt.plot, 'x', "Model(gaussian, prefix='p1_')", ls='--')
grid.map(plt.plot, 'x', "Model(gaussian, prefix='p2_')", ls='--')
grid.map(plt.plot, "x", "best_fit");

../_images/notebooks_pybroom-example-multi-datasets_48_0.png

Comparison of one- or two-peaks models¶

Here we plot a comparison of the two fitted models (one or two peaks) for different datasets.

First we create a single tidy DataFrame with data from the two models:

In [32]:

da['model'] = 'twopeaks'
da1['model'] = 'onepeak'
da_tot = pd.concat([da, da1], ignore_index=True)

Then we perfom a facet plot with seaborn:

In [33]:

grid = sns.FacetGrid(da_tot.query('dataset < 6'), col="dataset", hue="model", col_wrap=3)
grid.map(plt.plot, 'x', 'data', marker='o', ls='None', ms=3, color='k')
grid.map(plt.plot, "x", "best_fit")
grid.add_legend();

../_images/notebooks_pybroom-example-multi-datasets_53_0.png

Note that the “tidy” organization of data allows plot libraries such as seaborn to automatically infer most information to create complex plots with simple commands. Without tidy data, instead, a manual creation of such plots becomes a daunting task.