Demo Notebook 2022-07-18
Contents
Demo Notebook 2022-07-18#
Demos from the live sessions on 2022-07-18.
Setup#
import grama as gr
import pandas as pd
import numpy as np
DF = gr.Intention()
%matplotlib inline
Load the challenge data#
df_data = pd.read_csv("../challenges/data/c05-data.csv")
df_data
E | mu | t | id_machine | id_specimen | id_measurement | |
---|---|---|---|---|---|---|
0 | 10113.457096 | 0.319442 | 0.250 | A | 0 | g |
1 | 9432.148720 | 0.308230 | 0.250 | A | 1 | g |
2 | 10468.172283 | 0.328422 | 0.250 | A | 2 | g |
3 | 9474.697920 | 0.333919 | 0.250 | A | 3 | g |
4 | 10010.402837 | 0.310751 | 0.250 | A | 4 | g |
... | ... | ... | ... | ... | ... | ... |
715 | 11935.167126 | 0.319328 | 0.125 | F | 115 | l |
716 | 11552.946124 | 0.317546 | 0.125 | F | 116 | l |
717 | 11985.579932 | 0.332173 | 0.125 | F | 117 | l |
718 | 12092.949690 | 0.317655 | 0.125 | F | 118 | l |
719 | 12170.319166 | 0.335891 | 0.125 | F | 119 | l |
720 rows × 6 columns
Compare across thicknesses#
I chose to make this comparison using boxplots. These are a simple (and underrated!) way to make comparisons across groups.
A particularly helpful tip is to use notch=True
with boxplots. This visualizes a confidence interval on the median (the middle line), which is helpful for assessing the statistical signifance of differences. As we can see from the boxplots below, the two notches are highly separated, suggesting that the difference is statistically significant (not likely to be due to chance alone).
## TODO: Compare elasticity across thicknesses
## NOTE: There are many ways to do this!
(
df_data
# solution-begin
>> gr.ggplot(gr.aes("t", "E"))
+ gr.geom_boxplot(gr.aes(group="t"), notch=True)
# solution-end
)
<ggplot: (8770214991553)>
Explore the experimental design#
(The following is the solution code from the q2 task.)
Answer the following questions to better understand the experimental design. Note that the same questions are posed within each cell and under observations below.
Hint: The verbs tf_count()
and tf_distinct()
will be very useful for answering some of these questions!
## Q: What thicknesses were tested?
# solution-begin
(
df_data
>> gr.tf_count(DF.t)
)
# solution-end
t | n | |
---|---|---|
0 | 0.125 | 360 |
1 | 0.250 | 360 |
## Q: How many unique specimens were manufactured?
# solution-begin
(
df_data
>> gr.tf_summarize(spec_max=gr.max(DF.id_specimen) + 1)
)
# solution-end
spec_max | |
---|---|
0 | 120 |
## Q: How many specimens were made on each machine?
# solution-begin
(
df_data
>> gr.tf_distinct(DF.id_specimen, DF.id_machine)
>> gr.tf_count(DF.id_machine)
)
# solution-end
id_machine | n | |
---|---|---|
0 | A | 20 |
1 | B | 20 |
2 | C | 20 |
3 | D | 20 |
4 | E | 20 |
5 | F | 20 |
## Q: How many times did each operator measure each specimen?
# solution-begin
(
df_data
>> gr.tf_count(DF.id_specimen, DF.id_measurement)
>> gr.tf_summarize(n_max=gr.max(DF.n))
)
# solution-end
n_max | |
---|---|
0 | 1 |
Observations
What thicknesses were tested?
(Your response here)
How many unique specimens were manufactured?
(Your response here)
How many specimens were made on each machine?
(Your response here)
How many times did each operator measure each specimen?
(Your response here)
What thicknesses were tested?
0.125 and 0.250 inches
How many unique specimens were manufactured?
120
How many specimens were made on each machine?
20 specimens each
How many times did each operator measure each specimen?
Just once
Fitting a lognorm#
First, let’s set up the data as it was in the challenge. We seek to model the variability of E
as observed in df_sub
.
## NOTE: No need to edit
df_sub = (
df_data
>> gr.tf_filter(
DF.t == 0.250,
DF.id_machine != "C",
DF.id_measurement != "h",
)
)
The following code fits a lognormal distribution to the observations on E
.
Note that the "lognorm"
distribution in grama uses the scipy implementation. As noted in that documentation, the loc
and scale
parameters are used to modify the input variable (x - loc) / scale
. However, we can freeze the location parameter of the distribution by setting floc=0
. This is useful when fitting a lognormal distribution becomes unstable. The following code demonstrates this phenomenon:
## Fit with loc == 0
mg_fixed = gr.marg_fit(
"lognorm",
df_sub.E,
floc=0,
)
## Fit with loc able to move
mg_free = gr.marg_fit(
"lognorm",
df_sub.E,
# floc=0, # Don't set the location parameter to zero
)
## Use QQ plot to asses
(
df_sub
>> gr.tf_mutate(
fixed=gr.qqvals(DF.E, marg=mg_fixed),
free=gr.qqvals(DF.E, marg=mg_free),
)
>> gr.tf_pivot_longer(
columns=["fixed", "free"],
names_to="location",
values_to="q_E",
)
>> gr.ggplot(gr.aes("q_E", "E"))
+ gr.geom_abline(intercept=0, slope=1, linetype="dashed")
+ gr.geom_point()
+ gr.facet_grid("~location", scales="free_x")
)
/Users/zach/opt/anaconda3/envs/evc/lib/python3.9/site-packages/plotnine/facets/facet.py:487: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
/Users/zach/opt/anaconda3/envs/evc/lib/python3.9/site-packages/plotnine/utils.py:371: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
/Users/zach/opt/anaconda3/envs/evc/lib/python3.9/site-packages/plotnine/utils.py:371: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
<ggplot: (8770231222013)>
Clearly the free
option has resulted in a terrible fit!
We can also see that the fitted parameter values for the free
version have ended up being enormous (and implausible).
print(mg_fixed) # floc == 0 (fixed)
print(mg_free) # floc free
(+0) lognorm, {'mean': '9.965e+03', 's.d.': '4.977e+02', 'COV': 0.05, 'skew.': 0.15, 'kurt.': 3.04}
(+0) lognorm, {'mean': '2.409e+13', 's.d.': '3.298e+26', 'COV': 13691416095072.72, 'skew.': 2.5665226910475562e+39, 'kurt.': 1.2347725193581465e+105}
Real variability?#
Note that the tails of the fitted distribution seem to exhibit some mismatch:
(
df_sub
>> gr.tf_mutate(
q_E=gr.qqvals(DF.E, marg=mg_fixed),
)
>> gr.ggplot(gr.aes("q_E", "E"))
+ gr.geom_abline(intercept=0, slope=1, linetype="dashed")
+ gr.geom_point()
)
<ggplot: (8770214991071)>
This is a problem for us, as we’re aiming to assess probabilities that lie in the tails (around the 1% level).
Part of the issue here is that the real and erroneous variability mingle to create a complicated, non-standard distribution. We can improve the situation by applying the mean heuristic to the data:
## NOTE: No need to edit
df_real = (
df_sub
>> gr.tf_group_by(DF.id_specimen)
>> gr.tf_summarize(
mu=gr.mean(DF.mu),
E=gr.mean(DF.E),
)
>> gr.tf_ungroup()
)
Now re-fit the marginal distribution on the reduced dataset:
# solution-begin
mg_E = gr.marg_fit(
"lognorm",
df_real.E,
floc=0,
)
(
df_real
>> gr.tf_mutate(
q_E=gr.qqvals(DF.E, marg=mg_E),
)
>> gr.ggplot(gr.aes("q_E", "E"))
+ gr.geom_abline(intercept=0, slope=1, linetype="dashed")
+ gr.geom_point()
)
# solution-end
<ggplot: (8770215282339)>
As discussed in the morning session, this fit is still not perfect. However, we can see that the points generally lie closer to the line than before.
Since the real variability actually was generated from a lognormal distribution, the remaining mismatch here is due to erroneous (sampling) variability. Dealing with this remaining uncertainty is outside the scope of this course, but you can take a look at the Further Reading below for more information!
Further reading#
When are allowables conservative?
Discusses basis values, which are used in aerospace engineering to account for both real and erroneous (sampling) sources of variability simultaneously.
Also discusses the limitations of allowable (single) values for engineering analysis purposes.