Stats: Fundamentals of Statistical Process Monitoring#

Purpose: In future exercises we’re going to learn how to model variable quantities as random, but before we do that, we need tools to help decide if a given quantity should be modeled as random. The tools of statistical process monitoring (sometimes called statistical process control) will help us find non-random patterns in data, which can be helpful for deciding whether or not a quantity can be reasonably modeled as random.

Setup#

import grama as gr
import numpy as np
DF = gr.Intention()
%matplotlib inline

Theory#

It’s difficult to tell by-inspection if data are random. For example, let’s look at a completely random dataset:

## NOTE: No need to edit; run and inspect
np.random.seed(102)

mg_normal = gr.marg_mom("norm", mean=0, sd=1)
df_random = (
    gr.df_make(X=mg_normal.r(1000))
    >> gr.tf_mutate(idx=DF.index)
)

(
    df_random
    >> gr.ggplot(gr.aes("idx", "X"))
    + gr.geom_point()
)
../_images/d22-e-stat06-spc-solution_5_0.png
<ggplot: (8763723758912)>

Some observations:

  • Without some sense of reasonable limits, we can’t distinguish outliers from the normal highs and lows of randomness.

  • Without some knowledge of the underlying random process, we cannot define reasonable limits.

Rather than do concentrated staring at random patterns, we can make some intelligent choices about grouping observations, and use batch averages to set meaningful control limits. These ideas combine in a control chart; the following plot is our first example of a control chart.

%%capture --no-display
## NOTE: No need to edit

(
    df_random 
    ## Group observations into consecutive batches of 10
    >> gr.tf_mutate(id_batch=DF.index // 10)
    ## Construct an Xbar and S (xbs) chart
    >> gr.pt_xbs(group="id_batch", var="X")
)
../_images/d22-e-stat06-spc-solution_8_0.png
<ggplot: (8763723979319)>

From this chart, we see no outliers outside the lower control limit (LCL) or upper control limit (UCL). We also don’t see any special patterns in the data. Therefore, based on this control chart, we can declare that the process is likely under statistical control—that it can be modeled as random.

X-bar and S Chart#

The helper function gr.pt_xbs() constructs a particular kind of control chart: an X-bar and S chart. This shows both the batch sample standard deviation (S) and batch sample mean (X-bar), along with control limits for these quantities. This statistical tool helps us find suspicious patterns in a dataset.

In order to understand X-bar and S charts, we need understand why we take batch averages, and how to construct meaningful control limits for X-bar and S.

Batch Averages#

We saw before that the sample mean follows the central limit theorem, which tells us about the distribution of the sample mean if we were to observe multiple, independent datasets. Thus taking the average of multiple observations from the same process gives us a way to make the underlying distribution more predictable.

However, taking the average of all observations \(n_{\text{total}}\) prevents us from determining where an outlier occurred: If the mean of all the data is high, was it the first observation that was large? The last one? Something in the middle?

Therefore, we’ll split the data into batches of a much smaller size \(n_{\text{batch}} << n_{\text{total}}\). Taking batch averages gives us better resolution (allowing us to determine which batch is worth further investigation), while still giving the advantages of the central limit theorem.

Control limits: Confidence Intervals#

Remember (from e-stat05-CI) that the central limit theorem-based confidence interval for the sample mean is given by

\[[\overline{X}_n - z_C \sigma/\sqrt{n}, \overline{X}_n + z_C \sigma/\sqrt{n}]\]

where \(z_C\) is the relevant quantile of a standard normal distribution, and \(n\) is the number of observations in our sample. For control charts, it is standard to choose a very strict confidence level of \(C = 1 - 1/1000\).

To understand how we will use the idea of confidence intervals to look for non-random patterns, let’s remind ourselves of the golden rule for interpreting confidence intervals:

The “Golden Rule” for interpreting Confidence Intervals

When interpreting a confidence interval, we should assume the true value could be anywhere inside the interval.

With a control chart, we want to be able to tell when a pattern in the data looked non-random. To that end, we’ll study the sample means of small batches of the data, and compare all of those batches to a single confidence interval. Since the confidence interval is a reasonable range for the true value, we will flag batch means that fall outside the confidence interval as potentially non-random, and worth of further investigation.

When constructing a control chart, we use the entire dataset to estimate the mean \(\overline{X}_{n_{\text{total}}}\) and standard deviation \(\sigma\), but construct the confidence interval based on the batch size \(n_{\text{batch}}\). That looks like:

\[[\overline{X}_{n_{\text{total}}} - z_C \sigma/\sqrt{n_{\text{batch}}}, \overline{X}_{n_{\text{total}}} + z_C \sigma/\sqrt{n_{\text{batch}}}]\]

The helper function gr.pt_xbs() automates the approximation of these control limits. In the rest of this exercise, we’ll focus on grouping the observations into reasonable batches.

See the references for the gory statistical details

When it comes to approximating the standard deviation \(\sigma\) for a control chart, there is actually a bit more work we have to do to de-bias the standard deviation. The control limits for the standard deviation also follow a different formula than the sample mean. For more details, see the references listed at the bottom of this exercise.

Grouping observations#

Ideally, the selection of batches should be made during the planning of data collection. If we are studying archival data, we can sometimes make some reasonable assumptions about data collection in order to define batches.

Natural batches#

In cases when we can plan data collection, there are likely candidates for “natural” batches of data. For instance, if we have multiple machines that are used (separately) to make the same part, we could group observations based on which machine led to their manufacture.

The next code chunk simulates data where multiple machines are used to manufacture the same part—one of the machines is malfunctioning. You’ll use a control chart to help determine which machine is causing problems.

## NOTE: No need to edit
np.random.seed(101)

df_base = gr.df_make(
    sd=[0.01, 0.04, 0.01, 0.01, 0.01, 0.01],
    ma=[ "A",  "B",  "C",  "D",  "E",  "F"],
)
mg = gr.marg_mom("norm", mean=0, sd=1)

df_machines = gr.df_grid()
for i in range(df_base.shape[0]):
    df_machines = (
        df_machines
        >> gr.tf_bind_rows(
            gr.df_make(
                X=mg.r(15) * df_base.sd[i],
                id_machine=df_base.ma[i],
            )
        )
    )
# Randomly sort
df_machines = (
    df_machines
    >> gr.tf_sample(frac=1)
)

q0 Conceptual questions#

Before we start constructing control charts, let’s check your understanding with some conceptual questions. Run the code below, and answer the questions under observations below.

## TODO: No need to edit this code; run and inspect
(
    df_machines
    >> gr.tf_count(DF.id_machine)
    >> gr.tf_bind_rows(
        df_machines
        >> gr.tf_summarize(n=gr.n())
        >> gr.tf_mutate(id_machine="(Total)")
    )
)
id_machine n
0 A 15
1 B 15
2 C 15
3 D 15
4 E 15
5 F 15
6 (Total) 90

Observations

  • If we set the grouping according to the column id_machine, what value of n should we use to compute the control limits? (That is, the factor of \(1/\sqrt{n}\).)

    • Since there are n=15 values in each group, we should set n_batch == 15.

  • To construct a control chart, we need to compute control limits based on the variance of the data. What number of observations is used to compute the variance?

    • The variance should be estimated based on the entire dataset, which has n == 90.

q1 Batch by machine#

Construct a control chart using the machine identifier as the group variable. Look for patterns in the measured quantity X. Answer the questions under observations below.

Hint: Remember to read the documentation for an unfamiliar function to learn about its arguments.

%%capture --no-display
## TODO: Create a control chart grouped by machine
(
    df_machines
# solution-begin    
    >> gr.pt_xbs(group="id_machine", var="X")
# solution-end    
)
../_images/d22-e-stat06-spc-solution_21_0.png
<ggplot: (8763671329196)>

Observations

  • Based on the batch size you chose, did you find any out-of-control patterns?

    • Yes, I see out-of-control variability for Machine B.

  • If you answered “yes” to the previous question, does your group variable give you a meaningful way to determine what may have cause the out-of-control pattern?

    • Yes! Batching by machine gives us an obvious clue about where the out-of-control behavior occurred.

Batches in time#

Natural batches—such as per-machine as above—are a fine way to choose the grouping for a control chart. However, there are other ways to batch the data.

Another way to batch the data is by consecutive periods of time; this can be helpful for monitoring a single data-generating process to see if something happened at a particular point in time.

When choosing consecutive observations, we should choose reasonable batch sizes \(n_{\text{batch}}\). The following rule of thumb is helpful for picking a reasonable \(n_{\text{batch}}\).

A rule of thumb is to use three time periods with 20 to 25 samples of size \(n_{\text{batch}} = 50\) to \(n_{\text{batch}} = 100\) units each.

The following code simulates data from a process that varies over time. The data collection plans for 45 samples, each containing \(n_{\text{batch}} = 50\) observations. This results in 45 * 50 = 2250 observations total.

## NOTE: No need to edit
np.random.seed(101)

df_base = gr.df_make(
    mu=[0]*30 + list(gr.linspace(0, 0.8, 15)),
)
mg = gr.marg_mom("norm", mean=0, sd=1)

df_time = gr.df_grid()
for i in range(df_base.shape[0]):
    df_time = (
        df_time
        >> gr.tf_bind_rows(
            gr.df_make(X=mg.r(50) + df_base.mu[i])
        )
    )
# Randomly sort
df_time = (
    df_time
    >> gr.tf_mutate(t=DF.index/gr.n(DF.index))
    >> gr.tf_sample(frac=1)
)

q2 Create consecutive groups#

Use a mutate to create a column id_seq to group sequential observations.

Hint 1: Remember that floor division x // y is one way to create sequential values.

Hint 2: Remember that DF.index allows you to access the index of each row in a DataFrame.

## TASK: Create a column `id_seq` to group sequential observations
df_seq = (
    df_time

    >> gr.tf_mutate(id_seq=DF.index // 50)
# solution-end    
)

## NOTE: Use this to check your work
assert \
    "id_seq" in df_seq.columns, \
    "df_seq does not have a column `id_seq`"

print(
    "You grouped the observations into {0:} samples of size n_batch = {1:}".format(
        len(set(df_seq.id_seq)),
        len(df_seq[df_seq.id_seq == df_seq.id_seq.values[0]]),
    )
)

print(df_seq >> gr.tf_head(20))
You grouped the observations into 45 samples of size n_batch = 50
           X         t  id_seq
0   2.605316  0.801333       0
1  -0.234766  0.080444       0
2  -0.950839  0.888889       0
3  -1.340159  0.527556       0
4   0.563586  0.999556       0
5   0.372806  0.253778       0
6   1.378403  0.080000       0
7   1.031736  0.288000       0
8   0.639372  0.916889       0
9   1.186054  0.336000       0
10  0.083397  0.545333       0
11 -0.683701  0.935556       0
12 -0.388190  0.505333       0
13  0.139908  0.761333       0
14  0.978535  0.296444       0
15 -0.187721  0.272889       0
16 -0.063713  0.812444       0
17  0.759567  0.749778       0
18  0.087531  0.587111       0
19  0.150913  0.484889       0

q3 Batch by slices of time#

Sort the data by time t, then batch the observations into consecutive groups and construct a control chart.

Make sure to match the batch size \(n_{\text{batch}}\) used in data collection.

Answer the questions under observations below.

%%capture --no-display
## TODO: Arrange the dataset by time `t` and batch appropriately
df_timebatch = (
    df_time
# solution-begin    
    >> gr.tf_arrange(DF.t)
    >> gr.tf_mutate(id_seq=DF.index // 50)
# solution-end    
)

## NOTE: No need to edit; use this to check your work
assert \
    "id_seq" in df_timebatch.columns, \
    "df_timebatch does not have a column `id_seq`"

assert \
    len(df_timebatch[df_timebatch.id_seq == df_timebatch.id_seq.values[0]]) == 50, \
    "You did not choose a batch size n_batch that matches the data collection"
    
# Visualize
(
    df_timebatch
    >> gr.pt_xbs(group="id_seq", var="X")
)
../_images/d22-e-stat06-spc-solution_28_0.png
<ggplot: (8763708800699)>

Observations

Hint: You will not see any patterns if you do not properly arrange the data.

  • Based on the batch size you chose, did you find any out-of-control patterns?

    • Arranged by time and with a batch size of n=50, I see one low mean and a long string of high-mean batches.

  • If you answered “yes” to the previous question, does your group variable give you a meaningful way to determine what may have cause the out-of-control pattern?

    • Yes; since the batches are arranged in time, we can see that the process was trending upward in time towards the end of the period.

Naive consecutive rows#

What happens if we do not correctly arrange the rows? Then we lose all information about the time history of the data. The following plot shows what happens if we construct a control chart without respecting the time ordering of the data:

%%capture --no-display
## NOTE: No need to edit; run and inspect
(
    df_time
    >> gr.tf_mutate(id_batch=DF.index // 50)
    >> gr.pt_xbs(group="id_batch", var="X")
)
../_images/d22-e-stat06-spc-solution_31_0.png
<ggplot: (8763708846577)>

Note that there are no patterns in this view of the data!

Patterns to seek#

Kenett and Zacks (2014) highlight several patterns that can signal an out-of-control manufacturing process:

  1. A single point outside the control limits.

  2. A run of nine or more points in a row above (or below) the centerline.

  3. Six consecutive points increasing (trend up) or decreasing (trend down).

The following cells simulate data to illustrate each of these patterns.

Outside control limits#

%%capture --no-display
## NOTE: No need to edit
np.random.seed(101)

n = 20
mu = [0]*15 + [1]*1 + [0]*15 
mg_standard = gr.marg_mom("norm", mean=0, sd=0.5)

df_outlier = gr.df_grid()
for i, m in enumerate(mu):
    df_outlier = (
        df_outlier
        >> gr.tf_bind_rows(gr.df_make(
            X=mg_standard.r(n) + m
        ))
    )
    
(
    df_outlier
    >> gr.tf_mutate(idx=DF.index // n)
    >> gr.pt_xbs(group="idx", var="X")
)
../_images/d22-e-stat06-spc-solution_35_0.png
<ggplot: (8763671402325)>

Variability out-of-control#

%%capture --no-display
## NOTE: No need to edit
np.random.seed(101)

n = 20
mu = [0]*30
mg_standard = gr.marg_mom("norm", mean=0, sd=0.5)

df_outlier = gr.df_grid()
for i, m in enumerate(mu):
    df_outlier = (
        df_outlier
        >> gr.tf_bind_rows(gr.df_make(
            X=mg_standard.r(n) * (1 + (i == 20)) + m
        ))
    )
    
(
    df_outlier
    >> gr.tf_mutate(idx=DF.index // n)
    >> gr.pt_xbs(group="idx", var="X")
)
../_images/d22-e-stat06-spc-solution_37_0.png
<ggplot: (8763708668823)>

Runs above center#

%%capture --no-display
## NOTE: No need to edit
np.random.seed(101)

n = 20
mu = [0]*15 + [1]*10 + [0]*15 
mg_standard = gr.marg_mom("norm", mean=0, sd=2.0)

df_above = gr.df_grid()
for i, m in enumerate(mu):
    df_above = (
        df_above
        >> gr.tf_bind_rows(gr.df_make(
            X=mg_standard.r(n) + m
        ))
    )
    
(
    df_above
    >> gr.tf_mutate(idx=DF.index // n)
    >> gr.pt_xbs(group="idx", var="X")
)
../_images/d22-e-stat06-spc-solution_39_0.png
<ggplot: (8763708698067)>

Note that none of the observations exceed the upper control limit (UCL), but a “run” of many (9 or more) observations above the centerline is highly unlikely to be due to chance alone.

Runs below center#

%%capture --no-display
## NOTE: No need to edit
np.random.seed(101)

n = 20
mu = [0]*15 + [-1]*10 + [0]*15 
mg_standard = gr.marg_mom("norm", mean=0, sd=2.0)

df_below = gr.df_grid()
for i, m in enumerate(mu):
    df_below = (
        df_below
        >> gr.tf_bind_rows(gr.df_make(
            X=mg_standard.r(n) + m
        ))
    )
    
(
    df_below
    >> gr.tf_mutate(idx=DF.index // n)
    >> gr.pt_xbs(group="idx", var="X")
)
../_images/d22-e-stat06-spc-solution_42_0.png
<ggplot: (8763709073960)>

Runs increasing#

%%capture --no-display
## NOTE: No need to edit
np.random.seed(101)

n = 20
mu = [0.5]*10 + list(gr.linspace(0, 1, 6)) + [0.5]*10
mg_standard = gr.marg_mom("norm", mean=0, sd=0.8)

df_increase = gr.df_grid()
for i, m in enumerate(mu):
    df_increase = (
        df_increase
        >> gr.tf_bind_rows(gr.df_make(
            X=mg_standard.r(n) + m
        ))
    )
    
(
    df_increase
    >> gr.tf_mutate(idx=DF.index // n)
    >> gr.pt_xbs(group="idx", var="X")
)
../_images/d22-e-stat06-spc-solution_44_0.png
<ggplot: (8763724211703)>

Runs decreasing#

%%capture --no-display
## NOTE: No need to edit
np.random.seed(101)

n = 20
mu = [0.5]*10 + list(1 - gr.linspace(0, 1, 6)) + [0.5]*10
mg_standard = gr.marg_mom("norm", mean=0, sd=0.8)

df_decrease = gr.df_grid()
for i, m in enumerate(mu):
    df_decrease = (
        df_decrease
        >> gr.tf_bind_rows(gr.df_make(
            X=mg_standard.r(n) + m
        ))
    )
    
(
    df_decrease
    >> gr.tf_mutate(idx=DF.index // n)
    >> gr.pt_xbs(group="idx", var="X")
)
../_images/d22-e-stat06-spc-solution_46_0.png
<ggplot: (8763688709133)>

Warnings#

While we should be on the lookout for patterns in a control chart, we should balance this out with a healthy dose of skepticism. The following two examples demonstrate cases where we would see patterns in a control chart that have no real meaning.

Beware spurious patterns!#

Note that false patterns can still occur based on random events. These false patterns become more likely with smaller batch sizes. For instance, let’s take a look at the completely random dataset from above, batched into tiny subgroups:

%%capture --no-display
## NOTE: No need to edit

(
    df_random 
    >> gr.tf_mutate(id_batch=DF.index // 2)
    >> gr.pt_xbs(group="id_batch", var="X")
)
../_images/d22-e-stat06-spc-solution_49_0.png
<ggplot: (8763688516666)>

Note that a variety of “patterns” are detected, but these cannot possibly be due to any assignable causes: The data are completely random by construction. Using a larger batch size helps to protect us against being mislead by random patterns, but keep in mind that false detections are always possible!

Beware meaningless order#

Patterns based on order (such as high/low or increasing/decreasing runs) are meaningless when order of the group variable has no meaning. For instance, we can reorder the machine identifiers in df_machines to artificially create an increasing run:

%%capture --no-display
## NOTE: No need to edit
(
    df_machines
    >> gr.tf_mutate(id_machine=gr.fct_reorder(DF.id_machine, DF.X, fun=gr.mean))
    >> gr.pt_xbs(group="id_machine", var="X")
)
../_images/d22-e-stat06-spc-solution_52_0.png
<ggplot: (8763671402316)>

Here the batch means are increasing, but that is only because we ordered the machines by their batch mean!

Example: Cast Aluminum Dataset#

Let’s take a closer look at the cast aluminum dataset using our new lense of control charts.

from grama.data import df_shewhart
df_shewhart.head()
specimen tensile_strength hardness density
0 1 29314 53.0 2.666
1 2 34860 70.2 2.708
2 3 36818 84.3 2.865
3 4 30120 55.3 2.627
4 5 34020 78.5 2.581

We might want to use this data to model these material properties are random, but before that we should ask: Is the manufacturing process generating these properties under statistical control?

q4 Assess the density#

Assess the state of statistical control for the variable density using a control chart.

Answer the questions under observations below.

%%capture --no-display
## TODO: Construct batches and make a control chart
(
    df_shewhart
    >> gr.tf_mutate(idx=DF.index // 3)
    >> gr.pt_xbs(group="idx", var="density")
)
../_images/d22-e-stat06-spc-solution_58_0.png
<ggplot: (8763671331271)>

Observations

  • What assumptions did you have to make in order to construct rational batches?

    • To group observations into consecutive batches, we need to assume that the data are reported in the order they were collected.

  • Based on the batch size you chose, did you find any out-of-control patterns? Did it occur in the Variability or batch Mean?

    • At a batch size of \(n=5\), I find no out-of-control patterns.

    • At a batch size of \(n=3\), I do find out-of-control patterns (one case outside the UCL).

  • If you answered “yes” to the previous question: Based on the batch size you chose, are those patterns likely to be spurious?

    • For the \(n=3\) case, this is a very small batch size; the out-of-control case could be spurious.

  • Is the density likely to be under statistical control?

    • Yes; I find no evidence of out-of-control behavior.

q5 Assess the tensile_strength#

Assess the state of statistical control for the variable tensile_strength using a control chart.

Answer the questions under observations below.

%%capture --no-display
## TODO: Construct batches and make a control chart
(
    df_shewhart
    >> gr.tf_mutate(idx=DF.index // 5)
    >> gr.pt_xbs(group="idx", var="tensile_strength")
)
../_images/d22-e-stat06-spc-solution_61_0.png
<ggplot: (8763708832208)>

Observations

  • What assumptions did you have to make in order to construct rational batches?

    • To group observations into consecutive batches, we need to assume that the data are reported in the order they were collected.

  • Based on the batch size you chose, did you find any out-of-control patterns? Did it occur in the Variability or batch Mean?

    • Yes; at a batch size of n=5 I found the mean to violate the LCL.

  • If you answered “yes” to the previous question: Based on the batch size you chose, are those patterns likely to be spurious?

    • Possibly; n=5 is a very small batch size, so this could certainly be due to chance.

  • Is the tensile_strength likely to be under statistical control?

    • Possibly; I find that the below LCL batch mean persists for batch sizes of n=6,7,8,9, which suggests the process may have been out of control for some of the earlier runs. However, the below LCL case disappears for n=10; it is possible that this pattern is spurious.

References#

  • Kenett and Zacks, Modern Industrial Statistics, 2nd Edition (2014)