Friday, February 27, 2015

What color is the dress??!??!?!?

Enough people have asked me to adjudicate this question that I really have to do an emergency post.

Unless you are living in a dark cave you have probably heard the furor. What color is the dress????


It seems that 75% see it as white and gold and 25% see it as blue and gold.


I have learned to avoid this type of question. I have learned in my marriage that if I say “Wow! Look at that really gorgeous woman in the turquoise dress! She’s really cute, and I would like to ask her out – maybe take her to Bermuda for the weekend.” this will start an argument with my wife about the color of the dress!

But clearly I need to weigh in on this. It comes down to how we define color.

What is "color"?

Definition #1: Color is something computed from the spectrum of light that comes off an object.

If this were true, a piece of copy paper would be brilliant white in the sunlight and a very very dark brown under incandescent light. This phenomenon is explained by the fact that the cones are autoranging.

Definition #2: Color is defined by the signals collected by the cones after this auto-ranging.

This optical illusion proves that to be wrong: http://en.wikipedia.org/wiki/Checker_shadow_illusion. The cones are collecting the same signals for A and B in that illusion, but the neurons in the eye are comparing the pixels with their neighbors. “This pixel is bluer than the one next to it, so I am going to call it “a little bit blue”.

Definition #3: Color is defined by the signals leaving the eye.

This doesn’t quite explain the checker illusion. The lower part of the brain pulls a lot of tricks in an attempt to interpret a visual scene. In the checker example, the lower brain interprets each of the squares of the checkerboard as an object, and simplifies things by saying that the variation in signal intensity is due to shading, and not due to any properties of the dress.

Definition #4: Color is defined by the interpretation provided by the lower brain, after segmenting the scene into distinct objects.

Still not there yet. The thing that is (likely) the point of confusion in the dress image is that the brain actively seeks a white point from which to judge color. Look at a newspaper. What color is it? White. Now lay a piece of white ultra bright copy paper next to it. What color is the newspaper now? It turned dingy. Your brain first used the newspaper as a white point, and made its color assessment  based on that. When the copy paper was introduced, your brain picked up a different definition of white to compare things to.

In the dress picture, you will note that the upper right portion is saturated. This is confusing to the brain, since the autorange in the eyeball doesn’t normally allow things to saturate. Our eye would scale this so that we could see the bright area, and we would not be able to tell the color of the dress.

How does the brain interpret the saturation that happened in the camera? Does it see that as the white point and assess from there? Or does it come up with another brilliant (pun intended) explanation, and set it’s white point to something beyond 255, 255, 255? This is a guess, but I think that different brains might set different white points.


Objects don’t have an inherent color. Color is a subtle interplay between the light hitting an object, the light reflected from the object, the spectral response and autoranging of the cones, the low level segmentation into distinct objects, and the interpretation of white point.

So what’s my answer? The question is a silly question. Dresses do not have any inherent color. The better question is “what color do you see when you look at the dress?”  That question apparently depends on the viewer. Color is in the eye brain of the beholder.


Wednesday, February 18, 2015

How many samples do I need? (with lots of assumptions)

In an earlier post, I looked at the question "How many samples of a production run do I need to assure that 68% are within tolerance?" I concluded with "at least 300, and preferably 1,200. In the first pass I made only one assumption - that the sampling of parts to test was done at random. I answered the question with no information about the metric or about the process. 

For my answer, I reduced the question down to a simple one. Suppose that a green M&M is put into a bucket/barrel/swimming pool whenever a good part is produced, and that a red M&M is put in whenever a bad part is produced. For the testing of this production run, M&Ms are pulled out at random, and noted as being either good or bad. After tallying the color of each M&M, they are replaced into the bucket/barrel/swimming pool. 


Note the assumptions. I assume that the production run is sampled at run with replacement. And that's about it. 

Statistics of color difference values

Today I answer the question with a whole lot of additional assumptions. Today I assume that the metric being measured and graded is the color difference value, in units of ΔE. And I make some assumptions about the statistical nature of color differences.

I assembled real-world data to create an archetypal cumulative probability density function (CPDF) of color difference data from a collection of 262 color difference data sets each with 300 to 1,600 data points. In total, my result is a distillation of  317,667 color differences from 201 different print devices, including web offset, coldset, flexo, and ink jet printers. So, a lot of data was reduced to a set of 101 percentile points shown in the image below. Note that this curve has been normalized to have a median value of 1.0 ΔE, on the assumption that all the curves have the same shape, but differ in scale.

Archetypal cumulative probability density function for color difference data (ΔEab)

For my analysis, it is assumed that all color difference data has this same shape. Note that if one has a data set of color difference data, it can be transformed to relate to this archetype by dividing all the color difference values by the median of the data set. In my analysis of the 262 data sets, this may not of been an excellent assumption, but then again, it was not a bad assumption.

The archetypal curve is based on data from printing test targets each with of hundreds of CMYK values, and not from production runs of 10,000 copies of a single CMYK patch. For this analysis, I make the assumption that run-time color differences behave kinda the same. I've seen data from a couple three press runs. I dunno, might not be such a good assumption.

Let's see... are there any other assumptions that I am making today? Oh yeah... I have based the archetypal CPDF on color difference data based on the original 1976 ΔE formula and not the 2000. Today, I don't know how much of a difference this makes. Some day, I might know.

Monte Carlo simulation of press runs

I did some Monte Carlo simulations with all the aforementioned assumptions. I was asking a variation on the question asked in the previous blog. Instead of asking what the how many samples were needed to make a reliable pass/fail call, I asked how many samples were needed to get a reliable estimate of the 68th percentile. Subtle difference, but that's the nature of statistics.

As in the previous blog, I will start with the example of the printer who pulls only three samples and from these three, determines the 68th percentile. I'm not sure just how you get a 68th percentile from only three samples, but somehow when I use the PERCENTILE function in Excel or the Quantile function in Mathematica, they give me a number. I assume that the number means something reasonable.

Now for a couple more assumptions. I will assume that the tolerance threshold is 4 ΔE (in other words, 68% must be less than 4 ΔE), and that the printer is doing a pretty decent job of holding this - 68% of the samples are below 3.5 ΔE. One would hope that the printer gets the thumbs up on the job almost all the time, right?

Gosh, that would be nice, but my Monte Carlo simulation says that this just ain't gonna happen. I ran the test 10,000 times. Each time, I drew three random samples from the archetypal CPDF shown above. From those, I calculated a 68th percentile. The histogram below shows the distribution of the 68th percentiles determined this way. Nearly 55% of the press runs were declared out of tolerance.

Distribution of estimates for the 68th percentile, determined from 3 random samples

There is something just a tad confusing here. The assumption was that the press runs had a 68th percentile of 3.5 ΔE. Wouldn't you expect that at least 50% of the runs were in tolerance? Yes, I think you might, but note two things: First, the distribution above is not symmetrical. Second, as I said before, determining the 68th percentile of a set of three data points is a bit of a slippery animal.

When this printer saw how many were failing, he asked for my advice. I pointed him to my previous blog, and he said "1200?!?!?  Are you kidding me!?!?  I can't even afford to measure 300 samples!" He ignored me, and never paid me my $10,000 consulting fee, but I heard through the grapevine that he did start pulling 30 samples. That's why I get paid the big bucks. So people can ignore my advice. 

Distribution of estimates for the 68th percentile, determined from 30 random samples

The image above shows what happened when he started measuring the color error on 30 samples per press run. Much better. Now only about 13% of the press runs are erroneously labelled "bad product". What happened after that depended on how sharp the teeth were in the contract between the printer and print buyer. Maybe the print buyer just shrugged it off when one out of every 8 print runs were declared out of tolerance? Maybe there's a lawsuit pending? I don't know. That particular printer never called me up with a status report.

What if the printer had heeded my advice and started pulling 300 samples to determine the 68th percentile? The results from one last Monte Carlo experiment are shown below. Here the printer pulled all 300 samples that I asked for. At the end of 10,000 press runs, the printer had only three examples where a good press run was called "bad". 

Distribution of estimates for the 68th percentile, determined from 300 random samples

Print buyer's perspective

The previous examples were from the printer's perspective, where the printer responds with self-righteous indignation when sadistical control process has the gall to say that a good run is bad. We now turn this around and look at the print buyer's perspective.

Let's say that a printer is doing work that is not up to snuff... I dunno... let's say that the 68th percentile is at 4.5 ΔE. If the print buyer is a forgiving sort, then maybe this is OK by him. But then again, maybe his wife might tell him to stop being such a door mat?  (I am married to a woman who tells her spouse that all the time, especially when it comes to clients not paying.) We can't simulate what this print buyer's wife will tell him, but we can simulate how often statistical process control will erroneously tell him that a 4.5 ΔE run was good.

The results are similar, as I guess we would expect. If your vision of "statistical process control" means three samples, then 21.1% of the bad jobs will be given the rubber stamp of approval. The printer may like that, but I don't think the print buyer's spouse will stand for it.

If you up the sampling to 10 samples, quite paradoxically, the rate of mis-attribution goes up to 35.7%. That darn skewed distribution.

Pulling thirty samples doesn't help a great deal either. With 30 samples, the erroneous use of the "approved" stamp goes down only to 15.7%. If the count is increased to 100, then about 4.7% of the bad runs are called "good". But when 300 samples are pulled, the number drops way down to 0.06%.

Conclusions

I ran the simulation with a number of different sample sizes and a number of different underlying levels of "quality of production run".  The results are below. The percentages are the probability of making a wrong decision. In the first three lines of the table (3.0 ΔE to 3.75 ΔE), this is the chance that a good job will be called bad. In the next three lines of the table, this is the chance that a bad job will be called good.

Actual 68th
N = 3
N = 10
N = 30
N = 100
N = 300
3.0 ΔE
37.0%
4.0%
0.6%
0.0%
0.0%
3.5 ΔE
54.6%
18.1%
12.9%
1.5%
0.0%
3.75 ΔE
61.1%
29.0%
30.2%
13.2%
4.8%
4.25 ΔE
25.9%
47.5%
29.7%
20.9%
5.1%
4.5 ΔE
21.1%
35.7%
15.7%
4.7%
0.1%
5.0 ΔE
13.1%
19.6%
2.9%
0.0%
0.0%

Calculation of this table is a job for an applied math guy. Interpreting the table is a job for a statistician, which is at the edge of my competence. Deciding how to use this table is beyond my pay grade. It depends on how comfortable you are with the various outcomes. If, as a printer, you are confident that your process has a 68th percentile of 3.0 ΔE or less, then 30 samples should prove that point. And if your process slips a bit to the 3.5 ΔE level, and you are cool with having one out of eight of these jobs recalled, then don't let no one talk you into more than 30 samples. If you don't want those jobs recalled though...

If, as a print buyer, you really have no intention of cracking down on a printer until they hit the 5 ΔE mark, then you may be content with 30 samples. But if you want to have some teeth in the contract when a printer goes over 4.5 ΔE, then you need to demand at least 100 samples.

First addendum

You will note that my answer was a little different than the previous blog post where I made minimal assumptions. If I make all the assumptions that are in this analysis, then the number of samples required (to demonstrate that 68% of the colors are within a threshold color difference) is smaller than the previous blog might  have suggested. Note that If one has a data set of color difference data, it can be transformed to relate to this archetype by dividing all the color difference values by the median of the data set. Then again, that one word ("assume", and its derivatives) in bold print has appears on this page 22 times...

Second addendum

In the first section, I mentioned "sampling with replacement", which means that you might sample a given product twice. Kind of a waste of time, really. Especially for small production runs, where the likelihood of duplicated effort is larger. Taken to the extreme, my conclusion was clearly absurd. Do I really need to pull 300 samples for my run of 50 units?!!?!?!

Well, no. Clearly one would sample a production run without replacement. But, in my world, a production run of 10,000 units is on the small side, so I admit to the myopic vision. For the purposes of this discussion, if the production run is over 10,000, it doesn't matter a whole lot whether a few of the 1,200 samples are measured twice. 

Wednesday, February 4, 2015

How many samples do I need?

Simple question:  If I am sampling a production run to verify tolerances, how many production pieces do I need to sample?

It's an easily stated question, and also an important one. Millions of dollars may be at stake if a production run has to be scrapped or if the customer has to be reimbursed for a run that was out of tolerance (so-called "makegoods"). On the other side, the manufacturer may need to spend tens or hundreds of thousands of dollars on equipment to perform inspection.

For certain manufactured goods, 100% compliance is required. The cost of delivering a bad Mercedes, pharmaceutical, or lottery ticket is very high, so pretty much every finished good has to be inspected. But in most cases, the cost of a few bad goods is not that great. If a few cereal boxes burst because of a bad glue seal, or if a page in the Color Scientist Monthly is smeared, but how bad can that be?  It's a calculated risk of product waste versus inspection cost.

Now if the foldout featuring the PlayColor of the Month is smeared, that's another story.

I volunteer to do 100% inspection of the taste of these products!

In the world I live in - printing - contracts often stipulate a percentage of product that must be within a given tolerance. This is reflected in the ISO standards. I have pointed out previously that ISO 12647-2 requires 68% of the color control patches within a run be within a specified tolerance. The thought is, if we get 68% of the samples within a "pretty good" tolerance, then 95% will be within a "kinda good" tolerance. All that bell curve kinda stuff.

A press run may have tens of thousands or even millions of impressions. Clearly you don't need to sample all of the control patches in the press run in order to establish the 68%, but how many samples are needed to get a good guess?

Maybe three samples?

Keeping things simple, let's assume that I pull three samples from the run, and measure those. There are four possible outcomes: all three might be in compliance, two of the three might be in compliance, only one may be in compliance, or none of the samples might be in compliance. I'm going to cheat just a tiny bit, and pretend that if two or more of the three pass, then I am in compliance. That's 66.7% versus 68%. It's an example. Ok?

I am also going to assume that random sampling is done, or more accurately, that the sampling is done in such a way that the variations in the samples are independent. Note that pulling three samples in a row almost certainly violates this. Sampling at the end of each batch, roll, or work shift probably also violates this. And at the very least, the samples must be staggered through the run. 

Under those assumptions, we can start looking at the likelihood of different outcomes. The table below shows the eight possible outcomes, and the ultimate diagnosis of the production run. 

Sample 1
Sample 2
Sample 3
Run diagnosis
Probability
Not so good
Not so good
Not so good
Fail
(1-p)3
Not so good
Not so good
Good
Fail
p (1-p)2
Not so good
Good
Not so good
Fail
p (1-p)2
Not so good
Good
Good
Pass
p2 (1-p)
Good
Not so good
Not so good
Fail
p (1-p)2
Good
Not so good
Good
Pass
p2 (1-p)
Good
Good
Not so good
Pass
p2 (1-p)
Good
Good
Good
Pass
p3

Four of the possibilities show that the run was passed, and four show it failing, but this is not to say that there is a 50% chance of passing. The possible outcomes are not equally likely. It depends on the probability that any particular sample is good. If, for example, the production run were to be overwhelmingly in compliance (as one would hope), the probability that all four samples would come up good is very high.

The right-most column helps us quantify this. If the probability of pulling a good sample is p, then the probability of pulling three good samples is p3. From this, we can quantify the likelihood that we will get at least the requisite two good samples out of three to qualify the production run as good.

     Probability of ok-ing the run based on three samples = p2 (1-p) + p2 (1-p) + p2 (1-p) + p3

Things start going bad

What could possibly go wrong?  We have proper random sampling, and we have a very official looking formula.

Actually, two different things could go wrong. First off, the production run might be perfectly good, but, by luck of the draw, two or three bad samples were drawn. I'm pretty sure the manufacturer wouldn't like that. 

The other thing that could go wrong is that the production run was actually out of tolerance (more than one-third of the pieces were bad), but this time Lady Tyche (the Goddess of Chance) favored the manufacturer. The buyer probably wouldn't like that.

From the formula above, we can plot the outcomes as a function of the true percentage that were in tolerance. The plot conveniently shows the four possibilities: correctly rejected, incorrectly accepted, correctly accepted, and incorrectly accepted. 

Outcomes when 3 samples are used to test for conformance

Looking at the plot, we can see if 40% of the widgets in the whole run were in tolerance, then there is a 35.2% chance that the job will be given the thumbs up, and consequently a 64,8% chance of being given the thumbs down as it should. The manufacturers who are substandard will be happy that they still have a fighting chance if the right samples are pulled for testing. This of course is liable to be a bit disconcerting for the folks that buy these products.

But, the good manufacturers will bemoan the fact that even when they do a stellar job of getting 80% of the widgets widgetting properly, there is still a chance of more than 10% that the job will be kicked out.

Just in case you were wondering, the area of the red (representing incorrect decisions) is 22.84%. That seems like a pretty good way to quantify the efficacy of deciding about the run based on three samples.

How about 30 samples?

Three samples does sound a little skimpy -- even for a lazy guy like me. How about 30? The Seymourgraph for 30 samples is shown below. It does look quite a bit better... not quite so much of the bad decision making, especially when it comes to wrongly accepting lousy jobs. Remember the manufacturer who got away with shipping lots that were only 40% in tolerance one in three times? If he is required to sample 30 products to test for compliance, all of a sudden his chance of getting away with this drops way down to 0.3%. Justice has been served!

Outcomes when 30 samples are used to test for conformance

And at the other end, the stellar manufacturer who is producing 80% of the products in tolerance now has only a 2.6% chance of being unfairly accused of shoddy merchandise. That's better, but if I were a stellar manufacturer, I would prefer not to get called out on the carpet once out of 40 jobs. I would look into doing more sampling so I could demonstrate my prowess.

The area of the red curve is now 6.95%, by the way. I'm not real sure what that means. It kinda means that the mistake rate is about 7%, but you gotta be careful. The mistake rate for a particular factory depends on the percentage that are produced to within a tolerance. This 7% mistake rate has to do with the mistake rate for pulling 30 samples over all possible factories. 

I am having a hard time getting my head around that, but it still strikes me that this is a decent way to measure the efficacy of pulling 30 samples.

How about 300 samples?

So... thirty samples feels like a lot of samples, especially for a lazy guy like me. I guess if it was part of my job, I could deal with it. But as we saw in the last section, it's probably not quite enough. Misdiagnosing the run 7% of the time sounds a bit harsh. 

Let's take it up a notch to 300 samples. The graph, shown below, looks pretty decent. The mis-attributions occur only between about 59% and 72%. One could make the case that, if the condition of the production facility is cutting it that close, then it might not be so bad for them to be called out on the carpet once in a while.

Outcomes when 300 samples are used to test for conformance

Remember looking at the area of the red part of the graph... the rate of mis-attributions?  The area was 22.84% when we took 3 samples. It went down to 6.95% with 30 samples. With 300 samples, the mis-attribution rate goes down to 2.17%. 

The astute reader may have noticed that each factor of ten increase in the number of samples will decrease the mis-attribution error by a factor of three. In general, one would expect that the mis-attribution rate drops by square root of the number of samples. Multiplying the sampling rate by ten will decrease the mis-attribution rate by the square root of ten, which is about 3.16.

If our goal is to bring the mis-attribution rate down to 1%, we would need to pull about 1,200 samples. While 300 samples is beyond my attention span, 1,200 samples is way beyond my attention span. Someplace in there, the factory needs to consider investing in some automated inspection equipment.

The Squishy Answer

So, how many samples do we need?

That's kind of a personal question... personal in that it requires a bit more knowledge. If the production plant is pretty darn lousy --let's say only 20% of the product within tolerance -- then you don't need many samples to establish the foregone conclusion. Probably more than 3 samples, but the writing is on the wall before 30 samples have been tested. Similarly, if the plant is stellar, and produces product that is in tolerance 99% of the time, then you won't need a whole lot of samples to statistically prove that at least 68% are within tolerance.

Then again, if you actually knew that the plant was producing 20% or 99% of the product in tolerance, then you wouldn't need to do any sampling, anyway. The only reason we are doing sampling is because we don't know.

The question gets a little squishy as you get close to the requested percentage. If your plant is consistently producing 68.1% of the product in tolerance, you would need to do a great deal of sampling to prove to a statistician that the plant was actually meeting the 68% in tolerance quota.

So... you kinda have to consider all possibilities. Go in without any expectations about the goodness of the production plant. Assume that the actual compliance rate could be anything.

The Moral of the Story 

If I start with the assumption that the production run could produce anywhere between 0% and 100% of the product in tolerance, and that each of these is equally likely, then if I take around 1,200 samples, I have about a 99% chance of correctly determining if 68% of the run is in tolerance. 

If you find yourself balking at that amount of hand checking, then it's high time you looked into some automated testing.