Editor's note: Our "Words from Wiechers" series considers the lessons our industry can learn from the late Johann Wiechers, Ph.D. He was an advisor, colleague and insightful leader in the industry until his unexpected passing. Presenting Wiechers's insights is Tony O'Lenick.
This month, we review Chapter 12 of Wiechers's book, Memories of a Cosmetically Disturbed Mind. Here, he modifies a quote of Benjamin Disraeli, the 19th century British Prime Minister, who stated, "There are three kinds of lies: lies, damned lies, and statistics," to relate to our industry. Wiechers writes, “There is quite a bit of nonsense going on in this field (cosmetics), lending some truth to that well-known statement, 'Lies, damn lies and statistics.'"
Wiechers says: "Most scales in (cosmetic) science are numerical scales, which helps if you want to do statistics. We diligently calculate means and standard deviations and do t-tests to see whether our comparisons are statistically significantly different. But let’s be careful. Problems can arise when we think we are dealing with a numerical scale, whilst in reality, it is an ordinal scale.
"There is quite a bit of nonsense going on in this field, lending some truth to that well-known statement, 'Lies, damn lies and statistics.' Not because statistics can give you any answer that you would like to have, but because there are so many different statistical techniques that it is often difficult to identify which one you should use. One of the most critical things to consider is which type of scale you are dealing with. Let’s investigate a few things that are often forgotten.
"We often automatically assume the distance between two points on different portions of the scale to be similar; i.e., the distance between 1 and 2 is the same as that between 3 and 4. For instance, visual assessment of skin lightening is done by comparing the skin color of two arms. If they have the same color, the score is 0, but if the left arm is somewhat lighter than the right arm, then the score is 1.
"If the difference is clearly visible, the score is 2, and when there is an extreme difference, the score is 3. But is the difference between 0 and 1 the same as between 1 and 2? I bet you that it is not, and that is exactly the reason why it is more difficult to measure significant differences from visual than from instrumental assessments, where there is no judgmental element. Other similar examples include scales for irritancy, axillary malodor, acne (via comparison with pictures) or dandruff."
As a case in point, he highlights the dandruff scale. "The dandruff scale is a very beautiful example of a weird and illogical scale. In quantifying the degree of dandruff, the scalp is subdivided into four quadrants, two per half head, each half receiving another product. Each of these quadrants is separately judged for the incidence and severity of the dandruff. The incidence can vary from 0 (< 10% affected) to 4 (> 70% affected), whereas the severity, judging both the size of the scales and the degree to which the scales are attached to the scalp, can vary from 1 till 5.
"Subsequently, the two scores obtained for the same quadrant are multiplied and can thus range between 0 and 20 for a single quadrant. The scores for the two quadrants that make up the half-head treated with the same product are subsequently summed, and therefore range from 0 to 40.
"This all sounds very logical, so why do I think this is probably the weirdest scale ever invented? First of all, how do you get to the number 39? This can only be 20 + 19, or 19 + 20. But how do you make 19, a prime number, out of the multiplication of two small numbers? Indeed, 39 is impossible on this scale. Similarly, 38 is only 19 + 19; or 20 + 18; or 18 + 20.
It is more difficult to measure significant differences from visual than from instrumental assessments, where there is no judgmental element.
"We just concluded that 19 is impossible, but the same also applies for 18. It can only be formed from 2 x 9 and 3 x 6; but 6 and 9 are off the scale. It will not amaze you that, whilst fighting off a jetlag, I have worked out all numbers that could and could not be made on the dandruff scale. Try it and, if you take into account the fact that certain scores can be made in many more ways than other scores (and therefore have a higher probability of being the score of the dandruff investigation), you will probably see why I think that the dandruff scale is a weird scale. This scale is not only non-equidistant; certain distances do not even exist.
"So, how do you take this into account when doing the statistics on dandruff data? The only way to overcome such irregularities is the law of the big numbers; [i.e.], if you have enough subjects in your trials, then the influence of scale irregularities diminishes but the costs of your study rockets. Would anybody (including you) notice that you potentially invalidate your anti-dandruff study by using a smaller panel to reduce costs? How do you balance your desire for scientific integrity and using the smallest possible panel size to substantiate your cosmetic claim?
'Pretty Bloody Obvious (PBO)' Test
"Lies, damn lies and statistics. You’ve heard it all before, so what’s new? Only that it is not true but often simply based on an unawareness that many cosmetic scientists would rather like to keep that way. By far, the most powerful statistical test in cosmetics that you will never see quoted is the PBO test, the Pretty Bloody Obvious test.
"If you see that two products have two sets of scores, one all around 1,000 and the other all around 1, then you do not need any statistics to identify whether there is any statistical difference between the two products. You will never have to have any discussion on which statistical test should be used or whether the scale was equidistant. The PBO test is enough.
"We do need statistics when the difference is far less obvious, even to the extent that we need to rely on large population samples to be able to show the difference. Unfortunately, the effective differences between our products or the improvements in our products have become that small that the question of the appropriate statistics becomes very important."
Wiechers, as always, clearly points out how absurd some of our assumptions are related to scales that are not appropriate to what we are trying to measure. I had a statistics professor who used to say, “your noise is larger than what you are trying to measure.” Perhaps Johann would agree, it should be “your noise is larger than the claim you are attempting to support.”
We are scientists but accept claims without data or unverified methods. We seem to be willing to accept new “buzz words” that claim properties that are as-yet undefined. We must be more skeptical when looking at data, methods and claims that use dubious scales to measure them or even worse, that make claims that are just stated and never proven by data, including a proven test methodology. Unfortunately, what Johann observed years ago is still a problem in our industry.
Dubious claims support the contention of Mark Train, who was thought to have said: “Figures don’t lie, but liars sure figure”—and I can point to the PBO test to support it.