Portfolio Segmentation

Beware of Simpson's Paradox, and Be Careful

Al R. Vilcius

Risk Management generally claims that segmentation is necessary to sharpen results obtained from the analytic tools used for various purposes. However, this needs to be done very carefully to avoid erroneous results that can easily lead to mistaken and inappropriate actions. This will be shown here, first by a couple of simple examples, then a little theory.

Consider the following (hypothetical) statistical study :

A 1980 study of the difference in smoking habits between men and women showed a higher proportion of women smokers.

The study was repeated in 1990.  It was thought that women were quitting faster, so a larger sample of women was taken to prove this point.

Here are the findings: 

study of smokers
      1980     1990
  sample smokers ratio sample smokers ratio
Men 8,000 800 10% 5,000 450 9%
Women 3,000 450 15% 10,000 1,400 14%
People 11,000 1,250 11.3% 15,000 1,850 12.3%

On the basis of these results, the study revealed that men and women were quitting at a similar rate, with the ratio of smokers reduced for both men and women, but the ratio of people smoking overall had increased.

Conclusion: both men and women are smoking less but people are smoking more!!

The same study also asked a question about drinking habits.

These results were as follows: 

study of drinkers
      1980     1990
  sample drinkers ratio sample drinkers ratio
Men 8,000 2,560 32% 5,000 1,650 33%
Women 3,000 540 18% 10,000 2,100 21%
People 11,000 3,100 31% 15,000 3,750 25%

The bonus question showed that both the ratio of men and women drinking had increased, with a surprisingly sharp increase for women, but that in total, the ratio of people drinking had actually decreased.

Conclusion: both men and women are drinking proportionately more, but people are generally drinking less!

So there you have it - them's the facts!

This curious reversal is a problem with all demographic studies.  It is called Simpson's paradox.

The  problem also occurs in medical studies where conflicting "results" are often obtained for clinical tests that are not properly controlled.  But that's not all.  There are also many occurances in finance.  Here is another example to consider:

Wagner [1] used income tax numbers in his example to illustrate Simpson’s paradox. Here I have restated it (using his numbers) in terms of securities simply as a mater of personal taste and for extra clarity and dramatic effect.

Balanced Securities Portfolio
      Last Year     This Year
securities Average M-t-M Return Yield Average M-t-M Return Yield
Soverign Bonds $41,651,643 $2,244,467 5.4% $19,879,622 $689,318 3.5%
AAA Corp Bonds $146,400,740 $13,646,348 9.3% $122,853,351 $8,819,461 7.2%
Retail Securitizations $192,688,922 $21,449,597 11.1% $171,858,024 $17,155,758 10.0%
Large Cap Listed Equities $470,010,790 $75,038,230 16.0% $865,037,814 $137,860,951 15.9%
Derivative enhanced OTC $29,427,152 $11,311,672 38.4% $62,806,159 $24,051,698 38.3%
TOTAL  $880,179,247 $123,690,314 14.1% $1,242,434,970 $188,577,186 15.2%

This example shows a lower yield in EACH category this year, but the overall yield is higher - check it out for yourself!.

Here is the caveat to risk management portfolio analysis:

Portfolio segmentation into parallel descriptive categories can show improvement or deterioration in each segment while showing exactly the opposite overall.

As Wagner [1] points out: "This is not a contrived pedagogical example" - it does ocur frequently in practice.  This example can be used in many contexts: in 1991 I used it in the context of credit quality analysis for corporate loan portfolio scoring.  The relevance to electronic commerce behavioural analysis should be obvious; the danger is that Simpson's Paradox can lead observers to "see" patterns in the data that do not exist.

Now here is a little bit of theory, as promised.

Simpson's Paradox (also known as the "Stein paradox") Is Not Really a Paradox: it occurs whenever an apparently significant conclusion based upon a sample of data, is reversed in each subsample, when the sample is split according to a "lurking variable".

In order to avoid an overly abstract discussion, my explanation here uses the first table (study of smokers) above.

The answer to the apparent paradox is that the proportions of the table are weighted averages, not simple averages.

Pr(smoker)=  Pr(smoker and man)  +  Pr(smoker and woman)
                 =  Pr(smoker | man)Pr(man) + Pr(smoker | woman) Pr(woman)

which is obtained simply by substituting the definition of conditional probability (given below).  Now it is clear that the distribution (ie. ratio) of man:woman is the "lurking variable" mentioned above.

The apparent "cheat" is that the frequency ratio man:woman, i.e. Pr(man) and Pr(woman) changed in the 1990 population to that of the 1980 sample, resulting in essentially different weighted averages being calculated for the population of people.

  1980 1990
Pr(man) 73% 33%
Pr(woman) 27% 67%

In general,

Then the expression for these fractions is identical to the one for conditional probabilities above.  Notice that if  b=d (which would mean that the number of men is the same as the number of women in the example), then we just have a simple average, which is what most people seem to have in mind when looking at the sample tables in the above examples.

Proportions within descriptive subclasses of non-constant populations is a feature of conditional probability, defined as a ratio:

Pr(A | B)=Pr(A and B) / Pr(B)

When this is applied to 1980 observations on smokers, we see:

Pr(smoker | woman)=Pr(smoker and woman) / Pr(woman)

Assuming that people are either men or women, not both

Pr(man) + Pr(woman)=Pr(man or woman)=1

Pr(smoker) + Pr(non-smoker)=1

Pr(smoker)=Pr(smoker | man)Pr(man) + Pr(smoker | woman) Pr(woman)

Then Bay's theorem gives us the ratio:

Pr(woman | smoker)=Pr(smoker and woman) / Pr(smoker)
 =Pr(smoker | woman) Pr(woman) / [Pr(smoker | man)Pr(man) + Pr(smoker | woman) Pr(woman) ]

from which further contradictions using the little table above can be found.

This situation also comes up in "regression" applications.   The straightforward regression equation often is just plain wrong in some cases, leading to incorrect actions. Caution should be exercised in all finance, economics, medical, and social applications.

The bottom line:

This "paradox" of conditional probabilty deserves caution in practice.


Recommended Further Reading

These titles can be purchased on-line from Chapters.ca

  1. the left column below contains the hot links
  2. use browser back button to return
Hardcover | 493 Pages | ISBN 047161842X Published in 1989 by John Wiley & Sons Canada, Limited [JC'89]
John L. Casti
" Alternate Realities:Mathematical Models of Nature & Man"
John Wiley & Sons Canada, Limited, 1989


Email: AL.R@VILCIUS.com

Tel / Fax: (905) 854-3342 /-3371

   BACK to SymDR home page