+ - 0:00:00
Notes for current slide
Notes for next slide

Give Your Statistician Colleague Iris Bulbs for Their House Warming!

Di Cook
Monash University

Joint Statistics Meetings
July 30, 2019






Image credit: Iris Virginica, Wikimedia Commons

1 / 39

Where did this data come from?

The iris data was introduced to the data science community in Fisher (1936), in order to illustrate his new method "Fishers linear discriminant".

An era when the entire data table could be written on a single page.

3 / 39

Where did this data come from?

"Table I shows measurements of the flowers of fifty plants each of the two species Iris setosa and I. versicolor, found growing together in the same colony and measured by Dr E. Anderson, to whom I am indebted for the use of the data." Fisher

What about the third species?

4 / 39

The original source

Edgar Anderson (1935) "The irises of the Gaspe Peninsula" Bulletin of the American Iris Society, 59, 2-5.





Original cannot be found. Bulletin of the American Iris Society


...

5 / 39

Anderson had been intrigued by I. versicolor and I. virginica for almost a decade.

Anderson (1928) The Problem of Species in the Northern Blue Flags. Iris versicolor and Iris virginica.

Anderson (1931) Internal Factors Affecting Discontinuity Between Species.

"Although well provided with distinguishing characteristics, Iris versicolor and Iris virginica seem to be under a special curse so far as their recognition in the herbarium is concerned." Anderson, 1936 The Species Problem in Iris


More than 50 pages long!

6 / 39

The motivation

"Although well provided with distinguishing characteristics, Iris versicolor and Iris virginica seem to be under a special curse so far as their recognition in the herbarium is concerned." Anderson, 1936 The Species Problem in Iris

7 / 39

The motivation

"Although well provided with distinguishing characteristics, Iris versicolor and Iris virginica seem to be under a special curse so far as their recognition in the herbarium is concerned." Anderson, 1936 The Species Problem in Iris

"The sample of the third species given in Table I, Iris virginica, differs from the two other samples in not being taken from the same natural colony as they were." Fisher, 1936

7 / 39

Outline

  • Where did this data come from? (we've just covered this)
  • Data:
    • Description - what is a sepal?
    • What was the original task/analysis? Is this the same as how it has been used since?
    • The new iris data using genomics
  • The very amazing visuals from Anderson's original paper
  • What does it mean to use the iris data, and how to get yourself into de-tox treatment
8 / 39

Would you like framed iris photos?

Stay tuned to take the quiz at the end of the presentation.

9 / 39

Data description

Four variables: Sepal Length and Width, Petal Length and Width

Source: Suruchi Fialoke, October 13, 2016, Classification of Iris Varieties

10 / 39



Irises are weird!



The sepals have grown out of control, and are more spectacular than the petals.

11 / 39

Where to find the three species

Setosa Versicolor Virginica
12 / 39

Fisher's method

  1. Fisher's linear discriminant between setosa and versicolor: (x¯setosax¯versicolor)TSpooled1
  2. Project virginica on this vector
13 / 39

Fisher's method

  1. Fisher's linear discriminant between setosa and versicolor: (x¯setosax¯versicolor)TSpooled1
  2. Project virginica on this vector

13 / 39

Fisher's method

  1. Fisher's linear discriminant between setosa and versicolor: (x¯setosax¯versicolor)TSpooled1
  2. Project virginica on this vector

14 / 39

Fisher's method

  1. Fisher's linear discriminant between setosa and versicolor: (x¯setosax¯versicolor)TSpooled1
  2. Project virginica on this vector

I. versicolor is 2/3 of the distance between I. setosa and I. virginica. Which gives evidence for Anderson's claim that I. versicolor is a hybrid of the two.

14 / 39

Do it: Fisher linear discriminant

Coefficients of linear discriminant (setosa, versicolor only) on standardised 1 measurements.

## LD1
## Sepal.Length -0.1928034
## Sepal.Width -0.8492086
## Petal.Length 3.1053092
## Petal.Width 1.7156500

Mostly, petal length and width needed to separate the (two) species.

1 Standardised variables allows direction interpretation of importance from the coefficients of the linear combination.

15 / 39

16 / 39

🌎 Most contemporary use of the data focuses on building a classification model for all three species.



The task was never to distinguish between all three species!

17 / 39

Anderson's original paper was filled with visual evidence, to support his thinking that I.versicolor is a hybrid.

Come down the rabbit hole with me

19 / 39

Heatmap

Very popular plots for multivariate data.

Show the data matrix, with rows and columns sorted by some criteria. Colour the cell according to the numerical value.



Do you see three groups?

20 / 39

Scatterplot matrix

Plot all pairs of variables, coloured by species.

There is a strong linear association between petal length and width, and the three species cluster separately along the association.

Is it possible that only petal size matters to distinguish the species?

Maybe the Fisher linear discriminant is purely a linear combination of these two measurements.

21 / 39

Parallel coordinate plot (*)

22 / 39

Icon (glyph) plot (*)

One observation (one row) = one icon

Variable values mapped to different features.

23 / 39

Star glyphs

24 / 39

Star glyphs

Chernoff faces

1-height of face, 2-width of face, 3-shape of face, 4-height of mouth, 5-width of mouth, 6-curve of smile, 7-height of eyes, 8-width of eyes, 9-height of hair, 10-width of hair, 11-styling of hair, 12-height of nose, 13-width of nose, 14-width of ears, 15-height of ears

24 / 39

Tour

A movie of low-dimensional projections constructed in such a way that it comes arbitrarily close to showing all possible low-dimensional projections.

25 / 39
26 / 39

How did Anderson plot the data to support his thinking?

27 / 39
28 / 39

It's not a parallel coordinate plot.



It's an icon (glyph) plot, with lines connecting variable values.

29 / 39

It's not a parallel coordinate plot.



It's an icon (glyph) plot, with lines connecting variable values.

Anderson called them ideographs.



29 / 39

30 / 39

30 / 39

31 / 39

People have been using this data for demonstrating supervised classification (and clustering and dimension reduction) for almost 100 years!

Why has it stuck so long?

  • pretty, and clean
  • got a challenge (that's not actually a challenge!)
  • as close to simulated data as you can get

Small but not trivial. Simple but challenging. Real data. Fisher's reputation, although it's not his data. Tradition. Inertia. Continuity. You can find flower pictures to spell it out. – Nick Cox Nov 6 '13 at 19:07

32 / 39

Other similar (abused) old data sets for classification/clustering

  • Italian olive oils: Sicily is overlapped with other regions, overlapping classes
  • Australian crabs: Strong positive correlation, heteroskedasity variance, overlapping classes
  • Pima Indian diabetes: overlapping classes
  • Hand-written digits: lots of classes, lots of data, overlapping classes
33 / 39

New data science

Fresh and local data, where you can. For me, this is:

  • Pedestrian sensors
  • Election and census
  • Local sea levels
  • Local temperatures
  • Tennis Australia

34 / 39

Use the iris data in your statistics and data science classes if you want to look like you are not qualified to teach the class

35 / 39

Use the iris data in your publications if you want to look like your method isn't useful

36 / 39

Thanks for listening!

Trivia quiz on the Iris data is here: http://bit.ly/JSMiris2019

37 / 39

Resources

38 / 39

Thanks

Slides created via the R package xaringan, with iris theme created from xaringanthemer.

The chakra comes from remark.js, knitr, and R Markdown.

Slides are available at https://dicook.org/files/JSM19/slides.html and supporting files at https://github.com/dicook/JSM19.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

39 / 39
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow