Image credit: Iris Virginica, Wikimedia Commons
| | |
| Setosa Lady Bird Johnson Wildflower Center | Versicolor Wikimedia Commons | Virginica Wikimedia Commons |
The iris data was introduced to the data science community in Fisher (1936), in order to illustrate his new method "Fishers linear discriminant".
An era when the entire data table could be written on a single page.

"Table I shows measurements of the flowers of fifty plants each of the two species Iris setosa and I. versicolor, found growing together in the same colony and measured by Dr E. Anderson, to whom I am indebted for the use of the data." Fisher
What about the third species?

Edgar Anderson (1935) "The irises of the Gaspe Peninsula" Bulletin of the American Iris Society, 59, 2-5.
Original cannot be found.
Bulletin of the American Iris Society

...

Anderson had been intrigued by I. versicolor and I. virginica for almost a decade.
Anderson (1928) The Problem of Species in the Northern Blue Flags. Iris versicolor and Iris virginica.
Anderson (1931) Internal Factors Affecting Discontinuity Between Species.

"Although well provided with distinguishing characteristics, Iris versicolor and Iris virginica seem to be under a special curse so far as their recognition in the herbarium is concerned." Anderson, 1936 The Species Problem in Iris
More than 50 pages long!
"Although well provided with distinguishing characteristics, Iris versicolor and Iris virginica seem to be under a special curse so far as their recognition in the herbarium is concerned." Anderson, 1936 The Species Problem in Iris
"Although well provided with distinguishing characteristics, Iris versicolor and Iris virginica seem to be under a special curse so far as their recognition in the herbarium is concerned." Anderson, 1936 The Species Problem in Iris
"The sample of the third species given in Table I, Iris virginica, differs from the two other samples in not being taken from the same natural colony as they were." Fisher, 1936
Would you like framed iris photos?
Stay tuned to take the quiz at the end of the presentation.
Four variables: Sepal Length and Width, Petal Length and Width

Source: Suruchi Fialoke, October 13, 2016, Classification of Iris Varieties
Irises are weird!
The sepals have grown out of control, and are more spectacular than the petals.
| | |
| Setosa | Versicolor | Virginica |



I. versicolor is 2/3 of the distance between I. setosa and I. virginica. Which gives evidence for Anderson's claim that I. versicolor is a hybrid of the two.
Coefficients of linear discriminant (setosa, versicolor only) on standardised 1 measurements.
## LD1## Sepal.Length -0.1928034## Sepal.Width -0.8492086## Petal.Length 3.1053092## Petal.Width 1.7156500Mostly, petal length and width needed to separate the (two) species.
1 Standardised variables allows direction interpretation of importance from the coefficients of the linear combination.

🌎 Most contemporary use of the data focuses on building a classification model for all three species.
The task was never to distinguish between all three species!

Genomic analysis confirms Anderson's suspicions!

Come down the rabbit hole with me
Very popular plots for multivariate data.
Show the data matrix, with rows and columns sorted by some criteria. Colour the cell according to the numerical value.
Do you see three groups?

Plot all pairs of variables, coloured by species.
There is a strong linear association between petal length and width, and the three species cluster separately along the association.
Is it possible that only petal size matters to distinguish the species?

Maybe the Fisher linear discriminant is purely a linear combination of these two measurements.

One observation (one row) = one icon
Variable values mapped to different features.
Star glyphs

Star glyphs

Chernoff faces

1-height of face, 2-width of face, 3-shape of face, 4-height of mouth, 5-width of mouth, 6-curve of smile, 7-height of eyes, 8-width of eyes, 9-height of hair, 10-width of hair, 11-styling of hair, 12-height of nose, 13-width of nose, 14-width of ears, 15-height of ears
A movie of low-dimensional projections constructed in such a way that it comes arbitrarily close to showing all possible low-dimensional projections.



![]()

People have been using this data for demonstrating supervised classification (and clustering and dimension reduction) for almost 100 years!
Why has it stuck so long?
Small but not trivial. Simple but challenging. Real data. Fisher's reputation, although it's not his data. Tradition. Inertia. Continuity. You can find flower pictures to spell it out. – Nick Cox Nov 6 '13 at 19:07
Fresh and local data, where you can. For me, this is:

Thanks for listening!
Trivia quiz on the Iris data is here: http://bit.ly/JSMiris2019
Slides created via the R package xaringan, with iris theme created from xaringanthemer.
The chakra comes from remark.js, knitr, and R Markdown.
Slides are available at https://dicook.org/files/JSM19/slides.html and supporting files at https://github.com/dicook/JSM19.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
| | |
| Setosa Lady Bird Johnson Wildflower Center | Versicolor Wikimedia Commons | Virginica Wikimedia Commons |
Keyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide |
| ↓, →, Pg Dn, Space, j | Go to next slide |
| Home | Go to first slide |
| End | Go to last slide |
| Number + Return | Go to specific slide |
| b / m / f | Toggle blackout / mirrored / fullscreen mode |
| c | Clone slideshow |
| p | Toggle presenter mode |
| t | Restart the presentation timer |
| ?, h | Toggle this help |
| Esc | Back to slideshow |