Myth busting and apophenia in data visualisation: is what you see really there?

Professor Dianne Cook
Monash University

Philosophy

Being able to construct effective data plots goes a long way to helping you understand any problem.

Philosophy

Being able to construct effective data plots goes a long way to helping you understand any problem.

The biggest complication is that data plots lack the inferential machinery of formal statistics …

… but not any more!

What I have learned from plots (1/5)

Gender gap in reading is universal, but the math gap is not. (Open data, available from OECD PISA.)

What I have learned from plots (2/5)

Pollsters have some bias in their reported results. (Data generally available at Real Clear Politics.)

What I have learned from plots (3/5)



Flying into and out of Dallas-Fort Worth (DFW) is a good option, at least before the pandemic. (Open data, available from BTS.)

What I have learned from plots (4/5)



CO\(_2\) is seasonal in the northern hemisphere, and increasing everywhere. (Open data is available from Scripps.)

What I have learned from plots (5/5)



82% of 2019-2020 Victorian bushfires were caused by lightning, and only 4% by arson. (Data and code here.)

Without making plots we might never have learned these, and many more things. Plots allow the data to share its secrets.

What about this one?

Apophenia


Apophenia is the human experience of seeing meaningful patterns or connections in random or meaningless data.


Chelsea Veals, 2012



Which plot exhibits the most separation between the groups?

In support of plotting data


Data plots are important for making discoveries - failing to make a discovery is a tragedy.



Plots need to be accompanied with methods (modern computational techniques) to guard against false discovery.

Inferential machinery

Why is a plot a statistic?

Many of you (hopefully) use ggplot2 to make your plots with a grammar of graphics from tidy data.

data %>% 
  ggplot() + 
    GEOM_something(
      mapping=aes(MAPPINGS)) + #<< 
  extra nice styling


A statistic is a function of a random variable(s). This is how the mapping can be interpreted, e.g. x=V1, y=V2, colour=cl, and the GEOM is the function.

Making comparisons with a null

ggplot(data=lineup(null_generator(VARS), #<<
                   DATA) + 
  GEOM_something(
    mapping=aes(MAPPINGS)) +
  facet_wrap(~ .sample) + #<<
  extra nice styling


Null hypothesis: There is NO pattern

Alternative: There is some sort of pattern



Test by comparing with null plots.

Example


p <- ggplot(lineup(
        null_permute('cyl'), 
              mtcars, n=12),
        aes(x=mpg, 
            y=hp, 
            colour = factor(cyl))) +
       facet_wrap(~ .sample) +
       geom_point(size=2, 
                  alpha=0.8) +
  xlab("") + ylab("") +
  scale_colour_ochre("",
       palette="healthy_reef") +
  theme(axis.text = 
          element_blank())
#> decrypt("wvLp U5E5 Ha es9HEHsa Yi")


Can you identify the odd one out?

REVIEW: Tidy data, plot, inference

1

Define your plot, based on tidy data


ggplot() + 
  GEOM_something(
    mapping=aes(MAPPINGS))

2

Add data



DATA %>% 
  ggplot() + 
    GEOM_something(
    mapping=aes(MAPPINGS))

3

Compare your data plot with a sample of null plots

ggplot(data=LINEUP(
  NULL_GENERATOR(VARS), 
     DATA) + 
  GEOM_something(
    mapping=aes(MAPPINGS)) +
  facet_wrap(~ .sample)

Calculating statistical significance and power

Lineup is viewed by \(K\) uninvolved, independent observers. The chance that any observer chooses the data plot is \(1/m\), where \(m\) is the number of plots in the lineup. With \(K\) observers, the \(p\)-value is the probability that \(x\) or more select the data plot.

Signal strength (visual power) is the computed as the \(x/K\), adjusting for multiple selections.

pvisual(x=2, K=23, m=12)
#>      x simulated binom
#> [1,] 2      0.52 0.582


pvisual(x=5, K=23, m=12)
#>      x simulated binom
#> [1,] 5    0.0894 0.038


pvisual(x=8, K=23, m=12)
#>      x simulated    binom
#> [1,] 8    0.0039 0.000363

The nullabor package

  • lineup: Generates a lineup using one of the given null generating mechanisms
    • null_permute
    • null_dist
    • null_lm
    • null_ts
  • pvisual: Compute \(p\)-values, after showing to observers
  • visual_power: Compute the power, after showing to observers
  • distmet: empirical distribution of distance between data plot and null plots

http://dicook.github.io/nullabor/

Example: tuberculosis data (1/6)

#> Rows: 8,277
#> Columns: 165
#> $ country               <chr> "Afghanistan", "Afghanistan"…
#> $ iso2                  <chr> "AF", "AF", "AF", "AF", "AF"…
#> $ iso3                  <chr> "AFG", "AFG", "AFG", "AFG", …
#> $ iso_numeric           <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
#> $ g_whoregion           <chr> "EMR", "EMR", "EMR", "EMR", …
#> $ year                  <dbl> 1980, 1981, 1982, 1983, 1984…
#> $ new_sp                <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn                <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_su                <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep                <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_oth               <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ ret_rel               <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ ret_taf               <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ ret_tad               <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ ret_oth               <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newret_oth            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_labconf           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_clindx            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ ret_rel_labconf       <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ ret_rel_clindx        <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ ret_rel_ep            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ ret_nrel              <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ notif_foreign         <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ c_newinc              <dbl> 71685, 71554, 41752, 52502, …
#> $ new_sp_m04            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_m514           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_m014           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_m1524          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_m2534          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_m3544          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_m4554          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_m5564          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_m65            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_mu             <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f04            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f514           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f014           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f1524          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f2534          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f3544          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f4554          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f5564          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f65            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_fu             <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m04            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m514           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m014           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m1524          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m2534          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m3544          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m4554          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m5564          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m65            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m15plus        <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_mu             <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f04            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f514           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f014           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f1524          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f2534          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f3544          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f4554          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f5564          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f65            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f15plus        <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_fu             <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_sexunk04       <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_sexunk514      <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_sexunk014      <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_sexunk15plus   <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m04            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m514           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m014           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m1524          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m2534          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m3544          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m4554          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m5564          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m65            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m15plus        <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_mu             <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f04            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f514           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f014           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f1524          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f2534          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f3544          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f4554          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f5564          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f65            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f15plus        <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_fu             <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_sexunk04       <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_sexunk514      <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_sexunk014      <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_sexunk15plus   <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_sexunkageunk   <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ rel_in_agesex_flg     <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m04            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m514           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m014           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m1524          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m2534          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m3544          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m4554          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m5564          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m65            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m15plus        <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_mu             <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f04            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f514           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f014           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f1524          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f2534          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f3544          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f4554          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f5564          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f65            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f15plus        <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_fu             <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_sexunk04       <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_sexunk514      <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_sexunk014      <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_sexunk15plus   <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_sexunkageunk   <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ rdx_data_available    <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newinc_rdx            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ rdxsurvey_newinc      <lgl> NA, NA, NA, NA, NA, NA, NA, …
#> $ rdxsurvey_newinc_rdx  <lgl> NA, NA, NA, NA, NA, NA, NA, …
#> $ rdst_new              <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ rdst_ret              <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ rdst_unk              <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ conf_rrmdr            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ conf_mdr              <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ rr_sldst              <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ all_conf_xdr          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ unconf_rrmdr_tx       <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ conf_rrmdr_tx         <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ unconf_mdr_tx         <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ conf_mdr_tx           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ conf_xdr_tx           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ mdrxdr_bdq_used       <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ mdrxdr_bdq_tx         <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ mdrxdr_dlm_used       <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ mdrxdr_dlm_tx         <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ mdr_shortreg_used     <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ mdr_shortreg_tx       <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ mdr_tx_adverse_events <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ mdr_tx_adsm           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_tbhiv_flg      <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_hivtest        <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_hivpos         <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_art            <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hivtest               <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hivtest_pos           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_cpt               <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_art               <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_tbscr             <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_reg               <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_ipt               <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_reg_new           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_ipt_reg_all       <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_reg_all           <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_tbdetect          <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_reg_new2          <dbl> NA, NA, NA, NA, NA, NA, NA, …

Is the data in tidy form?

Example: tuberculosis data (2/6)

#> Rows: 165,540
#> Columns: 6
#> $ country <chr> "Afghanistan", "Afghanistan", "Afghanistan…
#> $ iso3    <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", …
#> $ year    <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 1980, …
#> $ count   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ sex     <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m…
#> $ age     <chr> "04", "514", "014", "1524", "2534", "3544"…

Data is now clearly tidy form.

Variables are country, iso3, year, count, sex, age.

Example: tuberculosis data (3/6)

Focusing only on Australia, what would we like to know?

  1. Is there an increasing or decreasing trend?
  2. Is there a difference by age?
  3. Is there a difference by sex?
  4. Is there a difference by age and sex?
  5. Is the trend different by age?
  6. Is the trend different by sex?
  7. Is the trend different by age and sex?

What type of plot would we make to investigate Q1?

What type of plot would we make to investigate Q6?

Example: tuberculosis data (4/6)



p <- tb_oz %>% 
  group_by(year) %>%
  summarise(count = sum(count)) %>%
  ggplot(aes(x=year, y=count)) +
    geom_col()



For the question: Is there an increasing or decreasing trend?

What would the null hypothesis be?

What would be a possible null generator?

Example: tuberculosis data (5/6)

Go to www.menti.com and use the code 8049 8450




Check results

Example: tuberculosis data (6/6)


Compute the \(p\)-value.



pvisual(x=??, K=??, 12)

Summary

  • We have equipped data plots with statistical inference machinery. With the grammar of graphics, data plots can be statistics, and thus data plots can be tested in an equivalent manner to formal hypothesis testing, using modern randomisation techniques.
  • The inference machinery can be used to objectively test plot design. The design that has the highest visual power, is the winner.
  • The framework should be suitable for training computer vision models to read data plots. Hoping to off-load visual model diagnostics to a robot.

Further reading

Buja et al (2009) Statistical Inference for Exploratory Data Analysis
Wickham et al (2010) Graphical Inference for Infovis
Hofmann et al (2012) Graphical Tests for Power Comparison
Majumder et al (2013) Validation of Visual Statistical Inference
Yin et al (2013) Visual Mining Methods for RNA-Seq data
Zhao, et al (2014) Mind Reading: Using An Eye-tracker to See How People Are Looking At Lineups
Lin et al (2015) Does Host-Plant Diversity Explain Species Richness in Insects?
Roy Chowdhury et al (2015) Using Visual Statistical Inference to Better Understand Random Class Separations in High Dimension, Low Sample Size Data
Loy et al (2017) Model Choice and Diagnostics
Roy Chowdhury et al (2018) Measuring Lineup Difficulty By Matching Distance Metrics with Subject Choices

Acknowledgements

Slides produced using quarto.

Colour palettes using the ochRe R package.

Slides available from https://github.com/dicook/Macquarie_2022.

Viewable at https://www.dicook.org/files/macquarie_2022/slides#/title-slide.

Contact: dicook@monash.edu