Professor Dianne Cook
Monash University
Being able to construct effective data plots goes a long way to helping you understand any problem.
Being able to construct effective data plots goes a long way to helping you understand any problem.
The biggest complication is that data plots lack the inferential machinery of formal statistics …
… but not any more!
Gender gap in reading is universal, but the math gap is not. (Open data, available from OECD PISA.)
Pollsters have some bias in their reported results. (Data generally available at Real Clear Politics.)
Flying into and out of Dallas-Fort Worth (DFW) is a good option, at least before the pandemic. (Open data, available from BTS.)
CO\(_2\) is seasonal in the northern hemisphere, and increasing everywhere. (Open data is available from Scripps.)
82% of 2019-2020 Victorian bushfires were caused by lightning, and only 4% by arson. (Data and code here.)
Without making plots we might never have learned these, and many more things. Plots allow the data to share its secrets.
Apophenia is the human experience of seeing meaningful patterns or connections in random or meaningless data.
Which plot exhibits the most separation between the groups?
Data plots are important for making discoveries - failing to make a discovery is a tragedy.
Plots need to be accompanied with methods (modern computational techniques) to guard against false discovery.
Why is a plot a statistic?
Many of you (hopefully) use ggplot2
to make your plots with a grammar of graphics from tidy data.
A statistic is a function of a random variable(s). This is how the mapping can be interpreted, e.g. x=V1, y=V2, colour=cl
, and the GEOM is the function.
Null hypothesis: There is NO pattern
Alternative: There is some sort of pattern
Test by comparing with null plots.
p <- ggplot(lineup(
null_permute('cyl'),
mtcars, n=12),
aes(x=mpg,
y=hp,
colour = factor(cyl))) +
facet_wrap(~ .sample) +
geom_point(size=2,
alpha=0.8) +
xlab("") + ylab("") +
scale_colour_ochre("",
palette="healthy_reef") +
theme(axis.text =
element_blank())
#> decrypt("wvLp U5E5 Ha es9HEHsa Yi")
Can you identify the odd one out?
Lineup is viewed by \(K\) uninvolved, independent observers. The chance that any observer chooses the data plot is \(1/m\), where \(m\) is the number of plots in the lineup. With \(K\) observers, the \(p\)-value is the probability that \(x\) or more select the data plot.
Signal strength (visual power) is the computed as the \(x/K\), adjusting for multiple selections.
nullabor
packagelineup
: Generates a lineup using one of the given null generating mechanisms
null_permute
null_dist
null_lm
null_ts
pvisual
: Compute \(p\)-values, after showing to observersvisual_power
: Compute the power, after showing to observersdistmet
: empirical distribution of distance between data plot and null plots#> Rows: 8,277
#> Columns: 165
#> $ country <chr> "Afghanistan", "Afghanistan"…
#> $ iso2 <chr> "AF", "AF", "AF", "AF", "AF"…
#> $ iso3 <chr> "AFG", "AFG", "AFG", "AFG", …
#> $ iso_numeric <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
#> $ g_whoregion <chr> "EMR", "EMR", "EMR", "EMR", …
#> $ year <dbl> 1980, 1981, 1982, 1983, 1984…
#> $ new_sp <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_su <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_oth <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ ret_rel <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ ret_taf <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ ret_tad <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ ret_oth <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newret_oth <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_labconf <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_clindx <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ ret_rel_labconf <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ ret_rel_clindx <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ ret_rel_ep <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ ret_nrel <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ notif_foreign <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ c_newinc <dbl> 71685, 71554, 41752, 52502, …
#> $ new_sp_m04 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_m514 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_m014 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_m1524 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_m2534 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_m3544 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_m4554 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_m5564 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_m65 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_mu <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f04 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f514 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f014 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f1524 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f2534 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f3544 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f4554 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f5564 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_f65 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sp_fu <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m04 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m514 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m014 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m1524 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m2534 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m3544 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m4554 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m5564 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m65 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_m15plus <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_mu <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f04 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f514 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f014 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f1524 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f2534 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f3544 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f4554 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f5564 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f65 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_f15plus <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_fu <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_sexunk04 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_sexunk514 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_sexunk014 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_sn_sexunk15plus <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m04 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m514 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m014 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m1524 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m2534 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m3544 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m4554 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m5564 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m65 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_m15plus <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_mu <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f04 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f514 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f014 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f1524 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f2534 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f3544 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f4554 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f5564 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f65 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_f15plus <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_fu <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_sexunk04 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_sexunk514 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_sexunk014 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_sexunk15plus <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ new_ep_sexunkageunk <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ rel_in_agesex_flg <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m04 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m514 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m014 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m1524 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m2534 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m3544 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m4554 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m5564 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m65 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_m15plus <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_mu <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f04 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f514 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f014 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f1524 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f2534 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f3544 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f4554 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f5564 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f65 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_f15plus <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_fu <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_sexunk04 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_sexunk514 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_sexunk014 <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_sexunk15plus <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_sexunkageunk <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ rdx_data_available <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newinc_rdx <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ rdxsurvey_newinc <lgl> NA, NA, NA, NA, NA, NA, NA, …
#> $ rdxsurvey_newinc_rdx <lgl> NA, NA, NA, NA, NA, NA, NA, …
#> $ rdst_new <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ rdst_ret <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ rdst_unk <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ conf_rrmdr <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ conf_mdr <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ rr_sldst <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ all_conf_xdr <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ unconf_rrmdr_tx <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ conf_rrmdr_tx <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ unconf_mdr_tx <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ conf_mdr_tx <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ conf_xdr_tx <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ mdrxdr_bdq_used <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ mdrxdr_bdq_tx <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ mdrxdr_dlm_used <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ mdrxdr_dlm_tx <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ mdr_shortreg_used <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ mdr_shortreg_tx <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ mdr_tx_adverse_events <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ mdr_tx_adsm <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_tbhiv_flg <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_hivtest <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_hivpos <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ newrel_art <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hivtest <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hivtest_pos <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_cpt <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_art <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_tbscr <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_reg <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_ipt <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_reg_new <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_ipt_reg_all <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_reg_all <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_tbdetect <dbl> NA, NA, NA, NA, NA, NA, NA, …
#> $ hiv_reg_new2 <dbl> NA, NA, NA, NA, NA, NA, NA, …
Is the data in tidy form?
#> Rows: 165,540
#> Columns: 6
#> $ country <chr> "Afghanistan", "Afghanistan", "Afghanistan…
#> $ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", …
#> $ year <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 1980, …
#> $ count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ sex <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m…
#> $ age <chr> "04", "514", "014", "1524", "2534", "3544"…
Data is now clearly tidy form.
Variables are country
, iso3
, year
, count
, sex
, age
.
Focusing only on Australia, what would we like to know?
What type of plot would we make to investigate Q1?
What type of plot would we make to investigate Q6?
Go to www.menti.com and use the code 8049 8450
Compute the \(p\)-value.
Buja et al (2009) Statistical Inference for Exploratory Data Analysis
Wickham et al (2010) Graphical Inference for Infovis
Hofmann et al (2012) Graphical Tests for Power Comparison
Majumder et al (2013) Validation of Visual Statistical Inference
Yin et al (2013) Visual Mining Methods for RNA-Seq data
Zhao, et al (2014) Mind Reading: Using An Eye-tracker to See How People Are Looking At Lineups
Lin et al (2015) Does Host-Plant Diversity Explain Species Richness in Insects?
Roy Chowdhury et al (2015) Using Visual Statistical Inference to Better Understand Random Class Separations in High Dimension, Low Sample Size Data
Loy et al (2017) Model Choice and Diagnostics
Roy Chowdhury et al (2018) Measuring Lineup Difficulty By Matching Distance Metrics with Subject Choices
Slides produced using quarto.
Colour palettes using the ochRe R package.
Slides available from https://github.com/dicook/Macquarie_2022.
Viewable at https://www.dicook.org/files/macquarie_2022/slides#/title-slide.
Contact: dicook@monash.edu
Macquarie University – August 29, 2022 – https://github.com/dicook/Macquarie_2022