Rookie mistakes and how to fix them when making plots of data

In this assignment, the focus was to practice data cleaning. Students suggested questions to build a class survey, to get to know the interests of other class members, and then completed the composed survey. After cleaning the data, a few summary plots of interesting aspects of the data were made. There are some common mistakes that rookies often make when constructing data plots: packing too much into a single graphic, leaving categorical variables unordered, reversing norms for response and explanatory variables, conditioning in wrong order, plotting counts when proportions should be the focus, not normalizing by counts, using a boxplot for small sample size.

Statistical Sciences, Cornell University

This week I have been visiting the Department of Statistical Sciences at Cornell University. This is the home of many venerable statisticians. At first sight it appears that statisticians are spread all over the university, and technically they are because funding comes from many directions, but almost all are actually located in a suite in Comstock Hall. Professor Paul Velleman is one of the pioneers of data-centrist thinking about statistics. He produced the software called DataDesk in the early 90s that some saw as rivaling LispStat and particularly JMP for introductory statistics classes.

Graduates in Statistical Graphics Research at ISU 2015

Its exciting to report on the graduations from the working group this year. Niladri Roy Chowdhury defended his PhD thesis in Aug 2014, titled “Explorations of the lineup protocol for visual inference: application to high dimension, low sample size problems and metrics to assess the quality”, under my direction. He is a scientist at Novartis, Boston, MA. Susan Vanderplas defended her PhD in May, titled “Perception in Statistical Graphics”, under the direction of Professor Heike Hofmann.

EDA at the UN

On Nov 10 I was part of a celebration of John W. Tukey at the United Nations. This event kicked off a new UN initiative called Unite Ideas. Details of the event, and the initiative can be found here. There were five talks relayed live to an audience of several thousand, using google hangouts and a youtube channel, and listeners could post questions using the Q/A tool. My talk was titled “An Exploratory Data Analysis of OECD’s 2012 PISA Survey”s and I delivered it by computer from my office in Iowa.

New version of nullabor package released

The new version of nullabor contains numerical measures that quantify how close the plot of the data is to the null plots in a lineup. It is very difficult to quantify all patterns that might be read from plots, so these should be taken in a spirit ofa Herculean task. The goal is to get some sense of what people are reacting to in a plot, which could be then associated with the text descriptions from people, or with data from an eyetracker.

Statistical computing research

During the week, I received final confirmation notice that the special issue of Statistical Science that Vince Carey and I put together is finally published. There are four papers from leaders in the field of statistical computing research: John Chambers, Duncan Temple Lang, Michael Lawrence and Michael Morgan (newly minted members of R Core) and Yihui Xie, Heike Hofmann and Xiaoyue Cheng. The links to the overview and the four papers are below.

How good is Nick Kygrios?

Nick Kygrios caught the world’s attention in July at Wimbledon 2014 when he beat world number 1 Rafael Nadal. After the match McEnroe commented: “We’ve been waiting for this for a while. We keep saying, `Who’s the next guy?‘, and I think we found that guy right now.” but Nadal seemed to beg to differ: “He has things, positive things, to be a good player. But everything is a little bit easier when you are arriving.

Facetted barcharts, and fluctuation diagrams are good alternatives to stacked barcharts

When there are two categorical variables it is common to make a stacked barchart. The stacked barchart primarily allows the reader to see the overall count, but it is harder to compare the counts of categories, the colored segments. Using data from the vcd package in R, here is an example. The data describes the responses of couples on questions about their sex life. This is a bar chart showing the husbands views, with his wifes’ views forming the stacking.

A Graphical Expedition into a Statistics Gradebook

It is always with a sense of unease that I reduce a whole semester’s work into a single letter grade, and to alleviate this feeling, I often pack the gradebook into an interactive graphics system like ggobi or cranvas, and perambulate over it. In the second issue of Chance 2014, I wrote about doing this on grades for a large introductry statistics class using interactive graphics. The class had on the order of 100 students, and grades from exams, homeworks, labs, worksheets, online quizzes and a data analysis project.