Data Visualization Workshop
DataFEWSion Graduate Traineeship
Anabelle Laurent
February 20, 2023
1 / 85

My background

2012: Received my Master's degree in Agronomy/ Agroecology (France)
2013-2016: Conducted research projects about energy crops at INRAE (France)
2017-2020: PhD at ISU in Crop Production & Physiology - Deparment of Agronomy
2021-2022: Postdoc at ISU - Department of Agronomy
2023 to current: Research Scientist at Corteva (Johnston, IA) in the Biostatistics Team

2 / 85

Let's talk: why is Data Visualization important?  🤔3 / 85

Why is Data Visualization important?

Universal way to communicate information
Provides clear and effective message
Find patterns, trends, spot extreme values
Make data memorable
Maintain the audience's interest

4 / 85

Let's talk: Who is your audience? Which support are you using? 🤔5 / 85

Who is your audience?
- scientist 🥼
- students 👨‍🎓
- industry
- general audience
Which support?
- peer-reviewed paper 🗞
- oral presentations 💬
- website, blog, etc.

6 / 85

What make a good visualization?  😄7 / 85

What make a good visualization?

Reveals a trend or relationship between variables
Always have at minimum a caption, axis, scales and symbols
Distinct and legible symbols (i.e., use contrast)
Caption should convey as much information as possible
No noise: keep information at minimum
the correct graph type based on the kind of data to be presented

8 / 85

Disclaimer

This workshop does not provide code but all the plots were made using R Studio (see last slides for more details)

Artwork by @allison_horst

9 / 85

Visualizing quantity10 / 85

Visualizing quantity : bar plot

11 / 85

Visualizing quantity : bar plot

What's wrong with this plot?

11 / 85

Visualizing quantity : bar plot

Avoid abbreviations
Precise axis title + unit
Make it more attractive

12 / 85

Visualizing quantity : bar plot

13 / 85

Visualizing quantity : bar plot

For long x-axis labels, flip the the axis

13 / 85

Visualizing quantity: bar plot

Order the categories by ascending or descending values
Keep categories naturally ordered like age group
For long labels: flip the axis

14 / 85

Visualizing quantity : grouped bar plot

Useful to draw bars within each group according to another other categorical variable

15 / 85

What's wrong with this plot?

16 / 85

What's wrong with this plot?

bars are too long
Can be impractical sometimes

16 / 85

Don't do that! 🙅

17 / 85

Don't do that! 🙅

Bars charts start at zero. Indeed, the bar length is proportional to the amount displayed.
dot plot is a better option

17 / 85

Visualizing quantity: dot plot

18 / 85

Visualizing quantity: dot plot

19 / 85

Visualizing quantity: dot plot

Bars charts or dot plot: the order matters
Here, you don't deliver a clear message

19 / 85

Visualizing quantity : lollipop plot

Database: On-time data for all flights that departed NYC
Lollipop plots are an alternative for simple barchart

20 / 85

Visualizing distribution

Artwork by @allison_horst

21 / 85

Visualizing distribution : histograms

Histogram are useful for plotting the distribution of a single quantitative variable

22 / 85

Visualizing distribution : histograms

Try different bin widths for best visual appearance.

Small bin width -> peaky and busy histogram
Large bin width -> features might disappear

23 / 85

Visualizing distribution : density plot

24 / 85

Visualizing distribution : density plot

Try different bandwidths for best visual appearance

Small bandwidth -> peaky and busy density
Large bandwidth -> smooth feature and might look like a gaussian

25 / 85

Visualizing multiple distributions

26 / 85

Visualizing multiple distributions

The peaks of the density plot are where there is the highest concentration of points
For several distributions, density plots work better than histograms.

27 / 85

Visualizing multiple distributions

28 / 85

Visualizing multiple distributions: ridgeline plot

29 / 85

Visualizing multiple distributions: ridgeline plot

Ridgeline plot shows the distribution of a numeric value for several groups (at least 5-6 groups) or when they overlap each other.

30 / 85

Visualizing distributions: boxplot

A boxplot can summarize the distribution of a numeric variable for several groups

31 / 85

Visualizing distributions: boxplot

Boxplot does not tell about the number of observations.

32 / 85

Visualizing distributions: boxplot with jitter

Boxplots with jitter tell about:

the distribution of the data
if the groups are balanced or unbalanced in terms of observations.

33 / 85

Visualizing distributions: boxplot with jitter

No overlapping facilitates the visual appearence of the plot

34 / 85

Visualizing distributions: violin plot

Violins are equivalent to density estimate
They are useful to represent bimodal data.

35 / 85

Your turn 👨‍💻

Create one visual using one of these types of graphics:

bar chart
histograms
density plot
boxplot
violin plot

36 / 85

Your turn 👩‍💻

Choose the dataset of your choices:

Nutritional and marketing information on US Cereals
Diamonds

37 / 85

Your turn 👩‍💻

Choose the dataset of your choices:

Nutritional and marketing information on US Cereals
Diamonds Choose the dataset of your choices

	mfr	calories	protein	fat	sugars	shelf
100% Bran	N	212.1	12.1	3	18.2	3
All-Bran	K	212.1	12.1	3	15.2	3
All-Bran with Extra Fiber	K	100.0	8.0	0	0.0	3

## 'data.frame':    65 obs. of  11 variables:
##  $ mfr      : Factor w/ 6 levels "G","K","N","P",..: 3 2 2 1 2 1 6 4 5 1 ...
##  $ calories : num  212 212 100 147 110 ...
##  $ protein  : num  12.12 12.12 8 2.67 2 ...
##  $ fat      : num  3.03 3.03 0 2.67 0 ...
##  $ sodium   : num  394 788 280 240 125 ...
##  $ fibre    : num  30.3 27.3 28 2 1 ...
##  $ carbo    : num  15.2 21.2 16 14 11 ...
##  $ sugars   : num  18.2 15.2 0 13.3 14 ...
##  $ shelf    : int  3 3 3 1 2 3 1 3 2 1 ...
##  $ potassium: num  848.5 969.7 660 93.3 30 ...
##  $ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2 2 2 2 2 2 2 2 ...

37 / 85

Your turn 👨‍💻

Choose the dataset of your choices

carat	cut	color	clarity	depth	table	price	x	y	z
0.2	Ideal	E	SI2	61.5	55	326	4.0	4.0	2.4
0.2	Premium	E	SI1	59.8	61	326	3.9	3.8	2.3
0.2	Good	E	VS1	56.9	65	327	4.0	4.1	2.3

## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

38 / 85

Your turn 👩‍💻

install.packages("esquisse")
library(esquisse)
library(MASS)
library(ggplot2)
?UScereal  # more details about the dataset
esquisse::esquisser(UScereal,viewer="browser")
?diamonds  # more details about the dataset
esquisse::esquisser(diamonds,viewer="browser")

39 / 85

Visualizing associations among quantitative variables40 / 85

Relationship between 2 numeric variables: scatterplot

41 / 85

Relationship between 2 numeric variables: scatterplot + linear fit

42 / 85

Relationship between 2 numeric variables: scatterplot + quadratic fit

⚠️ Linear fit is widely used but it is not always the best fit, try quadratic fit too.

43 / 85

Relationship between 2 numeric variables: scatterplot

44 / 85

Multi-panel plots

Split a single plot using one variable with many levels

45 / 85

Multi-panel plots

Split a single plot using the combinations of two discrete variables.

46 / 85

Multi-panel plots

⚠️ different scales can lead to misinterpretation

47 / 85

Bubble plot

A bublle plot is a scatterplot with 3 numerical variables

48 / 85

Hexagonal heatmap

It counts the number of cases in each hexagon. Useful for large dataset or avoid overplotting.

49 / 85

Computationally more efficient than plotting individual data points for very large dataset.

50 / 85

Your turn 👨‍💻

Create one visual using scatter plot or bubble plot
Use a data set from TidyTuesday

51 / 85

Your turn 👨‍💻

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm
Adelie	Torgersen	39.1	18.7	181
Adelie	Torgersen	39.5	17.4	186
Adelie	Torgersen	40.3	18.0	195
Adelie	Torgersen	NA	NA	NA
Adelie	Torgersen	36.7	19.3	193

52 / 85

Your turn 👩‍💻

species	island	body_mass_g	sex	year
Adelie	Torgersen	3750	male	2007
Adelie	Torgersen	3800	female	2007
Adelie	Torgersen	3250	female	2007
Adelie	Torgersen	NA	NA	2007
Adelie	Torgersen	3450	female	2007

53 / 85

Your turn 👩‍💻

penguins <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv')
esquisse::esquisser(penguins,viewer="browser")

🐧See this link for more details about the penguins dataset

54 / 85

Visualizing time series55 / 85

Visualizing time series

Example with a NASA dataset: atmospheric measurements across a grid of locations in Central America (Murrell, 2010)

56 / 85

Visualizing time series

Overview of the data

##   time y x   lat      long       date cloudhigh cloudlow cloudmid ozone
## 1    1 1 1 -21.2 -113.8000 1995-01-01       0.5     31.0      2.0   260
## 2    1 1 2 -21.2 -111.2957 1995-01-01       1.5     31.5      2.5   260
## 3    1 1 3 -21.2 -108.7913 1995-01-01       1.5     32.5      3.5   260
## 4    1 1 4 -21.2 -106.2870 1995-01-01       1.0     39.0      4.0   258
## 5    1 1 5 -21.2 -103.7826 1995-01-01       0.5     48.0      4.5   258
## 6    1 1 6 -21.2 -101.2783 1995-01-01       0.0     50.0      2.5   258
##   pressure surftemp temperature  id day month year
## 1     1000    297.4       296.9 1-1   0     1 1995
## 2     1000    297.4       296.5 2-1   0     1 1995
## 3     1000    297.4       296.0 3-1   0     1 1995
## 4     1000    296.9       296.5 4-1   0     1 1995
## 5     1000    296.5       295.5 5-1   0     1 1995
## 6     1000    296.5       295.0 6-1   0     1 1995

57 / 85

Visualizing time series

Let's pick one location (x=1 & y=1) and focus on surface temperature (Kelvin)

##   time x y surftemp day month year
## 1    1 1 1    297.4   0     1 1995
## 2    2 1 1    298.7  31     2 1995
## 3    3 1 1    298.3  59     3 1995
## 4    4 1 1    298.7  90     4 1995
## 5    5 1 1    298.3 120     5 1995
## 6    6 1 1    295.0 151     6 1995

58 / 85

Visualizing time series

Without the dots you emphasize on the general trend and not on the individual observation
A plot with line + dots is called a line graph

59 / 85

Visualizing time series

60 / 85

Display change between two time periods: dummbbell chart

Source: Rob Kabacoff (2020)

61 / 85

Aesthetics: Color, Shape, Opcacity62 / 85

Color to distinguish

🐧

63 / 85

Color to highlight

Something wrong?🤷

64 / 85

Color & shape to highlight

Alternative 1

65 / 85

Color & shape to highlight

Alternative 2

66 / 85

Opacity

Spatial distribution of drug related crimes in Chalottesville

You can miss the interpretation of your graphic if the opacity is not set correctly

67 / 85

Opacity

Spatial distribution of drug related crimes in Chalottesville

Control the opacity to avoid overlapping and provide shading

68 / 85

Tell a story with your data 📖69 / 85

Tell a story with your data

Before data visualizatio, you must:

Know your audience
Know the level of data detail expected
Give enough context
Ask yourself: What do I want my audience know/remember with the data I am presenting?

70 / 85

Tell a story with your data

Don't be repetitive but be consistent (theme, color scheme, font size etc.)

71 / 85

Tell a story with your data

Guide your audience by point out specific values

72 / 85

Tell a story with your data

Guide your audience by pointing out specific values

73 / 85

Tell a story with your data

Customize your plot using highlighting

74 / 85

Tell a story with your data

Customize your plot using highlighting + text

75 / 85

Interactive graphics with ggplotly

76 / 85

Interactive time-series with dygraphs

Lung deaths in UK

77 / 85

Your turn 👩‍💻

library(dygraphs)
lungDeaths <- cbind(ldeaths, mdeaths, fdeaths)
# Ex 1: highlighting persist even after the mouse leaves the graph area.
dygraph(lungDeaths, main = "Deaths from Lung Disease (UK)") %>%
  dyHighlight(highlightCircleSize = 5, 
              highlightSeriesBackgroundAlpha = 0.2,
              hideOnMouseOut = FALSE) 
# Ex 2: stroke width of highlighted series
dygraph(lungDeaths, main = "Deaths from Lung Disease (UK)") %>%
  dyHighlight(highlightSeriesOpts = list(strokeWidth = 3))

78 / 85

Your turn 👨‍💻

library(dygraphs)
lungDeaths <- cbind(ldeaths, mdeaths, fdeaths)
# Ex 3: fill in the area underneath the series
dygraph(ldeaths, main = "Deaths from Lung Disease (UK)") %>%
  dyOptions(fillGraph = TRUE, fillAlpha = 0.4) 
# Ex 4: display of the individual points in a series
dygraph(ldeaths, main = "Deaths from Lung Disease (UK)") %>%
  dyOptions(drawPoints = TRUE, pointSize = 2)

79 / 85

Data visulization using interactive web-app

First example: ISOFAST web-app

Second example: ONFANT web-app

80 / 85

Your turn! 👩‍💻

Create a prototype user interface for a Shiny app Goal: get familiar with R Shiny BUT then level up with R Studio tutorials

install.packages("designer")
library(designer)
designer::designApp()

Check this website for more information.

81 / 85

R library used for this presentation

library(dygraphs)
library(gapminder)
library(gghighlight)
library(ggplot2)
library(ggrepel)
library(dplyr)
library(plotly)
library(tidyr)

82 / 85

Ressources to go deeper into Data Viz

Website for R colors and palettes
Claus Wilke's book
Rob Kabacoff's book
Marie Döbler & Tim Großmann's book Available online with ISU Library
Cédric Scherer's blog
From Data to Viz's website
dygraphs R package
Plotly R package
Shiny tutorial
Check the hashtag #tidytuesday on twitter if you are looking for inspiration & R code.
Shiny app about Tidy Tuesday tweets

83 / 85

Accurate

Artwork by @allison_horst

84 / 85

Thank you for your attention

✉️ my email: alaurent@iastate.edu

Slides created via the R package xaringan.

85 / 85

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help