class: center, middle, inverse, title-slide # Data Visualization Workshop ## DataFEWSion Graduate Traineeship ###
Anabelle Laurent ### March 02, 2021 --- ### But first... <iframe src="https://app.sli.do/event/vid6lkpl/embed/polls/fcc19b45-618f-4550-8732-b5929884cc0d" width="300" height="400"></iframe> --- ### Why is Data Visualization important? π€ <iframe src="https://app.sli.do/event/vid6lkpl/embed/polls/f67942d8-85e6-442b-9c1c-e17269635c10" width="300" height="400"></iframe> --- ### Why is Data Visualization important? - Universal way to communicate information - Provides clear and effective message - Find patterns, trends, spot extreme values - Make data memorable - Maintain the audience's interest --- Who is your audience? <iframe src="https://app.sli.do/event/vid6lkpl/embed/polls/c3c918fa-e88e-418e-9f57-d982285a6bd2" width="300" height="400"></iframe> --- ### What about you? - What is your audience? + scientist π©βπ¬ + students π¨βπ + industry + general audience - Which support? + peer-reviewed paper π + oral presentations π¬ + website, blog, etc. --- ### What make a good visualization? π€ No online poll, let's talk! π --- ### What make a good visualization? - Reveals a **trend** or **relationship** between variables - Always have at minimum a **caption**, **axis**, **scales** and **symbols** - Distinct and legible symbols (i.e., use contrast) - Caption should convey as much information as possible - No noise: keep information at minimum - the **correct graph type** based on the kind of data to be presented --- ### Disclaimer This workshop does not provide code but all the plots were made using R Studio (see last slides for more details) <center><img src="images/ggplot2_masterpiece.png" style="width: 70%" /> </center> [Artwork by @allison_horst](https://github.com/allisonhorst/stats-illustrations) --- # Visualizing quantity --- ### Visualizing quantity : bar plot .right-column[ <!-- --> ] -- .left-column[ What's wrong with this plot? ] --- ### Visualizing quantity : bar plot .right-column[ <!-- --> ] .left-column[ - Avoid abbreviations - Precise axis title + unit - Make it more attractive ] --- ### Visualizing quantity : bar plot .right-column[ <!-- --> ] -- .left-column[ For long x-axis labels, flip the the axis ] --- ### Visualizing quantity: bar plot .right-column[ <!-- --> ] .left-column[ - Order the categories by ascending or descending values - Keep categories naturally ordered like age group - For long labels: flip the axis ] --- ### Visualizing quantity : grouped bar plot .right-column[ <!-- --> ] .left-column[ Useful to draw bars within each group according to another other categorical variable ] --- ### What's wrong with this plot? .right-column[ <!-- --> ] -- .left-column[ - bars are too long - Can be impractical sometimes ] --- ### Don't do that! βΉ .right-column[ <!-- --> ] -- .left-column[ - Bars charts start at zero. Indeed, the bar length is proportional to the amount displayed. - **dot plot** is a better option ] --- ### Visualizing quantity: dot plot <!-- --> --- ### Visualizing quantity: dot plot .right-column[ <!-- --> ] -- .left-column[ - Bars charts or dot plot: **the order matters** - Here, you don't deliver a clear message ] --- ### Visualizing quantity : lollipop plot .right-column[ <!-- --> ] .left-column[ - Database: On-time data for all flights that departed NYC - Lollipop plots are an alternative for simple barchart ] --- # Visualizing distribution <center><img src="images/histogram.png" style="width: 70%" /> </center> [Artwork by @allison_horst](https://github.com/allisonhorst/stats-illustrations) --- ### Visualizing distribution : histograms .left-right[ <!-- --> ] .left-column[ Histogram are useful for plotting the distribution of a single quantitative variable ] --- ### Visualizing distribution : histograms .right-column[ <!-- --> ] .left-column[ Try different bin widths for best visual appearance. - Small bin width -> peaky and busy histogram - Large bin width -> features might disappear ] --- ### Visualizing distribution : density plot <!-- --> --- ### Visualizing distribution : density plot .right-column[ <!-- --> ] .left-column[ Try different bandwidths for best visual appearance - Small bandwidth -> peaky and busy density - Large bandwidth -> smooth feature and might look like a gaussian ] --- ### Visualizing multiple distributions <!-- --> --- ### Visualizing multiple distributions .right-column[ <!-- --> ] .left-column[ - The peaks of the density plot are where there is the highest concentration of points - For several distributions, density plots work better than histograms. ] --- ### Visualizing multiple distributions <!-- --> --- ### Visualizing multiple distributions: ridgeline plot <!-- --> --- ### Visualizing multiple distributions: ridgeline plot .right-column[ <!-- --> ] .left-column[ Ridgeline plot shows the distribution of a numeric value for several groups (at least 5-6 groups) or when they overlap each other. ] --- ### Visualizing distributions: boxplot <center><img src="images/read_boxplot.jpeg" style="width: 100%" /> </center> A boxplot can summarize the distribution of a numeric variable for several groups --- ### Visualizing distributions: boxplot .right-column[ <!-- --> ] .left-column[ Boxplot does not tell about the number of observations. ] --- ### Visualizing distributions: boxplot with jitter .right-column[ <!-- --> ] .left-column[ Boxplots with jitter tell about: - the distribution of the data - if the groups are balanced or unbalanced in terms of observations. ] --- ### Visualizing distributions: boxplot with jitter .right-column[ <!-- --> ] .left-column[ No overlapping facilitates the visual appearence of the plot ] --- ### Visualizing distributions: violin plot .right-column[ <!-- --> ] .left-column[ - Violins are equivalent to density estimate - They are useful to represent bimodal data. ] --- ### Your turn π¨βπ» Create one visual using one of these types of graphics: - bar chart - histograms - density plot - boxplot - violin plot --- ### Your turn π©βπ» Choose the dataset of your choices: - Titanic - Nutritional and marketing information on US Cereals -- |Class |Sex |Age |Survived | Freq| |:-----|:------|:-----|:--------|----:| |1st |Male |Child |No | 0| |2nd |Male |Child |No | 0| |3rd |Male |Child |No | 35| |Crew |Male |Child |No | 0| |1st |Female |Child |No | 0| ``` ## 'data.frame': 32 obs. of 5 variables: ## $ Class : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ... ## $ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ... ## $ Age : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ... ## $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ... ## $ Freq : num 0 0 35 0 0 0 17 0 118 154 ... ``` Freq = the number of observations --- ### Your turn π¨βπ» Choose the dataset of your choices: - Titanic - Nutritional and marketing information on US Cereals -- Choose the dataset of your choices | |mfr | calories| protein| fat| sugars| shelf| |:-------------------------|:---|--------:|-------:|---:|------:|-----:| |100% Bran |N | 212.1| 12.1| 3| 18.2| 3| |All-Bran |K | 212.1| 12.1| 3| 15.2| 3| |All-Bran with Extra Fiber |K | 100.0| 8.0| 0| 0.0| 3| ``` ## 'data.frame': 65 obs. of 11 variables: ## $ mfr : Factor w/ 6 levels "G","K","N","P",..: 3 2 2 1 2 1 6 4 5 1 ... ## $ calories : num 212 212 100 147 110 ... ## $ protein : num 12.12 12.12 8 2.67 2 ... ## $ fat : num 3.03 3.03 0 2.67 0 ... ## $ sodium : num 394 788 280 240 125 ... ## $ fibre : num 30.3 27.3 28 2 1 ... ## $ carbo : num 15.2 21.2 16 14 11 ... ## $ sugars : num 18.2 15.2 0 13.3 14 ... ## $ shelf : int 3 3 3 1 2 3 1 3 2 1 ... ## $ potassium: num 848.5 969.7 660 93.3 30 ... ## $ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2 2 2 2 2 2 2 2 ... ``` --- ### Your turn π¨βπ» ```r remotes::install_github("dreamRs/esquisse") library(MASS) library(esquisse) titanic <- as.data.frame(Titanic) ?Titanic # more details about the dataset ?UScereal # more details about the dataset esquisse::esquisser(titanic,viewer="browser") esquisse::esquisser(UScereal,viewer="browser") ``` --- # Visualizing associations among quantitative variables --- ### Relationship between 2 numeric variables: scatterplot <!-- --> --- ### Relationship between 2 numeric variables: scatterplot + linear fit <!-- --> --- ##### Relationship between 2 numeric variables: scatterplot + quadratic fit <!-- --> β οΈ Linear fit is widely used but it is not always the best fit, try quadratic fit too. --- ### Relationship between 2 numeric variables: scatterplot <!-- --> --- ### Multi-panel plots <!-- --> Split a single plot using one variable with many levels --- ### Multi-panel plots <!-- --> Split a single plot using the combinations of two discrete variables. --- ### Multi-panel plots <!-- --> β οΈ different scales can lead to misinterpretation --- ### Bubble plot A bublle plot is a scatterplot with 3 numerical variables <!-- --> --- ### Your turn π¨βπ» - Create one visual using scatter plot or bubble plot - Use a data set from TidyTuesday --- ### Your turn π©βπ» |species |island | bill_length_mm| bill_depth_mm| flipper_length_mm| |:-------|:---------|--------------:|-------------:|-----------------:| |Adelie |Torgersen | 39.1| 18.7| 181| |Adelie |Torgersen | 39.5| 17.4| 186| |Adelie |Torgersen | 40.3| 18.0| 195| |Adelie |Torgersen | NA| NA| NA| |Adelie |Torgersen | 36.7| 19.3| 193| <center><img src="images/bill_penguin.png" style="width: 50%" /> </center> --- ### Your turn π¨βπ» |species |island | body_mass_g|sex | year| |:-------|:---------|-----------:|:------|----:| |Adelie |Torgersen | 3750|male | 2007| |Adelie |Torgersen | 3800|female | 2007| |Adelie |Torgersen | 3250|female | 2007| |Adelie |Torgersen | NA|NA | 2007| |Adelie |Torgersen | 3450|female | 2007| <center><img src="images/penguins_drawing.png" style="width: 50%" /> </center> --- ### Your turn π©βπ» ```r penguins <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv') esquisse::esquisser(penguins,viewer="browser") ``` π§See this [link for more details about the penguins dataset](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-07-28/readme.md) --- # Visualizing time series --- ### Visualizing time series Example with a NASA dataset: atmospheric measurements across a grid of locations in Central America (Murrell, 2010) <center><img src="images/grid_nasa.png" style="width: 50%" /> </center> --- ### Visualizing time series Overview of the data ``` ## time y x lat long date cloudhigh cloudlow cloudmid ozone ## 1 1 1 1 -21.2 -113.8000 1995-01-01 0.5 31.0 2.0 260 ## 2 1 1 2 -21.2 -111.2957 1995-01-01 1.5 31.5 2.5 260 ## 3 1 1 3 -21.2 -108.7913 1995-01-01 1.5 32.5 3.5 260 ## 4 1 1 4 -21.2 -106.2870 1995-01-01 1.0 39.0 4.0 258 ## 5 1 1 5 -21.2 -103.7826 1995-01-01 0.5 48.0 4.5 258 ## 6 1 1 6 -21.2 -101.2783 1995-01-01 0.0 50.0 2.5 258 ## pressure surftemp temperature id day month year ## 1 1000 297.4 296.9 1-1 0 1 1995 ## 2 1000 297.4 296.5 2-1 0 1 1995 ## 3 1000 297.4 296.0 3-1 0 1 1995 ## 4 1000 296.9 296.5 4-1 0 1 1995 ## 5 1000 296.5 295.5 5-1 0 1 1995 ## 6 1000 296.5 295.0 6-1 0 1 1995 ``` --- ### Visualizing time series Let's pick one location (x=1 & y=1) and focus on surface temperature (Kelvin) ``` ## time x y surftemp day month year ## 1 1 1 1 297.4 0 1 1995 ## 2 2 1 1 298.7 31 2 1995 ## 3 3 1 1 298.3 59 3 1995 ## 4 4 1 1 298.7 90 4 1995 ## 5 5 1 1 298.3 120 5 1995 ## 6 6 1 1 295.0 151 6 1995 ``` --- ### Visualizing time series <!-- --> - Without the dots you emphasize on the general trend and not on the individual observation - A plot with line + dots is called a line graph --- ### Visualizing time series <!-- --> --- ### Display change between two time periods: dummbbell chart <center><img src="images/dummbbell_chart.png" style="width: 80%" /> </center> Source: Rob Kabacoff (2020) --- # Color & Symbols --- ### Color to distinguish <!-- --> π§ --- ### Color to highlight <!-- --> Something wrong?π€· --- ### Color & shape to highlight <!-- --> Alternative 1 --- ### Color & shape to highlight <!-- --> Alternative 2 --- # Tell a story with your data π --- ### Tell a story with your data Before data visualizatio, you must: - Know your audience - Know the level of data detail expected - Give enough context - Ask yourself: What do I want my audience know/remember with the data I am presenting? --- ### Tell a story with your data .right-column[ <center><img src="images/2020_08_14_penguins.png" style="width: 80%" /> </center> ] .left-column[ Don't be repetitive but be consistent (theme, color scheme, font size etc.) ] --- ### Tell a story with your data Guide your audience by point out specific values <center><img src="images/2019_10_08_powerlifting.png" style="width: 80%" /> </center> --- ### Tell a story with your data Guide your audience by pointing out specific values <center><img src="images/2020_foodconsumption.png" style="width: 70%" /> </center> --- ### Tell a story with your data Customize your plot using highlighting <!-- --> --- ### Tell a story with your data Customize your plot using highlighting + text <!-- --> --- ### Interactive graphics with ggplotly
--- ### Interactive time-series with dygraphs
--- ### Your turn π¨βπ» ```r library(dygraphs) lungDeaths <- cbind(ldeaths, mdeaths, fdeaths) # Ex 1 dygraph(lungDeaths, main = "Deaths from Lung Disease (UK)") %>% dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE) #highlighting persist even after the mouse leaves the graph area. # Ex 2 dygraph(lungDeaths, main = "Deaths from Lung Disease (UK)") %>% dyHighlight(highlightSeriesOpts = list(strokeWidth = 3)) # stroke width of highlighted series # Ex 3 dygraph(ldeaths, main = "Deaths from Lung Disease (UK)") %>% dyOptions(fillGraph = TRUE, fillAlpha = 0.4) # fill in the area underneath the series # Ex 4 dygraph(ldeaths, main = "Deaths from Lung Disease (UK)") %>% dyOptions(drawPoints = TRUE, pointSize = 2) # display of the individual points in a series ``` --- ### Data visulization using interactive web-app One example [ISOFAST web-app](https://analytics.iasoybeans.com/cool-apps/ISOFAST/) --- ### R library used for this presentation ```r library(ggplot2) library(dplyr) library(tidyr) library(gapminder) library(gghighlight) library(ggrepel) library(dygraphs) library(plotly) ``` --- ### Ressources to go deeper into Data Viz - Website for [R colors and palettes](https://www.color-hex.com/) - [Claus Wilke's book](https://clauswilke.com/dataviz/index.html) - [Rob Kabacoff's book](https://rkabacoff.github.io/datavis/) - [Marie DΓΆbler & Tim GroΓmann's book](https://www.barnesandnoble.com/w/the-data-visualization-workshop-second-edition-mario-d-bler/1136609407) Available online with ISU Library - [CΓ©dric Scherer's blog](https://www.cedricscherer.com/top/dataviz/) - [From Data to Viz's website](https://www.data-to-viz.com/) - [dygraphs R package](https://rstudio.github.io/dygraphs/) - [Plotly R package](https://plotly.com/r/) - [Shiny tutorial](https://shiny.rstudio.com/tutorial/) - Check the hashtag **#tidytuesday** on twitter if you are looking for inspiration & R code. - [Shiny app about Tidy Tuesday tweets](https://nsgrantham.shinyapps.io/tidytuesdayrocks/) --- ### Accurate <center><img src="images/r_rollercoaster.png" style="width: 85%" /> </center> [Artwork by @allison_horst](https://github.com/allisonhorst/stats-illustrations) --- ### Thank you for your attention <center><img src="images/lastslide.jpg" style="width: 60%" /> </center> βοΈ my email: **alaurent@iastate.edu** Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).