github twitter linkedin email
Visualizing French wine reviews with rbokeh
Oct 21, 2018
391 minutes read

Purpose of this post

Although I love French wine, I do not intend in this post to conduct a detailed statistical investigation into wine, per se; rather, I want to explore a new (for me) visualization package in R: rbokeh. I typically use ggplot2 for most of my visualizations. However, rbokeh produces interactive graphics. I particularly like the idea of programmers being able not only to convey in a static manner a single insight, but also to embed enough information within a figure that the consumer can explore the figure in further detail.

The data I use are freely available on Kaggle. The file I use in this post contains reviews from Wine Enthusiast on roughly 130,000 wines from around the world. The data require a moderate amount of cleaning in order to extract necessary features, such as the vintage/year. Because I use this post to explore rbokeh, I will not show the code to order the data, though I am happy to answer questions or provide code if you would like to contact me.

On to rbokeh

rbokeh is not a library unique to R. Here is what the website for the package has to say:

Bokeh is a visualization library that provides a flexible and powerful declarative framework for creating web-based plots. Bokeh renders plots using HTML canvas and provides many mechanisms for interactivity. Bokeh has interfaces in Python, Scala, Julia, and now R.

Searching for answers to various questions, I noticed that other, non-R interfaces, such as that for Python, have more extensive documentation or discussion on sites, like StackOverflow. However, enough information exists to answer most inquiries.

Now, let get to exploring!

Before getting to the visualizations, I will first provide an brief overview of the data.

# get dimensions
dim(reviews_fr)
## [1] 20508    15
# get variable names
names(reviews_fr)
##  [1] "points"                "title"                
##  [3] "description"           "taster_name"          
##  [5] "taster_twitter_handle" "price"                
##  [7] "designation"           "variety"              
##  [9] "region_1"              "region_2"             
## [11] "province"              "country"              
## [13] "winery"                "region_type"          
## [15] "vintage"

The dim() function indicates that there are over 20,000 reviews of French1 wines, and the names() function shows the 15 variables. The first 13 were included in the original file from Kaggle. I added the region_type and vintage variables. The former indicates if a wine uses an AOC (appellation d’origine contrôlée) or an IGP (indication géographique protégée) label, as the two are competing classification systems in the French wine market; the second refers simply to the year in which the wine was produced. I focus in this post on the points, taster_name, price, province, winery, and vintage variables.

Single Plots

To begin, I will create a basic scatterplot concerning the relationship between the points a wine receives and the year in which it was produced. The syntax structure is similar to that of ggplot2: one initially creates an empty figure and then adds layers. You can use the pipe operator between layers (rather than the plus sign of ggplot2).

# load package
library(rbokeh)

# points versus vintage
figure() %>% 
  ly_points(x = vintage,
            y = points,
            data = reviews_fr)

This very simple scatter plot is not very informative. It is difficult to see much of a trend, likely for two reasons: (1) the actual trend—or lack thereof—in the data and (2) the size, density, and transparency of the points.

Fixing the second set of issues is relatively straightforward. I will set the point size to 7 and the alpha to 0.3.

# adjust size and alpha
figure() %>% 
  ly_points(x = vintage,
            y = points,
            data = reviews_fr,
            size = 7,
            alpha = 0.3)

The darker points indicate considerable overlap of observations, and the smaller dots allow us to see more precisely where these denser regions are located, which are unfortunately rather frequent. This issue lies in the data, so I will ignore it for now because the goal is to showcase rbokeh.

The scatterplot currently has no title, and the axis labels are not capitalized, as they are simply the variable names. In order to fix this, we specify the title, xlab, and ylab options.

figure(title = "Rating versus vintage of French wines") %>% 
  ly_points(x = vintage,
            y = points,
            data = reviews_fr,
            size = 7,
            alpha = 0.3,
            xlab = "Vintage",
            ylab = "Points")

This version looks better, but the axis labels are larger than the plot title. We can fix this issue with the theme_title() function.

figure(title = "Rating versus vintage of French wines") %>% 
  ly_points(x = vintage,
            y = points,
            data = reviews_fr,
            size = 7,
            alpha = 0.3,
            xlab = "Vintage",
            ylab = "Points") %>% 
  theme_title(text_font_size = "13pt")

Et voilà!

My favorite capability of the robkeh visualizations is having information on each observation presented to you upon hovering above the point. To make this occur, we use the hover option.

figure(title = "Rating versus vintage of French wines") %>% 
  ly_points(x = vintage,
            y = points,
            data = reviews_fr,
            size = 7,
            alpha = 0.3,
            xlab = "Vintage",
            ylab = "Points",
            hover = c(vintage, points)) %>% 
  theme_title(text_font_size = "13pt")