Visualizing French wine reviews with rbokeh

Oct 21, 2018

391 minutes read

Purpose of this post

Although I love French wine, I do not intend in this post to conduct a detailed statistical investigation into wine, per se; rather, I want to explore a new (for me) visualization package in R: rbokeh. I typically use ggplot2 for most of my visualizations. However, rbokeh produces interactive graphics. I particularly like the idea of programmers being able not only to convey in a static manner a single insight, but also to embed enough information within a figure that the consumer can explore the figure in further detail.

The data I use are freely available on Kaggle. The file I use in this post contains reviews from Wine Enthusiast on roughly 130,000 wines from around the world. The data require a moderate amount of cleaning in order to extract necessary features, such as the vintage/year. Because I use this post to explore rbokeh, I will not show the code to order the data, though I am happy to answer questions or provide code if you would like to contact me.

On to rbokeh

rbokeh is not a library unique to R. Here is what the website for the package has to say:

Bokeh is a visualization library that provides a flexible and powerful declarative framework for creating web-based plots. Bokeh renders plots using HTML canvas and provides many mechanisms for interactivity. Bokeh has interfaces in Python, Scala, Julia, and now R.

Searching for answers to various questions, I noticed that other, non-R interfaces, such as that for Python, have more extensive documentation or discussion on sites, like StackOverflow. However, enough information exists to answer most inquiries.

Now, let get to exploring!

Before getting to the visualizations, I will first provide an brief overview of the data.

# get dimensions
dim(reviews_fr)

## [1] 20508    15

# get variable names
names(reviews_fr)

##  [1] "points"                "title"                
##  [3] "description"           "taster_name"          
##  [5] "taster_twitter_handle" "price"                
##  [7] "designation"           "variety"              
##  [9] "region_1"              "region_2"             
## [11] "province"              "country"              
## [13] "winery"                "region_type"          
## [15] "vintage"

The dim() function indicates that there are over 20,000 reviews of French¹ wines, and the names() function shows the 15 variables. The first 13 were included in the original file from Kaggle. I added the region_type and vintage variables. The former indicates if a wine uses an AOC (appellation d’origine contrôlée) or an IGP (indication géographique protégée) label, as the two are competing classification systems in the French wine market; the second refers simply to the year in which the wine was produced. I focus in this post on the points, taster_name, price, province, winery, and vintage variables.

Single Plots

To begin, I will create a basic scatterplot concerning the relationship between the points a wine receives and the year in which it was produced. The syntax structure is similar to that of ggplot2: one initially creates an empty figure and then adds layers. You can use the pipe operator between layers (rather than the plus sign of ggplot2).

# load package
library(rbokeh)

# points versus vintage
figure() %>% 
  ly_points(x = vintage,
            y = points,
            data = reviews_fr)

This very simple scatter plot is not very informative. It is difficult to see much of a trend, likely for two reasons: (1) the actual trend—or lack thereof—in the data and (2) the size, density, and transparency of the points.

Fixing the second set of issues is relatively straightforward. I will set the point size to 7 and the alpha to 0.3.

# adjust size and alpha
figure() %>% 
  ly_points(x = vintage,
            y = points,
            data = reviews_fr,
            size = 7,
            alpha = 0.3)

The darker points indicate considerable overlap of observations, and the smaller dots allow us to see more precisely where these denser regions are located, which are unfortunately rather frequent. This issue lies in the data, so I will ignore it for now because the goal is to showcase rbokeh.

The scatterplot currently has no title, and the axis labels are not capitalized, as they are simply the variable names. In order to fix this, we specify the title, xlab, and ylab options.

figure(title = "Rating versus vintage of French wines") %>% 
  ly_points(x = vintage,
            y = points,
            data = reviews_fr,
            size = 7,
            alpha = 0.3,
            xlab = "Vintage",
            ylab = "Points")

This version looks better, but the axis labels are larger than the plot title. We can fix this issue with the theme_title() function.

figure(title = "Rating versus vintage of French wines") %>% 
  ly_points(x = vintage,
            y = points,
            data = reviews_fr,
            size = 7,
            alpha = 0.3,
            xlab = "Vintage",
            ylab = "Points") %>% 
  theme_title(text_font_size = "13pt")

Et voilà!

My favorite capability of the robkeh visualizations is having information on each observation presented to you upon hovering above the point. To make this occur, we use the hover option.

figure(title = "Rating versus vintage of French wines") %>% 
  ly_points(x = vintage,
            y = points,
            data = reviews_fr,
            size = 7,
            alpha = 0.3,
            xlab = "Vintage",
            ylab = "Points",
            hover = c(vintage, points)) %>% 
  theme_title(text_font_size = "13pt")

One issue with the information provided when one hovers over a point is that the label for the values is simply a variable name. There are two different ways to fix this problem. One may reference within a string the variable preceded by an @ symbol (e.g., “@vintage”), which enables the possibility for hovers to include something like “This wine was produced in @vintage”, but, in my experience until now, you cannot include multiple pieces of information. The other way to cutomize the hover is to use html code. While this might seem a bit clunky at first, you will additionally have the ability to include multiple pieces of information, and the presentation is more flexible than the large table the accompanies the previous example. I will illustrate now the use of html to alter the hover information.

# customize hover
figure(title = "Rating versus vintage of French wines") %>% 
  ly_points(x = vintage,
            y = points,
            data = reviews_fr,
            size = 7,
            alpha = 0.3,
            xlab = "Vintage",
            ylab = "Points",
            hover = "<b>Vintage</b>: @vintage <br>
            <b>Points</b>: @points") %>% 
  theme_title(text_font_size = "13pt")

Multiple Plots

Unlike ggplot2, rbokeh does not have a succinct way to create multiple plots (e.g., a facet option). Instead, you have to create the plots you want to be grouped together and then use the grid_plot() function to join them together. Once you do this, customizing the group is relatively straightforward. I switch in this section to visualizing the relationship between price and rating for a few wine regions within France.

The first task consists of splitting data by region, which the split() function makes easy.

# split by region/province
split_province <- split(reviews_fr, reviews_fr$province)

Here are the regions found in the data.

unique(reviews_fr$province)

##  [1] "alsace"               "beaujolais"           "bordeaux"            
##  [4] "burgundy"             "france other"         "southwest france"    
##  [7] "rhône valley"         "languedoc roussillon" "provence"            
## [10] "loire valley"         "champagne"

Next, I write a function to create a plot for each function that I will then apply to all of the sub-sets of region data. Within each plot I include the price and points values in the hover.

# create plotting function
plot_scatter <- function(x){
  figure() %>% 
    ly_points(x = price,
              y = points,
              data = x,
              hover = "<b>Price</b>: @price<br><b>Points</b>: @points",
              size = 6,
              alpha = 0.3)
}

# create list of figures
scatter_list_province <- lapply(split_province, plot_scatter)

I do not want plots for all 11 regions found within the data, so I will choose four regions: Bordeaux, Burgundy (Bourgogne), Champagne, and Alsace.

# now plot together
grid_plot(list(Bordeaux = scatter_list_province[["bordeaux"]],
               Burgundy = scatter_list_province[["burgundy"]],
               Champagne = scatter_list_province[["champagne"]],
               Alsace = scatter_list_province[["alsace"]]),
          nrow = 2,
          same_axes = TRUE,
          xlim = c(0, 1000))

It is important in the initial list provided to type on the right hand of the equal sign the plot name as you would like it to appear; I have not (yet, hopefully) found away to easily alter the titles in another manner, for example, if one wants multiple words. I specified two rows to ensure that I received a 2 by 2 table of figures. rbokeh also allows you to specify the number of columns. The same_axes argument ensures that all plots have the same ranges, and I set the limit of the x axis as 1000. I chose to do so because there are extreme values in some of the plots, which made trends even more difficult to see easily. Note however that these points are still on the plot: because the scatter plots are interactive, you may click and drag the plot past the 1000 mark on the x axis.

Using Color within a Single Plot

In this final section, I return to creating a single visualization and will examine the wines from Alsace. I begin by subsetting to reviews_fr data to include only wines from this region.

# filter alsace
alsace <- reviews_fr %>% 
  filter(province == "alsace")

I will include in this figure all of the previous customization types I have covered until now. I provide a title with 14 point font, a hover that include information on the vintage, price, and points, resized and largely transparent dots, and x and y axis labels. The key change is that I add a color option, using taster_name as a factor variable (it is currently a character string). I also include height and width specifications, but these simply enlarge the resulting figure. There are

figure(width = 600,
       height = 600,
       title = "Alsace Wine Reviews, 1996-2016") %>% 
  ly_points(x = price,
            y = points,
            data = alsace,
            hover = "<b>Winery</b>: @winery<br>
            <b>Vintage</b>: @vintage<br>
            <b>Price</b>: $@price<br>
            <b>Points</b>: @points<br>",
            size = 6,
            alpha = 0.3,
            color = as.factor(taster_name),
            xlab = "Price",
            ylab = "Points") %>% 
  theme_title(text_font_size = "14pt")

In addition to the bivariate distribution of scores (points) againt prices, we also see a general trend in the reviewing for Alsace. Ann Krebiehl (blue) tends to review wines with higher scores than Roger Voss (red). Joe Czerwinski reviewed only three wines, and Lauren Buzzeo reviewed two. Hovering over each observation provides information on the winery, vintage, price, and points a wine received.

It is important to note three issues with the figure. First, the dollar sign is not always placed consistently in the hover. If the price is three digits, then there is no space between the sign and the first digit, but, if the price is two digits, then there is a space. I have not yet found a way remedy this problem. Second, the legend is in the way. I will move it—using the legend_location option in the figrue() function—to the bottom right of the figure in the next iteration. Finally, the variable name, which appears in the legend, is exactly as I coded it when creating the figure; this means that it includes as.factor() and has an underscore rather than a space between the two words. The as.factor() problem may be addressed by changing the variable’s class before plotting the data, but I have not found a way to customize the presentation of the variable name.

# recode variable
alsace <- alsace %>% 
  mutate(Taster_Name = as.factor(taster_name))

# plot
figure(legend_location = "bottom_right",
       width = 600,
       height = 600,
       title = "Alsace Wine Reviews, 1996-2016") %>% 
  ly_points(x = price,
            y = points,
            data = alsace,
            hover = "<b>Winery</b>: @winery<br>
            <b>Vintage</b>: @vintage<br>
            <b>Price</b>: $@price<br>
            <b>Points</b>: @points<br>",
            size = 6,
            alpha = 0.3,
            color = Taster_Name,
            xlab = "Price",
            ylab = "Points") %>% 
  theme_title(text_font_size = "14pt")

Note the Toolbar

The toolbar is an important figure of rbokeh plots. You can find it in the above figure on the upper right side, and it is composed of several semi-transparent icons: pan, box zoom, wheel zoom, reset, a selector to enable or disable the hover tool, and a link to more information about rbokeh. There are many ways to customize the toolbar, including removing it altogether. You change its location by including “left”, “right”, “top”, “bottom” after the toolbar_location option in figure(); if you put NULL rather than one of the strings, it removes the toolbar.

I particularly enjoy the zoom features of rbokeh. Having over 20,000 observations, as does the reveiws_fr data, means that certain portions of the scatter plot can get quite dense, and zooming lets you inspect a smaller portion of the data while retaining the ability to examine other spaces on the plot. The box zoom enables you to enlarge a set of points that you select (i.e., that fall within a box that appears when you click and drag the cursor across the figuree). The wheel zoom allows you to use you mouse’s scroll wheel (or function if you have a touch mouse) to zoom in and out on portions of the plot. If you decide you no longer want to zoom, reset will make the plot appear as it was originally coded.[^2]

You can of course change the default presentation from which the toolbar enables you to deviate. I provide below an example in which I include only pan, box zoom, and reset as options; this is accomplished with the tool option in figure(). You will still have the option to disable to hover tool.

# change tools
figure(legend_location = "bottom_right",
       width = 600,
       height = 600,
       title = "Alsace Wine Reviews, 1996-2016",
       tools = c("pan", "box_zoom", "reset")) %>% 
  ly_points(x = price,
            y = points,
            data = alsace,
            hover = "<b>Winery</b>: @winery<br>
            <b>Vintage</b>: @vintage<br>
            <b>Price</b>: $@price<br>
            <b>Points</b>: @points<br>",
            size = 6,
            alpha = 0.3,
            color = Taster_Name,
            xlab = "Price",
            ylab = "Points") %>% 
  theme_title(text_font_size = "14pt")

Concluding Thoughts

Do I like rbokeh? Yes! The figures it produces look very nice and can be customized with a considerable deal of refinement, despite a few flaws (which might be due to a lack of experience with the package or not browsing far enough back into the pages of a Google search). The interactivity and hover options permit users to easily ascertain insights beyond the default presentation that the coder chooses.

Do I prefer rbokeh to ggplot2? It depends, but largely no. ggplot2 surpasses rbokeh when it comes to customizations, extensions, and community support. If you need to craft static visualizations for a presentation, then I think ggplot2 is still the way to go (though rbokeh can suffice just fine). If, as the rbokeh CRAN documentation states, you want to create “web-based graphics” that are interactive, then rbokeh provides a nice option, without having to move to Shiny.

I filtered the data, which has reviews of wines from many countries, to include only those reviews of French wines.↩

Back to posts