Tutorial

Let's work through a simple example and analyze the Schrute dataset a little bit.

In [13]:
# using Cairo
using DataFrames
using DataFramesMeta
using Gadfly
using Schrute
using Statistics
using Query

We grab the dataset with the one function the package exports

In [14]:
df = Schrute.theOffice();
first(df, 7)
Out[14]:

7 rows × 12 columns (omitted printing of 6 columns)

indexseasonepisodeepisode_namedirectorwriter
Int64Int64Int64StringStringString
1111PilotKen KwapisRicky Gervais;Stephen Merchant;Greg Daniels
2211PilotKen KwapisRicky Gervais;Stephen Merchant;Greg Daniels
3311PilotKen KwapisRicky Gervais;Stephen Merchant;Greg Daniels
4411PilotKen KwapisRicky Gervais;Stephen Merchant;Greg Daniels
5511PilotKen KwapisRicky Gervais;Stephen Merchant;Greg Daniels
6611PilotKen KwapisRicky Gervais;Stephen Merchant;Greg Daniels
7711PilotKen KwapisRicky Gervais;Stephen Merchant;Greg Daniels

Doesn't tell us much, let's look at the structure a bit

In [15]:
describe(df)
Out[15]:

12 rows × 8 columns (omitted printing of 5 columns)

variablemeanmin
SymbolUnion…Any
1index27565.51
2season5.538691
3episode12.5481
4episode_nameA Benihana Christmas (Parts 1&2)
5directorAlex Hardcastle
6writerAaron Shure
7character"Angela"
8text"- and the man in the moon. When you coming home Dad? I don't know when-"
9text_w_direction"- and the man in the moon. When you coming home Dad? I don't know when-"
10imdb_rating8.301936.7
11total_votes2174.161393
12air_date2005-03-24

So we have all the text from the entire series. Each record is a line, with various attributes such as character name, etc. There are attributes that are at the episode-level such as writer, episode, etc. which we can access by grouping functions, which we'll cover in a bit.

Now to ask some basic questions of the data.

What is the mean imdb score per season?

We all know the last few seasons were the worst, let's see if the collective wisdom of imdb supports that hunch.

In [16]:
one = @linq df |>
    by(:season, mean_score = mean(:imdb_rating)) |>
    select(:season, :mean_score) |>
    plot(x=:season, y=:mean_score, Geom.line, Guide.xticks(ticks=collect(1:9)), Guide.title("Mean IMDB Score by Season"))

one
# draw(PNG("images/1.png", 6inch, 4inch), one)
Out[16]:
season 1 2 3 4 5 6 7 8 9 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 6.00 6.05 6.10 6.15 6.20 6.25 6.30 6.35 6.40 6.45 6.50 6.55 6.60 6.65 6.70 6.75 6.80 6.85 6.90 6.95 7.00 7.05 7.10 7.15 7.20 7.25 7.30 7.35 7.40 7.45 7.50 7.55 7.60 7.65 7.70 7.75 7.80 7.85 7.90 7.95 8.00 8.05 8.10 8.15 8.20 8.25 8.30 8.35 8.40 8.45 8.50 8.55 8.60 8.65 8.70 8.75 8.80 8.85 8.90 8.95 9.00 9.05 9.10 9.15 9.20 9.25 9.30 9.35 9.40 9.45 9.50 9.55 9.60 9.65 9.70 9.75 9.80 9.85 9.90 9.95 10.00 10.05 10.10 10.15 10.20 10.25 10.30 10.35 10.40 10.45 10.50 6 8 10 12 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10.0 10.1 10.2 10.3 10.4 10.5 mean_score Mean IMDB Score by Season
In [17]:
two = plot(
    df,
    x=:index,
    y=:imdb_rating,
    Geom.histogram2d(xbincount=50, ybincount=20),
    Guide.xlabel("Episode")
)

two

# draw(PNG("images/2.png", 6inch, 4inch), two)
Out[17]:
Episode -7.0×10⁴ -6.0×10⁴ -5.0×10⁴ -4.0×10⁴ -3.0×10⁴ -2.0×10⁴ -1.0×10⁴ 0 1.0×10⁴ 2.0×10⁴ 3.0×10⁴ 4.0×10⁴ 5.0×10⁴ 6.0×10⁴ 7.0×10⁴ 8.0×10⁴ 9.0×10⁴ 1.0×10⁵ 1.1×10⁵ 1.2×10⁵ 1.3×10⁵ -6.00×10⁴ -5.80×10⁴ -5.60×10⁴ -5.40×10⁴ -5.20×10⁴ -5.00×10⁴ -4.80×10⁴ -4.60×10⁴ -4.40×10⁴ -4.20×10⁴ -4.00×10⁴ -3.80×10⁴ -3.60×10⁴ -3.40×10⁴ -3.20×10⁴ -3.00×10⁴ -2.80×10⁴ -2.60×10⁴ -2.40×10⁴ -2.20×10⁴ -2.00×10⁴ -1.80×10⁴ -1.60×10⁴ -1.40×10⁴ -1.20×10⁴ -1.00×10⁴ -8.00×10³ -6.00×10³ -4.00×10³ -2.00×10³ 0 2.00×10³ 4.00×10³ 6.00×10³ 8.00×10³ 1.00×10⁴ 1.20×10⁴ 1.40×10⁴ 1.60×10⁴ 1.80×10⁴ 2.00×10⁴ 2.20×10⁴ 2.40×10⁴ 2.60×10⁴ 2.80×10⁴ 3.00×10⁴ 3.20×10⁴ 3.40×10⁴ 3.60×10⁴ 3.80×10⁴ 4.00×10⁴ 4.20×10⁴ 4.40×10⁴ 4.60×10⁴ 4.80×10⁴ 5.00×10⁴ 5.20×10⁴ 5.40×10⁴ 5.60×10⁴ 5.80×10⁴ 6.00×10⁴ 6.20×10⁴ 6.40×10⁴ 6.60×10⁴ 6.80×10⁴ 7.00×10⁴ 7.20×10⁴ 7.40×10⁴ 7.60×10⁴ 7.80×10⁴ 8.00×10⁴ 8.20×10⁴ 8.40×10⁴ 8.60×10⁴ 8.80×10⁴ 9.00×10⁴ 9.20×10⁴ 9.40×10⁴ 9.60×10⁴ 9.80×10⁴ 1.00×10⁵ 1.02×10⁵ 1.04×10⁵ 1.06×10⁵ 1.08×10⁵ 1.10×10⁵ 1.12×10⁵ 1.14×10⁵ 1.16×10⁵ 1.18×10⁵ 1.20×10⁵ -1×10⁵ 0 1×10⁵ 2×10⁵ -6.00×10⁴ -5.50×10⁴ -5.00×10⁴ -4.50×10⁴ -4.00×10⁴ -3.50×10⁴ -3.00×10⁴ -2.50×10⁴ -2.00×10⁴ -1.50×10⁴ -1.00×10⁴ -5.00×10³ 0 5.00×10³ 1.00×10⁴ 1.50×10⁴ 2.00×10⁴ 2.50×10⁴ 3.00×10⁴ 3.50×10⁴ 4.00×10⁴ 4.50×10⁴ 5.00×10⁴ 5.50×10⁴ 6.00×10⁴ 6.50×10⁴ 7.00×10⁴ 7.50×10⁴ 8.00×10⁴ 8.50×10⁴ 9.00×10⁴ 9.50×10⁴ 1.00×10⁵ 1.05×10⁵ 1.10×10⁵ 1.15×10⁵ 1.20×10⁵ 1000 200 400 600 1 800 Count h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 10.8 11.0 11.2 11.4 11.6 11.8 12.0 12.2 12.4 12.6 12.8 13.0 13.2 13.4 13.6 13.8 14.0 0 5 10 15 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 imdb_rating

Yep, that checks out!

Who has the most lines in the series?

In [18]:
foo = by(df, :character,  nrow)[1:8,:]
sort!(foo, :x1, rev = true)
rename!(foo, Dict(:x1 => "Number of Lines"))

mytheme = Theme(bar_spacing=25pt);

three = plot(foo,
     x=:character,
     y=:"Number of Lines",
     Geom.bar,
     Guide.yticks(ticks=[2000,4000,6000,8000,10000,12000]),
     Guide.title("Lines by Character"),
     mytheme
)

three
 
# draw(PNG("images/3.png", 6inch, 4inch), three)
Out[18]:
character Michael Dwight Jim Pam Phyllis Jan Todd Packer Michel h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? 2000 4000 6000 8000 10000 12000 Number of Lines Lines by Character

That's what she said

Who said this the most? It's obvious Michael Scarn but let's see for sure

In [19]:
twss = []
for line in df[:,:text]
    # println(line)
    try
        foo = occursin("what she said", line)
        # println(foo)
        push!(twss, string(foo))
    catch
        push!(twss, "false")
        continue
    end
end

df[:,:twss] = twss

twss_df = @linq df |>
    where(:twss .== "true") 

twss_toplot =  by(twss_df, :character, nrow)
rename!(twss_toplot, Dict(:x1 => "Occurrences"))

four = plot(twss_toplot,
     x=:character,
     y=:Occurrences,
     Geom.bar,
     Guide.yticks(ticks=[5,10,15,20,25]),
     Guide.title("That's What She Said by Character"),
     mytheme
)

four
# draw(PNG("images/4.png", 6inch, 4inch), four)
Out[19]:
character Michael Dwight Jan Jim Pam David Holly Creed h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? 5 10 15 20 25 Occurrences That's What She Said by Character

How are the imdb ratings distributed across seasons?

In [20]:
plot(df,
    x=:season,
    y=:imdb_rating,
    Geom.boxplot,
    Guide.xticks(ticks=collect(1:9)),
    Guide.title("IMDB Rating Distribution by Season")
)
Out[20]:
season 1 2 3 4 5 6 7 8 9 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 10.8 11.0 11.2 11.4 11.6 11.8 12.0 12.2 12.4 12.6 12.8 13.0 13.2 13.4 13.6 13.8 14.0 0 5 10 15 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 imdb_rating IMDB Rating Distribution by Season

There's a few low-scoring outliers in seasons 6 and 8, I'm curious which episodes those were...

In [21]:
df_worst = @linq df |>
    # groupby([:episode_name, :season]) |>
    by([:episode_name, :season], score = mean(:imdb_rating))


df_worst = @linq df_worst |>
    groupby(:season) |>
    transform(min_sc = minimum(:score)) |>
    where(:min_sc .== :score) |>
    select(:season, :episode_name, :score)
Out[21]:

12 rows × 3 columns

seasonepisode_namescore
Int64StringFloat64
11Pilot7.6
22The Carpet7.9
33Grief Counseling8.0
43Diwali8.0
54Job Fair7.9
65Employee Transfer8.1
75Blood Drive8.1
86The Banker6.8
97Christening7.5
107Todd Packer7.5
118Get The Girl6.7
129Here Comes Treble7.1

Ahh, looks like "Get the girl" in season 8 wasn't great, as well as "The Banker" in season 6. If we had a way of quantifying "cringe", no doubt Scott's Tots would be at the top!

Schrute in python and R

The Schrute package is also available in R and Python