Tutorial

Let's work through a simple example and analyze the Schrute dataset a little bit.

In [13]:
# using Cairo
using DataFrames
using DataFramesMeta
using Gadfly
using Schrute
using Statistics
using Query

We grab the dataset with the one function the package exports

In [14]:
df = Schrute.theOffice();
first(df, 7)
Out[14]:

7 rows × 12 columns (omitted printing of 6 columns)

indexseasonepisodeepisode_namedirectorwriter
Int64Int64Int64StringStringString
1111PilotKen KwapisRicky Gervais;Stephen Merchant;Greg Daniels
2211PilotKen KwapisRicky Gervais;Stephen Merchant;Greg Daniels
3311PilotKen KwapisRicky Gervais;Stephen Merchant;Greg Daniels
4411PilotKen KwapisRicky Gervais;Stephen Merchant;Greg Daniels
5511PilotKen KwapisRicky Gervais;Stephen Merchant;Greg Daniels
6611PilotKen KwapisRicky Gervais;Stephen Merchant;Greg Daniels
7711PilotKen KwapisRicky Gervais;Stephen Merchant;Greg Daniels

Doesn't tell us much, let's look at the structure a bit

In [15]:
describe(df)
Out[15]:

12 rows × 8 columns (omitted printing of 5 columns)

variablemeanmin
SymbolUnion…Any
1index27565.51
2season5.538691
3episode12.5481
4episode_nameA Benihana Christmas (Parts 1&2)
5directorAlex Hardcastle
6writerAaron Shure
7character"Angela"
8text"- and the man in the moon. When you coming home Dad? I don't know when-"
9text_w_direction"- and the man in the moon. When you coming home Dad? I don't know when-"
10imdb_rating8.301936.7
11total_votes2174.161393
12air_date2005-03-24

So we have all the text from the entire series. Each record is a line, with various attributes such as character name, etc. There are attributes that are at the episode-level such as writer, episode, etc. which we can access by grouping functions, which we'll cover in a bit.

Now to ask some basic questions of the data.

What is the mean imdb score per season?

We all know the last few seasons were the worst, let's see if the collective wisdom of imdb supports that hunch.

In [16]:
one = @linq df |>
    by(:season, mean_score = mean(:imdb_rating)) |>
    select(:season, :mean_score) |>
    plot(x=:season, y=:mean_score, Geom.line, Guide.xticks(ticks=collect(1:9)), Guide.title("Mean IMDB Score by Season"))

one
# draw(PNG("images/1.png", 6inch, 4inch), one)
Out[16]:
season 1 2 3 4 5 6 7 8 9 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 6.00 6.05 6.10 6.15 6.20 6.25 6.30 6.35 6.40 6.45 6.50 6.55 6.60 6.65 6.70 6.75 6.80 6.85 6.90 6.95 7.00 7.05 7.10 7.15 7.20 7.25 7.30 7.35 7.40 7.45 7.50 7.55 7.60 7.65 7.70 7.75 7.80 7.85 7.90 7.95 8.00 8.05 8.10 8.15 8.20 8.25 8.30 8.35 8.40 8.45 8.50 8.55 8.60 8.65 8.70 8.75 8.80 8.85 8.90 8.95 9.00 9.05 9.10 9.15 9.20 9.25 9.30 9.35 9.40 9.45 9.50 9.55 9.60 9.65 9.70 9.75 9.80 9.85 9.90 9.95 10.00 10.05 10.10 10.15 10.20 10.25 10.30 10.35 10.40 10.45 10.50 6 8 10 12 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10.0 10.1 10.2 10.3 10.4 10.5 mean_score Mean IMDB Score by Season
In [17]:
two = plot(
    df,
    x=:index,
    y=:imdb_rating,
    Geom.histogram2d(xbincount=50, ybincount=20),
    Guide.xlabel("Episode")
)

two

# draw(PNG("images/2.png", 6inch, 4inch), two)
Out[17]:
Episode -7.0×10⁴ -6.0×10⁴ -5.0×10⁴ -4.0×10⁴ -3.0×10⁴ -2.0×10⁴ -1.0×10⁴ 0 1.0×10⁴ 2.0×10⁴ 3.0×10⁴ 4.0×10⁴ 5.0×10⁴ 6.0×10⁴ 7.0×10⁴ 8.0×10⁴ 9.0×10⁴ 1.0×10⁵ 1.1×10⁵ 1.2×10⁵ 1.3×10⁵ -6.00×10⁴ -5.80×10⁴ -5.60×10⁴ -5.40×10⁴ -5.20×10⁴ -5.00×10⁴ -4.80×10⁴ -4.60×10⁴ -4.40×10⁴ -4.20×10⁴ -4.00×10⁴ -3.80×10⁴ -3.60×10⁴ -3.40×10⁴ -3.20×10⁴ -3.00×10⁴ -2.80×10⁴ -2.60×10⁴ -2.40×10⁴ -2.20×10⁴ -2.00×10⁴ -1.80×10⁴ -1.60×10⁴ -1.40×10⁴ -1.20×10⁴ -1.00×10⁴ -8.00×10³ -6.00×10³ -4.00×10³ -2.00×10³ 0 2.00×10³ 4.00×10³ 6.00×10³ 8.00×10³ 1.00×10⁴ 1.20×10⁴ 1.40×10⁴ 1.60×10⁴ 1.80×10⁴ 2.00×10⁴ 2.20×10⁴ 2.40×10⁴ 2.60×10⁴ 2.80×10⁴ 3.00×10⁴ 3.20×10⁴ 3.40×10⁴ 3.60×10⁴ 3.80×10⁴ 4.00×10⁴ 4.20×10⁴ 4.40×10⁴ 4.60×10⁴ 4.80×10⁴ 5.00×10⁴ 5.20×10⁴ 5.40×10⁴ 5.60×10⁴ 5.80×10⁴ 6.00×10⁴ 6.20×10⁴ 6.40×10⁴ 6.60×10⁴ 6.80×10⁴ 7.00×10⁴ 7.20×10⁴ 7.40×10⁴ 7.60×10⁴ 7.80×10⁴ 8.00×10⁴ 8.20×10⁴ 8.40×10⁴ 8.60×10⁴ 8.80×10⁴ 9.00×10⁴ 9.20×10⁴ 9.40×10⁴ 9.60×10⁴ 9.80×10⁴ 1.00×10⁵ 1.02×10⁵ 1.04×10⁵ 1.06×10⁵ 1.08×10⁵ 1.10×10⁵ 1.12×10⁵ 1.14×10⁵ 1.16×10⁵ 1.18×10⁵ 1.20×10⁵ -1×10⁵ 0 1×10⁵ 2×10⁵ -6.00×10⁴ -5.50×10⁴ -5.00×10⁴ -4.50×10⁴ -4.00×10⁴ -3.50×10⁴ -3.00×10⁴ -2.50×10⁴ -2.00×10⁴ -1.50×10⁴ -1.00×10⁴ -5.00×10³ 0 5.00×10³ 1.00×10⁴ 1.50×10⁴ 2.00×10⁴ 2.50×10⁴ 3.00×10⁴ 3.50×10⁴ 4.00×10⁴ 4.50×10⁴ 5.00×10⁴ 5.50×10⁴ 6.00×10⁴ 6.50×10⁴ 7.00×10⁴ 7.50×10⁴ 8.00×10⁴ 8.50×10⁴ 9.00×10⁴ 9.50×10⁴ 1.00×10⁵ 1.05×10⁵ 1.10×10⁵ 1.15×10⁵ 1.20×10⁵ 1000 200 400 600 1 800 Count h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 10.8 11.0 11.2 11.4 11.6 11.8 12.0 12.2 12.4 12.6 12.8 13.0 13.2 13.4 13.6 13.8 14.0 0 5 10 15 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 imdb_rating

Yep, that checks out!

Who has the most lines in the series?

In [18]:
foo = by(df, :character,  nrow)[1:8,:]
sort!(foo, :x1, rev = true)
rename!(foo, Dict(:x1 => "Number of Lines"))

mytheme = Theme(bar_spacing=25pt);

three = plot(foo,
     x=:character,
     y=:"Number of Lines",
     Geom.bar,
     Guide.yticks(ticks=[2000,4000,6000,8000,10000,12000]),
     Guide.title("Lines by Character"),
     mytheme
)

three
 
# draw(PNG("images/3.png", 6inch, 4inch), three)
Out[18]:
character Michael Dwight Jim Pam Phyllis Jan Todd Packer Michel h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? 2000 4000 6000 8000 10000 12000 Number of Lines Lines by Character

That's what she said

Who said this the most? It's obvious Michael Scarn but let's see for sure

In [19]:
twss = []
for line in df[:,:text]
    # println(line)
    try
        foo = occursin("what she said", line)
        # println(foo)
        push!(twss, string(foo))
    catch
        push!(twss, "false")
        continue
    end
end

df[:,:twss] = twss

twss_df = @linq df |>
    where(:twss .== "true") 

twss_toplot =  by(twss_df, :character, nrow)
rename!(twss_toplot, Dict(:x1 => "Occurrences"))

four = plot(twss_toplot,
     x=:character,
     y=:Occurrences,
     Geom.bar,
     Guide.yticks(ticks=[5,10,15,20,25]),
     Guide.title("That's What She Said by Character"),
     mytheme
)

four
# draw(PNG("images/4.png", 6inch, 4inch), four)
Out[19]:
character Michael Dwight Jan Jim Pam David Holly Creed h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? 5 10 15 20 25 Occurrences That's What She Said by Character

How are the imdb ratings distributed across seasons?

In [20]:
plot(df,
    x=:season,
    y=:imdb_rating,
    Geom.boxplot,
    Guide.xticks(ticks=collect(1:9)),
    Guide.title("IMDB Rating Distribution by Season")
)
Out[20]:
season 1 2 3 4 5 6 7 8 9