Categories
#rstats

Advent of Code 2017 in #rstats: Day 5

This went smoothly.  No tricky parts and Part 2 was a simple extension of Part 1.  No packages needed to solve (I used testthat for checking against the test input).

This was a textbook case for a while loop, which I rarely use in R.

steps_to_leave <- function(vec, allow_decreasing_offsets = FALSE){
  index <- 1
  steps <- 0
  
  while(!is.na(vec[index])){
    steps <- steps + 1  
    starting_position <- index
    index <- starting_position + vec[index]

    # increase or decrease original offset, updated for part 2
    if(!allow_decreasing_offsets | vec[starting_position] < 3){
      vec[starting_position] <- vec[starting_position] + 1
    } else {
      vec[starting_position] <- vec[starting_position] - 1
    }
  }
  steps
}

# Part 1
test_vec <- c(0, 3, 0, 1, -3)
testthat::expect_equal(steps_to_leave(test_vec, FALSE), 5)
dat <- scan("05_1_dat.txt", numeric(), quote = "")
steps_to_leave(dat) # 364539

# Part 2
testthat::expect_equal(steps_to_leave(test_vec, TRUE), 10)
steps_to_leave(dat, TRUE) # 27477714

 

Categories
#rstats

Advent of Code 2017 in #rstats: Day 4

Whew, much easier than day 3.  Probably would be a great day to learn the tidytext package but I’ll be faster just muddling through with my current tools.

I wrote a function that compares the length of the input tokens to length of unique() input tokens.  Then for part 2, I added an argument as to whether anagrams should be permitted; if not, the tokens are each sorted alphabetically first into alphagrams (a word borrowed from competitive Scrabble).

Parts 1 & 2 together

I borrowed the string sorting algorithm from StackOverflow – I cited it below and upvoted as well 🙂

library(pacman)
p_load(dplyr, stringr, testthat)

check_valid <- function(string, anagrams_okay){
  subs <- unlist(str_split(string, " "))
  if(!anagrams_okay){
    subs <- unlist(lapply(subs, make_alphagram)) # for part 2
  }
  length(unique(subs)) == length(subs)
}

# thanks StackOverflow! https://stackoverflow.com/questions/5904797/how-to-sort-letters-in-a-string

make_alphagram <- function(x){
  paste(sort(unlist(strsplit(x, ""))), collapse = "")
}

Now it’s just a matter of testing & running:

# Tests
expect_equal(check_valid("aa bb cc dd ee aa", anagrams_okay = TRUE), FALSE)
expect_equal(check_valid("ba bb cc dd ee ab", anagrams_okay = TRUE), TRUE)
expect_equal(check_valid("ba bb cc dd ee ab", anagrams_okay = FALSE), FALSE)

# Part 1
raw <- read.delim("04_1_dat.txt", header = FALSE, stringsAsFactors = FALSE)[[1]] # maybe an inelegant way to read in as a vector...
lapply(raw, check_valid) %>%
  unlist %>%
  sum

# Part 2
lapply(raw, check_valid, anagrams_okay = FALSE) %>% unlist %>% sum
Categories
#rstats

Advent of Code 2017 in #rstats: Day 2

After reading this puzzle, I was excited and concerned to see “spreadsheet”:

  1. Excited: R is good at rectangular numeric data, and I work with a lot of spreadsheets
  2. Concerned: Will I be starting with a single long string?  How do I get that into rectangular form?

After considering trying to chop up one long string, I copy-pasted my input to a text file and read.delim() worked on the first try.

Part 1 was simple, Part 2 was tricky.  The idea of row-wise iteration got in my head and I spent a lot of time with apply and lapply… I still hold a grudge against them from battles as a new R user, but having (mostly) bent them to my will I went to them here.  Maybe I should spend more time with the purrr package.

Part 1

Straightforward:

library(dplyr)
library(testthat)

ex <- data.frame(
  a = c(5, 7, 2),
  b = c(1,5,4),
  c = c(9,3,6),
  d = c(5, NA, 8)
)

checksum <- function(x){
  row_max <- apply(x, 1, max, na.rm = TRUE)
  row_min <- apply(x, 1, min, na.rm = TRUE)
  sum(row_max - row_min)
}

expect_equal(checksum(ex), 18) # works!
my_input <- read.delim("02_1_dat.txt", header = FALSE)
checksum(my_input)

Part 2

Got the idea quickly, but executing took a bit longer.  I decided right away that I would:

  1. Work on a function that tackled a row (as a vector) at a time
  2. The function would divide every value in the vector by every other value in the vector
  3. It would then pick out the integer that wasn’t 1, as well as its reciprocal, and divide them
  4. Then run this function row-wise on the input

Actually building up that function took ~20 minutes.  I felt confident the approach would work so didn’t stop to consider anything else… I’m excited to see other solutions after I publish this, as maybe there’s something elegant I missed.

Along the way I learned: is.integer doesn’t do what I thought it would.  Come on, base R.

I don’t work with matrices much (not a mathematician, don’t have big data) and was a somewhat vexed by the matrix -> data.frame conversion.

# Will apply something row-wise through the data.frame
# Need to get the dividend and divisor

# example data for part 2:
ex2 <- data.frame(
  a = c(5, 9, 3),
  b = c(9,4,8),
  c = c(2,7,6),
  d = c(8, 3, 5)
)

# start with the top row to play with
x <- ex2[1, ]

# I'll create an x by x matrix with every number in a vector divided by itself
# I want the numbers corresponding to matrix entries containing the sole non-1 integer and its reciprocal.
# Annoying, is.integer() doesn't do what I thought it would, so need to make my own function: https://stackoverflow.com/q/3476782/4470365
is_integer <- function(x) {
  (x %% 1) == 0
}

is_recip_integer <- function(x){
  ((1/x) %% 1) == 0
}

get_quotient <- function(x){
  # create x by x matrix dividing each entry by itself
  divided <- sapply(x, function(a) a/x) %>%
    unlist %>%
    matrix(nrow = sqrt(length(.))) %>%
    as.data.frame() 
  
  divided[divided == 1] <- NA # don't want the diagonal values
  
  # Get the position where the vector contains an integer
  bigger_index <- which(
    lapply(divided, is_integer) %>%
      lapply(sum, na.rm = TRUE) %>%
      unlist(.) == 1) # I drafted this function as nested parentheses,
                      # converted to pipes for blogging... it looks weird.
  
  # Get the position where the vector contains the reciprocal of an integer
  smaller_index <- which(
    lapply(divided, is_recip_integer) %>%
      lapply(sum, na.rm = TRUE) %>%
      unlist(.) == 1)
  
  # Compute quotient of values at those indices
  x[bigger_index] / x[smaller_index]
  
}

# Test
expect_equal(
  # apply the function rowwise and sum
  apply(ex2, 1, get_quotient) %>% sum, 
  9
)

# Get part 2 answer:
apply(my_input, 1, get_quotient) %>% sum

 

Categories
#rstats

Advent of Code 2017 in #rstats: Day 1

Today’s puzzle is a good example of thinking “R-ishly”.   In R, I find it easier to compare vectors than to compare strings, and easier to work with vectorized functions than to loop.  So my code starts by splitting the input into a vector.

Part 1:

Split the input into a vector, add the first value to the end, and use dplyr’s lead() function to compare each value to the following value:

library(stringr)
library(dplyr)

target &lt;- "112345567891" # should yield 1 + 5 + 1 = 7
target &lt;- str_split(target, "") %&gt;% unlist %&gt;% as.numeric() # split input string into a vector:
target &lt;- c(target, target[1]) # add the first value to the end, to account for wrap-around
sum(target[target == lead(target)], na.rm = TRUE) # sum every number that matches the number following it

Part 2:

This challenge is simpler, in my opinion.  You can halve the vector, then compare the first half to the second half, summing the matches (don’t forget to double that result).

target &lt;- "123425" # should match both 2s for a total of 4
target &lt;- str_split(target, "") %&gt;% unlist %&gt;% as.numeric()
first_half &lt;- target[1:(length(target)/2)]
second_half &lt;- target[-c(1:(length(target)/2))]
sum(first_half[first_half == second_half], na.rm = TRUE) * 2
Categories
#rstats Data analysis

How to Teach Yourself R

(Or, “how to teach professionals to teach themselves R”).

Background: I taught myself R in 2014 from public web resources, and since then have steered several cohorts of data analysts at my organization through various R curricula, adapting based on their feedback.

This is geared toward people teaching themselves R outside of graduate school (I perceive graduate students to have more built-in applications and more time for learning, though I don’t speak from experience).  I say “students” below but I am referring to professionals.  This advice assumes little or no programming experience in other languages, e.g., people making the shift from Excel to R (I maintain that Excel is one of R’s chief competitors).  If you already work in say, Stata, you may face fewer frustrations (and might consider DataCamp’s modules geared specifically to folks in your situation). 

I’ve tried combinations of Coursera’s Data Science Specialization, DataCamp’s R courses, and the “R for Data Science” textbook.  Here’s what I’ve learned about learning and teaching R and what I recommend.

I see three big things that will help you learn R:

  1. A problem you really want to solve
  2. A self-study resource
  3. A coach/community to help you
Categories
#rstats Data analysis

Can a Twitter bot increase voter turnout?

Summary: in 2015 I created a Twitter bot, @AnnArborVotes (code on GitHub).  (2018 Sam says: after this project ceased I gave the Twitter handle to local civics hero Mary Morgan at A2CivCity).  I searched Twitter for 52,000 unique voter names, matching names from the Ann Arbor, MI voter rolls to Twitter accounts based nearby.  The bot then tweeted messages to a randomly-selected half of those 2,091 matched individuals, encouraging them to vote in a local primary election that is ordinarily very low-turnout.

I then examined who actually voted (a matter of public record).  There was no overall difference between the treatment and control groups. I observed a promising difference in the voting rate when looking only at active Twitter users, i.e., those who had tweeted in the month before I visited their profile. These active users only comprised 7% of my matched voters, however, and the difference in this small subgroup was not statistically significant (n = 150, voting rates of 23% vs 15%, p = 0.28).

I gave a talk summarizing the experiment at Nerd Nite Ann Arbor that is accessible to laypeople (it was at a bar and meant to be entertainment):

This video is hosted by the amazing Ann Arbor District Library – here is their page with multiple formats of this video and a summary of the talk.  Here are the slides from the talk (PDF), but they’ll make more sense with the video’s voiceover.

The full write-up:

I love the R programming language (#rstats) and wanted a side project.  I’d been curious about Twitter bots.  And I’m vexed by how low voter turnout is in local elections.  Thus, this experiment.

Categories
#rstats Data analysis survivor pool

Calculating likelihood of X% of entrants advancing in an NFL Survivor Pool

or: Yes, Week 2 of the 2015 NFL season probably was the toughest week for a survivor pool, ever.

Week 2 of the 2015 NFL season was rife with upsets, with 9 of 16 underdogs winning their games.  This wreaked havoc on survivor pools (aka eliminator pools), where the object is to pick a single team to win each week.  The six most popular teams (according to Yahoo! sports) all lost:

yahoo wk 2 picks 2015

(image from Yahoo! Sports, backed up here as it looks like the URL will not be stable into next year)

About 4.6% of Yahoo! participants survived the week (I looked only at the top 11 picks due to data availability, see the GitHub file below for more details).  This week left me wondering: was this the greatest % of survivor pool entrants to lose in a single week, ever?  And what were the odds of this happening going into this week?

I wrote some quick code to run a million simulations of the 2nd week of the 2015 NFL season (available here on GitHub).

Results

Given the projected win probabilities (based on Vegas odds) and the pick distributions, only 684 of the 1,000,000 simulations yielded a win rate below the 4.6% actual figure.  Thus the likelihood that only 4.6% of entrants would make it through the week was 0.0684%, less than a tenth of one percent.  Or to put it another way, this event had a 1-in-1,462 chance of occurring.

Here are the results of the simulation:

simulation results

  1. Blue line: median expected result, 80.6% winners
  2. Yellow line: 1st percentile result, 13.8% winners (to give you a sense of how rare a result this week was)
  3. Red line: actual result, 4.6% winners

So was it the greatest week for survivor pool carnage ever?  Probably.  You might never see a week like it again in your lifetime.

P.S. This distribution is pretty cool, with the sudden drop off and gradual climb starting at x = 0.50.  This is caused by 50% of the pool picking the Saints, the most likely team to win.  I wouldn’t say this is a bimodal distribution, exactly – is there a term for this?