Category: #rstats

Advent of Code 2017 in #rstats: Day 6

Post author By Sam
Post date 2017-12-06
1 Comment on Advent of Code 2017 in #rstats: Day 6

The solution was a simple concept but harder than I thought to implement. I learned some lessons today:

Read the instructions carefully. I built the balancing function balance_banks to return how many distributions it made of the maximum value, but removed this code when I realized that the full distribution of a bank of N values counted as one iteration, not N iterations.
Related: Up until now, Git has felt unnecessary for these challenges, and I’ve been too lazy to use it (though it would do wonders for my GitHub contribution graph). Today I got nervous deleting and reworking code for Part 1 … would I need it for the not-yet-known Part 2?
I built my balance_banks function inside of my go_til_repeat function (because I thought it would help with passing counter information, which I did not need). When editing and then testing, I kept failing to load both into memory; I’d update one, load the other into memory, and call the first one. I don’t work much with nested functions of any complexity; TIL to avoid loading just one while programming and testing them.

Today was the first day that run-time performance became an issue. My outputs for each part were correct on the first try, but each took a while (3 minutes?) to run. Suspicious that my code was stuck in a loop, I never let it finish the first call I made: I broke the loop manually and added the counter to print every 1,000 rebalances.

But it was just slow. To check completion, I switched to the use of double duplicated you see below instead of janitor::get_dupes(), which is easier on the brain but a relatively expensive operation.

Lastly, when I thought I had to return multiple outputs from balance_banks I had it return them in a list. I removed the increment counter code and the function now returns just one result (the balanced vector), but I left the list syntax in place since it worked. That’s why you see the $output in last_result <- balance_banks(last_result$output)for instance.

Adding Part 2 onto Part 1 was trivial. I’d seen Jenny Bryan use diff(range()) in a solution to an earlier puzzle and was stoked to reuse it in Part 2.

I resorted to rbind instead of my beloved dplyr::bind_rows because my data.frames don’t have column names. I remain tickled by situations where I work with data.frames in R while ignoring column names.

Parts 1 and 2

library(testthat)

# Get the index of the largest bank
bank_to_split_loc <- function(x) {
  which(x == max(x))[1] # subset to break ties
}
expect_equal(bank_to_split_loc(dat), 3)

# Takes a vector and balances the banks by distributing the largest one over the rest
balance_banks <- function(dat){
  loop_amount <- dat[bank_to_split_loc(dat)]
  current_bank <- bank_to_split_loc(dat) + 1
  dat[bank_to_split_loc(dat)] <- 0
  for(i in seq_len(loop_amount)){
    if(current_bank > length(dat)) { current_bank <- 1 } # wrap around the boundary
    dat[current_bank] <- dat[current_bank] + 1
    current_bank <- current_bank + 1
  }
  list(output = dat) # this is unnecessary, but when I misunderstood the problem I was also passing a counter out of this loop
}
expect_equal(balance_banks(c(0, 2, 7, 0))$output, c(2, 4, 1, 2))

# Takes vector of banks, keeps rebalancing them and checking the resulting vectors until a match is found
# I later added the part 1 or 2 argument to return 
go_til_repeat <- function(input, part = 1){
 
  # Initialize results data.frame
  results <- data.frame(matrix(ncol = length(input), nrow = 0), stringsAsFactors = FALSE)
  results <- rbind(results, input)
  
  last_result <- balance_banks(input)
  results <- rbind(results, last_result$output)
  
  # Print every 1000 rebalances, to confirm making progress
  # I thought the max bank value would decrease over time; it doesn't really, but still works as a progress indicator
  print_counter <- 0
  while(sum(duplicated(results)) == 0){
    if(print_counter %% 1000 == 0 & print_counter > 0){ print(max(last_result$output)) }
    print_counter <- print_counter + 1
    last_result <- balance_banks(last_result$output)
    results <- rbind(results, last_result$output)
  }

  if(part == 1){ # how many total cycles
    nrow(results)-1 
  } else if (part == 2){ # how many cycles between duplicated rows
    # courtesy of StackOverflow: https://stackoverflow.com/questions/12495345/find-indices-of-duplicated-rows
    diff(range(which(duplicated(results) | duplicated(results, fromLast = TRUE))))
  }

}

# These commands will take several minutes to run
day_05_input <- c(5, 1, 10, 0, 1, 7, 13, 14, 3, 12, 8, 10, 7, 12, 0, 6)
go_til_repeat(day_05_input, 1)
go_til_repeat(day_05_input, 2)

#rstats

Advent of Code 2017 in #rstats: Day 5

Post author By Sam
Post date 2017-12-05
No Comments on Advent of Code 2017 in #rstats: Day 5

This went smoothly. No tricky parts and Part 2 was a simple extension of Part 1. No packages needed to solve (I used testthat for checking against the test input).

This was a textbook case for a while loop, which I rarely use in R.

steps_to_leave <- function(vec, allow_decreasing_offsets = FALSE){
  index <- 1
  steps <- 0
  
  while(!is.na(vec[index])){
    steps <- steps + 1  
    starting_position <- index
    index <- starting_position + vec[index]

    # increase or decrease original offset, updated for part 2
    if(!allow_decreasing_offsets | vec[starting_position] < 3){
      vec[starting_position] <- vec[starting_position] + 1
    } else {
      vec[starting_position] <- vec[starting_position] - 1
    }
  }
  steps
}

# Part 1
test_vec <- c(0, 3, 0, 1, -3)
testthat::expect_equal(steps_to_leave(test_vec, FALSE), 5)
dat <- scan("05_1_dat.txt", numeric(), quote = "")
steps_to_leave(dat) # 364539

# Part 2
testthat::expect_equal(steps_to_leave(test_vec, TRUE), 10)
steps_to_leave(dat, TRUE) # 27477714

#rstats

Advent of Code 2017 in #rstats: Day 4

Post author By Sam
Post date 2017-12-04
No Comments on Advent of Code 2017 in #rstats: Day 4

Whew, much easier than day 3. Probably would be a great day to learn the tidytext package but I’ll be faster just muddling through with my current tools.

I wrote a function that compares the length of the input tokens to length of unique() input tokens. Then for part 2, I added an argument as to whether anagrams should be permitted; if not, the tokens are each sorted alphabetically first into alphagrams (a word borrowed from competitive Scrabble).

Parts 1 & 2 together

I borrowed the string sorting algorithm from StackOverflow – I cited it below and upvoted as well 🙂

library(pacman)
p_load(dplyr, stringr, testthat)

check_valid <- function(string, anagrams_okay){
  subs <- unlist(str_split(string, " "))
  if(!anagrams_okay){
    subs <- unlist(lapply(subs, make_alphagram)) # for part 2
  }
  length(unique(subs)) == length(subs)
}

# thanks StackOverflow! https://stackoverflow.com/questions/5904797/how-to-sort-letters-in-a-string

make_alphagram <- function(x){
  paste(sort(unlist(strsplit(x, ""))), collapse = "")
}

Now it’s just a matter of testing & running:

# Tests
expect_equal(check_valid("aa bb cc dd ee aa", anagrams_okay = TRUE), FALSE)
expect_equal(check_valid("ba bb cc dd ee ab", anagrams_okay = TRUE), TRUE)
expect_equal(check_valid("ba bb cc dd ee ab", anagrams_okay = FALSE), FALSE)

# Part 1
raw <- read.delim("04_1_dat.txt", header = FALSE, stringsAsFactors = FALSE)[[1]] # maybe an inelegant way to read in as a vector...
lapply(raw, check_valid) %>%
  unlist %>%
  sum

# Part 2
lapply(raw, check_valid, anagrams_okay = FALSE) %>% unlist %>% sum

#rstats

Advent of Code 2017 in #rstats: Day 2

Post author By Sam
Post date 2017-12-02
No Comments on Advent of Code 2017 in #rstats: Day 2

After reading this puzzle, I was excited and concerned to see “spreadsheet”:

Excited: R is good at rectangular numeric data, and I work with a lot of spreadsheets
Concerned: Will I be starting with a single long string? How do I get that into rectangular form?

After considering trying to chop up one long string, I copy-pasted my input to a text file and read.delim() worked on the first try.

Part 1 was simple, Part 2 was tricky. The idea of row-wise iteration got in my head and I spent a lot of time with apply and lapply… I still hold a grudge against them from battles as a new R user, but having (mostly) bent them to my will I went to them here. Maybe I should spend more time with the purrr package.

Part 1

Straightforward:

library(dplyr)
library(testthat)

ex <- data.frame(
  a = c(5, 7, 2),
  b = c(1,5,4),
  c = c(9,3,6),
  d = c(5, NA, 8)
)

checksum <- function(x){
  row_max <- apply(x, 1, max, na.rm = TRUE)
  row_min <- apply(x, 1, min, na.rm = TRUE)
  sum(row_max - row_min)
}

expect_equal(checksum(ex), 18) # works!
my_input <- read.delim("02_1_dat.txt", header = FALSE)
checksum(my_input)

Part 2

Got the idea quickly, but executing took a bit longer. I decided right away that I would:

Work on a function that tackled a row (as a vector) at a time
The function would divide every value in the vector by every other value in the vector
It would then pick out the integer that wasn’t 1, as well as its reciprocal, and divide them
Then run this function row-wise on the input

Actually building up that function took ~20 minutes. I felt confident the approach would work so didn’t stop to consider anything else… I’m excited to see other solutions after I publish this, as maybe there’s something elegant I missed.

Along the way I learned: is.integer doesn’t do what I thought it would. Come on, base R.

I don’t work with matrices much (not a mathematician, don’t have big data) and was a somewhat vexed by the matrix -> data.frame conversion.

# Will apply something row-wise through the data.frame
# Need to get the dividend and divisor

# example data for part 2:
ex2 <- data.frame(
  a = c(5, 9, 3),
  b = c(9,4,8),
  c = c(2,7,6),
  d = c(8, 3, 5)
)

# start with the top row to play with
x <- ex2[1, ]

# I'll create an x by x matrix with every number in a vector divided by itself
# I want the numbers corresponding to matrix entries containing the sole non-1 integer and its reciprocal.
# Annoying, is.integer() doesn't do what I thought it would, so need to make my own function: https://stackoverflow.com/q/3476782/4470365
is_integer <- function(x) {
  (x %% 1) == 0
}

is_recip_integer <- function(x){
  ((1/x) %% 1) == 0
}

get_quotient <- function(x){
  # create x by x matrix dividing each entry by itself
  divided <- sapply(x, function(a) a/x) %>%
    unlist %>%
    matrix(nrow = sqrt(length(.))) %>%
    as.data.frame() 
  
  divided[divided == 1] <- NA # don't want the diagonal values
  
  # Get the position where the vector contains an integer
  bigger_index <- which(
    lapply(divided, is_integer) %>%
      lapply(sum, na.rm = TRUE) %>%
      unlist(.) == 1) # I drafted this function as nested parentheses,
                      # converted to pipes for blogging... it looks weird.
  
  # Get the position where the vector contains the reciprocal of an integer
  smaller_index <- which(
    lapply(divided, is_recip_integer) %>%
      lapply(sum, na.rm = TRUE) %>%
      unlist(.) == 1)
  
  # Compute quotient of values at those indices
  x[bigger_index] / x[smaller_index]
  
}

# Test
expect_equal(
  # apply the function rowwise and sum
  apply(ex2, 1, get_quotient) %>% sum, 
  9
)

# Get part 2 answer:
apply(my_input, 1, get_quotient) %>% sum

#rstats

Advent of Code 2017 in #rstats: Day 1

Post author By Sam
Post date 2017-12-01
No Comments on Advent of Code 2017 in #rstats: Day 1

Today’s puzzle is a good example of thinking “R-ishly”. In R, I find it easier to compare vectors than to compare strings, and easier to work with vectorized functions than to loop. So my code starts by splitting the input into a vector.

Part 1:

Split the input into a vector, add the first value to the end, and use dplyr’s lead() function to compare each value to the following value:

library(stringr)
library(dplyr)

target &lt;- "112345567891" # should yield 1 + 5 + 1 = 7
target &lt;- str_split(target, "") %&gt;% unlist %&gt;% as.numeric() # split input string into a vector:
target &lt;- c(target, target[1]) # add the first value to the end, to account for wrap-around
sum(target[target == lead(target)], na.rm = TRUE) # sum every number that matches the number following it

Part 2:

This challenge is simpler, in my opinion. You can halve the vector, then compare the first half to the second half, summing the matches (don’t forget to double that result).

target &lt;- "123425" # should match both 2s for a total of 4
target &lt;- str_split(target, "") %&gt;% unlist %&gt;% as.numeric()
first_half &lt;- target[1:(length(target)/2)]
second_half &lt;- target[-c(1:(length(target)/2))]
sum(first_half[first_half == second_half], na.rm = TRUE) * 2

#rstats Data analysis

How to Teach Yourself R

(Or, “how to teach professionals to teach themselves R”).

Background: I taught myself R in 2014 from public web resources, and since then have steered several cohorts of data analysts at my organization through various R curricula, adapting based on their feedback.

This is geared toward people teaching themselves R outside of graduate school (I perceive graduate students to have more built-in applications and more time for learning, though I don’t speak from experience). I say “students” below but I am referring to professionals. This advice assumes little or no programming experience in other languages, e.g., people making the shift from Excel to R (I maintain that Excel is one of R’s chief competitors). If you already work in say, Stata, you may face fewer frustrations (and might consider DataCamp’s modules geared specifically to folks in your situation).

I’ve tried combinations of Coursera’s Data Science Specialization, DataCamp’s R courses, and the “R for Data Science” textbook. Here’s what I’ve learned about learning and teaching R and what I recommend.

I see three big things that will help you learn R:

A problem you really want to solve
A self-study resource
A coach/community to help you

#rstats Data analysis

Can a Twitter bot increase voter turnout?

Post author By Sam
Post date 2016-04-03
2 Comments on Can a Twitter bot increase voter turnout?

Summary: in 2015 I created a Twitter bot, @AnnArborVotes (code on GitHub). (2018 Sam says: after this project ceased I gave the Twitter handle to local civics hero Mary Morgan at A2CivCity). I searched Twitter for 52,000 unique voter names, matching names from the Ann Arbor, MI voter rolls to Twitter accounts based nearby. The bot then tweeted messages to a randomly-selected half of those 2,091 matched individuals, encouraging them to vote in a local primary election that is ordinarily very low-turnout.

I then examined who actually voted (a matter of public record). There was no overall difference between the treatment and control groups. I observed a promising difference in the voting rate when looking only at active Twitter users, i.e., those who had tweeted in the month before I visited their profile. These active users only comprised 7% of my matched voters, however, and the difference in this small subgroup was not statistically significant (n = 150, voting rates of 23% vs 15%, p = 0.28).

I gave a talk summarizing the experiment at Nerd Nite Ann Arbor that is accessible to laypeople (it was at a bar and meant to be entertainment):

This video is hosted by the amazing Ann Arbor District Library – here is their page with multiple formats of this video and a summary of the talk. Here are the slides from the talk (PDF), but they’ll make more sense with the video’s voiceover.

The full write-up:

I love the R programming language (#rstats) and wanted a side project. I’d been curious about Twitter bots. And I’m vexed by how low voter turnout is in local elections. Thus, this experiment.

#rstats Data analysis survivor pool

Calculating likelihood of X% of entrants advancing in an NFL Survivor Pool

Post author By Sam
Post date 2015-09-28
No Comments on Calculating likelihood of X% of entrants advancing in an NFL Survivor Pool

or: Yes, Week 2 of the 2015 NFL season probably was the toughest week for a survivor pool, ever.

Week 2 of the 2015 NFL season was rife with upsets, with 9 of 16 underdogs winning their games. This wreaked havoc on survivor pools (aka eliminator pools), where the object is to pick a single team to win each week. The six most popular teams (according to Yahoo! sports) all lost:

(image from Yahoo! Sports, backed up here as it looks like the URL will not be stable into next year)

About 4.6% of Yahoo! participants survived the week (I looked only at the top 11 picks due to data availability, see the GitHub file below for more details). This week left me wondering: was this the greatest % of survivor pool entrants to lose in a single week, ever? And what were the odds of this happening going into this week?

I wrote some quick code to run a million simulations of the 2nd week of the 2015 NFL season (available here on GitHub).

Results

Given the projected win probabilities (based on Vegas odds) and the pick distributions, only 684 of the 1,000,000 simulations yielded a win rate below the 4.6% actual figure. Thus the likelihood that only 4.6% of entrants would make it through the week was 0.0684%, less than a tenth of one percent. Or to put it another way, this event had a 1-in-1,462 chance of occurring.

Here are the results of the simulation:

Blue line: median expected result, 80.6% winners
Yellow line: 1st percentile result, 13.8% winners (to give you a sense of how rare a result this week was)
Red line: actual result, 4.6% winners

So was it the greatest week for survivor pool carnage ever? Probably. You might never see a week like it again in your lifetime.

P.S. This distribution is pretty cool, with the sudden drop off and gradual climb starting at x = 0.50. This is caused by 50% of the pool picking the Saints, the most likely team to win. I wouldn’t say this is a bimodal distribution, exactly – is there a term for this?