-
Notifications
You must be signed in to change notification settings - Fork 13
Description
This is for posterity's sake but I hope it'll be fixed.
For eight days at end of 2012 and the beginning of 2013, cranlogs::cran_downloads() returns counts that are double or even triple of what they should be. I'm fairly confident of this conclusion because the numbers I get are derived by directly downloading the logs from RStudio and counting the number of log entries.
The code for my analysis:
library(cranlogs)
library(packageRank)
start.date <- "2012-10-01"
end.date <- "2013-01-05"
# The expression below uses 'cranlogs' to compute the total number of
# downloads for all of CRAN on the dates above:
cranlogs.data <- cranlogs::cran_downloads(from = start.date, to = end.date)
# This code below uses 'packageRank' and the "raw" RStudio logs to compute
# the total number of download for all CRAN packages on the dates above.
# There are two functions to note: fixDate_2012(), which is part of
# 'packageRank' but is not exported (not in namespace) and
# packageRank::fetchCranLog().
# fixDate_2012() fixes mis-labelled filenames/URL and duplicate logs
fixDate_2012 <- function(date = "2012-12-31") {
if (class(date) != "Date") ymd <- as.Date(date)
else ymd <- date
if (format(ymd, "%Y") == "2012") {
if (ymd %in% as.Date(c("2012-12-29", "2012-12-30", "2012-12-31"))) {
stop("Log for ", ymd, " is missing/unavailable.", call. = FALSE)
} else if (ymd >= as.Date("2012-10-13") & ymd <= as.Date("2012-12-28")) {
ymd <- ymd + 3
} else if (ymd %in% as.Date(c("2012-10-11", "2012-10-12"))) {
if (identical(ymd, as.Date("2012-10-11"))) {
ymd <- as.Date("2012-10-12")
} else if (identical(ymd, as.Date("2012-10-12"))) {
ymd <- as.Date("2012-10-14")
}
}
}
ymd
}
# packageRank::fetchCranLog(date, memoization = FALSE)
# retrieves logs by their "literal" or exact filename/URL
d <- seq(from = as.Date(start.date), to = as.Date(end.date), by = "day")
packageRank.data <- vapply(d, function(x) {
tmp <- try(packageRank::fetchCranLog(fixDate_2012(x), TRUE), silent = TRUE)
if (any(class(tmp) == "try-error")) 0L
else nrow(tmp[!is.na(tmp$package), ])
}, integer(1L))
packageRank.data <- data.frame(date = d, count = packageRank.data)
# Merge the two data frames by calendar date:
cran.data <- merge(cranlogs.data, packageRank.data, by = "date")
names(cran.data)[-1] <- c("cranlogs", "packageRank")
# Compute the ratio of counts of 'cranlogs' to 'packageRank'
cran.data$ratio <- cran.data$cranlogs / cran.data$packageRank
# If you take a look at `cran.data`, you'll see that generally,
# you get the same exact results for both methods except for
# 8 discrepancies or errors:
errors <- cran.data[cran.data$cranlogs != cran.data$packageRank, ]
# > errors
# date cranlogs packageRank ratio
# 6 2012-10-06 13630 6815 2.000000
# 7 2012-10-07 50 25 2.000000
# 8 2012-10-08 170 85 2.000000
# 11 2012-10-11 388 194 2.000000
# 87 2012-12-26 80738 26910 3.000297
# 88 2012-12-27 49007 24501 2.000204
# 89 2012-12-28 21959 10979 2.000091
# 93 2013-01-01 21822 10911 2.000000
The ratio of these differences are generally whole numbers. This leads me to believe that there may be computational errors in 'cranlogs'.
-
I'm not sure what's going on with "2012-10-06".
-
I believe that problem with "2012-10-07", "2012-10-08" and ""2012-10-11" stem from the fact that those logs for are actually duplicated in the RStudio logs.
Nominal Actual log in file/URL
2012-10-07 ----- 2012-10-07
2012-10-11 ----- 2012-10-07
2012-10-08 ----- 2012-10-08
2012-10-13 ----- 2012-10-08
2012-10-12 ----- 2012-10-11
2012-10-15 ----- 2012-10-11
This overcounting makes sense because, as you wrote in issue #54, you rely on the data in the files and not the filenames/URLs. By doing so, you may have ended up double counting.
- I haven't sorted out what's going on with the 4 remaining dates ("2012-12-26", "2012-12-27", "2012-12-28", "2013-01-01") but I'm guessing it has something to do with the fact that they surround the 3 missing/lost RStudio logs ("2012-12-29", "2012-12-30", "2012-12-31").
Note that the ratios for the three December dates are not whole numbers. However, I did a sanity check using the top six packages for each of the three days; they all returned whole number multiples. If useful, I can provide more details.