-
Notifications
You must be signed in to change notification settings - Fork 13
Description
To compute the number of package downloads for a given day, cranlogs::cran_downloads() counts the number of entries (number of rows) for that package in CRAN's download logs. Would it be possible to add an optional argument so that observations with sizes less than 1000 bytes do not count toward the number of package downloads?
Two reasons. First, it's hard to say that such observations really represent a package download. Second, while much of this may just be unsuccessful/aborted downloads, I think that some of this is more than random noise.
Using 2019-10-23 as an example, here's what I found.
My back-of-envelope estimate is that around 5% (233,722 / 5,097,912) of all downloads on 2019-10-23 was smaller than 1000 bytes, typically around 500 bytes.
Here's an example. On 2019-10-23 'rstan' was downloaded 2,574 times.
> cranlogs::cran_downloads("rstan", from = "2019-10-23", to = "2019-10-23")
date count package
1 2019-10-23 2574 rstan
But if you look at the logs (RStudio's CRAN mirror at http://cran-logs.rstudio.com), you'll see that there are 40 entries smaller than 1000 bytes:
date time size package version country ip_id
1438000 2019-10-23 19:49:15 531 rstan 2.11.1 US 7
1438001 2019-10-23 19:49:15 537 rstan 2.12.1 US 7
1438002 2019-10-23 19:49:15 537 rstan 2.13.2 US 7
1438003 2019-10-23 19:49:15 531 rstan 2.14.1 US 7
1438004 2019-10-23 19:49:15 537 rstan 2.14.2 US 7
1438005 2019-10-23 19:49:15 537 rstan 2.15.1 US 7
1438006 2019-10-23 19:49:15 531 rstan 2.16.2 US 7
1438007 2019-10-23 19:49:15 537 rstan 2.17.2 US 7
1438008 2019-10-23 19:49:15 531 rstan 2.17.3 US 7
1438009 2019-10-23 19:49:15 537 rstan 2.17.4 US 7
1438010 2019-10-23 19:49:15 537 rstan 2.18.1 US 7
1438011 2019-10-23 19:49:15 537 rstan 2.18.2 US 7
1438012 2019-10-23 19:49:15 533 rstan 2.7.0-1 US 7
1438013 2019-10-23 19:49:15 533 rstan 2.8.0 US 7
1438014 2019-10-23 19:49:15 533 rstan 2.8.1 US 7
1438015 2019-10-23 19:49:15 539 rstan 2.8.2 US 7
1438016 2019-10-23 19:49:15 539 rstan 2.9.0-3 US 7
1438017 2019-10-23 19:49:15 539 rstan 2.9.0 US 7
1438121 2019-10-23 19:49:14 537 rstan 2.10.1 US 7
3607030 2019-10-23 10:59:51 534 rstan 2.19.2 <NA> 5
3702500 2019-10-23 20:16:37 537 rstan 2.10.1 US 7
3702501 2019-10-23 20:16:38 537 rstan 2.11.1 US 7
3702502 2019-10-23 20:16:38 537 rstan 2.12.1 US 7
3702503 2019-10-23 20:16:38 531 rstan 2.13.2 US 7
3702504 2019-10-23 20:16:38 531 rstan 2.14.1 US 7
3702505 2019-10-23 20:16:38 537 rstan 2.14.2 US 7
3702506 2019-10-23 20:16:38 537 rstan 2.15.1 US 7
3702507 2019-10-23 20:16:38 531 rstan 2.16.2 US 7
3702508 2019-10-23 20:16:38 537 rstan 2.17.2 US 7
3702509 2019-10-23 20:16:38 537 rstan 2.17.3 US 7
3702510 2019-10-23 20:16:38 531 rstan 2.17.4 US 7
3702511 2019-10-23 20:16:38 531 rstan 2.18.1 US 7
3702512 2019-10-23 20:16:38 537 rstan 2.18.2 US 7
3702513 2019-10-23 20:16:38 539 rstan 2.7.0-1 US 7
3702514 2019-10-23 20:16:39 539 rstan 2.8.0 US 7
3702515 2019-10-23 20:16:39 539 rstan 2.8.1 US 7
3702516 2019-10-23 20:16:39 539 rstan 2.8.2 US 7
3702517 2019-10-23 20:16:39 539 rstan 2.9.0-3 US 7
3702518 2019-10-23 20:16:39 533 rstan 2.9.0 US 7
4186380 2019-10-23 18:29:29 530 rstan 2.19.2 US 7
For what it's worth, here's the code for the above log data using packageLog() from the development version of 'packageRank' (v0.3.0.9000) on https://github.com/lindbrook/packageRank
rstan.log <- packageRank::packageLog("rstan", "2019-10-23")
vars <- c("date", "time", "size", "package", "version", "country", "ip_id")
rstan.log[rstan.log$size < 1000, vars]
While 40 of 2574 downloads is small, percentage-wise (1.6%), you'll see that the overwhelming majority of these observations comes from a single IP address that is "downloading" different (possible all) versions of 'rstan'.
While people may, of course, be interested in previous versions of a package and while many people are using network address translation (NAT), this kind of activity is not an isolated event. You'll find it across many packages intermittently throughout the month.
It even extends to "archived" packages (those that are not included on CRAN's main listing). For example, we see that 'bim' was downloaded 12 times:
> cranlogs::cran_downloads("bim", from = "2019-10-23", to = "2019-10-23")
date count package
1 2019-10-23 12 bim
But all 12 of those downloads looks like this:
date time size package version country ip_id
1746102 2019-10-23 19:39:06 539 bim 0.92-3 US 7
1746103 2019-10-23 19:39:06 533 bim 0.93-1 US 7
1746105 2019-10-23 19:39:06 537 bim 1.01-1 US 7
1746107 2019-10-23 19:39:06 537 bim 1.01-3 US 7
1746108 2019-10-23 19:39:06 537 bim 1.01-4 US 7
1746109 2019-10-23 19:39:06 537 bim 1.01-5 US 7
1937591 2019-10-23 19:12:29 533 bim 0.92-3 US 7
1937592 2019-10-23 19:12:29 539 bim 0.93-1 US 7
1937593 2019-10-23 19:12:29 537 bim 1.01-1 US 7
1937594 2019-10-23 19:12:29 537 bim 1.01-3 US 7
1937595 2019-10-23 19:12:29 537 bim 1.01-4 US 7
1937596 2019-10-23 19:12:29 537 bim 1.01-5 US 7
bim.log <- packageRank::packageLog("bim", "2019-10-23")
vars <- c("date", "time", "size", "package", "version", "country", "ip_id")
bim.log[, vars]
In this case, it's hard to say that this package was really downloaded.