Our database offers a refined dataset of unparalleled scope, accuracy and granularity, capturing internet activity across 1,647 cities in 122 countries by 15 minute intervals across a seven year period.
To construct it, we obtained 1.5 trillion observations of global IP activity from the University of Southern California’s PREDICT internet-security database. These observations were collected by periodically probing every IP address around the world to see whether or not it was connected to a device that was online between 2006 and 2012.
The observations were matched to a highly accurate, commercially-available IP-geolocation library using the Australian Synchrotron’s MASSIVE cluster computing facility, to produce 24 billion geo-located IP activity observations. These were then clustered into cities (using urban boundaries obtained from satellite observational data), and 15 minute intervals, to yield a final dataset of 75 million rows of online/offline observations.
For more information, download our paper.
Constructing the geo-located—IP-activity dataset
Whilst utilisation of the internet’s over 4 billion unique IP addresses in the abstract is of considerable interest to computer scientists, network institutions and security experts on its own, without knowing where a given IP address is active renders these data of little use to quantitative social scientists. For social scientists, the atomic observation is always referenced to a particular family, village, town, city, state or nation. Hence, to join any other observation on these locales to internet activity, a location must be sought. In broad terms, the construction of a geo-located — IPactivity database requires an IP-activity database, a historical IP/geo-location database, and a method that uniquely joins each row of the IP-activity database to the correct row in the historical IP/geo-location database (Fig. 1).
The problem is immediately non-trivial since the typical join-algorithm applied in such a scenario scales as the cross-product of the two database sizes. Given that one scan of the internet generates observations on the order of a billion rows, and that the human population is relatively sparsely geo-located leading to millions of unique locations, geo-locating a single scan would create an intermediate object of easily over 1 × 1015 rows for just a single full scan. However, the matter is further complicated by the fact that the IPv4 space is not statically allocated: IPs are typically allocated for fixed periods to a given internetservice-provider (ISP), and then within the ISP, allocations to consumers need not be static.
The outcome being that the IP geo-location database must be re-created, and stored, approximately every two weeks to provide a correct historical reference point for where an IP was at any given time, again, causing significant expansion in the cross-product processing object. In summary, traditional database join methods cannot be used on the problem.
Our solution was to develop a novel cluster-join algorithm which can be applied more broadly beyond the data treated in this report. In outline, the algorithm proceeds by first partitioning the historical geo-location database into strictly nonoverlapping IP-space segments with given date-range tags to create an indexed database. Then, the algorithm addresses each row of the activity database and finds first the correct sub-section of the location-database by the index, then conducts the row-wise match. Significantly, this algorithm can be efficiently applied in a distributed, parallel, file-system environment, dramatically reducing the processing time needed to conduct the full join.
After processing, 24 billion, accurately geo-located IP-activity observations were retrieved, which were then spatially and temporally aggregated within 1,647 urban-boundaries and 15min intervals over 2006-2012. City-day blocks of of IP activity were obtained where at least 100 online users were recorded in each of the 96 temporal segments of each day, and city-year blocks of data were retained where at least 30 city-days existed.
Characterising the diffusion of the IPv4-space
The diffusion of technology, including previous GPTs[11, 2], is of ongoing interest in the economic literature[17, 18, 19]. Previous related studies have used a variety of internet penetration proxies at either snapshot- or annualised- detail, each proxy having one or more compromises such as data-quality problems (in the case of ITU surveys[16, 20]), or actual internet use identification complications (in the case of block-based or router-based assignment[21, 22]). In contrast, since we observe actual end-user IP connections, in well-defined urban boundaries (cities), at 15min intervals, identified by a hitherto unused highly accurate geolocation database, we are able to provide the first accurate estimate of the evolution of the internet’s expansion at monthly intervals. Significantly, given the temporal granularity and global scope of our series, we are able to confirm that the diffusion of the internet does indeed follow an S-, or logistic-, shaped process (Fig. 2), mimicking studies of the diffusion of other technologies in the literature from hybrid corn to steam engines, electrification and personal computers. Accordingly, we estimate the temporal dynamics of IP per capita, IP c at 1,647 cities globally as a logistic process IP ct = K 1 + e −α(t−β) , where K, α and β are the asymptotic limit, the gradient and midpoint parameters respectively. We estimate this process as a non-linear mixed-effects model with a stochastic expectation maximisation algorithm (see S1).
By doing so, the algorithm is able to learn from the experiences of all countries by treating each country as a deviation (in time and gradient) from a generalised, or average, diffusion process. We find that the internet’s general diffusion process has an asymptotic limit of 0.32 IPs per person, equating to an internet ‘saturation’ level of approximately one IP address for a three person household, on average. Further we estimate that the diffusion process’ average time to saturation within a country is just 16.1 years (1%-99%), eclipsing the estimated 100- and 60- year saturation times for the comparable GPTs of steampower and electrification respectively. Our method also enables the elaboration of individual country experiences of the internet’s penetration (see Table A, S1). Our estimates reveal that whilst several nations already experience saturated internet penetration, others will not reach this point for decades.