Publicly-Available Data on Internet Access and Equity

Modern society is mediated through the Internet. Equity in society requires equity in usable broadband Internet access: for communication, education, health, and quality of life or entertainment. This was true before the coronavirus pandemic and it will be true when the pandemic subsides. But the need today is more acute than it has ever been.

This is a multifaceted problem: the connections available to people and their bandwidth, the devices that they use on those connections, and their digital “literacy” in leveraging those resources. Understanding and responding to digital literacy requires good data from a variety of angles. In a series of three posts, I want to lay out

existing public data sources on internet access and equity (this post),
simple technical methods for measuring performance at home, and
some of the more-elaborate strategies required for measuring consumption and application performance.

But aside from the fact that it’s kind of fun to fuss with your router, why should we pursue those dierect measurements at all? After all, hardware is expensive, deploying it is difficult, and data is free: at first blush, there are a lot of data sources out there. Those data are the subject of this post.

To make a long story short: the available data use very rough metrics of “broadband” (presence of a 25/3 Mbps connection), or they have significant sampling limitations. Notwithstanding, they are very valuable, so I’ll review some of them here. The map below draws on several of these data sources, to present neighborhood-level Internet access in the twenty largest American cities.

Base map data © OpenStreetMap contributors, CC-BY-SA; layers by CARTO. Map data from US Census American Community Survey, FCC Form 477, and Speedtest by Ookla. Map by James Saxon, Center for Data and Computing, University of Chicago.

This project on broadband Internet continues my study of resource equity in urban neighborhoods, but aligns it with the “classical strengths” of my current research group. I moved over the summer from the Harris School of Public Policy and the Center for Spatial Data Science to the Center for Data and Computing and the NOISE Lab, directed by Nick Feamster.

FCC Form 477.

Perhaps most prominent is the venerable FCC 477 data: ISPs’ self-reports of offered services, as shown in this flashy map. From these data¹, it is possible to calculate broadband availability and indeed the number of ISPs providing service at any threshold. The problems with the data are well-documented. The reports are usually dated and they tend to over-represent real options. A Census block counts as having a service if it is available to just one address in that block. Even within this generous definition, Major, Teixera, and Mayer recently showed that the 477 reports often overstate what the ISPs’ own subscription sites will actually allow you to order. It is also notable that the 477 reports capture availability rather than subscriptions, meaning that it doesn’t represent what people actually experience or use on the ground. Within the literature on health access, it has long been established that access is more meaningfully measured via consumption or use than pro forma availability; the same holds for Internet. Finally, there is often a difference between what’s available and what’s physically installed. Comcast lists huge swaths of Chicago as having 985/35 Mbps connections available, but bringing the fiber to the home may still involve a fleet of trucks (and adeptly-applied pressure).

Survey data.

You can also just ask people if they have Internet. The US Census does that in two surveys: the American Community Survey (ACS, the successor to the Census long form), and the National Telecommunications and Information Administration’s (NTIA) supplement to the Current Population Survey (CPS). The ACS captures device ownership and broadband subscriptions in households. The sample size is about 1% of the US population annually. “Broadband” is understood dichotomously as the presence or absence of a 25/3 Mbps connection. The ACS API provides estimates down to the Census Tract level, and using the IPUMS microdata it is possible to construct measures of the “digital divide” – by gender, income, race, geography, etc.²

The NTIA data (docs, code) include both the subscriptions and device ownership of the ACS, along with additional data on which services people use and where they do so. A state-level map and time trends by Rafi Goldberg at the NTIA is here. On the other hand, the NTIA supplement is less frequent and has a much-smaller sample (as well as a slightly different sample frame). The CPS is stratified at the state level, and I have found that finer-grained measurements are simply not credible within uncertainties. For example, I find it hard to believe that among the twenty largest US cities, San Francisco is the worst-provisioned in Internet.

Distributed, web-based measurements.

Ookla (speedtest & data) and M-Labs (speed test) both perform distributed measurements of network latency, jitter, and throughput via the browser and dedicated apps. The M-Lab data are readily available on BigQuery and have been influential for policy. It is the hard-coded first link in a Google search for “speed test.” But the data have somewhat important limitations, as recently described by Nick Feamster (my adviser at CDAC) and Jason Livingood (Comcast) in the CACM.³ The server infrastructure is not widely distributed, and it has not proven entirely reliable.

There are also some low-level technical issues. In some instances, the 10-second test may not suffice for a TCP link to saturate. More to the point, the ndt protocol has not included multiple TCP threads until quite recently. Now, ndt7 uses BBR congestion control when it’s available, and this purports to do better. But most of the data measure the performance of a single TCP thread, which saturates far below the bandwidth of most modern access links. The other parameters of the M-Labs tests (latency, jitter, etc.) might better approximate the intended concepts, were the servers not so sparse.

By contrast, the Ookla/speedtest measurements are technically on surer ground, and the infrastructure is rather more robust. Still, that sample isn’t entirely unbiased either. This can be seen by contrasting Ookla device counts with Census households with Internet subscriptions. People who choose to run speedtests are not representative of the broader population, and they generally don’t run them at unbiased (random or regular) times. Just as important, the “last-mile” bandwidth may not be an sufficient indicator of network performance (or indeed even of the ISP’s link, if a test is performed from a device connected over a bad Wifi connection). Speedtest chooses the test server with the lowest round-trip latency (roughly, the closest), but while major services have widely distributed servers (c.f. Google’s and Netflix’s peering “offers”) and others rely on CDNs, resources are not always close at hand.

SamKnows Whiteboxes.

Data collected by SamKnows for the FCC are in some ways a gold standard, although their measurements rely on the M-Lab infrastructure. This program has been running since 2011 and includes a bevy of continuous, open-sourced tests on dedicated hardware that they call “whiteboxes”: TCP (both single and multithreaded) and UDP bandwidth, latencies (ping, under load, and DNS), web browsing and loads, video quality, even consumption, and so on. The data are used for the annual Measuring Broadband America reports. But it too has some issues of who is captured. The sampling strategy aims to ensure that ISPs, rather than demographic groups, are adequately sampled. So like Ookla devices, SamKnows households tend to skew wealthier than the general population. In addition, there are just a few thousand households in the sample nation-wide, which is not enough to say how certain groups fare within Chicago, for instance. (There are no demographics and limited information about units’ geography.⁴)

The European Network Coordination Centre, RIPE, also operates a network of probes and anchors worldwide, measuring latency and network topology: ping, DNS, traceroutes, etc. The RIPE probes used to be hardware devices attached via Ethernet to a router, but RIPE has extended this to a software package that “anyone” can run.

Citizen science and server logs.

There are of course many, many more strategies and datasets. Telecom companies spend a fortune “drivetesting” their networks, rolling around to map access and gaps. There’s plenty of room for a citizen science, on the same thing. As you go about your neighborhood and routines, your phone registers available access points. The “game” of recording those systematically and competitively is a popular activity for hackers,⁵ known as “wardriving” (from wardialing, from WarGames). The WiGLE wardriving platform makes its data available, and there are other institutionalized efforts such as Mozilla Location Services (they call it “stumbling,” but it’s functionally equivalent).

I’ve dabbled with these methods – both with apps and jury-rigging raspberry pis. As you can see here, it was pretty quick to cover my own neighborhood in Chicago. But the issues of coverage are not trivial. Which networks can be seen from the street depends on building materials, heights, interference, etc. There are also sampling problems (who runs these tests), but one can imagine a measurement of access point density “conditional on an observation.” At the University of Chicago, Monisha Ghosh’s group has developed apps for finer-grained maps of mobile broadband spectrum and deployment. I see tremendous potential in deploying these apps through a fleet (city vehicles or taxis⁶) or a public mapping program like MAPSCorps.

And there’s a ton more to be uncovered, waiting in server logs. To my mind, Wikipedia logs would be the very best data for measuring the “homework gap”: it is a high-traffic site hosting lightweight pages (no fancy connection is required), of plausibly educationally-oriented materials.⁷ As you might expect (or hope!) Wikipedia guards user privacy in their logs quite strictly, and there remains a challenge of IP geolocation, to attribute use and access to demography. (This is a problem I’m currently working on.) Still, Microsoft has used similar data – page load times – to derive estimates of who really has broadband access, at the county level (data). Their work suggests that real access is about half of what the FCC reports would suggest. This is fantastic work, but it would be nice to measure the variation at finer spatial granularity within cities.

That is what we’re setting out to do.

Notes.

The FCC data include, among other fields, advertised and contractual up and downstream bandwidths, per supplier (DBA, holding company, etc.), and whether service is provided to commercial or consumer properties. In practice, I found that the contractual numbers – naively more apposite – are often basically missing, and so the advertised levels are to be preferred. ↩
Albeit at coarser geographies, called Public Use Microsample Areas (PUMAs). These are regions defined to preserve the anonymity of microdata, and contain 100-200k people. ↩
Ookla lays out similar criticisms in a recent piece. ↩
Census block IDs are often ill-formed in the data; SamKnows is currently working to correct this. ↩
More likely, niche, though it depends on your peer group. ↩
A project by the Senseable City Lab at MIT estimates the coverage of taxis as an opportunistic sensor deployment. ↩
It is perhaps interesting to note that for domain-level sites (like Wikipedia), either the sites’ own logs or any high-level DNS resolver would do. Moura et al showed at IMC ‘20 that in the Netherlands, the concentration of DNS resolution has grown quite extreme. We might expect it to be even more so in the US / .com TLD. What this means is that Google and Amazon know if you’re using Wikipedia, even if you don’t search, because they are likely the ones who are telling you how to get to Wikipedia. ↩