Big Data’s Little Secrets (Part 1)

The term ‘big data’ has been getting a lot of attention recently, some of it very complimentary (see ‘The End of Theory‘), and some of it not so much (see Mark Birkin’s report on a recent AAG session). On one level this is very exciting for me since much of my work with travel and communications data falls loosely under this rubric. But when big data sets are promoted as ‘the answer’ to everything from the next Census to deriving universal laws of human behaviour, it is also time for us to look a little more closely at what big data can actually deliver. In this post I want to draw attention to some important issues that should be governing our interpretation of such data. In the second part I want to dig more deeply into some of the subtleties of a real-world big data set, and in part 3 I will wrap up with some thoughts on where we go from here.

How Big is Big?

One basic problem is that big data is fairly loosely defined: Wikipedia helpfully specifies only that big data sets “grow so large that they become awkward to work with using on-hand database management tools.” So what someone calls ‘big data’ rather depends on where they’re coming from — some researchers find a few tens of millions of records ‘big’, I typically work with data sets ranging from 1-8 billion records, and Google just laugh at all of that and move on with trying to “organise the world’s knowledge“.

The kinds of numbers bandied about can seem quite impressive, and it’s certainly the case that boiling all of those records down to something meaningful is a lot of work. You can spot how proud researchers are of their work — and their data set — by how close to the start of the article terms like ‘unprecedented’ or ‘largest known’ crop up in connection with the analysis. And yes, I’m just a guilty of this as the next person.

These numbers also lend a certain weightiness to the conclusions that people draw from big data research, leading us to naturally speculate about ‘universal laws’ that might govern everything from mobility to communications usage. But as I suspect even the most ardent of big data researchers knows, it’s nowhere near as simple as that singular term would suggest, and here’s why.

Description is Not Explanation

The first issue — that behavioural data sets rarely explain what motivates the behaviours observed — is fairly easy to set out, but wretchedly hard to resolve. In other words, as all social scientists know: just because we have described a phenomenon does not mean that we have explained it. There are some fields where description and understanding are very nearly coterminous, but that is most definitely not the case when it comes to society. Because the field is so new and the skills so specialized, the people working with these new data sets may well have very little background in the issue that they are tackling and the analysis may lack even a rudimentary context.

Without a proper understanding of the system we’re describing, there is a very real risk that feedback effects invalidate many of our findings. Often, they may do so shortly after we reach them. Nick Bilton reports that Google’s FluTrends algorithm, which performed extremely well for several years, overestimated the incidence of flu this year by 100%. The problem seems to have been that people were searching a lot more for advice on flu because this year was expected to be a very bad one for the virus and so people were more aware. This is the basic problem of all complex, adaptive systems: feedback between different elements of the system cause huge changes to the behavior of the system. And that’s for a relatively straightforward system!

Even simple complex systems (if you’ll pardon the seeming oxymoron) such as a public transit network can show extraordinary changes when real people are introduced into the system: from my own work with Olympic Oyster Card data it seems that barely 5% of regular commuters (who constitute nearly 2/3rds of travellers on London’s network) and roughly 17% of irregular commuters changed their travel behavior in a measurable way^[1] this summer. These numbers don’t (and won’t) square with TfL’s own figure of 30% because we’re not working with the same definitions (which is itself a problem with big data), but either way the headlines (from papers not normally known for praise of the transit operator) spoke for themselves. The result during the Games was planned and hoped for, but it was nonetheless not exactly expected.

Sampling Bias

There are many good reasons why big data research isn’t always based on the full data set: cost (pay-to-play) and the sheer volume data (storage and bandwidth limitations) are just two of them. The problem is that when we’re working with a sample from a big data feed it gets much, much harder to figure out whether the picture we’re building from the data is remotely representative of the phenomenon we’re trying to understand. This is because we didn’t select the population to begin with and often only have the behaviour itself to use as a basis for grouping the population into sub-groups (because it’s very, very hard to link across big data sets).

Let’s take Twitter as an example, and imagine for a moment that you want to know what people think about some topic based on a sampling of their tweets. In 2010, 22.5% of users posted 90% of all tweets. And unless you are paying for Twitter’s firehose API, you are most likely getting your data from the garden hose, which contains roughly 10% of all public status events. If you wanted to geo-locate those tweets with some level of precision using only those who included GPS coordinates when tweeting, then the numbers drop further still.

Moreover, because the garden hose is not (to my knowledge) a stratified sample — meaning one where you structure the sample so that it is representative of the whole population — you’re wildly more likely to get data from the 22.5% than the 77.5%. In fact, as the plot below shows, something like 3% of users post 60% of all content to Twitter. You can, of course, limit the impact of individual users by only counting their input once, but you’re still going to have a lot of problems getting at the views of the ‘silent majority’ because they crop up so little in your feed.

But there’s a second problem: crossing over into speculation, it’s rather likely that the 22.5% (or 10.6% or whatever you like) behave differently from the general population because they are different from the general population. You are necessarily talking about a very small, technologically-sophisticated, relatively affluent, and self-selecting population so the extent to which they can be relied upon to give you a flavour of the general population is… dubious at best. There is excellent research that can be done with Twitter provided you recognise that it is mainly a large-volume test-bed for ideas and not a platform for research; however, I am rather concerned at the way in which even ‘toy’ visualisations are taken up outside of academia as replacements for the long, hard slog of collecting good data.

Sample Frequency

Let’s turn to another data source that has researchers excited these days: mobile telecommunications. What’s great about mobile phones is that they are uniquely personal devices: we carry them everywhere, a much greater proportion of the population (except the elderly) have and use them regularly, and they are almost always tied to one person (how many people do you know who share a mobile phone?). So this makes mobiles seem like an ideal resource for large-scale analysis on things like human mobility patterns. Certainly, it’s one heck of a lot better than Twitter. But it’s also a heck of a lot more sensitive.

When researchers talk about working with mobile data they are usually talking about working with CDRs (Call Data Records). These records are generated every time there is a potentially billable event on the user’s phone: placing or receiving a call, sending or receiving a text message, and (sometimes, though not always) surfing the Web. If they’re not roaming, then those events are associated to a particular cell in the operator’s network and this allows them to be geolocated with a level of accuracy that varies between roughly 100m and 5km.

The frequency issue creeps in when we realise that people certainly don’t place calls, write texts or browse the Web in a random fashion. It’s the old “HELLO! I’m on the TRAIN! The TRAIN!” problem, but scaled up to several million users. I would guess that there’s more randomness in our receipt of calls and texts, but either way the basic issue is that we’re only getting locational information about people in particular socio-spatial contexts and so, unless you’re very careful about how you draw your conclusions, it’s very hard to know what the bias in the sampling might mean for the results.

Spatial Extent

A second limitation on mobile analysis is the fact that once the user leaves the operator’s network they basically become invisible. In a place like Europe this can happen fairly easily — cheap flights, cheap trains, and small countries mean that you’re going to miss out on a lot of the user’s activity. In fact, stand at the foot of the white cliffs of Dover and you can place a call from France without even getting your feet wet! I wonder if anyone has used that feature strategically (… ah, no, I’m stuck in France…) ?

So the underlying issue is that the geography governing the network generating the data matters. Depending on the scale at which you look, you’ll undoubtedly find one predominant axis of motion in, say, Portugal and another in Switzerland. At the very least, you’ll get one answer for “how far do people move” if you ask your question in a large country, and another if you ask it in a small one. But we can’t know any of this for sure because none of the data is in the open, nor has anyone managed to negotiate access to multiple providers across multiple countries, which is the only way we’d know what was really going on.

I also know from in-depth conversations with a network operator that the system rarely behaves in the predictable way that most big data scientists seem to think it does when they report their findings in a major journal. Handovers between cells do not happen in a neat way: not only do you need substantial overlap between cells in order to provide continuous coverage, but cells ‘breathe’ under load (meaning that their boundaries shift continuously) and individual cells are lumped together into larger areas under the control of a MSC which also affects the ways in which handovers are actually managed and registered. The point is that these introduce additional uncertainty into the physical location of any one user and that, depending on the configuration, aggregate counts could shift substantially across space.

Competing Networks

I don’t think that the multiple provider issue is catastrophic for the value of mobile phone research — or for any of the work being done with locational services like FourSquare — but depending on what you want to do it could be. Making public policy decisions for anyone except elite twenty to forty year-olds using tweets or check-ins would be insane. Fortunately, I don’t believe that anyone has yet proposed it… though I’ve not read an issue of Wired recently so I could be wrong.

Conversely, I have been speaking to some people involved in the Outdoor Advertising industry and it would appear that much of the research there is based on paying people to go about their business with a GPS device. The sample size for this? Less than 150 people per week for the entire UK. Compared to that, mobile data from the sources listed is a godsend and it’s from the most valuable demographic, but it’s not the answer to all questions. So, again, it all depends on what you are claiming based on what you’re measuring.

[ End of Part 1 ]

1. It is impossible to observe changes in route directly and I’ve not had time to analyse changes in travel time from which it might be possible to infer something in this area.