Researching Personal Data
Category Archives: personal data
I recently delivered a seminar for the Southampton University Cyber Security seminar series. My talk introduced some of the research I’ve been doing into the UK’s Data Protection Register, and was entitled ‘Data Controller Registers: Waste of Time or Untapped Transparency Goldmine?’.
The idea of a register of data controllers came from the EU Data Protection Directive, which set out a blueprint for member state’s data protection laws. Data controllers – any entity responsible for collection and use of personal data – must provide details about the purposes of collection, categories of data subjects, categories of personal data, any recipients, and any international data transfers, to the supervisory authority (in the UK, this is the Information Commissioner’s Office). This represents a rich data source on the use of personal data by over 350,000 UK entities.
My talk explored some initial results from my research into 3 years worth of data from this register. A number of broad trends have been identified, including;
The amount of personal data collection reported is increasing. This is measured in terms of the number of distinct register entries for individual instances of data collection, which have increased by around 3% each year.
There are over 60 different stated reasons for collection of data, with ‘Staff Administration’, ‘Accounts & Records’ and ‘Advertising, Marketing & Public Relations’ being the most popular (outnumbering all other purposes combined).
The categories of personal data collected exhibit a similar ‘long tail’, with ten very common categories (including ‘Personal Details’, ‘Financial Details’ and ‘Goods or Services Provided’) accounting for the majority of instances.
In terms of transfers of data outside the EU, the vast majority of international data transfers are described as ‘Worldwide’. Of those who do specify, the most popular countries are the U.S., Canada, Australia, New Zealand and India.
Beyond these general trends, I explored one particular category of personal data collection which has been raised as a concern in studies of EU public attitudes, namely, trading and sharing of personal data. The kinds of data likely to be collected for this purpose are broadly reflective of the general trends, with the exception of ‘membership details’, which are far more likely to be collected for the purpose of trading.
Digging further into this category, I selected one particularly sensitive kind of data – ‘Sexual Life’ – to see how this was being used. This uncovered 349 data controllers who hold data about individual’s sexual lives, for the purpose of trading and sharing with other entities (from the summer 2012 dataset). I visualised this activity as a network graph, looking at the relationship between individual data controllers and the kinds of entities they share this information with. By clicking on blue nodes you can see individual data controllers, while categories of recipients are in yellow (note: wordpress won’t allow me to embed this in an iframe) Trading / Sharing Data about Sexual Life
I also explored how this dataset can be used to create personalised transparency tools, or to ‘visualise your digital footprint’. By identifying the organisations, employers, retailers and suppliers who have my personal details, I can pull in their entries from the register in order to see who knows what about me, what kinds of recipients they’re sharing it with and why. A similar interactive network graph shows a sample of this digital footprint.
Open data is often seen as in tension with privacy. However, through this research I hope to demonstrate some of the ways that open data can address privacy concerns. These concerns often stem from a lack of transparency about the collection and use of personal data by data controllers. By providing knowledge about data controllers, open data can be a basis for accountability and transparency about the use (or abuse) of personal data.
As a volunteer ‘data donor’ at the Midata Innovation Lab, I’ve recently been attempting to get my data back from a range of suppliers. As our lives become more data-driven, an increasing number of people want access to a copy of the data gathered about them by service providers, personal devices and online platforms. Whether it’s financial transactions data, activity records from a Fitbit or Nike Fuelband, or gas and electricity usage, access to our own data has the potential to drive new services that help us manage our lives and gain self-insight. But anyone who has attempted to get their own data back from service providers will know the process is not always simple. I encountered a variety of complicated access procedures, data formats, and degrees of detail.
For instance, BT gave me access to my latest bill as a CSV file, but previous months were only available as PDF documents. And my broadband usage was displayed as a web page in a seperate part of the site. Wouldn’t it be useful to have everything – broadband usage, landline, and billing – in one file, covering, say, the last year of service? Or, even better, a secure API which would allow trusted applications to access the latest data directly from my BT account, so I don’t have to?
Another problem was that in order to get my data, I sometimes had to sign up for unwanted services. My mobile network provider, GiffGaff, require me to opt-in to their marketing messages in order to receive my monthly usage report. FitBit users need to pay for a premium account to get access to the raw data from their own device.
Wouldn’t it be nice to rate these services according to a set of best practices? In 2006, when the open data movement was in its infancy, Tim Berners-Lee defined ‘Five Stars of Open Data‘ to describe how ‘open’ a data source is. If it’s on the web under an open license, it gets one star. Five stars means that it is in a machine-readable, non-proprietary format, and uses URI’s and links to other data for context. While we don’t necessarily want our private, personal data to be ‘open’ in Berners-Lee’s sense, we do want standard ways to get access to our personal data from a service. So, here are my suggested ‘Five Stars of Personal Data Access’ (to be read as complementary, not necessarily hierarchical):
1. My data is made available to me for free in a digital form. For instance, through a web dashboard, or email, rather than as a paper statement. There are no strings attached; I do not need to pay for premium services or sign up to marketing alerts to read it.
2. My data is machine-readable (such as CSV rather than PDF).
3. My data is in a non-proprietary format (such as CSV, XML or JSON, rather than Excel).
4. My data is complete; all the relevant fields are included in the same place. For instance, usage history and billing are included in the same file or feed.
5. My data is up-to-date; available as a regularly-updated feed, rather than a static file I have to look up and download. This could be via a secure API that I can connect trusted third-party services to.
The Midata programme has considered these issues from the outset, calling for suppliers to adopt common procedures and formats. Simplifying this process is an important step towards a world where individuals are empowered by their own data. My initial attempts to get my data back from suppliers point to a number of areas for improvement, which I’ve tried to reflect in these star ratings. Of course, there’s lots of room for debate over the definitions I’ve given here. And I’m sure there are other important aspects I’ve missed out. What would you add?
If Facebook were a state, it would be the third most populated in the world, just ahead of the USA and behind India. Like the former Soviet Union, which occupied the same third place slot at its peak, the state of Facebook rules over a geographically and culturally diverse citizenry. And like the USSR in 1990, this disparate social network may be at the beginning of its decline.
I’ll resist the urge to draw further fatuous parrallels – between, say, Stalin’s centralised planning and Zuckerburg’s centralised business model, or Gorbachev’s collapsing economy and the social network’s dismal performance on the stock market – fun as they might be. There are early signs of Facebook’s eventual dissolution, cracks which have appeared over the last six months. Facebook lost 10 million US visitors in the last year. Monthly visits in Europe are down. Its incredible international growth rate is beginning to plateau. And ‘Home’, the Facebook-smeared Android smartphone interface, appears to have flopped.
I’m just one data-point in all this, but I’ve been quietly engineering my own secession from Facebook over the last few weeks. I won’t go over some of the good reasons to leave Facebook (Paul Bernal has eloquently outlined ten of them already). I’ve always been a reluctant user, but equally reluctant to leave. Enough of my personal (and worryingly, professional) communication seems to come through Facebook that leaving altogether doesn’t seem to be an option, yet. Instead, I’ve taken a less drastic approach in the interim, which means I should never have to log in to Facebook again (except, perhaps, to delete my account).
- Exported (almost) all my data
- Removed (almost) all the information from my account.
- Deleted the Facebook and Facebook Messenger apps from my smartphone and tablet.
- Set up RSS feeds for pages.
- Set email notifications for group posts and events.
- Exported all my friend’s birthdays into a calendar, and set up a weekly update of upcoming birthdays.
- Finally, exported all my friend’s email addresses, so I can communicate via email instead. This was the hardest one. I had to sign up to Yahoo Mail (the only service Facebook will allow email imports into), and then run a scraping script on a html page to get them into a CSV format, before finally importing that into my email contacts. Thanks to @joincamp for the guide.
This way, I still get to hear about the important stuff, without exposing my eyeballs, or much of my data, to Facebook. It’s also given me the chance to experiment with other means of personal communication. Email feels very personal again. I’m working on my telephone manner. Postcards are also fun.
Last weekend I attended the Open Internet of Things Assembly here in London. You can read more comprehensive accounts of the weekend here. The purpose was to collaboratively draft a set of recommendations/standards/criteria to establish what it takes to be ‘open’ in the emerging ‘Internet of Things’. This vague term describes an emerging reality where our bodies, homes, cities and environment bristle with devices and sensors interacting with each other over the internet.
A huge amount of data is currently collected through traditional internet use – searches, clicks, purchases. The proliferation of internet-connected objects envisaged by Internet-of-Things enthusiasts would make the current ‘data deluge’ seem insignificant by comparison.
At this stage, asking what an Internet of Things is for would be a bit like travelling back to 1990 to ask Tim Berners-Lee what the World Wide Web was ‘for’. It’s just not clear yet. Like the web, it probably has some great uses, and some not so great ones. And, like the web, much of its positive potential probably depends on it being ‘open’. This means that anyone can participate, both at the level of infrastructure – connecting ‘things’ to the internet, and at the level of data – utilising the flows of data that emerge from that infrastructure.
The final document we came up with which attempts to define what it takes to be ‘open’ in the internet of things is available here. A number of salient points arose for me over the course of the weekend.
When it comes to questions of rights, privacy and control, we can all agree that there is an important distinction to be made between personal and non-personal data. What also emerged over the weekend for me were the shades of grey between this apparently clear-cut distinction. Saturday morning’s discussions were divided into four categories – the body, the home, the city, and the environment – which I think are spread relatively evenly across the spectrum between personal and non-personal.
Some language emerged to describe these differences – notably, the idea of a ‘data subject’ as someone who the data is ‘about’. Whilst helpful, this term also points to further complexities. Data about one person at one time can later be mined or combined with other data sets to yield data about somebody else. I used to work at a start-up which analysed an individual’s phone call data to reveal insights into their productivity. We quickly realised that when it comes to interpersonal connections, data about you is inextricably linked to data about other people – and this gets worse the more data you have. This renders any straightforward analysis of personal vs. non-personal data inadequate.
During a session on privacy and control, we considered whether the right to individual anonymity in public data sets is technologically realistic. Cambridge computer scientist Ross Anderson‘s work concludes that absolute anonymity is impossible – datasets can always be mined and ‘triangulated’ with others to reveal individual identities. It is only possible to increase or decrease the costs of de-anonymisation. Perhaps the best that can be said is that it is incumbent on those who publicly publish data to make efforts to limit personal identification.
Unlike its current geographically-untethered incarnation, the internet of things will be bound to the physical spaces in which its ‘things’ are embedded. This means we need to reconsider the meaning of and distinction between public and private space. Adam Greenfield spoke of the need for a ‘jurisprudence of open public objects’. Who has stewardship over ‘things’ embedded in public spaces? Do owners of private property have exclusive jurisdiction over the operation of the ‘things’ embedded on it, or do the owners of the thing have some say? And do the ‘data subjects’, who may be distinct from the first two parties, have a say? Mark Lizar pointed out that under existing U.S. law, you can mount a CCTV camera on your roof, pointed at your neighbours back garden (but any footage you capture is not admissible in court). Situations like this are pretty rare right now but will be part and parcel of the internet of things.
I came away thinking that the internet of things will be both wonderful and terrible, but I’m hopeful that the good people involved in this event can tip the balance towards the former and away from the latter.