Researching Personal Data
Category Archives: opendata
I recently delivered a seminar for the Southampton University Cyber Security seminar series. My talk introduced some of the research I’ve been doing into the UK’s Data Protection Register, and was entitled ‘Data Controller Registers: Waste of Time or Untapped Transparency Goldmine?’.
The idea of a register of data controllers came from the EU Data Protection Directive, which set out a blueprint for member state’s data protection laws. Data controllers – any entity responsible for collection and use of personal data – must provide details about the purposes of collection, categories of data subjects, categories of personal data, any recipients, and any international data transfers, to the supervisory authority (in the UK, this is the Information Commissioner’s Office). This represents a rich data source on the use of personal data by over 350,000 UK entities.
My talk explored some initial results from my research into 3 years worth of data from this register. A number of broad trends have been identified, including;
The amount of personal data collection reported is increasing. This is measured in terms of the number of distinct register entries for individual instances of data collection, which have increased by around 3% each year.
There are over 60 different stated reasons for collection of data, with ‘Staff Administration’, ‘Accounts & Records’ and ‘Advertising, Marketing & Public Relations’ being the most popular (outnumbering all other purposes combined).
The categories of personal data collected exhibit a similar ‘long tail’, with ten very common categories (including ‘Personal Details’, ‘Financial Details’ and ‘Goods or Services Provided’) accounting for the majority of instances.
In terms of transfers of data outside the EU, the vast majority of international data transfers are described as ‘Worldwide’. Of those who do specify, the most popular countries are the U.S., Canada, Australia, New Zealand and India.
Beyond these general trends, I explored one particular category of personal data collection which has been raised as a concern in studies of EU public attitudes, namely, trading and sharing of personal data. The kinds of data likely to be collected for this purpose are broadly reflective of the general trends, with the exception of ‘membership details’, which are far more likely to be collected for the purpose of trading.
Digging further into this category, I selected one particularly sensitive kind of data – ‘Sexual Life’ – to see how this was being used. This uncovered 349 data controllers who hold data about individual’s sexual lives, for the purpose of trading and sharing with other entities (from the summer 2012 dataset). I visualised this activity as a network graph, looking at the relationship between individual data controllers and the kinds of entities they share this information with. By clicking on blue nodes you can see individual data controllers, while categories of recipients are in yellow (note: wordpress won’t allow me to embed this in an iframe) Trading / Sharing Data about Sexual Life
I also explored how this dataset can be used to create personalised transparency tools, or to ‘visualise your digital footprint’. By identifying the organisations, employers, retailers and suppliers who have my personal details, I can pull in their entries from the register in order to see who knows what about me, what kinds of recipients they’re sharing it with and why. A similar interactive network graph shows a sample of this digital footprint.
Open data is often seen as in tension with privacy. However, through this research I hope to demonstrate some of the ways that open data can address privacy concerns. These concerns often stem from a lack of transparency about the collection and use of personal data by data controllers. By providing knowledge about data controllers, open data can be a basis for accountability and transparency about the use (or abuse) of personal data.
Should Government agencies tasked with protecting our privacy make their investigations more transparent and open?
I spotted this story on (eminent IT law professor) Michael Geist’s blog, discussing a recent study by the Canadian Privacy Commissioner Jennifer Stoddart into how well popular e-commerce and media websites in Canada protect their user’s personal information and seek informed consent. This is important work; the kind of pro-active investigation into privacy practices that sets a good example to other authorities tasked with protecting citizen’s personal data.
However, while the results of the study have been published, the Commissioner declined to name names of those websites it investigated. Geist rightly points out that this secrecy denies individuals the opportunity to reassess their use of the offending websites. Amid calls from the Commissioner for greater transparency in data protection generally – such as better security breach notification – this decision goes against the trend, and seems, to me, a missed opportunity.
This isn’t just about naming and shaming the bad guys. It is as much about encouraging good practice where it appears. But this evaluation should take place in the open. Privacy and Data Protection commissioners should leverage the power of public pressure to improve company privacy practices, rather than relying solely on their own enforcement powers.
Identifying the subjects of such investigations is not a radical suggestion. It has already happened in a number of high-profile investigations undertaken by the Canadian Privacy Commissioner (into Google and Facebook), as well by its relevant counterparts in other countries. The Irish Data Protection Commissioner has made the results of its investigation into Facebook openly available. The UK Information Commissioners Office regularly identifies the targets of its investigations. While the privacy of individual data controllers should be respected, the privacy of individual data subjects should come before the ‘privacy’ of organisations and businesses.
As I wrote in my last blog post, openness and transparency from those government agencies tasked with enforcing data protection has the potential to alleviate modern privacy concerns. The data and knowledge they hold should be considered basic public infrastructure for sound privacy decisions. Opening up data protection registers could help reveal who is doing what with our personal data. Investigations undertaken by the authorities into websites’ privacy practices are another important source of information to empower individual users. The more information we have about who is collecting our data and how well they are protecting it, the better we can assess their trustworthiness.
Last weekend I attended the Open Internet of Things Assembly here in London. You can read more comprehensive accounts of the weekend here. The purpose was to collaboratively draft a set of recommendations/standards/criteria to establish what it takes to be ‘open’ in the emerging ‘Internet of Things’. This vague term describes an emerging reality where our bodies, homes, cities and environment bristle with devices and sensors interacting with each other over the internet.
A huge amount of data is currently collected through traditional internet use – searches, clicks, purchases. The proliferation of internet-connected objects envisaged by Internet-of-Things enthusiasts would make the current ‘data deluge’ seem insignificant by comparison.
At this stage, asking what an Internet of Things is for would be a bit like travelling back to 1990 to ask Tim Berners-Lee what the World Wide Web was ‘for’. It’s just not clear yet. Like the web, it probably has some great uses, and some not so great ones. And, like the web, much of its positive potential probably depends on it being ‘open’. This means that anyone can participate, both at the level of infrastructure – connecting ‘things’ to the internet, and at the level of data – utilising the flows of data that emerge from that infrastructure.
The final document we came up with which attempts to define what it takes to be ‘open’ in the internet of things is available here. A number of salient points arose for me over the course of the weekend.
When it comes to questions of rights, privacy and control, we can all agree that there is an important distinction to be made between personal and non-personal data. What also emerged over the weekend for me were the shades of grey between this apparently clear-cut distinction. Saturday morning’s discussions were divided into four categories – the body, the home, the city, and the environment – which I think are spread relatively evenly across the spectrum between personal and non-personal.
Some language emerged to describe these differences – notably, the idea of a ‘data subject’ as someone who the data is ‘about’. Whilst helpful, this term also points to further complexities. Data about one person at one time can later be mined or combined with other data sets to yield data about somebody else. I used to work at a start-up which analysed an individual’s phone call data to reveal insights into their productivity. We quickly realised that when it comes to interpersonal connections, data about you is inextricably linked to data about other people – and this gets worse the more data you have. This renders any straightforward analysis of personal vs. non-personal data inadequate.
During a session on privacy and control, we considered whether the right to individual anonymity in public data sets is technologically realistic. Cambridge computer scientist Ross Anderson‘s work concludes that absolute anonymity is impossible – datasets can always be mined and ‘triangulated’ with others to reveal individual identities. It is only possible to increase or decrease the costs of de-anonymisation. Perhaps the best that can be said is that it is incumbent on those who publicly publish data to make efforts to limit personal identification.
Unlike its current geographically-untethered incarnation, the internet of things will be bound to the physical spaces in which its ‘things’ are embedded. This means we need to reconsider the meaning of and distinction between public and private space. Adam Greenfield spoke of the need for a ‘jurisprudence of open public objects’. Who has stewardship over ‘things’ embedded in public spaces? Do owners of private property have exclusive jurisdiction over the operation of the ‘things’ embedded on it, or do the owners of the thing have some say? And do the ‘data subjects’, who may be distinct from the first two parties, have a say? Mark Lizar pointed out that under existing U.S. law, you can mount a CCTV camera on your roof, pointed at your neighbours back garden (but any footage you capture is not admissible in court). Situations like this are pretty rare right now but will be part and parcel of the internet of things.
I came away thinking that the internet of things will be both wonderful and terrible, but I’m hopeful that the good people involved in this event can tip the balance towards the former and away from the latter.
I’ve been a fan of the Open Rights Group – the UK’s foremost digital rights organisation – for a few years now, but yesterday was my first time attending ORGcon, their annual gathering. The turnout was impressive; upon arrival I was pleasantly surprised to see a huge queue stretching out of Westminster University and down Regent’s Street.
The day kicked off with a rousing keynote from Cory Doctorow on ‘The Coming War On General-Purpose Computing’ (a version of the talk he gave at the last Chaos Communication Camp, [video]). In his typical sardonic style, Doctorow argued that in an age when computers are everywhere – in household objects, medical implants, cars – we must defend our right to break into them and examine exactly what they are doing. Manufacturers don’t want their gadgets to be general-purpose computers, because this enables users to do things that scare them. They will disable computers that could be programmed to do anything, lock them down and turn them into appliances which operate outside of our control and obscured from our oversight.
Doctorow mocked the naive attempts of the copyright industries to achieve this using digital locks – but warned of the coming legal and technological measures which are likely to be campaigned for by industries with much greater lobbying power. In the post-talk Q&A session, an audience member linked the topic to the teaching of IT in schools; the need for children to understand from an early age how to look inside gadgets, understand how they work and that they may be operating against the users best interests.
As is always the way with parallel sessions, throughout the day I found myself wanting to be in multiple places at once. I opted to hear Wendy Seltzer give a nice summary of the current state of digital rights activism. She likened the grassroots response to SOPA and PIPA to an immune system fighting a virus. She warned that, like an overactive immune system, we run the risk of attacking the innocuous. If we cry wolf too often, legislators may cease to listen. She went on to imply that the current anti-ACTA movement is guilty of this. Personally, I think that as long as such protest is well informed, it cannot do any harm and hopefully will do some good. Legislators are only just beginning to recognise how serious these issues are to the ‘net generation’, and the more we can do to make that clear, the better.
The next hour was spent in a crowded and stuffy room, watching my Southampton colleague Tim Davies grill Chris Taggart (OpenCorporates), Rufus Pollock (OKFN), and Heather Brooke (journalist and author) about ‘Raw, Big, Linked, Open: is all this data doing us any good?’ The discussion was interesting and good to see this topic, which has until recently been confined to a relatively niche community, brought to an ORG audience.
After discussing university campus-based ORG actions over lunch, I went along to a discussion of the future of copyright reform in the UK in the wake of the Hargreaves report. Peter Bradwell went through ORG’s submission to the government’s consultation on the Hargreave’s measures. Saskia Wazkel from Consumer Focus gave a comprehensive talk and had some interesting things to say about the role of consumers and artists themselves in copyright reform. Emily Goodhand (more commonly known as @copyrightgirl on twitter) spoke about the University of Reading’s submission, and her perspective of as Copyright and Compliance officer there. Finally Professor Charlotte Waelde, head of Exeter Law School, took the common call for more evidence-based copyright policy and urged us to ask ‘What would evidence-based copyright policy actually look like?’. Particularly interesting for me, as both an interdisciplinary researcher and believer in evidence-based policy, was her question about what mixture of disciplines are needed to create conclusions to inform policy. It was also encouraging to see an almost entirely female panel and chair in what is too often a male-dominated community.
I spent the next session attending an open space discussion proposed by Steve Lawson, a musician, about the future of music in the digital age. It was great to hear the range of opinions – from data miners, web developers and a representative from the UK Pirate Party – and hear about some the innovations in this space. I hope to talk to Steve in more detail soon in lieu of a book I’m working on about consumer ethics/activism for the pirate generation.
Finally, we were sent off with a talk from Larry Lessig, on ‘recognising the fight we’re in’. His speech took in a bunch of different issues: open access to scholarly literature; the economics of the radio spectrum (featuring a hypothetical three way battle between economist Robert Coase, dictator Joseph Stalin and singer Hetty Lamar [whom I’d never heard of but apparently co-invented ‘frequency hopping’ which paved the way for modern day wireless communication]); and corruption in the US political system, the topic of his latest book.
In the Q+A I asked his opinion on academic piracy (the time honoured practice of swapping PDFs to get around lack of institutional access, which has now evolved into the twitter hashtag phenomenon #icanhazPDF), and whether he prefers the ‘green’ or ‘gold’ routes to open access. He seemed to generally endorse PDF-swapping. He came down on the side of ‘gold’ open access (where publishers become open-access), rather than ‘green’ (where academic departments self-archive), citing the importance of being able to do data-mining. I’m not convinced that data-mining isn’t possible under green OA; so long as self-archiving repositories are set up right (for example, Southampton’s eprints software is designed to enable this kind of thing).
After Lessig’s talk, about a hundred sweaty, thirsty digital rights activists descended on a nearby pub, then pizza, then said our goodbyes until next time. All round it was a great conference; roll on ORGcon2013.