Researching Personal Data
I recently delivered a seminar for the Southampton University Cyber Security seminar series. My talk introduced some of the research I’ve been doing into the UK’s Data Protection Register, and was entitled ‘Data Controller Registers: Waste of Time or Untapped Transparency Goldmine?’.
The idea of a register of data controllers came from the EU Data Protection Directive, which set out a blueprint for member state’s data protection laws. Data controllers – any entity responsible for collection and use of personal data – must provide details about the purposes of collection, categories of data subjects, categories of personal data, any recipients, and any international data transfers, to the supervisory authority (in the UK, this is the Information Commissioner’s Office). This represents a rich data source on the use of personal data by over 350,000 UK entities.
My talk explored some initial results from my research into 3 years worth of data from this register. A number of broad trends have been identified, including;
The amount of personal data collection reported is increasing. This is measured in terms of the number of distinct register entries for individual instances of data collection, which have increased by around 3% each year.
There are over 60 different stated reasons for collection of data, with ‘Staff Administration’, ‘Accounts & Records’ and ‘Advertising, Marketing & Public Relations’ being the most popular (outnumbering all other purposes combined).
The categories of personal data collected exhibit a similar ‘long tail’, with ten very common categories (including ‘Personal Details’, ‘Financial Details’ and ‘Goods or Services Provided’) accounting for the majority of instances.
In terms of transfers of data outside the EU, the vast majority of international data transfers are described as ‘Worldwide’. Of those who do specify, the most popular countries are the U.S., Canada, Australia, New Zealand and India.
Beyond these general trends, I explored one particular category of personal data collection which has been raised as a concern in studies of EU public attitudes, namely, trading and sharing of personal data. The kinds of data likely to be collected for this purpose are broadly reflective of the general trends, with the exception of ‘membership details’, which are far more likely to be collected for the purpose of trading.
Digging further into this category, I selected one particularly sensitive kind of data – ‘Sexual Life’ – to see how this was being used. This uncovered 349 data controllers who hold data about individual’s sexual lives, for the purpose of trading and sharing with other entities (from the summer 2012 dataset). I visualised this activity as a network graph, looking at the relationship between individual data controllers and the kinds of entities they share this information with. By clicking on blue nodes you can see individual data controllers, while categories of recipients are in yellow (note: wordpress won’t allow me to embed this in an iframe) Trading / Sharing Data about Sexual Life
I also explored how this dataset can be used to create personalised transparency tools, or to ‘visualise your digital footprint’. By identifying the organisations, employers, retailers and suppliers who have my personal details, I can pull in their entries from the register in order to see who knows what about me, what kinds of recipients they’re sharing it with and why. A similar interactive network graph shows a sample of this digital footprint.
Open data is often seen as in tension with privacy. However, through this research I hope to demonstrate some of the ways that open data can address privacy concerns. These concerns often stem from a lack of transparency about the collection and use of personal data by data controllers. By providing knowledge about data controllers, open data can be a basis for accountability and transparency about the use (or abuse) of personal data.
As a volunteer ‘data donor’ at the Midata Innovation Lab, I’ve recently been attempting to get my data back from a range of suppliers. As our lives become more data-driven, an increasing number of people want access to a copy of the data gathered about them by service providers, personal devices and online platforms. Whether it’s financial transactions data, activity records from a Fitbit or Nike Fuelband, or gas and electricity usage, access to our own data has the potential to drive new services that help us manage our lives and gain self-insight. But anyone who has attempted to get their own data back from service providers will know the process is not always simple. I encountered a variety of complicated access procedures, data formats, and degrees of detail.
For instance, BT gave me access to my latest bill as a CSV file, but previous months were only available as PDF documents. And my broadband usage was displayed as a web page in a seperate part of the site. Wouldn’t it be useful to have everything – broadband usage, landline, and billing – in one file, covering, say, the last year of service? Or, even better, a secure API which would allow trusted applications to access the latest data directly from my BT account, so I don’t have to?
Another problem was that in order to get my data, I sometimes had to sign up for unwanted services. My mobile network provider, GiffGaff, require me to opt-in to their marketing messages in order to receive my monthly usage report. FitBit users need to pay for a premium account to get access to the raw data from their own device.
Wouldn’t it be nice to rate these services according to a set of best practices? In 2006, when the open data movement was in its infancy, Tim Berners-Lee defined ‘Five Stars of Open Data‘ to describe how ‘open’ a data source is. If it’s on the web under an open license, it gets one star. Five stars means that it is in a machine-readable, non-proprietary format, and uses URI’s and links to other data for context. While we don’t necessarily want our private, personal data to be ‘open’ in Berners-Lee’s sense, we do want standard ways to get access to our personal data from a service. So, here are my suggested ‘Five Stars of Personal Data Access’ (to be read as complementary, not necessarily hierarchical):
1. My data is made available to me for free in a digital form. For instance, through a web dashboard, or email, rather than as a paper statement. There are no strings attached; I do not need to pay for premium services or sign up to marketing alerts to read it.
2. My data is machine-readable (such as CSV rather than PDF).
3. My data is in a non-proprietary format (such as CSV, XML or JSON, rather than Excel).
4. My data is complete; all the relevant fields are included in the same place. For instance, usage history and billing are included in the same file or feed.
5. My data is up-to-date; available as a regularly-updated feed, rather than a static file I have to look up and download. This could be via a secure API that I can connect trusted third-party services to.
The Midata programme has considered these issues from the outset, calling for suppliers to adopt common procedures and formats. Simplifying this process is an important step towards a world where individuals are empowered by their own data. My initial attempts to get my data back from suppliers point to a number of areas for improvement, which I’ve tried to reflect in these star ratings. Of course, there’s lots of room for debate over the definitions I’ve given here. And I’m sure there are other important aspects I’ve missed out. What would you add?
I recently attended OrgCon2013, the Open Rights Group’s annual conference. As in previous years, this was an excellent opportunity to catch up on the latest developments in a range of UK and international digital rights issues. It was perfectly timed to coincide with the news about the NSA surveillance leak, a story which found its way into virtually every talk I attended throughout the day, including my own short presentation on ‘Open Data for Privacy’. I’d particularly recommend watching Caspar Bowden’s excellent talk on wiretapping the cloud – very timely given the aforementioned NSA story.
I’ve posted up my slides, and a related network graph visualisation here. I’m hosting them on a new website I’ve set up to host some outputs from my research into open data and privacy – MyDataTransparency.org. Suggestions / collaborations welcome.
And thanks to ORG for putting on another great event and having me talk!
It’s just over five years since the publication of Nudge, the seminal pop behavioural economics book by Richard Thaler and Cass Sunstein. Drawing from research in psychology and behavioural economics, it revealed the many common cognitive biases, fallacies, and heuristics we all suffer from. We often fail to act in our own self-interest, because our everyday decisions are affected by ‘choice architectures’; the particular way a set of options are presented. ‘Choice architects’ (as the authors call them) cannot help but influence the decisions people make.
Thaler and Sunstein encourage policy-makers to adopt a ‘libertarian paternalist’ approach; acknowledge that the systems they design and regulate inevitably affect people’s decisions, and design them so as to induce people to make decisions which are good for them. Their recommendations were enthusiastically picked up by governments (in the UK, the cabinet office even set up a dedicated behavioural insights team). The dust has now settled on the debate, and the approach has been explored in a variety of settings, from pension plans to hygiene in public toilets.
But libertarian paternalism has been criticised as an oxymoron; how is interference with an individual’s decisions, even when in their genuine best interests, compatible with respecting their autonomy? The authors responded that non-interference was not an option. In many cases, there is no neutral choice architecture. A list of pension plans must be presented in some order, and if you know that people tend to pick the first one regardless of its features, you ought to make it the one that seems best for them.
Whilst I’m sympathetic to Thaler and Sunstein’s response to the oxymoron charge, the ethical debate shouldn’t end there. Perhaps the question of autonomy and paternalism can be tackled head-on by asking how individuals might design their own choice architectures. If I know that I am liable to make poor decisions in certain contexts, I want to be able to nudge myself to correct that. I don’t want to rely solely on a benevolent system designer / policy-maker to do it for me. I want systems to ensure that my everyday, unconsidered behaviours, made in the heat-of-the-moment, are consistent with my life goals, which I define in more carefully considered, reflective states of mind.
In our digital lives, choice architectures are everywhere, highly optimised and A/B tested, designed to make you click exactly the way the platform wants you to. But there is also the possibility that they can be reconfigured by the individual to suit their will. An individual can tailor their web experience by configuring their browser to exclude unwanted aspects and superimpose additional functions onto the sites they visit.
This general capacity – for content, functionality and presentation to be altered by the individual – is a pre-requisite for refashioning choice architectures in our own favour. Services like RescueTime, which blocks certain websites for certain periods, represent a very basic kind of user-defined choice architecture which simply removes certain choices altogether. But more sophisticated systems would take an individuals’ own carefully considered life goals – say, to eat healthily, be prudent, or get a broader perspective on the world – and construct their digital experiences to nudge behaviour which furthers those goals.
Take, for instance, online privacy. Research by behavioural economist Alessandro Acquisti and colleagues at CMU has shown how effective nudging privacy can be. The potential for user-defined privacy nudges is strong. In a reflective, rational state, I may set myself a goal to keep my personal life private from my professional life. An intelligent privacy management system could take that goal and insert nudges into the choice architectures which might otherwise induce me to mess up. For instance, by alerting me when I’m about to accept a work colleague as a friend on a personal social network.
Next generation nudge systems should enable a user-defined choice architecture layer, which can be superimposed over the existing choice architectures. This would allow individuals to A/B test their decision-making and habits, and optimise them for their own ends. Ignoring the power of nudges is no longer a realistic or desirable option. We need intentionally designed choice architectures to help us navigate the complex world we live in. But the aims embedded in these architectures need to be driven by our own values, priorities and life goals.
If Facebook were a state, it would be the third most populated in the world, just ahead of the USA and behind India. Like the former Soviet Union, which occupied the same third place slot at its peak, the state of Facebook rules over a geographically and culturally diverse citizenry. And like the USSR in 1990, this disparate social network may be at the beginning of its decline.
I’ll resist the urge to draw further fatuous parrallels – between, say, Stalin’s centralised planning and Zuckerburg’s centralised business model, or Gorbachev’s collapsing economy and the social network’s dismal performance on the stock market – fun as they might be. There are early signs of Facebook’s eventual dissolution, cracks which have appeared over the last six months. Facebook lost 10 million US visitors in the last year. Monthly visits in Europe are down. Its incredible international growth rate is beginning to plateau. And ‘Home’, the Facebook-smeared Android smartphone interface, appears to have flopped.
I’m just one data-point in all this, but I’ve been quietly engineering my own secession from Facebook over the last few weeks. I won’t go over some of the good reasons to leave Facebook (Paul Bernal has eloquently outlined ten of them already). I’ve always been a reluctant user, but equally reluctant to leave. Enough of my personal (and worryingly, professional) communication seems to come through Facebook that leaving altogether doesn’t seem to be an option, yet. Instead, I’ve taken a less drastic approach in the interim, which means I should never have to log in to Facebook again (except, perhaps, to delete my account).
- Exported (almost) all my data
- Removed (almost) all the information from my account.
- Deleted the Facebook and Facebook Messenger apps from my smartphone and tablet.
- Set up RSS feeds for pages.
- Set email notifications for group posts and events.
- Exported all my friend’s birthdays into a calendar, and set up a weekly update of upcoming birthdays.
- Finally, exported all my friend’s email addresses, so I can communicate via email instead. This was the hardest one. I had to sign up to Yahoo Mail (the only service Facebook will allow email imports into), and then run a scraping script on a html page to get them into a CSV format, before finally importing that into my email contacts. Thanks to @joincamp for the guide.
This way, I still get to hear about the important stuff, without exposing my eyeballs, or much of my data, to Facebook. It’s also given me the chance to experiment with other means of personal communication. Email feels very personal again. I’m working on my telephone manner. Postcards are also fun.
A recently unemployed graduate walks into a job centre to attend a work skills session, a condition of receiving unemployment benefit. As part of a new drive to integrate social media into the job search process, he is asked to create an online profile on the popular micro-blogging platform Twitter. The supervisor tells him that by interacting with the accounts of potential employers, he may land himself a new job.
Five years ago, this experience (recently relayed to me by a friend) would have been farcical. Twitter was considered just a fad amongst Silicon Valley early-adopters. It’s now used by every brand, institution or service, from Her Majesty The Queen to the shipping forecast, as well as the rest of us ordinary people.
Using Twitter to get a job is not necessarily a bad idea. It might work well for some people, in some sectors. Good luck to them. But when there is an expectation that we adopt commercial media platforms as a precondition of entering the job market, something has gone wrong. Amidst confusion over what constitutes our digital identity, we’re being encouraged to construct public digital selves in order to please potential employers.
We can do better. We need better tools to match jobseekers to appropriate vacancies, that protect individual privacy, and provide authentication of qualifications and work history. Twitter is an informal, ephemeral, public medium. It is no substitute for trusted, public, digital infrastructure fit for the 21st-century job market.
What is the point of web standards? Ask someone who remembers the early days of web development and you will get a lecture on the mess that came from the early proliferation of incompatible platforms, languages and formats. Then (so the lecture goes), the World Wide Web Consortium came along and tidied everything up. They made open standards that anyone could implement and use regardless of browser, operating system, disability or device. Businesses who tried to capture their users with proprietary standards eventually lost out to openness. End of lecture. It is a history lesson worth repeating, but the recent debate over DRM in HTML5 illustrates how the morale of the story can actually be used to different ends by competing interests.
In one sense, standards and the bodies who set them are neutral; they do not make a value judgement on the activity covered by the standard. Whether you’re publishing metadata, embedding videos in your website, or displaying text in Comic Sans font, the W3C isn’t there to comment on whether that’s a good or bad thing. The W3C is there to help stakeholders come to a consensus on one common way of implementing that particular thing, so that there isn’t a proliferation of different methods that put up barriers to use. Imagine the W3C decided that because the Comic Sans font is ugly, they will no longer support it in the next HTML specification. While some of us might be happy with this decision, it would be a clear abdication of their responsibility to maintain neutrality.
There are echoes of this line of thought in the arguments put forward by proponents of the W3C’s Encrypted Media Extensions standard. The EME proposal is to create a standard for applying restrictions to content in the HTML5 specification. The proposal refers to ‘Content Decryption Modules’ (CDM’s) rather than ‘Digital Rights Management’ (DRM), and DRM is really a subset of CDM – but everyone knows that DRM is the primary use case here. The standard would cover how websites can require clients to be running approved CDM’s. Effectively, it provides a standard way to embed DRM software into web video. I’m not going to rehearse the arguments against this proposal (which are, in my opinion, persuasive). Rather, I’m interested in the ways ‘neutrality’ and ‘openness’ are appealed to in the debate.
In one sense, the W3C is being ‘neutral’; key stakeholders have been applying DRM to web content for years, but there are still no standards for doing so. And the lack of standards can create problems for those delivering DRM content via the web, and for those attempting to consume it. As with the use of the Comic Sans font, so for DRM content; the W3C should remain neutral as to the value of the activity (applying DRM), but be ready to create standards for it.
The problem with this view is that sometimes, blanket application of a principle actually undermines that very principle. When the grandfather of Liberalism John Stuart Mill advocated individual liberty, he had the good sense to see that liberty for one occasionally needs to be curtailed in order to promote liberty for all. The same applies to open standards for inherently non-open technologies.
If the W3C create a standard for implementing DRM, they will promote interoperability and openness in the application of DRM to web content. But in doing so, they would undermine interoperability and openness for the web as a whole. This is because creating standards for DRM both facilitates DRM on a technical level, and implicitly endorses it on a policy level; and DRM is inherently in conflict with openness and interoperability. The difference between taking a value stance on Comic Sans and taking a value stance on DRM is that the former takes the W3C into the realm of aesthetic judgement, which is beyond its remit. The latter, on the other hand, is objectionable on grounds of openness and interoperability, the very principles the W3C seeks to promote. In such cases, it would be perfectly legitimate for the W3C to make a value judgement. Indeed, failure to do so undermines all the great work that the organisation has already done to create an open web.
Should Government agencies tasked with protecting our privacy make their investigations more transparent and open?
I spotted this story on (eminent IT law professor) Michael Geist’s blog, discussing a recent study by the Canadian Privacy Commissioner Jennifer Stoddart into how well popular e-commerce and media websites in Canada protect their user’s personal information and seek informed consent. This is important work; the kind of pro-active investigation into privacy practices that sets a good example to other authorities tasked with protecting citizen’s personal data.
However, while the results of the study have been published, the Commissioner declined to name names of those websites it investigated. Geist rightly points out that this secrecy denies individuals the opportunity to reassess their use of the offending websites. Amid calls from the Commissioner for greater transparency in data protection generally – such as better security breach notification – this decision goes against the trend, and seems, to me, a missed opportunity.
This isn’t just about naming and shaming the bad guys. It is as much about encouraging good practice where it appears. But this evaluation should take place in the open. Privacy and Data Protection commissioners should leverage the power of public pressure to improve company privacy practices, rather than relying solely on their own enforcement powers.
Identifying the subjects of such investigations is not a radical suggestion. It has already happened in a number of high-profile investigations undertaken by the Canadian Privacy Commissioner (into Google and Facebook), as well by its relevant counterparts in other countries. The Irish Data Protection Commissioner has made the results of its investigation into Facebook openly available. The UK Information Commissioners Office regularly identifies the targets of its investigations. While the privacy of individual data controllers should be respected, the privacy of individual data subjects should come before the ‘privacy’ of organisations and businesses.
As I wrote in my last blog post, openness and transparency from those government agencies tasked with enforcing data protection has the potential to alleviate modern privacy concerns. The data and knowledge they hold should be considered basic public infrastructure for sound privacy decisions. Opening up data protection registers could help reveal who is doing what with our personal data. Investigations undertaken by the authorities into websites’ privacy practices are another important source of information to empower individual users. The more information we have about who is collecting our data and how well they are protecting it, the better we can assess their trustworthiness.
Summary: ‘Openness’ may create privacy problems, but it also has a role to play in solving them.
Context: Having been at the Open Knowledge Festival last week, I came away wondering how open data and openness might intersect with my PhD research on personal data and privacy. What follows is a rather abstract account of how two vaguely defined terms – ‘openness’ and ‘privacy’ – might conflict or complement one another.
The movement for ‘openness’ – open knowledge, open data, open government, open culture – has been picking up pace in recent years. Universal access to information and data, for re-use and remix, is heralded as a guiding principle for the digital age.
But if opening up information becomes the norm, what will be the effect on individual privacy? Open data – increasingly released by governments, businesses and organisations – might contain personally identifiable information (PII), or information which would enrich pre-existing PII. The failure of attempts to anonymize such data has been well documented. On the face of it, the drive to open up information resources will always have potentially damaging effects on the privacy of individuals. For those who are both drawn to the emancipatory power of openness, and yet concerned for the gradual erosion of individual privacy, a conflict arises between these two apparently irreconcilable principles.
But reconciliation may be possible; rather than only creating and exacerbating problems of privacy, openness can go some way to solving them too. We now live in a time where the flow of personal data is inevitable; it is the oil of the digital economy. But when personal data is collected covertly, without informed consent, and traded with unknown third parties, the problem is not openness. When governments sell ‘anonymised’ personal data to private companies who de-anonymise it, the problem is not openness. When someone doesn’t get hired on the basis of personal data which the prospective employer obtained secretly from an unknown source, the problem is not openness. None of these privacy-violating nightmare scenarios are caused by openness. Rather, they are a consequence of the secrecy, opacity and information asymmetries which characterize the current approach to personal data.
Far from exacerbating privacy concerns, openness could be a powerful counterweight to them. Universal access to information and data about the collection, exchange and use of personal data is a building block for informed privacy choices. Open data and information on privacy policies, data protection registers, audits and compliance with data protection law should be seen as an essential piece of public infrastructure on which informed privacy decisions and privacy-enhancing services can be built.
The solution to privacy concerns cannot be to turn back the clock on openness. Instead, the power of openness should be applied to the infrastructure on which sound privacy decisions depend, bringing real transparency and accountability to the collection and use of personal data.