Stronachs Logo

It has long been the case that human activities produce large amounts of records of what has happened and of what people have said and done. Now that so much of what people do and say is done through connected electronic devices, and so much of what happens is monitored electronically, more and more data is produced, and it is easier than ever to gather very large amounts of that data. This is further facilitated by the availability and cost of storage capacity. Taken together with the availability and cost of computing power, it is easier than ever to store and process huge volumes of data.

This has given rise to the phenomenon of “big data”. The data in question might be produced by social media, could be live data from sensors, be gathered by apps running on many users’ smartphones, or come from other sources. What matters is what is done with the data once it is stored. The idea of “big data” is that such data can be analysed.

Instead of relying on sampling or modelling of the data, now analysis can look for patterns in the large volumes of raw data. Instead of searching data to answer existing questions or confirm existing hypotheses, new algorithms are used to sift data for patterns and correlations. Once found, these patterns can be applied to new data to allow predictions to be made. This leads to improved decision making and targeting of resources, boosting efficiency.

The expression sometimes used is “the three Vs”:  volume, variety and velocity. Volume refers to the very large datasets involved. Variety refers to the range of sources from which data can be combined. Velocity refers to the fact that data can be analysed in real time or near real time rather than after the fact.

Of course, it is not enough to have data. It must be in a useable format. Sometimes data has been collected or recorded in an unstructured way. Sometimes it may have been recorded in a legacy format. Pre-processing is therefore often needed in order to ensure that data is in a useable form.


The classic story, which may be apocryphal but which illustrates the principle, involves Target, a retailer in the US. A father was outraged that his teenage daughter received coupons for baby products. He complained to Target that his daughter obviously wasn’t pregnant and that such products were not appropriate for her. When customer services phoned to apologise to him, he admitted that he had now discovered that she was pregnant. Target’s big data analysis of patterns of purchasing had allowed them to determine from the daughter’s purchases that they were associated with pregnancy.

A better documented example is that of Google Translate, which uses publically available translated texts, for example on multilingual websites, to find patterns it can match between the different language versions without the need to understand grammatical rules or the meanings of ambiguous words.

In another example from the US, Vree Health wanted to reduce readmission rates to hospital in cases of heart disease and pneumonia. They collected data during the hospital stay from admissions forms, and also from follow-up calls after discharge. To this they added data from patients interacting with their online system, clinical data and government historical medical data. Analysis of patterns in the data allowed Vree to predict who was likely to be readmitted by finding correlations with higher rates of readmission. Such patients can then be targeted to receive additional help.

Analysis can also be carried out in real time, in time critical situations. For example, Visa use big data, in real time, to monitor transactions. The aim is to reduce fraud while avoiding unnecessary rejection of genuine transactions. Hundreds of variables relating to each transaction, with tens of thousands of transactions per second, are analysed to look for patterns which might indicate a fraudulent transaction.

In the context of logistics and maintenance, big data can find patterns which allow better planning and targeting of maintenance schedules and supply shipments.

For example, UPS (United Parcel Service) gathers data from all of its delivery vehicles on performance, numbers of stops, routes taken and many other variables. By analysing all of this data they are able to calculate optimum routes for delivering all their packages. They also use data from the vehicles’ sensors to monitor patterns of parts failure and thereby predict when parts will need replacing. Reducing route distances saves fuel and therefore money. Accurately predicting maintenance requirements minimises the downtime of vehicles and other equipment, which optimises their use and thus saves money.

Governments can also make use of big data techniques. HMRC’s Connect system, developed by BAE Systems, takes data from tax records, Companies House, bank records, property ownership registers and electoral registers, amongst others. It analyses all of this data to find discrepancies and thereby identify individuals who have evaded tax. The system identifies relationships and networks of relationships between companies and between individuals which were not readily apparent, and has caught billions in evaded tax liabilities.

It should be noted that there are also failures in the attempt to deploy the methods of big data, where it is not used properly. In 2008 Google began a project which attempted to map and predict the spread of flu. The project relied on tracking Google searches for flu-related information, or information relating to flu symptoms. The starting point was the Centre for Disease Control’s (CDC) data on flu prevalence. Google then took search terms which correlated with the prevalence of flu according to the CDC. These correlations were then used to predict the spread of flu. However, in so much data, much of it is irrelevant, and some correlations are bound to occur by chance. These chance correlations with the CDC data were therefore presumed to be predictive of flu, but were not. Consequently, Google Flu Trends was wildly inaccurate in its predictions, massively overestimating the prevalence of flu in the US.

Big Data and Data Protection

In some instances big data will involve processing of personal data. It will therefore need to comply with data protection legislation. Whether data is personal is an assessment that needs to be made in each case. Data which is clearly about an individual such as medical data or employment data is personal data. It is not necessary that data be associated with a named person to be considered personal, it is enough that it can identify a specific individual. For example, CCTV footage could identify a particular person as being somewhere or having a particular routine, without the operator knowing their name. The purpose of the data processing and who is processing it can also be relevant. A photo of a crowd by a journalist to show that there is a crowd is not personal data. However, the same photo used by the police or others to identify faces in the crowd is personal data.

Not all data collected in large quantities has a personal element to it. For example, data relating to the weather, or the vast amounts of data collected by the LHC at CERN. To analyse large amounts of personal data using the methods of big data, the data must first be anonymised. Anonymisation involves the removal of any data by which an individual or individuals could be identified. Again, this need not necessarily mean identification by name. For example, in the case of clinical trials, the European Medicines Agency (EMA) suggests three methods: Masking, randomisation and generalisation. Masking means simply removing variables from the data which would allow the identification of individuals. Randomisation means introducing false data in a way which prevents identification of the data subjects without eliminating the utility of the data. For example, dates could be shifted or names changed. Generalisation consists of replacing personal data with more general information such as replacing a name and date of birth with an age range.

The EU’s new General Data Protection Regulation (GDPR), which is due to come into force in 2018, and which the UK Government has indicated it will implement despite Brexit, introduces the concept of “pseudonymisation” of data. Pseudonymous data is data from which individual data subjects cannot be identified without the use of additional data which is not part of that data set. For example, data subjects could be identified by a unique reference number, provided that the data showing which reference number corresponded with which individual was kept secure in a separate database. Whereas anonymisation requires that it not be possible to re-identify a data subject, for example by combining different pieces of data which remain in order to narrow down on an individual, pseudonymisation only requires that re-identification not be “reasonably likely”. This constitutes a relaxation of earlier requirements in order to facilitate big data.


Big data is certainly on the rise and there is huge untapped potential. Research for e-Skills UK (now the Tech Partnership) determined that of businesses with more than 100 employees, only 14% made use of big data, but they expected that to rise to 29% by 2017. The House of Commons Science and Technology Committee in 2016 said that only about 12% of business data was being analysed and that this was holding back productivity growth. However, they also thought that if the potential was unlocked, big data could lead to the creation of 58,000 jobs and add £216bn to the economy. Clearly, unlocking this potential must not compromise respect for the privacy of individuals, but privacy concerns, if handled properly, need not derail this data revolution.

Graham Chandler, Trainee Solicitor



Chambers UK 2106

Contact Info

28 Albyn Place, Aberdeen AB10 1YL
Tel: +44 1224 845845


Camas House, Pavilion 3, Fairways
Business Park, Inverness IV2 6AA
Tel: + 44 1463 713225

The Legal 500 logo