An identification, user detection or, simply, web-tracking, all that means a computation and an installation of a special identificator for each browser visiting a certain site. By and large, initially, it was not designed as a ‘ global evil’ and, as everything else has another ‘ side of a coin’, in other words it was made up to provide a benefit, for example, to allow website owners to distinguish real users from bots, or to give them a possibility to save user’s preferences and use them during the further visits. However, at the same time this option catch promo’s fancy. As you know, cookies are the most popular way to detect users. And they have been being used in advertising since 90s.
Many changes have taken place ever since, in the sphere of technologies a huge step forward have been made and today we can use not only cookies but many other ways for tracking users. The most obvious method is to install an indificator similar to a cookie. Another way is to use the information from the user’s computer, that we, actually, can get from the sent requests of HTTP-headers: address, OS type, time and so on. And, finally, we can distinguish a user upon his habits and behaviour ( the way he moves the cursor, favourite site sections and so on).
This approach is quite obvious, all that we need to do is to place some long-lasting identificator on the user’s side, that we can request during the further visits. Modern browsers allow to make it in a transparent manner for the user. First and foremost, we can use good old cookies. Then, it could be specific features of certain plag-ins, that have similar to cookies software, like, Local Shared Objects in Flash or Isolated Storage in Silverlight. Also we can find some storage mechanisms in HTML5, including localStorage, File and IndexedDB API. Besides, we can save unique markers in cache-resources on local machines or in the cache metadata (Last-Modified, ETag). Furthermore, we can detect a user by his fingerprints that we can get from Origin Bound certificates, generated by the browser for SSL-connections, or, by the data contained in SDCH-dictionaries, or by the information in those dictionaries. In short, there are plenty of possibilities.
Unlike other mechanism that we are going to discuss later, the usage of cookies is transparent for the end user. Moreover, it is not necessary to store an identification in a separate cookie to ‘mark’ a user, it could be simply collected from other cookies or be stored in metadata, like Expiration Time. That is why it is quite difficult to understand whether some certain cookie have been used for tracking or not.
Local Shared Objects
To store data on the client’s side by means of Adobe flash we use LSO mechanism. It is cookie’s analogue in HTTP, however, unlike the last ones it can store not only short fragments of text data which, in its turn, complicates the analysis and check-up of such objects. Until the release of the 10.3 version the flash-cookies’ behaviour has been set up separately from the browser settings, likewise, you needed to go to the Flash settings manager, situated on the
macromedia.com site ( by the way, it is still available on the further link (bit.ly/1nieRVb)). Today you can do it right from the control panel. Furthermore, most nowadays browsers provide quite good integration with flash-player, so, during the deleting of cookies and other sites’ datas the LSO will be deleted as well. On the other hand, the interaction is not that good, so the setting up the policy about outside cookies will not always consider the flash’s ones ( here you can find how to turn them off manually (adobe.ly/1svWIot)).
Isolated storage Silverlight
The software platform Silverlight is quite similar to Adobe Flash. So, for example, the mechanism
Isolated Storage is the analogue to
Local Shared Objects in Adobe. However, unlike Adobe the privacy settings in here are not connected to browser that is why even after deleting all the caches and cookies from a browser, all the data saved in
Isolated Storage still will be there. But what is more interesting, the storage is common for all the tabs of a browser ( except those in ‘incognito’ mode), as long as for all the profiles, installed on the one machine. Just like in LSO, from the technical point of view, there is no obstacles to store session identifications. Nonetheless, regarding the fact that you can not influence the mechanism through browser settings, it has not got such an expansion in term of unique identificators storage.
HTML5 and the data storage on the client’s side
HTML5 is a mechanism that allows to store structured information on the client side. Among them we have localStorage, File API and IndexedDB. Despite the differences their purpose is to store random amount of binary data connected to a certain resource. Moreover, unlike HTTP and Flash cookies there is no particular restrictions regarding the size of stored files. In modern browsers the HTML5 storage is situated among other site data. Nevertheless, it is quite difficult to figure out how to control the storage through a browser. Like, for example, to delete the information from the Firefox
localStorage the user have to choose “offline website data” or “site preferences” and set up the time interval on “everything”. Another offbeat feature contained in IE is that the data are existing only while the tabs opened at the moment of their saving are alive. Beside everything we have mentioned above we should say that the restrictions applicable to HTTP cookies does not really work with the mechanisms. For example, you can write and read from
localStorage through cross domain frames even when the side cookies are turned off.
The randomised objects
ETag and Last-Modified
A server should inform somehow a browser that the new version of the document is available in order the randomising works properly. That is why HTTP/1.1 offers two ways to deal with this problem. The first one is based on the date of the last modification, while the other one on the abstract identification known as ETag.
Using ETag, first, a server returns a so called version tag in a header of the reply with the document itself. With further requests to set up URL a client will send through the header If-None-Match this value associated with its local copy to the server. If the version in the header is up-to-date then the server will send the HTTP-code 304 (‘ Not Modified’) and a client will continue to use the randomised version. Otherwise the server will send a new version of the document with a new Etag. This approach are quite similar to the HTTP-cookies – like, the server stores random value on a client to be able to read it later. The other way is to use the Last-Modified header that allow to store at least 32 bits of information in the data string, that further will be sent by a client to the server in the If-Modified-Since header. What is interesting, that most browsers don’t request the correct date format in the date string. The situation here is the same as with the identification through randomised objects, the deleting of cookies and site data does not influenceETag andLast-Modifie
d, you can delete them only by cleaning the caches.
Application Cache allows to set up which part of a site supposed to be stored and be available even if a user is offline. The mechanism is controlled by manifests which set up the regulations of storing and extracting of the cache elements. Just like traditional randomising mechanism the AppCache also allows to store unique information that depends on user as inside the manifest itself so inside resources that exist for an indefinite amount of time ( in contrast to an ordinary cache which resources are deleted after some time). AppCache occupy an intermediate value between the mechanism of data storing in HTML5 and the common browser’s cache. In some browsers it is cleaned due to deleting of cookies and site data, while in the others only after the deleting of browsing history and all the randomised documents.
SDCH – dictionnaires
Other storage mechanisms
Besides mechanisms connected to randomising, JS and other plug-ins usage, the modern browsers also have another particular features, that allows to keep and take out the unique identificators.
- Origin Bound Certificates (aka ChannelID) – are the persistent self-signed certificates that identifies a client to server. A separate certificate is created for each new domain, that is being used for connections initiating in future. Also sites could use single external signal to track users without any actions along with, that a client could notice. The cryptographic hash of a certificate could be used as a unique identification as well, given by a client as a part of legitimate SSL-‘handshake’.
- There are two mechanisms in the TLS as well – session identifiers and session tickets that allow clients to resume link-downs connections without ‘full-handshake’. It is possible to do using randomised data. The two mechanisms allow servers to identify requests sent by clients within quite small amount of time.
- Almost all modern browsers use their own inner cache to accelerate the name resolution process ( moreover, in particular cases it allows to cut the risk of DNS rebinding attacks). Such cache could be easily used to store small amount of information. Like, for example, if you have about 16 available IP addresses, it would be enough to have about 8-9 randomised names, to identify any computer in web. However, such approach are restricted by the size of the inner browser’s DNS-cache and, potentially, could provoke conflicts with name resolution regarding DNS provider.
All the methods that we have considered are supposed the installation a unique identification that would be send to the server during further requests. However, there is another way to track users based on requests or characteristic changes in terms of a client machine. Separately each received characteristic is just a several bits of information, but if we combine some of them, we can identify any computer in web. Beside the fact that such tracking is far more difficult to recognise as long as to prevent, the technique will allow to identify a user that uses different browsers or private mode.
The simplest approach in terms of tracking is to build an identification by combining different available parameters in the browser’s environment, that, actually, does not have any value separately, however, together they create a remarkable features for each machine:
- User-agent. Hand out a browser’s version, an OS version and some installed add-ons. In cases when there is no User-agent or you would like to check its ‘truthfulness’ we can determine the browser’s version by checking certain implemented or changed features between releases.
- The display resolution and the window size of a browser ( including the parameters of the second display in case of multidisplay system).
- The list of installed fonts that have been downloaded, for example, with getComputedStyle API.
- The list of all installed plug-ins, ActiveX-controllers, Browser Helper Objects, including their versions. We can get them using navigator.plugins ( certain plat-ins could be tracked in HTTP-headers).
- The information about the installed extensions and other software. The extensions, like, advertisement blockers, implement some changes in browsable pages, you can determine these extensions and their settings due to these changes.
Web – fingerprints
There is another row of features in the architecture of the local net and the net-protocols’ settings. These features will be common for all browsers, installed on the client’s machine, they can not be hidden even with privacy settings or certain security utilities. Here the list of them:
- The external IP-address. This vector is especially interesting for IPv6, because last octets could be gotten from device’s MAC-address in certain cases and that is why they are stored even during the connection to different networks.
- Port numbers for outgoing TCP/IP-connections ( for most OS they are usually choose sequentially).
- Local IP-address for users on NAT or HTTP-proxy. Along with an external IP allows to identify the most of clients
- The information about proxy servers that a client is using can be found in HTTP headers (
X-Forwarded-For). In combination with the real client address that we can get using several ways by passing proxy, also allows to identify a user.
Behavior analysis and habits
Another way is to check characteristics that are connected not to PC, but, more likely, to a final user, such as local settings and behaviour. This method also allows to identify clients among different browser’s sessions, profiles and in case of private mode. So, we can draw conclusions basing on further parameters, that are always available for explorations:
- The cache data of a client and his browsing history. The cache elements could be found using time attacks, a tracking can find a long-lasting cache elements relative to popular resources, simply by measuring the time of downloading ( and just notice the transition if the time overpass the time of downloading from the local cache. Also we can get URL files from the browsing history, however, this attack is urge for an interaction with a user in terms of modern browsers.
- Mouse gestures, the frequency and duration of keystrokes, the accelerometer data – all these parameters are unique for each user.
- Any changes in terms of standard site fonts and their sizes, zoom level or usage of special possibilities, like, text colour or size.
- The condition of certain browser features, setting up by a client, like, the block of external cookies, DNS – prefetching, pop-up blocking, flash security adjustments and so on ( the irony of it is that the users that change their default settings as a matter of fact make it far more recognisable in terms of identification).
By and large, these are only the obvious variants that are not hard to plumb. If we ‘dig’ a little further – we can find out more.
As you can see, practically, there are great amount of different ways to track users. Some of them are the result of the implementation defects or gaps and, theoretically, can be fixed, the other ones are quite impossible to prevent without a full changing of the work principles of the computer networks, web applications, browsers. Generally, we can counter work against some techniques, like, to clean caches, cookies and other places where identificators can be stored. However, others work absolutely imperceptible for a user, and it is impossible to protect yourself from them. That is why, the most important thing to remember is that when you are ‘travelling’ in the web, all your shuffles could be tracked.