Let us try to consider an implementation of any cryptographic algorithm from top downwards. At the first stage a cryptographic algorithm is written in the form of mathematic operators. Here the algorithm is in the environment where only mathematical laws are valid; therefore, researchers verify only the mathematical resistance of the algorithm, or its cryptoresistance. We have a low interest in this step because mathematical operations should be converted into a code. At the code operation stage the critical information about the cipher operation can ooze through holes in the implementation. Buffer overflow, incorrect memory operations, non-documented capabilities and other features of the program environment enable an intruder to find a secret encryption key without using complicated mathematical manipulations. Many researchers stop at this step forgetting that there is at least one more step. Data reflect the real physical state of logical elements rather than an abstract notion whereas computations are physical processes which convert logical elements from one state to another. Consequently, the program execution is a conversion of physical signals, and from this point of view the result of operation of an algorithm is determined by physical laws. Hence the implementation of a cryptographic algorithm can be considered in the mathematical, program and physical environments.

Ultimately, any algorithm is executed by means of “hardware facilities”, which shall be any computational mechanisms being capable to execute the AND, OR logical operations or the logical negation operation. They include standard semiconductor devices such as processor and PLD (programmable logic device), brain neurons, DNA molecules, etc.. All computational facilities have at least two general features. The first feature is designed for making a computation, and one has to spend energy for it. In case of semiconductor devices, it is a matter of electrical power; in case of brain neurons, it is probably a matter of calories spent (I wonder if anybody has ever seen fat chess players); in case of DNA it is, for example, a matter of chemical reactions with heat release. The second general feature relates to the fact that all computational mechanisms require normal external conditions in order to execute operations correctly. Semiconductor devices need continuous voltage and current supply; brain neurons need quietness (Did you try to drive a car when your girlfriend attempts to clear the air?); DNA needs temperature. These two features compose a basis for hardware attacks, which we will talk about.

In the first glance, it seems that physical processes are absolutely non-relevant from the safety point of view but it is not the case. The energy that has been spent at the given particular moment of the logarithm operation can be measured and related to binary data, which will allow to find the encryption key. The variation of physical effects taking place during computations is the basis for all attacks called Side Channel Attacks. In Russia this term has not established yet; therefore one can encounter such word combinations as “incidental channel attacks”, “outside channel attacks”, etc.

Normal external conditions are not insignificant too. Let us consider, for example, the voltage that is to be applied to the processor input. What will happen if this voltage drops to zero for a period of several nanoseconds? As it may seem in the first glance, the processor will not reboot but most probably the continuum of physical processes will be violated and the algorithm result will be incorrect. Having created an error at the required moment, the intruder will be able to compute the key by comparing the correct and incorrect encryption texts. The variation of external conditions is used for computing keys in fault attacks. Again, in Russia this term has not been established completely: such attacks are also called attacks by means of errors, attacks by the method of induced failures and otherwise.

# Side channel attacks

We will start the introduction into side channel attacks from DES algorithm implemented on C++ (the algorithm flow chart is given in Fig. 1 and its detailed description can be found in Internet). Apart from what you will see as to in what unexpected places the vulnerabilities can be concealed, you will also learn about main techniques to be used in side channel attacks. These techniques should be felt deeply as they serve as a basis for more complicated attacks, which will be discussed in subsequent papers.

# Timing attacks

So it is a timing attack on a DES algorithm “implementation.” You can find the full code on `dvd.xakep.ru`

; and at the present moment we are interested in bit-wise exchange `Р`

(look for a circle with letter `Р`

in Fig. 2) to be performed at the last step of the Feistel box. In our case the code of this function has been made as follows:

#define GETBIT(x,i) ((x>>(i)) & 0x1) uint8_t p_tab[32] = {16,7,20,21,29,12,28,17,1,15,23,26,5,18,31,10,2,8,24,14,32,27,3,9,19,13,30,6,22,11,4,25}; uint32_t DES_P(const uint32_t var){ int iBit = 0; uint32_t res = 0x0, one = 0x1; for (iBit=0; iBit<32; iBit++) if (GETBIT(var,32 - p_tab[iBit]) == 1) res |= one<<(31-iBit); return res; }

From the point of view of hardware attacks, this code contains a gigantic vulnerability: the program will execute the operation `res |= one<<(31-iBit)`

, i.e. spend additional time (accordingly, energy) only when the `var`

variable bit is equal to `1`

. `var`

variable, in turn, depends upon the plaintext and the key; therefore, having related the program operation time to the key value we will obtain a tool for attack. In order to understand how to use the relation between time and data, I will consider two theoretical examples. Then in the third example it will be shown how to find directly the key of algorithm that uses this implementation.

# Comparison of keys under ideal measurements

The first theoretical example consists of the fact that we have five plaintexts, ideally measured time spent on encryption of each text and two keys: `K1=0x3030456789ABCDEF`

, `K2=0xFEDCBA9876540303`

, out of which it is necessary to select the correct key. We believe that the code was not interrupted during execution, the data were placed in the first level cache and the operating time of all functions of the cipher except `DES_P`

function was constant. I’d like to note that there are no ciphertexts; therefore, we will not manage to encrypt one plaintext by means of two keys and compare the ciphertexts obtained with the original one. What can be done in such case?

You already know that `var`

variable influences the operating time, which has two components:

- variable time to be spent on the execution of all DES_P operations, which depends upon the number of bits of the
`var`

variable for each round:`a*(∑HW(var))`

, where`а`

is a time constant; actually it is the number of processor cycles spent on the execution of one operation`res |= one<<(31-iBit)`

;`HW(var)`

is the Hamming distance, i.e. the number of bits of the`var`

variable set into 1. The summation symbol`∑`

means that we count the Hamming distance for all 16 rounds;

- the constant time to be spent on the execution of remaining operations will be denoted by letter
`Т`

.

Consequently, the operating time of the whole algorithm is equal to `t = a*(∑HW(var)) + T`

. We do not know the parameters `a`

and `T`

, and I will gladden you at once that we will not look for them. We have ideally measured the time of encryption of each plaintext, t`. In addition, we have the values of keys; therefore, we can calculate the`

var` variable for each round.

I believe that you have already guessed which of the two keys is correct: it is necessary to compute the sum of Hamming distances `∑HW(var)`

for each plaintext and one key value and compare the values obtained with the time value measured. It is evident that with growth of the value of `∑HW(var)`

the time value will also have to increase. Consequently, if the key is correct, then such dependence will be traced, and no such dependence will exist for the incorrect key.

All the foregoing can be written in the form of one table (Fig. 3).

The first column contains plaintexts, which are to be encrypted by means of one of the keys, `K1`

or `K2`

(it is necessary to determine of what particular key). The second column contains the values of time indicated in processor cycles which was spent on encryption of one plaintext. The third column contains the values of sum of Hamming distances of the `var`

variable for all rounds obtained for each plaintext and `K1`

key. The fourth column contains the values of the similar sum of Hamming distances but computed already for `K2`

key. As it is not difficult to note that the cipher operating time increases with the growth of values of the Hamming distances for `K1`

key. Accordingly, `K1`

key is the correct key.

Of course, it is an idealized example, which does not take into account a variety of factors arising in reality. I only mean to show an exemplary principle of side channel attacks, and the next example will already be explained on actually measured time values, but primarily it is necessary to remember something from statistics.

# Guess the number

I should like to show what happens with random variables when they are being averaged for a long period of time. If you have a good knowledge of statistics, then you may transfer immediately to the next part, otherwise let us consider a small game where computer will randomly choose an integer `А`

and suggests that you should guess it. Each time computer chooses an additional pair of numbers `(b, c)`

out of the range from `0`

to `М`

and returns only values of `(А + b, c)`

. Numbers `b`

and `с`

are chosen randomly and can be significantly larger than `А`

. You do not know the value of `М`

(but you can approximately guess as to its order due to spread of values of `c`

). How to guest the `А`

number?

The program which simulates this game is given below:

void Game(int *Ab, int *с){ static int A = 0; int M = 1000; srand((unsigned int)rdtsc()); if (A==0) A = 1+rand()%100; *Ab = A+(M*M*rand())%M; *с = (M*M*rand())%M; } void Guess(){ int Ab, с, i, nTries = 100000; double avg1 = 0.0, avg2 = 0.0; for (i=0; i<nTries; i++){ Game(&Ab, &c); avg1 += Ab; avg2 += c; } printf("%f\n", roundf((avg1-avg2)/numTries)); }

According to the code, the value of `А`

is taken out of the range from 1 to 100, and the values of variables `b`

and `с`

— out of the range from 0 to 999, which is approximately ten times as much as the maximum value of `А`

, i.e. the noise level is considerably higher than the level of the value itself! But if a pair of values `(А + b, с)`

is returned 100 thousand times (it is a large number but the noise level is also far from low), then the value of variable `А`

can be guessed approximately in roughly half the instances. To this end, it is necessary to average all returned values of `А + b`

and all values of `с`

, and then compute the difference of average values. This difference will just be the value of variable `А`

. Of course, if we reduce the value of `М`

, then the number of pairs `(A + b, с)`

being necessary to computation of the key will be less.

Now it is necessary to understand why this happens. There exists a remarkable law, which is a key law for side channel attacks, Chebyshev’s law of large numbers. According to this law, **with a large number of independent experiments the mean arithmetic of the observed values μ(x) of random variable x converges in probability to their mathematical expectation**. If this law is considered within the framework of our game, then the sum of values of

`A + b`

and `c`

will converge respectively to `А + μ(b)`

and `μ(c)`

, and as the values of `b`

and `c`

are selected randomly from the same range, then `μ(b)`

and `μ(c)`

will converge to their mathematical expectation; consequently, the difference `А + μ(b) – μ(c)`

will converge to the value of variable `А`

.# Comparison of keys under real measurements

Interruptions, data randomizing and other factors do not allow to measure the operating time of the algorithm at the accuracy up to one processor cycle. In my measurements, the operating time of the algorithm varies within 5% of the average value. It is enough for the method from the first example to stop working.

What can be invented in such case? I’d like to note at once that you can’t solve the problem by averaging the encryption time of each plaintext only (though the time will be measured more accurately). In the first place, it is by no means always possible as the input values of the cipher can be controlled by somebody but us. In the second place, even having averaged one million of encryptions, you will not get rid of the data randomizing problem, as it is difficult to place all the tables in the first level cache (at least it is necessary to take care of it beforehand) — I will tell about such peculiarities somehow the next time.

Now let us consider how the Chebyshev’s law of large numbers influences the side channel attacks. Once again, we will consider the DES implementation but now the operating time of the algorithm can be recorded as follows:

- variable time to be spent on the execution of
`DES_P`

operation, which depends upon the number of bits of the`var`

variable:`a*(∑HW(var))`

.`а`

is still the time constant, and`HW(var)`

is the Hamming distance; - constant time to be spent on the execution of all remaining
`Т`

; - measurement noise
`Δ(t)`

, which is normally distributed within the range of`[0:D]`

. The value of`D`

is unknown and, as always, we will not look for it.

Hence the operating time of the algorithm can be written in the following form: `t = a*(∑HW(var)) + T + Δ(t)`

. The table given in Fig. 4 contains the values of plaintexts and the time actually measured for them. Please note that `∑HW(var)`

for the correct key and each plaintext is equal to 254 but at that the difference between the least and largest time is equal to 327 cycles!

With such variations in measurements a simple pair-wise comparison of Hamming distances for one plaintext and the time of its encryption will not yield the results. Here we are to use the Chebychev’s law of large numbers. What will happen if we have averaged the algorithm time for different ciphertexts which yield the same Hamming distance `∑HW(var)`

:

μ(t) = μ(a*(∑HW(var)) + T + Δ(t)) = μ(a*(∑HW(var))) + μ(T) + μ(Δ(t)) = a*(∑HW(var)) + T + μ(Δ(t))

Actually, when the arithmetic mean of time is determined for different plaintexts, the noise averaging takes place, and as we know from statistics, this noise will converge to one and the same value with the growth of the number of measurements, i.e. `μ(Δ(t))`

will converge to a constant.

Let us show it by example. I have measured the operation of 100 thousand DES algorithm encryption operations, i.e. I have 100 thousand plaintexts and corresponding operating time values. This time we will compare the following four keys: `K1=0x3030456789ABCDEF`

, `K2=0xFEDCBA9876540303`

, `K3=0x2030456789ABCDEF`

, `K4=0x3030456789ABCDED`

. The same as in the first example, we will compute the Hamming distance value for all plaintexts and `K1,K2,K3,K4`

keys — it has been done in the table given in Fig. 5. None of the keys has an evident time dependence upon Hamming distances.

Let us average the operating time of encryption (for each key separately) which have the same Hamming distance (we will take only those plaintexts, for which `∑HW(var)`

lies within the interval of [234,276]). This time we will plot a curve (Fig. 6) where the X axis corresponds to the Hamming distance and the Y axis — to the average value of time for this distance.

# Pearson’s correlation coefficient

What do we see in Fig. 6? For the three keys (`K2`

, `K3`

, `K4`

) the cipher operating time is weakly dependent upon the hamming distance, and for the `K1`

key we see an ascending trend. Please pay attention to the sawtooth form of plots — it is due to the fact that we don’t have many measurements and they are not accurate to such a degree that `μ(Δ(t))`

should be averaged to one value. Anyway, we can see that the average cipher operation time increases with the growth of the Hamming distance values computed for the `K1`

key and does not increase for the other three keys. Therefore, we suppose that the `K1`

key is true (it is really the case). With the growth of the number of data, the ascending trend for the correct key will perhaps increase and for incorrect keys all values will converge to the average value. The “comb-like” plot character will disappear with the growth of the number of data.

You must admit that is rather inconvenient to construct plots and permanently verify them through the eyes; to this end there are several standard texts for verification of the dependence between the model and the real data: correlation coefficient, Т-test and mutual information. It is possible to remember and invent another pair of other coefficients but we will mainly use the Pearson’s correlation coefficient or, which is the same, the linear correlation coefficient `pcc(x,y)`

(the description of the coefficient is given on Wiki). This coefficient characterizes the degree of **linear** dependence upon two variables. In our case the dependence is just linear, as `μ(t) = a*(∑HW(var)) + T + μ(Δ(t))`

can be written as `y = a*x + b`

, where `x`

is the Hamming distance that we compute and `y`

is the actually measured time. The value of Pearson’s correlation coefficient for average time and Hamming distance values is shown in the line “with averaging” in Fig. 7.

The Pearson coefficient value for the `K1`

key is three times as much as the value for any of the three other keys. It is the evidence of the linear conformity between the data being simulated and real data, which once again confirms the use of the `K1`

key.

The Pearson’s correlation coefficient can be used even without preliminary averaging the values. In this case, its value will be significantly less than for the averaged values but at any rate the correct key will give the largest correlation coefficient (line “without averaging”).

Hence, at first visually and then by means of the correlation coefficient, we have been able to make sure that our time model for the `K1`

key matches the real data best of all. It is evident that it does not seem possible to verify all potential values of the main key; therefore, we need another method, which will allow to look for a key by parts. Such method is given in the next example, which can already be considered as an attack to be used in real life.

# Attack against an unknown key

Now we have come to a moment when the key will be looked for by parts of 6 bits each. We will seek 6 bits absolutely similar to how we verified the correctness of 64 bits before (when we dealt with four keys). The values of 6 bits of the key which will give the best linear relation between the model and the real data will most probably be correct. How does it work?

Let us consider how one can present the cipher operating time when we are looking for 6 bits of the key only:

- variable time to be spent on the execution of
`DES_P`

operation, which is dependent upon the following elements:- 4 bits of the
`var`

variable of the first round`a*HW(var[1,1:4])`

(6 bits of the key of the first round take part in forming 4 bits of the`var`

variable); - all remaining bits of the
`var`

variables, except 4 bits of the first round`a*(∑ HW(var[:,1:32]))`

;

- 4 bits of the
- constant time
`Т`

; - measurement noise
`Δ(t)`

.

Hence the operating time of the algorithm can be written as follows:

t = a*HW(var[1,1:4]) + a*(∑ HW(var[:,1:32])) + Т + Δ(t).

Once again, let us discuss 6 bits and 4 bits of the `var`

variables. The Feistel box takes 32 bits of the `R`

register and by means of the special exchange `E()`

receives 48 bits, which then are to be summed up with 48 bits of the key. In the first round, we know the value of `R`

and do not know the key. Then the result of summing up will be split (attention!) into eight blocks with 6 bits each and every set of 6 bits each is given to its own Sbox table. Each of eight tables replaces six input bits with four output bits; therefore a 32-bit `var`

variable is obtained at the output, which already influences the operating time of the cipher.

If we group the operating time of all encryption operations for which the Hamming distance `HW(var[1,1:4])`

is the same, then the average operating time will converge to the following value:

μ(t) = μ(a*HW(var[1,1:4])) + μ(a*(∑ HW(var[:,1:32]))) + μ(Т) + μ(Δ(t)) = a*HW(var[1,1:4]) + μ(a*(∑ HW(var[:,1:32]))) + Т + μ(Δ(t)).

As we take the identical values of `HW(var[1,1:4])`

and different values of `∑ HW(var[:,1:32])`

(we take the plaintexts where `HW(var[1,1:4])`

will obligatorily be identical, and we are not interested in the remaining parts; therefore, the sums `∑ HW(var[:,1:32])`

will be different), then the arithmetic mean `μ(a*(∑ HW(var[:,1:32])))`

will converge to a constant (to be more accurately, `μ(∑ HW(var[:,1:32]))`

, without taking the first four bits into account, should converge to 254), the same as the convergence of `μ(Δ(t))`

in the previous example!

The first four bits of the `var`

variable can be written as `var[1,1:4] = Sbox{E(R)[1,1:6] xor K[1,1:6]}`

, where `E(R)[1,1:6]`

are the first 6 bits of the `R`

register after the `E()`

operation; `K[1,1:6]`

are the first 6 bits of the key; `Sbox{}`

is the Sbox exchange table. Now let us substitute the expression for `var[1,1:4]`

:

μ(t) = a*HW(Sbox{E(R)[1,1:6] xor K[1,1:6]}) + μ(a*(∑ HW(var[:,1:32]))) + Т + μ(Δ(t)).

The values of `μ(a*(∑ HW(var[:,1:32])))`

, `μ(Δ(t))`

converge to their average values when they are computed for different plaintexts. Consequently, with a very large number of averaging operations the value of `μ(a*(∑ HW(var[:,1:32]))) + Т + μ(Δ(t))`

can simply be replaced with `const`

:

μ(t) = a*HW(Sbox{E(R)[1,1:6] xor K[1,1:6]}) + const.

In order to find a key in this expression, it is necessary to select plaintexts for each value of 6 bits of the key with the same value of `HW(Sbox{E(R)[1,1:6] xor K[1,1:6]})`

, average their operating time and compare to the model. The final key-searching algorithm can be written in the form of a pseudo-code:

For each key = 0:63 For each i = 1:N P = plaintext(i) // Plaintext [L, R] = IP(P) // Left and right parts after the initial exchange hw_var[i] = HammingWeight( Sbox1(E(R)[0:5] XOR key) ) // Hamming distance for the first four bits of the var variable EndFor // Correlation coefficient between N measured time values and N constant values of the Hamming distance pcc(key) = ComputePearsonCorrelation(t, hw_var) EndFor

This algorithm has been implemented in С++ (the full source code can be found on dvd.xakep.ru), and the correlation coefficients computed are shown in Fig. 8. One million measurements have been used for computation of the correlation coefficients. The combination of bits of the key `000010`

=2 gives the correlation coefficient being four times as much as for any other value; therefore, most probably, this combination of bits is true. Please note that we are looking for a first round key, which is not equal to the original key.

Once the first 6 bits of the keys are found, it is possible to look for the next bits until the key has been restored in full. The collection of time values can take from several hours to several weeks depending upon the system. The data analysis normally occurs considerably quicker, e.g. within several hours though it also depends upon the situation. There is no commercial software for timing attacks on open access but you can always use my source codes.

# To be continued

The time has come to fold up. The first paper is always bound to be a flop as it can seem easy/non-practical/non-interesting but you can’t get very far without an introductory part. Well, in the subsequent issues of the magazine we will continue studying hardware attacks, discuss how to use the device power supply consumed in order to crack the cipher and how to restore the key be means of errors in computations. Hence, stay tuned, as the story goes.