# Fastest shot. Optimizing Blind SQL injection

Being employed with BI.ZONE, I have to exploit Blind SQL injection vulnerabilities on a regular basis. In fact, I encounter Blind-based cases even more frequently than Union- or Error-based ones. But how to raise the efficiency of such attack? This article provides an overview of approaches used to exploit Blind SQL injection and techniques expediting the exploitation.

## What the heck is ‘Blind’ SQL injection?

This section describes situations when Blind SQL Injection occurs and explains its basic exploitation technique. If you are already familiar with this stuff and understand what a binary search is, feel free to proceed to the next section.

A Blind SQL Injection occurs when you cannot retrieve data from an app directly, but can distinguish between two different states of the web app depending on the condition you have defined in an SQL query.

#### How Blind SQL Injection works

Imagine the following injection (for simplicity purposes, I use the PHP language, and MySQL serves as the DBMS):

$query = "SELECT id,name FROM products ORDER BY ".$_GET['order'];

In the order parameter, you can specify not only the column name, but some condition as well. For instance:

order=(if( (select 1=1),id,name))order=(if( (select 1=0),id,name))

Depending on the truth of the logical expression (1=1) or (1=0), the DBMS will sort the result either by the id column or by the name column. The response of the web app will show you which column the products are sorted by, and you will be able to distinguish a true condition from a false one. This enables you to ‘ask questions’ to the DBMS and receive ‘yes’ or ‘no’ answers to them.

For instance, let’s ask the DBMS whether information_schema.tables has more than 50 strings or not.

order=(if( (select count(*)>50 from information_schema.tables),id,name))

Speaking more formally, one query retrieves 1 bit of information from the database. To get text information, you can perform a full enumeration of all possible characters as follows:

order=(if( (select substring(password,1,1)='a' from users where username='admin'),id,name))order=(if( (select substring(password,1,1)='b' from users where username='admin'),id,name))order=(if( (select substring(password,1,1)='c' from users where username='admin'),id,name))...

But this will take a while: the number of queries is equal to the number of letters you want to retrieve. This is why, binary search is used to expedite the process.

#### Binary search

Let’s convert a character contained in the target string into its ASCII code and make an assumption about the possible value of this character. For instance, you can assume that it’s located in the lower half of the ASCII table (i.e. its code is in the range from 0x00 to 0x7f). So, you divide this range into two halves and ask the database: which half of the range is the symbol in?

order=(if( (select ascii(substring(password,1,1))>0x3f from users where username='admin'),id,name))

If the character is greater than 0x3f, then the target range is narrowed to 0x40-0x7f; if less, to 0x00-0x3f. Next, you find the middle of the new range and ask the database again: is the target character greater or less than this median value? Then you narrow the range down again and continue this operation until exactly one value remains in the range. This value is the answer you need.

In this particular case, to find out the exact character value, you have to make $log_2(128) = 7$ queries.

## Factors affecting the exploitation speed

Now let’s figure out what limits the injection exploitation speed:

• The attacker exploiting Blind SQL injection sends queries to the same endpoint. Each such query is processed by the web server for some time. Let’s denote the average query execution time as $t_0$. The spread in query execution times is usually small.

• The web server can process a number of queries in parallel. This number depends on the number of physical web servers behind the load balancer, the number of web application threads, the number of cores on the DBMS server, etc. It can be found out experimentally. Let’s denote the number of queries to the web server by $n_0$.

• Therefore, the app won’t process more than $F_0 = \frac{n_0}{t_0}$ queries per second. You can send more queries per second, but they will be put into the queue and wait for processing; so, the total query processing speed won’t exceed $F_0$. This parameter may change depending on the current load on the app and may not be constant, but this is of no importance.

Conclusion: The data retrieval speed is ~=$F_0$ bits per second. This is the main limitation on the Blind SQL injection exploitation speed.

## General optimization ideas

### Manipulations with sessions

The web app might be unable to execute parallel queries within the same session (e.g. this is how PHP works). If you see that the search for a character is performed in one thread, you have to create one session for each search thread.

If a web app runs on multiple servers and uses Sticky Sessions load balancing, all queries made within a session are sent to the same server. You can create several sessions on different servers and distribute queries evenly between them. As a result, a larger number of servers will be involved in the processing of your queries, and $n_0$ will increase, as well as the overall attack speed.

There is no way you can get more than $F_0$ bits per second. However, you can wisely manage the available speed, thus, increasing the overall attack speed.

Usually, your goal is to retrieve several columns from one table (possibly using the WHERE filter). The implementation of such a ‘head-on’ attack involves the following tasks:

1. Determine the number of strings to be retrieved;
2. For each string and column to be retrieved, determine the string length; and
3. Search for each character in the target string. If you think that you only need standard characters from the lower half of the ASCII table (0x00-0x7f), then you have to make 7 queries per character.

#### Trivial implementation

In a trivial ‘head-on’ implementation, parallelization is implemented only at the third stage. At the first and second stages, the search is performed in 1 thread, which means that the speed is $\frac{1}{t_0}$ queries per second instead of the available $\frac{n_0}{t_0}$.

At the third stage, parallelization is implemented as follows. Let’s say your web app is processing $n_0$ simultaneous queries, and the string length is $X$. This means that you can perform $n_0$ parallel tasks, and each task can be placed into a separate thread.

First, you determine the first $n_0$ characters in the string in parallel. Important: you cannot use all available threads to identify a single character since the binary search forms a new query only after receiving a response to the previous one.

When the search for the current character is complete, you launch new threads for subsequent characters contained in the same string. But since the query execution speed is approximately the same, you will launch new $n_0$ threads approximately when the first $n_0$ threads are completed, i.e. after $7 * t_0$ seconds (if you are searching in the 0x00-0x7f range).

If the string length is a multiple of $n_0$, then everything is fine and all threads will be executed at the maximum speed. If not, then the last group after the $n_0$ multiplicity will be executed in $X\ mod\ (n_0)$ threads. The rest of the threads will be idle. Finally, if the string length is random, then it’s obvious that the last group will run only at 50% of the maximum speed on average.

#### More high-performance variant

To increase the speed, it’s better to use parallelism: start the first thread to solve task 1, and immediately start the remaining $n_0 – 1$ available threads to determine string lengths for the first $\frac{n_0 – 1}{c}$ strings ($c$ is the number of columns). If it turns out that the number of strings is less than you have defined, then some of the threads were executed meaninglessly. But otherwise, they would simply be idle; so, the attack speed won’t decrease anyway, but may increase.

You can go even further by assuming that the target strings are, for instance, at least 3 characters long. So, before completing either task 1 or task 2, you can start task 3 (i.e. start identifying the first 3 characters in the first target strings). After such a start, provided that you manage the threads correctly (e.g. you have already determined lengths of the next several strings), the overall speed may reach ~=$F_0$.

## Main exploitation objectives

This section addresses two typical objectives that arise when you exploit Blind SQL injection and effective ways to fulfil them: (1) determine the number of rows and the string length; and (2) search for integers.

### Determining the number of rows and the string length

Determining the number of rows to be returned and the string length are similar tasks, but they aren’t as trivial as you might think.

Binary search can be used to find a value within a range that has the minimum and maximum values. For instance, when you identify characters, you know that they most likely are in the 0x00-0x7f range and exactly fall into 0x00-0xff (let’s put Unicode aside for now, although you can also set the maximum range limit for it).

But when you determine the number of rows and the string length, you don’t have such information: in theory, the target value can be anything.

Also note that the number of rows is determined once for each query. But the string length is determined multiple times; so, it’s a higher priority to optimize the string length determination.

#### Determining string length (a trivial way)

First, I am going to present a trivial algorithm that determines the string length. After reviewing it and understanding the related problems, you will be able to assess the effectiveness of such algorithms.

The first idea that comes to mind is to determine a value that is greater than the string length (further explanations pertain only to the string length; the logic used to determine the row number is similar). After finding this value, you can run a binary search.

Since binary search is most efficient when you deal with ranges whose size is equal to an exact power of two (see below), you can start from a number representing an exact power of two (e.g. 256)

The resultant algorithm is as follows:

1. Compare the string length with 256;
2. If the length is less, then you run a binary search; it will require $ln_2(256) = 8$ queries; and
3. If the length is longer, then you have to increase the maximum range. For instance, you can multiply the old upper bound by $2^4$ and compare with it. If it’s less, then you run a binary search; if it’s more, then you multiply by $2^4$ again and so on.

To estimate the efficiency of such an algorithm, you have to make an assumption about the probabilistic distribution of lengths of target strings. Such an assumption isn’t easy to make since the string content strongly depends on its meaning. For instance, email addresses have one distribution, text posts have a different distribution, while password hashes always have the same length.

It’s also necessary to take into account standard real-life tasks. Blind SQL injection exploitation isn’t a quick process. Most likely, if you find a string 10,000 characters long, you won’t be interested in retrieving it in full – you just need to see its first 100 characters to understand what it contains.

In other words, if the string length is more than, let’s say, 128 characters, then you aren’t interested in its exact length: it’s possible to simply indicate in the output that the length is more than 128 characters and determine only these first 128 characters. Therefore, you have to make 7 queries to determine the string length – similar to determining 1 character. In my opinion, it’s quite acceptable.

#### Determining string length (an elegant way)

There is also a technique that doesn’t require to determine the string length at all.

The following construct is normally used to identify characters:

ASCII(substring(target_col,n,1))

If n is greater than the string length, then the substring function will return an empty string, and the ASCII function of an empty string will return 0x00 (this is true for MySQL and PostgreSQL). Accordingly, as soon as you find a null character, you conclude that you have found the last character, and there is no need to search further.

Using this approach, you search for a character after the last one; this operation requires an extra 7 queries. The cost is the same as the cost to determine the string length. Also note that in MSSQL and SQLite, the substring function will return not an empty string, but NULL, like the ASCII/Unicode function. You can create special constructs that convert NULL into 0, but this increases the query length. In addition, you must carefully build parallelism: simultaneous identification of several characters can result in useless searches beyond the string end and unnecessary generation of queries.

If you still need to determine the exact string length (as well as the exact number of rows to be returned), applicable techniques are provided below.

### Searching for integers

If the required column is integral, then the trivial way is as follows: convert the number into a string and get it as a string. You can also use binary search, but you still have to solve the above-mentioned problem: define the upper bound for binary search:

1. Assume that the number size is limited to 64 bits, and the number is not negative (unsigned). For signed numbers, you can apply the same logic, and the result will be the same.

2. The maximum 64-bit number is represented by a string consisting of 20 characters. Therefore, it takes 5 queries to determine the string length because $2^5 = 32$ is the closest from above exact power of two (actually a little less, as shown below, but for now let’s round it upwards).

3. Then you have to spend 3-4 queries for each digit depending on the string length (see the section “Narrowing search range” below). The string length is $ceil(log_{10}(N))$ where $N$ is the required number.

4. Therefore, the total number of queries is $5 + 3.4 * (ceil(log_{10}(N)))$.

The maximum number of queries for the maximum length is 73, while a trivial binary search for all 64 bits requires 64 queries regardless of the number size.

To improve the algorithm you can:

1. Determine how many bits are in the number ($ln_2(N)$). These values range from 1 to 64 (i.e. you need 6 queries ($2^6 = 64$).

2. Then, depending on the number of bits, you will need as many queries as many bits are in the number. The number of bits in the number is $ceil(log_2(N))$; so, in total, it’s $6 + ceil(log_2(N))$.

3. To compare the two variants, let’s remove rounding and convert the formula used to estimate the total number of queries to the binary logarithm:

$5 + 3.4 * log_{10}(N) = 5 + 3.4*\frac{log_2(N)}{log_2(10)} = 5 + \frac{3.4}{3.33}*log_2(N)$.

As you can see, the difference between the formulas is mainly reflected by the fraction $\frac{3.4}{3.33}$ that does not differ much from $1$. In other words, the two presented algorithms have approximately the same efficiency. It’s convenient to use the first algorithm when you don’t know the column type: you can convert all columns into a text type and deliver an effective attack (albeit at the cost of slightly longer queries).

## Search ranges

This section discusses different approaches to Blind SQL injection exploitation that involve different search ranges.

### Narrowing search range

As said above, 7 queries are required to determine one character. These 7 queries make it possible to determine a value from the range 0x00-0x7f (i.e. an alphabet whose volume is $2^7$) that corresponds to the lower (English) half of the ASCII table. To accelerate the procedure, you can search not the entire lower half of the ASCII table for target characters, but some of its subsets.

Let’s assume for instance that the target string consists solely of numbers and estimate the search speed.

The alphabet’s cardinality is 10. For exact powers of 2, you could simply take the binary logarithm of the alphabet’s cardinality. However, 10 is not a power of 2; therefore, it’s a little more difficult to determine the number of queries required to identify a symbol:

1. The first query will show whether the character is in the subset [0,1,2,3,4] or in [5,6,7,8,9].

2. The second query will split the resultant group of 5 elements into subgroups consisting of 2 and 3 elements:

• [0,1] and [2,3,4] for the first case; and
• [5,6] and [7,8,9] for the second case.
3. The third query will find the value if the target character is in the ranges [0,1] and [5,6].

4. For ranges [2,3,4] and [7,8,9], you can use one query to split them into subranges consisting of one and two characters:

• [2,3,4] is split into [2] and [3,4]; and
• [7,8,9] is split into [7] and и [8,9].
5. Therefore, the third query will find characters [2] and [7].

6. If the target character is in [3,4] or [8,9], then you’ll need one more (fourth) query.

In other words, you make 3 queries for 6 possible values and 4 queries for the remaining 4 values. On average, you make 3-4 queries per character. This is more than twice better than a full binary search over a range of 128 possible values.

#### Mathematical estimate

You can determine the average number of queries required to identify a character from the alphabet whose cardinality is N using the following formula:

$q = ((N – N_2) * 2 * (log_2(N_2) + 1) + (N_2 * 2-N) * log_2(N_2) ) / N$

where $N_2$ is the full power of 2 that is closest to N from below: $N_2 = 2^{floor(ln_2(N))}$.

from math import log,floordef questions(N): pow2 = 2**floor(ln2(N)) return ((N-pow2)*2*(ln2(pow2)+1) + (pow2*2-N)*ln2(pow2))/Ndef ln2(x): return log(x)/log(2)

The $q$ function can be approximated as $log_2(N)$, but the real $q$ for N that is not an exact power of two will always be slightly larger than $log_2(N)$.

### Error detection

Assume that the target character is a lowercase English letter in the a-z range:

1. Let’s search for the value of the respective ASCII code in the 97-122 range of numbers. The middle of this range is at 109.5.

2. Determining whether the ASCII code of the target character is greater or smaller than 109.5:

(ascii(substr(target_col,5,1)) from target_table limit 2,1)>109

If the ASCII code is smaller, then the target range decreases to 97-109; if it’s greater, to110-122.

1. The new range is split into two parts and so on until the character is identified.

Therefore, the average number of queries is $q(26)=4.77$.

#### Canary characters

If your initial assumption that “the target character is a lowercase English letter” is incorrect, then the binary search will bring you one of the boundary values: a or z. You won’t be able to distinguish such a result from the real “a” or “z” letters without additional queries. To solve this problem, you can add a canary character at both ends of the range; as a result, you will search for the target value not in the 97-122 range (as in the above example), but in the 96-123 range. If the search result is a canary value of 96 or 123, then you understand that your original assumption was incorrect.

This technique makes it possible to use a range search when you can assume with a good degree of probability what alphabet the target character belongs to – but aren’t 100% sure. Note that if you expand the range with canary characters, the average number of queries per character increases from $q(26) = 4.77$ to $q(28) = 4.86$.

It’s also necessary to note that if the target character wasn’t in the search range, then you have to add to the already spent queries:

• $q(97)$ if the value is less than 97; and
• $q(5)$ if the value is greater than 122.

Assuming that characters are evenly distributed, the total number of queries is $q(97) * \frac{97}{102} + q(5) * \frac{5}{102} + 4.86 = 11.33$. This is much more than the original 7 queries. In other words, your assumption about the range must have a really high probability; otherwise, an attempt to accelerate the search may result in wasted performance.

### Ranges with gaps

Now imagine that you have to search for the target character in a set of possible values that don’t form a continuous range in the ASCII table. For instance, your target is a hexadecimal string, the [0-9A-F] set of values. Two techniques can be used to fulfil this task:

• an injection using the IN and NOT IN operators; or
• a search that ignores gaps.

#### Injection using the IN and NOT IN operators

The list of valid ASCII codes in this case is as follows: [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 97, 98, 99, 100, 101, 102]. You can split this list in half and ask the DBMS if the required character is in the first sublist. The condition will look something like:

(ascii(substr(target_col,5,1)) from target_table limit 2,1) in (48, 49, 50, 51, 52, 53, 54, 55)

If the condition is met, then the target value will be in the set [48, 49, 50, 51, 52, 53, 54, 55]. If not, in the set [56, 57, 97, 98, 99, 100, 101, 102]. Then you divide the resulting ranges in half step by step until you get the target value. $q(16) = 4$ queries have to be spent to identify one character.

It must be noted that the same problem (a potentially wrong assumption about the range) is present in this case as well: if you aren’t sure that your initial assumption about the search alphabet is correct, you won’t discover an error. To solve it, you can:

• add to the alphabet one pseudovalue, which will correspond to all characters not included in the alphabet at once; or
• use only the NOT IN construct in your search.

Example

1. Assume that the search alphabet consists of three values: [1, 3, 5].

2. After adding the N pseudovalue to the [1, 3, 5] values, you have [1, 3, 5, N].

3. Make the following query: (ascii(substr(target_col,5,1)) from target_table limit 2,1) NOT IN (1,3). If you get False, then the target character is either 1 or 3, and the next query will bring the exact answer. If you get True, then either the answer is 5, or you made a mistake with the range.

4. Make an exact comparison with 5: (ascii(substr(target_col,5,1)) from target_table limit 2,1) NOT IN (5). If you get True, then your initial assumption about the alphabet was incorrect.

This technique is better than the above-mentioned canary characters at the range bounds: using the NOT IN operator, you can perform a binary search in ranges with gaps, and you have to increase the alphabet size by only 1 to discover an error. Disadvantages of this technique include longer queries because a half of the alphabet has to be listed in the parameter of the IN and NOT IN operators.

#### Search that ignores gaps

There is another way to search in the [0-9A-F] range with a gap: run a search ignoring the gaps:

1. Divide the range [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 97, 98, 99, 100, 101, 102] into two halves. The range border will be between the characters 55 and56.

2. Make a query using the comparison operator: (ascii(substr(target_col,5,1)) from target_table limit 2,1) > 55. The response makes it possible to select one of the halves of the original range.

This technique should be used until there is exactly 1 character left in the range. As a result, you will need $q(16)=4$ queries.

This technique can be supplemented with canary characters to search for errors.

1. Add canary values to the list containing values you are looking for:

• one value less than the smallest one: 47;
• one value more than the largest one: 103; and
• one nonexistent value, e.g. 58.
2. Run a search in the same way as shown above. The list of possible values is [47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 97, 98, 99, 100, 101, 102, 103].

3. Find the middle of the range (56) and perform the following comparison: (ascii(substr(target_col,5,1)) from target_table limit 2,1) > 56.

4. The execution of this query will bring you information that the target value is in one of the two sets:

• [47, 48, 49, 50, 51, 52, 53, 54, 55, 56]; or
• [57, 58, 97, 98, 99, 100, 101, 102, 103].
5. Select the middle of the range, send a query, split the set into two parts, and so on.

6. Ultimately, you will either get a value from the original search range (i.e. the answer) or one of the canary values: 47, 103, or 58 (any of them will indicate that the target character wasn’t in the range). You will also know whether the target character is greater than 102, less than 48, or falls within the 59-96 range.

If the supposed range has more than one gap (let’s say, their number is $r$), then the number of canary characters increases: you have to use $2 + r$ bounds.

The use of ranges with many gaps reduces the potential performance due to larger numbers of canary characters. Queries in such cases are shorter than those used with the IN and NOT IN operators. In case of an error, you receive additional information about the range containing the wrong symbol: to the left, to the right, or in one of the gaps.

In addition to preset ranges, you can modify the search range during the attack (i.e. on-the-fly).

#### Optimization ideas

If you have already decoded several strings in a given column, and all these strings have a certain set of characters, then it’s logical to assume that subsequent strings in this column have the same set of characters. Therefore, you can dynamically assemble the range of used characters and then use this range to identify values in this column. Any potential errors in this assumption (e.g. if you had never encountered a symbol that, in fact, can be encountered) can be negated by using canary characters.

Some columns may contain characters rigidly set at specific positions. For instance, a GUID (example: 6F9619FF-8B86-D011-B42D-00CF4FC964FF) has a - (minus) character at positions [8, 13, 18, 23]. You can check whether all characters at the respective positions in already decoded rows have the same value and, instead of a binary search, send just one query to check the value of this character.

You can go even further and assemble ranges based on the available statistics on the characters encountered – so that the identification of more common characters requires fewer queries. Formally speaking, you have to create an algorithm that minimizes the number of queries required to detect characters based on the available probabilistic distribution of characters. The optimal solution of this problem involves the H-tree used in the Huffman coding.

Indeed, the compression task is similar to your task: you want to find frequently occurring characters using fewer queries (i.e. spend fewer bits on them); while the purpose of the compression algorithm is to represent such characters with the minimum number of bits. Building such a tree requires a single review of all the collected information to obtain statistics. The construct itself has the $O(n * log(n))$ complexity (where $n$ is the alphabet size), which isn’t too much for the alphabets in your task.

Let’s say you have decoded several strings and the character distribution statistics in them is as follows:

Character Number of occurrences Probability
a 34 47.2%
b 7 9.7%
c 5 7%
d 12 16.7%
e 4 5.5%
f 10 13.9%

The following optimal H-tree can be constructed for this statistics:

Different characters will require different numbers of queries:

• 1 query for a;
• 2 queries for d, f, and b; and
• 3 queries for c and e.

Mathematical expectation of the number of queries per 1 character:

$1 * (0.472) + 2 * (0.167 + 0.139 + 0.097) + 3 * (0.07 + 0.055) = 1.653$

For a standard range search, it would be: $q(6) = 2.67$.

The H-tree can be rebuilt on a regular basis as new statistics become available. In this case, the search will often require gaps in the ranges; therefore, it’s preferable to use the technique involving the IN and NOT IN operators.

Conclusion: it’s possible to create an algorithm that adjusts the search range automatically based on the available statistics; such an algorithm will significantly accelerate the search.

## Other optimization ideas

### String concatenation

Another optimization algorithm is based on the idea to combine the entire response into one string, separated by dividers. This allows to determine the string length once and save queries, but each divider increases the number of characters you have to identify.

Algorithm idea

• To concatenate columns together, you can use the concatenation operation in the respective DBMS inserting a rarely used divider (e.g. 0x01) between them. You can also concatenate rows using special functions (group_concat for MySQLandstring_aggfor PostgreSQL). The same0x01 divider will be used to separate the rows.

• Analysis of the response makes it possible to distinguish the beginning of a new row by counting the number of columns. You have to spend a little more queries to determine the string length. However, the number of queries is approximately equal to the logarithm of the string length; accordingly, it grows slower than the total number of queries you have to spend to determine the lengths of each string separately.

• In this case, you can abstain from determining the string length: you simply detect characters until they run out (as described in the section “Determining the number of rows and the string length”). However, then you won’t be able to estimate how long will it take to obtain the required data, which is most likely unacceptable in real-life situations.

• The number of queries to be spent to define dividers depends on the selection range you use. The divider itself is added as one character to the range. The total number of dividers is equal to the number of returned rows minus 1. In other words, you replace the cost required to determine string lengths with the cost required to define dividers.

• Assuming that 7 queries on average are spent to determine the string length (as indicated in the section “Determining the number of rows and the string length”), this allows you to save $7 – q(len(D))$ queries per each string (where $D$ is the search range). You also spend extra $q(len(D)) – q(len(D) – 1)$ queries per each character since a divider is added to the range.

• You can use different ranges for searches in different columns only to a limited extent – because you use multithreading and cannot know in advance whether you would pass through a divider when you select the next character (and accordingly, get into another column) or not. Therefore, you either have to use wide ranges (which increases $q(len(D))$ and reduces the gain from not determining the string lengths), or you can use ranges corresponding to the current column and get additional errors (i.e. search results are canary characters that require you to run another search).

Conclusion: this technique is applicable only in specific cases (e.g. when all target columns have a narrow range).

### UNICODE

If the target text contains characters not included in the standard English ASCII table, then a DBMS is used to store Unicode. Some DBMSs (e.g. MySQL) use UTF-8 by default, while others (e.g. PostgreSQL) use UTF-16. In both cases, converting a character using the ASCII/UNICODE function results in a value greater than 128. Therefore, a value greater than 128 indicates that the string contains Unicode characters.

Unicode characters are language-specific. For instance, the first byte in Russian letters in UTF-8 is 0xDO or 0xD1. If you deal with UTF-8, you can assume that one UTF-8 character with a Russian letter can be followed be another one. Then you can effectively detect the first byte of this character using an alphabet consisting of two values [0xD0,0xD1], and for the second byte, you can limit the search only to values corresponding to Russian letters.

However, this approach is poorly compatible with multithreaded searches: the probability that a character located $N$ characters after the current one is also a Russian UTF-8 character decreases as $N$ goes up.

in UTF-8, every second character in a Russian word will be 0xD0 or 0xD1. But after the end of the word, there will be most likely a single-byte character (a space or a punctuation mark). This reduces the likelihood that 0xD0 or 0xD1 is located after an even number of characters following an encountered 0xD0.

Conclusion: if you deal with Unicode, a possible solution is to use one thread for each string.

### Data compaction

Some DBMSs support built-in compression functions (e.g. COMPRESS in MySQL and UTL_COMPRESS.LZ_COMPRESS in Oracle). These functions make it possible to reduce the number of characters you have to determine. In addition, you don’t need to make assumptions about the alphabet used: the compression result is a BLOB. Let’s use the full range consisting of 256 values.

Note that compression functions add additional data at the beginning of the compressed BLOB: the size of the original row and the inverse transformation table. Therefore, the positive effect can be gained only for long strings; while for very short strings, the effect can be negative:

SELECT LENGTH(COMPRESS('123'))>> 15

Conclusion: this technique can be used in combination with string concatenation. The resultant concatenated string will be long, and compression algorithms will be able to compress it quite efficiently. However, you have to force the DBMS to compress a long string for each query, which can increase the execution time for each query.

Sqlmap doesn’t use multithreading to determine the number of rows or string length. It searches only for characters in a single string at a time.

### Determining the string length

Sqlmap searches for the string length if multiple threads are used. If a single thread is used, then it thinks that the string ends as soon as the first 0x00 character is encountered.

### Search for characters

Sqlmap searches for characters using the following technique:

1. It searches for the first character by exhaustively enumerating the 128 characters (7 queries).

2. Then a character is selected based on the previous identified character, and the next search starts from it:

• if it’s a digit, then sqlmap selects 48 (ASCII code of the minimum character, digit 0);
• if it’s a lowercase letter, then it chooses 96 (codea); and
• if it’s a capital letter, then 64 (codeA).
3. Then a comparison is made with the selected number, and a standard binary search is performed in the remaining range. In other words, if the previous character is a letter, the program first compares the next character with 48, and if the target value is greater, a search is performed in the 48-127 range.

The only exception is as follows: if there are no expected characters to the left of the current compared character, then sqlmap immediately passes to a comparison with 1. If a comparison with 1 shows that the target character is smaller, then sqlmap decides that it has found the end of the string and finishes the search.

Advantages of this approach include the effective detection of the string end, especially if the last character in this string is a digit. In such a case, the string end will be determined just in three queries. But this technique also has a disadvantage: low search efficiency in some cases.

Assume that the previous character was a digit. Then the first comparison is made with 48, and if the next character exists and is greater than 48 (which is the most likely situation), then a range of $128 – 48 = 80$ characters will be used to determine it, which translates into $q(80) = 6.4$ queries. In addition, one query has already been made to compare with 48; so, in total, sqlmap will send 7.4 queries (i.e. it’s quite possible that it spends 8 queries). However, if sqlmap had performed a full binary search, it would spend exactly 7 queries and nothing more.

### Selecting character range

Sqlmap has a parameter, --charset, used to set the range of characters for binary search.

### Unicode

Sqlmap can automatically detect non-single-byte characters, but works with them extremely slow. In my experience, it takes 34 queries on average to identify one Russian letter.

Other optimization variants described above are not implemented in sqlmap (at least, I didn’t find them).

## Conclusions

This article provides a theoretical overview of techniques used to optimize the exploitation of Blind SQL injection. Some of them are more effective, some are less, but I tried to make it as comprehensive as possible. If you have other optimization ideas and techniques, write to my Telegram @sorokinpf, and I’ll be happy to discuss them.

Over time, I intend to implement some of the described techniques in my framework for Blind SQL injections: sqli_blinder. Currently, the framework implements a trivial binary search and supports SQLite, MSSQL, Oracle, MySQL, and PostgreSQL. To use it, you have to write one function in Python instructing the framework where to send queries and how to analyze responses.