# SIMD Compression and the Intersection of Sorted Integers

Daniel Lemire LICEF Research Center TELUQ, Université du Québec 5800 Saint-Denis Montreal, QC H2S 3L5 Can. lemire@gmail.com Leonid Boytsov Language Technologies Institute Carnegie Mellon University 5000 Forbes Ave Pittsburgh, PA 15213, USA srchvrs@cmu.edu

Nathan Kurz Verse Communications 84 La Espiral Orinda, CA 94563 USA nate@verse.com

# ABSTRACT

Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the SIMD instructions available in common processors to boost the speed of integer compression schemes. By making use of superscalar execution together with vectorization, our S4-BP128-D4 scheme uses as little 0.7 CPU cycles per decoded integer while still providing state-of-the-art compression.

However, if the subsequent processing of the integers is slow, the effort spent on optimizing decoding speed can be wasted. To show that it does not have to be so, we (1) vectorize and optimize the intersection of posting lists; (2) introduce the SIMD GALLOPING algorithm. We exploit the fact that one SIMD instruction can compare 4 pairs of integers at once.

We experiment with two TREC text collections, GOV2 and ClueWeb09 (Category B), using logs from AOL and the TREC million-query track. We show that using only the SIMD instructions ubiquitous in all modern CPUs, our techniques for conjunctive queries can double the speed of a state-of-the-art approach.

**Categories and Subject Descriptors:** E.4 [Coding and Information Theory]: Data Compaction and Compression

**Keywords:** performance; measurement; index compression; vector processing

# 1. INTRODUCTION

An inverted index maps terms to lists of document identifiers. A column index in a database might, similarly, map attribute values to row identifiers. Storing all these lists on disk can limit the performance since the latency of the fastest drives is several orders of magnitude higher than memory access. Thus, engineers commonly compress these lists of integers so that they fit in memory.

We assume that identifiers can be represented using 32-bit integers and that they are stored in sorted order. In this context, we can use the single instruction, multiple data (SIMD) instructions available on practically all server and desktop processors produced in the last decade (Streaming SIMD Extensions 2 or SSE2). A SIMD instruction performs the same operation on multiple pieces of data: this is also known as vector processing. Previous research [12] showed that we can decode compressed integers using less than 1.5 CPU cycles per integer in a realistic inverted index scenario by using SIMD instructions. We expect that this is at least twice as fast as any non-SIMD scheme. One of our contributions is to nearly double this speed, sometimes down to 0.7 cycles per integer decompressed.

To achieve better compression and higher decoding speed, we exploit the fact that the lists are sorted and use *differential coding*: instead of storing the integers themselves, we store the differences between successive integers, sometimes called the deltas. One downside of differential coding is that decoding requires the computation of a prefix sum to recover the original integer. When using earlier non-SIMD compression techniques, the computational cost of the prefix sum might be relatively small, but when using faster SIMD compression, the prefix sum can account for up to half of the running time. Thankfully we can accelerate the computation of the prefix sum using SIMD instructions.

Our first contribution is to revisit the computation of the prefix sum. On 128-bit SIMD vectors, we introduce 4 variations exhibiting various speed/compression trade-offs, from the most common one, to the 4-wise approach proposed by Lemire and Boytsov [12]. In particular, we show that at the same compression ratios, vector (SIMD) decoding is faster than scalar (non-SIMD) decoding. Maybe more importantly, we review how to *integrate* the prefix sum with the compression to benefit from superscalar execution. Indeed, most desktop CPUs are *superscalar* which means that they can execute more than one instruction per cycle. While outof-order-execution (OOE) can sometimes provide speedups behind the scenes, we demonstrate that implementations that consciously make better use of this superscalar potential can achieve almost double the decompression speed of naive sequential code. Some of our improved schemes can decompress more than one integer per CPU cycle. We are not aware of any similar high speed previously reported using single-threaded code, even accounting for hardware differences.

To illustrate that SIMD instructions can significantly benefit other aspects of an inverted index (or a database system), we consider the problem of computing conjunctive queries: e.g., find all documents that contain a given set of terms. In some search engines, such conjunctive queries are the first step, and sometimes the most expensive step, in query processing [6, 17]. Some categories of users, e.g., patent lawyers [10], prefer to use complex Boolean queries where conjunction is an important part of query processing.

Culpepper and Moffat showed that a competitive approach (henceforth HYB+M2) was to represent the longest lists as bitmaps while continuing to compress the short lists using differential coding [6]. To compute an intersection, the short lists are first intersected and then the corresponding bits in the bitmaps are read to finish the computation. As is commonly done, the short lists are intersected two-by-two starting from the two smallest lists: this is often called a Set-vs-Set or SvS processing. These intersections are typically computed using scalar algorithms. In fact, we are not aware of any SIMD intersection algorithm proposed for such problems on CPUs. Thus as our second contribution, we introduce new SIMD-based intersection algorithms that are up to twice as fast as competitive scalar intersection algorithms. By combining both the fast SIMD decoding and fast SIMD intersection, we can double the speed of the intersections on compressed lists.

To ease reproducibility, we make the software and data sets publicly available, including the software to process the text collections and generate query mappings.

## 2. RELATED WORK

For an exhaustive review of fast integer compression techniques, we refer the reader to Lemire and Boytsov [12]. Their main finding is that schemes compressing integers in large ( $\approx 128$ ) blocks of integers with minimal branching are faster, especially when using SIMD instructions. They reported using fewer than 1.5 CPU cycles per integer on a 2011-era Intel Sandy Bridge processor. In comparison, Stepanov et al. [16] also proposed compression schemes optimized for SIMD instructions on CPUs, but they reported using at least 2.2 CPU cycles per integer on a 2010 Intel Westmere processor.

Regarding the intersection of sorted integer lists, Ding and König [8] compared a wide range of algorithms in the context of a search engine and found that SvS with *galloping* was competitive: we describe galloping in § 5. Their own technique (RanGroup) fared better ( $\approx 10\%-30\%$ ) but it does not operate over sorted lists but rather over a specialized data structure that divides up the data randomly into small chunks. In some instances, they found that a merge (akin to the merge step in the merge sort algorithm) fared well. They also got good results with Lookup [18]: a technique that relies on an auxiliary data structure for "skipping" ahead [13]. Ding and König got poorer results with alternatives such as Baeza-Yates' algorithm [3] or adaptive algorithms [7].

Barbay et al. [4] also carried out an extensive experimental evaluation. On synthetic data using a uniform distribution, they found that Baeza-Yates' algorithm [3] fared better than SvS with galloping (by about 30%). However, on real data (e.g., TREC GOV2), SvS with galloping was superior to most alternatives by a wide margin (e.g.,  $2 \times$  faster).

Culpepper and Moffat similarly found that SvS with galloping was the fastest [6] though their own max algorithm fared nearly as well. They found that in some specific instances (for queries containing 9 or more terms) a technique similar to galloping (interpolative search) was slightly better (by less than 10%).

We could not find any work on the computation of intersections over sorted lists using SIMD instructions, except for Schlegel et al. [15]. However, they used a specialized data structure that packs the data using 16 bits per integer. They also worked on the intersection of arrays that have identical lengths. In contrast, our SIMD intersection algorithms use 32-bit integers and are designed for intersection between arrays having having differing lengths (see § 5). Experimental comparisons between our algorithms and theirs is left as future work.

Our work is focused on commodity desktop processors. Compression and intersection of integer lists using a graphics processing unit (GPU) has also received attention. Ding et al. [9] improved the intersection speed using a *parallel merge find*: essentially, they divide up one list into small blocks and intersect these blocks in parallel with the other array. On conjunctive queries, Ding at al. [9] found their GPU implementation to be only marginally superior to a CPU implementation ( $\approx 15\%$  faster) despite the assumption that the data was already loaded in GPU's global memory. They do, however, get impressive speed gains (7×) on disjunctive queries.

Ao et al. [2] proposed a parallelized compression technique (Parallel PFor) and replaced conventional differential coding with an approach based on linear regression. In our work, we rely critically on differential coding, but alternative models merit consideration [11, 19].

#### 3. INTEGER COMPRESSION

We consider the case where we have lists of integers stored using 32 bits, but where the magnitude of most integers requires fewer than 32-bits to express. We want to compress them as much as possible while spending as few CPU cycles as possible per integer. We consider 4 different fast compression schemes: VARINT, S4-BP128, FASTPFOR and S4-FASTPFOR. Both S4-BP128 and S4-FASTPFOR fared best in an exhaustive experimental comparison [12]. We review them briefly for completeness.

Many authors such as Culpepper and Moffat [6] use variable byte codes (henceforth VARINT) also known as escaping [18] for compressing integers. For example, we might code integers in  $[0, 2^7)$  using a single byte, integers in  $[2^7, 2^{14})$  using two bytes and so on. VARINT does not always compress well: it always uses at least one byte per integer. However, if most integers can be represented with a single byte, then it offers competitive decoding speed [12]. Even though Stepanov et al. [16] showed how to optimize VARINT for SIMD instructions but, we only consider the commonly used scalar VARINT.

Our fastest family of compression schemes is S4-BP128: the "S4" stands for 4-integer SIMD, "BP" stands for "Binary Packing", and "128" indicates the number of integers encoded in each block. The number 128 was chosen to match the bit width of the 4-integer vector based on the following insight. Suppose you encode 128 integers using exactly *b* bits per integer. The result will use  $128 \times b$  bits of memory ( $b \times 16$  bytes of memory). In this way, our algorithm can always make full use of 128-bit vector processing on each 16-byte chunk, with larger bit widths simply requiring more chunks. We call this process *bit packing*, and the reverse process *bit unpacking*.

Only 4 basic operations are required for bit unpacking: bitwise or, bitwise and, logical shift right, and logical shift left. The corresponding 128-bit SSE2 instructions for packed 32-bit integers in Intel and AMD processors are por, pand, psrld, and pslld. For a given bit width b, no branching is required for bit packing or bit unpacking. Thus, we can create one bit unpacking function for each bit width b and select the desired one using an array of function pointers or a switch/case statement. (A generic bit unpacking procedure for a block of 128 integers is given by Algorithm 1 and discussed again in  $\S$  4.) In the S4-BP128 format, we decompose arrays into meta-blocks of 2048 integers, each containing 16 blocks of 128 integers. Before bit packing the 16 blocks, we write 16 bit widths (b) using one byte each. For each block, the bit width is the smallest value b such that all corresponding integers are smaller than  $2^b$ . The value b can range from 0 to 32. In practice, lists of integers might not be divisible by 2048: we compress the remaining integers using VARINT.

The downside of the S4-BP128 approach is that the largest integers in a block of 128 integers determine the compression ratio of all these integers. We could use smaller blocks (e.g., 32) to improve compression. However, a better approach for performance might be *patching* [22] wherein the block is first decompressed using a smaller bit width, and then a limited number of entries requiring greater bit widths are overwritten ("patched") using additional information. In our version of patched coding (FASTPFOR), we proceed as in S4-BP128, that is we bit pack blocks of 128 integers.

However, instead of picking the bit width b such that all integers are smaller than  $2^b$ , we pick a different bit width b'that might be smaller than b. That is, we only bit pack the least significant b'-bits from each integer. We must still encode the missing information for all integers larger or equal to  $2^{b'}$ : we call each such integer an exception. For this purpose, for each block, we store both bit widths b and b', as well as the number of exceptions and their locations.

Each location where an exception should be applied is stored using one byte in a *metadata* byte array. It remains to store the most significant b - b' bits of the large integers. We collect these exceptions over many blocks (up to 512 blocks) and then bit pack them together, into up to 32 bit packed arrays (one for each possible value of b - b'excluding 0). For speed, these arrays are padded up to a multiple of 32 integers. The final data format is an aggregation of the bit packed blocks (using b' bits per integer), the *metadata* byte array and the bit packed arrays corresponding to the exceptions.

When the number of integers is not divisible by 2048, we compress the remainder of the integers using VARINT. Decoding using FASTPFOR is similar to decoding with S4-BP128, except that after bit unpacking the least significant b' bits for the integers of a block, we must add the most significant bits of the exceptions. For this end, the packed arrays of exception values are unpacked as needed and we loop over the exceptions. For each block, we pick b' so as to minimize  $128 \times b' + c(b')(b - b' + 8)$  where c(b') is the number exceptions generated for a given value b' (e.g., c(b) = 0). This heuristic for picking b' can be computed quickly if we first tabulate how many integers in the block are less than  $2^x$  for  $x = 0, \ldots, b$ . For this purpose, we use the assembly instruction **bsr** (Bit Scan Reverse) to calculate the log<sub>2</sub> of each integer.

We can compare FASTPFOR with other patched schemes as follows [12]: the original schemes from Zukowski et al. [22] is faster than FASTPFOR but has worse compression ratios (even worse than the bit packing method S4-BP128 that does not use patching), OptPFD [21] by Yan el al. sometimes compresses better than FASTPFOR, but can be up to two times slower, and their NewPDF [21] is typically slower and offers worse compression ratios than FASTPFOR.

Because FASTPFOR relies essentially on bit packing for compression, it is easy to vectorize it using the same SIMDbased bit packing and unpacking functions used by S4-BP128. We call the resulting scheme S4-FASTPFOR. In the original scheme by Lemire and Boytsov [12] (called SIMD-FASTPFOR), the bit-packed exceptions were padded up to multiples of 128 integers instead of multiples of 32 integers as in FASTP-FOR. While this insured that all memory pointers where aligned on 16 bytes boundaries, it also adversely affected compression. Our S4-FASTPFOR has essentially the same data format as FASTPFOR and, thus, the same compression. These changes require that we use (1) scalar bit packing/unpacking for up to 32 integers per packed array and (2) unaligned load and store SIMD instructions. Neither of these differences significantly impacts performance, and the compression ratios improve by  $\approx 5\%$ .

#### 4. DIFFERENTIAL CODING

Differential coding takes a sorted integer list  $x_1, x_2, \ldots$  and replaces it with successive differences (or deltas)  $\delta_2, \delta_3, \ldots = x_2 - x_1, x_3 - x_2, \ldots$  If we keep the first value intact  $(x_1)$ , this transformation is invertible. We can then apply various integer compression schemes on the deltas. Since the deltas are smaller than the original integers in the list, we get better compression ratios.

If the sorted lists contain no repetition, we can further subtract one from all deltas as we know that they are greater than zero. However, this only offers a significant benefit in compression if we expect many of the deltas to be close to zero. And, in such cases, other schemes such as bit vectors might be more appropriate. Thus, we do not consider this option any further.

Computing deltas during compression is an inexpensive operation that can be easily accelerated with superscalar execution or even SIMD instructions. However, recovering the original list from the deltas when decompressing can be more time consuming because of the inherent data-dependencies: the value of each recovered integer can only be calculated after the value of the integer immediately preceding it is known. Indeed, it involves the computation of a *prefix sum*:  $x_i = x_{i-1} + \delta_i$ . A naive implementation could end up using one or more CPU cycles per integer just to calculate the prefix sum. For a moderately fast scheme such as VARINT, this is not a concern, but for faster schemes, computation of the prefix sum can become a performance bottleneck. To alleviate this problem, we can sacrifice some compressibility, by computing the deltas on four-by-four basis (henceforth D4):  $\delta_i = x_i - x_{i-4}$  [12]. While fast, it also generates larger deltas.

As further refinements, we propose to consider 4 different forms of differential coding that offer different compression/speed trade-offs. As before, we assume 128-bit vectors and process 32-bit integers.

• The fastest is D4 which computes the deltas four-by-four: e.g.,  $(\delta_5, \delta_6, \delta_7, \delta_8) = (x_5, x_6, x_7, x_8) - (x_1, x_2, x_3, x_4)$ . We expect the deltas to be  $4 \times$  larger, which degrades the compression by approximately 2 bits per integer. However, a single 1-cycle-latency SIMD instruction (paddd in SSE2) can correctly calculate the prefix sum of four consecutive integers.

• The second fastest is DM. It is similar to D4 except that instead of subtracting the previous vector of integers, we subtract only the largest of these integers:  $\delta_{4i+j} = x_{4i+j} - x_{4i-1}$ . E.g.,

$$(\delta_5, \delta_6, \delta_7, \delta_8) = (x_5, x_6, x_7, x_8) - (x_4, x_4, x_4, x_4).$$

We expect the deltas to be  $2.5 \times$  larger on average. Compared to the computation of the prefix sum with D4, DM requires one extra instruction (**pshufd** in SSE2) to copy the last component to all components:  $(x_1, \ldots, x_4) \rightarrow$  $(x_4, \ldots, x_4)$ . On Intel processors, the **pshufd** instruction is fast.

• The third fastest is D2:  $\delta_i = x_i - x_{i-2}$ . E.g.,

$$(\delta_5, \delta_6, \delta_7, \delta_8) = (x_5, x_6, x_7, x_8) - (x_3, x_4, x_5, x_6).$$

The deltas should be only  $2 \times$  larger on average. The prefix sum for D2 can be implemented using 4 SIMD instructions.

- 1. Shift the delta vector by 2 integers (in SSE2 using **psrldq**): e.g.,  $(\delta_5, \delta_6, \delta_7, \delta_8) \rightarrow (0, 0, \delta_5, \delta_6)$ .
- 2. Add the original delta vector with the shifted version: e.g.,  $(\delta_5, \delta_6, \delta_5 + \delta_7, \delta_6 + \delta_8)$ .
- 3. Select from the previous vector the last two integers and copy them twice (in SSE2 using pshufd), e.g.,  $(x_1, x_2, x_3, x_4) \rightarrow (x_3, x_4, x_3, x_4)$ .
- 4. Add the results of the last two operations.
- The slowest approach is D1 which is just regular differential coding ( $\delta_i = x_i - x_{i-1}$ ). It generates the smallest deltas. We compute it with a well-known approach using 6 SIMD instructions.
  - 1. The first two steps are as with D2 to generate, for example,  $(\delta_5, \delta_6, \delta_5 + \delta_7, \delta_6 + \delta_8)$ . Then we take this result, shift it by one integer and add it to itself. We get, for example,

$$(\delta_5, \delta_5 + \delta_6, \delta_6 + \delta_5 + \delta_7, \delta_5 + \delta_6 + \delta_7 + \delta_8).$$

- 2. We copy the last integer of the previous vector to all components of a new vector. E.g., we generate  $(x_4, x_4, x_4, x_4)$ .
- 3. We add the last two results to get the final answer.

We summarize the different techniques (D1, D2, DM, D4) in Table 1. We stress that the number of instructions is not a measure of running-time performance, if only because of superscalar execution. Our analysis is for 4-integer SIMD instructions: for wider SIMD instructions, the number of instructions per integer quickly diminishes.

Combining a compression scheme like VARINT and differential coding is not a problem. Since we decode integers one at a time, we can easily integrate computation of the prefix sum into the decoding function: as soon as a new integer is decoded, it is added to the previous integer.

Table 1: Comparison between the 4 vectorized differential coding techniques with 4-integer SIMD instructions

|    | size of deltas | instructions/int |
|----|----------------|------------------|
| D1 | $1.0 \times$   | 1.5              |
| D2 | $2.0 \times$   | 1                |
| DM | $2.5 \times$   | 0.5              |
| D4 | $4.0 \times$   | 0.25             |

For S4-BP128 and S4-FASTPFOR, we could integrate differential coding at the block level. That is, we could decode 128 deltas and then calculate the prefix sum to convert these deltas back to the original integers. Though the decoding requires two passes over the same small block of integers, it is unlikely to cause many expensive cache misses.

Maybe surprisingly, we can do substantially better, at least for schemes such as S4-BP128. Instead of using two passes, we can use a single pass where we do both the bit unpacking and the computation of the prefix sum. In some cases, the one-pass approach is almost twice as fast as the two-pass approach because the CPU is able to use superscalar execution to compute the prefix "in parallel" with the bit unpacking. Thus we benefit from both superscalar execution and the SIMD instructions. Algorithm 1 illustrates the bit unpacking routine for a block of 128 integers. It takes as a parameter a SIMD prefix-sum function P used at lines 10 and 15: for D4, we have P(t, v) = t + v. (Omitting lines 10 and 15 disables differential coding.) In practice, we generate one such function for each bit width b and for each prefix-sum function P. The prefix sum always starts from an initial vector (v). By convention, we set v = (0, 0, 0, 0)initially and then, after decoding each block of 128 integers, v is set to the last 4 integers decoded.

Beside the integration of differential coding with the bit unpacking, we have also improved over Lemire and Boytsov's bit unpacking [12, Fig. 7] in another way: whereas each of their procedures may require several masks, our implementation uses a single mask per procedure (see line 4 in Algorithm 1). Given that we only have 16 SIMD registers on our Intel processors, attempting to keep several of them occupied with constant masks is wasteful.

Unfortunately, for S4-FASTPFOR, it is not clear how to integrate bit unpacking and computation of the prefix sum to exploit superscalar execution. Indeed, S4-FASTPFOR requires three separate operation in sequence: bit unpacking, patching and prefix sum. Patching must happen after bit unpacking, but before the prefix sum. This makes tight integration between patching and the prefix sum difficult.

## 5. FAST INTERSECTIONS

To compute the intersection between several sorted lists quickly, a competitive approach is Set-vs-Set (SvS): we sort the lists in order of non-decreasing cardinality and intersect them two-by-two, starting with the smallest. A textbook intersection algorithm between two lists (akin to the merge sort algorithm) runs in time O(m + n) where the lists have length m and n (henceforth we call it SCALAR). Though it is competitive when m and n are similar, there are better alternatives when  $n \gg m$ . Such alternative algorithms assume that we intersect a small list r with a large list f. They iterate over the small list: for each element  $r_i$ , they seek a match  $f_j$  in the second list using some search procedure. Whether

Algorithm 1 Unpacking procedure using 128-bit vectors with integrated differential coding. We write  $\gg$  for the bitwise right shift,  $\ll$  for the bitwise left shift, & for the bitwise AND, and | for the bitwise OR. The binary function P depends on the type of differential coding (e.g., to disable differential coding set P(t, v) = t).

1: input: a bit width b, a list of 32-bit integers  $y_1, y_2, \ldots, y_b$ , prefix-sum seed vector v2: **output**: list of 128-bit integers in  $[0, 2^b)$ 3:  $w \leftarrow \text{empty list}$ 4:  $M \leftarrow (2^{b} - 1, 2^{b} - 1, 2^{b} - 1, 2^{b} - 1)$  {Reusable mask} 5:  $i \leftarrow 0$ 6: for  $k = 0, 1, \dots, b - 1$  do while  $i + b \leq 32 - b$  do 7:  $t \leftarrow (y_{1+4k} \gg i, y_{2+4k} \gg i, y_{3+4k} \gg i, y_{2+4k} \gg i)$ 8:  $t \leftarrow t \& M$  {Bitwise AND with mask} 9: 10: $t \leftarrow P(t, v)$  and  $v \leftarrow t$  {Prefix sum} 11:append integers  $t_1, t_2, t_3, t_4$  to list w  $i \leftarrow i + b$ 12:if i < 32 then 13: $t \leftarrow (y_{1+4k} \gg i, y_{2+4k} \gg i,$  $y_{3+4k} \gg i, y_{2+4k} \gg i$ 14: $(y_{5+4k} \ll 32 - i, y_{6+4k} \ll 32 - i,$  $y_{7+4k} \ll 32 - i, y_{8+4k} \ll 32 - i)$  $t \leftarrow P(t, v)$  and  $v \leftarrow t$  {Prefix sum} 15:16:append integers  $t_1, t_2, t_3, t_4$  to list w 17: $i \leftarrow i - 32$ 18:else 19: $i \leftarrow 0$ 20: return w

there is a match or not, they advance in the second list up to the first point where the value is at least as large as the one in the small list  $(f_j \ge r_i)$ . The search procedure assumes that the lists are sorted so it can skip values. A popular algorithm uses galloping search [5] (also called exponential search): we pick the next available integer  $r_i$  from the small list and seek an integer at least as large in the other list, looking first at the next available value, then looking twice as far, and so on doubling the distance each time. When we find the first integer  $f_j \ge r_i$  that is not smaller than  $r_i$ , we use binary search to determine whether the second list contains  $r_i$  (among integers not larger than  $f_j$ ). Computing galloping intersections requires only  $O(m \log n)$  time which is better than O(m + n) when  $n \gg m$ .

When implementing the SvS algorithm, it is convenient to be able to write out the result of the intersection directly in one of the two lists. This removes the need to allocate additional memory to save the result of the intersection. Writing over the input data in this manner is not automatically safe. Nonetheless, all our intersection algorithms are designed to have this *output-to-input* property: you can always write the result in the shorter of the two lists.

Our simplest SIMD intersection algorithm is V1 (see Algorithm 2). It is essentially equivalent to a simple textbook intersection algorithm (SCALAR) except that we access the integers of the long lists in blocks of  $\tau$  integers. We advance in the short list one integer at a time. We then advance in the long list until we find a block of  $\tau$  integers such that the last one is at least as large as the integer in the short list. We then compare the block of  $\tau$  different integers in the long list with the integer in the short list. If there is a match, that is, if one of the  $\tau$  integers is equal to the integer from the short list, we write that integer to the intersection list.

Algorithm 2 The V1 intersection algorithm

**Require:** SIMD architecture with  $\tau$ -integer vectors 1: **input**: two sorted non-empty arrays of integers r, f2: assume: length(f) is divisible by  $\tau$ 3: x initially empty dynamic array (our answer) 4:  $j \leftarrow 1$ 5: for  $i \in \{1, 2, ..., \text{length}(r)\}$  do 6:  $R \leftarrow (r_i, r_i, \ldots, r_i)$ 7: while  $f_{j-1+\tau} < r_i$  do 8:  $j \leftarrow j + \tau$ if j > length(f) then 9: 10: return x11: $F \leftarrow (f_j, f_{j+1}, \dots, f_{j-1+\tau})$ if  $R_i = F_i$  for some  $i \in \{1, 2, \dots, \tau\}$  then 12:13:append  $r_i$  to x 14: return x

However, the computational complexity of the V1 algorithm is still  $O(m + n/\tau)$  which means that when  $n \gg m$ , a simple galloping intersection can become faster as it has complexity  $(O(m \log n))$ . We can optimize further and add two layers on branching (henceforth V3, see Algorithm 3). Thus, some comparisons can be skipped. However, when nis large, galloping is superior to both V1 and V3. Thus, we also created a SIMD-based galloping (henceforth SIMD GALLOPING, see Algorithm 4). It uses the same ideas as galloping search, except that we exploit the fact that SIMD instructions can compare 4 pairs of integers at once. SIMD GALLOPING has the same complexity in terms of m and n as scalar galloping  $(O(\frac{m}{\tau} \log n))$  so we expect good scalability.

In practice, we find that our SIMD GALLOPING is always faster than a non-SIMD galloping implementation. Nevertheless, to fully exploit the speed of SIMD instructions, we find it desirable to still use V1 and V3 when they are faster. Thus, we use a combination of V1, V3, and SIMD galloping where a choice of the intersection algorithm is defined by the following heuristic:

- When length(r) ≤ length(f) < 50 × length(r), we use the V1 algorithm with 8-integer vectors (τ = 8, see Algorithm 2).
- When  $50 \times \text{length}(r) \leq \text{length}(f)$ , we use the V3 algorithm with 32-integer vectors ( $\tau = 16$ , see Algorithm 3)
- When  $1000 \times \text{length}(r) \leq \text{length}(f)$ , we use the SIMD GALLOPING algorithm with 32-integer vectors ( $\tau = 16$ , see Algorithm 4).

Our SIMD Intersection algorithms assume that the longer array has length divisible by either  $\tau$  or  $4\tau$ . In practice, when this is not the case, we complete the computation of the intersection with the SCALAR algorithm. We expect that the contribution of this step to the total running time is negligible.

#### Algorithm 3 The V3 intersection algorithm

**Require:** SIMD architecture with  $\tau$ -integer vectors 1: **input**: two sorted non-empty arrays of integers r, f2: **assume:** length(f) is divisible by  $4\tau$ 3: x initially empty dynamic array (our answer) 4:  $j \leftarrow 1$ 5: for  $i \in \{1, 2, ..., \text{length}(r)\}$  do 6:  $R \leftarrow (r_i, r_i, \ldots, r_i)$ while  $f_{j-1+4\tau} < r_i$  do 7:  $j \leftarrow j + 4\tau$ 8: if j > length(f) then 9: return x10:if  $f_{j-1+2\tau} \geq r_i$  then 11: if  $f_{j-1+\tau} \ge r_i$  then 12:  $F \leftarrow (f_j, f_{j+1}, \dots, f_{j-1+\tau})$ 13: else 14:  $F \leftarrow (f_{j+\tau}, f_{j+2+\tau}, \dots, f_{j-1+2\tau})$ 15:16:else if  $f_{j-1+3\tau} \ge r_i$  then  $F \leftarrow (f_{j+2\tau}, f_{j+2+2\tau}, \dots, f_{j-1+3\tau})$ 17:18: 19:else  $F \leftarrow (f_{j+3\tau}, f_{j+2+3\tau}, \dots, f_{j-1+4\tau})$ 20:if  $R_i = F_i$  for some  $i \in \{1, 2, \ldots, \tau\}$  then 21:append  $r_i$  to x22:23: return x

#### Algorithm 4 The SIMD GALLOPING algorithm

**Require:** SIMD architecture with  $\tau$ -integer vectors 1: **input**: two non-empty sorted arrays of integers r, f2: assume: length(f) is divisible by  $\tau$ 3: x initially empty dynamic array (our answer) 4:  $i \leftarrow 1$ 5: for  $i \in \{1, 2, ..., \text{length}(r)\}$  do 6:  $R \leftarrow (r_i, r_i, \ldots, r_i)$ find by sequential search the smallest  $\delta$  such that 7: $f_{j+\delta-1+\tau} \geq r_i$  for  $\delta = 0, 1\tau, 2\tau, 4\tau, \dots$ , length(f) –  $1 + \tau$ , if none return x find by binary search the smallest  $\delta_{\min}$  in  $[|\delta/2|, \delta]$ 8: divisible by  $\tau$  such that  $f_{j+\delta_{\min}-1+\tau} \geq r_i$ 9:  $j \leftarrow j + \delta_{\min}$  $F \leftarrow (f_j, f_{j+1}, \ldots, f_{j-1+\tau})$ 10:

- 11: **if**  $R_i = F_i$  for some  $i \in \{1, 2, \dots, \tau\}$  **then**
- 12: append  $r_i$  to x
- 13: **return** *x*

#### 6. EXPERIMENTAL RESULTS

We assess experimentally the unpacking speed with *integrated* differential coding (§ 6.4), the decoding speed of the corresponding compression schemes (§ 6.5), the benefits of our SIMD-based intersection schemes (§ 6.6), and finally, SIMD-accelerated bitmap-list hybrids (§ 6.7) with realistic query logs and inverted indexes.

## 6.1 Software

All our software is freely available online<sup>1</sup> under the Apache Software License 2.0. The code is written in C++ using the C++11 standard. Our code builds using several compilers such as clang++ 3.2 and Intel icpc 13. However, we use GNU GCC 4.7 on a Linux PC for our tests. All code was compiled using the -O3 flag. By design, all our software is single-threaded. For clarity, we implemented scalar schemes (FASTPFOR and VARINT) without using SIMD instructions and with the scalar equivalent of D1 differential coding.

#### 6.2 Hardware

We ran all our experiments on an Intel Xeon CPU (E5-1620, Sandy Bridge) running at 3.60 GHz. This CPU has 4 cores, but we expect that a single core is used during our benchmarks. It also has 10 MB of L3 cache as well as 32 kB and 256 kB of L1 and L2 data cache per core. We have 32 GB of RAM (DDR3-1600) running quad-channel. We estimate that we can read from RAM at a speed of 4 billion integers per second and from L3 cache at 8 billion integers per second. All data is stored in RAM so that disk performance is irrelevant.

#### 6.3 Real data

To fully assess our results, we need realistic data sets. For this purpose, we use posting lists extracted from the on TREC collections ClueWeb09 (Category B) and GOV2. Collections GOV2 and ClueWeb09 (Category B) contain roughly 25 and 50 million HTML documents, respectively. These documents were crawled from the web. In the case of GOV2 almost all pages were collected from websites in either the .gov or .us domains, but the ClueWeb09 crawl was not limited to any specific domain.

ClueWeb09 is a partially sorted collection: on average it has runs of 3.7 thousand documents sorted by URLs. In GOV2, the average length of the sorted run is almost one. Thus, we use two variants of GOV2: the original and a sorted one, where documents are sorted by their URLs.

We first indexed collections using Lucene (version 4.6.0): the words were stopped using the default Lucene settings, but not stemmed. Then, we extracted postings corresponding to one million most frequent terms.<sup>2</sup> Uncompressed, extracted posting lists from GOV2 and ClueWeb09 use 23 GB and 59 GB, respectively. They include only document identifiers.

Our corpora represent realistic sets of documents obtained by splitting a large collection so that each part fits into memory of a single server. In comparison, other researchers use collections of similar or smaller sizes. Culpepper and Moffat [6], Ao et al. [2], Barbay et al. [4], Ding et al. [9] and Vigna [19] used TREC GOV2 (25M documents), Ding and König [8] indexed Wikipedia (14M documents) while Transier and Sanders [18] used WT2g (250k documents).

In addition we started with two realistic query logs: the AOL query log (20 million web queries from 650 thousand users) and TREC million-query track (1MQ) logs (60 thousand queries from years 2007–2009). Results for the AOL

<sup>&</sup>lt;sup>1</sup>https://github.com/lemire/SIMDCompressionAndIntersection. <sup>2</sup>The extracting software is available online: https://github. com/searchivarius/IndexTextCollect.

query logs are similar to that of the TREC 1MQ logs, thus we only present results for the TREC 1MQ set. We randomly picked 20 thousand queries containing at least two indexed terms. These queries were converted into sequences of posting identifiers for GOV2 and ClueWeb09. Table 2 gives intersection statistics for the GOV2 and Clueweb09 collections. We give the percentage of all queries having a given number of terms, the average size of the intersection, along with the average size of the smallest posting list (term 1), the average size of the second smallest (term 2) and so on.

Table 2: Statistics about the TREC Million-Query log on two corpora. For each term we give the average number of matching documents (in thousands). There are 50M documents in total for Clueweb09 and half as many for Gov2.

(a) Clueweb09 (matching documents in thousands)

| # | %    | inter. | term 1, term 2, $\dots$                  |
|---|------|--------|------------------------------------------|
| 2 | 19.8 | 93     | 380, 2600                                |
| 3 | 32.5 | 29     | 400, 1500, 5100                          |
| 4 | 26.3 | 17     | 480, 1400, 3200, 8100                    |
| 5 | 13.2 | 12     | 420, 1200, 2600, 4800, 10000             |
| 6 | 4.9  | 4      | 350, 1000, 2100, 3700, 6500, 13000       |
| 7 | 1.7  | 5      | 390, 1100, 2100, 3400, 5200, 7300, 13000 |
|   |      |        |                                          |

(b) Gov2 (matching documents in thousands)

| # | %    | inter. | term 1, term 2, $\dots$               |
|---|------|--------|---------------------------------------|
| 2 | 19.6 | 49     | 160, 1100                             |
| 3 | 32.3 | 22     | 180, 710, 2400                        |
| 4 | 26.4 | 11     | 210, 620, 1500, 3700                  |
| 5 | 13.4 | 7      | 170, 520, 1100, 2200, 4400            |
| 6 | 5.0  | 3      | 140, 420, 850, 1500, 2600, 5100       |
| 7 | 1.8  | 9      | 190, 440, 790, 1300, 2200, 3200, 5400 |
|   | 1    | 1      |                                       |

#### 6.4 Bit unpacking

To test the speeds of our various bit unpacking routines, we generated random arrays of 4096 integers for each bit width  $b = 1, 2, \ldots, 31$ . We ensured that the randomly generated integers have gaps that fit in b bits:  $0 \le x_{i+4} - x_i < 2^b$ . We use the C rand function as a pseudo-random number generator. For each bit width, we ran  $2^{14}$  sequences of packing and unpacking and report only the average.

In Fig. 1a, we plot the speed ratios of the integrated bit unpacking, vs. the regular one where differential coding is applied separately on small blocks of integers. To differentiate the integrated and regular differential coding we use an i prefix (iD4 vs. D4). For D4, integration improves the speed by anywhere from over 30 % to 90 %. This suggests that the CPU is able to make good use of superscalar execution. For D1, the results are less impressive as the gains range from 20 % to 40 %. For D2 and DM, the result lies in-between.

In Fig. 1b, we compare all of the integrated bit unpacking procedures (iD4, iDM, iD2, iD1). As our analysis predicted (see Table 1), D4 is the fastest followed by DM, D2 and D1. For comparison, we also include the scalar bit unpacking speed. The integrated bit unpacking (iscalar curve) is sometimes twice as fast as regular bit unpacking (scalar curve). Even so, the fastest scalar unpacking routine has half the speed as our slowest SIMD unpacking routine (iD1).



(a) One-pass (integrated) vs. two-pass (block-level) prefix sum



(b) Several differential bit unpacking speeds

Figure 1: Unpacking speed for all bit widths (b = 1, ..., 31). Speed is reported in billions of integers per second. We use arrays of length  $2^{12}$  for these tests.

## 6.5 Decoding speed

To test decoding speed, we generated arrays using the ClusterData distribution from Anh and Moffat [1]. This distribution primarily leaves small gaps between successive integers, punctuated by occasional larger gaps. We generated arrays of 65536 integers in either  $[0, 2^{19})$  or  $[0, 2^{30})$ . In each case, we give the decoding speed of our compression schemes, the entropy of the deltas as well as the speed (in billions of integers per second) of a simple copy (implemented as a call to memcpy) in Table 3. All compression schemes use differential coding. We append a suffix (-D4, -DM, -D2, and -D1) to indicate the type of differential coding used. We append the -NI suffix (for non-integrated) to S4-BP128-\* schemes where the prefix sum requires a second pass. We get our best speeds with S4-BP128-D4 and S4-BP128-DM. Given our 3.60 GHz clock speed, they decode integers using between 0.7 and 0.8 CPU cycles per integer. In contrast, the best results reported by Lemire and Boytsov [12] was 1.2 CPU cycles per integer, using similar synthetic data with a CPU of the same family (Sandy Bridge).

SIMD-BP128-D1-NI is 38% faster than SIMD-FastPFOR but SIMD-FastPFOR compensates with a superior compression rate (5%-15\%). However, with integration, SIMD-BP128-D1 increases the speed gap to 50%-75\%.

Table 3: Results on ClusterData for dense  $(2^{16} \text{ integers in } [0, 2^{19}))$  and sparse  $(2^{16} \text{ integers in } [0, 2^{30}))$ . We report decoding speed in billions of integers per second. Shannon entropy of the deltas is given. Our Intel CPU runs at 3.6 GHz.

|                | dei      | nse     | sparse   |         |  |
|----------------|----------|---------|----------|---------|--|
|                | bits/int | Bints/s | bits/int | Bints/s |  |
| entropy        | 3.9      | -       | 14.7     | -       |  |
| copy           | 32.0     | 5.4     | 32.0     | 5.4     |  |
| S4-BP128-D4    | 6.0      | 5.4     | 16.5     | 4.4     |  |
| S4-BP128-D4-NI | 6.0      | 3.9     | 16.5     | 3.3     |  |
| S4-BP128-DM    | 5.9      | 5.5     | 16.3     | 4.1     |  |
| S4-BP128-DM-NI | 5.9      | 3.9     | 16.3     | 3.3     |  |
| S4-BP128-D2    | 5.5      | 4.8     | 16.0     | 3.5     |  |
| S4-BP128-D2-NI | 5.5      | 3.7     | 16.0     | 3.2     |  |
| S4-BP128-D1    | 5.0      | 3.9     | 15.5     | 3.0     |  |
| S4-BP128-D1-NI | 5.0      | 3.0     | 15.5     | 2.7     |  |
| S4-FastPFOR-D4 | 5.8      | 3.1     | 16.1     | 2.6     |  |
| S4-FastPFOR-DM | 5.5      | 2.8     | 15.8     | 2.4     |  |
| S4-FastPFOR-D2 | 5.1      | 2.7     | 15.4     | 2.4     |  |
| S4-FASTPFOR-D1 | 4.4      | 2.2     | 14.8     | 2.0     |  |
| FastPFOR       | 4.4      | 1.1     | 14.8     | 1.1     |  |
| VARINT         | 8.0      | 1.2     | 17.2     | 0.3     |  |

## 6.6 Intersections

To test intersections, we again generate lists using the ClusterData distribution [1]: all lists contain integers in  $[0, 2^{26})$ . Lists are generated in pairs. First, we set the target cardinality n of the largest lists to  $2^{22}$ . Then we vary the target cardinality m of the smallest list from n down to n/10000. To ensure that intersections are not trivial, we first generate an "intersection" list of size m/3 (rounded to the nearest integer). The smallest list is built from the union of the intersection list with another list made of 2m/3 integers. Thus, the maximum length of this union is m. The longest list is made of the union of the intersection list with another list of cardinality n - m/3, for a total cardinality of up to n. The net result is that we have two sets with cardinalities  $\approx m$  and  $\approx n$  having an intersection is at least as large as m/3. We compute the average of 5 intersections (using 5 pairs of lists).

We report all speeds relative to the basic SCALAR intersection. In Fig. 2a, we compare the speeds of our 3 SIMD intersection functions: V1, V3, and SIMD GALLOPING. We see that V1 is the fastest for ratios of up to 16:1, whereas SIMD GALLOPING is the fastest for ratios larger than 1024:1. Inbetween, V3 is sometimes best. This justifies our heuristic which uses V1 for ratios up to 50:1, V3 for ratios up to 1000:1, and SIMD GALLOPING for larger ratios. In Fig. 2b, we compare the two non-SIMD intersections (SCALAR and galloping) with our SIMD intersection procedure. We see that for a wide range of cases (up to a ratio of 64:1), our SIMD intersection procedure is clearly superior, being up to twice as fast. As the ratio increases, reflecting a greater difference in list sizes, non-SIMD galloping eventually becomes nearly as fast as our SIMD intersection.

The performance of the intersection procedures is sensitive to the data distribution. When we replaced ClusterData with a uniform distribution, the intersection speed diminished by up to a factor of 2.5: value clustering improves branch predictions and skipping [21].

We also evaluated our algorithms using postings lists for our three collections using our TREC 1M Query log (see Table 4). For this test, posting lists are uncompressed. We see that our SIMD intersection routine is nearly twice as fast

Table 4: Time required to answer various queries in ms/query along with the storage requirement in bits/int, using the TREC Million-Query log.

| scheme        | bits/int | time | bits/int   | time | bits/int  | time |
|---------------|----------|------|------------|------|-----------|------|
|               | GOV2     |      | GOV2       |      | ClueWeb09 |      |
|               | (sorted) |      | (unsorted) |      |           |      |
| Skipper [14]  | 10.2     | 2.6  | 10.5       | 4.3  | 10.2      | 7.7  |
| SIMD SvS      | 32.0     | 0.5  | 32.0       | 0.7  | 32.0      | 1.5  |
| Galloping SvS | 32.0     | 0.7  | 32.0       | 1.4  | 32.0      | 2.8  |
| SCALAR SvS    | 32.0     | 2.8  | 32.0       | 3.3  | 32.0      | 6.6  |

as (non-SIMD) galloping. In the sorted version of Gov2, we expect most gaps between document identifiers to be small, with a few large gaps. In contrast, the unsorted version of Gov2 has a more uniform distribution of gaps. Just as like we find that intersections are faster on the ClusterData distributions, we find that intersections are  $1.4-2\times$  faster on the sorted Gov2 vs. its unsorted counterpart. As a reference point, we also implemented intersections using the Skipper [14] approach: in this last case, posting lists are compressed using VARINT. In the next section (§ 6.7), we show that much better speed is possible if we use bitmaps to represent some posting lists.



Figure 2: Intersection between two lists of different cardinality as described in  $\S~6.6$ 

# 6.7 Bitmap-list hybrids (hyb+m2)

We can also use bitmaps to accelerate intersections. Culpepper and Moffat's HYB+M2 framework is simple but effective.

tive [6]: all lists where the average gap size is smaller or equal to B (where B = 8, 16 or 32) are stored as bitmaps whereas other lists are stored as compressed deltas (using VARINT or other compression algorithm). To compute any intersection, the compressed lists are first intersected (e.g., using galloping) and the result is then checked against the bitmaps. We modify their framework by replacing the compression and intersection functions with SIMD-based ones. To simplify our analysis, we use galloping intersections with scalar compression schemes such as VARINT and FASTPFOR, and SIMD intersections with SIMD compression schemes.

To improve data cache utilization, we split our corpora (32 parts for GOV2 and 64 for ClueWeb09) into parts processed separately. Thus, all intermediate results for a single part can fit into L3 cache. Given a query, we first obtain the result for the first part, then for the second, and so on. Finally, all partial results are collected in one array. In each part, we work directly on compressed lists and bitmaps, without other auxiliary data structure.

As shown in Table 5, the schemas using SIMD instructions are 2–3 times faster than FASTPFOR and VARINT while having comparable compression ratios. The retrieval times can be improved (up to  $\approx 4\times$ ) with the addition of bitmaps, but the index is sometimes larger (up to  $\approx 2\times$ ).

The partitioning strategy always improved performance of non-hybrid methods by 10–40 % without affecting compression rates. Thus, in this case, we include only results for partitioned indices. However, for hybrid schemes, partitioning can improve performance, compression ratios, or both characteristics. Thus, we present two set of results: for partitioned and non-partitioned indices. Performance of partitioned and non-partitioned bitmaps is different, because each part of the posting list may be encoded differently. For example, even though the average gap size for the complete posting is smaller or equal than B, after partitioning, there may be parts with the average gap larger than B. Unlike the unpartitioned posting, which is stored as a bitmap, such parts are stored as compressed deltas, thus using less space.

For B = 8, B = 16, and indices without bitmaps, the various S4-BP128-\* schemes provide different space vs. performance trade offs. For example, for ClueWeb09 and B = 8(partitioned), S4-BP128-D4 is 11% faster than S4-BP128-D1, but S4-BP128-D1 uses 7% less space.

We also compare scalar vs. SIMD-based algorithms. For ClueWeb09, S4-FASTPFOR-D1 can be nearly twice as fast as the non-SIMD FASTPFOR scheme. Even with B = 16where we rely more on the bitmaps for speed, S4-FASTPFOR is over 50% faster than FASTPFOR (for ClueWeb09). As expected, for B = 32, there is virtually no difference in speed and compression ratios among various S4-BP128-\* compression schemes. However, S4-BP128-D4 is still about twice as fast as VARINT so that, even in this extreme case, our SIMD algorithms are worthwhile.

We also recorded the median and the  $90^{\rm th}$  percentile retrieval times. The gains in average query times are reflected in these other metrics.

# 7. DISCUSSION AND CONCLUSIONS

After introducing faster SIMD decoding techniques and fast SIMD intersection algorithms, we used them to accelerate Culpepper and Moffat's HYB+M2; sometimes doubling the

Table 5: Time required to answer various queries in ms/query along with the storage requirement in bits/int. Entropies of the deltas are provided. We write S4-FASTPFOR as a shorthand for S4-FASTPFOR-D1.

| scheme                     | hits/int | time              | bits/int                                  | time              | bits/int   | time              |
|----------------------------|----------|-------------------|-------------------------------------------|-------------------|------------|-------------------|
|                            | GO       |                   | GO                                        |                   | CLUEV      |                   |
|                            | (sor     |                   | (unso                                     |                   | OLUEV      | VED05             |
| entropy                    | 1.       |                   | 4.                                        | ,                 | 3.         | 1                 |
| ener opj                   |          |                   | partition                                 |                   |            | -                 |
| VARINT                     | 8.2      | 6.3               | 8.5                                       | 8.2               | 8.2        | 17.5              |
| FASTPFOR                   | 3.8      | 6.3               | 6.2                                       | 8.0               | 5.0        | 17.3              |
| S4-BP128-D4                |          | 2.1               | 7.9                                       | 2.5               | 7.5        | 5.6               |
| S4-BP128-DN                |          | 2.1               | 7.7                                       | 2.6               | 7.4        | 5.7               |
| S4-BP128-D2                | 5.6      | 2.2               | 7.4                                       | 2.6               | 7.1        | 6.0               |
| S4-BP128-D1                |          | 2.4               | 6.9                                       | 2.8               | 6.6        | 6.5               |
| S4-FastPFOI                | R 3.8    | 3.5               | 6.2                                       | 4.0               | 5.0        | 9.7               |
|                            |          |                   |                                           |                   |            |                   |
|                            |          |                   | rtitioned                                 |                   |            |                   |
| VARINT                     | 5.8      | 3.0               | 7.4                                       | 4.7               | 6.2        | 9.9               |
| FASTPFOR                   | 4.3      | 3.0               | 6.3                                       | 4.5               | 5.2        | 9.3               |
| S4-BP128-D4                |          | 1.2               | 7.4                                       | 1.6               | 6.5        | 3.4               |
| S4-BP128-DN                |          | 1.2               | 7.3                                       | 1.6               | 6.4        | 3.4               |
| S4-BP128-D2                |          | 1.3               | 7.1                                       | 1.6               | 6.3        | 3.6               |
| S4-BP128-D1                |          | 1.4<br>1.9        | $\begin{array}{c} 6.8 \\ 6.3 \end{array}$ | 1.7               | 6.1<br>5.2 | 3.8<br>5 4        |
| S4-FASTPFOI                |          |                   | 0.5<br>rtitioned                          | 2.3               | 5.2        | 5.4               |
| VARINT                     | <u> </u> | 2.1               | 8.2                                       | 3.2               | 6.9        | 6.7               |
| FASTPFOR                   | 5.5      | $\frac{2.1}{2.1}$ | 7.6                                       | $\frac{3.2}{2.9}$ | 6.4        | 5.9               |
| S4-BP128-D4                |          | $0.9^{2.1}$       | 8.4                                       | 1.2               | 7.2        | 2.5               |
| S4-BP128-D2                |          | 1.0               | 8.1                                       | $1.2 \\ 1.2$      | 7.2<br>7.1 | $\frac{2.5}{2.6}$ |
| S4-BP128-D1                |          | 1.0               | 7.9                                       | 1.2<br>1.3        | 7.0        | $\frac{2.0}{2.7}$ |
| S4-FASTPFOI                |          | 1.4               | 7.6                                       | 1.6               | 6.4        | 3.7               |
|                            |          |                   | rtitioneo                                 |                   |            |                   |
| VARINT                     | 8.2      | 1.4               | 11.1                                      | 2.0               | 8.9        | 4.2               |
| FastPFOR                   | 7.8      | 1.5               | 10.9                                      | 1.8               | 8.7        | 3.7               |
| S4-BP128-D4                |          | 0.8               | 11.2                                      | 0.9               | 9.1        | 1.9               |
| S4-BP128-D2                | 8.4      | 0.8               | 11.1                                      | 1.0               | 9.1        | 2.0               |
| S4-BP128-D1                | 8.3      | 0.8               | 11.0                                      | 1.0               | 9.0        | 2.0               |
| S4-FastPFOI                | R 7.8    | 1.1               | 10.9                                      | 1.2               | 8.7        | 2.6               |
|                            |          |                   |                                           |                   |            |                   |
| VADINE                     | 7.3      | B = 3.6           | = 8<br>7.5                                | 5.1               | 6.3        | 10.3              |
| varint<br>FastPFOR         | 4.4      | 3.0               | 6.3                                       | $\frac{5.1}{4.6}$ | 5.3        | 9.0               |
| S4-BP128-D4                |          | 1.5               | $\frac{0.3}{7.4}$                         | $\frac{4.0}{1.7}$ | 6.6        | $\frac{9.0}{3.9}$ |
| S4-BP128-D4<br>S4-BP128-DN |          | $1.5 \\ 1.5$      | $7.4 \\ 7.3$                              | $1.7 \\ 1.7$      | 6.5        | $3.9 \\ 3.9$      |
| S4-BP128-D2                |          | $1.5 \\ 1.5$      | 7.3<br>7.1                                | $1.7 \\ 1.7$      | 6.4        | 3.9               |
| S4-BP128-D1                |          | $1.0 \\ 1.6$      | 6.8                                       | 1.8               | 6.2        | 4.0               |
| S4-FASTPFOI                |          | 1.9               | 6.3                                       | 2.2               | 5.3        | 5.0               |
|                            |          | B =               |                                           |                   |            |                   |
| VARINT                     | 8.1      | 2.3               | 8.4                                       | 3.4               | 7.0        | 6.8               |
| FASTPFOR                   | 6.3      | 2.1               | 7.7                                       | 2.9               | 6.5        | 5.6               |
| S4-BP128-D4                |          | 1.0               | 8.4                                       | 1.2               | 7.3        | 2.5               |
| S4-BP128-D2                |          | 1.0               | 8.2                                       | 1.2               | 7.2        | 2.6               |
| S4-BP128-D1                |          | 1.1               | 8.0                                       | 1.2               | 7.1        | 2.6               |
| S4-FastPFOI                |          | 1.2               | 7.7                                       | 1.4               | 6.5        | 3.2               |
| B = 32                     |          |                   |                                           |                   |            |                   |
| VARINT                     | 11.0     | 1.4               | 11.3                                      | 2.1               | 9.1        | 4.2               |
| FastPFOR                   | 10.1     | 1.2               | 10.9                                      | 1.6               | 8.8        | 3.2               |
| S4-BP128-D4                |          | 0.7               | 11.4                                      | 0.8               | 9.2        | 1.8               |
| S4-BP128-D2                |          | 0.7               | 11.2                                      | 0.8               | 9.2        | 1.8               |
| S4-BP128-D1                |          | 0.7               | 11.1                                      | 0.9               | 9.1        | 1.8               |
| S4-FastPFOI                | R 10.2   | 0.8               | 10.9                                      | 1.0               | 8.8        | 2.1               |

speed without sacrificing compression. We chose HYB+M2 because we believe it might be the fastest published algorithm for conjunctive queries. However, HYB+M2 has limitations: e.g, it is unclear how it could extended to support scoring functions (e.g., BM25). In future work, we will apply our fast SIMD compression and intersection techniques together with multicore processing and scoring functions. We are also interested in how our techniques scale to billions of documents: in such cases, we would need 64-bit document identifiers.

Our work was focused on 128-bit vectors. Intel and AMD recently released processors that support integer operations on 256-bit vectors using the new AVX2 instruction set. On such a processor, Willhalm et al. [20] were able to double their bit unpacking speed. Moreover, Intel plans to support 512-bit vectors in 2015 on its commodity processors. Thus optimizing algorithms for SIMD instructions will become even more important in the near future.

## 8. ACKNOWLEDGMENTS

Daniel Lemire acknowledges support from the Natural Sciences and Engineering Research Council of Canada (NSERC) with grant number 26143. We thank J. Callan for his help.

#### References

- V. N. Anh and A. Moffat. Index compression using 64bit words. Softw. Pract. Exper., 40(2):131–147, 2010.
- [2] N. Ao, F. Zhang, D. Wu, D. S. Stones, G. Wang, X. Liu, J. Liu, and S. Lin. Efficient parallel lists intersection and index compression algorithms using graphics processing units. *Proc. VLDB Endow.*, 4(8):470–481, May 2011.
- [3] R. Baeza-Yates. A fast set intersection algorithm for sorted sequences. In S. Sahinalp, S. Muthukrishnan, and U. Dogrusoz, editors, *Combinatorial Pattern Matching*, volume 3109 of *LNCS*, pages 400–408. Springer Berlin / Heidelberg, 2004.
- [4] J. Barbay, A. López-Ortiz, T. Lu, and A. Salinger. An experimental investigation of set intersection algorithms for text searching. J. Exp. Algorithmics, 14:7:3.7-7:3.24, Jan. 2010.
- [5] J. L. Bentley and A. C.-C. Yao. An almost optimal algorithm for unbounded searching. *Inf. Process. Lett.*, 5(3):82–87, 1976.
- [6] J. S. Culpepper and A. Moffat. Efficient set intersection for inverted indexing. ACM Trans. Inf. Syst., 29(1):1:1– 1:25, Dec. 2010.
- [7] E. D. Demaine, A. López-Ortiz, and J. I. Munro. Adaptive set intersections, unions, and differences. In Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '00, pages 743– 752, Philadelphia, PA, USA, 2000. Society for Industrial and Applied Mathematics.
- [8] B. Ding and A. C. König. Fast set intersection in memory. Proc. VLDB Endow., 4(4):255–266, 2011.
- [9] S. Ding, J. He, H. Yan, and T. Suel. Using graphics processors for high performance IR query processing. In *Proceedings of the 18th international conference on World wide web*, WWW '09, pages 421–430, New York, NY, USA, 2009. ACM.

- [10] Y. Kim, J. Seo, and W. B. Croft. Automatic boolean query suggestion for professional search. In *Proceedings* of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '11, pages 825–834, New York, NY, USA, 2011. ACM.
- [11] R. Konow, G. Navarro, C. L. Clarke, and A. López-Ortíz. Faster and smaller inverted indices with treaps. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '13, pages 193–202, New York, NY, USA, 2013. ACM.
- [12] D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. *Softw. Pract. Exper.*, 2013. online: http://dx.doi.org/10.1002/spe.2203.
- [13] A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst., 14(4):349– 379, 1996.
- [14] P. Sanders and F. Transier. Intersection in integer inverted indices. In *Proceedings of the 9th Workshop on Algorithm Engineering and Experiments*, ALENEX '07, pages 71–83. SIAM, 2007.
- [15] B. Schlegel, T. Willhalm, and W. Lehner. Fast sortedset intersection using SIMD instructions. In Proceedings of the 2nd International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures, ADMS '11, 2011.
- [16] A. A. Stepanov, A. R. Gangolli, D. E. Rose, R. J. Ernst, and P. S. Oberoi. SIMD-based decoding of posting lists. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM '11, pages 317–326, New York, NY, USA, 2011. ACM.
- [17] S. Tatikonda, B. B. Cambazoglu, and F. P. Junqueira. Posting list intersection on multicore architectures. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR '11, pages 963–972, New York, NY, USA, 2011. ACM.
- [18] F. Transier and P. Sanders. Engineering basic algorithms of an in-memory text search engine. ACM Trans. Inf. Syst., 29(1):2:1–2:37, Dec. 2010.
- [19] S. Vigna. Quasi-succinct indices. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM '13, pages 83–92, New York, NY, USA, 2013. ACM.
- [20] T. Willhalm, I. Oukid, I. Müller, and F. Faerber. Vectorizing database column scans with complex predicates. In Proceedings of the 4th International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures, ADMS '13, pages 1–12, 2013.
- [21] H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In *Proceedings of the 18th International Conference on World Wide Web*, WWW '09, pages 401–410, New York, NY, USA, 2009. ACM.
- [22] M. Zukowski, S. Heman, N. Nes, and P. Boncz. Superscalar RAM-CPU cache compression. In *Proceedings of* the 22nd International Conference on Data Engineering, ICDE '06, pages 59–71, Washington, DC, USA, 2006. IEEE Computer Society.