Sorting a Billion Numbers

on Puzzles,

The question usually goes like this: given a large list of numbers, how do you actually sort it?

Said file is also larger than the amount of memory available to the computer, so it’s not possible to load it up and do some sort and be done with it.

A standard answer is with external sorting. First, partition the file into smaller chunks, which can be sorted separately. Then, using a K-way merge, build and output the final sorted list from the smaller chunks.

Generating a large random list

Before we begin, we need to generate a large list of simple random numbers.

34/* Open output file */
35char* outputFilename = argv[4];
36std::ofstream out;
37out.open(outputFilename, std::ofstream::out);
38
39/* Seed random number generator with current time */
40std::srand(std::time(0));
41
42/* Print n random numbers */
43for (int i = 0; i < n; i++) {
44    out << std::rand() << "\n";
45}
46
47/* Flush and close output file */
48out.close();

Running the generate command to create a billion random numbers takes about 7 minutes on my MacBook 12-inch.

$ time ./generate -n 1000000000 -o input.txt
362.32s user 30.76s system 92% cpu 7:03.97 total

The parallel version of the script takes about the same time, though it should clearly be faster on better systems.

./generate -n 250000000 -o input-1.txt &
./generate -n 250000000 -o input-2.txt &
./generate -n 250000000 -o input-3.txt &
./generate -n 250000000 -o input-4.txt &
wait
cat input-*.txt >> input.txt
rm input-*.txt

One billion numbers gets us a file size of around 9.8GB (this is less than 10GB). This makes sense because positive, signed, 32-bit integers are in the range [0, 2,147,483,647]. This means each number takes up around 1 - 10 digits, occupying a byte for each digit and one more for the newline character.

Vs GNU Sort

We know that the GNU sort uses a similar method, so 40 minutes on my computer is the benchmark time to match.

$ time gsort -no output-sorted.txt input.txt
4626.90s user 418.01s system 205% cpu 40:48.99 total

The gsort command is the GNU version of sort which I obtained by installing the Homebrew package coreutils on my Mac. The gsplit command from the coreutils library also appear later on.

Splitting and Sorting

To split our large list into smaller lists. We split into 12 parts so that we can sort with 4 processes in 3 rounds.

$ time gsplit -n l/12 -d input.txt output-
0.17s user 17.94s system 41% cpu 43.146 total
$ ls -alh
-rw-r--r--   1 nicluo  staff   833M Dec  7 15:47 output-00
-rw-r--r--   1 nicluo  staff   833M Dec  7 15:47 output-01
-rw-r--r--   1 nicluo  staff   833M Dec  7 15:47 output-02
-rw-r--r--   1 nicluo  staff   833M Dec  7 15:47 output-03
-rw-r--r--   1 nicluo  staff   833M Dec  7 15:47 output-04
-rw-r--r--   1 nicluo  staff   833M Dec  7 15:47 output-05
-rw-r--r--   1 nicluo  staff   833M Dec  7 15:47 output-06
-rw-r--r--   1 nicluo  staff   833M Dec  7 15:47 output-07
-rw-r--r--   1 nicluo  staff   833M Dec  7 15:47 output-08
-rw-r--r--   1 nicluo  staff   833M Dec  7 15:47 output-09
-rw-r--r--   1 nicluo  staff   833M Dec  7 15:47 output-10
-rw-r--r--   1 nicluo  staff   833M Dec  7 15:47 output-11
-rw-r--r--   1 nicluo  staff   9.8G Dec  7 12:51 input.txt

Two major components to this are:

  1. Reading from file into a vector and writing the vector to file.
  2. Sorting the vector itself.

A quick way to read and write:

12std::vector<int> readFromFile(char* filename) {
13    std::vector<int> values = std::vector<int>();
14
15    /* Open input file */
16    std::ifstream in(filename);
17
18    /* Read all integers from file */
19    if(in) {
20        int v;
21        while(in >> v) {
22            values.push_back(v);
23        }
24    }
25
26    in.close();
27
28    return values;
29}
30
31void writeToFile(char* filename, std::vector<int> &values) {
32    /* Open output file */
33    std::ofstream out;
34    out.open(filename, std::ofstream::out);
35    
36    /* Write all integers to file */
37    for(int v : values) {
38        out << v << "\n";
39    }
40    
41    out.close();
42}

An in-place quicksort implementation:

44void swap(std::vector<int> &values, int firstIndex, int secondIndex) {
45    int tmp = values[firstIndex];
46    values[firstIndex] = values[secondIndex];
47    values[secondIndex] = tmp;
48}
49
50void sortHelper(std::vector<int> &values, int start, int end) {
51    /* Check if already sorted */
52    if(end - start <= 1) return;
53
54    /* Pick the first element as the pivot,
55       we already know the elements are random
56       so this is as good as a random pivot */
57    int pivot = values[start];
58
59    int swapIndex = start + 1;
60    for(int index = start + 1; index < end; index++) {
61        /* Move elements smaller than pivot value to the front
62           of the list */
63        if(values[index] < pivot) {
64            swap(values, index, swapIndex);
65            swapIndex++;
66        }
67    }
68
69    /* Swap pivot to appropriate point in list,
70       since elements to the left of the pivot are smaller,
71       and the pivot is at the zeroth position,
72       it should swap with a smaller value, i.e. swapIndex - 1 */
73    swap(values, start, swapIndex - 1);
74
75    /* Sort sublists */
76    sortHelper(values, start, swapIndex);
77    sortHelper(values, swapIndex, end);
78}
79
80void sort(std::vector<int> &values) {
81    sortHelper(values, 0, values.size());
82}

By replacing the sort with a NOOP (a function which does not do anything), we can estimate the impact to running time for the two steps.

# Without sorting (sort() replaced with NOOP)
$ time ./sort -f output-00
71.99s user 4.17s system 96% cpu 1:18.59 total

# With sorting
$ time ./sort -f output-00
82.78s user 4.43s system 96% cpu 1:30.68 total

It seems that IO and allocation takes about 1:18 for an ~800MB file, while the in-place sorting takes only 12 seconds.

Merging

With our bunch of sorted sublists, we use a K-way merge to produce the single output list. The merge operation on sublists is suitable because we don’t have to read everything into memory to build our output. Since the head of each file contains the minimum of its chunk, then the minimum of all chunks must be among them. For each sorted number in the output list, we only need to inspect 12 numbers.

Although the code appears to only read and write a line at a time, ifstream and ofstream calls are actually buffered.

29void merge(std::vector<Seq*>& sequences, std::ofstream& output) {
30    /* Greater<Seq*> comparator for Seq* to create min heap */
31    auto comp = []( const Seq* lhs, const Seq* rhs ) { return lhs->value > rhs->value; };
32    std::priority_queue<Seq*, std::vector<Seq*>, decltype(comp)> pq(sequences.begin(), sequences.end(), comp);
33    
34    while(!pq.empty()) {
35        /* Get and pop minimum element */
36        Seq* top = pq.top();
37        pq.pop();
38        
39        output << top->value << '\n';
40        
41        /* Since the minimum element is a sequence, if it has a next value,
42           we can re-insert into the heap for the next iteration */
43        if(next(top)) {
44            pq.push(top);
45        }
46    }
47}

$ time ./merge -o output.txt -f output-*
980.08s user 79.38s system 91% cpu 19:19.35 total

Altogether

Combining all the steps into a shell script lets us complete the sort routine.

 1#!/usr/bin/env bash
 2
 3# Split file into 12 chunks, keeping individual lines together
 4gsplit -n l/12 -d input.txt output-
 5
 6# Sort 12 individual chunks in parallel
 7{
 8    ./sort -f output-00
 9    ./sort -f output-01
10    ./sort -f output-02
11} &
12{
13    ./sort -f output-03
14    ./sort -f output-04
15    ./sort -f output-05
16} &
17{
18    ./sort -f output-06
19    ./sort -f output-07
20    ./sort -f output-08
21} &
22{
23    ./sort -f output-09
24    ./sort -f output-10
25    ./sort -f output-11
26} &
27wait
28
29# 12-way merge of all the individual chunks
30./merge -o output.txt -f output-*
31
32# Remove chunks
33rm output-*

The script takes 28 minutes to run.

$ time ./sort.sh
2809.45s user 126.41s system 172% cpu 28:17.11 total

By comparison, here’s the summary of individual parts when I ran them one at a time.

  1. Splitting the input file takes 43 seconds
  2. Sorting each chunk in parallel takes 11 minutes
  3. Merging all chunks takes 19 minutes

All in all, my shell and C++ implementation took 30 minutes to read, sort and write a billion numbers. Which is close to the GNU sort program of 40 minutes. Of course, this is limited to only integers, but I’m quite happy it was that close. If you’re wondering why any of these took so long, it’s because my MacBook is constantly over 90 degrees when running them – and getting throttled.

Thanks to the GNU sort program, we also have a correctness checker, since we can do a diff to find out if the outputs are identical.

$ diff -sq output-gsort.txt output-my-sort.txt
Files output-gsort and output-my-sort.txt are identical

The C++ code for these is available at github.com/nicluo/sorting-numbers.