Sophisticated processing of huge, complex datasets requires us to rethink the relationship between disk-based storage and main-memory processing. Some features of modern systems—namely, 64-bit architectures, memory-mapped sparse files, virtual memory, and copy on write support—let us process our data with readable and efficient RAM-based algorithms, using slow disks and file systems only for their large capacity and to secure the data's persistence. The author demonstrates this approach through a short C++ program that locates the shortest path on very large graphs, like that of Wikipedia. Although RAM-based processing opens up again many problems that database systems already solve, the author believes that this is the right move because it provides us with a unified programming and performance model for all our data operations, irrespective of where the data resides.
Read full abstract