Sunday, March 20, 2016

class MemoryFile: memory mapped file I/O for mere mortals

In UNIX there is the mmap() system call for mapping a file into virtual memory. It allows you to access the file data, as stored on disk, via a pointer—a memory address. The operating system is working behind the scenes to make this possible; the virtual memory management system performs demand paging, loading file data as page faults occur. The very same operating system code that handles ‘swap’ memory is made available to user programs through the mmap() system call.

I am making a file viewer and mmap() seemed like a good idea. However, when I mapped a 4 GB file my computer started having a hard time freeing up memory, allocating swap, and what not. Apparently it was trying to rack up a contiguous space of 4 gigs. After about 20 seconds it was done and good to go, but this behavior I did not expect. And what if we were to trying to load a ten gig file? Twenty, forty, a hundred? We are going to need something better, and this is where class MemoryFile comes in.

Basically, the class MemoryFile is an array class that is backed by a file. When you access the array, it seeks in the file and loads in a small portion into a buffer. So in a way, it’s doing demand paging, but since it’s reusing the same small buffer all the time, it has a really small memory footprint, even if you decide to page through the entire file.
Just for safety, the implementation shown is read-only. You can adopt the code below for implementing read-write, swap file-like behavior.

class MemoryFile(object):
    '''array-like object, backed by on-disk file'''

    def __init__(self, filename=None, pagesize=64*1024):
        '''initialize'''

        self.filename = filename
        self.pagesize = pagesize
        self.filesize = 0
        self.fd = None
        self.data = None
        self.page_addr = -pagesize

        if filename is not None:
            self.open_file(filename)

We will be reading pages of 64 kiB. We will keep the file size so that we know what the end of the virtual memory block will be. We will store a loaded page into self.data, which will be a bytearray object holding the data. The page_addr is a virtual address; it really reflects the file position for the loaded page. We initialize the starting page_addr to a large negative value, which is to say “we haven’t loaded any data.”

    def open_file(self, filename=None):
        '''open file'''

        if filename is None:
            filename = self.filename
        else:
            self.filename = filename
        self.filesize = os.path.getsize(filename)
        self.fd = open(filename)

This only opens the file. No data is being loaded just yet.

Note On the Windows platform, you should call open(filename, 'rb') to open the file in binary mode. UNIX makes no such distinction.

Now things get interesting.

    def __getitem__(self, idx):
        '''Returns bytes at index'''

        if isinstance(idx, int):
            if idx < 0 or idx >= self.filesize:
                raise IndexError('index out of bounds')

            if idx < self.page_addr or idx >= self.page_addr + self.pagesize:
                self.pagefault(idx)

            return self.data[idx - self.page_addr]

We define __getitem__() so that a MemoryFile object may be indexed like an array. It is Python’s way of defining operator[](int).
If the index is out of bounds, raise (or throw) an exception. If the index is valid, but we haven’t loaded that particular page, issue a page fault. Now, of course, in an operating system a page fault is a hardware generated interrupt; here, we simulate it in software and just call a subroutine.

    def pagefault(self, idx):
        '''load the page for index'''

        self.page_addr = idx - (idx % self.pagesize)
        self.fd.seek(self.page_addr, os.SEEK_SET)
        self.data = self.fd.read(self.pagesize)

First, let page_addr be the address that is the start of the page that holds the index that we were trying to access. We get that address by rounding down to the nearest multiple of pagesize.
Next, load that data. That data may be shorted than pagesize if we have reached EOF. This is not a problem because we already checked for overflowing filesize earlier. Still, you may want to pad it with zeroes. For swap files, this is usually not an issue because they tend to be clean multiples of the page size.

That’s it! We can now access the file as if it were an array.

memfile = MemoryFile('testfile.dat')
for i in xrange(0, 16):
    print 'byte:', memfile[i]

Although this example looks incredibly underwhelming, we can do some cool things with this. And the best part is, we’re not using much memory, not even if the file is very, very large.

Points for improvement:

Add a close() method that closes the file
Implement method __len__() to return the file size. This allows you to use len() on the MemoryFile
Adapt __getitem__() so that you can do slicing
Adapt pagefault() so that it always caches the area ‘around’ idx. This seems convenient for programs like a file viewer in which the user may scroll backwards
Add Python context manager methods for using the with statement

Bonus points:

do the same thing in C++