class MemoryFile: memory mapped file I/O for mere mortals
In UNIX there is the
mmap() system call for mapping a file into virtual
memory. It allows you to access the file data, as stored on disk, via a
pointer—a memory address. The operating system is working behind the
scenes to make this possible; the virtual memory management system performs
demand paging, loading file data as page faults occur. The very same
operating system code that handles ‘swap’ memory is made available to
user programs through the
mmap() system call.
I am making a file viewer and
mmap() seemed like a good idea. However,
when I mapped a 4 GB file my computer started having a hard time freeing
up memory, allocating swap, and what not. Apparently it was trying to rack
up a contiguous space of 4 gigs. After about 20 seconds it was done and
good to go, but this behavior I did not expect. And what if we were to
trying to load a ten gig file? Twenty, forty, a hundred? We are going to
need something better, and this is where class
MemoryFile comes in.
Basically, the class
MemoryFile is an array class that is backed by a file.
When you access the array, it seeks in the file and loads in a small portion
into a buffer. So in a way, it’s doing demand paging, but since it’s reusing
the same small buffer all the time, it has a really small memory footprint,
even if you decide to page through the entire file.
Just for safety, the implementation shown is read-only. You can adopt the code below for implementing read-write, swap file-like behavior.
class MemoryFile(object): '''array-like object, backed by on-disk file''' def __init__(self, filename=None, pagesize=64*1024): '''initialize''' self.filename = filename self.pagesize = pagesize self.filesize = 0 self.fd = None self.data = None self.page_addr = -pagesize if filename is not None: self.open_file(filename)
We will be reading pages of 64 kiB. We will keep the file size so that
we know what the end of the virtual memory block will be.
We will store a loaded page into
self.data, which will be a
bytearray object holding the data. The
page_addr is a virtual address;
it really reflects the file position for the loaded page. We initialize the
page_addr to a large negative value, which is to say “we haven’t
loaded any data.”
def open_file(self, filename=None): '''open file''' if filename is None: filename = self.filename else: self.filename = filename self.filesize = os.path.getsize(filename) self.fd = open(filename)
This only opens the file. No data is being loaded just yet.
Note On the Windows platform, you should call
open(filename, 'rb')to open the file in binary mode. UNIX makes no such distinction.
Now things get interesting.
def __getitem__(self, idx): '''Returns bytes at index''' if isinstance(idx, int): if idx < 0 or idx >= self.filesize: raise IndexError('index out of bounds') if idx < self.page_addr or idx >= self.page_addr + self.pagesize: self.pagefault(idx) return self.data[idx - self.page_addr]
__getitem__() so that a
MemoryFile object may be indexed like
an array. It is Python’s way of defining
If the index is out of bounds, raise (or throw) an exception. If the index is valid, but we haven’t loaded that particular page, issue a page fault. Now, of course, in an operating system a page fault is a hardware generated interrupt; here, we simulate it in software and just call a subroutine.
def pagefault(self, idx): '''load the page for index''' self.page_addr = idx - (idx % self.pagesize) self.fd.seek(self.page_addr, os.SEEK_SET) self.data = self.fd.read(self.pagesize)
page_addr be the address that is the start of the page that
holds the index that we were trying to access. We get that address by
rounding down to the nearest multiple of
Next, load that data. That data may be shorted than
we have reached EOF. This is not a problem because we already checked
for overflowing filesize earlier. Still, you may want to pad it with zeroes.
For swap files, this is usually not an issue because they tend to be clean
multiples of the page size.
That’s it! We can now access the file as if it were an array.
memfile = MemoryFile('testfile.dat') for i in xrange(0, 16): print 'byte:', memfile[i]
Although this example looks incredibly underwhelming, we can do some cool things with this. And the best part is, we’re not using much memory, not even if the file is very, very large.
Points for improvement:
- Add a
close()method that closes the file
- Implement method
__len__()to return the file size. This allows you to use
__getitem__()so that you can do slicing
pagefault()so that it always caches the area ‘around’
idx. This seems convenient for programs like a file viewer in which the user may scroll backwards
- Add Python context manager methods for using the
- do the same thing in C++