class MemoryFile: memory mapped file I/O for mere mortals
In UNIX there is the mmap()
system call for mapping a file into virtual
memory. It allows you to access the file data, as stored on disk, via a
pointer—a memory address. The operating system is working behind the
scenes to make this possible; the virtual memory management system performs
demand paging, loading file data as page faults occur. The very same
operating system code that handles ‘swap’ memory is made available to
user programs through the mmap()
system call.
I am making a file viewer and mmap()
seemed like a good idea. However,
when I mapped a 4 GB file my computer started having a hard time freeing
up memory, allocating swap, and what not. Apparently it was trying to rack
up a contiguous space of 4 gigs. After about 20 seconds it was done and
good to go, but this behavior I did not expect. And what if we were to
trying to load a ten gig file? Twenty, forty, a hundred? We are going to
need something better, and this is where class MemoryFile
comes in.
Basically, the class MemoryFile
is an array class that is backed by a file.
When you access the array, it seeks in the file and loads in a small portion
into a buffer. So in a way, it’s doing demand paging, but since it’s reusing
the same small buffer all the time, it has a really small memory footprint,
even if you decide to page through the entire file.
Just for safety, the implementation shown is read-only. You can adopt the
code below for implementing read-write, swap file-like behavior.
class MemoryFile(object):
'''array-like object, backed by on-disk file'''
def __init__(self, filename=None, pagesize=64*1024):
'''initialize'''
self.filename = filename
self.pagesize = pagesize
self.filesize = 0
self.fd = None
self.data = None
self.page_addr = -pagesize
if filename is not None:
self.open_file(filename)
We will be reading pages of 64 kiB. We will keep the file size so that
we know what the end of the virtual memory block will be.
We will store a loaded page into self.data
, which will be a
bytearray
object holding the data. The page_addr
is a virtual address;
it really reflects the file position for the loaded page. We initialize the
starting page_addr
to a large negative value, which is to say “we haven’t
loaded any data.”
def open_file(self, filename=None):
'''open file'''
if filename is None:
filename = self.filename
else:
self.filename = filename
self.filesize = os.path.getsize(filename)
self.fd = open(filename)
This only opens the file. No data is being loaded just yet.
Note On the Windows platform, you should call
open(filename, 'rb')
to open the file in binary mode. UNIX makes no such distinction.
Now things get interesting.
def __getitem__(self, idx):
'''Returns bytes at index'''
if isinstance(idx, int):
if idx < 0 or idx >= self.filesize:
raise IndexError('index out of bounds')
if idx < self.page_addr or idx >= self.page_addr + self.pagesize:
self.pagefault(idx)
return self.data[idx - self.page_addr]
We define __getitem__()
so that a MemoryFile
object may be indexed like
an array. It is Python’s way of defining operator[](int)
.
If the index is out of bounds, raise (or throw) an exception. If the index
is valid, but we haven’t loaded that particular page, issue a page fault.
Now, of course, in an operating system a page fault is a hardware generated
interrupt; here, we simulate it in software and just call a subroutine.
def pagefault(self, idx):
'''load the page for index'''
self.page_addr = idx - (idx % self.pagesize)
self.fd.seek(self.page_addr, os.SEEK_SET)
self.data = self.fd.read(self.pagesize)
First, let page_addr
be the address that is the start of the page that
holds the index that we were trying to access. We get that address by
rounding down to the nearest multiple of pagesize
.
Next, load that data. That data may be shorted than pagesize
if
we have reached EOF. This is not a problem because we already checked
for overflowing filesize earlier. Still, you may want to pad it with zeroes.
For swap files, this is usually not an issue because they tend to be clean
multiples of the page size.
That’s it! We can now access the file as if it were an array.
memfile = MemoryFile('testfile.dat')
for i in xrange(0, 16):
print 'byte:', memfile[i]
Although this example looks incredibly underwhelming, we can do some cool things with this. And the best part is, we’re not using much memory, not even if the file is very, very large.
Points for improvement:
- Add a
close()
method that closes the file - Implement method
__len__()
to return the file size. This allows you to uselen()
on theMemoryFile
- Adapt
__getitem__()
so that you can do slicing - Adapt
pagefault()
so that it always caches the area ‘around’idx
. This seems convenient for programs like a file viewer in which the user may scroll backwards - Add Python context manager methods for using the
with
statement
Bonus points:
- do the same thing in C++