Reverse-engineering MAT-files
Part 1: Introduction and toy examples
MATLAB is a programming language primarily known for its emphasis on matrices and arrays as its primary data structure. Owing to this, as well as to its renowned ease of use for non-programmers, MATLAB is widely used in the engineering domain.
At my company, there has been a recent push away from MATLAB in favour of open-source alternatives. A lot of work has been put in migrating the existing MATLAB infrastructure. In the process, many hurdles have been encountered. One such hurdle is the way that MATLAB stores data, and it’s the hurdle that I’d like to talk about today.
Since MATLAB is the lingua franca among engineers, much of the data that we work with gets stored and exchanged in so-called MAT-files. A MAT-file is MATLAB’s way of storing data. When working with MATLAB you can, at any point, save the variables that you currently have in memory into a MAT-file. You or your friend can then, at a later point in time, open this MAT-file to reconstruct your past workspace. Owing to MATLAB’s excellent dedication to backward compatibility, reading MAT-files is extremely robust, and most likely the MAT-files you created twenty years ago can still be read today.
The problem with MAT-files is that it relies on a very proprietary format. As such, it is very difficult to parse a MAT-file without a MATLAB license. As my company aims to move away from MATLAB, it is vital that we are able to convert the contents of our MAT-file into a format that other languages can handle.
To the best of my knowledge, open-source options for handling MAT-files are limited. There are definitely some tools here and there, but they are usually limited in scope. Notably, although simple arrays and structs can typically be parsed, most of these tools fall short of parsing objects. This is because MAT-files encode objects in a rather peculiar manner which is difficult to reverse-engineer.
I felt that it would be a fun challenge to try and reverse-engineer MATLAB’s data format. In the next couple of posts I’ll share with you what I have learned so far.
Note: I will be focussing entirely on the v7.3 version of MAT-files, which is MATLAB’s most modern and most proprietary format. That said, some of the key concepts will carry over to older versions.
What is HDF5?
MAT v7.3 builds on the HDF5 format, so it’d serve us well to dwell on this first. I’ll keep this as brief as I can.
HDF5 is a file format designed for storing large amounts of data. An HDF5 file is basically a filesystem in a file, but with a few crucial differences:
- Directories are called groups and files are called datasets.
- Datasets can’t contain arbitrary data; rather, they are (possibly multidimensional) rectangular arrays containing homogeneous data.
- Groups and datasets can have metadata in the form of attributes.
- Unlike an ordinary filesystem, there can be multiple paths leading to the same dataset. This can be helpful for bookkeeping. Suppose you want to do multiple version of an experiment over multiple days. A traditional filesystem would force you to either sort your data ‘by day’ or sort it ‘by experiment version’. Non-unique paths in HDF5 allow you to do both at once. We will not use this feature.
Some toy MAT-files
We start with a couple of toy examples to get our feet wet.
>> x = 9001; >> save('double.mat', 'x', '-v7.3');
After creating a variable x
, we use the
save
function to save this variable in a MAT-file. We
explicitly specify that we want to store it in the
v7.3
format.
We are going to use the Julia wrapper HDF5.jl to analyse our MAT-files.
julia> using HDF5 julia> h = h5open("double.mat")
Let's see the output.
🗂️ HDF5.File: (read-only) double.mat └─ 🔢 x └─ 🏷️ MATLAB_class
Our HDF5 🗂️ file has a single 🔢 dataset /x
,
which in turn has an 🏷️ attribute MATLAB_class
. We
read our the contents and the attribute as follows.
julia> read_attribute(h["x"], "MATLAB_class") "double"
julia> read(h["x"]) 1×1 Matrix{Float64}: 9001.0
So there you have it, the value we were looking for. In typical MATLAB fashion, it is comfortably wrapped in a 1-by-1 matrix.
Here’s another toy example.
>> x = 'Foobar'; >> save('char.mat', 'x', '-v7.3');
We open it with HDF5.jl and obtain the following:
julia> h = h5open("char.mat") 🗂️ HDF5.File: (read-only) char.mat └─ 🔢 x ├─ 🏷️ MATLAB_class └─ 🏷️ MATLAB_int_decode
We read out /x
in the same way as before:
julia> read(h["x"])
1×6 Matrix{UInt16}:
0x0046 0x006f 0x006f 0x0062 0x0061 0x0072
These numbers are simply the ASCII representation of
'Foobar'
. As for the 🏷️ attributes, the
MATLAB_class
attribute reads out as "char"
,
but not we also find a mysterious
MATLAB_int_decode
attribute, which returns 2
.
Now for one more toy example. We ramp up the complexity ever so slightly.
x = struct(); x.Foo = 9001; x.Bar = 9002; save('struct.mat', 'x', '-v7.3');
julia> h = h5open("struct.mat") 🗂️ HDF5.File: (read-only) struct.mat ├─ 📂 #refs# │ └─ 🔢 a │ ├─ 🏷️ MATLAB_class │ └─ 🏷️ MATLAB_empty └─ 📂 x ├─ 🏷️ MATLAB_class ├─ 🏷️ MATLAB_fields ├─ 🔢 Bar │ ├─ 🏷️ H5PATH │ └─ 🏷️ MATLAB_class └─ 🔢 Foo ├─ 🏷️ H5PATH └─ 🏷️ MATLAB_class
That already looks a lot scarier, but we easily recognise our fields
Foo
and Bar
. The 📂 indicates that
/x
is now a group rather than a dataset. In turn,
/x
contains the 🔢 datasets /x/Foo
and
/x/Bar
which inevitably have the numbers
9001
and 9002
hidden within them.
julia> read(h["x"]["Foo"]) 1×1 Matrix{Float64}: 9001.0
julia> read(h["x"]["Bar"]) 1×1 Matrix{Float64}: 9002.0
Let’s also read out some of the attributes to get a better feel for the metadata.
julia> read_attribute(h["x"], "MATLAB_class") "struct"
julia> read_attribute(h["x"], "MATLAB_fields") 2-element Vector{Vector{String}}: ["F", "o", "o"] ["B", "a", "r"]
What about those H5PATH
attributes?
julia> read_attribute(h["x"]["Foo"], "H5PATH") "/x"
julia> read_attribute(h["x"]["Bar"], "H5PATH") "/x"
Some rather redundant metadata. Not to be concerned about.
For completeness, I also want to briefly check out the other dataset,
📂 /#refs#
. It cannot possibly contain anything of
value since we already have all the information needed to reconstruct
our object, but as it happens, 📂 /#refs#
will wind up
becoming very important later on.
julia> read(h["#refs#"]["a"]) 2-element Vector{UInt64}: 0x0000000000000000 0x0000000000000000
julia> read_attribute(h["#refs#"]["a"], "MATLAB_class") "canonical empty"
julia> read_attribute(h["#refs#"]["a"], "MATLAB_empty") 0x01
A bunch of numbers of class "canonical empty"
. There is
nothing that we can make out from this, frankly, but it’s good to
know it’s there.
The last toy example I want to consider is a simple cell array.
>> x = cell(2, 1); >> x{1} = 9001; >> x{2} = 'Foobar'; >> save('cell.mat', 'x', '-v7.3');
julia> h = h5open("cell.mat") 🗂️ HDF5.File: (read-only) cell.mat ├─ 📂 #refs# │ ├─ 🔢 a │ │ ├─ 🏷️ MATLAB_class │ │ └─ 🏷️ MATLAB_empty │ ├─ 🔢 b │ │ ├─ 🏷️ H5PATH │ │ └─ 🏷️ MATLAB_class │ └─ 🔢 c │ ├─ 🏷️ H5PATH │ ├─ 🏷️ MATLAB_class │ └─ 🏷️ MATLAB_int_decode └─ 🔢 x └─ 🏷️ MATLAB_class
We see a bit more stuff in 📂 #refs#
, but let’s
first look at 🔢 x
.
julia> read_attribute(h["x"], "MATLAB_class") "cell"
julia> read(h["x"]) 2×1 Matrix{HDF5.Reference}: HDF5.Reference(HDF5.API.hobj_ref_t(0x0000000000000ab0)) HDF5.Reference(HDF5.API.hobj_ref_t(0x0000000000000bc8))
We see that /x
is indeed a "cell"
. Rather than
containing the contents directly, /x
contains two
references to other locations within the file. This is not too
surprising, really, since cell arrays can take on arbitrary values.
Julia’s HDF5.jl provides convenient indexing syntax for following an HDF5 reference:
julia> r = read(h["x"]);
julia> h[r[1]] 🔢 HDF5.Dataset: /#refs#/b (file: cell.mat xfer_mode: 0) ├─ 🏷️ H5PATH └─ 🏷️ MATLAB_class
julia> h[r[2]] 🔢 HDF5.Dataset: /#refs#/c (file: cell.mat xfer_mode: 0) ├─ 🏷️ H5PATH ├─ 🏷️ MATLAB_class └─ 🏷️ MATLAB_int_decode
We see that the references in /x
point to
/#refs#/b
and /#refs#/c
. We read out those
datasets to find the values we stored:
julia> read(h[r[1]]) 1×1 Matrix{Float64}: 9001.0
julia> read(h[r[2]]) 1×6 Matrix{UInt16}: 0x0046 0x006f 0x006f 0x0062 0x0061 0x0072
For what it’s worth, just as with struct.mat
,
/#refs#/a
has MATLAB_class
"canonical empty"
and contains a bunch of zeros.
Our first classes
We’ve seen a couple of very simple objects, and so far, life has been good. The objects are stored in the most straightforward possible manner using datasets with appropriate labels. Granted, we’ve found some junk here and there, but we’ve been able to safely ignore it.
Now I want to talk about classes.Note 1
MATLAB supports object-oriented programming with the so-called
MATLAB Class Object System or MCOS. Let’s create a
very simple class by putting the following contents in a file called
MyClass.m
.
classdef MyClass properties Foo Bar end methods function obj = MyClass(foo, bar) obj.Foo = foo; obj.Bar = bar; end end end
What we’ve done is create a class MyClass
with two
fields, Foo
and Bar
. I have not specified what
values these fields must contain, meaning that they can take on whatever
value we want for now. Thus, our class effectively behaves like a
struct. For convenience, I also created a default constructor under
methods
.
>> x = MyClass(9001, 9002); >> save('myclass.mat', 'x', '-v7.3');
We open our MAT-file as per usual:
julia> h = h5open("myclass.mat") 🗂️ HDF5.File: (read-only) myclass.mat ├─ 📂 #refs# │ ├─ 🔢 a │ │ ├─ 🏷️ MATLAB_class │ │ └─ 🏷️ MATLAB_empty │ ├─ 🔢 b │ │ ├─ 🏷️ H5PATH │ │ └─ 🏷️ MATLAB_class │ ├─ 🔢 c │ │ ├─ 🏷️ H5PATH │ │ └─ 🏷️ MATLAB_class │ ├─ 🔢 d │ │ ├─ 🏷️ H5PATH │ │ ├─ 🏷️ MATLAB_class │ │ └─ 🏷️ MATLAB_int_decode │ ├─ 🔢 e │ │ ├─ 🏷️ H5PATH │ │ └─ 🏷️ MATLAB_class │ ├─ 🔢 f │ │ ├─ 🏷️ H5PATH │ │ └─ 🏷️ MATLAB_class │ ├─ 🔢 g │ │ ├─ 🏷️ H5PATH │ │ ├─ 🏷️ MATLAB_class │ │ └─ 🏷️ MATLAB_empty │ └─ 🔢 h │ ├─ 🏷️ H5PATH │ ├─ 🏷️ MATLAB_class │ └─ 🏷️ MATLAB_empty ├─ 📂 #subsystem# │ └─ 🔢 MCOS │ ├─ 🏷️ MATLAB_class │ └─ 🏷️ MATLAB_object_decode └─ 🔢 x ├─ 🏷️ MATLAB_class └─ 🏷️ MATLAB_object_decode
Wow, that’s a whole lot of junk! But the important part is that we
have our /x
, right? Perhaps, but notice is that, unlike in
our struct, /x
is not a 📂 group, but a
🔢 dataset. In particular, /x
cannot, and does not,
have subfolders /x/Foo
and /x/Bar
. This could
imply that /x
might not have the information that
we’re looking for.
Indeed when we read out /x
we’re greeted with the
following:
julia> read(h["x"]) 6×1 Matrix{UInt32}: 0xdd000000 0x00000002 0x00000001 0x00000001 0x00000001 0x00000001
julia> read_attribute(h["x"], "MATLAB_class") "MyClass"
julia> read_attribute(h["x"], "MATLAB_object_decode") 3
The MATLAB_class
attribute confirms that we’re
definitely looking at the right object. But the actual content of
/x
is seemingly meaningless.
As it turns out, the actual values of x
isn’t stored
here; rather, it’s hidden within the #refs#
group.
Dataset | MATLAB_class |
Value |
---|---|---|
/#refs#/a |
"canonical empty" |
2-element Vector{UInt64}: 0x0000000000000000 0x0000000000000000 |
/#refs#/b |
"uint8" |
192×1 Matrix{UInt8}: 0x03 0x00 ⋮ 0x00 0x00 |
/#refs#/c |
"double" |
1×1 Matrix{Float64}: 9001.0 |
/#refs#/d |
"char" |
1×6 Matrix{UInt16}: 0x0046 0x006f 0x006f 0x0062 0x0061 0x0072 |
/#refs#/e |
"int32" |
2×1 Matrix{Int32}: 0 0 |
/#refs#/f |
"cell" |
2×1 Matrix{HDF5.Reference}: HDF5.Reference(HDF5.API.hobj_ref_t(0x00000000000016b0)) HDF5.Reference(HDF5.API.hobj_ref_t(0x00000000000017e8)) |
/#refs#/g |
"struct" |
2-element Vector{UInt64}: 0x0000000000000001 0x0000000000000000 |
/#refs#/h |
"struct" |
2-element Vector{UInt64}: 0x0000000000000001 0x0000000000000000 |
Most of it is unreadable, but under /#refs#/c
and
/#refs#/d
we unambiguously recognise the values
9001
and 'Foobar'
.
So we know where to find the actual values stored in the fields of our class. But how do we couple these values to the associated class? How do we find the correct structure? For instance, suppose we had done the following:
classdef MyClass properties Foo end methods function obj = MyClass(foo) obj.Foo = foo; end end end
>> s = struct(); >> s.Field1 = 9001; >> s.Field2 = 'Foobar'; >> x = MyClass(s); >> save('myclass.mat', 'x', '-v7.3');
We would still be able to find the values 9001
and
'Foobar'
somewhere in our MAT-file, but we’re clearly
dealing with a different object, and we ought to be able to
differentiate between them and reconstruct the entire structure.
In principle, one could reasonably fear that the MAT-file simply lacks
the necessary information. After all, when MATLAB loads our object,
MATLAB could simply use the contents of MyClass.m
to infer
the actual structure of the class, and in this way it could match the
values in myclass.mat
with the appropriate fields.
Luckily for us, this isn’t the case. MAT-files are essentially
self-contained in that the entire structure of the class can be
reverse-engineered based on the available metadata. The crucial
component is /#refs#/b
, which is stored as an unstructured
stream of bytes, but which in fact contains all the information needed
for our reconstruction.
Below is a hexdump of /#refs#/b
:
00000000: 0300 0000 0300 0000 3800 0000 5800 0000 ........8...X... 00000010: 5800 0000 8800 0000 b000 0000 c000 0000 X............... 00000020: 0000 0000 0000 0000 466f 6f00 4261 7200 ........Foo.Bar. 00000030: 4d79 436c 6173 7300 0000 0000 0000 0000 MyClass......... 00000040: 0000 0000 0000 0000 0000 0000 0300 0000 ................ 00000050: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000060: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000070: 0100 0000 0000 0000 0000 0000 0000 0000 ................ 00000080: 0100 0000 0100 0000 0000 0000 0000 0000 ................ 00000090: 0200 0000 0100 0000 0100 0000 0000 0000 ................ 000000a0: 0200 0000 0100 0000 0100 0000 0000 0000 ................ 000000b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
Let’s look at the results.
-
/#refs#/b
starts off with a couple of numbers separated by zeros. Actually, these are little-endian 32-bit integers. The numbers seem rather arbitrary. It may be observed that the eighth number (0xc0
) is exactly equal to the number of bytes in/#refs#/b
. -
After exactly 40 (
0x28
) bytes, we see the names of both our fields and of our class stored in ASCII, separated by a single null byte. - Beyond this point, we find a load of zeros interspersed with a few other numbers.
The third part of /#refs#/b
, the part with all the zeros,
is what actually encodes the structure of our object. Decoding it is
rather challenging, and will be the main topic of
part 2.
Appendix: The parts of myclass.mat
that we skipped
We have ignored a few components of myclass.mat
that may
seem to contain interesting information. As it turns out, not much
happens in these components, but for completeness I’ll briefly
summarise what’s inside of them.
-
/#refs#/f
was seen to contain two references. These references in fact point to/#refs#/g
and/#refs#/h
, respectively, which in turn have no real content. -
The 📂
#subsystem#
group has a single dataset 🔢MCOS
(‘MATLAB Class Object System’). TheMATLAB_class
attribute isFileWrapper__
and theMATLAB_object_decode
attribute agains returns3
. The dataset contains six references, and they point to/#refs#/b
,/#refs#/a
,/#refs#/c
,/#refs#/d
,/#refs#/e
and/#refs#/f
, in that order.