Reverse-engineering MAT-files Part 2

Reverse-engineering MAT-files
Part 2: Decoding class structures

In part 1 of this series we looked at some simple MAT-files (v7.3). We found that primitive data, such as character arrays and doubles, is stored in a straightforward manner. Structs and cell arrays, can easily be reconstructed.

When we turned to objects, we found that the encoding quickly gets more complex. We were able to pinpoint the location of the data that we stored, but it wasn’t obvious how the structure of our object is encoded in the MAT-file. Figuring this out is what we’ll turn to in this section.

To keep things simple, we’ll first restrict our attention to (potentially nested) objects that have primitive data at their leaves. So for example, objects containing strings, cell arrays of objects, and objects with fields of restricted type, are deferred to a future part of this series.

Outline

We will consider five different MAT-files, each storing one or more possibly nested classes. We will compare the data in each of the MAT-files. The differences between them will allow us to infer a lot about the encoding.

The MAT-files we will consider are the following.

mycls1.mat: We start with a class MyClass1 with one field Foo1. Our MAT-file consists of a single object x1 = MyClass1(9001).
mycls1_mycls1_mycls1.mat: We save three objects x1 = MyClass1(9001), x2 = MyClass1(9002), and x3 = MyClass1(9003), each of which are of the same type MyClass1.
mycls1_mycls2_mycls3.mat: We create two more classes, MyClass2 and MyClass3, which each have a single field Foo2 and Foo3, respectively. We save one variable of each type: x1 = MyClass1(9001), x2 = MyClass2(9002), and x3 = MyClass3(9003).
mycls123.mat: We create a class MyClass123 with three fields Foo1, Foo2 and Foo3, and we save a single object x1 = MyClass123(9001, 9002, 9003).
mycls1(mycls2(mycls3)).mat: We consider a single nested object x1 = MyClass1(MyClass2(MyClass3(9001))).

We split up the analysis of the MAT-files into several sections.

In the first section we read out the datasets contained in each of the MAT-files. We make a global comparison to draw some general conclusions about how the objects are saved.
In the second section we zoom in on a particularly important dataset in /#refs# which happens to be /#refs#/b for all our MAT-files. We give a hexdump of the object and, with the help of a color scheme, we will be able to identify and understand some common components in this dataset.
At the end of the second section, we will have pinpointed which components of /#refs#/b have the vital information about the structure of the classes. In the third section we zoom in even further and identify the meaning of every relevant byte in these components.

Dataset comparison

As in part 1, we will analyse the MAT-file with HDF5.jl. Below, we see the output of one of our MAT-files.

julia> h = h5open("mycls1.mat")
🗂️ HDF5.File: (read-only) mycls1.mat
├─ 📂 #refs#
│  ├─ 🔢 a
│  │  ├─ 🏷️ MATLAB_class
│  │  └─ 🏷️ MATLAB_empty
│  ├─ 🔢 b
│  │  ├─ 🏷️ H5PATH
│  │  └─ 🏷️ MATLAB_class
⋮  ⋮
│  ├─ 🔢 f
│  │  ├─ 🏷️ H5PATH
│  │  ├─ 🏷️ MATLAB_class
│  │  └─ 🏷️ MATLAB_empty
│  └─ 🔢 g
│     ├─ 🏷️ H5PATH
│     ├─ 🏷️ MATLAB_class
│     └─ 🏷️ MATLAB_empty
├─ 📂 #subsystem#
│  └─ 🔢 MCOS
│     ├─ 🏷️ MATLAB_class
│     └─ 🏷️ MATLAB_object_decode
└─ 🔢 x1
   ├─ 🏷️ MATLAB_class
   └─ 🏷️ MATLAB_object_decode

Our MAT-file consists of a 🔢 dataset x1 corresponding to the stored variable. In addition, there are a bunch of datasets under /#refs#, and there is a dataset MCOS under the 📂 group /#subsystem# containing internal references.

Let’s first read out the content of the actual variables. As we observed in part 1, the content of a variable corresponding to an object only contains a bunch of metadata in the form of a 6×1 Matrix{UInt32}. In the table below, I have listed the metadata for each variable in each dataset. Be sure to scroll to the right to see the entire table.

Variable mycls1 mycls1_mycls1_mycls1 mycls1_mycls2_mycls3 mycls123 mycls1(mycls2(mycls3))

Variable	`mycls1`	`mycls1_mycls1_mycls1`	`mycls1_mycls2_mycls3`	`mycls123`	`mycls1(mycls2(mycls3))`
`/x1`	0xdd000000 0x00000002 0x00000001 0x00000001 0x00000001 0x00000001	0xdd000000 0x00000002 0x00000001 0x00000001 0x00000001 0x00000001	0xdd000000 0x00000002 0x00000001 0x00000001 0x00000001 0x00000001	0xdd000000 0x00000002 0x00000001 0x00000001 0x00000001 0x00000001	0xdd000000 0x00000002 0x00000001 0x00000001 0x00000001 0x00000003
`/x2`	NA	0xdd000000 0x00000002 0x00000001 0x00000001 0x00000002 0x00000001	0xdd000000 0x00000002 0x00000001 0x00000001 0x00000002 0x00000002	NA	NA
`/x3`	NA	0xdd000000 0x00000002 0x00000001 0x00000001 0x00000003 0x00000001	0xdd000000 0x00000002 0x00000001 0x00000001 0x00000003 0x00000003	NA	NA

/x1

/x2

/x3

‘NA’ simply means that the specified field doesn’t exist because we put only one variable in those MAT-files.

Here’s what we may observe.

The first four integers in the metadata never change. It’s reasonable to assume they serve as headers. This is not true. Only the first integer is a header; the subsequent integers tell us something about the size of our object. We’ll return to this in the next post.
I indicated in orange and purple the two numbers that do change. Based in particular on the second and third object (mycls1_mycls1_mycls1.mat and mycls1_mycls2_mycls3.mat) we may tentatively conclude that they encode an object ID and class ID, respectively.
The fifth object (mycls1(mycls2(mycls3)).mat) illustrates that objects and classes aren’t enumerated in the same way. The datasets in /#refs# will confirm that the objects have been enumerated ‘top to bottom’ and the classes have been enumerated ‘bottom to top’.

Next, we will read out the contents of the 🔢 datasets under /#refs#. I’ve shortened the output for the sake of readability. Again, be sure to scroll to the right.

Variable	`mycls1`	`mycls1_mycls1_mycls1`	`mycls1_mycls2_mycls3`	`mycls123`	`mycls1(mycls2(mycls3))`
`/#refs#/a`	0x0000000000000000 0x0000000000000000	0x0000000000000000 0x0000000000000000	0x0000000000000000 0x0000000000000000	0x0000000000000000 0x0000000000000000	0x0000000000000000 0x0000000000000000
`/#refs#/b`	`176×1 Matrix{UInt8}` (Content omitted)	`272×1 Matrix{UInt8}` (Content omitted)	`336×1 Matrix{UInt8}` (Content omitted)	`216×1 Matrix{UInt8}` (Content omitted)	`336×1 Matrix{UInt8}` (Content omitted)
`/#refs#/c`	9001.0	9001.0	9001.0	9001.0	9001.0
`/#refs#/d`	0 0	9002.0	9002.0	9002.0	0xdd000000 0x00000002 0x00000001 0x00000001 0x00000003 0x00000001
`/#refs#/e`	Ref to `/#refs#/f` Ref to `/#refs#/g`	9003.0	9003.0	9003.0	0xdd000000 0x00000002 0x00000001 0x00000001 0x00000002 0x00000002
`/#refs#/f`	0x0000000000000001 0x0000000000000000	0 0	0 0 0 0	0 0	0 0 0 0
`/#refs#/g`	0x0000000000000001 0x0000000000000000	Ref to `/#refs#/h` Ref to `/#refs#/i`	Ref to `/#refs#/h` Ref to `/#refs#/i` Ref to `/#refs#/j` Ref to `/#refs#/k`	Ref to `/#refs#/h` Ref to `/#refs#/i`	Ref to `/#refs#/h` Ref to `/#refs#/i` Ref to `/#refs#/j` Ref to `/#refs#/k`
`/#refs#/h`	NA	0x0000000000000001 0x0000000000000000	0x0000000000000001 0x0000000000000000	0x0000000000000001 0x0000000000000000	0x0000000000000001 0x0000000000000000
`/#refs#/i`	NA	0x0000000000000001 0x0000000000000000	0x0000000000000001 0x0000000000000000	0x0000000000000001 0x0000000000000000	0x0000000000000001 0x0000000000000000
`/#refs#/j`	NA	NA	0x0000000000000001 0x0000000000000000	NA	0x0000000000000001 0x0000000000000000
`/#refs#/k`	NA	NA	0x0000000000000001 0x0000000000000000	NA	0x0000000000000001 0x0000000000000000

What do we see?

/#refs#/a contains no information whatsoever.
We omitted the output of /#refs#/b, which is by far the largest dataset. We’ll study it in detail in the next section.
After /#refs#/b, we see one dataset for every field in the MAT-file. If the field contains a double, then the dataset simply contains the value; if it’s an object, it displays the same six-number metadata that we encountered earlier.
Next, we encounter a field containing just zeros, with the number of zeros seemingly corresponding to the unique number of classes.
The next field contains references to the subsequent fields. These fields, in turn, always contain the numbers 1 and 0, in that order.

We now know where to find the actual (integer or floating-point) values contained in the MAT-file. What remains is to recover the structure of the objects. This is the role of /#refs#/b, which is what we’ll turn to next.

Hexdumps of `/#refs#/b`

So far we have ignored /#refs#/b, which is by far the largest 🔢 dataset, and which essentially contains all the information about the structure of the saved objects. Below, we show a hexdump of each dataset.

We start off with mycls1.mat:

00000000: 0300 0000 0200 0000 3800 0000 5800 0000  ........8...X...
00000010: 5800 0000 8800 0000 a000 0000 b000 0000  X...............
00000020: 0000 0000 0000 0000 466f 6f31 004d 7943  ........Foo1.MyC
00000030: 6c61 7373 3100 0000 0000 0000 0000 0000  lass1...........
00000040: 0000 0000 0000 0000 0000 0000 0200 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000070: 0100 0000 0000 0000 0000 0000 0000 0000  ................
00000080: 0100 0000 0100 0000 0000 0000 0000 0000  ................
00000090: 0100 0000 0100 0000 0100 0000 0000 0000  ................
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

Let’s look at the results.

The red component consists of eight little-endian 32-bit integers and they contain size-related information. For convenience, it may already be observed that the eighth number (0xb0) is exactly equal to the number of bytes in /#refs#/b.
After exactly 40 (0x28) bytes, the green component lists the names of our field and of our class, stored in ASCII and separated by a single null byte.
After the field and class names, we see a blue, cyan, and yellow component. They consist of a load of zeros interspersed with a few other numbers. These numbers are again little-endian 32-bit integers.

We turn to the second object, mycls1_mycls1_mycls1.mat.

00000000: 0300 0000 0200 0000 3800 0000 5800 0000  ........8...X...
00000010: 5800 0000 b800 0000 f000 0000 1001 0000  X...............
00000020: 0000 0000 0000 0000 466f 6f31 004d 7943  ........Foo1.MyC
00000030: 6c61 7373 3100 0000 0000 0000 0000 0000  lass1...........
00000040: 0000 0000 0000 0000 0000 0000 0200 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000070: 0100 0000 0000 0000 0000 0000 0000 0000  ................
00000080: 0100 0000 0100 0000 0100 0000 0000 0000  ................
00000090: 0000 0000 0000 0000 0200 0000 0200 0000  ................
000000a0: 0100 0000 0000 0000 0000 0000 0000 0000  ................
000000b0: 0300 0000 0300 0000 0000 0000 0000 0000  ................
000000c0: 0100 0000 0100 0000 0100 0000 0000 0000  ................
000000d0: 0100 0000 0100 0000 0100 0000 0100 0000  ................
000000e0: 0100 0000 0100 0000 0100 0000 0200 0000  ................
000000f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000100: 0000 0000 0000 0000 0000 0000 0000 0000  ................

A few things may immediately be observed:

Some numbers in the red component have stayed the same while others have increased. Later down this section, we’ll make a nice table with all the values in this component, which will make it easier to infer their meanings.
We had suspected that the eighth number in the red component refers to the length. The eighth number reads 1001 0000 while the length is 0x110. This reaffirms my claim that the numbers are little-endian.
The green component has stayed the same. This means that class and field names are not repeated in the list.
The cyan and yellow components of the structural part are now significantly longer.

mycls1_mycls2_mycls3.mat:

00000000: 0300 0000 0600 0000 5800 0000 9800 0000  ........X.......
00000010: 9800 0000 f800 0000 3001 0000 5001 0000  ........0...P...
00000020: 0000 0000 0000 0000 466f 6f31 004d 7943  ........Foo1.MyC
00000030: 6c61 7373 3100 466f 6f32 004d 7943 6c61  lass1.Foo2.MyCla
00000040: 7373 3200 466f 6f33 004d 7943 6c61 7373  ss2.Foo3.MyClass
00000050: 3300 0000 0000 0000 0000 0000 0000 0000  3...............
00000060: 0000 0000 0000 0000 0000 0000 0200 0000  ................
00000070: 0000 0000 0000 0000 0000 0000 0400 0000  ................
00000080: 0000 0000 0000 0000 0000 0000 0600 0000  ................
00000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000b0: 0100 0000 0000 0000 0000 0000 0000 0000  ................
000000c0: 0100 0000 0100 0000 0200 0000 0000 0000  ................
000000d0: 0000 0000 0000 0000 0200 0000 0200 0000  ................
000000e0: 0300 0000 0000 0000 0000 0000 0000 0000  ................
000000f0: 0300 0000 0300 0000 0000 0000 0000 0000  ................
00000100: 0100 0000 0100 0000 0100 0000 0000 0000  ................
00000110: 0100 0000 0300 0000 0100 0000 0100 0000  ................
00000120: 0100 0000 0500 0000 0100 0000 0200 0000  ................
00000130: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000140: 0000 0000 0000 0000 0000 0000 0000 0000  ................

The blue component has suddenly increased in length. Looking at the contents, we see that it lists the distinct classes that our MAT-file contains. The numbers in fact indicate which of the names listed in the green component are class names.^{Note 1}
The cyan and yellow components have changed but it’s not directly clear in what way. We defer the discusion of these components until the next section.

For completeness, we end with the hexdumps of the last two objects. Here is mycls123.mat:

00000000: 0300 0000 0400 0000 4800 0000 6800 0000  ........H...h...
00000010: 6800 0000 9800 0000 c800 0000 d800 0000  h...............
00000020: 0000 0000 0000 0000 466f 6f31 0046 6f6f  ........Foo1.Foo
00000030: 3200 466f 6f33 004d 7943 6c61 7373 3132  2.Foo3.MyClass12
00000040: 3300 0000 0000 0000 0000 0000 0000 0000  3...............
00000050: 0000 0000 0000 0000 0000 0000 0400 0000  ................
00000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000080: 0100 0000 0000 0000 0000 0000 0000 0000  ................
00000090: 0100 0000 0100 0000 0000 0000 0000 0000  ................
000000a0: 0300 0000 0100 0000 0100 0000 0000 0000  ................
000000b0: 0200 0000 0100 0000 0100 0000 0300 0000  ................
000000c0: 0100 0000 0200 0000 0000 0000 0000 0000  ................
000000d0: 0000 0000 0000 0000                      ........

And finally, mycls1(mycls2(mycls3)).mat:

00000000: 0300 0000 0600 0000 5800 0000 9800 0000  ........X.......
00000010: 9800 0000 f800 0000 3001 0000 5001 0000  ........0...P...
00000020: 0000 0000 0000 0000 466f 6f31 0046 6f6f  ........Foo1.Foo
00000030: 3200 466f 6f33 004d 7943 6c61 7373 3300  2.Foo3.MyClass3.
00000040: 4d79 436c 6173 7332 004d 7943 6c61 7373  MyClass2.MyClass
00000050: 3100 0000 0000 0000 0000 0000 0000 0000  1...............
00000060: 0000 0000 0000 0000 0000 0000 0400 0000  ................
00000070: 0000 0000 0000 0000 0000 0000 0500 0000  ................
00000080: 0000 0000 0000 0000 0000 0000 0600 0000  ................
00000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000b0: 0300 0000 0000 0000 0000 0000 0000 0000  ................
000000c0: 0100 0000 0300 0000 0200 0000 0000 0000  ................
000000d0: 0000 0000 0000 0000 0200 0000 0200 0000  ................
000000e0: 0100 0000 0000 0000 0000 0000 0000 0000  ................
000000f0: 0300 0000 0100 0000 0000 0000 0000 0000  ................
00000100: 0100 0000 0100 0000 0100 0000 0200 0000  ................
00000110: 0100 0000 0200 0000 0100 0000 0100 0000  ................
00000120: 0100 0000 0300 0000 0100 0000 0000 0000  ................
00000130: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000140: 0000 0000 0000 0000 0000 0000 0000 0000  ................

We know the meaning of the green and blue components, and we defer our discussion on the cyan and yellow components to the next section, so let us now take a closer look at the red component. Below I’ve listed the numbers in the red component for each MAT-file, in decimal notation.

mycls1.mat:                   3   2  56  88  88 136 160 176
mycls1_mycls1_mycls1.mat:     3   2  56  88  88 184 240 272
mycls1_mycls2_mycls3.mat:     3   6  88 152 152 248 304 336
mycls123.mat:                 3   4  72 104 104 152 200 216
mycls1(mycls2(mycls3.mat)):   3   6  88 152 152 248 304 336

Here’s what we may conclude.

The first number is always 3, indicating that it may serve as a header.
The second number is equal to the number of names in the green component, or equivalently, the number of unique field and class names.^{Note 1}
mycls1_mycls2_mycls3.mat and mycls1(mycls2(mycls3.mat)) have the exact same numbers, yet we see that their hexdumps are also the exact same size. We may thus infer that most numbers refer to a length of /#refs#/b itself.

Based on the last point, upon closer inspection we see that the last five numbers are all length measures of /#refs#/b. Specifically:

The third number is the length of /#refs#/b up to and including the green component.
The fourth and fifth numbers are the length up to and including the blue component.
The sixth number is the length up to and including the cyan component.
The seventh number is the length up to and including the yellow component.
The eighth number is the entire length of /#refs#/b.

The fact that the fourth and fifth number are the same implies that there is another hidden component which happens to be empty for all objects we have thus far considered. This is indeed the case as we’ll see in a future part.

The remaining part of `/#refs#/b`

We have yet to discuss the behaviour of the cyan and yellow components. These components are the most complex, and they basically encode how the classes and fields relate to each other.

Since our objects are very small, all nonzero numbers in the cyan and yellow components can be described by a single digit. Thanks to this, we can give a very compact representation of their contents, allowing us to juxtapose the components of the five MAT-files. Here are the results:

mycls1.mat:                 100011               00 1-110            
mycls1_mycls1_mycls1.mat:   100011 100022 100033 00 1-110 1-111 1-112
mycls1_mycls2_mycls3.mat:   100011 200022 300033 00 1-110 1-311 1-512
mycls123.mat:               100011               00 3-110-211-312    
mycls1(mycls2(mycls3)).mat: 300013 200022 100031 00 1-112 1-211 1-310

To make your life easier, I have added suggestive spacings and dashes to reveal how the cyan and yellow components are organised.

Here’s what we can tell by looking at this data. First, two general points.

The cyan component and the yellow component consist of an equal number of parts. This suggests that they are in fact referring to the same things.
The number of parts in both the cyan and yellow component is precisely equal to the number of objects in our MAT-file. This includes objects that sit within another object.

Now focus specifically on the cyan component.

Each part in the cyan component is a sextet.
In mycls1_mycls1_mycls1.mat, the first number in every sextet is 1. This suggests that this number refers to the class ID.
The second, third, and fourth numbers in the sextet are always 0.^{Note 2}
When we studied mycls1(mycls2(mycls3)).mat, we found that the object ID and class ID were in the reverse order. If the first number is the class ID then it could conceivably be that the fifth number represents the object ID. This is not true however, as we’ll find out in the next post. Instead, the sextets are ordered by object ID.
The sixth number in the sextet seems to always be some permutation of the numbers 1 to (number of components).

With this figured out, we turn to the yellow component.

The length of the parts in the yellow component is variable. Upon closer inspection, we find that each part consists of N triplets, each preceded by one more byte which has the number N.
In fact, since we know that each part refers to an object in the MAT-file, we may infer that N is the number of fields of that given object. Thus, each triplet corresponds to a specific field of the object.
The first number in the triplet is seen to be as high as 5 in one case. This means that it cannot be something as simple as a class or object ID. Instead, it is the name of the field, encoded as the index of the name as listed in the green component.
The second number in the triplet is always 1.^{Note 2}
Some digits in the triplet are 0.^{Note 3}
So far, nothing in the MAT-file connects the fields to their values. Frankly, the third number of the triplet is pretty much the only thing that could reasonably fulfill this role. With this in mind we find, on closer inspection, that this third number is in fact a (zero-indexed!) reference to subsequent 🔢 datasets in /#refs#. That is, if the value of the field sits in /#refs#/c, we write down a 0; if it’s in /#refs#/d we write down a 1; and so on.

This brings us to the end of our analysis. Although we haven’t yet figured out the meaning of each and every byte, the information presented in this post is enough to reconstruct objects that are similar to the ones analysed here.

In the next post, we will take a look at classes with attributes. This will uncover some new phenomena that we haven’t yet observed in our simple classes.