Reverse-engineering MAT-files
Part 2: Decoding class structures

In part 1 of this series we looked at some simple MAT-files (v7.3). We found that primitive data, such as character arrays and doubles, is stored in a straightforward manner. Structs and cell arrays, can easily be reconstructed.

When we turned to objects, we found that the encoding quickly gets more complex. We were able to pinpoint the location of the data that we stored, but it wasn’t obvious how the structure of our object is encoded in the MAT-file. Figuring this out is what we’ll turn to in this section.

To keep things simple, we’ll first restrict our attention to (potentially nested) objects that have primitive data at their leaves. So for example, objects containing strings, cell arrays of objects, and objects with fields of restricted type, are deferred to a future part of this series.

Outline

We will consider five different MAT-files, each storing one or more possibly nested classes. We will compare the data in each of the MAT-files. The differences between them will allow us to infer a lot about the encoding.

The MAT-files we will consider are the following.

  1. mycls1.mat: We start with a class MyClass1 with one field Foo1. Our MAT-file consists of a single object x1 = MyClass1(9001).
  2. mycls1_mycls1_mycls1.mat: We save three objects x1 = MyClass1(9001), x2 = MyClass1(9002), and x3 = MyClass1(9003), each of which are of the same type MyClass1.
  3. mycls1_mycls2_mycls3.mat: We create two more classes, MyClass2 and MyClass3, which each have a single field Foo2 and Foo3, respectively. We save one variable of each type: x1 = MyClass1(9001), x2 = MyClass2(9002), and x3 = MyClass3(9003).
  4. mycls123.mat: We create a class MyClass123 with three fields Foo1, Foo2 and Foo3, and we save a single object x1 = MyClass123(9001, 9002, 9003).
  5. mycls1(mycls2(mycls3)).mat: We consider a single nested object x1 = MyClass1(MyClass2(MyClass3(9001))).

We split up the analysis of the MAT-files into several sections.

  1. In the first section we read out the datasets contained in each of the MAT-files. We make a global comparison to draw some general conclusions about how the objects are saved.
  2. In the second section we zoom in on a particularly important dataset in /#refs# which happens to be /#refs#/b for all our MAT-files. We give a hexdump of the object and, with the help of a color scheme, we will be able to identify and understand some common components in this dataset.
  3. At the end of the second section, we will have pinpointed which components of /#refs#/b have the vital information about the structure of the classes. In the third section we zoom in even further and identify the meaning of every relevant byte in these components.

Dataset comparison

As in part 1, we will analyse the MAT-file with HDF5.jl. Below, we see the output of one of our MAT-files.

julia> h = h5open("mycls1.mat")
🗂️ HDF5.File: (read-only) mycls1.mat
├─ 📂 #refs#
│  ├─ 🔢 a
│  │  ├─ 🏷️ MATLAB_class
│  │  └─ 🏷️ MATLAB_empty
│  ├─ 🔢 b
│  │  ├─ 🏷️ H5PATH
│  │  └─ 🏷️ MATLAB_class
⋮  ⋮
│  ├─ 🔢 f
│  │  ├─ 🏷️ H5PATH
│  │  ├─ 🏷️ MATLAB_class
│  │  └─ 🏷️ MATLAB_empty
│  └─ 🔢 g
│     ├─ 🏷️ H5PATH
│     ├─ 🏷️ MATLAB_class
│     └─ 🏷️ MATLAB_empty
├─ 📂 #subsystem#
│  └─ 🔢 MCOS
│     ├─ 🏷️ MATLAB_class
│     └─ 🏷️ MATLAB_object_decode
└─ 🔢 x1
   ├─ 🏷️ MATLAB_class
   └─ 🏷️ MATLAB_object_decode

Our MAT-file consists of a 🔢 dataset x1 corresponding to the stored variable. In addition, there are a bunch of datasets under /#refs#, and there is a dataset MCOS under the 📂 group /#subsystem# containing internal references.

Let’s first read out the content of the actual variables. As we observed in part 1, the content of a variable corresponding to an object only contains a bunch of metadata in the form of a 6×1 Matrix{UInt32}. In the table below, I have listed the metadata for each variable in each dataset. Be sure to scroll to the right to see the entire table.

Variable mycls1 mycls1_mycls1_mycls1 mycls1_mycls2_mycls3 mycls123 mycls1(mycls2(mycls3))
/x1
0xdd000000
0x00000002
0x00000001
0x00000001
0x00000001
0x00000001
0xdd000000
0x00000002
0x00000001
0x00000001
0x00000001
0x00000001
0xdd000000
0x00000002
0x00000001
0x00000001
0x00000001
0x00000001
0xdd000000
0x00000002
0x00000001
0x00000001
0x00000001
0x00000001
0xdd000000
0x00000002
0x00000001
0x00000001
0x00000001
0x00000003
/x2 NA
0xdd000000
0x00000002
0x00000001
0x00000001
0x00000002
0x00000001
0xdd000000
0x00000002
0x00000001
0x00000001
0x00000002
0x00000002
NA NA
/x3 NA
0xdd000000
0x00000002
0x00000001
0x00000001
0x00000003
0x00000001
0xdd000000
0x00000002
0x00000001
0x00000001
0x00000003
0x00000003
NA NA

‘NA’ simply means that the specified field doesn’t exist because we put only one variable in those MAT-files.

Here’s what we may observe.

Next, we will read out the contents of the 🔢 datasets under /#refs#. I’ve shortened the output for the sake of readability. Again, be sure to scroll to the right.

Variable mycls1 mycls1_mycls1_mycls1 mycls1_mycls2_mycls3 mycls123 mycls1(mycls2(mycls3))
/#refs#/a
0x0000000000000000
0x0000000000000000
0x0000000000000000
0x0000000000000000
0x0000000000000000
0x0000000000000000
0x0000000000000000
0x0000000000000000
0x0000000000000000
0x0000000000000000
/#refs#/b 176×1 Matrix{UInt8}
(Content omitted)
272×1 Matrix{UInt8}
(Content omitted)
336×1 Matrix{UInt8}
(Content omitted)
216×1 Matrix{UInt8}
(Content omitted)
336×1 Matrix{UInt8}
(Content omitted)
/#refs#/c
9001.0
9001.0
9001.0
9001.0
9001.0
/#refs#/d
0
0
9002.0
9002.0
9002.0
0xdd000000
0x00000002
0x00000001
0x00000001
0x00000003
0x00000001
/#refs#/e Ref to /#refs#/f
Ref to /#refs#/g
9003.0
9003.0
9003.0
0xdd000000
0x00000002
0x00000001
0x00000001
0x00000002
0x00000002
/#refs#/f
0x0000000000000001
0x0000000000000000
0
0
0
0
0
0
0
0
0
0
0
0
/#refs#/g
0x0000000000000001
0x0000000000000000
Ref to /#refs#/h
Ref to /#refs#/i
Ref to /#refs#/h
Ref to /#refs#/i
Ref to /#refs#/j
Ref to /#refs#/k
Ref to /#refs#/h
Ref to /#refs#/i
Ref to /#refs#/h
Ref to /#refs#/i
Ref to /#refs#/j
Ref to /#refs#/k
/#refs#/h NA
0x0000000000000001
0x0000000000000000
0x0000000000000001
0x0000000000000000
0x0000000000000001
0x0000000000000000
0x0000000000000001
0x0000000000000000
/#refs#/i NA
0x0000000000000001
0x0000000000000000
0x0000000000000001
0x0000000000000000
0x0000000000000001
0x0000000000000000
0x0000000000000001
0x0000000000000000
/#refs#/j NA NA
0x0000000000000001
0x0000000000000000
NA
0x0000000000000001
0x0000000000000000
/#refs#/k NA NA
0x0000000000000001
0x0000000000000000
NA
0x0000000000000001
0x0000000000000000

What do we see?

We now know where to find the actual (integer or floating-point) values contained in the MAT-file. What remains is to recover the structure of the objects. This is the role of /#refs#/b, which is what we’ll turn to next.

Hexdumps of /#refs#/b

So far we have ignored /#refs#/b, which is by far the largest 🔢 dataset, and which essentially contains all the information about the structure of the saved objects. Below, we show a hexdump of each dataset.

We start off with mycls1.mat:

00000000: 0300 0000 0200 0000 3800 0000 5800 0000  ........8...X...
00000010: 5800 0000 8800 0000 a000 0000 b000 0000  X...............
00000020: 0000 0000 0000 0000 466f 6f31 004d 7943  ........Foo1.MyC
00000030: 6c61 7373 3100 0000 0000 0000 0000 0000  lass1...........
00000040: 0000 0000 0000 0000 0000 0000 0200 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000070: 0100 0000 0000 0000 0000 0000 0000 0000  ................
00000080: 0100 0000 0100 0000 0000 0000 0000 0000  ................
00000090: 0100 0000 0100 0000 0100 0000 0000 0000  ................
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

Let’s look at the results.

We turn to the second object, mycls1_mycls1_mycls1.mat.

00000000: 0300 0000 0200 0000 3800 0000 5800 0000  ........8...X...
00000010: 5800 0000 b800 0000 f000 0000 1001 0000  X...............
00000020: 0000 0000 0000 0000 466f 6f31 004d 7943  ........Foo1.MyC
00000030: 6c61 7373 3100 0000 0000 0000 0000 0000  lass1...........
00000040: 0000 0000 0000 0000 0000 0000 0200 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000070: 0100 0000 0000 0000 0000 0000 0000 0000  ................
00000080: 0100 0000 0100 0000 0100 0000 0000 0000  ................
00000090: 0000 0000 0000 0000 0200 0000 0200 0000  ................
000000a0: 0100 0000 0000 0000 0000 0000 0000 0000  ................
000000b0: 0300 0000 0300 0000 0000 0000 0000 0000  ................
000000c0: 0100 0000 0100 0000 0100 0000 0000 0000  ................
000000d0: 0100 0000 0100 0000 0100 0000 0100 0000  ................
000000e0: 0100 0000 0100 0000 0100 0000 0200 0000  ................
000000f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000100: 0000 0000 0000 0000 0000 0000 0000 0000  ................

A few things may immediately be observed:

mycls1_mycls2_mycls3.mat:

00000000: 0300 0000 0600 0000 5800 0000 9800 0000  ........X.......
00000010: 9800 0000 f800 0000 3001 0000 5001 0000  ........0...P...
00000020: 0000 0000 0000 0000 466f 6f31 004d 7943  ........Foo1.MyC
00000030: 6c61 7373 3100 466f 6f32 004d 7943 6c61  lass1.Foo2.MyCla
00000040: 7373 3200 466f 6f33 004d 7943 6c61 7373  ss2.Foo3.MyClass
00000050: 3300 0000 0000 0000 0000 0000 0000 0000  3...............
00000060: 0000 0000 0000 0000 0000 0000 0200 0000  ................
00000070: 0000 0000 0000 0000 0000 0000 0400 0000  ................
00000080: 0000 0000 0000 0000 0000 0000 0600 0000  ................
00000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000b0: 0100 0000 0000 0000 0000 0000 0000 0000  ................
000000c0: 0100 0000 0100 0000 0200 0000 0000 0000  ................
000000d0: 0000 0000 0000 0000 0200 0000 0200 0000  ................
000000e0: 0300 0000 0000 0000 0000 0000 0000 0000  ................
000000f0: 0300 0000 0300 0000 0000 0000 0000 0000  ................
00000100: 0100 0000 0100 0000 0100 0000 0000 0000  ................
00000110: 0100 0000 0300 0000 0100 0000 0100 0000  ................
00000120: 0100 0000 0500 0000 0100 0000 0200 0000  ................
00000130: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000140: 0000 0000 0000 0000 0000 0000 0000 0000  ................

For completeness, we end with the hexdumps of the last two objects. Here is mycls123.mat:

00000000: 0300 0000 0400 0000 4800 0000 6800 0000  ........H...h...
00000010: 6800 0000 9800 0000 c800 0000 d800 0000  h...............
00000020: 0000 0000 0000 0000 466f 6f31 0046 6f6f  ........Foo1.Foo
00000030: 3200 466f 6f33 004d 7943 6c61 7373 3132  2.Foo3.MyClass12
00000040: 3300 0000 0000 0000 0000 0000 0000 0000  3...............
00000050: 0000 0000 0000 0000 0000 0000 0400 0000  ................
00000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000080: 0100 0000 0000 0000 0000 0000 0000 0000  ................
00000090: 0100 0000 0100 0000 0000 0000 0000 0000  ................
000000a0: 0300 0000 0100 0000 0100 0000 0000 0000  ................
000000b0: 0200 0000 0100 0000 0100 0000 0300 0000  ................
000000c0: 0100 0000 0200 0000 0000 0000 0000 0000  ................
000000d0: 0000 0000 0000 0000                      ........

And finally, mycls1(mycls2(mycls3)).mat:

00000000: 0300 0000 0600 0000 5800 0000 9800 0000  ........X.......
00000010: 9800 0000 f800 0000 3001 0000 5001 0000  ........0...P...
00000020: 0000 0000 0000 0000 466f 6f31 0046 6f6f  ........Foo1.Foo
00000030: 3200 466f 6f33 004d 7943 6c61 7373 3300  2.Foo3.MyClass3.
00000040: 4d79 436c 6173 7332 004d 7943 6c61 7373  MyClass2.MyClass
00000050: 3100 0000 0000 0000 0000 0000 0000 0000  1...............
00000060: 0000 0000 0000 0000 0000 0000 0400 0000  ................
00000070: 0000 0000 0000 0000 0000 0000 0500 0000  ................
00000080: 0000 0000 0000 0000 0000 0000 0600 0000  ................
00000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000b0: 0300 0000 0000 0000 0000 0000 0000 0000  ................
000000c0: 0100 0000 0300 0000 0200 0000 0000 0000  ................
000000d0: 0000 0000 0000 0000 0200 0000 0200 0000  ................
000000e0: 0100 0000 0000 0000 0000 0000 0000 0000  ................
000000f0: 0300 0000 0100 0000 0000 0000 0000 0000  ................
00000100: 0100 0000 0100 0000 0100 0000 0200 0000  ................
00000110: 0100 0000 0200 0000 0100 0000 0100 0000  ................
00000120: 0100 0000 0300 0000 0100 0000 0000 0000  ................
00000130: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000140: 0000 0000 0000 0000 0000 0000 0000 0000  ................

We know the meaning of the green and blue components, and we defer our discussion on the cyan and yellow components to the next section, so let us now take a closer look at the red component. Below I’ve listed the numbers in the red component for each MAT-file, in decimal notation.

mycls1.mat:                   3   2  56  88  88 136 160 176
mycls1_mycls1_mycls1.mat:     3   2  56  88  88 184 240 272
mycls1_mycls2_mycls3.mat:     3   6  88 152 152 248 304 336
mycls123.mat:                 3   4  72 104 104 152 200 216
mycls1(mycls2(mycls3.mat)):   3   6  88 152 152 248 304 336

Here’s what we may conclude.

Based on the last point, upon closer inspection we see that the last five numbers are all length measures of /#refs#/b. Specifically:

The fact that the fourth and fifth number are the same implies that there is another hidden component which happens to be empty for all objects we have thus far considered. This is indeed the case as we’ll see in a future part.

The remaining part of /#refs#/b

We have yet to discuss the behaviour of the cyan and yellow components. These components are the most complex, and they basically encode how the classes and fields relate to each other.

Since our objects are very small, all nonzero numbers in the cyan and yellow components can be described by a single digit. Thanks to this, we can give a very compact representation of their contents, allowing us to juxtapose the components of the five MAT-files. Here are the results:

mycls1.mat:                 100011               00 1-110            
mycls1_mycls1_mycls1.mat:   100011 100022 100033 00 1-110 1-111 1-112
mycls1_mycls2_mycls3.mat:   100011 200022 300033 00 1-110 1-311 1-512
mycls123.mat:               100011               00 3-110-211-312    
mycls1(mycls2(mycls3)).mat: 300013 200022 100031 00 1-112 1-211 1-310

To make your life easier, I have added suggestive spacings and dashes to reveal how the cyan and yellow components are organised.

Here’s what we can tell by looking at this data. First, two general points.

Now focus specifically on the cyan component.

With this figured out, we turn to the yellow component.

This brings us to the end of our analysis. Although we haven’t yet figured out the meaning of each and every byte, the information presented in this post is enough to reconstruct objects that are similar to the ones analysed here.

In the next post, we will take a look at classes with attributes. This will uncover some new phenomena that we haven’t yet observed in our simple classes.

Footnotes

  1. For what it’s worth, if a name is used both as a field name and a class name, it shows up only once in the green component, and it is still listed in the blue component as a class name.
  2. We’ll find exceptions to this observation in a future post.
  3. Initially, I interpreted these stray zeros in the yellow component as separators. This led me astray for a pretty long time.