The last section was tough, it only get tougher.
As I said, the mp4 file is streamed and then data to decode the file is written after. So like most file types mp4 is constructed in parts. Frequently you hear the term file header which is a section that explain the files contents. With mp4 its full of boxes. These boxes might be at the beginning or they might be at the end. We don’t know and we have to find out. Below is a software that allows you to open up the contents of an mp4 file.
Some key parts…
fytp -> decsribes basic contents
mdat -> the actual video data
avCC -> the stuff we need to decode the data
Parsing a video file
We are examining the h264 codec. h264 is a software that takes images and encodes them to reduce file size. These images are then wrapped up in the above boxes and into an mp4 container.
Lets think about this, you camera takes a 2mb picture. A video plays 30 frames per second or 30fps. A cd can hold 700mb. So if a video was simply a series of pictures a DVD would only contain 60mb per second or 12 seconds of video total.
Instead the h264 codec compresses a single image then records a series of “changes” that happen to the image. So 30 frames of a single second of video might be one actual complete image and 29 “changes” to that image. This is the pattern below repeated numerous times as video plays.
Image->change->->change->change->change->Image->change->->change->change->change
Of course there is a lot more to know but each of the above is called a slice. These slices are saved in that mdat box one after another. A slice is not the same as a frame but sometimes it can be.
NALU or network abstraction layer unit is what these slices are saved as. These nalus are saved on after another and are separated by headers. There are two main types of headers we will be dealing with. Below the are written in HEX
AnnexB -> 0x00 0x00 0x00 0x01 0x65 The last tells you what type. The first four is just a startcode with no data
These headers are simply a string of zeros and a one plus the nalu type. The video codec makes sure there are no other instances where this format can be found in the data output.
Avcc -> 0x00 0x02 0x4A 0x8F 0x65 The first four are the length the last described what type it is.Obviously the first four change with each data it represents.
If you make the accidental mistake of padding some data by copying a half filled buffer you will destroy your data’s readability by any decoder because you emulate the annex-b style start code. This goes for either type. Working with this data is unforgiving. (Sound like the voice of experience here!)
Here are the different types.
0 Unspecified non-VCL
1 Coded slice of a non-IDR picture VCL
2 Coded slice data partition A VCL
3 Coded slice data partition B VCL
4 Coded slice data partition C VCL
5 Coded slice of an IDR picture VCL
6 Supplemental enhancement information (SEI) non-VCL
7 Sequence parameter set non-VCL
8 Picture parameter set non-VCL
9 Access unit delimiter non-VCL
10 End of sequence non-VCL
11 End of stream non-VCL
12 Filler data non-VCL
13 Sequence parameter set extension non-VCL
14 Prefix NAL unit non-VCL
15 Subset sequence parameter set non-VCL
16 Depth parameter set non-VCL
17..18 Reserved non-VCL
19 Coded slice of an auxiliary coded picture without partitioning non-VCL
20 Coded slice extension non-VCL
21 Coded slice extension for depth view components non-VCL
22..23 Reserved non-VCL
24..31 Unspecified non-VCL
Based on this information expect to see files like this.
[size or start code][type][data payload] repeated x infinity…might as well be
Parsing SPS & PPS
Data in each box can also be found if you know where to look. Check out our avcc box here. I have it labeled for you and you can see it in hex and ascii.
Here you can find the data necessary to parse you video file. According to this chart… source is stackoverflow
bits
8 version ( always 0x01 )
8 avc profile ( sps[0][1] )
8 avc compatibility ( sps[0][2] )
8 avc level ( sps[0][3] )
6 reserved ( all bits on )
2 NALULengthSizeMinusOne
3 reserved ( all bits on )
5 number of SPS NALUs (usually 1)
repeated once per SPS:
16 SPS size
variable SPS NALU data
8 number of PPS NALUs (usually 1)
repeated once per PPS
16 PPS size
variable PPS NALU data
Remember the avcc 4 header bytes that gave you the length? Those are described in NAlulengthsizeminusone. They could also be two bytes for example. Its minus one because you can only count to three with the two bits of space allowed so 11 = 4 and 01 = 2….a bit quirky.
Now we have an understanding of the basic makeup of a mp4 file lets parse it in next section. Where we go deeper.