Stream Video From Android Part 3 – Understanding h264 in mp4

The last section was tough, it only get tougher.

As I said, the mp4 file is streamed and then data to decode the file is written after.  So like most file types mp4 is constructed in parts. Frequently you hear the term file header which is a section that explain the files contents. With mp4 its full of boxes. These boxes might be at the beginning or they might be at the end. We don’t know and we have to find out. Below is a software that allows you to open up the contents of an mp4 file.

Some key parts…

fytp -> decsribes basic contents

mdat -> the actual video data

avCC -> the stuff we need to decode the data


Parsing a video file

We are examining the h264 codec. h264 is a software that takes images and encodes them to reduce file size. These images are then wrapped up in the above boxes and into an mp4 container.

Lets think about this, you camera takes a 2mb picture. A video plays 30 frames per second or 30fps. A cd can hold 700mb. So if a video was simply a series of pictures a DVD would only contain 60mb per second or 12 seconds of video total.

Instead the h264 codec compresses a single image then records a series of “changes” that happen to the image.  So 30 frames of a single second of video might be one actual complete image and 29 “changes” to that image. This is the pattern below repeated numerous times as video plays.


Of course there is a lot more to know but each of the above is called a slice. These slices are saved in that mdat box one after another. A slice is not the same as a frame but sometimes it can be.

NALU or network abstraction layer unit is what these slices are saved as. These nalus are saved on after another and are separated by headers. There are two main types of headers we will be dealing with. Below the are written in HEX

AnnexB   -> 0x00 0x00 0x00 0x01 0x65 The last tells you what type. The first four is just a startcode with no data

These headers are simply a string of zeros and a one plus the nalu type. The video codec makes sure there are no other instances where this format can be found in the data output.

Avcc ->  0x00 0x02 0x4A 0x8F 0x65 The first four are the length the last described what type it is.Obviously the first four change with each data it represents.

If you make the accidental mistake of padding some data by copying a half filled buffer you will destroy your data’s readability by any decoder because you emulate the annex-b style start code.  This goes for either type. Working with this data is unforgiving. (Sound like the voice of experience here!)

Here are the different types.

0      Unspecified                                            non-VCL
1      Coded slice of a non-IDR picture                             VCL
2      Coded slice data partition A                                 VCL
3      Coded slice data partition B                                 VCL
4      Coded slice data partition C                                 VCL
5      Coded slice of an IDR picture                                VCL
6      Supplemental enhancement information (SEI)              non-VCL
7      Sequence parameter set                                 non-VCL
8      Picture parameter set                                non-VCL
9      Access unit delimiter                                   non-VCL
10     End of sequence                                          non-VCL
11     End of stream                                           non-VCL
12     Filler data                                             non-VCL
13     Sequence parameter set extension                        non-VCL
14     Prefix NAL unit                                         non-VCL
15     Subset sequence parameter set                            non-VCL
16     Depth parameter set                                     non-VCL
17..18 Reserved                                                 non-VCL
19     Coded slice of an auxiliary coded picture without partitioning non-VCL
20     Coded slice extension                                 non-VCL
21     Coded slice extension for depth view components         non-VCL
22..23 Reserved                                               non-VCL
24..31 Unspecified                                           non-VCL

Based on this information expect to see files like this.

[size or start code][type][data payload]  repeated x infinity…might as well be

Parsing SPS & PPS

Data in each box can also be found if you know where to look. Check out our avcc box here. I have it labeled for you and you can see it in hex and ascii.

Here you can find the data necessary to parse you video file. According to this chart… source is stackoverflow

8   version ( always 0x01 )
8   avc profile ( sps[0][1] )
8   avc compatibility ( sps[0][2] )
8   avc level ( sps[0][3] )
6   reserved ( all bits on )
2   NALULengthSizeMinusOne
3   reserved ( all bits on )
5   number of SPS NALUs (usually 1)
repeated once per SPS:
  16     SPS size
  variable   SPS NALU data
8   number of PPS NALUs (usually 1)
repeated once per PPS
  16    PPS size
  variable PPS NALU data

Remember the avcc 4 header bytes that gave you the length? Those are described in NAlulengthsizeminusone. They could also be two bytes for example. Its minus one because you can only count to three with the two bits of space allowed so 11 = 4 and 01 = 2….a bit quirky.

Now we have an understanding of the basic makeup of a mp4 file lets parse it in next section. Where we go deeper.

Leave a Reply

Your email address will not be published. Required fields are marked *