For training models using video data in neon, aeon requires the videos to be MJPEG encoded. A full example of how to perform this initial preprocessing is demonstrated in the neon C3D example in the neon repository based off the C3D model architecture trained using the UCF101 dataset. The preprocessing in that example is achieved using the following ffmpeg command:

ffmpeg -v quiet -i $VIDPATH \
       -an -vf scale=171:128 -framerate 25 \
       -c:v mjpeg -q:v 3 \
       -f segment -segment_time 0.64 -reset_timestamps 1 \
       -segment_list ${VIDPATH%.avi}.csv \
       -segment_list_entry_prefix `dirname $VIDPATH`/ \
       -y ${VIDPATH%.avi}_%02d.avi

Breaking this command down:

  • -an disables the audio stream
  • -vf scale=171:128 scales the video frames to 171 by 128 pixels
  • -framerate 25 sets the output framerate to 25 frames per second
  • c:v mjpeg sets the output video codec to MJPEG
  • -q:v 3 sets the output codec compression quality
  • -f segment ... splits video into equal length segments. See the ffmpeg documentation for details
  • -y overwrite output file without prompting

Splitting the videos into equal length segments as we did here is not necessary in general for the aeon DataLoader, but is helpful for training this particular model in neon.

Once preprocessing is complete, a sample manifest CSV file must be created with the absolute paths of the videos and the classification labels. For example:


Where the first column contains absolute paths to the preprocessed MJPEG videos and the second column contains absolute paths to label files. The label files in this case contain a single ASCII number indicating the correct class label of this training example.

Next in our model training python script, we create a DataLoader config dictionary as described in the user guide but with an appropriate entry for video options:

config = dict(type="video,label",
              video={'max_frame_count': 16,
                     'frame': {'height': 112,
                               'width': 112,
                               'scale': [0.875, 0.875]}},
              label={'binary': False},

The two current possible options for video configuration are:

Name Default Description
max_frame_count (uint) Required Maximum number of frames to extract from video. Shorter samples will be zero padded.
frame (object) Required An Image configuration for each frame extracted from the video.

The last step is to then create the Python DataLoader object specifying a set of transforms to apply to the input data.

from import OneHot, TypeCast
from aeon import DataLoader
# config is defined in the code above
model = ... # neon.models.Model object
dl = DataLoader(config,
dl = OneHot(dl, index=1, nclasses=101)
dl = TypeCast(dl, index=0, dtype=np.float32)
# ..., optimizer=opt, num_epochs=args.epochs, cost=cost, callbacks=callbacks)

Again, for the full example consult the complete neon C3D example in the neon repository.


Learning Spatiotemporal Features with 3D Convolutional Networks