The Internet contains an enormous amount of publicly available videos that we can learn from. You can watch a person make a beautiful presentation, a digital artist draw a beautiful sunset, and a Minecraft player build an intricate house. However, these videos only provide a record of that it happened but not exactly as it was achieved, that is, you will not know the exact sequence of mouse movements and key presses. If we wanted to build on a large scale foundation models in these domains as we have done in language with GPTthis lack of action tags poses a new challenge that is not present in the language domain, where “action tags” are simply the next words in a sentence.
In order to make use of the vast amount of unlabeled video data available on the Internet, we present a novel, yet simple, semi-supervised imitation learning method: Video PreTraining (VPT). We started by collecting a small dataset from the contractors where we recorded not only their video, but also the actions they performed, which in our case are keystrokes and mouse movements. With this data we train an inverse dynamics model (IDM), which predicts the action that is being taken at each step of the video. It is important to note that the IDM can use past and future information to guess the action at each step. This task is much easier and therefore requires much less data than the behavioral cloning task of predicting given actions. previous video frames only, which requires inferring what the person wants to do and how to achieve it. We can then use the trained IDM to tag a much larger data set of online videos and learn to act through behavioral cloning.