Salesforce AI Research Delivers “ALPRO”: A New Video and Linguistic Representation Learning Framework (Pre-Training)

This Article is written as a summay by Marktechpost Staff based on the Research Paper 'Align and Prompt: Video-and-Language Pre-training with Entity Prompts'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper and github.

Please Don't Forget To Join Our ML Subreddit

Consider how dynamic and diverse human contact is in the real world. There seems to be no doubt that everyone interacts verbally in a busy world where video and language play vital interconnected roles on an ongoing basis. Examples include the football commentary one enjoyed with friends over a beer, Jeopardy’s questions about The Matrix, and never-before-seen recipes featured on the TV show Hell’s Kitchen.

In other words, video and language content has become ubiquitous in the digital age; they are all around us, continuously, and 24 hours a day. And people seem to have little difficulty absorbing this torrent of video and textual content.

Specifically, given the ubiquity of video and language in the real world, a fundamental scientific question arises: how are artificial intelligence systems designed to simultaneously interpret video material and human language?

Many practical applications require the AI ​​model to understand both modalities simultaneously. It is therefore crucial to develop such a model. An example is content-based video search, which allows many Internet videos to be searched even in the absence of textual information. Another use is video categorization and recommendation, where the model can categorize videos by analyzing both video content and written descriptions. This will make it easier to find and recommend personalized videos.

Vision-language pre-training (VLP) techniques

Vision-language or pre-training video-and-language (VLP) techniques have recently emerged as an effective method to address this AI difficulty.

Using VLP approaches, neural networks are initially pre-trained on numerous web-based video-text pairs. Although some of this web data can be noisy, neural networks can acquire efficient representations for downstream applications.

After the pre-training, the parameters of the neural networks are used as an initialization for fine tuning.

Limits and opportunity

Despite encouraging improvements, existing VLP models are limited in a variety of ways, including: First, the video and text embeds are not properly aligned. Intermodal alignment can be modeled in multiple ways in available research. Some works, for example, maximize the similarity between unimodal embeddings of the same video-text pair by taking the dot product between them. The other working group passes unimodal embeddings directly to the intermodal encoder hoping that the intermodal encoder automatically captures the alignment relationship. Nevertheless, since separate encoder arrays produce these unimodal integrations of video and text, their integrations lie in distinct feature spaces. Therefore, neither approach is effective in modeling intermodal alignment.

Absence of fine video data: Second, most visual pre-training assignments do not explicitly model fine-grained regional visual data. However, this information is essential for understanding the video content. Some earlier efforts (such as ActBERT) use object detectors to create pseudo-tags as supervision. Specifically, they apply Faster-RCNN to their video frames to generate object labels. Then they supervise the pre-training models with these labels. The MSCOCO object detection dataset, for example, contains less than a hundred distinct object classifications. This severely limits the ability of the VLP model to learn the vast array of notions of objects and entities. In short, VLP models are plagued with inaccurate detections and a limited number of object classes.

ALPRO (ALign & PROmpt)

ALign and PROmpt (ALPRO), a new approach to video and linguistic representation learning (pre-training), has been proposed to address previous work limitations.

ALPRO follows the “pre-train then fit” paradigm used in previously described VLP techniques but overcomes their drawbacks. The approach works on poorly sampled video frames and allows more efficient cross-modal alignment without explicit object detectors.

The ultimate goal of the new strategy is to improve the performance of subsequent tasks, such as video-to-text retrieval and video question answering (video QA). As proposed in ALPRO, the improved pre-training technique results in improved video language representations, contributing to better performance on later tasks.

The resulting pre-trained model in ALPRO achieves peak performance on four public datasets for two common tasks: video text retrieval and video quality assurance. The strategy outperforms previous work by a significant margin and is significantly more efficient at labeling than competing methods.


The unique ALPRO approach consists of two main modules: a visual language pre-training model and a teleprompter (see image above). The prompter creates software entity tags for pre-training supervision of the video language model. Each module has its video encoder (TimeSformer) and text encoder (first six layers of BERT) to extract features from video and text inputs. The pre-training model incorporates an additional multimodal encoder (the last six layers of BERT) to accurately capture the interaction between the two modalities.


Pre-training task 1: Intermodally aligned contrasting video-text object

  • Before transferring the features from the single-mode encoders to the multi-mode encoder, video-to-text contrast (VTC) was applied for loss to align the features. This is accomplished by encouraging positive pair video and text embeds to be more comparable to negative ones. Before modeling their interactions, this ensures that the cross-coder receives better-fitting unimodal integrations.

Task 2: Entity Modeling (PEM) initiation to capture accurate video data

  • PEM is a new visual pre-training task that improves the ability of models to capture regional and local data. PEM precisely relies on a prompter module that provides software pseudo-tags for up to a thousand distinct entity categories for random video cropping. Given the pseudo-label as the target, the pre-training model is then asked to predict the feature categories.
  • In order to construct the pseudo-tags, the prompter compares the selected video crops to a list of so-called “entity prompts”. “A video of ENTITY” is an example of an entity prompt, where ENTITY is a name that appears frequently in the pre-training corpus. Thus, more entity categories are extended by adding more entity prompts.


As shown in the tables below, ALPRO achieves peak performance on four standard video language downstream datasets for video text retrieval and video quality assurance tasks.

ALPRO outperforms the previous best recovery model FiT on the widely used MSRVTT video text recovery dataset.

In video quality assurance, ALPRO achieves results equivalent to VQA-T by using QA-specific domain pre-training pairs.

ALPRO is much more efficient at labeling than ALPRO achieves its superior performance with only 5-10% of the pre-training data required by earlier methods.


Ethical considerations

  • The pre-training video-text corpus is compiled from the web to reduce exposure to inappropriate information. This content is usually generated without sufficient control by people. Therefore, ALPRO may be exposed to inappropriate video content or dangerous literature. It is also desirable to pre-train and fine-tune ALPRO using production-specific multimodal data to solve the problem.
  • Similar to the primary concern, further analysis and training should be undertaken before deploying the technology. Since the pre-training video-text corpus was acquired from the Internet, it is also susceptible to data bias. This bias can exist in object detection, text encoders or video encoders.
  • Due to meticulous optimization of the model architecture and data processing pipeline, training ALPRO requires a moderate amount of computational resources. The total training cost is approximately several hundred A100 GPU hours. Pre-trained models are provided to prevent end users from repeating pre-training efforts to promote environmentally friendly AI systems.
  • Privacy issues: The pre-trained video language data may contain identity-sensitive data, such as fingerprints. Alternative sources of pre-training without human identity can be examined to address this issue (see for example the work on self-supervised pre-training without human beings [2]). Additionally, pre-processing the pre-training corpus can use anonymity measures to avoid identifying leaks.


ALPRO (ALign and PROmpt) is a new video and language pre-training system that provides a generic yet effective way to learn videotext representations. ALPRO adheres to the “pre-train then adjust” paradigm used by other VLP systems but overcomes their limitations.

ALPRO achieves industry-leading performance on two classic tasks, videotext retrieval and quality assessment, on four public datasets, while being significantly more efficient in labeling than competing approaches.

Developing an AI model that can reason about video and language simultaneously is critical, as many practical applications require the model to understand both modalities. An example is content-based video search, which allows many Internet videos to be searched even in the absence of textual information. Another use is video categorization and recommendation, where the model can categorize videos by analyzing both video content and written descriptions. This will make it easier to find and recommend personalized videos.