Abstract: Leveraging powerful semantic understanding and generation capabilities, Vision-Language Pre-trained (VLP) large models have demonstrated remarkable potential in cross-modal retrieval.
Abstract: The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained alignment between visual and textual information. However, retrieving the correct video according to ...
conda create -n EDA python=3.7 conda activate EDA conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.1 -c pytorch -c nvidia pip install numpy ipython psutil traitlets transformers ...