[文献CS-LVLM-EN-20231116]Video-LLaVA: Learning United Visual Representation by Alignment Before Projection