CrossA11y: Identifying Video Accessibility Issues
via Cross-modal Grounding

Adobe Research
UT Austin
Our system (A) identifies accessibility issues by locating modality asymmetries between audio segments and video segments using cross-modal grounding. 
            (B) lets authors address accessibility issues using the CrossA11y interface to write captions and video descriptions,
             and (C) creates a more accessible video from the authored descriptions.


Authors make their videos visually accessible by adding audio descriptions (AD), and auditorily accessible by adding closed captions (CC). However, creating AD and CC is challenging and tedious, especially for non-professional describers and captioners, due to the difficulty of identifying accessibility problems in videos. A video author will have to watch the video through and manually check for inaccessible information frame-by-frame, for both visual and auditory modalities. In this paper, we present CrossA11y, a system that helps authors efficiently detect and address visual and auditory accessibility issues in videos. Using cross-modal grounding analysis, CrossA11y automatically measures accessibility of visual and audio segments in a video by checking for modality asymmetries. CrossA11y then displays these segments and surfaces visual and audio accessibility issues in a unified interface, making it intuitive to locate, review, script AD/CC in-place, and preview the described and captioned video immediately. We demonstrate the effectiveness of CrossA11y through a lab study with 11 participants, comparing to existing baseline.


DOI | PDF | Video | Talk | Presentation Slides

Online Demo



author = {Liu, Xingyu "Bruce" and Wang, Ruolin and Li, Dingzeyu and Chen, Xiang Anthony and Pavel, Amy},
title = {CrossA11y: Identifying Video Accessibility Issues via Cross-Modal Grounding},
year = {2022},
isbn = {9781450393201},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {},
doi = {10.1145/3526113.3545703},
booktitle = {Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology},
articleno = {43},
numpages = {14},
keywords = {video, audio description, closed caption, accessibility},
location = {Bend, OR, USA},
series = {UIST '22}}