Presentation
The original
talk was given by Yi-Hao Peng at ACM ASSETS 2021 Virtual Conference.
The contents below include: 1) each representative slide frame with its corresponding alt-text summary descriptions, and 2) transcripts for each slide.
Slide 1
Hi I’m Yi-Hao and today I will share our project called Slidecho, a system that supports flexible non-visual exploration of presentation videos.
Slide 2
People use videos to convey information in courses, conference talks, job trainings, and more. Unfortunately, people upload these videos without the slides.
Slide 3
The videos alone are not accessible to blind and visually impaired viewers, because many speakers do not describe their slides.
Slide 4
Consider watching this TED talk without seeing the slides.
(audio played from a TED talk: "When I gave up sugar for 30 days, day 31 looked like this.")
Slide 5
Here, we missed out when the speaker referenced the large pile of chocolate on his slide.
So, what can we do to make these videos accessible?
Slide 6
The two options are to create audio descriptions or provide the accessible slides. An audio description provides narration of important visual content.
Slide 7
And here is an example. (Same TED video played: "When I gave up sugar for 30 days, day 31 looked like this." Followed with the audio description: "A large pile of chocolates shown on the floor.")
Slide 8
But, the recorded description might leave out visual content.
Slide 9
Accessible slides on the other hand provide information for all of the visual content.
But they’re not synchronized to the talk, so when you hear “day 31 looked like this”, you’d need to read through all the slides until you find the right one.
Slide 10
So our goal is to make online videos accessible through flexible and synchronous access to all slide content.
Slide 11
To achieve this goal, we created Slidecho.
Slidecho takes a presentation video as input, and automatically extracts the slides to make the text and image elements screenreader-accessible.
Then, Slidecho aligns the speaker’s narration to the slide elements — to make it easy to access undescribed information on demand.
Slide 12
Now, watching the TED talk with Slidecho, users can easily pause the video to obtain more information about the slide.
(Same TED talk video played in the Slidecho interface. The speaker said: "When I gave up sugar for 30 days, day 31 looked like this."
User then navigated the interface with screen reader and receive the following audio feedback: "Heading level 5: Undescribed slide regions. Image, A stack of chocolate scattered on the floor.")
Slide 13
Slidecho is powered by several algorithmic methods.
Slide 14
The first step is to extract slide frames.
Slide 15
To do this, we detect shot boundaries, then remove shots that do not contain slides. Some slides include animation, so we only select the last-slide-frame.
Slide 16
Slidecho displays the detected slide boundaries on the video timeline.
As the user plays the presentation video, Slidecho updates the slide pane to show the current slide number and plays an audio notification to let users know the slide has changed.
(Video played. Speaker said: "When writing an outcome, be clear about what the students will need to learn and how you expect them to demonstrate this." System notification said: "Slide 6." Speaker then said: "Effective learning outcomes are what's called SMART.")
Slide 17
Slidecho then extracts the text and image elements for each slide.
Slide 18
To do this, we extract text elements with optical character recognition and identify image bounding boxes.
We obtain a description for each image using Microsoft’s scene detection service, and speakers can edit the results with edit mode.
We group text elements into lists as necessary.
Slide 19
Here, a user navigates through Slidecho’s extracted elements in the slide pane.
(Video played with audio from screenreader: "Heading level 5: slide 6 (current slide). Title, effective learning outcomes are SMART.
List (5 item). Bullet, specific. Bullet, measureable, Bullet, attainable. Bullet, relevant. Bullet, time-bound. End of list. Image: arrow. End of current slide, group.")
Slide 20
So far, Slidecho extracts all of the slide elements, but many of these elements are fully redundant with the presenters speech.
To help users identify important moments when there is visual content missing from the speech,
Slidecho automatically detects the described and undescribed slide elements.
Slide 21
To do this, we first tokenize the speech text into sentences.
For each sentence, we then compute the cosine similarity between the sentence embedding and each slide element embedding.
If the element’s maximum similarity score passes a threshold of 0.3, it is counted as described and linked to the sentence with highest similarity score.
Slide 22
For instance, the slide element “effective learning outcomes are smart” has a similarity of 0.9 for the speech sentence “Effective learning outcomes are what’s called SMART”, so this element is marked described.
Slide 23
If the maximum score is below 0.3, then the element is counted as undescribed.
For instance, the word “Attainable” was not described in the speech and only has a maximum similarity score of 0.2.
Slide 24
Our interface shows the undescribed elements in the elements pane, and plays an audio notification of the end of each slide that contains undescribed elements. I’ll play an example showing the end of a slide, the undescribed element notification, and a user navigating to the undescribed content.
Slide 25
(Video played. The speaker said: "Students should be able to accomplish this learning and demonstrating it in the way laid out in the outcome in the time frame available."
System notification: "Extra content on slide." Screen reader feedback: "Heading level 5, slide 6 (undescribed regions). List, 1 item. Bullet, attainable. End of list. Image: arrow. End of undescribed regions of current slide, group.")
Slide 26
To see how Slidecho would perform in-the-wild, we evaluated each component against a ground truth labeling of 158 slides from 20 presentation videos.
Slide 27
Our slide boundary detection achieved a precision and recall of around 97% as it relied on existing shot detection.
Slide 28
The image segmentation and text grouping both had precision and recall around 90%. Errors most commonly occurred when Slidecho recognized extra image segments.
Slide 29
For instance, our method broke a set of icons designed to represent “a class”, into individual images.
Slide 30
Finally, our slide element to sentence alignment performed with a precision of 87.4% and a recall of 82.1%. Errors happened most often when presenters’ descriptions of their images were not similar to the automated description.
Slide 31
We evaluated Slidecho through a user study with 10 blind and visually impaired participants comparing two versions of slidecho:
(1) Slidecho with just the extracted slides but no synchronization, vs
(2) the full slidecho interface with audio notifications and synchronized slides.
After participants viewed each interface with a different video, they answered two questions about the video content and participated in an interview.
Slide 32
Participants answered all questions accurately with both interfaces.
But using Slidecho sync mode, they spent significantly less time to achieve the same accuracy.
Participants using slidecho sync mode also viewed significantly fewer slide elements that were fully redundant with the speech, saving them time and effort.
Slide 33
Overall, 8 of the 10 participants ultimately preferred the full slidecho interface because it provided granular access to information at the time that it was relevant.
On the other hand, the two participants who preferred the interface without synchronization liked that they could move forward and backwards through the slides as the video played. We thus included this in the final interface.
Slide 34
In the future, we will provide flexible non-visual access to information in more types of videos.
We’ll also explore how we can generate audio descriptions for presentation videos.
Finally, we encourage authors to release their slides along with their talks. In the future, we could align these slides to the video rather than extracting them from scratch.
Slide 35
With that I would like to end my talk, Thank you!