Slidecho: Flexible Non-Visual Exploration of Presentation Videos

Presentation

The original talk was given by Yi-Hao Peng at ACM ASSETS 2021 Virtual Conference. The contents below include: 1) each representative slide frame with its corresponding alt-text summary descriptions, and 2) transcripts for each slide.

Hi I’m Yi-Hao and today I will share our project called Slidecho, a system that supports flexible non-visual exploration of presentation videos.

This slide shows a short clip of a presenter who did not describe his slides. The slide he presented included the title: Study of Existing CAPTCHAs. There is a chart displaying below it. The chart displayed the percentage of people successfully entering the CAPTCHAS within different trials. The x-axis represents the time of trials, including 1st try, 2nd try, 3rd try and never, the y-axis represents the percentage of users from 0-100% with 25% as interval. The measurement of three types of captcha are included in the chart including blind audio, sighted audio and sighted visual. In summary, most of the successful attempts for blind and sighted audio captcha are concentrated in 1st try and never, while for the sighted visual captcha most of the participants have successfully attempts on 1st try. On the right side of the chart are some details on study design. The first item initiated with the red arrow was the number of participants, which in total is 162 participants with 89 blind web users and 73 sighted web users recruited for the study. The second item initiated with the red arrow was the study procedure. There are a total of 10 examples and 10 sites in the randomized trials study design. Overall, the speaker barely described any of these visuals on his slide.

People use videos to convey information in courses, conference talks, job trainings, and more. Unfortunately, people upload these videos without the slides.

A slide with text “undescribe slide content = inaccessible presentation”. The text “inaccessible” is highlighted in red.

The videos alone are not accessible to blind and visually impaired viewers, because many speakers do not describe their slides.

A slide with text “How often do speakers fail to describe their slides?”

Consider watching this TED talk without seeing the slides. (audio played from a TED talk: "When I gave up sugar for 30 days, day 31 looked like this.")

A slide describes the details of our field survey, which includes text “90 videos”, “269 slides”, “615 slide elemnts”, “half text elements” and “half media elements”. All numeric related word are highlighted with bold typeface. Below the texts there is one graphic demonstrating the spectrum of different levels of descriptions on slide contents from none, little, half, most to complete, which are colored respectively in blue, green, gray, orange and red.

Here, we missed out when the speaker referenced the large pile of chocolate on his slide. So, what can we do to make these videos accessible?

A slide highlights the portion of undescribed contents in usual presentations. Text “72%” placed at the center of the slides with huge font size and bold typeface. At the bottom right side of that text is text “Slide elements without key information in the description” with normal font size.

The two options are to create audio descriptions or provide the accessible slides. An audio description provides narration of important visual content.

A slide shows the current presentation accessibility guidelines. Title text “Presentation accessibility guidelines” put at the top center of the slide. Below it is a bulleted list with two text items. The first item is text “Describe all pertinent visual information [W3C]”. The second item is text “Use minimal visuals [SIGACCESS]”.

And here is an example. (Same TED video played: "When I gave up sugar for 30 days, day 31 looked like this." Followed with the audio description: "A large pile of chocolates shown on the floor.")

A slide with text “Help presenters make their presentations accessible” at the middle left of the slide. Text “accessible” is highlighted in green.

But, the recorded description might leave out visual content.

A slide shows the overview of our system. Title text “Presentation A11y” put at the top center of the slide. Below it are the screenshots of two interfaces. On the left is the screenshot of our real-time interface with text “Real-Time Feedback” on its top. On the right is the screenshot of our post-presentation interface with text “Post-Presentation Feedback” on its top.

Accessible slides on the other hand provide information for all of the visual content. But they’re not synchronized to the talk, so when you hear “day 31 looked like this”, you’d need to read through all the slides until you find the right one.

This is a transition slide. It is almost the same slide as the last slide frame showing the overview of our system with two interfaces. The Real-Time Feedback interface is highlighted with bright color while Post-Presentation Feedback Interface is styled with dark color.

So our goal is to make online videos accessible through flexible and synchronous access to all slide content.

This is the slide that shows the demo clip of our real-time feedback interface. On the top of the slide are text “Real-Time Feedback” and text “Augments Presenter view”. Below it is the presenter view of Google Slides, including the timer, the buttons for switching to the previous or next slides, the current slide content, the previewed content of previous and next slides, and the speaker notes section. The animation first showed a green arrow pointing to the word “Activity” colored in green within current slide content, which demonstrated the mentioned word was highlighted, and a red arrow pointing to another unmentioned word “Activity” which was not highlighted. The played clip then demonstrated when speakers said “create a colorful circle brush” and depicting the image of a colorful squiggly line, the corresponding words were highlighted in green and the image was highlighted with a green border. The text “display: on” in the speaker note section then be marked with a red border to show the feedback can be turned on or off.

To achieve this goal, we created Slidecho. Slidecho takes a presentation video as input, and automatically extracts the slides to make the text and image elements screenreader-accessible. Then, Slidecho aligns the speaker’s narration to the slide elements — to make it easy to access undescribed information on demand.

This is a transition slide. It is almost the same slide as the prior slide frame showing the overview of our system with two interfaces. The Post-Presentation Feedback interface is highlighted with bright color while the Real-Time Feedback Interface is styled with dark color.

Now, watching the TED talk with Slidecho, users can easily pause the video to obtain more information about the slide. (Same TED talk video played in the Slidecho interface. The speaker said: "When I gave up sugar for 30 days, day 31 looked like this." User then navigated the interface with screen reader and receive the following audio feedback: "Heading level 5: Undescribed slide regions. Image, A stack of chocolate scattered on the floor.")

This is the slide that shows the Post-Presentation Interface. On the top of the slide are text “Post-Presentation Feedback” and text “Augments Slide Editor”. Below them is the google slide editor that contains the preview section of each slide on the left and main slide section on the right that shows the same content as demonstrated in the previous presenter view. It includes the title “Outline”. Below it are two bullet text items. The first item is “Activity: create a colorful circle brush” with three subitems “Review: points, paths, colors”, “Shapes”, and “Even attributes”. The second item is “Activity: create a stitches brush” with three subitems “Review: vectors, length, angles”, “Normals” and “Vector from event”. An image of a colorful squiggly line placed on the right side of the slide.

Slidecho is powered by several algorithmic methods.

This is the slide that shows the same post-presentation interface as shown in the previous slide. The preview section of each slide on the left is highlighted with a red border. It shows that each preview slide is highlighted with different colors from red to green, where red represents that speaker did not describe most of the slide contents while green represents speaker describing most of the slide contents.

The first step is to extract slide frames.

This is the slide that shows the same post-presentation interface as shown in the previous slide. The first bullet text item “Activity: create a colorful circle brush” and its two subitems “Review: points, paths, colors” and “Shapes” are colored in green. The image of a colorful squiggly line is also highlighted with a green border.

To do this, we detect shot boundaries, then remove shots that do not contain slides. Some slides include animation, so we only select the last-slide-frame.

This is the slide that shows the same post-presentation interface as shown in the previous slide with additional results panel overlaid on the top area of the existing interface. The result panel consists of three parts. On the left is the coverage percentage of speaker’s narrations on slide contents, which in this case is 44%. In the middle of the panel is the raw transcript of speech. On the right is the accessibility suggestion section, which includes the suggestions such as removing certain undescribed elements or adding the descriptions for undescribed elements in the transcripts.

Slidecho displays the detected slide boundaries on the video timeline. As the user plays the presentation video, Slidecho updates the slide pane to show the current slide number and plays an audio notification to let users know the slide has changed. (Video played. Speaker said: "When writing an outcome, be clear about what the students will need to learn and how you expect them to demonstrate this." System notification said: "Slide 6." Speaker then said: "Effective learning outcomes are what's called SMART.")

This is the slide that shows the demo clip of our post-presentation feedback interface. The demo example is still the same interface and contents as shown in previous slides. In the video, the speakers removed undescribed bullet text item “Activity: create a stiches brush” along with its three subitems “Review: vector, length, angles”, “Normals” and “Vector from event”. The coverage percentage was changed from 44% to 79% after removing the undescribed contents.

Slidecho then extracts the text and image elements for each slide.

This is the slide that also shows the demo clip of our post-presentation feedback interface with the zoom-in shot. Following the previous slide, after removing the undescribed elements, the speaker started trying to add the descriptions to transcripts for the undescribed contents. Specifically, in the video, the speaker added the undescribed text “Event attributes” to the transcripts, and the coverage percentage was changed from 79% to 93%.

To do this, we extract text elements with optical character recognition and identify image bounding boxes. We obtain a description for each image using Microsoft’s scene detection service, and speakers can edit the results with edit mode. We group text elements into lists as necessary.

This slide shows the details of the user study. Title “User Study” is placed on the top center of the slide. Three bullet text items displayed below the title. The first text item is “16 people presented their own slides”. The second item is “Presentation: Present slides, half with and half without our real-time feedback”. The last item is “Review: identify changes with and without our post-presentation feedback”.

Here, a user navigates through Slidecho’s extracted elements in the slide pane. (Video played with audio from screenreader: "Heading level 5: slide 6 (current slide). Title, effective learning outcomes are SMART. List (5 item). Bullet, specific. Bullet, measureable, Bullet, attainable. Bullet, relevant. Bullet, time-bound. End of list. Image: arrow. End of current slide, group.")

This slide shows the study results as a table. The first column contains three different metrics we evaluated. The first row is two comparison subjects, our system and default interface. The second row is the text coverage %, where our system is 57% and the default interface is 46%. The third row is the image coverage score, where our system is 3.5 and the default interface is 3.1. The last row is the number of accessibility edits identified, where our system is 2.3 and the default interface is 0.7. Our system significantly outperformed all the metrics compared to the default interface.

So far, Slidecho extracts all of the slide elements, but many of these elements are fully redundant with the presenters speech. To help users identify important moments when there is visual content missing from the speech, Slidecho automatically detects the described and undescribed slide elements.

To do this, we first tokenize the speech text into sentences. For each sentence, we then compute the cosine similarity between the sentence embedding and each slide element embedding. If the element’s maximum similarity score passes a threshold of 0.3, it is counted as described and linked to the sentence with highest similarity score.

This is the last slide. Three text blocks placed from the top to the bottom with left alignment. The first text block is “sayitall.github.io”. The second text block is “Say It All: Feedback for Improving Non-Visual Presentation Accessibility”, and the final text block contains names of all four authors including “Yi-Hao Peng”, “JiWoong Jang”, “Jeffrey P. Bigham” and “Amy Pavel”.

For instance, the slide element “effective learning outcomes are smart” has a similarity of 0.9 for the speech sentence “Effective learning outcomes are what’s called SMART”, so this element is marked described.

If the maximum score is below 0.3, then the element is counted as undescribed. For instance, the word “Attainable” was not described in the speech and only has a maximum similarity score of 0.2.

Our interface shows the undescribed elements in the elements pane, and plays an audio notification of the end of each slide that contains undescribed elements. I’ll play an example showing the end of a slide, the undescribed element notification, and a user navigating to the undescribed content.

(Video played. The speaker said: "Students should be able to accomplish this learning and demonstrating it in the way laid out in the outcome in the time frame available." System notification: "Extra content on slide." Screen reader feedback: "Heading level 5, slide 6 (undescribed regions). List, 1 item. Bullet, attainable. End of list. Image: arrow. End of undescribed regions of current slide, group.")

To see how Slidecho would perform in-the-wild, we evaluated each component against a ground truth labeling of 158 slides from 20 presentation videos.

Our slide boundary detection achieved a precision and recall of around 97% as it relied on existing shot detection.

The image segmentation and text grouping both had precision and recall around 90%. Errors most commonly occurred when Slidecho recognized extra image segments.

For instance, our method broke a set of icons designed to represent “a class”, into individual images.

Finally, our slide element to sentence alignment performed with a precision of 87.4% and a recall of 82.1%. Errors happened most often when presenters’ descriptions of their images were not similar to the automated description.

We evaluated Slidecho through a user study with 10 blind and visually impaired participants comparing two versions of slidecho: (1) Slidecho with just the extracted slides but no synchronization, vs (2) the full slidecho interface with audio notifications and synchronized slides. After participants viewed each interface with a different video, they answered two questions about the video content and participated in an interview.

Participants answered all questions accurately with both interfaces. But using Slidecho sync mode, they spent significantly less time to achieve the same accuracy. Participants using slidecho sync mode also viewed significantly fewer slide elements that were fully redundant with the speech, saving them time and effort.

Overall, 8 of the 10 participants ultimately preferred the full slidecho interface because it provided granular access to information at the time that it was relevant. On the other hand, the two participants who preferred the interface without synchronization liked that they could move forward and backwards through the slides as the video played. We thus included this in the final interface.

In the future, we will provide flexible non-visual access to information in more types of videos. We’ll also explore how we can generate audio descriptions for presentation videos. Finally, we encourage authors to release their slides along with their talks. In the future, we could align these slides to the video rather than extracting them from scratch.

With that I would like to end my talk, Thank you!

Slidecho: Flexible Non-Visual Exploration of Presentation Videos

Presentation

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35