Overview

What does it mean to teach machines to listen? And how does our understanding of “listening” inform how we “tune” machine ears to listen to the world around us?

In this course, students will learn how to teach machines to listen from the ground up. We will see how design decisions in building these systems inform just what these machines are able to listen for. Beginning with fundamental audio signal processing techniques, students will learn the building blocks to go from machines that respond to simple tones to ones that recognize speech and eventually understand complex sounds in our environment. Complementing these technical exercises are readings and case studies that help contextualize this technology within a larger history of teaching machines to understand the world through sound and audio. These examples highlight our own biases and presumptions in building these systems, focusing us to ask: what is the machine listening for, and for whom?

This class will primarily be guided through academic readings and in-class/take home programming exercises. Experience with programming is a prerequisite. Not simply a technical programming course, however, this course can also be though of as a History of Technology or Science and Technology Studies course, using machine listening, speech recognition, voice interfaces, environmental sound classification, and audio understanding as topics to explore a techno-history that extends back to pre-electronic practices from the late 19th century to our contemporary moment with Alexa, Google Home, and Siri ever present. We will examine this technology alongside papers, articles, and scholarly writings to frame our interaction with this pursuit of teaching machines to listen within a particular history and context, as though we are archeologists examining this technological artifact through the lens of the humanities, social science and anthropology. The intention is to become better informed technologists, equipped with both technical skill, historical context, and critical design approaches to create listening machines responsibly and ethically, mitigating the risks and harm for those it listens to.

Tentative Schedule

  1. Intro / Audio Signal Processing
    1. In-Class
      1. Logistics
      2. Class Overview
      3. Demos, Examples, Theory
    2. Audio Signal Processing
      1. Concepts
        1. Waveforms
        2. FFTs
        3. STFT
        4. Spectrograms
        5. Analyzing stream of audio
    3. Homework
      1. Setup
      2. Reading groups
      3. Theory question
      4. Programming assignment (still TBD but maybe one of thse)
        1. Pitch detection with autocorrelation to activate something
        2. Build your own Radio Rex
        3. Naive approach (recognize “yes” vs” “no” via high pitched sound intensity
  2. Audrey - The First Speech Recognition System
    1. In-Class
      1. Vocal Features: From Voice Identification to Speech Recognition by Machine
      2. Audrey example
    2. Homework
      1. Make template database and try to get Audrey to work for you
  3. Speech Emotion Recognition
    1. In-Class
      1. Classical ML
      2. Speech Features
      3. Reading: On the Praxes and Politics of AI Speech Emotion Recognition
        1. https://dl.acm.org/doi/pdf/10.1145/3593013.3594011
    2. Homework
  4. SOTA Speech-to-Text Automatic Speech Recognition System (ASR)
    1. In-Class

      1. Common Voice
      2. Split a spoken sentence into words - timestamp generation
      3. Modern/Contemporary ML / DL for audio
    2. Homework

      1. Fine-tuned model
    3. Readings

      1. The Bitter Lesson
        1. https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf
  5. Non-Speech Listening
    1. In-Class
      1. Environmental Sound classification
    2. Homework
      1. TBD
  6. Audio Understanding
    1. TBD, more state-of-the-art research on audio understanding
    2. Homework
      1. Work on Final Project
  7. Final Project Presentation

Resources