Overview

What does it mean to teach machines to listen? And how does our understanding of “listening” inform how we “tune” machine ears to listen to the world around us?

In this course, students will learn how to teach machines to listen from the ground up. We will see how design decisions in building these systems inform just what these machines are able to listen for. Beginning with fundamental audio signal processing techniques, students will learn the building blocks to go from machines that respond to simple tones to ones that recognize speech and eventually understand complex sounds in our environment. Complementing these technical exercises are readings and case studies that help contextualize this technology within a larger history of teaching machines to understand the world through sound and audio. These examples highlight our own biases and presumptions in building these systems, focusing us to ask: what is the machine listening for, and for whom?

This class will primarily be guided through academic readings and in-class/take home programming exercises. Experience with programming is a prerequisite. Not simply a technical programming course, however, this course can also be though of as a History of Technology or Science and Technology Studies course, using machine listening, speech recognition, voice interfaces, environmental sound classification, and audio understanding as topics to explore a techno-history that extends back to pre-electronic practices from the late 19th century to our contemporary moment with Alexa, Google Home, and Siri ever present. We will examine this technology alongside papers, articles, and scholarly writings to frame our interaction with this pursuit of teaching machines to listen within a particular history and context, as though we are archeologists examining this technological artifact through the lens of the humanities, social science and anthropology. The intention is to become better informed technologists, equipped with both technical skill, historical context, and critical design approaches to create listening machines responsibly and ethically, mitigating the risks and harm for those it listens to.

Class 1 - Introduction to Listening Machines

Class 1 Presentation: https://docs.google.com/presentation/d/1PK628ZIwQW9GWWvM42FS5txZqehg57U_QFTP6gDYDyI/edit?usp=sharing

  1. Introduction to Listening Machines
    1. In-Class
      1. First Half
        1. Introduction
        2. What this class is about
        3. What this class is not about
        4. Introduction to Sound
        5. What this class is really about
        6. Critical Technology Studies
        7. Class Goals and Objectives
        8. Make, make, make
        9. On being a guide
        10. Async Communication
        11. Office hours
        12. 1-on-1s
        13. Logistics
        14. Class Overview
        15. Demos, Examples, Theory
        16. AI Policy
      2. Second Half
        1. Audio Signal Processing
          1. Concepts
            1. Waveforms
            2. FFTs
            3. STFT
            4. Spectrograms
            5. Analyzing stream of audio
    2. Homework
      1. Reading

        1. Sterne, Jonathan. “Is Machine Listening Listening?” Preprint, University of Massachusetts Amherst, 2022. https://doi.org/10.7275/ZEQH-EG38.

          Is_Machine_Listening_Listening?-Jonathan_Sterne.pdf

        2. Napolitano, Domenico, and Renato Grieco. “The Folded Space of Machine Listening.” SoundEffects - An Interdisciplinary Journal of Sound and Sound Experience 10, no. 1 (2021): 173–89. https://doi.org/10.7146/se.v10i1.124205

          The_folded_space_of_machine listenig-Napolitano_Grieco.pdf

      2. Homework Assignment

        1. Programming Environment Setup
        2. Pitch detection with autocorrelation to activate something
        3. Build your own Radio Rex
        4. Naive approach (recognize “yes” vs” “no” via high pitched sound intensity

Programming assignment (still TBD but maybe one of those)

Class 2 - Audrey - The First Speech Recognition System

Class 2 Presentation: TBD

  1. Audrey - The First Speech Recognition System
    1. In-Class
      1. First Half
        1. Homework Review
        2. Reading Discussion
      2. Second Half
        1. Introduction to Audrey
    2. Homework
      1. Reading

        1. Li, Xiaochang, and Mara Mills. “Vocal Features: From Voice Identification to Speech Recognition by Machine.” Technology and Culture 60, no. 2S (2019): S129–60. https://doi.org/10.1353/tech.2019.0066.

          Li and Mills - 2019 - Vocal Features From Voice Identification to Speec.pdf

      2. Homework Assignment

        1. Build a version of Audrey and use library to make your own sound / speech recognition project
        2. Make template database and try to get Audrey to work for you

Class 3 - Speech Emotion Recognition

Class 3 Presentation: TBD

  1. Speech Emotional Recognition
    1. In-Class
      1. First Half
        1. Homework Review
        2. Reading Discussion
      2. Second Half
        1. Classical ML
        2. Speech Featuresx
    2. Homework
      1. Reading

        1. Edward B. Kang. 2023. On the Praxes and Politics of AI Speech Emotion Recognition. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT '23). Association for Computing Machinery, New York, NY, USA, 455–466. https://doi.org/10.1145/3593013.3594011

          On_the_Praxes_and_Politics_of_AI_Speech_Emotion_Recognition-Edward_B_Kang.pdf

      2. Homework Assignment

        1. Explore some of the speech emotion datasets, classical ML algorithms and test your own recordings to determine emotion in speech

Class 4 - SOTA Speech-to-Text Automatic Speech Recognition System (ASR)

Class 4 Presentation: TBD

  1. SOTA Speech-To-Text (STT) Automatic Speech Recognition Systems (ASR)
    1. In-Class
      1. First Half
        1. Common Voice
        2. Split a spoken sentence into words - timestamp generation
        3. Modern/Contemporary ML / DL for audio
      2. Second Half
    2. Homework

Class 5 - Non-Speech Listening

Class 5 Presentation: TBD

  1. Non-Speech Listening