The Scientific Artist

Audio Processing and Indexing

App Demo

Scalable Tagging and Indexing

This project presents a scalable and efficient solution for automating audio annotation and feature extraction. Designed as the final project for my Audio Processing and Indexing course, the system leverages state‐of‐the‐art deep learning techniques to streamline music information retrieval tasks. The project was developed with the help of my team members-Shreyansh Sharma, H.P. Pranaav, and Trent Eriksen.

The primary goal was to address the bottleneck in manual audio annotation by:

  • Automating Tagging: Using an adapted ConvNeXt model to generate embeddings from 10-second audio clips.
  • Efficient Indexing: Storing these high-level features in a vector database, making them readily accessible for similarity searches.
  • Scalability: Integrating with OpenSearch for rapid retrieval of labels and embeddings, even with large-scale datasets.

This approach not only speeds up the annotation process but also enhances the accuracy of tagging by reducing the need for extensive human expert feedback.

Technologies & Methodologies
  • Deep Learning & Audio Models:
    Adapted ConvNeXt—originally designed for computer vision—to process audio clips and extract meaningful features. The model leverages the extensive AudioSet dataset to ensure diverse coverage in music genres and sounds.

  • Indexing & Retrieval:
    A dual-indexing strategy is implemented using OpenSearch, which supports both the embeddings and their corresponding labels. This facilitates effective similarity search and tag validation.

  • User Interface:
    A clean, interactive UI is built with Streamlit, providing users an intuitive platform to view annotations, perform searches, and interact with the dataset.

  • Containerization:
    The project is fully containerized with Docker and Docker Compose, ensuring ease of setup and consistent deployment across different systems.

How It Works
  1. Tagging & Feature Extraction:
    The system processes 10-second audio segments using the ConvNeXt model to generate embeddings. These embeddings are paired with tags derived from the AudioSet dataset.

  2. Indexing:
    The generated features and tags are indexed in a vector database via OpenSearch. This dual-indexing enables not only fast retrieval but also robust semantic similarity searches.

  3. User Interaction:
    The Streamlit-based interface allows users to:

    • Upload and annotate audio files.

    • Visualize spectrograms and tag overlays.

    • Execute similarity searches to find related audio segments quickly.

  4. Validation & Experimentation:
    By comparing automated tags with human-verified annotations, the system demonstrates high accuracy—particularly in tracks with clear rhythmic and instrumental cues.

What did we achieve
  • Accuracy:
    Successfully identified consistent musical elements such as steady drum beats, guitar patterns, and even distinguished nuanced effects like reverb.

  • Robust Performance:
    Demonstrated effectiveness with various audio files—from pure instrumentals to tracks that blend sound design with music.

  • Scalable Architecture:
    Designed with future expansion in mind, the architecture supports zero-shot extraction for custom tags and rapid adaptation to larger datasets.

This project exemplifies the fusion of modern deep learning with practical music annotation challenges, offering a powerful tool for efficient audio processing and indexing.

Detailed instructions, setup files, project documentation, and additional demo screenshots are available on the project’s GitHub repository.