The Scientific Artist

Graduation Project

Cross-Modal translation pipeline
Cross-Modal translation pipeline

Cross-Modal Translation with β-VAE

In an increasingly multimodal world, AI systems must understand and translate between sensory inputs like video and audio. For my Master’s thesis in Media Technology, I tackled the challenge of symbol grounding (linking abstract symbols to real-world meaning) and compositional generalization (applying learned concepts to unseen scenarios) by leveraging structured latent representations. Using a β-Variational Autoencoder (β-VAE) and a Sequence-to-Sequence (Seq2Seq) model, I developed a framework to map video features to their corresponding audio descriptions, enabling systems to generalize beyond training data.

Key Contributions:

  • Demonstrated that β-VAE-learned latent representations significantly improve cross-modal alignment and generalization.
  • Identified gradual β-factor ramp-up as the optimal strategy to balance reconstruction accuracy and feature disentanglement.
  • Achieved a 97% reduction in cosine distance compared to baseline models, enabling robust translation of unseen object-action pairs.
Methodology
  1. Structured Representation Learning

    • Trained a β-VAE to encode video and audio features into a shared 16-dimensional latent space.

    • Explored 9 β-scheduling strategies, with gradual ramp-up yielding the best trade-off between disentanglement and reconstruction.

  2. Cross-Modal Translation

    • Fed disentangled latent vectors into a modified Seq2Seq model to map video sequences to audio descriptions.

    • Evaluated performance using cosine distance, MSE, and generalization tests on unseen object-action combinations.

  3. Dataset

    • Used a custom dataset of 36,000 video-audio pairs (objects: pen, phone, spoon, knife, fork; actions: left, right, up, down, rotate).

Results
  • Symbol Grounding: Achieved a cosine distance of 0.17 (vs. baseline’s 50.12), indicating near-perfect alignment between predicted and actual audio.

  • Compositional Generalization: Tested on unseen object-action pairs (e.g., “rotate the knife”), the model maintained stable performance:

    • Cosine distance ≤ 8.88 across all held-out objects (vs. baseline’s 59.59–66.28).

    • Test loss reduction by 99% compared to the baseline.

This work bridges the gap between multimodal learning and real-world adaptability. Applications include:

  • Robotics: Enabling robots to interpret dynamic sensory inputs (e.g., translating visual actions to verbal commands).
  • Human-Computer Interaction: Improving systems like virtual assistants to handle novel user requests.

Thesis LinkRead the Full Thesis
Code RepositoryGitHub