Graduation Project

Cross-Modal Translation with β-VAE

In an increasingly multimodal world, AI systems must understand and translate between sensory inputs like video and audio. For my Master’s thesis in Media Technology, I tackled the challenge of symbol grounding (linking abstract symbols to real-world meaning) and compositional generalization (applying learned concepts to unseen scenarios) by leveraging structured latent representations. Using a β-Variational Autoencoder (β-VAE) and a Sequence-to-Sequence (Seq2Seq) model, I developed a framework to map video features to their corresponding audio descriptions, enabling systems to generalize beyond training data.

Key Contributions:

Demonstrated that β-VAE-learned latent representations significantly improve cross-modal alignment and generalization.
Identified gradual β-factor ramp-up as the optimal strategy to balance reconstruction accuracy and feature disentanglement.
Achieved a 97% reduction in cosine distance compared to baseline models, enabling robust translation of unseen object-action pairs.

Methodology

Structured Representation Learning
- Trained a β-VAE to encode video and audio features into a shared 16-dimensional latent space.
- Explored 9 β-scheduling strategies, with gradual ramp-up yielding the best trade-off between disentanglement and reconstruction.
Cross-Modal Translation
- Fed disentangled latent vectors into a modified Seq2Seq model to map video sequences to audio descriptions.
- Evaluated performance using cosine distance, MSE, and generalization tests on unseen object-action combinations.
Dataset
- Used a custom dataset of 36,000 video-audio pairs (objects: pen, phone, spoon, knife, fork; actions: left, right, up, down, rotate).

Results

Symbol Grounding: Achieved a cosine distance of 0.17 (vs. baseline’s 50.12), indicating near-perfect alignment between predicted and actual audio.
Compositional Generalization: Tested on unseen object-action pairs (e.g., “rotate the knife”), the model maintained stable performance:
- Cosine distance ≤ 8.88 across all held-out objects (vs. baseline’s 59.59–66.28).
- Test loss reduction by 99% compared to the baseline.

This work bridges the gap between multimodal learning and real-world adaptability. Applications include:

Robotics: Enabling robots to interpret dynamic sensory inputs (e.g., translating visual actions to verbal commands).
Human-Computer Interaction: Improving systems like virtual assistants to handle novel user requests.

Thesis Link: Read the Full Thesis
Code Repository: GitHub