Graduation Project

Cross-Modal Translation with β-VAE
In an increasingly multimodal world, AI systems must understand and translate between sensory inputs like video and audio. For my Master’s thesis in Media Technology, I tackled the challenge of symbol grounding (linking abstract symbols to real-world meaning) and compositional generalization (applying learned concepts to unseen scenarios) by leveraging structured latent representations. Using a β-Variational Autoencoder (β-VAE) and a Sequence-to-Sequence (Seq2Seq) model, I developed a framework to map video features to their corresponding audio descriptions, enabling systems to generalize beyond training data.
Key Contributions:
- Demonstrated that β-VAE-learned latent representations significantly improve cross-modal alignment and generalization.
- Identified gradual β-factor ramp-up as the optimal strategy to balance reconstruction accuracy and feature disentanglement.
- Achieved a 97% reduction in cosine distance compared to baseline models, enabling robust translation of unseen object-action pairs.
Methodology
Structured Representation Learning
Trained a β-VAE to encode video and audio features into a shared 16-dimensional latent space.
Explored 9 β-scheduling strategies, with gradual ramp-up yielding the best trade-off between disentanglement and reconstruction.
Cross-Modal Translation
Fed disentangled latent vectors into a modified Seq2Seq model to map video sequences to audio descriptions.
Evaluated performance using cosine distance, MSE, and generalization tests on unseen object-action combinations.
Dataset
Used a custom dataset of 36,000 video-audio pairs (objects: pen, phone, spoon, knife, fork; actions: left, right, up, down, rotate).
Results
Symbol Grounding: Achieved a cosine distance of 0.17 (vs. baseline’s 50.12), indicating near-perfect alignment between predicted and actual audio.
Compositional Generalization: Tested on unseen object-action pairs (e.g., “rotate the knife”), the model maintained stable performance:
Cosine distance ≤ 8.88 across all held-out objects (vs. baseline’s 59.59–66.28).
Test loss reduction by 99% compared to the baseline.
This work bridges the gap between multimodal learning and real-world adaptability. Applications include:
- Robotics: Enabling robots to interpret dynamic sensory inputs (e.g., translating visual actions to verbal commands).
- Human-Computer Interaction: Improving systems like virtual assistants to handle novel user requests.
Thesis Link: Read the Full Thesis
Code Repository: GitHub