BEng (Hons) Software Engineering | Final Year Project
W.H.Sachini WewalwalaEffortless multimodal communication and coordination among several data modalities, like text, image, audio, and video, are difficult for current multi-agent systems to accomplish. Present frameworks lack integrated representation spaces supporting easy agent-to-agent communication and consequently result in disjoint knowledge processing, communication breakdowns, and less cooperative performance. In the context of multimodal environments, these issues hinder realizing higher-order collective intelligence, especially when the agents need to process and coordinate information from multiple sources at a time. To address this challenge, we propose FUSION-X, a unified bimodal representation framework employing a bimodal (image-text) fusion strategy designed within the constraints of the Final Year Project timeline based on one unified method using deep models. The technique combines a pre-trained encoder pathway utilizing modality-specific embeddings mapped into one shared representation space using a light neural network fusion layer with pre-trained OpenAI CLIP image encoders and HuggingFace Transformer text encoders. The system takes advantage of direct mixed-methods research that takes advantage of quantitative performance indicators along with qualitative responses from stakeholders. Controlled experiments on benchmark sets MS-COCO ensured its correctness. The Fusion-X models are evaluated using standard multimodal indicators: Recall@K, F1-score, and modality gap reduction. The high retrieval accuracies and low modality gap values demonstrate that Fusion-X efficiently creates unified bimodal representations for cross-modal agent communication