Multimodal conversational agents allow the user to communicate through natural language and visual information. In ecommerce, this type of agents have the potential to lead to realistic and dynamic shopping experiences, where the costumer finds the desired products more efficiently with the help of an agent. A common approach in this scenario is to build a representation space where both the textual and visual information of a product are close. Then, it is possible to search and retrieve products with queries from any of the modalities. This work proposes to generate this joint representation space by also taking into account prior knowledge about the fashion domain, to ensure that the retrieved products comply with the target type of products. Combining label relaxation with a taxonomy-based regularization, the proposed approach diminishes the penalization of the contrastive loss by assigning a smaller loss to other acceptable matches. Our results show that the proposed approach significantly reduces gross errors, like retrieving pants when the costumer is looking for t-shirts, while simultaneously achieving good retrieval performances. Additionally, this approach allows multimodal queries, where specific attributes can be modified by manipulating a visual query with text.