There are many domains where the temporal dimension is critical to unveil how different modalities, such as images and texts, are correlated. Notably, in the social media domain, information is constantly evolving over time according to the events that take place in the real world. In this work, we seek for highly expressive loss functions that allow the encoding of data temporal traits into cross-modal embedding spaces. To achieve this goal, we propose to steer the learning procedure of such embedding through a set of adaptively enforced temporal constraints. In particular, we propose a new formulation of the triplet loss function, where the traditional static margin is superseded by a novel temporally adaptive maximum margin function. This novel redesign of the static margin formulation, allows the embedding to effectively capture not only the semantic correlations across data modalities, but also data’s fine-grained temporal correlations. Our experiments confirm the effectiveness of our model in structuring different modalities, while organizing data according to temporal correlations. Moreover, we experimentally highlight how can these embeddings be used for multimedia understanding.