Situated natural language interactions between humans and robots are strictly necessary for complex applications: communication here implies the reference to the environment shared between a user and the robot. This paper proposes a transformer-based architecture that supports the integration of spatial information (as logical representation) about a semantic map of the environment and the input utterances. The generated interpretation is a logical form of the command that makes references to the state of the world through a single end-to-end process, stimulated at each interaction by an explicit linguistic description of the environment. In this specific work, the end-to-end capability of the targeted transformer is studied in light of its multilingual applications where the robot can be queried in different natural languages. The obtained experimental results confirm the applicability of transformers to grounded human-robotic interaction, with benefits in terms of both portability of the approach across domains and effectiveness in terms of reachable accuracy. Moreover, language-specific processing chains are shown to be preferable to large-scale multilingual models for their better trade-off between accuracy and complexity. Overall, the proposed architecture outperforms previous approaches and paves the way for sustainable multilingual architectures.