well,
my 2 cents:
this is done with a backend server, who generated the tour.
When you add some text, this will be stored in the backend / DB,
and when a user checks the pano, in the backend the XML is generated for this pano / scene, including the hotspot / textlayers