Text this: A VLM-based framework for evaluating garment consistency in AI-generated images based on DeLong’s theory