Human-robot interactions increasingly require adaptive instruction delivery, yet robots struggle to calibrate instruction detail levels without explicit user input. We present a system that automatically modulates instruction granularity using real-time affect detection through multi-modal fusion of thermal imaging, facial expressions, and contextual information. Our transformer-based architecture integrates these signals to enable decisions about instruction delivery based on detected user states. In a between-subjects study (N=40), participants completed assembly tasks under either manual adjustment or automatic adaptation conditions. Results showed significantly fewer manual adjustments in the adaptive condition (0.7 vs 2.0 per session), with comparable user satisfaction across conditions. This work shows the effectiveness of affect-driven adaptive instruction in human-robot interaction, contributing to more responsive robotic interfaces while providing guidelines for balancing automation with user control.