Recognizing and diagnosing learner’s cognitive and emotional state to intervene assertively is an important aspect to improve learning processes. This mission that can be supported by social robots in educational contexts. A cognitive architecture to manage the robot’s social behavior with handling capacity is presented. The human-robot scaffolding architecture is composed of three systems: multimodal fusion, believes, and scaffolding. Those recognize verbal and nonverbal data from user and from the mechanical assembly task, acknowledges the user’s cognitive and emotional state according to the learning task and configure the actions of the robot based on the Flow Theory. It establishes relations between challenges and skills during the learning process, presenting also the theoretical analysis and explorative actions with children to build each subsystem of architecture. The present research contributes to the field of human-robot interaction by suggesting an architecture that seeks the robot’s proactive behavior according to learner’s needs.