“Lack of supervision” is a particularly challenging problem in E-learning environments, such as Massive Open Online Courses (MOOCs). A wide range of research efforts and technologies have been explored to alleviate its impact by monitoring students’ engagement, such as emotion or learning behaviors. However, the current research still lacks multi-dimensional computational measures for analyzing learner’s engagement from the interactions that occur in digital learning environment. In this paper, we propose an integrated framework to identify learning engagement from three facets: affect, behavior and cognitive state, which are conveyed by learner’s facial expressions, eye movement behaviors and the overall performance during short video learning session. To recognize the three states of learners, three channel data is recorded: 1)video/image sequence captured by camera, 2)eye movement information from a non-intrusive and cost-effective eye tracker, 3)click stream data from mouse. Based on these modalities, we designed a multi-channel data fusion strategy to explore course learning performance predictions. We also presented a new method to make the self-reported annotations more reliable without using external observers’ verification. To validate the approach and methods, 46 participants were invited to attend an representative course on-line that consisted of short videos in our designed learning environment. The results demonstrated the effectiveness of the proposed framework and methods in monitoring learning engagement. More importantly, a prototype system was developed to detect learner’s emotional and eye behavioral engagement in real-time, meanwhile, it is able to predict the learning performance of learners after they had completed each short video course.