Policy gradient reinforcement learning-based vehicle thermal comfort control

  • Gaobo Chen

    Student thesis: Doctoral ThesisDoctor of Philosophy


    Reinforcement learning (RL) methods have been developed to deal with numerous real world tasks including applications that focus on the climate controls for different indoor environment including office, classroom, house and car cabin. Recent research applying in car cabin climate control is based on the State-Action-Reward-State-Action (SARSA) algorithms to train an artificial agent that can automatically maintain the thermal conditions that satisfy occupant comfort. However, the SARSA-based RL approaches usually spend 2.9 to 6.3 years of simulated learning experience on training a near-optimal control policy. This cost is not negligible in comparison with the lifetime of vehicles. Alternatively, the family of policy gradient reinforcement learning (PGRL) algorithms has potential to accelerate the training process and acquire less learning experience.

    Hence, the main aim of this thesis is to apply PGRL approaches in learning vehicle climate control and assess if the resulting controller can maximally achieve occupant comfort with reasonable energy consumed by the thermal conditioning system. In order to achieve this main goal, a multilayer perceptron (MLP) based neural network with softmax output layer is used as thermal control policy, the PGRL schemes basically maximize received rewards to compute the gradients to update the weights of this control policy. Two primitive PGRLs are applied and compared: the Monte-carlo policy gradient (MCPG) and mean actor critic (MAC). However, the main difficulty of using primitive PGRL methods is that the learning step size computed by direct gradient-descent rules does not always improve the policy. This issue can be solved by employing two typical PGRL approaches: trust region policy optimization (TRPO) and proximal policy optimization (PPO).

    The experiment shows that TRPO and PPO approaches can improve the sample efficiency and with a reduced simulated learning time of 0.63 years. The PPO based training scheme statistically yields higher episodic reward per learning trial than the alternative PGRLs. Additionally, the PPO-based controller achieves occupant comfort averagely in 3.8 minutes, and maintains 77.94% time spent on the comfort. Compared to the SARSA-based controls with pre-selected testing scenarios, the PPO-based one achieves 92.3% occupant comfort which is higher than the 67% achieved by the SARSA-based controller. Moreover, the state representation is non-Markovian due to its dependence on the time steps. As the validation shows that increasing the episode duration from 1000 to 5000 s can significantly improve the comfort maintaining performance and averaged episodic rewards. A Markovian state representation is then introduced to mitigate state dependence on time-step, the case with 4 × 103 s duration shows that using Markovian state representation can improve comfort percentage from 53.58% to 64.32%. But this improvement is lower than 77.94% by the non-Markovian training case with 5 × 103 s episode time. The trade-off is that the non-Markovian learning case consumes 20% more simulated time in estimating comfort-oriented controller that maintains 13% more time spent on comfort.

    Therefore, the PPO-based PGRL climate controller can significantly improve the occupant comfort percentage to above 77%, while using less simulated learning time (0.63 years) with the non-Markovian state representation. The simulated time is much less compared to the vehicle’s lifetime. Furthermore, other innovative policy gradient techniques, such as, Actor Critic with experience replay and Trust Region-Guided PPO have potentials to further reduce the 0.63 years of learning sample by PPO method, and a more realistic human thermal comfort model is needed.

    Date of AwardMay 2021
    Original languageEnglish
    SupervisorJames Brusey (Supervisor) & Elena Gaura (Supervisor)

    Cite this