• English
  • Čeština
  • Deutsch
  • Español
  • Français
  • Gàidhlig
  • Latviešu
  • Magyar
  • Nederlands
  • Português
  • Português do Brasil
  • Suomi
  • Svenska
  • Türkçe
  • Қазақ
  • বাংলা
  • हिंदी
  • Ελληνικά
  • Log In
  • Communities & Collections
  • Browse OpenUCT
  • English
  • Čeština
  • Deutsch
  • Español
  • Français
  • Gàidhlig
  • Latviešu
  • Magyar
  • Nederlands
  • Português
  • Português do Brasil
  • Suomi
  • Svenska
  • Türkçe
  • Қазақ
  • বাংলা
  • हिंदी
  • Ελληνικά
  • Log In
  1. Home
  2. Browse by Author

Browsing by Author "Smit, Andries"

Now showing 1 - 1 of 1
Results Per Page
Sort Options
  • No Thumbnail Available
    Item
    Open Access
    Meta-learning adaptive intrinsic reward weighting for curiosity-driven reinforcement learning
    (2026) Ziki, Batsirayi Mupamhi; Shock, Jonathan; Smit, Andries
    For both organisms and artificial agents, exploration is essential to continue learning and avoid becoming trapped in suboptimal behaviours. Reinforcement learning (RL) agents can also face exploration challenges in environments with sparse feedback. Curiosity-driven exploration algorithms can help address these challenges by providing intrinsic rewards based on the novelty of situations an agent encounters. These intrinsic rewards are typically combined with extrinsic rewards using a weighted sum with the parameter λ. However, fine-tuning λ for each task across multiple environments can become computationally expensive. We propose a meta-learning approach for automatic tuning of λ using a recurrent neural network (RNN) that dynamically outputs λ values. We call this RNN the reward combiner. The reward combiner was trained using evolutionary strategies on XLand-MiniGrid environments, where feedback is sparse. The fitness function was the total extrinsic reward obtained during the training phase of an agent. We used BYOL-Explore, a curiosity-driven exploration algorithm, for intrinsic reward generation. The reward combiner takes normalised extrinsic and intrinsic rewards as input, along with actions that provide task-specific context for λ selection. Trained on Unlock and Empty-16x16 environments, the reward combiner generalises across different grid sizes of the same task, outperforming baselines when tested on DoorKey environments. It also generalises across different tasks when tested on UnlockPickUp, where the objective differs from the training environments. Our approach achieves higher extrinsic returns at the end of training than curiosity-driven baselines across all test environments. Despite being tested only within XLand-MiniGrid environments, our results indicate this approach has potential to eliminate costly hyperparameter sweeps when switching to new tasks with similar mechanics.
UCT Libraries logo

Contact us

Jill Claassen

Manager: Scholarly Communication & Publishing

Email: openuct@uct.ac.za

+27 (0)21 650 1263

  • Open Access @ UCT

    • OpenUCT LibGuide
    • Open Access Policy
    • Open Scholarship at UCT
    • OpenUCT FAQs
  • UCT Publishing Platforms

    • UCT Open Access Journals
    • UCT Open Access Monographs
    • UCT Press Open Access Books
    • Zivahub - Open Data UCT
  • Site Usage

    • Cookie settings
    • Privacy policy
    • End User Agreement
    • Send Feedback

DSpace software copyright © 2002-2026 LYRASIS