Discussion: Requesting assistance and guidance with implementation of RL algorithms and models in the context of Tetris

Hello!
I'm trying to implement the many models and algorithms described in this library in the context of Tetris: specifically for multiplayer Tetris where players compete to efficiently clear lines to send as many lines as possible. Currently, I am developing a simple bot that extends beyond simply placing tetrominoes randomly.

Here's what I have:
An environment modeled after atari and game_2048 that allows models to interact and train successfully.
A Modified rewards system to incentivize more blocks place, emphasizing any lines cleared.
A config file to work with EfficientZero, with over 48 single-GPU hours trained.

Here's Some context on the environment trained on:
The observation is a 10colx8row one hot encoded board, this is stacked with some more one-hot encoded information such as the current piece, pieces in the queue, and piece held.
Each move is encoded as the coordinate of the piece place, the type of the piece placed, and the rotation of the piece, for a one-hot encoded size of 2560. The input size is 144. Currently the model uses mlp.
It is worth noting, even after many training iterations, the console still warns that a lot of illegal moves are attempted, despite the action mask being provided for the varied action space. It seems possibly the model is not able to correctly learn the legal actions.


I've also done a small amount of testing using an action space of 10 instead, and some more with reanalyzed set to 0.25 and etc.
I am always open to trying everything as I just want to get something going :).

Let me know if there is any more information or resources or context I can provide to facilitate my learning process.



Here's Some graphs from the training.
![image](https://github.com/user-attachments/assets/c9c23485-97f3-4e82-88d0-043f5cdd3026)
![image](https://github.com/user-attachments/assets/cde13ff2-b885-4048-93ca-7066a2a8f597)


Here is the `total_config.py` file for the run:
```python
exp_config = {
    'env': {
        'manager': {
            'episode_num': float("inf"),
            'max_retry': 1,
            'step_timeout': None,
            'auto_reset': True,
            'reset_timeout': None,
            'retry_type': 'reset',
            'retry_waiting_time': 0.1,
            'shared_memory': False,
            'copy_on_get': True,
            'context': 'fork',
            'wait_num': float("inf"),
            'step_wait_timeout': None,
            'connect_timeout': 60,
            'reset_inplace': False,
            'cfg_type': 'SyncSubprocessEnvManagerDict',
            'type': 'subprocess'
        },
        'stop_value': 10000000000,
        'n_evaluator_episode': 4,
        'env_id': 'botris',
        'render_mode': None,
        'replay_format': 'gif',
        'replay_name_suffix': 'eval',
        'replay_path': None,
        'act_scale': True,
        'obs_type': 'dict_encoded_board',
        'reward_normalize': False,
        'reward_norm_scale': 100,
        'reward_type': 'raw',
        'max_score': None,
        'delay_reward_step': 0,
        'prob_random_agent': 0.0,
        'max_episode_steps': 50000000,
        'is_collect': True,
        'ignore_legal_actions': False,
        'cfg_type': 'BotrisEnvDict',
        'type': 'botris',
        'import_names': ['zoo.botris.envs.botris_lightzero_env'],
        'collector_env_num': 8,
        'evaluator_env_num': 4
    },
    'policy': {
        'model': {
            'model_type': 'mlp',
            'continuous_action_space': False,
            'observation_shape': 144,
            'self_supervised_learning_loss': True,
            'categorical_distribution': True,
            'image_channel': 1,
            'frame_stack_num': 1,
            'num_res_blocks': 1,
            'num_channels': 64,
            'support_scale': 300,
            'bias': True,
            'discrete_action_encoding_type': 'one_hot',
            'res_connection_in_dynamics': True,
            'norm_type': 'BN',
            'analysis_sim_norm': False,
            'analysis_dormant_ratio': False,
            'harmony_balance': False,
            'lstm_hidden_size': 256,
            'action_space_size': 2560,
            'latent_state_dim': 256
        },
        'learn': {
            'learner': {
                'train_iterations': 1000000000,
                'dataloader': {
                    'num_workers': 0
                },
                'log_policy': True,
                'hook': {
                    'load_ckpt_before_run': '',
                    'log_show_after_iter': 100,
                    'save_ckpt_after_iter': 10000,
                    'save_ckpt_after_run': True
                },
                'cfg_type': 'BaseLearnerDict'
            }
        },
        'collect': {
            'collector': {
                'deepcopy_obs': False,
                'transform_obs': False,
                'collect_print_freq': 100,
                'cfg_type': 'SampleSerialCollectorDict',
                'type': 'sample'
            }
        },
        'eval': {
            'evaluator': {
                'eval_freq': 1000,
                'render': {
                    'render_freq': -1,
                    'mode': 'train_iter'
                },
                'figure_path': None,
                'cfg_type': 'InteractionSerialEvaluatorDict',
                'stop_value': 10000000000,
                'n_episode': 4
            }
        },
        'other': {
            'replay_buffer': {
                'type': 'advanced',
                'replay_buffer_size': 4096,
                'max_use': float("inf"),
                'max_staleness': float("inf"),
                'alpha': 0.6,
                'beta': 0.4,
                'anneal_step': 100000,
                'enable_track_used_data': False,
                'deepcopy': False,
                'thruput_controller': {
                    'push_sample_rate_limit': {
                        'max': float("inf"),
                        'min': 0
                    },
                    'window_seconds': 30,
                    'sample_min_limit_ratio': 1
                },
                'monitor': {
                    'sampled_data_attr': {
                        'average_range': 5,
                        'print_freq': 200
                    },
                    'periodic_thruput': {
                        'seconds': 60
                    }
                },
                'cfg_type': 'AdvancedReplayBufferDict'
            },
            'commander': {
                'cfg_type': 'BaseSerialCommanderDict'
            }
        },
        'on_policy': False,
        'cuda': True,
        'multi_gpu': False,
        'bp_update_sync': True,
        'traj_len_inf': False,
        'use_rnd_model': False,
        'sampled_algo': False,
        'gumbel_algo': False,
        'mcts_ctree': True,
        'collector_env_num': 8,
        'evaluator_env_num': 4,
        'env_type': 'not_board_games',
        'action_type': 'varied_action_space',
        'battle_mode': 'play_with_bot_mode',
        'monitor_extra_statistics': True,
        'game_segment_length': 50,
        'eval_offline': False,
        'cal_dormant_ratio': False,
        'analysis_sim_norm': False,
        'analysis_dormant_ratio': False,
        'transform2string': False,
        'gray_scale': False,
        'use_augmentation': False,
        'augmentation': ['shift', 'intensity'],
        'ignore_done': False,
        'update_per_collect': None,
        'replay_ratio': 0.25,
        'batch_size': 256,
        'optim_type': 'Adam',
        'learning_rate': 0.003,
        'target_update_freq': 100,
        'target_update_freq_for_intrinsic_reward': 1000,
        'weight_decay': 0.0001,
        'momentum': 0.9,
        'grad_clip_value': 10,
        'n_episode': 8,
        'num_simulations': 50,
        'discount_factor': 0.997,
        'td_steps': 5,
        'num_unroll_steps': 5,
        'reward_loss_weight': 1,
        'value_loss_weight': 0.25,
        'policy_loss_weight': 1,
        'policy_entropy_loss_weight': 0,
        'ssl_loss_weight': 2,
        'lr_piecewise_constant_decay': True,
        'threshold_training_steps_for_final_lr': 50000,
        'manual_temperature_decay': False,
        'threshold_training_steps_for_final_temperature': 100000,
        'fixed_temperature_value': 0.25,
        'use_ture_chance_label_in_chance_encoder': False,
        'reanalyze_noise': True,
        'reuse_search': False,
        'collect_with_pure_policy': False,
        'use_priority': False,
        'priority_prob_alpha': 0.6,
        'priority_prob_beta': 0.4,
        'root_dirichlet_alpha': 0.3,
        'root_noise_weight': 0.25,
        'random_collect_episode_num': 0,
        'eps': {
            'eps_greedy_exploration_in_collect': False,
            'type': 'linear',
            'start': 1.0,
            'end': 0.05,
            'decay': 100000
        },
        'cfg_type': 'EfficientZeroPolicyDict',
        'lstm_horizon_len': 5,
        'type': 'efficientzero',
        'import_names': ['lzero.policy.efficientzero'],
        'model_path': None,
        'device': 'cuda',
        'reanalyze_ratio': 0.0,
        'eval_freq': 200,
        'replay_buffer_size': 1000000
    },
    'exp_name': 'data_ez/botris_efficientzero_ns50_upcNone_rer0.0_seed0',
    'seed': 0
}
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discussion: Requesting assistance and guidance with implementation of RL algorithms and models in the context of Tetris #265

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discussion: Requesting assistance and guidance with implementation of RL algorithms and models in the context of Tetris #265

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions