Skip to content

Discussion: Requesting assistance and guidance with implementation of RL algorithms and models in the context of Tetris #265

@lunathanael

Description

@lunathanael

Hello!
I'm trying to implement the many models and algorithms described in this library in the context of Tetris: specifically for multiplayer Tetris where players compete to efficiently clear lines to send as many lines as possible. Currently, I am developing a simple bot that extends beyond simply placing tetrominoes randomly.

Here's what I have:
An environment modeled after atari and game_2048 that allows models to interact and train successfully.
A Modified rewards system to incentivize more blocks place, emphasizing any lines cleared.
A config file to work with EfficientZero, with over 48 single-GPU hours trained.

Here's Some context on the environment trained on:
The observation is a 10colx8row one hot encoded board, this is stacked with some more one-hot encoded information such as the current piece, pieces in the queue, and piece held.
Each move is encoded as the coordinate of the piece place, the type of the piece placed, and the rotation of the piece, for a one-hot encoded size of 2560. The input size is 144. Currently the model uses mlp.
It is worth noting, even after many training iterations, the console still warns that a lot of illegal moves are attempted, despite the action mask being provided for the varied action space. It seems possibly the model is not able to correctly learn the legal actions.

I've also done a small amount of testing using an action space of 10 instead, and some more with reanalyzed set to 0.25 and etc.
I am always open to trying everything as I just want to get something going :).

Let me know if there is any more information or resources or context I can provide to facilitate my learning process.

Here's Some graphs from the training.
image
image

Here is the total_config.py file for the run:

exp_config = {
    'env': {
        'manager': {
            'episode_num': float("inf"),
            'max_retry': 1,
            'step_timeout': None,
            'auto_reset': True,
            'reset_timeout': None,
            'retry_type': 'reset',
            'retry_waiting_time': 0.1,
            'shared_memory': False,
            'copy_on_get': True,
            'context': 'fork',
            'wait_num': float("inf"),
            'step_wait_timeout': None,
            'connect_timeout': 60,
            'reset_inplace': False,
            'cfg_type': 'SyncSubprocessEnvManagerDict',
            'type': 'subprocess'
        },
        'stop_value': 10000000000,
        'n_evaluator_episode': 4,
        'env_id': 'botris',
        'render_mode': None,
        'replay_format': 'gif',
        'replay_name_suffix': 'eval',
        'replay_path': None,
        'act_scale': True,
        'obs_type': 'dict_encoded_board',
        'reward_normalize': False,
        'reward_norm_scale': 100,
        'reward_type': 'raw',
        'max_score': None,
        'delay_reward_step': 0,
        'prob_random_agent': 0.0,
        'max_episode_steps': 50000000,
        'is_collect': True,
        'ignore_legal_actions': False,
        'cfg_type': 'BotrisEnvDict',
        'type': 'botris',
        'import_names': ['zoo.botris.envs.botris_lightzero_env'],
        'collector_env_num': 8,
        'evaluator_env_num': 4
    },
    'policy': {
        'model': {
            'model_type': 'mlp',
            'continuous_action_space': False,
            'observation_shape': 144,
            'self_supervised_learning_loss': True,
            'categorical_distribution': True,
            'image_channel': 1,
            'frame_stack_num': 1,
            'num_res_blocks': 1,
            'num_channels': 64,
            'support_scale': 300,
            'bias': True,
            'discrete_action_encoding_type': 'one_hot',
            'res_connection_in_dynamics': True,
            'norm_type': 'BN',
            'analysis_sim_norm': False,
            'analysis_dormant_ratio': False,
            'harmony_balance': False,
            'lstm_hidden_size': 256,
            'action_space_size': 2560,
            'latent_state_dim': 256
        },
        'learn': {
            'learner': {
                'train_iterations': 1000000000,
                'dataloader': {
                    'num_workers': 0
                },
                'log_policy': True,
                'hook': {
                    'load_ckpt_before_run': '',
                    'log_show_after_iter': 100,
                    'save_ckpt_after_iter': 10000,
                    'save_ckpt_after_run': True
                },
                'cfg_type': 'BaseLearnerDict'
            }
        },
        'collect': {
            'collector': {
                'deepcopy_obs': False,
                'transform_obs': False,
                'collect_print_freq': 100,
                'cfg_type': 'SampleSerialCollectorDict',
                'type': 'sample'
            }
        },
        'eval': {
            'evaluator': {
                'eval_freq': 1000,
                'render': {
                    'render_freq': -1,
                    'mode': 'train_iter'
                },
                'figure_path': None,
                'cfg_type': 'InteractionSerialEvaluatorDict',
                'stop_value': 10000000000,
                'n_episode': 4
            }
        },
        'other': {
            'replay_buffer': {
                'type': 'advanced',
                'replay_buffer_size': 4096,
                'max_use': float("inf"),
                'max_staleness': float("inf"),
                'alpha': 0.6,
                'beta': 0.4,
                'anneal_step': 100000,
                'enable_track_used_data': False,
                'deepcopy': False,
                'thruput_controller': {
                    'push_sample_rate_limit': {
                        'max': float("inf"),
                        'min': 0
                    },
                    'window_seconds': 30,
                    'sample_min_limit_ratio': 1
                },
                'monitor': {
                    'sampled_data_attr': {
                        'average_range': 5,
                        'print_freq': 200
                    },
                    'periodic_thruput': {
                        'seconds': 60
                    }
                },
                'cfg_type': 'AdvancedReplayBufferDict'
            },
            'commander': {
                'cfg_type': 'BaseSerialCommanderDict'
            }
        },
        'on_policy': False,
        'cuda': True,
        'multi_gpu': False,
        'bp_update_sync': True,
        'traj_len_inf': False,
        'use_rnd_model': False,
        'sampled_algo': False,
        'gumbel_algo': False,
        'mcts_ctree': True,
        'collector_env_num': 8,
        'evaluator_env_num': 4,
        'env_type': 'not_board_games',
        'action_type': 'varied_action_space',
        'battle_mode': 'play_with_bot_mode',
        'monitor_extra_statistics': True,
        'game_segment_length': 50,
        'eval_offline': False,
        'cal_dormant_ratio': False,
        'analysis_sim_norm': False,
        'analysis_dormant_ratio': False,
        'transform2string': False,
        'gray_scale': False,
        'use_augmentation': False,
        'augmentation': ['shift', 'intensity'],
        'ignore_done': False,
        'update_per_collect': None,
        'replay_ratio': 0.25,
        'batch_size': 256,
        'optim_type': 'Adam',
        'learning_rate': 0.003,
        'target_update_freq': 100,
        'target_update_freq_for_intrinsic_reward': 1000,
        'weight_decay': 0.0001,
        'momentum': 0.9,
        'grad_clip_value': 10,
        'n_episode': 8,
        'num_simulations': 50,
        'discount_factor': 0.997,
        'td_steps': 5,
        'num_unroll_steps': 5,
        'reward_loss_weight': 1,
        'value_loss_weight': 0.25,
        'policy_loss_weight': 1,
        'policy_entropy_loss_weight': 0,
        'ssl_loss_weight': 2,
        'lr_piecewise_constant_decay': True,
        'threshold_training_steps_for_final_lr': 50000,
        'manual_temperature_decay': False,
        'threshold_training_steps_for_final_temperature': 100000,
        'fixed_temperature_value': 0.25,
        'use_ture_chance_label_in_chance_encoder': False,
        'reanalyze_noise': True,
        'reuse_search': False,
        'collect_with_pure_policy': False,
        'use_priority': False,
        'priority_prob_alpha': 0.6,
        'priority_prob_beta': 0.4,
        'root_dirichlet_alpha': 0.3,
        'root_noise_weight': 0.25,
        'random_collect_episode_num': 0,
        'eps': {
            'eps_greedy_exploration_in_collect': False,
            'type': 'linear',
            'start': 1.0,
            'end': 0.05,
            'decay': 100000
        },
        'cfg_type': 'EfficientZeroPolicyDict',
        'lstm_horizon_len': 5,
        'type': 'efficientzero',
        'import_names': ['lzero.policy.efficientzero'],
        'model_path': None,
        'device': 'cuda',
        'reanalyze_ratio': 0.0,
        'eval_freq': 200,
        'replay_buffer_size': 1000000
    },
    'exp_name': 'data_ez/botris_efficientzero_ns50_upcNone_rer0.0_seed0',
    'seed': 0
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionDiscussion of a typical issue or concept

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions