$ python3 simple_vllm_model.py INFO 06-02 13:24:22 [importing.py:53] Triton module has been replaced with a placeholder. INFO 06-02 13:24:23 [__init__.py:239] Automatically detected platform rocm. INFO 06-02 13:24:24 [config.py:209] Replacing legacy 'type' key with 'rope_type' INFO 06-02 13:24:40 [config.py:716] This model supports multiple tasks: {'reward', 'generate', 'score', 'embed', 'classify'}. Defaulting to 'generate'. INFO 06-02 13:24:48 [arg_utils.py:1699] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine. INFO 06-02 13:24:48 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform. INFO 06-02 13:24:48 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.5.dev243+gc53e0730c) with config: model='/srv/app/models/Phi-4-mini-instruct', speculative_ config=None, tokenizer='/srv/app/models/Phi-4-mini-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/srv/app/models/Phi-4-mini-instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, INFO 06-02 13:24:48 [rocm.py:186] None is not supported in AMD GPUs. INFO 06-02 13:24:48 [rocm.py:187] Using ROCmFlashAttention backend. [W602 13:24:48.833832012 ProcessGroupNCCL.cpp:1028] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator()) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 INFO 06-02 13:24:48 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0 INFO 06-02 13:24:48 [model_runner.py:1120] Starting to load model /srv/app/models/Phi-4-mini-instruct... Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00