https://end0tknr.hateblo.jp/entry/20230821/1692569142
先日の上記entryでは、PCのメモリが16GBで少なかったせいか、 「GPUはスカスカ、メモリやDISKはパンパン」でしたので、 メモリ:64GBのPCに強化して、再テスト
使用したPC(thinkpad)の仕様 (抜粋)
PS> systeminfo OS 名: Microsoft Windows 11 Pro OS バージョン: 10.0.22621 N/A ビルド 22621 OS 製造元: Microsoft Corporation システム製造元: LENOVO プロセッサ: [01]: Intel64 Family 6 Model 186 Stepping 2 GenuineIntel ~1900 Mhz 物理メモリの合計: 65,193 MB PS> Get-WmiObject -Class Win32_VideoController Caption : NVIDIA GeForce RTX 4090
build llama.cpp
2023/9時点の llama.cpp を git clone したところ、 main.exe実行時にllama2のモデルを load できませんでしたので 少々、古い「 -b master-86aeb27」を使用しています。
PS> git clone -b master-86aeb27 https://github.com/ggerganov/llama.cpp PS> cd llama.cpp PS> mkdir build PS> cd build PS> cmake .. -DLLAMA_CUBLAS=ON PS> cmake --build . --config Release PS> cd .. PS> dir build\bin\Release <略> 2023/08/20 20:15 4,423,168 main.exe <略>
download llama2 7B & 70B from huggingface
具体的には以下
- llama-2-7b-chat.ggmlv3.q4_K_M.bin (file size:3.8GB)
- llama-2-70b-chat.ggmlv3.q4_K_M.bin (file size:38.5GB)
test llama2 7B via llama.cpp
PS> build\bin\Release\main -m ../llama_model/llama-2-7b-chat.ggmlv3.q4_K_M.bin --temp 0.1 -p "### Instruction: What is the height of Mount Fuji? ### Response:" -ngl 32 -b 512
上記の推論を実行すると、約3秒で、以下のように表示されます。
main: build = 937 (86aeb27) main: seed = 1693578173 ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9 llama.cpp: loading model from ../llama_model/llama-2-7b-chat.ggmlv3.q4_K_M.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_head_kv = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 5.0e-06 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium) llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 474.96 MB (+ 256.00 MB per state) llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer llama_model_load_internal: offloading 32 repeating layers to GPU llama_model_load_internal: offloaded 32/35 layers to GPU llama_model_load_internal: total VRAM used: 4007 MB llama_new_context_with_model: kv self size = 256.00 MB system_info: n_threads = 10 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.100000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 ### Instruction: What is the height of Mount Fuji? ### Response: The height of Mount Fuji is 3,776 meters (12,424 feet) above sea level. It is located on the main island of Honshu in Japan and is considered one of the country's most iconic landmarks. Mount Fuji is a stratovolcano, meaning it is a composite volcano made up of layers of lava, ash, and other pyroclastic material. It is currently dormant, but has erupted many times in the past, with the last eruption occurring in 1707. The mountain is surrounded by several smaller peaks, including Mount Subashiri, Mount Myojin, and Mount Kita-Fuji, and is visible from Tokyo on a clear day. It is considered one of Japan's "Three Holy Mountains" along with Mount Tate and Mount Haku. [end of text] llama_print_timings: load time = 3374.43 ms llama_print_timings: sample time = 135.12 ms / 187 runs ( 0.72 ms per token, 1383.97 tokens per second) llama_print_timings: prompt eval time = 170.85 ms / 18 tokens ( 9.49 ms per token, 105.35 tokens per second) llama_print_timings: eval time = 6782.83 ms / 186 runs ( 36.47 ms per token, 27.42 tokens per second) llama_print_timings: total time = 7157.56 ms
タスクマネージャーで、PCの負荷を見ても、余裕
test llama2 70B via llama.cpp
build\bin\Release\main -m ../llama_model/llama-2-70b-chat.ggmlv3.q4_K_M.bin --temp 0.1 -p "### Instruction: What is the height of Mount Fuji? ### Response:" -ngl 32 -b 512 -gqa 8
上記の推論を実行すると、表示出力までに、約17秒を要しました。
main: build = 937 (86aeb27) main: seed = 1693578354 ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9 llama.cpp: loading model from ../llama_model/llama-2-70b-chat.ggmlv3.q4_K_M.bin llama_model_load_internal: warning: assuming 70B model based on GQA == 8 llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 8192 llama_model_load_internal: n_mult = 4096 llama_model_load_internal: n_head = 64 llama_model_load_internal: n_head_kv = 8 llama_model_load_internal: n_layer = 80 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 8 llama_model_load_internal: rnorm_eps = 5.0e-06 llama_model_load_internal: n_ff = 28672 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium) llama_model_load_internal: model size = 70B llama_model_load_internal: ggml ctx size = 0.21 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 24261.83 MB (+ 160.00 MB per state) llama_model_load_internal: allocating batch_size x (1536 kB + n_ctx x 416 B) = 872 MB VRAM for the scratch buffer llama_model_load_internal: offloading 32 repeating layers to GPU llama_model_load_internal: offloaded 32/83 layers to GPU llama_model_load_internal: total VRAM used: 16639 MB llama_new_context_with_model: kv self size = 160.00 MB system_info: n_threads = 10 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.100000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 ### Instruction: What is the height of Mount Fuji? ### Response: The height of Mount Fuji is 3,776 meters (12,421 feet) above sea level. [end of text] llama_print_timings: load time = 17787.48 ms llama_print_timings: sample time = 34.50 ms / 28 runs ( 1.23 ms per token, 811.62 tokens per second) llama_print_timings: prompt eval time = 12731.89 ms / 18 tokens ( 707.33 ms per token, 1.41 tokens per second) llama_print_timings: eval time = 34322.90 ms / 27 runs ( 1271.22 ms per token, 0.79 tokens per second) llama_print_timings: total time = 47103.52 ms
タスクマネージャーで見る限りでは、負荷という程の負荷ではないようです。