https://end0tknr.hateblo.jp/entry/20230821/1692569142

先日の上記entryでは、PCのメモリが16GBで少なかったせいか、「GPUはスカスカ、メモリやDISKはパンパン」でしたので、メモリ:64GBのPCに強化して、再テスト

使用したPC(thinkpad)の仕様 (抜粋)

PS> systeminfo

OS 名:                  Microsoft Windows 11 Pro
OS バージョン:          10.0.22621 N/A ビルド 22621
OS 製造元:              Microsoft Corporation
システム製造元:         LENOVO
プロセッサ:             [01]: Intel64 Family 6 Model 186 Stepping 2 GenuineIntel ~1900 Mhz
物理メモリの合計:       65,193 MB

PS> Get-WmiObject -Class Win32_VideoController

Caption               : NVIDIA GeForce RTX 4090

build llama.cpp

2023/9時点の llama.cpp を git clone したところ、 main.exe実行時にllama2のモデルを load できませんでしたので少々、古い「 -b master-86aeb27」を使用しています。

PS> git clone -b master-86aeb27 https://github.com/ggerganov/llama.cpp
PS> cd llama.cpp
PS> mkdir build
PS> cd build
PS> cmake .. -DLLAMA_CUBLAS=ON
PS> cmake --build . --config Release
PS> cd ..
PS> dir build\bin\Release
＜略＞
2023/08/20  20:15         4,423,168 main.exe
＜略＞

download llama2 7B & 70B from huggingface

具体的には以下

llama-2-7b-chat.ggmlv3.q4_K_M.bin (file size:3.8GB)
llama-2-70b-chat.ggmlv3.q4_K_M.bin (file size:38.5GB)

test llama2 7B via llama.cpp

PS> build\bin\Release\main -m ../llama_model/llama-2-7b-chat.ggmlv3.q4_K_M.bin --temp 0.1 -p "### Instruction: What is the height of Mount Fuji?  ### Response:" -ngl 32 -b 512

上記の推論を実行すると、約3秒で、以下のように表示されます。

main: build = 937 (86aeb27)
main: seed  = 1693578173
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
llama.cpp: loading model from ../llama_model/llama-2-7b-chat.ggmlv3.q4_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  =  474.96 MB (+  256.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloaded 32/35 layers to GPU
llama_model_load_internal: total VRAM used: 4007 MB
llama_new_context_with_model: kv self size  =  256.00 MB

system_info: n_threads = 10 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.100000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 ### Instruction: What is the height of Mount Fuji?  ### Response: The height of Mount Fuji is 3,776 meters (12,424 feet) above sea level. It is located on the main island of Honshu in Japan and is considered one of the country's most iconic landmarks. Mount Fuji is a stratovolcano, meaning it is a composite volcano made up of layers of lava, ash, and other pyroclastic material. It is currently dormant, but has erupted many times in the past, with the last eruption occurring in 1707. The mountain is surrounded by several smaller peaks, including Mount Subashiri, Mount Myojin, and Mount Kita-Fuji, and is visible from Tokyo on a clear day. It is considered one of Japan's "Three Holy Mountains" along with Mount Tate and Mount Haku. [end of text]

llama_print_timings:        load time =  3374.43 ms
llama_print_timings:      sample time =   135.12 ms /   187 runs   (    0.72 ms per token,  1383.97 tokens per second)
llama_print_timings: prompt eval time =   170.85 ms /    18 tokens (    9.49 ms per token,   105.35 tokens per second)
llama_print_timings:        eval time =  6782.83 ms /   186 runs   (   36.47 ms per token,    27.42 tokens per second)
llama_print_timings:       total time =  7157.56 ms

タスクマネージャーで、PCの負荷を見ても、余裕

test llama2 70B via llama.cpp

build\bin\Release\main -m ../llama_model/llama-2-70b-chat.ggmlv3.q4_K_M.bin --temp 0.1 -p "### Instruction: What is the height of Mount Fuji?  ### Response:" -ngl 32 -b 512 -gqa 8

上記の推論を実行すると、表示出力までに、約17秒を要しました。

main: build = 937 (86aeb27)
main: seed  = 1693578354
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
llama.cpp: loading model from ../llama_model/llama-2-70b-chat.ggmlv3.q4_K_M.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 4096
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 24261.83 MB (+  160.00 MB per state)
llama_model_load_internal: allocating batch_size x (1536 kB + n_ctx x 416 B) = 872 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloaded 32/83 layers to GPU
llama_model_load_internal: total VRAM used: 16639 MB
llama_new_context_with_model: kv self size  =  160.00 MB

system_info: n_threads = 10 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.100000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 ### Instruction: What is the height of Mount Fuji?  ### Response: The height of Mount Fuji is 3,776 meters (12,421 feet) above sea level. [end of text]

llama_print_timings:        load time = 17787.48 ms
llama_print_timings:      sample time =    34.50 ms /    28 runs   (    1.23 ms per token,   811.62 tokens per second)
llama_print_timings: prompt eval time = 12731.89 ms /    18 tokens (  707.33 ms per token,     1.41 tokens per second)
llama_print_timings:        eval time = 34322.90 ms /    27 runs   ( 1271.22 ms per token,     0.79 tokens per second)
llama_print_timings:       total time = 47103.52 ms

タスクマネージャーで見る限りでは、負荷という程の負荷ではないようです。

end0tknr's kipple - web写経開発

太宰府天満宮の狛犬って、妙にカワイイ

test llama2 7B & 70B via llama.cpp

使用したPC(thinkpad)の仕様 (抜粋)

build llama.cpp

download llama2 7B & 70B from huggingface

test llama2 7B via llama.cpp

test llama2 70B via llama.cpp