end0tknr's kipple - web写経開発


test llama2 7B & 70B via llama.cpp


先日の上記entryでは、PCのメモリが16GBで少なかったせいか、 「GPUはスカスカ、メモリやDISKはパンパン」でしたので、 メモリ:64GBのPCに強化して、再テスト

使用したPC(thinkpad)の仕様 (抜粋)

PS> systeminfo

OS 名:                  Microsoft Windows 11 Pro
OS バージョン:          10.0.22621 N/A ビルド 22621
OS 製造元:              Microsoft Corporation
システム製造元:         LENOVO
プロセッサ:             [01]: Intel64 Family 6 Model 186 Stepping 2 GenuineIntel ~1900 Mhz
物理メモリの合計:       65,193 MB

PS> Get-WmiObject -Class Win32_VideoController

Caption               : NVIDIA GeForce RTX 4090

build llama.cpp

2023/9時点の llama.cpp を git clone したところ、 main.exe実行時にllama2のモデルを load できませんでしたので 少々、古い「 -b master-86aeb27」を使用しています。

PS> git clone -b master-86aeb27 https://github.com/ggerganov/llama.cpp
PS> cd llama.cpp
PS> mkdir build
PS> cd build
PS> cmake --build . --config Release
PS> cd ..
PS> dir build\bin\Release
2023/08/20  20:15         4,423,168 main.exe

download llama2 7B & 70B from huggingface


  • llama-2-7b-chat.ggmlv3.q4_K_M.bin (file size:3.8GB)
  • llama-2-70b-chat.ggmlv3.q4_K_M.bin (file size:38.5GB)

test llama2 7B via llama.cpp

PS> build\bin\Release\main -m ../llama_model/llama-2-7b-chat.ggmlv3.q4_K_M.bin --temp 0.1 -p "### Instruction: What is the height of Mount Fuji?  ### Response:" -ngl 32 -b 512


main: build = 937 (86aeb27)
main: seed  = 1693578173
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
llama.cpp: loading model from ../llama_model/llama-2-7b-chat.ggmlv3.q4_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  =  474.96 MB (+  256.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloaded 32/35 layers to GPU
llama_model_load_internal: total VRAM used: 4007 MB
llama_new_context_with_model: kv self size  =  256.00 MB

system_info: n_threads = 10 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.100000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

 ### Instruction: What is the height of Mount Fuji?  ### Response: The height of Mount Fuji is 3,776 meters (12,424 feet) above sea level. It is located on the main island of Honshu in Japan and is considered one of the country's most iconic landmarks. Mount Fuji is a stratovolcano, meaning it is a composite volcano made up of layers of lava, ash, and other pyroclastic material. It is currently dormant, but has erupted many times in the past, with the last eruption occurring in 1707. The mountain is surrounded by several smaller peaks, including Mount Subashiri, Mount Myojin, and Mount Kita-Fuji, and is visible from Tokyo on a clear day. It is considered one of Japan's "Three Holy Mountains" along with Mount Tate and Mount Haku. [end of text]

llama_print_timings:        load time =  3374.43 ms
llama_print_timings:      sample time =   135.12 ms /   187 runs   (    0.72 ms per token,  1383.97 tokens per second)
llama_print_timings: prompt eval time =   170.85 ms /    18 tokens (    9.49 ms per token,   105.35 tokens per second)
llama_print_timings:        eval time =  6782.83 ms /   186 runs   (   36.47 ms per token,    27.42 tokens per second)
llama_print_timings:       total time =  7157.56 ms


test llama2 70B via llama.cpp

build\bin\Release\main -m ../llama_model/llama-2-70b-chat.ggmlv3.q4_K_M.bin --temp 0.1 -p "### Instruction: What is the height of Mount Fuji?  ### Response:" -ngl 32 -b 512 -gqa 8


main: build = 937 (86aeb27)
main: seed  = 1693578354
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
llama.cpp: loading model from ../llama_model/llama-2-70b-chat.ggmlv3.q4_K_M.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 4096
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 24261.83 MB (+  160.00 MB per state)
llama_model_load_internal: allocating batch_size x (1536 kB + n_ctx x 416 B) = 872 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloaded 32/83 layers to GPU
llama_model_load_internal: total VRAM used: 16639 MB
llama_new_context_with_model: kv self size  =  160.00 MB

system_info: n_threads = 10 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.100000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

 ### Instruction: What is the height of Mount Fuji?  ### Response: The height of Mount Fuji is 3,776 meters (12,421 feet) above sea level. [end of text]

llama_print_timings:        load time = 17787.48 ms
llama_print_timings:      sample time =    34.50 ms /    28 runs   (    1.23 ms per token,   811.62 tokens per second)
llama_print_timings: prompt eval time = 12731.89 ms /    18 tokens (  707.33 ms per token,     1.41 tokens per second)
llama_print_timings:        eval time = 34322.90 ms /    27 runs   ( 1271.22 ms per token,     0.79 tokens per second)
llama_print_timings:       total time = 47103.52 ms
