A Hands-on Guide ‣ Seeweb


Seeweb offers GPU-enabled servers that can run large language models (LLM)s quickly
and smoothly. You can even run multiple instances of LLMs on a single GPU. In this
tutorial, you will learn how to:

  • Provision a Seeweb GPU server
  • Install Ollama
  • Run inference
  • Monitor 

Step 1: Provisioning a GPU Server

Before diving into Ollama, we need to ensure we have the right infrastructure in place. This involves selecting a GPU-enabled server and verifying our hardware capabilities.

1. Provision a GPU server with pre-installed NVIDIA drivers. In this case, we have chosen RTX6000 with 32GB VRAM and Ubunutu 22.04 LTS with Nvidia Drivers.

2. Once the server is up, grab its IP address and connect to it using the code below.

$ ssh -i ~/.ssh/my_key.pem root@<ip_address_of_instance>

3. Then, verify the availability of the GPU using the following code.

$ nvidia-smi

This will show your the GPU’s model, its capacity, and the installed drivers.
Moreover, you will also get usage metrics. We will later see how it varies when we
run a model.

Step 2: Install Ollama

Installing Ollama is straightforward. You can simply follow their official installation script.
Let’s work it out and verify everything is working correctly.

1. Install using Ollama’s official script. 

$ curl -fsSL  | sh

2. Then, verify installation using following code. 

$ ollama

Running this code will list the commands available with Ollama. 

3. You can also check the version of Ollama using the following code. However, it will not return any models yet. 

$ ollama version

Step 3: Explore Models 

Ollama provides access to various open-source models. Let’s start with a smaller model to understand the basics of model management and interaction.

1. Use Ollama’s library to browse the available models. For this demonstration, we will use the llama3.1 model.

Pro tip: Run $ ollama list command before installing any models to ensure you don’t have the model already installed. In our case, there are none. Hence, running the command will show an empty list.  

2. Pull llama3.1’s image for initial testing using the following code. 

$ ollama pull llama3.1:8b 

3. Verify the model was installed successfully by re-running the $ ollama list command. In this case, you will find the model installed successfully. 

Step 4: Run Your First Inferences

Since the model has been installed successfully, we can now explore different ways to interact with it – both through the command line and API interface. See below for examples of interaction using both methods.

Basic CLI Interaction 

Run the following command on the CLI. 

$ ollama run llama3.1:8b "You are on a deserted island. List 5 creative ways to signal for help using only items from nature".

API-based Interaction

Run the following command on the CLI.

$ curl  -d '{ 
"model": "llama3.1:8b",
"prompt": "Write a story about a coffee mug that grants wishes",
"stream": false
}'

Step 5: Monitor the Performance of the Model

Understanding resource utilization is crucial when running LLMs. You can easily monitor how the model uses the GPU and system resources.

To watch the GPU’s utilization, run the following command.

$ watch -n 1 nvidia-smi

And, to checking the Ollama’s running processes, run the following code.

$ ollama ps

Step 6: [Optional] Proceed to Testing Larger Models

After you get comfortable with a smaller model, you can begin exploring models with more capabilities. These models are useful for complex reasoning tasks and can be installed in the same GPU server.

Use Ollama’s library to explore and find the model that fits your purpose and pull it. In the following example we will pull DeepSeek’s R1 model.

$ ollama pull deepseek-r1:32b

Then, run a complex reasoning test, using the following code.

$ ollama run deepseek-r1:32b "If humans could suddenly photosynthesize like plants, how would it change our society? Consider aspects like food production, urban planning, and daily routines."

Next Steps

Moving beyond this initial setup, you can enhance your Ollama deployment for production use. The first consideration should be setting up proper monitoring and logging to track model performance, GPU utilization, and API usage patterns. For larger workloads, Ollama supports multi-GPU configurations and can be integrated into a distributed system architecture.

Regardless of your specific use case, remember that running LLMs requires careful attention to resource management and performance optimization. Regular monitoring of GPU utilization, memory usage, and response times will help ensure optimal performance as your deployment scales.



Cloud Software

Leave a Reply

Your email address will not be published. Required fields are marked *

Proudly powered by WordPress | Theme: Hike Blog by Crimson Themes.