Running OpenCode Agentic AI workflows with local LLM models on DCC

This is a guide on setting up OpenCode to run Agentic AI workflows with local LLM models on DCC. The model is served through an Ollama instance hosted on a GPU node(s) and OpenCode can be run on any node as a client connecting to that server.

The files needed for this example can be found at: https://github.com/DukeRC/code/tree/main/Running-OpenCode-Agentic-AI-workflows-with-local-LLM-models-on-DCC

In certain scenarios you may want to move away from cloud-based LLMs to locally hosted ones for your agentic AI workflows as they offer benefits including data privacy, offline functionality, freedom from subscription fees, API rate limits, and avoid censorship.

In this guide, we will launch an Ollama server on a GPU node, and set up OpenCode to communicate with models hosted there for agentic AI workflows. As a precursor, please set up OpenCode on DCC as instructed in the tutorial, Setting up OpenCode on DCC.

The next steps involve setting up Ollama and starting up the Ollama server on a GPU node (through an interactive slurm job or a batch job script) as described in sections, Initial setup and Starting Ollama server on the GPU node in the tutorial Running an LLM server on DCC with Ollama. Essentially, OpenCode just replaces the Running inference sessions part here. For completeness, we will follow these steps from the start.

Initial setup

Usually, on HPC clusters, you won't have root privileges to install applications, so we are going to first build a simple Apptainer container with Ollama in it. If you already have ollama, you may skip this setup. Create the ollama.def file shown below in some working directory on your cluster. For the purpose of this guide, my working directory where I keep all the files and models mentioned is /work/${USER}/ollama.

ollama.def

Bootstrap: docker
From: ollama/ollama:latest

Then build the container with,

export APPTAINER_CACHEDIR=/cwork/${USER}/tmp
export APPTAINER_TMPDIR=/cwork/${USER}/tmp

apptainer build ollama.sif ollama.def

If successful, the Ollama container, ollama.sif will be built in the same directory. The reason we changed the Apptainer tmp and cache directories was because the default /tmp sometimes fills up leading to build failures.

You may download an ollama binary through conda with (I assume you have created some conda environment for the ollama procedure),

conda install -c conda-forge ollama

however, I noticed that this does not utilize the GPUs when used as the server like the original ollama application does, but it can be used as the client for inference. So install it anyways.

Starting Ollama server on the GPU node

We will host the Ollama server on a H200 GPU node and run OpenCode from a different node.

1. Create a bash script, ollama_server_apptainer.sh and modify it to suit your environment,

ollama_server_apptainer.sh

#!/bin/bash

# Configuration
CONTAINER_IMAGE="/work/${USER}/ollama/ollama.sif"
INSTANCE_NAME="ollama-$USER"
MODEL_PATH="/work/${USER}/ollama/models"
PORT=11434

# Unset variables to avoid conflicts
unset ROCR_VISIBLE_DEVICES

# Start Apptainer instance with GPU and writable tempfs
apptainer instance start \
  --nv \
  --writable-tmpfs \
  --bind "$MODEL_PATH" \
  "$CONTAINER_IMAGE" "$INSTANCE_NAME"

# Start Ollama serve inside the container in the background
apptainer exec \
  --env OLLAMA_MODELS="$MODEL_PATH" \
  --env OLLAMA_HOST="0.0.0.0:$PORT" \
  instance://$INSTANCE_NAME \
  ollama serve &

echo "🦙 Ollama is now serving at http://$(hostname -f):$PORT"
echo "Run the following command on the client shell to connect to the Ollama server:"
echo "export OLLAMA_HOST=http://$(hostname -f):$PORT"

This will use the Apptainer container, ollama.sif, we built earlier to host a server in the background that looks for models stored in /work/${USER}/ollama/models. It is important to host on 0.0.0.0 so that the server listens on all available network interfaces and not just localhost. Request an interactive session or submit a Slurm job script to request a GPU node. I will request an interactive session on a node with 1 H200 GPU, 300 GB of RAM, for 2 hours with,

srun -p scavenger-h200 -A scavenger-h200 --gres=gpu:h200:1 --mem=300G -t 2:00:00 --pty bash -i

Once you are in the GPU node, start the Ollama server by running the bash script, ollama_server_apptainer.sh,

chmod +x ollama_server_apptainer.sh
./ollama_server_apptainer.sh

You will see something like the following,

(ai) ukh at dcc-h200-gpu-06 in /work/ukh/ollama
$ ./ollama_server_apptainer.sh 
INFO:    Instance stats will not be available - requires cgroups v2 with systemd as manager.
INFO:    instance started successfully
🦙 Ollama is now serving at http://dcc-h200-gpu-06.rc.duke.edu:11434
Run the following command on the client shell to connect to the Ollama server:
OLLAMA_HOST=http://dcc-h200-gpu-06.rc.duke.edu:11434

It is important to note the host: http://dcc-h200-gpu-06.rc.duke.edu and the port: 11434 the server is broadcasting on as we will need this information for the inference. If you receive an Error: listen tcp 0.0.0.0:11434: bind: address already in use message, change the PORT number and try again. If you already have an ollama setup in your environment that doesn't require using Apptainer, you could use the following script, ollama_server.sh, in place of ollama_server_apptainer.sh.

ollama_server.sh

#!/bin/bash

# Configuration
MODEL_PATH="/work/${USER}/ollama/models"
PORT=11434

# Unset variables to avoid conflicts
unset ROCR_VISIBLE_DEVICES

# Model path
export OLLAMA_MODELS="$MODEL_PATH"

# Bind to all interfaces on that node, on port 11434
export OLLAMA_HOST="0.0.0.0:$PORT"

# Start Ollama server in the background
ollama serve &

echo "🦙 Ollama is now serving at http://$(hostname -f):$PORT"
echo "Run the following command on the client shell to connect to the Ollama server"                                                                             
echo "export OLLAMA_HOST=http://$(hostname -f):$PORT"

This process can be done through submitting a slurm batch job script as well (instead of the interactive slurm session). Now that we have the Ollama server hosted, let's see how we can use that to run agentic AI workflows with OpenCode.

Connecting OpenCode to the Ollama server

Next, we need to get OpenCode to talk to the Ollama server to use local LLM models. Copy the following opencode.json configuration file to ~/.config/opencode/opencode.json.

{
    "$schema": "https://opencode.ai/config.json",
    "default_agent": "plan",
    "permission": {
        "edit": "ask"
    },
    "provider": {
        "openai": {
            "name": "OpenAI via LiteLLM",
            "options": {
                "baseURL": "https://litellm.oit.duke.edu/v1",
                "litellmProxy": true,
                "apiKey":"{env:LITELLM_TOKEN}"
            }
        },
        "ollama": {
            "npm": "@ai-sdk/openai-compatible",
            "name": "Ollama (local)",
            "options": {
                "baseURL": "{env:OLLAMA_HOST}/v1"
            },
            "models": {
                "qwen3.6": {},
                "glm-4.7-flash": {}
            }
        }
    }
}

You may be familiar with this configuration file from the Setting up OpenCode on DCC guide. We just added the ollama provider section here to enable OpenCode to retrieve models from Ollama. Here, we have the models qwen3.6 and glm-4.7-flash, but you may download models from Ollama or Hugging Face. That can be done from any node after launching the Ollama server from the GPU node as above and running the following commands (replace the OLLAMA_HOST with your equivalent setup),

export OLLAMA_HOST="http://dcc-h200-gpu-06.rc.duke.edu:11434"
export OLLAMA_MODELS=/work/${USER}/ollama/models
ollama pull <model_name>

Important!

If you will be using opencode to perform computationally intensive tasks, first request an interactive session on DCC and run it on that.

Next, start OpenCode with the command,

opencode

and run the /models command there to select a model. You will see the local LLM models served through Ollama under the category, "Ollama (local)". Now you can run your agentic AI workflows without having to worry about data privacy or exhausting your cloud AI tokens. An example agentic workflow to perform an Equation of State (EOS) analysis can be found in the tutorial, Performing Equation of State Analysis using Agentic-AI-Skills.md.

Once you are done with your session, stop the server we started on the GPU node with,

apptainer instance stop ollama-$USER