Back to blog
12 May 2026·15 min read

Building your own AI phone agent: FreeSWITCH, a SIP trunk and a custom voice pipeline

When someone calls my business number, I don't pick up — Hank does. Hank is my self-hosted AI phone agent. He has a natural real-time conversation, has access to my contacts, can transfer calls or hang up. This post walks through how that setup is built on a single VPS — no Twilio, no Vapi, no third-party voice-agent vendor.

The stack at a glance

Four layers stack on top of each other. At the bottom a SIP trunkfrom an Austrian telco (peoplefone) with a nationwide 0720 number. Above it FreeSWITCH as the open-source PBX that terminates SIP and forwards audio frames to the backend. In the backend the voice pipeline: speech-to-text (Google Cloud), an LLM provider of your choice (Anthropic, Groq, OpenAI, …) and text-to-speech (Google Cloud). On top of that, the agent logic with tool-use, contact lookup and pre-rendered greetings.

Everything runs on a single Hetzner VPS. No cloud functions, no per-minute Twilio fees, no third-party TLS termination for audio. The effort only pays off if you have privacy requirements no SaaS can satisfy, or if you want to avoid provider lock-in (latency, LLM choice, languages).

Step 1 — The SIP trunk: peoplefone

A SIP trunk replaces the old phone line with a VoIP connection over the internet. I use peoplefone (AT) — Austrian provider, well-documented CIDR ranges for whitelisting, IP authentication available, clean SIP standards conformance. Trunk host is sips.peoplefone.at (95.128.80.3), failover range in Germany (185.190.125.0/27). DID: a 0720 number (nationwide, non-geographic); the hardware lives entirely on my side.

What matters here isn't the trunk vendor, it's which outbound IP ranges they publish. peoplefone lists every outbound IP in the knowledge base — that's the most important piece of information for the firewall config later. Pick a trunk vendor without documented CIDRs and you cannot cleanly whitelist them.

Step 2 — FreeSWITCH as the SIP bridge

FreeSWITCH has been the open-source PBX standard for 15+ years and does pretty much everything: terminating SIP trunks, codec negotiation, dialplan execution, audio stream forking. I run it as a Docker container in host network mode so it listens directly on the public SIP ports (5060/5061). The image comes from a standard build, configuration lives in a host bind-mount.

Three config files are central. First the external SIP profile (sip_profiles/external.xml): binds codecs (G.711a/μ, OPUS) and sets the inbound ACL. Second the gateway file with trunk credentials. Third a public dialplan (dialplan/public/...) for inbound calls and a default dialplan for outbound.

Audio is streamed via the mod_audio_fork module over a WebSocket to the backend. The backend receives PCM-8-kHz frames in real time, sends synthesized responses back, and FreeSWITCH plays them into the ongoing call. Best-case latency: 60–100 ms from end-of-speech to audio frames arriving at the backend.

Gotchas from real-world use: config edits have to land in the right path — the standard entrypoint.sh writes to /etc/freeswitch/..., but FreeSWITCH itself reads from /usr/local/freeswitch/conf/.... And docker rm kills any changes that live only in the container writable layer — so every config tweak has to also go into the host bind-mount.

Step 3 — Building your own pipeline instead of using Gemini Live

Google offers Gemini Live, a complete end-to-end speech-in/speech-out pipeline. It's impressive but comes with three drawbacks: provider lock-in(Google only), latency tied to Google's servers, and more restricted tool-use than Anthropic Messages or OpenAI-compatible APIs (Groq, Mistral, DeepSeek, …).

That's why Hank runs on a custom pipeline with three separate modules. STT and TTS come from Google Cloud (speech-to-text via gRPC streaming with the latest_longmodel, text-to-speech REST with voice de-DE-Chirp-HD-D). The LLM in the middle is swappable: Anthropic with native tool-use and SSE streaming, or any OpenAI-compatible API with tool_calls. The telephony settings pick exactly which provider answers.

The most important optimisation is sentence-by-sentence streaming. As soon as the LLM output produces one full sentence (detected via .!?…boundaries), that sentence is shipped to TTS while the LLM keeps generating the next. Perceived response latency drops by half — the caller hears the beginning of the answer before the LLM has even finished phrasing the end.

Equally important: barge-in. If the caller interrupts Hank, STT already emits interim transcripts. The pipeline bumps a turn-id and cancels all in-flight LLM and TTS jobs — Hank goes silent immediately, listens, and only replies after. Without barge-in the bot feels robotic because it ploughs on through interruptions.

LLM choice matters for voice. Anthropic Sonnet has the best quality but takes ~300 ms to first token. Groq with Llama-3.3-70Bmanages under ~50 ms — the difference between "bot" and "person". I use Anthropic for complex requests, Groq for pure voice sessions.

Step 4 — Pre-rendered greetings: 0 ms instead of 1500 ms

A live TTS request takes 800–1500 ms from trigger to first audio frame. Synthesizing the first sentence live on every call is therefore the worst possible first impression — the caller hears silence. The fix: pre-render greetings and cache them.

Whenever the greeting text, the voice, or a contact changes, a background warmGreetingsAsync() writes the TTS output as a PCM-8-kHz file into the greetings cache. Cache key is contact + voice + hash(text). On each call the inbound handler resolves the caller via caller-ID, picks the matching file and ships it as the first audio frame — 0 ms latency. Unknown callers get a generic greeting that asks for their name.

Step 5 — Toll fraud: what happens if SIP is open

Here's the uncomfortable part. If your FreeSWITCH is publicly reachable, you willbe attacked. Not "maybe", but guaranteed. The first SIP probes hit within hours of the first public DNS lookup. I underestimated this myself and had to clean up 13,283 CDR entries afterwards.

What the bots try: mass REGISTER packets with spoofed caller-IDs (test, trunk1, 1001, sipvicious), INVITE packets to exotic destinations (Iraq, Inmarsat, Switzerland), and most importantly probes to IPRN numbers— International Premium Rate Numbers. Example: 8818899199 is an Inmarsat premium number. If a call to this destination went through, every minute would cost real money — and a slice flows to the attacker. Classic toll fraud.

On my peak day I measured 3,787 probes in 24 hours. Top attacker IPs came from Contabo (132k probes), Telia (122k), OVH and AWS Tokyo. These are not targeted attacks — it's the background noise of the internet. But that background noise is enough to drain a misconfigured system.

Step 6 — Hardening Phase A

Four defensive layers, in this exact order. Each one reduces the attack surface by an order of magnitude.

1. UFW whitelist on SIP ports.Ports 5060/5061/5080/5081 only see packets from the trunk provider. For me that's two CIDRs: 95.128.80.0/29(peoplefone AT) and 185.190.125.0/27 (peoplefone DE). Eight rules total — done. That alone blocks ~99% of bot traffic at the network layer, before FreeSWITCH ever reads a packet header.

2. FreeSWITCH ACL. Defense in depth: the same IP range goes into autoload_configs/acl.conf.xml, and the external SIP profile references it via apply-inbound-acl="trusted-providers". If UFW is ever bypassed (post-migration misconfig, IPv6 leak), FreeSWITCH itself still filters.

3. Inbound dialplan with caller-ID whitelist and DID match.Anonymous calls (no From header, "anonymous") are dropped. Allowed caller-ID prefixes: AT/DE/CH/IT. Destination must match your DID exactly (^(\+?43720271025)$). Anything else → reject. That way the trunk cannot be abused to relay a call through your real number to a third party.

4. Outbound dialplan with premium-rate block. This is where most hobby setups die: an open catch-all ^(\+\d+)$ means anyone with trunk access can dial anywhere. Replace it with an explicit AT/DE/CH/IT allowlist, preceded by six premium blocks: AT 0900/0820/0939/010/0118, DE 0900/0137/0180/018x, CH 0900/0901/0906, IT 89x/144/155/199, plus Inmarsat (+882) and UIFN (+883/+979). Anything matching those falls out before the call is ever set up.

Verification via fs_cli: a simple acl 95.128.80.3 trusted-providersmust return true, the same query against an attacker IP false. sofia status gateway shows the trunk registration state, ufw status verbosethe allowed sources. When those three checks are green, phase A is done.

Step 7 — What remains (Phase B+)

Phase A takes the system out of the line of fire. For long-term production there are more steps I work through in order: fail2ban for FreeSWITCH (NO_ROUTE entries and auth fails trigger 24-hour bans), trunk password rotationwith the provider's IP lock enabled, an audit log for every outbound initiation from the backend (even local processes must present a token), and a server guardian digest emailing daily SIP-probe counts and outbound stats. Only after that does the setup feel operationally stable.

What does a minute cost?

Costs split into a fixed block and variable per-minute fees. Fixed: the DID number (0720) costs €20.00 net per year at peoplefone (€24.00 gross) — roughly €1.67/month. Four variable items add up per conversation minute:

  • peoplefone SIP trunk: ~€0.009/min inbound (someone calls you), ~€0.076/min outbound to Austrian mobile, ~€0.022/min to Austrian landlines.
  • Google STT (Chirp 3 HD): $0.016/min of audio — since the caller speaks only ~35% of the time, effective cost is ~€0.005/min.
  • Groq LLM (Llama-3.3-70B): $0.59/1M input + $0.79/1M output tokens, at ~2 turns/min that works out to ~€0.002/min.
  • Google TTS (de-DE-Chirp-HD-D, Studio tier): $160/1M characters — the dominant cost driver at ~€0.027/min. A cheaper alternative (Neural2, $16/1M) costs ten times less (~€0.003/min) at slightly lower voice quality.

Total per conversation minute (inbound, Chirp-HD TTS): ~€0.04/min. With Neural2 TTS that drops to ~€0.02/min.

Extrapolated to 500 inbound minutes/month:

ItemChirp-HDNeural2
DID number (share)€1.67€1.67
peoplefone SIP€4.50€4.50
Google STT€2.58€2.58
Groq LLM€1.14€1.14
Google TTS€13.54€1.35
Total (net)€23.43€11.24

All figures net. USD/EUR rate ~1:0.92. peoplefone per-minute rates are estimates — check the exact tariffs in the peoplefone portal.

Takeaways

1.A self-hosted AI phone agent is doable today with open-source building blocks (FreeSWITCH + STT/TTS + LLM of your choice). Infrastructure cost is low (an €8 VPS is enough), but recurring API fees for STT, TTS and the LLM stack on top — with Google Cloud + Anthropic/Groq, that's anywhere from a few to low three-figure euros per month depending on call volume. Truly zero cloud cost is only possible with a fully self-hosted stack (Whisper for STT, Coqui/Piper for TTS, a local Llama model) — at the price of quality and CPU/GPU requirements.

2.The single biggest knob for latency isn't the model, it's architecture: pre-rendered greetings (zero latency on answer), sentence-by-sentence TTS (response time halved) and barge-in (no talking over the caller).

3.Provider-agnosticism isn't a luxury, it's insurance. Language, speed, cost and compliance change — the pipeline must be able to swap LLM providers without touching FreeSWITCH or TTS.

4. If your SIP port is open, you will be attacked. A provider CIDR whitelist on the firewall is the single most important defence. Without it, everything else (ACL, dialplan, premium block) is just damage control.

5. Outbound dialplans without a catch-all — this is the rule that prevents premium-rate fraud. Better to start too strict and whitelist numbers manually than to start too loose and wake up to a four-figure phone bill.

The whole setup lives behind a single 0720 number. If you're interested in the code (voice pipeline, greeting cache, hardening snippets) — the matching AI coworker HeyHank will be published under MIT with the public launch.

Questions or feedback? office@markusstoeger.com