Picture by Creator | Ideogram.ai
# Introduction
When constructing massive language mannequin purposes, tokens are cash. In the event you’ve ever labored with an LLM like GPT-4, you’ve in all probability had that second the place you test the invoice and assume, “How did it get this excessive?!” Every API name you make consumes tokens, which immediately impacts each latency and value. However with out monitoring them, you haven’t any thought the place they’re being spent or the right way to optimize.
That’s the place LangSmith is available in. It not solely traces your LLM calls but in addition enables you to log, monitor, and visualize token utilization for each step in your workflow. On this information, we’ll cowl:
- Why token monitoring issues?
- How one can arrange logging?
- How one can visualize token consumption within the LangSmith dashboard?
# Why does Token Monitoring Matter?
Token monitoring issues as a result of each interplay with a big language mannequin has a direct price tied to the variety of tokens processed, each in your inputs and the mannequin’s outputs. With out monitoring, small inefficiencies in prompts, pointless context, or redundant requests can silently inflate your invoice and decelerate efficiency.
By monitoring tokens, you acquire visibility into precisely the place they’re being consumed. This fashion you possibly can optimize prompts, streamline workflows, and preserve price management. For instance, in case your chatbot is utilizing 1,500 tokens per request, lowering that to 800 tokens can lower prices virtually in half. The token monitoring idea by some means works like:


# Setting Up LangSmith for Token Logging
// Step 1: Set up Required Packages
pip3 set up langchain langsmith transformers speed up langchain_community
// Step 2: Make all vital imports
import os
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langsmith import traceable
// Step 3: Configure Langsmith
Set your API key and undertaking title:
# Substitute along with your API key
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"
os.environ["LANGCHAIN_PROJECT"] = "HF_FLAN_T5_Base_Demo"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
# Elective: disable tokenizer parallelism warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"
// Step 4: Load a Hugging Face Mannequin
Use a CPU-friendly mannequin like google/flan-t5-base and allow sampling for extra pure outputs:
model_name = "google/flan-t5-base"
pipe = pipeline(
"text2text-generation",
mannequin=model_name,
tokenizer=model_name,
gadget=-1, # CPU
max_new_tokens=60,
do_sample=True, # allow sampling
temperature=0.7
)
llm = HuggingFacePipeline(pipeline=pipe)
// Step 5: Create a Immediate and Chain
Outline a immediate template and join it along with your Hugging Face pipeline utilizing LLMChain:
prompt_template = PromptTemplate.from_template(
"Clarify gravity to a 10-year-old in about 20 phrases utilizing a enjoyable analogy."
)
chain = LLMChain(llm=llm, immediate=prompt_template)
// Step 6: Make the Perform Traceable with LangSmith
Use the @traceable decorator to mechanically log inputs, outputs, token utilization, and runtime:
@traceable(title="HF Clarify Gravity")
def explain_gravity():
return chain.run({})
// Step 7: Run the Perform and Print Outcomes
reply = explain_gravity()
print("n=== Hugging Face Mannequin Reply ===")
print(reply)
Output:
=== Hugging Face Mannequin Reply ===
Gravity is a measure of mass of an object.
// Step 8: Examine the Langsmith Dashboard
Go to smith.langchain.com → Tracing Initiatives. You’ll one thing as:
![]()
![]()
You possibly can even see the price related to every undertaking, which helps you to analyse your billing. Now to see the utilization of tokens and different insights, click on in your undertaking. And you will notice:


The purple field highlights and lists down the variety of runs you have got made to your undertaking. Click on on any run and you will notice:


You possibly can see varied issues right here similar to whole tokens, latency, and many others. Click on on dashboard as proven beneath:
![]()
![]()
Now you possibly can view graphs over time to trace token utilization developments, test common latency per request, evaluate enter vs. output tokens, and determine peak utilization intervals. These insights assist optimize prompts, handle prices, and enhance mannequin efficiency.


Please scroll right down to view all of the related graphs along with your undertaking.
// Step 9: Discover the LangSmith Dashboard
You possibly can analyse loads of the insights similar to:
- View Instance Traces: Click on on a hint to see detailed execution, together with uncooked enter, generated output, and efficiency metrics
- Examine Particular person Traces: For every hint, you possibly can discover each step of execution, seeing prompts, outputs, token utilization, and latency
- Examine Token Utilization & Latency: Detailed token counts and processing instances assist determine bottlenecks and optimize efficiency
- Analysis Chains: Use LangSmith’s analysis instruments to check situations, observe mannequin efficiency, and evaluate outputs
- Experiment in Playground: Modify parameters similar to temperature, immediate templates, or sampling settings to fine-tune your mannequin’s conduct
With this setup, you now have full visibility of your Hugging Face mannequin runs, token utilization, and total efficiency within the LangSmith dashboard.
# How To Spot and Repair Token Hogs?
When you’ve received logging, you possibly can:
- See if prompts are too lengthy
- Determine calls the place the mannequin is over-generating
- Swap to smaller fashions for cheaper duties
- Cache responses to keep away from duplicate requests
That is gold for debugging lengthy chains or brokers. Discover the step consuming probably the most tokens and repair it.
# Wrapping Up
That is how one can arrange and use Langsmith. Logging token utilization isn’t nearly saving cash, it’s about constructing smarter, extra environment friendly LLM apps. The information gives a basis, you possibly can be taught extra by exploring, experimenting, and analyzing your individual workflows.
Kanwal Mehreen is a machine studying engineer and a technical author with a profound ardour for knowledge science and the intersection of AI with drugs. She co-authored the e book “Maximizing Productiveness with ChatGPT”. As a Google Technology Scholar 2022 for APAC, she champions variety and educational excellence. She’s additionally acknowledged as a Teradata Variety in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower ladies in STEM fields.
