Fix: LlamaIndex Not Working — Import Errors, Vector Store Issues, and Query Engine Failures
Part of: Python Errors
Quick Answer
How to fix LlamaIndex errors — ImportError llama_index.core module not found, ServiceContext deprecated use Settings instead, vector store index not persisting, query engine returns irrelevant results, and LlamaIndex 0.10 migration.
The Error
You install LlamaIndex and the imports fail:
ImportError: cannot import name 'GPTSimpleVectorIndex' from 'llama_index'
ModuleNotFoundError: No module named 'llama_index.core'Or ServiceContext raises a deprecation warning on every call:
DeprecationWarning: ServiceContext is deprecated, please use Settings instead.Or you build an index, restart your script, and it’s gone:
index = VectorStoreIndex.from_documents(documents)
# Restart script
index = ??? # How do I load it back?Or query results come back irrelevant — the retriever returns the wrong chunks:
response = query_engine.query("What is the CEO's name?")
print(response) # "This document discusses financial projections..."
# Nothing about the CEOLlamaIndex went through a major restructuring in version 0.10 (February 2024) — the package was split into llama-index-core plus dozens of integration packages. Code written for 0.9 or earlier breaks immediately on import. The ServiceContext object was replaced with a Settings singleton. This guide covers the migration plus common RAG-specific errors.
Why This Happens
LlamaIndex 0.10 split the monolithic llama-index package into a core library and many small integration packages (llama-index-vector-stores-chroma, llama-index-embeddings-openai, etc.). This keeps the core install small but means every feature requires its own pip install. Code that worked with 0.9’s single-package import model fails immediately.
RAG quality problems come from three places: the document chunking (too large, too small, or poorly aligned with sentence boundaries), the embedding model (wrong domain, wrong language), and the retrieval strategy (top-k alone often misses relevant content buried in a single long document).
Diagnostic Timeline
When your RAG pipeline returns wrong answers, the reflex is “check the index.” That blames the wrong layer roughly 80% of the time. Follow this timeline before re-indexing anything.
Minute 0 — Wrong first instinct. You assume the index is stale or corrupt, blow it away, and re-index from scratch. Twenty minutes later you ask the same question and get the same wrong answer. Re-indexing fixes corruption, but corruption is rare; what is common is that the index is built correctly and the query path uses different settings than the indexing path. Re-indexing cannot fix that.
Minute 1 — Discriminating evidence. Print the retrieved nodes before judging the answer:
response = query_engine.query("What is the CEO's name?")
for node in response.source_nodes:
print(node.score, node.text[:200])If the source nodes contain the answer but the LLM ignored it, the bug is in the prompt or the LLM (prompt template, context window truncation). If the source nodes do not contain the answer, the bug is in retrieval (embeddings, chunking, or top_k). That single check redirects the next 30 minutes of debugging.
Minute 2 — Next check. Compare the embedding model used at indexing time and at query time. After 0.10 the Settings singleton is global — if you set Settings.embed_model = OpenAIEmbedding() to build the index, then a later script loads with Settings.embed_model still on the default BAAI/bge-small-en-v1.5, you are querying a 1536-dim index with 384-dim vectors. Some vector stores raise; many silently return nonsense. Print Settings.embed_model.__class__.__name__ in both processes.
Minute 3 — Actual root cause. The two failure modes that account for almost every “RAG returns wrong answer” case:
- Embedding model mismatch between indexing and querying. Already covered above. The fix is to persist embedding metadata alongside the index and assert it on load. A one-line check at startup saves hours of debugging.
- Storage context not persisted. You called
VectorStoreIndex.from_documents(docs), queried it in the same process, and got correct answers. You restarted the script and the index is gone — because you never calledindex.storage_context.persist(persist_dir=...). The in-memorySimpleVectorStorehas no disk backing. The symptom is “the same query returns different answers after restart”; the cause is no persistence layer. Fix 3 covers the persistence call you missed.
If both check out, then look at the index itself. By that point you actually have a hypothesis.
Fix 1: LlamaIndex 0.10+ Migration
The 0.10 release reorganized everything. Here’s the migration:
Old (0.9 and earlier):
from llama_index import (
GPTSimpleVectorIndex,
SimpleDirectoryReader,
ServiceContext,
LLMPredictor,
)
from llama_index.llms import OpenAI
from llama_index.embeddings import OpenAIEmbeddingNew (0.10+):
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
Settings,
StorageContext,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbeddingInstall the packages you need:
# Core library (required)
pip install llama-index-core
# LLM providers — install the one(s) you use
pip install llama-index-llms-openai
pip install llama-index-llms-anthropic
pip install llama-index-llms-huggingface
pip install llama-index-llms-ollama
# Embedding providers
pip install llama-index-embeddings-openai
pip install llama-index-embeddings-huggingface
pip install llama-index-embeddings-cohere
# Vector stores
pip install llama-index-vector-stores-chroma
pip install llama-index-vector-stores-qdrant
pip install llama-index-vector-stores-pinecone
# Readers (document loaders)
pip install llama-index-readers-file # PDF, DOCX, etc.
pip install llama-index-readers-web # URL-based
pip install llama-index-readers-database # SQLOr install everything at once with the legacy bundle:
pip install llama-index # Installs llama-index-core + common integrationsCommon Mistake: Copying code from old tutorials that use from llama_index import .... The top-level llama_index namespace no longer has most symbols — everything moved to llama_index.core or provider subpackages. If you see ImportError on what looks like a standard LlamaIndex import, check the 0.10 migration guide and replace with the new path.
Fix 2: ServiceContext Deprecated — Use Settings
DeprecationWarning: ServiceContext is deprecated, please use Settings instead.The 0.10 release replaced the per-call ServiceContext with a global Settings singleton.
Old pattern:
from llama_index import ServiceContext, LLMPredictor
from llama_index.llms import OpenAI
from llama_index.embeddings import OpenAIEmbedding
service_context = ServiceContext.from_defaults(
llm=OpenAI(model="gpt-4"),
embed_model=OpenAIEmbedding(),
chunk_size=512,
)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)New pattern:
from llama_index.core import Settings, VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# Set globally — applies to all subsequent operations
Settings.llm = OpenAI(model="gpt-4")
Settings.embed_model = OpenAIEmbedding()
Settings.chunk_size = 512
Settings.chunk_overlap = 50
# No need to pass service_context anymore
index = VectorStoreIndex.from_documents(documents)Pro Tip: Set Settings once at the top of your script or in a config module. If you need different LLMs or embeddings in different parts of the app, pass them explicitly to the specific component rather than mutating Settings mid-flight. Mutating global state makes debugging much harder.
Per-component override:
from llama_index.llms.openai import OpenAI
# Use a specific LLM for this query engine only
query_engine = index.as_query_engine(llm=OpenAI(model="gpt-4-turbo"))Fix 3: Persisting and Loading Indexes
index = VectorStoreIndex.from_documents(documents)
# ... script ends, all work is lostSave the index to disk:
from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage
index = VectorStoreIndex.from_documents(documents)
# Persist
index.storage_context.persist(persist_dir="./index_storage")Load later:
from llama_index.core import StorageContext, load_index_from_storage
# Load back
storage_context = StorageContext.from_defaults(persist_dir="./index_storage")
index = load_index_from_storage(storage_context)
# Use as before
query_engine = index.as_query_engine()
response = query_engine.query("What's the main topic?")For production — use a dedicated vector store:
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
# Chroma client (persistent)
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("my_collection")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Build index (stores vectors in Chroma automatically)
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
)
# Load later — just connect to Chroma
index = VectorStoreIndex.from_vector_store(vector_store)Qdrant (for production scale):
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client
client = qdrant_client.QdrantClient(host="localhost", port=6333)
vector_store = QdrantVectorStore(client=client, collection_name="my_collection")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)Fix 4: Poor Query Results — Chunking and Retrieval
Vague or wrong answers usually mean the retriever isn’t finding the right chunks.
Step 1: Tune chunk size to your documents.
from llama_index.core import Settings
# Default is 1024 tokens — good for mixed content
Settings.chunk_size = 1024
Settings.chunk_overlap = 100
# Smaller for specific facts (FAQ, Q&A)
Settings.chunk_size = 512
Settings.chunk_overlap = 50
# Larger for long-form documents (narratives, technical docs)
Settings.chunk_size = 2048
Settings.chunk_overlap = 200Step 2: Use sentence-aware splitting.
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(
chunk_size=1024,
chunk_overlap=100,
paragraph_separator="\n\n\n",
)
nodes = splitter.get_nodes_from_documents(documents)
index = VectorStoreIndex(nodes)Step 3: Increase top-k.
# Default top_k=2 — often too few
query_engine = index.as_query_engine(similarity_top_k=5)Step 4: Add a re-ranker for higher precision:
from llama_index.core.postprocessor import SentenceTransformerRerank
# Get more candidates, re-rank, use top results
rerank = SentenceTransformerRerank(
model="cross-encoder/ms-marco-MiniLM-L-6-v2",
top_n=3,
)
query_engine = index.as_query_engine(
similarity_top_k=10, # Retrieve 10 candidates
node_postprocessors=[rerank], # Re-rank and keep top 3
)Inspect what was retrieved to diagnose:
response = query_engine.query("What is the CEO's name?")
print(f"Answer: {response}")
print(f"\nSource chunks used:")
for node in response.source_nodes:
print(f"Score: {node.score:.3f}")
print(f"Text: {node.text[:200]}...")
print(f"Metadata: {node.metadata}")
print("---")If the retrieved chunks don’t contain the answer, it’s a retrieval problem (chunking, embedding). If they contain the answer but the LLM ignores it, it’s an LLM problem (prompting, context window).
Fix 5: Custom Embedding Models
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# Use a local HuggingFace model (no API costs)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
# Multilingual embeddings
Settings.embed_model = HuggingFaceEmbedding(
model_name="intfloat/multilingual-e5-large",
)
# Japanese-specific
Settings.embed_model = HuggingFaceEmbedding(
model_name="intfloat/multilingual-e5-large",
# Or "cl-tohoku/bert-base-japanese-v3"
)Embedding batch size for faster indexing:
Settings.embed_model = HuggingFaceEmbedding(
model_name="BAAI/bge-small-en-v1.5",
embed_batch_size=32, # Default 10 — raise if GPU has memory
)API-based embeddings:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.cohere import CohereEmbedding
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")
# Or
Settings.embed_model = CohereEmbedding(model_name="embed-english-v3.0")Common Mistake: Re-embedding all documents every time you start the app. Embedding is expensive (money for API, time for local). Always persist the index after first build, and only re-index when source documents change.
When loading HuggingFace embedding models, the first call downloads weights to ~/.cache/huggingface. Set HF_HOME to a shared volume in container deployments so every pod does not re-download the same 500 MB at startup.
Fix 6: Document Loaders — PDFs, URLs, Databases
from llama_index.core import SimpleDirectoryReader
# Load all files from a directory
documents = SimpleDirectoryReader("./data").load_data()
# Filter by file type
documents = SimpleDirectoryReader(
"./data",
required_exts=[".pdf", ".txt", ".md"],
).load_data()
# Recursive
documents = SimpleDirectoryReader(
"./data",
recursive=True,
exclude_hidden=True,
).load_data()PDF-specific readers for better extraction:
pip install llama-index-readers-file pypdffrom llama_index.readers.file import PDFReader
reader = PDFReader()
documents = reader.load_data(file="report.pdf")Web page loader:
pip install llama-index-readers-webfrom llama_index.readers.web import SimpleWebPageReader
reader = SimpleWebPageReader(html_to_text=True)
documents = reader.load_data(urls=[
"https://example.com/page1",
"https://example.com/page2",
])Database loader:
pip install llama-index-readers-database sqlalchemyfrom llama_index.readers.database import DatabaseReader
reader = DatabaseReader(
scheme="postgresql",
host="localhost",
port="5432",
user="postgres",
password="password",
dbname="mydb",
)
documents = reader.load_data(
query="SELECT id, title, content FROM articles WHERE published = true"
)Fix 7: Streaming and Async
from llama_index.core import VectorStoreIndex
# Streaming response
query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("Explain the main findings.")
# Print tokens as they arrive
for token in response.response_gen:
print(token, end="", flush=True)Async queries:
import asyncio
async def query_async():
query_engine = index.as_query_engine()
response = await query_engine.aquery("What are the conclusions?")
return response
response = asyncio.run(query_async())Batch queries concurrently:
import asyncio
async def batch_queries(queries):
tasks = [query_engine.aquery(q) for q in queries]
return await asyncio.gather(*tasks)
queries = [
"What is the main topic?",
"Who are the authors?",
"What are the key findings?",
]
results = asyncio.run(batch_queries(queries))Fix 8: Agents and Tool Use
LlamaIndex agents wrap LLMs with tools for multi-step reasoning.
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI
def multiply(a: int, b: int) -> int:
"""Multiply two integers."""
return a * b
def add(a: int, b: int) -> int:
"""Add two integers."""
return a + b
multiply_tool = FunctionTool.from_defaults(fn=multiply)
add_tool = FunctionTool.from_defaults(fn=add)
llm = OpenAI(model="gpt-4")
agent = ReActAgent.from_tools([multiply_tool, add_tool], llm=llm, verbose=True)
response = agent.chat("What is (123 * 456) + 789?")
print(response)Tool built from a query engine (RAG as a tool):
from llama_index.core.tools import QueryEngineTool, ToolMetadata
query_tool = QueryEngineTool(
query_engine=query_engine,
metadata=ToolMetadata(
name="company_docs",
description="Search company internal documents for policies and procedures.",
),
)
agent = ReActAgent.from_tools([query_tool], llm=llm, verbose=True)
response = agent.chat("What's our policy on remote work?")LlamaIndex agents and LangChain agents look similar but their tool-call schemas differ — do not try to share FunctionTool definitions across libraries; reimplement at the boundary.
Still Not Working?
LlamaIndex vs LangChain
- LlamaIndex — Specialized for RAG and indexing. Better abstractions for document processing, querying, and retrieval. Simpler RAG setup.
- LangChain — Broader toolkit covering agents, chains, LCEL, and many integrations. More flexible but also more complex.
Both libraries work well together — you can use LlamaIndex indexes as retrievers in LangChain chains. For LangChain-specific patterns, see LangChain Python not working.
OpenAI API Key and Rate Limits
export OPENAI_API_KEY=sk-...For OpenAI-specific rate limits and retry patterns when LlamaIndex hits them, see OpenAI API not working.
Hybrid Search — Combining Vector and Keyword Retrieval
Pure semantic search misses exact keyword matches (product codes, names, specific terminology). Hybrid search combines both:
from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.retrievers.bm25 import BM25Retriever
vector_retriever = index.as_retriever(similarity_top_k=5)
bm25_retriever = BM25Retriever.from_defaults(
docstore=index.docstore,
similarity_top_k=5,
)
# Fuse both retrievers — takes best from both
hybrid_retriever = QueryFusionRetriever(
[vector_retriever, bm25_retriever],
similarity_top_k=5,
num_queries=1,
mode="reciprocal_rerank",
)
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(hybrid_retriever)This catches cases where semantic search alone fails — e.g., searching for “SKU-12345” with pure embeddings rarely matches, but BM25 keyword matching finds it immediately.
Using Ollama for Local LLMs
pip install llama-index-llms-ollamafrom llama_index.llms.ollama import Ollama
from llama_index.core import Settings
Settings.llm = Ollama(model="llama3", request_timeout=60.0)For Ollama setup and model management, see Ollama not working.
Debugging and Observability
import llama_index.core
import logging
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
llama_index.core.set_global_handler("simple") # Prints all LLM calls
# Or use wandb for production tracking
llama_index.core.set_global_handler("wandb", run_args={"project": "rag-experiments"})The wandb handler logs every LLM call, every retrieval, and every embedding request — it can balloon storage quickly. Sample heavily in production or you will be paying for trace data nobody reads.
Metadata Filters for Scoped Search
Attach metadata to documents and filter queries to specific subsets:
from llama_index.core import Document
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters, FilterOperator
# Attach metadata when creating documents
documents = [
Document(text="Q1 earnings report...", metadata={"year": 2024, "department": "finance"}),
Document(text="Product roadmap 2024...", metadata={"year": 2024, "department": "product"}),
Document(text="Q1 2023 earnings...", metadata={"year": 2023, "department": "finance"}),
]
index = VectorStoreIndex.from_documents(documents)
# Query only 2024 finance documents
filters = MetadataFilters(filters=[
MetadataFilter(key="year", value=2024),
MetadataFilter(key="department", value="finance"),
])
query_engine = index.as_query_engine(filters=filters)
response = query_engine.query("What was revenue last quarter?")Metadata filters happen before embedding search — dramatically faster than retrieving everything and filtering after.
Evaluating RAG Quality
from llama_index.core.evaluation import (
FaithfulnessEvaluator,
RelevancyEvaluator,
CorrectnessEvaluator,
)
faithfulness = FaithfulnessEvaluator(llm=Settings.llm)
relevancy = RelevancyEvaluator(llm=Settings.llm)
response = query_engine.query("What's the CEO's name?")
faith_result = faithfulness.evaluate_response(response=response)
print(f"Faithfulness: {faith_result.passing}") # Does the response match sources?
relevancy_result = relevancy.evaluate_response(
query="What's the CEO's name?",
response=response,
)
print(f"Relevancy: {relevancy_result.passing}") # Is the response relevant?Evaluators are LLM-judged — they use another LLM call to score whether the response is faithful to sources or relevant to the query. Useful for regression testing RAG pipelines.
Settings Mutation Across Modules
Settings.llm = ... in module A and Settings.embed_model = ... in module B compete. Import order then becomes load-bearing — whichever module loads last wins. Symptoms are flaky tests and prod behavior that differs from local. Configure Settings once at the entrypoint of your app, never inside library modules. If you genuinely need two different LLMs (e.g., a cheap one for embeddings, a strong one for synthesis), pass them per-component via index.as_query_engine(llm=..., embed_model=...) instead of mutating the global.
Hugging Face Rate Limits During Bulk Indexing
Building a large index against an Inference API quickly hits rate limits — symptoms include silent dropouts where some chunks are embedded as zero vectors. Use a local embedding model (HuggingFaceEmbedding with BAAI/bge-small-en-v1.5) for batch indexing and reserve the paid API for query-time embeddings. For HuggingFace credential setup, see HuggingFace Transformers not working.
Source Documents Look Wrong After Persist + Load
You persist an index, load it from disk, and node.text is empty or truncated. The cause is almost always that your custom node parser used a different chunk size on the load side, or that the persisted format does not match the LlamaIndex version you are loading with. Pin the LlamaIndex version in requirements.txt and store the version alongside the persisted index so a mismatch raises loudly instead of silently corrupting retrieval.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: DSPy Not Working — LM Configuration, Signatures, Modules, Optimizers, and Cache Surprises
How to fix DSPy errors — no LM configured, signature field types, ChainOfThought vs Predict, optimizer (MIPROv2) setup, retrieval module wiring, async usage, and cache invalidation between runs.
Fix: Milvus Not Working — Connection Errors, Schema Setup, and Index Build Failures
How to fix Milvus errors — pymilvus connection refused localhost 19530, collection schema mismatch, index not built before search, partition not found, embedded vs standalone vs cluster, and flush before search.
Fix: Weaviate Not Working — Client v4 Migration, Schema Setup, and Vectorizer Errors
How to fix Weaviate errors — client v3 to v4 migration breaking imports, schema creation property mismatch, vectorizer module not loaded, connection refused localhost 8080, batch import errors, and hybrid search alpha tuning.
Fix: ChromaDB Not Working — Persistent Client, Collection Errors, and Embedding Function Issues
How to fix ChromaDB errors — persistent client not saving data, collection already exists error, dimension mismatch in embeddings, embedding function required, HTTP client connection refused, and memory growing unbounded.