Fix: LlamaIndex Not Working — Import Errors, Vector Store Issues, and Query Engine Failures
Quick Answer
How to fix LlamaIndex errors — ImportError llama_index.core module not found, ServiceContext deprecated use Settings instead, vector store index not persisting, query engine returns irrelevant results, and LlamaIndex 0.10 migration.
The Error
You install LlamaIndex and the imports fail:
ImportError: cannot import name 'GPTSimpleVectorIndex' from 'llama_index'
ModuleNotFoundError: No module named 'llama_index.core'Or ServiceContext raises a deprecation warning on every call:
DeprecationWarning: ServiceContext is deprecated, please use Settings instead.Or you build an index, restart your script, and it’s gone:
index = VectorStoreIndex.from_documents(documents)
# Restart script
index = ??? # How do I load it back?Or query results come back irrelevant — the retriever returns the wrong chunks:
response = query_engine.query("What is the CEO's name?")
print(response) # "This document discusses financial projections..."
# Nothing about the CEOLlamaIndex went through a major restructuring in version 0.10 (February 2024) — the package was split into llama-index-core plus dozens of integration packages. Code written for 0.9 or earlier breaks immediately on import. The ServiceContext object was replaced with a Settings singleton. This guide covers the migration plus common RAG-specific errors.
Why This Happens
LlamaIndex 0.10 split the monolithic llama-index package into a core library and many small integration packages (llama-index-vector-stores-chroma, llama-index-embeddings-openai, etc.). This keeps the core install small but means every feature requires its own pip install. Code that worked with 0.9’s single-package import model fails immediately.
RAG quality problems come from three places: the document chunking (too large, too small, or poorly aligned with sentence boundaries), the embedding model (wrong domain, wrong language), and the retrieval strategy (top-k alone often misses relevant content buried in a single long document).
Fix 1: LlamaIndex 0.10+ Migration
The 0.10 release reorganized everything. Here’s the migration:
Old (0.9 and earlier):
from llama_index import (
GPTSimpleVectorIndex,
SimpleDirectoryReader,
ServiceContext,
LLMPredictor,
)
from llama_index.llms import OpenAI
from llama_index.embeddings import OpenAIEmbeddingNew (0.10+):
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
Settings,
StorageContext,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbeddingInstall the packages you need:
# Core library (required)
pip install llama-index-core
# LLM providers — install the one(s) you use
pip install llama-index-llms-openai
pip install llama-index-llms-anthropic
pip install llama-index-llms-huggingface
pip install llama-index-llms-ollama
# Embedding providers
pip install llama-index-embeddings-openai
pip install llama-index-embeddings-huggingface
pip install llama-index-embeddings-cohere
# Vector stores
pip install llama-index-vector-stores-chroma
pip install llama-index-vector-stores-qdrant
pip install llama-index-vector-stores-pinecone
# Readers (document loaders)
pip install llama-index-readers-file # PDF, DOCX, etc.
pip install llama-index-readers-web # URL-based
pip install llama-index-readers-database # SQLOr install everything at once with the legacy bundle:
pip install llama-index # Installs llama-index-core + common integrationsCommon Mistake: Copying code from old tutorials that use from llama_index import .... The top-level llama_index namespace no longer has most symbols — everything moved to llama_index.core or provider subpackages. If you see ImportError on what looks like a standard LlamaIndex import, check the 0.10 migration guide and replace with the new path.
Fix 2: ServiceContext Deprecated — Use Settings
DeprecationWarning: ServiceContext is deprecated, please use Settings instead.The 0.10 release replaced the per-call ServiceContext with a global Settings singleton.
Old pattern:
from llama_index import ServiceContext, LLMPredictor
from llama_index.llms import OpenAI
from llama_index.embeddings import OpenAIEmbedding
service_context = ServiceContext.from_defaults(
llm=OpenAI(model="gpt-4"),
embed_model=OpenAIEmbedding(),
chunk_size=512,
)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)New pattern:
from llama_index.core import Settings, VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# Set globally — applies to all subsequent operations
Settings.llm = OpenAI(model="gpt-4")
Settings.embed_model = OpenAIEmbedding()
Settings.chunk_size = 512
Settings.chunk_overlap = 50
# No need to pass service_context anymore
index = VectorStoreIndex.from_documents(documents)Pro Tip: Set Settings once at the top of your script or in a config module. If you need different LLMs or embeddings in different parts of the app, pass them explicitly to the specific component rather than mutating Settings mid-flight. Mutating global state makes debugging much harder.
Per-component override:
from llama_index.llms.openai import OpenAI
# Use a specific LLM for this query engine only
query_engine = index.as_query_engine(llm=OpenAI(model="gpt-4-turbo"))Fix 3: Persisting and Loading Indexes
index = VectorStoreIndex.from_documents(documents)
# ... script ends, all work is lostSave the index to disk:
from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage
index = VectorStoreIndex.from_documents(documents)
# Persist
index.storage_context.persist(persist_dir="./index_storage")Load later:
from llama_index.core import StorageContext, load_index_from_storage
# Load back
storage_context = StorageContext.from_defaults(persist_dir="./index_storage")
index = load_index_from_storage(storage_context)
# Use as before
query_engine = index.as_query_engine()
response = query_engine.query("What's the main topic?")For production — use a dedicated vector store:
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
# Chroma client (persistent)
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("my_collection")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Build index (stores vectors in Chroma automatically)
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
)
# Load later — just connect to Chroma
index = VectorStoreIndex.from_vector_store(vector_store)Qdrant (for production scale):
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client
client = qdrant_client.QdrantClient(host="localhost", port=6333)
vector_store = QdrantVectorStore(client=client, collection_name="my_collection")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)Fix 4: Poor Query Results — Chunking and Retrieval
Vague or wrong answers usually mean the retriever isn’t finding the right chunks.
Step 1: Tune chunk size to your documents.
from llama_index.core import Settings
# Default is 1024 tokens — good for mixed content
Settings.chunk_size = 1024
Settings.chunk_overlap = 100
# Smaller for specific facts (FAQ, Q&A)
Settings.chunk_size = 512
Settings.chunk_overlap = 50
# Larger for long-form documents (narratives, technical docs)
Settings.chunk_size = 2048
Settings.chunk_overlap = 200Step 2: Use sentence-aware splitting.
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(
chunk_size=1024,
chunk_overlap=100,
paragraph_separator="\n\n\n",
)
nodes = splitter.get_nodes_from_documents(documents)
index = VectorStoreIndex(nodes)Step 3: Increase top-k.
# Default top_k=2 — often too few
query_engine = index.as_query_engine(similarity_top_k=5)Step 4: Add a re-ranker for higher precision:
from llama_index.core.postprocessor import SentenceTransformerRerank
# Get more candidates, re-rank, use top results
rerank = SentenceTransformerRerank(
model="cross-encoder/ms-marco-MiniLM-L-6-v2",
top_n=3,
)
query_engine = index.as_query_engine(
similarity_top_k=10, # Retrieve 10 candidates
node_postprocessors=[rerank], # Re-rank and keep top 3
)Inspect what was retrieved to diagnose:
response = query_engine.query("What is the CEO's name?")
print(f"Answer: {response}")
print(f"\nSource chunks used:")
for node in response.source_nodes:
print(f"Score: {node.score:.3f}")
print(f"Text: {node.text[:200]}...")
print(f"Metadata: {node.metadata}")
print("---")If the retrieved chunks don’t contain the answer, it’s a retrieval problem (chunking, embedding). If they contain the answer but the LLM ignores it, it’s an LLM problem (prompting, context window).
Fix 5: Custom Embedding Models
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# Use a local HuggingFace model (no API costs)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
# Multilingual embeddings
Settings.embed_model = HuggingFaceEmbedding(
model_name="intfloat/multilingual-e5-large",
)
# Japanese-specific
Settings.embed_model = HuggingFaceEmbedding(
model_name="intfloat/multilingual-e5-large",
# Or "cl-tohoku/bert-base-japanese-v3"
)Embedding batch size for faster indexing:
Settings.embed_model = HuggingFaceEmbedding(
model_name="BAAI/bge-small-en-v1.5",
embed_batch_size=32, # Default 10 — raise if GPU has memory
)API-based embeddings:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.cohere import CohereEmbedding
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")
# Or
Settings.embed_model = CohereEmbedding(model_name="embed-english-v3.0")Common Mistake: Re-embedding all documents every time you start the app. Embedding is expensive (money for API, time for local). Always persist the index after first build, and only re-index when source documents change.
For HuggingFace model loading patterns and authentication, see HuggingFace Transformers not working.
Fix 6: Document Loaders — PDFs, URLs, Databases
from llama_index.core import SimpleDirectoryReader
# Load all files from a directory
documents = SimpleDirectoryReader("./data").load_data()
# Filter by file type
documents = SimpleDirectoryReader(
"./data",
required_exts=[".pdf", ".txt", ".md"],
).load_data()
# Recursive
documents = SimpleDirectoryReader(
"./data",
recursive=True,
exclude_hidden=True,
).load_data()PDF-specific readers for better extraction:
pip install llama-index-readers-file pypdffrom llama_index.readers.file import PDFReader
reader = PDFReader()
documents = reader.load_data(file="report.pdf")Web page loader:
pip install llama-index-readers-webfrom llama_index.readers.web import SimpleWebPageReader
reader = SimpleWebPageReader(html_to_text=True)
documents = reader.load_data(urls=[
"https://example.com/page1",
"https://example.com/page2",
])Database loader:
pip install llama-index-readers-database sqlalchemyfrom llama_index.readers.database import DatabaseReader
reader = DatabaseReader(
scheme="postgresql",
host="localhost",
port="5432",
user="postgres",
password="password",
dbname="mydb",
)
documents = reader.load_data(
query="SELECT id, title, content FROM articles WHERE published = true"
)Fix 7: Streaming and Async
from llama_index.core import VectorStoreIndex
# Streaming response
query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("Explain the main findings.")
# Print tokens as they arrive
for token in response.response_gen:
print(token, end="", flush=True)Async queries:
import asyncio
async def query_async():
query_engine = index.as_query_engine()
response = await query_engine.aquery("What are the conclusions?")
return response
response = asyncio.run(query_async())Batch queries concurrently:
import asyncio
async def batch_queries(queries):
tasks = [query_engine.aquery(q) for q in queries]
return await asyncio.gather(*tasks)
queries = [
"What is the main topic?",
"Who are the authors?",
"What are the key findings?",
]
results = asyncio.run(batch_queries(queries))Fix 8: Agents and Tool Use
LlamaIndex agents wrap LLMs with tools for multi-step reasoning.
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI
def multiply(a: int, b: int) -> int:
"""Multiply two integers."""
return a * b
def add(a: int, b: int) -> int:
"""Add two integers."""
return a + b
multiply_tool = FunctionTool.from_defaults(fn=multiply)
add_tool = FunctionTool.from_defaults(fn=add)
llm = OpenAI(model="gpt-4")
agent = ReActAgent.from_tools([multiply_tool, add_tool], llm=llm, verbose=True)
response = agent.chat("What is (123 * 456) + 789?")
print(response)Tool built from a query engine (RAG as a tool):
from llama_index.core.tools import QueryEngineTool, ToolMetadata
query_tool = QueryEngineTool(
query_engine=query_engine,
metadata=ToolMetadata(
name="company_docs",
description="Search company internal documents for policies and procedures.",
),
)
agent = ReActAgent.from_tools([query_tool], llm=llm, verbose=True)
response = agent.chat("What's our policy on remote work?")For general LangChain-style agent patterns that overlap with LlamaIndex agents, see LangChain Python not working.
Still Not Working?
LlamaIndex vs LangChain
- LlamaIndex — Specialized for RAG and indexing. Better abstractions for document processing, querying, and retrieval. Simpler RAG setup.
- LangChain — Broader toolkit covering agents, chains, LCEL, and many integrations. More flexible but also more complex.
Both libraries work well together — you can use LlamaIndex indexes as retrievers in LangChain chains. For LangChain-specific patterns, see LangChain Python not working.
OpenAI API Key and Rate Limits
export OPENAI_API_KEY=sk-...For OpenAI-specific rate limits and retry patterns when LlamaIndex hits them, see OpenAI API not working.
Hybrid Search — Combining Vector and Keyword Retrieval
Pure semantic search misses exact keyword matches (product codes, names, specific terminology). Hybrid search combines both:
from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.retrievers.bm25 import BM25Retriever
vector_retriever = index.as_retriever(similarity_top_k=5)
bm25_retriever = BM25Retriever.from_defaults(
docstore=index.docstore,
similarity_top_k=5,
)
# Fuse both retrievers — takes best from both
hybrid_retriever = QueryFusionRetriever(
[vector_retriever, bm25_retriever],
similarity_top_k=5,
num_queries=1,
mode="reciprocal_rerank",
)
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(hybrid_retriever)This catches cases where semantic search alone fails — e.g., searching for “SKU-12345” with pure embeddings rarely matches, but BM25 keyword matching finds it immediately.
Using Ollama for Local LLMs
pip install llama-index-llms-ollamafrom llama_index.llms.ollama import Ollama
from llama_index.core import Settings
Settings.llm = Ollama(model="llama3", request_timeout=60.0)For Ollama setup and model management, see Ollama not working.
Debugging and Observability
import llama_index.core
import logging
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
llama_index.core.set_global_handler("simple") # Prints all LLM calls
# Or use wandb for production tracking
llama_index.core.set_global_handler("wandb", run_args={"project": "rag-experiments"})For wandb setup and logging patterns, see Weights & Biases not working.
Metadata Filters for Scoped Search
Attach metadata to documents and filter queries to specific subsets:
from llama_index.core import Document
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters, FilterOperator
# Attach metadata when creating documents
documents = [
Document(text="Q1 earnings report...", metadata={"year": 2024, "department": "finance"}),
Document(text="Product roadmap 2024...", metadata={"year": 2024, "department": "product"}),
Document(text="Q1 2023 earnings...", metadata={"year": 2023, "department": "finance"}),
]
index = VectorStoreIndex.from_documents(documents)
# Query only 2024 finance documents
filters = MetadataFilters(filters=[
MetadataFilter(key="year", value=2024),
MetadataFilter(key="department", value="finance"),
])
query_engine = index.as_query_engine(filters=filters)
response = query_engine.query("What was revenue last quarter?")Metadata filters happen before embedding search — dramatically faster than retrieving everything and filtering after.
Evaluating RAG Quality
from llama_index.core.evaluation import (
FaithfulnessEvaluator,
RelevancyEvaluator,
CorrectnessEvaluator,
)
faithfulness = FaithfulnessEvaluator(llm=Settings.llm)
relevancy = RelevancyEvaluator(llm=Settings.llm)
response = query_engine.query("What's the CEO's name?")
faith_result = faithfulness.evaluate_response(response=response)
print(f"Faithfulness: {faith_result.passing}") # Does the response match sources?
relevancy_result = relevancy.evaluate_response(
query="What's the CEO's name?",
response=response,
)
print(f"Relevancy: {relevancy_result.passing}") # Is the response relevant?Evaluators are LLM-judged — they use another LLM call to score whether the response is faithful to sources or relevant to the query. Useful for regression testing RAG pipelines.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: ChromaDB Not Working — Persistent Client, Collection Errors, and Embedding Function Issues
How to fix ChromaDB errors — persistent client not saving data, collection already exists error, dimension mismatch in embeddings, embedding function required, HTTP client connection refused, and memory growing unbounded.
Fix: CrewAI Not Working — Agent Delegation, Task Context, and LLM Configuration Errors
How to fix CrewAI errors — LLM not configured ValidationError, agent delegation loop, task context not passed between agents, tool output truncated, process hierarchical vs sequential, and memory not persisting across runs.
Fix: FAISS Not Working — Import Errors, Index Selection, and GPU Setup
How to fix FAISS errors — ImportError cannot import name swigfaiss, faiss-gpu vs faiss-cpu install, IndexFlatL2 slow on large data, IVF training required, index serialization write_index, and dimension mismatch.
Fix: LangGraph Not Working — State Errors, Checkpointer Setup, and Cyclic Graph Failures
How to fix LangGraph errors — state not updating between nodes, checkpointer thread_id required, StateGraph compile error, conditional edges not routing, streaming events missing, recursion limit exceeded, and interrupt handling.