Fix: LlamaIndex Not Working — Import Errors, Vector Store Issues, and Query Engine Failures

Q: How do I fix "LlamaIndex Not Working — Import Errors, Vector Store Issues, and Query Engine Failures"?

How to fix LlamaIndex errors — ImportError llama_index.core module not found, ServiceContext deprecated use Settings instead, vector store index not persisting, query engine returns irrelevant results, and LlamaIndex 0.10 migration.

The Error

You install LlamaIndex and the imports fail:

ImportError: cannot import name 'GPTSimpleVectorIndex' from 'llama_index'
ModuleNotFoundError: No module named 'llama_index.core'

Or ServiceContext raises a deprecation warning on every call:

DeprecationWarning: ServiceContext is deprecated, please use Settings instead.

Or you build an index, restart your script, and it’s gone:

index = VectorStoreIndex.from_documents(documents)
# Restart script
index = ???   # How do I load it back?

Or query results come back irrelevant — the retriever returns the wrong chunks:

response = query_engine.query("What is the CEO's name?")
print(response)   # "This document discusses financial projections..."
# Nothing about the CEO

LlamaIndex went through a major restructuring in version 0.10 (February 2024) — the package was split into llama-index-core plus dozens of integration packages. Code written for 0.9 or earlier breaks immediately on import. The ServiceContext object was replaced with a Settings singleton. This guide covers the migration plus common RAG-specific errors.

Why This Happens

LlamaIndex 0.10 split the monolithic llama-index package into a core library and many small integration packages (llama-index-vector-stores-chroma, llama-index-embeddings-openai, etc.). This keeps the core install small but means every feature requires its own pip install. Code that worked with 0.9’s single-package import model fails immediately.

RAG quality problems come from three places: the document chunking (too large, too small, or poorly aligned with sentence boundaries), the embedding model (wrong domain, wrong language), and the retrieval strategy (top-k alone often misses relevant content buried in a single long document).

Diagnostic Timeline

When your RAG pipeline returns wrong answers, the reflex is “check the index.” That blames the wrong layer roughly 80% of the time. Follow this timeline before re-indexing anything.

Minute 0 — Wrong first instinct. You assume the index is stale or corrupt, blow it away, and re-index from scratch. Twenty minutes later you ask the same question and get the same wrong answer. Re-indexing fixes corruption, but corruption is rare; what is common is that the index is built correctly and the query path uses different settings than the indexing path. Re-indexing cannot fix that.

Minute 1 — Discriminating evidence. Print the retrieved nodes before judging the answer:

response = query_engine.query("What is the CEO's name?")
for node in response.source_nodes:
    print(node.score, node.text[:200])

If the source nodes contain the answer but the LLM ignored it, the bug is in the prompt or the LLM (prompt template, context window truncation). If the source nodes do not contain the answer, the bug is in retrieval (embeddings, chunking, or top_k). That single check redirects the next 30 minutes of debugging.

Minute 2 — Next check. Compare the embedding model used at indexing time and at query time. After 0.10 the Settings singleton is global — if you set Settings.embed_model = OpenAIEmbedding() to build the index, then a later script loads with Settings.embed_model still on the default BAAI/bge-small-en-v1.5, you are querying a 1536-dim index with 384-dim vectors. Some vector stores raise; many silently return nonsense. Print Settings.embed_model.__class__.__name__ in both processes.

Minute 3 — Actual root cause. The two failure modes that account for almost every “RAG returns wrong answer” case:

Embedding model mismatch between indexing and querying. Already covered above. The fix is to persist embedding metadata alongside the index and assert it on load. A one-line check at startup saves hours of debugging.
Storage context not persisted. You called VectorStoreIndex.from_documents(docs), queried it in the same process, and got correct answers. You restarted the script and the index is gone — because you never called index.storage_context.persist(persist_dir=...). The in-memory SimpleVectorStore has no disk backing. The symptom is “the same query returns different answers after restart”; the cause is no persistence layer. Fix 3 covers the persistence call you missed.

If both check out, then look at the index itself. By that point you actually have a hypothesis.

Fix 1: LlamaIndex 0.10+ Migration

The 0.10 release reorganized everything. Here’s the migration:

Old (0.9 and earlier):

from llama_index import (
    GPTSimpleVectorIndex,
    SimpleDirectoryReader,
    ServiceContext,
    LLMPredictor,
)
from llama_index.llms import OpenAI
from llama_index.embeddings import OpenAIEmbedding

New (0.10+):

from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Settings,
    StorageContext,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Install the packages you need:

# Core library (required)
pip install llama-index-core

# LLM providers — install the one(s) you use
pip install llama-index-llms-openai
pip install llama-index-llms-anthropic
pip install llama-index-llms-huggingface
pip install llama-index-llms-ollama

# Embedding providers
pip install llama-index-embeddings-openai
pip install llama-index-embeddings-huggingface
pip install llama-index-embeddings-cohere

# Vector stores
pip install llama-index-vector-stores-chroma
pip install llama-index-vector-stores-qdrant
pip install llama-index-vector-stores-pinecone

# Readers (document loaders)
pip install llama-index-readers-file     # PDF, DOCX, etc.
pip install llama-index-readers-web      # URL-based
pip install llama-index-readers-database # SQL

Or install everything at once with the legacy bundle:

pip install llama-index   # Installs llama-index-core + common integrations

Common Mistake: Copying code from old tutorials that use from llama_index import .... The top-level llama_index namespace no longer has most symbols — everything moved to llama_index.core or provider subpackages. If you see ImportError on what looks like a standard LlamaIndex import, check the 0.10 migration guide and replace with the new path.

Fix 2: `ServiceContext` Deprecated — Use `Settings`

DeprecationWarning: ServiceContext is deprecated, please use Settings instead.

The 0.10 release replaced the per-call ServiceContext with a global Settings singleton.

Old pattern:

from llama_index import ServiceContext, LLMPredictor
from llama_index.llms import OpenAI
from llama_index.embeddings import OpenAIEmbedding

service_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-4"),
    embed_model=OpenAIEmbedding(),
    chunk_size=512,
)

index = VectorStoreIndex.from_documents(documents, service_context=service_context)

New pattern:

from llama_index.core import Settings, VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Set globally — applies to all subsequent operations
Settings.llm = OpenAI(model="gpt-4")
Settings.embed_model = OpenAIEmbedding()
Settings.chunk_size = 512
Settings.chunk_overlap = 50

# No need to pass service_context anymore
index = VectorStoreIndex.from_documents(documents)

Pro Tip: Set Settings once at the top of your script or in a config module. If you need different LLMs or embeddings in different parts of the app, pass them explicitly to the specific component rather than mutating Settings mid-flight. Mutating global state makes debugging much harder.

Per-component override:

from llama_index.llms.openai import OpenAI

# Use a specific LLM for this query engine only
query_engine = index.as_query_engine(llm=OpenAI(model="gpt-4-turbo"))

Fix 3: Persisting and Loading Indexes

index = VectorStoreIndex.from_documents(documents)
# ... script ends, all work is lost

Save the index to disk:

from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage

index = VectorStoreIndex.from_documents(documents)

# Persist
index.storage_context.persist(persist_dir="./index_storage")

Load later:

from llama_index.core import StorageContext, load_index_from_storage

# Load back
storage_context = StorageContext.from_defaults(persist_dir="./index_storage")
index = load_index_from_storage(storage_context)

# Use as before
query_engine = index.as_query_engine()
response = query_engine.query("What's the main topic?")

For production — use a dedicated vector store:

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

# Chroma client (persistent)
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("my_collection")

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Build index (stores vectors in Chroma automatically)
index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
)

# Load later — just connect to Chroma
index = VectorStoreIndex.from_vector_store(vector_store)

Qdrant (for production scale):

from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

client = qdrant_client.QdrantClient(host="localhost", port=6333)
vector_store = QdrantVectorStore(client=client, collection_name="my_collection")

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

Fix 4: Poor Query Results — Chunking and Retrieval

Vague or wrong answers usually mean the retriever isn’t finding the right chunks.

Step 1: Tune chunk size to your documents.

from llama_index.core import Settings

# Default is 1024 tokens — good for mixed content
Settings.chunk_size = 1024
Settings.chunk_overlap = 100

# Smaller for specific facts (FAQ, Q&A)
Settings.chunk_size = 512
Settings.chunk_overlap = 50

# Larger for long-form documents (narratives, technical docs)
Settings.chunk_size = 2048
Settings.chunk_overlap = 200

Step 2: Use sentence-aware splitting.

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=100,
    paragraph_separator="\n\n\n",
)

nodes = splitter.get_nodes_from_documents(documents)
index = VectorStoreIndex(nodes)

Step 3: Increase top-k.

# Default top_k=2 — often too few
query_engine = index.as_query_engine(similarity_top_k=5)

Step 4: Add a re-ranker for higher precision:

from llama_index.core.postprocessor import SentenceTransformerRerank

# Get more candidates, re-rank, use top results
rerank = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-6-v2",
    top_n=3,
)

query_engine = index.as_query_engine(
    similarity_top_k=10,        # Retrieve 10 candidates
    node_postprocessors=[rerank],  # Re-rank and keep top 3
)

Inspect what was retrieved to diagnose:

response = query_engine.query("What is the CEO's name?")
print(f"Answer: {response}")
print(f"\nSource chunks used:")
for node in response.source_nodes:
    print(f"Score: {node.score:.3f}")
    print(f"Text: {node.text[:200]}...")
    print(f"Metadata: {node.metadata}")
    print("---")

If the retrieved chunks don’t contain the answer, it’s a retrieval problem (chunking, embedding). If they contain the answer but the LLM ignores it, it’s an LLM problem (prompting, context window).

Fix 5: Custom Embedding Models

from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Use a local HuggingFace model (no API costs)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Multilingual embeddings
Settings.embed_model = HuggingFaceEmbedding(
    model_name="intfloat/multilingual-e5-large",
)

# Japanese-specific
Settings.embed_model = HuggingFaceEmbedding(
    model_name="intfloat/multilingual-e5-large",
    # Or "cl-tohoku/bert-base-japanese-v3"
)

Embedding batch size for faster indexing:

Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    embed_batch_size=32,   # Default 10 — raise if GPU has memory
)

API-based embeddings:

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.cohere import CohereEmbedding

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")
# Or
Settings.embed_model = CohereEmbedding(model_name="embed-english-v3.0")

Common Mistake: Re-embedding all documents every time you start the app. Embedding is expensive (money for API, time for local). Always persist the index after first build, and only re-index when source documents change.

When loading HuggingFace embedding models, the first call downloads weights to ~/.cache/huggingface. Set HF_HOME to a shared volume in container deployments so every pod does not re-download the same 500 MB at startup.

Fix 6: Document Loaders — PDFs, URLs, Databases

from llama_index.core import SimpleDirectoryReader

# Load all files from a directory
documents = SimpleDirectoryReader("./data").load_data()

# Filter by file type
documents = SimpleDirectoryReader(
    "./data",
    required_exts=[".pdf", ".txt", ".md"],
).load_data()

# Recursive
documents = SimpleDirectoryReader(
    "./data",
    recursive=True,
    exclude_hidden=True,
).load_data()

PDF-specific readers for better extraction:

pip install llama-index-readers-file pypdf

from llama_index.readers.file import PDFReader

reader = PDFReader()
documents = reader.load_data(file="report.pdf")

Web page loader:

pip install llama-index-readers-web

from llama_index.readers.web import SimpleWebPageReader

reader = SimpleWebPageReader(html_to_text=True)
documents = reader.load_data(urls=[
    "https://example.com/page1",
    "https://example.com/page2",
])

Database loader:

pip install llama-index-readers-database sqlalchemy

from llama_index.readers.database import DatabaseReader

reader = DatabaseReader(
    scheme="postgresql",
    host="localhost",
    port="5432",
    user="postgres",
    password="password",
    dbname="mydb",
)

documents = reader.load_data(
    query="SELECT id, title, content FROM articles WHERE published = true"
)

Fix 7: Streaming and Async

from llama_index.core import VectorStoreIndex

# Streaming response
query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("Explain the main findings.")

# Print tokens as they arrive
for token in response.response_gen:
    print(token, end="", flush=True)

Async queries:

import asyncio

async def query_async():
    query_engine = index.as_query_engine()
    response = await query_engine.aquery("What are the conclusions?")
    return response

response = asyncio.run(query_async())

Batch queries concurrently:

import asyncio

async def batch_queries(queries):
    tasks = [query_engine.aquery(q) for q in queries]
    return await asyncio.gather(*tasks)

queries = [
    "What is the main topic?",
    "Who are the authors?",
    "What are the key findings?",
]
results = asyncio.run(batch_queries(queries))

Fix 8: Agents and Tool Use

LlamaIndex agents wrap LLMs with tools for multi-step reasoning.

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI

def multiply(a: int, b: int) -> int:
    """Multiply two integers."""
    return a * b

def add(a: int, b: int) -> int:
    """Add two integers."""
    return a + b

multiply_tool = FunctionTool.from_defaults(fn=multiply)
add_tool = FunctionTool.from_defaults(fn=add)

llm = OpenAI(model="gpt-4")
agent = ReActAgent.from_tools([multiply_tool, add_tool], llm=llm, verbose=True)

response = agent.chat("What is (123 * 456) + 789?")
print(response)

Tool built from a query engine (RAG as a tool):

from llama_index.core.tools import QueryEngineTool, ToolMetadata

query_tool = QueryEngineTool(
    query_engine=query_engine,
    metadata=ToolMetadata(
        name="company_docs",
        description="Search company internal documents for policies and procedures.",
    ),
)

agent = ReActAgent.from_tools([query_tool], llm=llm, verbose=True)
response = agent.chat("What's our policy on remote work?")

LlamaIndex agents and LangChain agents look similar but their tool-call schemas differ — do not try to share FunctionTool definitions across libraries; reimplement at the boundary.

Still Not Working?

LlamaIndex vs LangChain

LlamaIndex — Specialized for RAG and indexing. Better abstractions for document processing, querying, and retrieval. Simpler RAG setup.
LangChain — Broader toolkit covering agents, chains, LCEL, and many integrations. More flexible but also more complex.

Both libraries work well together — you can use LlamaIndex indexes as retrievers in LangChain chains. For LangChain-specific patterns, see LangChain Python not working.

OpenAI API Key and Rate Limits

export OPENAI_API_KEY=sk-...

For OpenAI-specific rate limits and retry patterns when LlamaIndex hits them, see OpenAI API not working.

Hybrid Search — Combining Vector and Keyword Retrieval

Pure semantic search misses exact keyword matches (product codes, names, specific terminology). Hybrid search combines both:

from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.retrievers.bm25 import BM25Retriever

vector_retriever = index.as_retriever(similarity_top_k=5)
bm25_retriever = BM25Retriever.from_defaults(
    docstore=index.docstore,
    similarity_top_k=5,
)

# Fuse both retrievers — takes best from both
hybrid_retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    similarity_top_k=5,
    num_queries=1,
    mode="reciprocal_rerank",
)

from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(hybrid_retriever)

This catches cases where semantic search alone fails — e.g., searching for “SKU-12345” with pure embeddings rarely matches, but BM25 keyword matching finds it immediately.

Using Ollama for Local LLMs

pip install llama-index-llms-ollama

from llama_index.llms.ollama import Ollama
from llama_index.core import Settings

Settings.llm = Ollama(model="llama3", request_timeout=60.0)

For Ollama setup and model management, see Ollama not working.

Debugging and Observability

import llama_index.core
import logging

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
llama_index.core.set_global_handler("simple")   # Prints all LLM calls

# Or use wandb for production tracking
llama_index.core.set_global_handler("wandb", run_args={"project": "rag-experiments"})

The wandb handler logs every LLM call, every retrieval, and every embedding request — it can balloon storage quickly. Sample heavily in production or you will be paying for trace data nobody reads.

Metadata Filters for Scoped Search

Attach metadata to documents and filter queries to specific subsets:

from llama_index.core import Document
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters, FilterOperator

# Attach metadata when creating documents
documents = [
    Document(text="Q1 earnings report...", metadata={"year": 2024, "department": "finance"}),
    Document(text="Product roadmap 2024...", metadata={"year": 2024, "department": "product"}),
    Document(text="Q1 2023 earnings...", metadata={"year": 2023, "department": "finance"}),
]

index = VectorStoreIndex.from_documents(documents)

# Query only 2024 finance documents
filters = MetadataFilters(filters=[
    MetadataFilter(key="year", value=2024),
    MetadataFilter(key="department", value="finance"),
])

query_engine = index.as_query_engine(filters=filters)
response = query_engine.query("What was revenue last quarter?")

Metadata filters happen before embedding search — dramatically faster than retrieving everything and filtering after.

Evaluating RAG Quality

from llama_index.core.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    CorrectnessEvaluator,
)

faithfulness = FaithfulnessEvaluator(llm=Settings.llm)
relevancy = RelevancyEvaluator(llm=Settings.llm)

response = query_engine.query("What's the CEO's name?")

faith_result = faithfulness.evaluate_response(response=response)
print(f"Faithfulness: {faith_result.passing}")   # Does the response match sources?

relevancy_result = relevancy.evaluate_response(
    query="What's the CEO's name?",
    response=response,
)
print(f"Relevancy: {relevancy_result.passing}")   # Is the response relevant?

Evaluators are LLM-judged — they use another LLM call to score whether the response is faithful to sources or relevant to the query. Useful for regression testing RAG pipelines.

`Settings` Mutation Across Modules

Settings.llm = ... in module A and Settings.embed_model = ... in module B compete. Import order then becomes load-bearing — whichever module loads last wins. Symptoms are flaky tests and prod behavior that differs from local. Configure Settings once at the entrypoint of your app, never inside library modules. If you genuinely need two different LLMs (e.g., a cheap one for embeddings, a strong one for synthesis), pass them per-component via index.as_query_engine(llm=..., embed_model=...) instead of mutating the global.

Hugging Face Rate Limits During Bulk Indexing

Building a large index against an Inference API quickly hits rate limits — symptoms include silent dropouts where some chunks are embedded as zero vectors. Use a local embedding model (HuggingFaceEmbedding with BAAI/bge-small-en-v1.5) for batch indexing and reserve the paid API for query-time embeddings. For HuggingFace credential setup, see HuggingFace Transformers not working.

Source Documents Look Wrong After Persist + Load

You persist an index, load it from disk, and node.text is empty or truncated. The cause is almost always that your custom node parser used a different chunk size on the load side, or that the persisted format does not match the LlamaIndex version you are loading with. Pin the LlamaIndex version in requirements.txt and store the version alongside the persisted index so a mismatch raises loudly instead of silently corrupting retrieval.

Fix: LlamaIndex Not Working — Import Errors, Vector Store Issues, and Query Engine Failures

The Error

Why This Happens

Diagnostic Timeline

Fix 1: LlamaIndex 0.10+ Migration

Fix 2: `ServiceContext` Deprecated — Use `Settings`

Fix 3: Persisting and Loading Indexes

Fix 4: Poor Query Results — Chunking and Retrieval

Fix 5: Custom Embedding Models

Fix 6: Document Loaders — PDFs, URLs, Databases

Fix 7: Streaming and Async

Fix 8: Agents and Tool Use

Still Not Working?

LlamaIndex vs LangChain

OpenAI API Key and Rate Limits

Hybrid Search — Combining Vector and Keyword Retrieval

Using Ollama for Local LLMs

Debugging and Observability

Metadata Filters for Scoped Search

Evaluating RAG Quality

`Settings` Mutation Across Modules

Hugging Face Rate Limits During Bulk Indexing

Source Documents Look Wrong After Persist + Load

Related Articles

Fix: DSPy Not Working — LM Configuration, Signatures, Modules, Optimizers, and Cache Surprises

Fix: Milvus Not Working — Connection Errors, Schema Setup, and Index Build Failures

Fix: Weaviate Not Working — Client v4 Migration, Schema Setup, and Vectorizer Errors

Fix: ChromaDB Not Working — Persistent Client, Collection Errors, and Embedding Function Issues

The Error

Why This Happens

Diagnostic Timeline

Fix 1: LlamaIndex 0.10+ Migration

Fix 2: ServiceContext Deprecated — Use Settings

Fix 3: Persisting and Loading Indexes

Fix 4: Poor Query Results — Chunking and Retrieval

Fix 5: Custom Embedding Models

Fix 6: Document Loaders — PDFs, URLs, Databases

Fix 7: Streaming and Async

Fix 8: Agents and Tool Use

Still Not Working?

LlamaIndex vs LangChain

OpenAI API Key and Rate Limits

Hybrid Search — Combining Vector and Keyword Retrieval

Using Ollama for Local LLMs

Debugging and Observability

Metadata Filters for Scoped Search

Evaluating RAG Quality

Settings Mutation Across Modules

Hugging Face Rate Limits During Bulk Indexing

Source Documents Look Wrong After Persist + Load

Related Articles

Fix: DSPy Not Working — LM Configuration, Signatures, Modules, Optimizers, and Cache Surprises

Fix: Milvus Not Working — Connection Errors, Schema Setup, and Index Build Failures

Fix: Weaviate Not Working — Client v4 Migration, Schema Setup, and Vectorizer Errors

Fix: ChromaDB Not Working — Persistent Client, Collection Errors, and Embedding Function Issues

Fix 2: `ServiceContext` Deprecated — Use `Settings`

`Settings` Mutation Across Modules