Build Text-to-SQL using LLM
Text-to-SQL is a powerful application of large language models (LLMs) like GPT that enables users to interact with a database using natural language. Let us delve into understanding how to build Text-to-SQL using LLM.
1. What is an OpenAI and LLM?
OpenAI is an AI research and deployment company focused on ensuring that artificial general intelligence (AGI) benefits all of humanity. It is known for developing advanced language models like GPT-3, GPT-4, and the newer GPT-4o, which power applications such as ChatGPT. These models are capable of understanding and generating human-like language, performing tasks like summarization, question answering, translation, and even code generation. LLM stands for Large Language Model, which refers to a type of AI model trained on massive amounts of textual data to understand context, syntax, semantics, and intent in human language. LLMs leverage deep learning, particularly transformer architectures, to process and generate language across a wide range of domains and tasks. When integrated with external systems like databases or APIs, LLMs can act as powerful natural language interfaces that convert user inputs into structured queries or commands.
2. PostgreSQL Setup on Docker and Sample Data
Begin by setting up a PostgreSQL container using Docker. This eliminates the need for a manual installation and provides an isolated, reproducible environment for development and testing.
docker run --name text2sql-postgres \ -e POSTGRES_USER=admin \ -e POSTGRES_PASSWORD=admin123 \ -e POSTGRES_DB=employees_db \ -p 5432:5432 \ -d postgres
This command will create a new container named text2sql-postgres
, set up a database user admin
with the password admin123
, initialize a database named employees_db
, expose port 5432
to allow connections from your host machine, and run the official postgres
image in detached mode. After the container starts, you can interact with the database using the `psql` CLI or GUI-based tools like DBeaver or pgAdmin. To connect with `psql`, use:
psql -h localhost -p 5432 -U admin -d employees_db
Once connected, execute the SQL statements below to create a sample table and seed it with mock data:
CREATE TABLE employees ( id SERIAL PRIMARY KEY, name VARCHAR(100), department VARCHAR(100), salary INTEGER ); INSERT INTO employees (name, department, salary) VALUES ('Alice', 'Engineering', 95000), ('Bob', 'HR', 60000), ('Charlie', 'Engineering', 105000), ('Diana', 'Sales', 70000);
This will create a table named employees
with basic fields for ID, name, department, and salary, and insert sample records to simulate a simple company structure. This sample data serves as a foundational dataset for testing Text-to-SQL conversions, building query interfaces, or training AI-based SQL generation models.
3. Sample Python Example
We’ll now use Python to connect to both Azure OpenAI’s API and the PostgreSQL database using SQLAlchemy as the ORM layer. This integration enables us to dynamically generate SQL queries using natural language inputs and execute them securely against the database.
3.1 Dependencies
The following Python packages are required to build this integration:
pip install openai sqlalchemy psycopg2-binary python-dotenv
3.2 Python Code
Create a main.py
file and add the following code to implement the complete flow that connects your Python application to Azure OpenAI and PostgreSQL using SQLAlchemy. The script allows users to input natural language questions, convert them to SQL using OpenAI, and execute those queries on the database.
# Author: Yatin B. import logging import re from os import getenv import sqlalchemy from dotenv import load_dotenv from openai import AzureOpenAI from sqlalchemy import text from sqlalchemy.exc import SQLAlchemyError # ========== Setup Logging ========== logging.basicConfig( level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s" ) logger = logging.getLogger(__name__) # ========== Load Environment ========== load_dotenv() # ========== Constants ========== TABLE_NAME = "employees" DB_URL = 'postgresql://username:password@hostname:5432/company' # ========== Setup Clients ========== def get_openai_client() -> tuple[AzureOpenAI, str | None]: "carview.php?tsp=""Initialize and return an Azure OpenAI client."carview.php?tsp="" api_key = getenv('OPENAI_API_KEY') api_version = getenv('OPENAI_API_AZURE_API_VERSION') api_endpoint = getenv('OPENAI_API_AZURE_API_ENDPOINT') deployment = getenv('OPENAI_API_AZURE_DEPLOYMENT') if not all([api_key, api_version, api_endpoint, deployment]): logger.error("Missing OpenAI Azure environment configuration.") raise EnvironmentError("Missing OpenAI Azure credentials.") client = AzureOpenAI( api_key=api_key, api_version=api_version, azure_endpoint=api_endpoint ) return client, deployment def get_db_engine() -> sqlalchemy.Engine: "carview.php?tsp=""Initialize and return a SQLAlchemy engine."carview.php?tsp="" return sqlalchemy.create_engine(DB_URL) # ========== Database Utilities ========== def get_table_schema(engine: sqlalchemy.Engine, table_name: str) -> str: "carview.php?tsp=""Fetch column names and data types from the specified table."carview.php?tsp="" logger.info(f"Fetching schema for table: {table_name}") query = text("carview.php?tsp="" SELECT column_name, data_type FROM information_schema.columns WHERE table_name = :table "carview.php?tsp="") try: with engine.connect() as connection: result = connection.execute(query, {"table": table_name}) schema = "\n".join(f"{row.column_name}: {row.data_type}" for row in result) if not schema: raise ValueError(f"No schema found for table '{table_name}'") return schema except SQLAlchemyError as e: logger.exception("Error fetching schema") raise RuntimeError(f"Schema retrieval failed: {e}") def run_sql_query(engine: sqlalchemy.Engine, sql: str) -> list[dict]: "carview.php?tsp=""Execute a SQL query and return results."carview.php?tsp="" logger.info("Executing SQL query...") try: with engine.connect() as conn: result = conn.execute(text(sql)) return [dict(row._mapping) for row in result] except SQLAlchemyError as e: logger.exception("SQL execution failed") raise RuntimeError(f"Database query failed: {e}") # ========== OpenAI Utilities ========== def create_prompt(question: str, schema: str, table_name: str) -> str: "carview.php?tsp=""Generate prompt to send to OpenAI."carview.php?tsp="" return f"carview.php?tsp="" You are an assistant that converts natural language into SQL queries. Table: {table_name} Schema: {schema} Question: {question} SQL: "carview.php?tsp="" def extract_sql_from_response(response_content: str) -> str: "carview.php?tsp=""Extract SQL query from OpenAI response by removing markdown formatting."carview.php?tsp="" return re.sub(r"```(?:sql)?\s*|\s*```", "carview.php?tsp=", response_content).strip() def text_to_sql(client: AzureOpenAI, deployment: str, question: str, schema: str, table_name: str) -> str: "carview.php?tsp=""Generate SQL from a natural language question using OpenAI."carview.php?tsp="" logger.info(f"Generating SQL using OpenAI for question: {question}") prompt = create_prompt(question, schema, table_name) try: response = client.chat.completions.create( model=deployment, messages=[ {"role": "system", "content": "You are a helpful assistant that converts natural language questions into SQL queries."}, {"role": "user", "content": prompt} ] ) raw_sql = response.choices[0].message.content sql_query = extract_sql_from_response(raw_sql) logger.info("Generated SQL: %s", sql_query) return sql_query except Exception as e: logger.exception("OpenAI API call failed") raise RuntimeError(f"Failed to generate SQL: {e}") # ========== Main Entry Point ========== def main(): try: question = input("Enter your question: ").strip() if not question: raise ValueError("Empty question provided.") # Setup engine = get_db_engine() client, deployment = get_openai_client() # Generate SQL schema = get_table_schema(engine, TABLE_NAME) sql_query = text_to_sql(client, deployment, question, schema, TABLE_NAME) # Execute and display results results = run_sql_query(engine, sql_query) logger.info(f"\nGenerated SQL Query:\n{sql_query}\n") logger.info("Query Results:") for row in results: print(row) except Exception as e: logger.error(f"[ERROR] {e}") if __name__ == "__main__": main()
3.2.1 Code Explanation
This Python script connects Azure OpenAI with a PostgreSQL database using SQLAlchemy to convert natural language questions into executable SQL queries. It starts by importing required libraries for logging, environment management, database access, and OpenAI integration. Logging is configured to output timestamped messages, and environment variables are loaded using python-dotenv
. Constants like TABLE_NAME
and DB_URL
are defined to point to the employees table and PostgreSQL connection string respectively. The get_openai_client()
function initializes an AzureOpenAI client using credentials from environment variables, while get_db_engine()
sets up a SQLAlchemy engine to interact with the database. The get_table_schema()
function queries the database to retrieve column names and types, which are used to construct a schema prompt. The run_sql_query()
function takes a SQL query, executes it, and returns the result as a list of dictionaries. The create_prompt()
function builds a prompt that includes the table schema and user question, which is sent to OpenAI via text_to_sql()
, where a chat completion request is made to generate the SQL query. The extract_sql_from_response()
function cleans the SQL output by removing markdown formatting. Finally, the main()
function orchestrates the entire workflow: it takes user input, prepares the schema, gets the generated SQL from OpenAI, runs it against the database, and prints the result. The script is executed when run directly and is designed to handle common errors with informative logging throughout.
3.3 Sample Output
The following is an example of how the application behaves when a user inputs a natural language question. In this case, the question is List all employees from Engineering
. The application fetches the schema of the target table, constructs a prompt, sends the prompt to Azure OpenAI, receives a SQL query as a response, executes the query against the PostgreSQL database, and logs the results. Each step is timestamped using the logging module for easy debugging and monitoring. The OpenAI response is successfully parsed, resulting in a valid SQL query that retrieves all employee records from the Engineering department.
Enter your question: List all employees from Engineering 2025-07-04 20:57:01,226 - INFO - Fetching schema for table: employees 2025-07-04 20:57:01,260 - INFO - Generating SQL using OpenAI for question: List all employees from Engineering 2025-07-04 20:57:02,739 - INFO - HTTP Request: POST https://jcg.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-08-01-preview "HTTP/1.1 200 OK" 2025-07-04 20:57:02,746 - INFO - Generated SQL: SELECT * FROM employees WHERE department = 'Engineering'; 2025-07-04 20:57:02,746 - INFO - Executing SQL query... 2025-07-04 20:57:02,753 - INFO - Generated SQL Query: SELECT * FROM employees WHERE department = 'Engineering'; 2025-07-04 20:57:02,753 - INFO - Query Results: {'id': 1, 'name': 'Alice', 'department': 'Engineering', 'salary': 95000} {'id': 3, 'name': 'Charlie', 'department': 'Engineering', 'salary': 105000}
4. Conclusion
We’ve built a simple but functional Text-to-SQL application using OpenAI’s LLMs. With just natural language, users can query structured databases. For production scenarios, you can improve security by adding query validation, and also switch to OpenAI’s chat/completions
endpoint using function calling or tools like LangChain or LlamaIndex for better control. This approach can be extended for multi-table joins, complex aggregations, and dynamic dashboarding with visualization tools like Streamlit or Dash.