We decompose the natural language question into logical clauses based on semantic units and incorporate this information into the prompt, allowing the LLM to generate Pre-SQL. At this stage of generating Pre-SQL, we ensure that the model fully utilizes the information from the Question and hint, as well as the DB schema information without value details.
We instruct the model to extract the tables and columns involved in the Pre-SQL and then construct the following information:
-
In the DB schema where the tables and columns involved in the Pre-SQL are masked, the model is tasked to explore potential table and column information based on the Question.
-
For the tables and columns involved in the Pre-SQL, a value condition checker is used to further filter out the columns related to value condition judgments:
2.1 For columns involved in value condition judgments, similarity search methods are used to provide value examples with high similarity to the keywords in the natural language question.
2.2 For columns not involved in value condition judgments, SQL queries are directly constructed to fetch value examples.
-
For the tables and columns involved in the Pre-SQL, if their relationships are incorrect or if there are errors in table names or column names, all errors are captured using SQL.
In summary, three pieces of information will be obtained:
- Simplified DB schema information with the Pre-SQL tables and columns masked.
- Value example information based on the Pre-SQL.
- Potential error information in the Pre-SQL.
Then, the model is allowed to correct the Pre-SQL based on the information obtained above.
Execute the Second-SQL on the database, then integrate the execution results with the prompt as input for the model. Instruct the model to analyze whether the execution results of the Second-SQL are reasonable and refine the Second-SQL accordingly to produce the Final-SQL.
GSR/
├── README.md
├── requirements.txt
│
├── data/
│ └── databases/
│ └── dev_20240627/
│
├── data_process/
│ └── sql_data_process_BIRD.py
│
├── run/
│ └── GSR.py
│
└── tools/
conda create -n GSR python=3.10
conda activate GSR
pip install -r requirements.txtPlease place the test set files in the directory data/database/. Then set the path parameters.
In data_process_config.py, you need to set SQL_DATA_INFO and DATABASE_PATH. The parameters you need to set in SQL_DATA_INFO include ‘data_source’, ‘file’, ‘tables_file’, ‘database_name’.
cd data_process/
python sql_data_process_BIRD.pyFour files are generated after execution.
- all_mappings.json
- raw_format_data.json
- Pre_input.json
- Second_input.json
Please set the parameters of ICL-SQL. In ICL-SQL.py, mainly set database_file_path, start_idx and end_idx.
cd run/
python GSR.pyYou will get the generated SQL file in the output after execution.


