Skip to content

Commit 372aebc

Browse files
committed
changes for release 0.26.0, bunch of fixes & improvements
1 parent 07f8b1f commit 372aebc

File tree

16 files changed

+556
-43
lines changed

16 files changed

+556
-43
lines changed

CHANGELOG.txt

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,22 @@
1+
**v0.26.0**
2+
Improvements:
3+
4+
1. Added more explicit debug message on Statement errors - https://github.com/xnuinside/simple-ddl-parser/issues/116
5+
2. Added support for "USING INDEX TABLESPACE" statement in ALTER - https://github.com/xnuinside/simple-ddl-parser/issues/119
6+
3. Added support for IN statements in CHECKS - https://github.com/xnuinside/simple-ddl-parser/issues/121
7+
8+
New features:
9+
1. Support SparkSQL USING - https://github.com/xnuinside/simple-ddl-parser/issues/117
10+
Updates initiated by ticket https://github.com/xnuinside/simple-ddl-parser/issues/120:
11+
2. In Parser you can use argument json_dump=True in method .run() if you want get result in JSON format.
12+
- README updated
13+
14+
Fixes:
15+
1. Added support for PARTITION BY one column without type
16+
2. Alter table add constraint PRIMARY KEY - https://github.com/xnuinside/simple-ddl-parser/issues/119
17+
3. Fix for paring SET statement - https://github.com/xnuinside/simple-ddl-parser/pull/122
18+
4. Fix for disappeared colums without properties - https://github.com/xnuinside/simple-ddl-parser/issues/123
19+
120
**v0.25.0**
221
## Fixes:
322

README.md

Lines changed: 42 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,21 @@ However, in process of adding support for new statements & features I see that o
1414

1515

1616
### How does it work?
17-
Parser tested on different DDLs mostly for PostgreSQL & Hive. But idea to support as much as possible DDL dialects (AWS Redshift, Oracle, Hive, MsSQL, BigQuery etc.). You can check dialects sections after `Supported Statements` section to get more information that statements from dialects already supported by parser.
17+
18+
Parser supports:
19+
20+
- SQL
21+
- HQL (Hive)
22+
- MSSQL dialec
23+
- Oracle dialect
24+
- MySQL dialect
25+
- PostgreSQL dialect
26+
- BigQuery
27+
- Redshift
28+
- Snowflake
29+
- SparkSQL
30+
31+
You can check dialects sections after `Supported Statements` section to get more information that statements from dialects already supported by parser. If you need to add more statements or new dialects - feel free to open the issue.
1832

1933
### Feel free to open Issue with DDL sample
2034
**If you need some statement, that not supported by parser yet**: please provide DDL example & information about that is it SQL dialect or DB.
@@ -170,6 +184,26 @@ You can provide target path where you want to dump result with argument **-t**,
170184
sdp tests/sql/test_two_tables.sql -t dump_results/
171185

172186
```
187+
### Get Output in JSON
188+
189+
If you want to get output in JSON in stdout you can use argument **json_dump=True** in method **.run()** for this
190+
```python
191+
from simple_ddl_parser import DDLParser
192+
193+
194+
parse_results = DDLParser("""create table dev.data_sync_history(
195+
data_sync_id bigint not null,
196+
sync_count bigint not null,
197+
); """).run(json_dump=True)
198+
199+
print(parse_results)
200+
201+
```
202+
Output will be:
203+
204+
```json
205+
[{"columns": [{"name": "data_sync_id", "type": "bigint", "size": null, "references": null, "unique": false, "nullable": false, "default": null, "check": null}, {"name": "sync_count", "type": "bigint", "size": null, "references": null, "unique": false, "nullable": false, "default": null, "check": null}], "primary_key": [], "alter": {}, "checks": [], "index": [], "partitioned_by": [], "tablespace": null, "schema": "dev", "table_name": "data_sync_history"}]
206+
```
173207

174208
### More details
175209

@@ -297,7 +331,7 @@ In output you will have names like 'dbo' and 'TO_Requests', not '[dbo]' and '[TO
297331

298332
- STATEMENTS: PRIMARY KEY, CHECK, FOREIGN KEY in table defenitions (in create table();)
299333

300-
- ALTER TABLE STATEMENTS: ADD CHECK (with CONSTRAINT), ADD FOREIGN KEY (with CONSTRAINT), ADD UNIQUE, ADD DEFAULT FOR, ALTER TABLE ONLY, ALTER TABLE IF EXISTS
334+
- ALTER TABLE STATEMENTS: ADD CHECK (with CONSTRAINT), ADD FOREIGN KEY (with CONSTRAINT), ADD UNIQUE, ADD DEFAULT FOR, ALTER TABLE ONLY, ALTER TABLE IF EXISTS; ALTER .. PRIMARY KEY; ALTER .. USING INDEX TABLESPACE
301335

302336
- PARTITION BY statement
303337

@@ -319,6 +353,11 @@ In output you will have names like 'dbo' and 'TO_Requests', not '[dbo]' and '[TO
319353

320354
- CREATE DATABASE + Properties parsing
321355

356+
### SparkSQL Dialect statements
357+
358+
- USING
359+
360+
322361
### HQL Dialect statements
323362

324363
- PARTITIONED BY statement
@@ -385,6 +424,7 @@ In output you will have names like 'dbo' and 'TO_Requests', not '[dbo]' and '[TO
385424

386425
### TODO in next Releases (if you don't see feature that you need - open the issue)
387426

427+
-1. Update command line to parse all arguments, that supported by Parser
388428
0. Add support for ALTER TABLE ... ADD COLUMN
389429
1. Add more support for CREATE type IS TABLE (example: CREATE OR REPLACE TYPE budget_tbl_typ IS TABLE OF NUMBER(8,2);
390430
2. Add support (ignore correctly) ALTER TABLE ... DROP CONSTRAINT ..., ALTER TABLE ... DROP INDEX ...

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "simple-ddl-parser"
3-
version = "0.25.0"
3+
version = "0.26.0"
44
description = "Simple DDL Parser to parse SQL & dialects like HQL, TSQL (MSSQL), Oracle, AWS Redshift, Snowflake, MySQL, PostgreSQL, etc ddl files to json/python dict with full information about columns: types, defaults, primary keys, etc.; sequences, alters, custom types & other entities from ddl."
55
authors = ["Iuliia Volkova <[email protected]>"]
66
license = "MIT"

simple_ddl_parser/ddl_parser.py

Lines changed: 45 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
from typing import Dict, List
22

3+
from ply.lex import LexToken
4+
35
from simple_ddl_parser import tokens as tok
46
from simple_ddl_parser.dialects.bigquery import BigQuery
57
from simple_ddl_parser.dialects.hql import HQL
@@ -8,6 +10,7 @@
810
from simple_ddl_parser.dialects.oracle import Oracle
911
from simple_ddl_parser.dialects.redshift import Redshift
1012
from simple_ddl_parser.dialects.snowflake import Snowflake
13+
from simple_ddl_parser.dialects.spark_sql import SparkSQL
1114
from simple_ddl_parser.dialects.sql import BaseSQL
1215
from simple_ddl_parser.parser import Parser
1316

@@ -17,13 +20,13 @@ class DDLParserError(Exception):
1720

1821

1922
class DDLParser(
20-
Parser, Snowflake, BaseSQL, HQL, MySQL, MSSQL, Oracle, Redshift, BigQuery
23+
Parser, SparkSQL, Snowflake, BaseSQL, HQL, MySQL, MSSQL, Oracle, Redshift, BigQuery
2124
):
2225

2326
tokens = tok.tokens
2427
t_ignore = "\t \r"
2528

26-
def get_tag_symbol_value_and_increment(self, t):
29+
def get_tag_symbol_value_and_increment(self, t: LexToken):
2730
# todo: need to find less hacky way to parse HQL structure types
2831
if "<" in t.value:
2932
t.type = "LT"
@@ -33,15 +36,15 @@ def get_tag_symbol_value_and_increment(self, t):
3336
self.lexer.lt_open -= t.value.count(">")
3437
return t
3538

36-
def after_columns_tokens(self, t):
39+
def after_columns_tokens(self, t: LexToken):
3740
t.type = tok.after_columns_tokens.get(t.value.upper(), t.type)
3841
if t.type != "ID":
3942
self.lexer.after_columns = True
4043
elif self.lexer.columns_def:
4144
t.type = tok.columns_defenition.get(t.value.upper(), t.type)
4245
return t
4346

44-
def process_body_tokens(self, t):
47+
def process_body_tokens(self, t: LexToken):
4548
if (
4649
self.lexer.last_par == "RP" and not self.lexer.lp_open
4750
) or self.lexer.after_columns:
@@ -52,7 +55,7 @@ def process_body_tokens(self, t):
5255
t.type = tok.sequence_reserved.get(t.value.upper(), "ID")
5356
return t
5457

55-
def tokens_not_columns_names(self, t):
58+
def tokens_not_columns_names(self, t: LexToken):
5659
if not self.lexer.check:
5760
for key in tok.symbol_tokens_no_check:
5861
if key in t.value:
@@ -78,28 +81,28 @@ def tokens_not_columns_names(self, t):
7881

7982
return t
8083

81-
def set_lexer_tags(self, t):
84+
def set_lexer_tags(self, t: LexToken):
8285
if t.type == "SEQUENCE":
8386
self.lexer.sequence = True
8487
elif t.type == "CHECK":
8588
self.lexer.check = True
8689

87-
def t_DOT(self, t):
90+
def t_DOT(self, t: LexToken):
8891
r"\."
8992
t.type = "DOT"
9093
return self.set_last_token(t)
9194

92-
def t_STRING(self, t):
95+
def t_STRING(self, t: LexToken):
9396
r"((\')([a-zA-Z_,`0-9:><\=\-\+.\~\%$\!() {}\[\]\/\\\"\#\*&^|?;±§@~]*)(\')){1}"
9497
t.type = "STRING"
9598
return self.set_last_token(t)
9699

97-
def t_DQ_STRING(self, t):
100+
def t_DQ_STRING(self, t: LexToken):
98101
r"((\")([a-zA-Z_,`0-9:><\=\-\+.\~\%$\!() {}'\[\]\/\\\\#\*&^|?;±§@~]*)(\")){1}"
99102
t.type = "DQ_STRING"
100103
return self.set_last_token(t)
101104

102-
def is_token_column_name(self, t):
105+
def is_token_column_name(self, t: LexToken):
103106
"""many of reserved words can be used as column name,
104107
to decide is it a column name or not we need do some checks"""
105108
skip_id_tokens = ["(", ")", ","]
@@ -111,28 +114,34 @@ def is_token_column_name(self, t):
111114
and t.value.upper() not in tok.first_liners
112115
)
113116

114-
def is_creation_name(self, t):
117+
def is_creation_name(self, t: LexToken):
115118
"""many of reserved words can be used as column name,
116119
to decide is it a column name or not we need do some checks"""
117120
skip_id_tokens = ["(", ")", ","]
121+
exceptional_keys = [
122+
"SCHEMA",
123+
"TABLE",
124+
"DATABASE",
125+
"TYPE",
126+
"DOMAIN",
127+
"TABLESPACE",
128+
"INDEX",
129+
"CONSTRAINT",
130+
"EXISTS",
131+
]
118132
return (
119133
t.value not in skip_id_tokens
120134
and t.value.upper() not in ["IF"]
121-
and self.lexer.last_token
122-
in [
123-
"SCHEMA",
124-
"TABLE",
125-
"DATABASE",
126-
"TYPE",
127-
"DOMAIN",
128-
"TABLESPACE",
129-
"INDEX",
130-
"CONSTRAINT",
131-
"EXISTS",
132-
]
135+
and self.lexer.last_token in exceptional_keys
136+
and not self.exceptional_cases(t.value.upper())
133137
)
134138

135-
def t_ID(self, t):
139+
def exceptional_cases(self, value: str) -> bool:
140+
if value == "TABLESPACE" and self.lexer.last_token == "INDEX":
141+
return True
142+
return False
143+
144+
def t_ID(self, t: LexToken):
136145
r"([0-9]\.[0-9])\w|([a-zA-Z_,0-9:><\/\=\-\+\~\%$\*\()!{}\[\]\`\[\]]+)"
137146
t.type = tok.symbol_tokens.get(t.value, "ID")
138147

@@ -141,7 +150,6 @@ def t_ID(self, t):
141150
self.lexer.columns_def = True
142151
self.lexer.last_token = "LP"
143152
return t
144-
145153
elif self.is_token_column_name(t) or self.lexer.last_token == "DOT":
146154
t.type = "ID"
147155
elif t.type != "DQ_STRING" and self.is_creation_name(t):
@@ -156,25 +164,31 @@ def t_ID(self, t):
156164

157165
return self.set_last_token(t)
158166

159-
def commat_type(self, t):
167+
def commat_type(self, t: LexToken):
160168
if t.type == "COMMA" and self.lexer.lt_open:
161169
t.type = "COMMAT"
162170

163-
def capitalize_tokens(self, t):
171+
def capitalize_tokens(self, t: LexToken):
164172
if t.type != "ID" and t.type not in ["LT", "RT"]:
165173
t.value = t.value.upper()
166174

167-
def set_lexx_tags(self, t):
175+
def set_parathesis_tokens(self, t: LexToken):
168176
if t.type in ["RP", "LP"]:
169177
if t.type == "RP" and self.lexer.lp_open:
170178
self.lexer.lp_open -= 1
171179
self.lexer.last_par = t.type
180+
181+
def set_lexx_tags(self, t: LexToken):
182+
self.set_parathesis_tokens(t)
183+
184+
if t.type == "ALTER":
185+
self.lexer.is_alter = True
172186
elif t.type in ["TYPE", "DOMAIN", "TABLESPACE"]:
173187
self.lexer.is_table = False
174-
elif t.type in ["TABLE", "INDEX"]:
188+
elif t.type in ["TABLE", "INDEX"] and not self.lexer.is_alter:
175189
self.lexer.is_table = True
176190

177-
def set_last_token(self, t):
191+
def set_last_token(self, t: LexToken):
178192
self.lexer.last_token = t.type
179193
return t
180194

@@ -190,7 +204,7 @@ def p_id(self, p):
190204
if p[0].startswith(symbol) and p[0].endswith(delimeters_to_end[num]):
191205
p[0] = p[0][1:-1]
192206

193-
def t_error(self, t):
207+
def t_error(self, t: LexToken):
194208
raise DDLParserError("Unknown symbol %r" % (t.value[0],))
195209

196210
def p_error(self, p):

simple_ddl_parser/dialects/__init__.py

Whitespace-only changes.

simple_ddl_parser/dialects/hql.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -129,9 +129,10 @@ def p_expression_stored_as(self, p):
129129
p[0]["stored_as"] = p_list[-1]
130130

131131
def p_expression_partitioned_by_hql(self, p):
132-
"""expr : expr PARTITIONED BY pid_with_type"""
132+
"""expr : expr PARTITIONED BY pid_with_type
133+
| expr PARTITIONED BY LP pid RP"""
133134
p[0] = p[1]
134-
p_list = list(p)
135+
p_list = remove_par(list(p))
135136
p[0]["partitioned_by"] = p_list[-1]
136137

137138
def p_pid_with_type(self, p):

simple_ddl_parser/dialects/snowflake.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ def p_clone(self, p):
1010
def p_table_properties(self, p):
1111
"""table_properties : id id id"""
1212
p_list = list(p)
13+
print(p_list, "table_properties")
1314
p[0] = {p_list[-3]: p_list[-1]}
1415

1516
def p_expression_cluster_by(self, p):
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
class SparkSQL:
2+
def p_expression_using(self, p):
3+
"""expr : expr using"""
4+
p[0] = p[1]
5+
p[1].update(p[2])
6+
7+
def p_using(self, p):
8+
"""using : USING id"""
9+
p_list = list(p)
10+
p[0] = {"using": p_list[-1]}

0 commit comments

Comments
 (0)