BIRD 벤치마크 snowflake 시도

범위 : BIRD BENCH 의 FIANANCIAL 데이터셋에 한정하여 수동 벤치마크테스트

1.데이터 업로드 : Financial 데이터셋만 SQlite → Parquet 변환후 8개 테이블 업로드

- 파일1위치 : https://bird-bench.github.io/ - DEV 데이터셋 다운로드 - Dev_20240627 - dev_databases - financial - financial_sqlite
- 파일2위치 : https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/bird

#sqlite -> parquet 변환 python
import sqlite3
import pandas as pd
from pathlib import Path

SQLITE_PATH = "./dev_20240627/dev_databases/financial/financial.sqlite"
OUT_DIR = Path("./parquet/financial")
OUT_DIR.mkdir(parents=True, exist_ok=True)
conn = sqlite3.connect(SQLITE_PATH)

# 테이블 목록
tables = pd.read_sql(
    "SELECT name FROM sqlite_master WHERE type='table'",
    conn
)["name"]

print("테이블 목록:", list(tables))

for table in tables:
    query = f'SELECT * FROM "{table}"'
    df = pd.read_sql(query, conn)
    df.columns = [c.lower() for c in df.columns]
    out_path = OUT_DIR / f"{table}.parquet"
    df.to_parquet(out_path, index=False)
    print(f"V {table} → {out_path}")
conn.close()
print("SQLite → Parquet 변환 완료")

 

2.Sementic View 설정 : Cortex Anaylst - [Create New Sementic View] 방식으로 설정

dev_20240627/dev_databases/financial/database_description , dev_20240627/dev_tables.json 참조

  1. 8개 테이블별로 PK/Unique key 설정
  2. Bird Bench에서 제공하는 컬럼별 description/별칭 등입력
  3. relationship 설정

YAML/SEMENTIC VIEW 생성코드는 아래에 [더보기]

더보기
name: E260128
tables:
  - name: CARD
    base_table:
      database: SAMPLE_DATA
      schema: PUBLIC
      table: CARD
    dimensions:
      - name: CARD_ID
        synonyms:
          - credit card id
        description: id number of credit card
        expr: CARD_ID
        data_type: NUMBER(38,0)
      - name: DISP_ID
        synonyms:
          - disposition id
        description: disposition id
        expr: DISP_ID
        data_type: NUMBER(38,0)
      - name: ISSUED
        description: |-
          the date when the credit card issued 
          in the form YYMMDD
        expr: ISSUED
        data_type: VARCHAR(16777216)
      - name: TYPE
        description: |-
          type of credit card
          ""junior": junior class of credit card; 
          "classic": standard class of credit card; 
          "gold": high-level credit card"
        expr: '"TYPE"'
        data_type: VARCHAR(16777216)
    primary_key:
      columns:
        - CARD_ID
  - name: CLIENT
    base_table:
      database: SAMPLE_DATA
      schema: PUBLIC
      table: CLIENT
    dimensions:
      - name: BIRTH_DATE
        description: birth date
        expr: BIRTH_DATE
        data_type: VARCHAR(16777216)
      - name: CLIENT_ID
        description: the unique number
        expr: CLIENT_ID
        data_type: NUMBER(38,0)
      - name: DISTRICT_ID
        synonyms:
          - location of branch
        description: location of branch
        expr: DISTRICT_ID
        data_type: NUMBER(38,0)
      - name: GENDER
        description: |-
          "F:female 
          M:male "
        expr: GENDER
        data_type: VARCHAR(16777216)
    primary_key:
      columns:
        - CLIENT_ID
  - name: DISP
    base_table:
      database: SAMPLE_DATA
      schema: PUBLIC
      table: DISP
    dimensions:
      - name: ACCOUNT_ID
        description: id number of account
        expr: ACCOUNT_ID
        data_type: NUMBER(38,0)
      - name: CLIENT_ID
        description: id number of client
        expr: CLIENT_ID
        data_type: NUMBER(38,0)
      - name: DISP_ID
        synonyms:
          - disposition id
        description: unique number of identifying this row of record
        expr: DISP_ID
        data_type: NUMBER(38,0)
      - name: TYPE
        description: |-
          type of disposition
          "OWNER" : "USER" : "DISPONENT"
          commonsense evidence:
          the account can only have the right to issue permanent orders or apply for loans
        expr: '"TYPE"'
        data_type: VARCHAR(16777216)
    primary_key:
      columns:
        - DISP_ID
  - name: DISTRICT
    base_table:
      database: SAMPLE_DATA
      schema: PUBLIC
      table: DISTRICT
    dimensions:
      - name: A10
        synonyms:
          - ratio of urban inhabitants
        description: ratio of urban inhabitants
        expr: A10
        data_type: FLOAT
      - name: A11
        synonyms:
          - average salary
        description: average salary
        expr: A11
        data_type: NUMBER(38,0)
      - name: A12
        synonyms:
          - unemployment rate 1995
        description: unemployment rate 1995
        expr: A12
        data_type: FLOAT
      - name: A13
        synonyms:
          - unemployment rate 1996
        description: unemployment rate 1996
        expr: A13
        data_type: FLOAT
      - name: A14
        synonyms:
          - no. of entrepreneurs per 1000 inhabitants
        description: no. of entrepreneurs per 1000 inhabitants
        expr: A14
        data_type: NUMBER(38,0)
      - name: A15
        synonyms:
          - no. of committed crimes 1995
        description: no. of committed crimes 1995
        expr: A15
        data_type: FLOAT
      - name: A16
        synonyms:
          - no. of entrepreneurs per 1000 inhabitants
        description: no. of entrepreneurs per 1000 inhabitants
        expr: A16
        data_type: NUMBER(38,0)
      - name: A2
        synonyms:
          - district_name
        description: district_name
        expr: A2
        data_type: VARCHAR(16777216)
      - name: A3
        synonyms:
          - region
        description: region
        expr: A3
        data_type: VARCHAR(16777216)
      - name: A4
        synonyms:
          - number of inhabitants
        expr: A4
        data_type: VARCHAR(16777216)
      - name: A5
        synonyms:
          - no. of municipalities with inhabitants < 499
        description: municipality < district < region
        expr: A5
        data_type: VARCHAR(16777216)
      - name: A6
        synonyms:
          - no. of municipalities with inhabitants 500-1999
        description: municipality < district < region
        expr: A6
        data_type: VARCHAR(16777216)
      - name: A7
        synonyms:
          - no. of municipalities with inhabitants 2000-9999
        description: municipality < district < region
        expr: A7
        data_type: VARCHAR(16777216)
      - name: A8
        synonyms:
          - no. of municipalities with inhabitants > 10000
        description: municipality < district < region
        expr: A8
        data_type: NUMBER(38,0)
      - name: A9
        description: not useful
        expr: A9
        data_type: NUMBER(38,0)
      - name: DISTRICT_ID
        synonyms:
          - location of branch
        description: location of branch
        expr: DISTRICT_ID
        data_type: NUMBER(38,0)
    facts: []
    primary_key:
      columns:
        - DISTRICT_ID
  - name: LOAN
    base_table:
      database: SAMPLE_DATA
      schema: PUBLIC
      table: LOAN
    dimensions:
      - name: ACCOUNT_ID
        description: the id number identifying the account
        expr: ACCOUNT_ID
        data_type: NUMBER(38,0)
      - name: AMOUNT
        description: |-
          approved amount
          unit:US dollar
        expr: AMOUNT
        data_type: NUMBER(38,0)
      - name: DATE
        description: the date when the loan is approved
        expr: DATE
        data_type: VARCHAR(16777216)
      - name: DURATION
        description: |-
          loan duration
          unit:month
        expr: DURATION
        data_type: NUMBER(38,0)
      - name: LOAN_ID
        description: the id number identifying the loan data
        expr: LOAN_ID
        data_type: NUMBER(38,0)
      - name: STATUS
        description: |-
          repayment status
          "'A' stands for contract finished, no problems;
          'B' stands for contract finished, loan not paid;
          'C' stands for running contract, OK so far;
          'D' stands for running contract, client in debt"
        expr: STATUS
        data_type: VARCHAR(16777216)
    facts:
      - name: PAYMENTS
        synonyms:
          - monthly payments
        description: |-
          monthly payments
          unit:month
        expr: PAYMENTS
        data_type: FLOAT
        access_modifier: public_access
    primary_key:
      columns:
        - LOAN_ID
  - name: TRANS
    base_table:
      database: SAMPLE_DATA
      schema: PUBLIC
      table: TRANS
    dimensions:
      - name: _ACCOUNT
        synonyms:
          - account of the partner
        expr: '"ACCOUNT"'
        data_type: FLOAT
      - name: ACCOUNT_ID
        expr: ACCOUNT_ID
        data_type: NUMBER(38,0)
      - name: BALANCE
        synonyms:
          - balance after transaction
        description: |-
          balance after transaction
          Unit:USD
        expr: BALANCE
        data_type: NUMBER(38,0)
      - name: BANK
        synonyms:
          - Unit:USD
        description: each bank has unique two-letter code
        expr: BANK
        data_type: VARCHAR(16777216)
      - name: DATE
        synonyms:
          - date of transaction
        description: date of transaction
        expr: DATE
        data_type: VARCHAR(16777216)
      - name: K_SYMBOL
        synonyms:
          - characterization of the transaction
        description: |-
          "POJISTNE": stands for insurrance payment
          "SLUZBY": stands for payment for statement
          "UROK": stands for interest credited
          "SANKC. UROK": sanction interest if negative balance
          "SIPO": stands for household
          "DUCHOD": stands for old-age pension
          "UVER": stands for loan payment
        expr: K_SYMBOL
        data_type: VARCHAR(16777216)
      - name: OPERATION
        synonyms:
          - mode of transaction
        description: |-
          mode of transaction
          ""VYBER KARTOU": credit card withdrawal
          "VKLAD": credit in cash
          "PREVOD Z UCTU" :collection from another bank
          "VYBER": withdrawal in cash
          "PREVOD NA UCET": remittance to another bank"
        expr: OPERATION
        data_type: VARCHAR(16777216)
      - name: TRANS_ID
        synonyms:
          - transaction id
        description: transaction id
        expr: TRANS_ID
        data_type: NUMBER(38,0)
      - name: TYPE
        synonyms:
          - +/- transaction
        description: |-
          +/- transaction
          ""PRIJEM" stands for credit
          "VYDAJ" stands for withdrawal"
        expr: '"TYPE"'
        data_type: VARCHAR(16777216)
    facts:
      - name: AMOUNT
        synonyms:
          - amount of money
        description: |-
          amount of money
          Unit:USD
        expr: AMOUNT
        data_type: NUMBER(38,0)
        access_modifier: public_access
    primary_key:
      columns:
        - TRANS_ID
  - name: _ACCOUNT
    base_table:
      database: SAMPLE_DATA
      schema: PUBLIC
      table: ACCOUNT
    dimensions:
      - name: ACCOUNT_ID
        synonyms:
          - account id
        description: the id of the account
        expr: ACCOUNT_ID
        data_type: NUMBER(38,0)
      - name: DATE
        synonyms:
          - date
        description: |-
          the creation date of the account
          in the form YYMMDD
        expr: DATE
        data_type: VARCHAR(16777216)
      - name: DISTRICT_ID
        synonyms:
          - location of branch
        description: location of branch
        expr: DISTRICT_ID
        data_type: NUMBER(38,0)
      - name: FREQUENCY
        synonyms:
          - frequency
        description: |-
          frequency of the acount
          ""POPLATEK MESICNE" stands for monthly issuance
          "POPLATEK TYDNE" stands for weekly issuance
          "POPLATEK PO OBRATU" stands for issuance after transaction"
        expr: FREQUENCY
        data_type: VARCHAR(16777216)
    primary_key:
      columns:
        - ACCOUNT_ID
  - name: _ORDER
    base_table:
      database: SAMPLE_DATA
      schema: PUBLIC
      table: '"ORDER"'
    dimensions:
      - name: ACCOUNT_ID
        description: id number of account
        expr: ACCOUNT_ID
        data_type: NUMBER(38,0)
      - name: ACCOUNT_TO
        synonyms:
          - account of the recipient
        description: |-
          account of the recipient
          each bank has unique two-letter code
        expr: ACCOUNT_TO
        data_type: NUMBER(38,0)
      - name: BANK_TO
        synonyms:
          - bank of the recipient
        description: bank of the recipient
        expr: BANK_TO
        data_type: VARCHAR(16777216)
      - name: K_SYMBOL
        synonyms:
          - characterization of the payment
        description: |-
          purpose of the payment
          ""POJISTNE" stands for insurance payment
          "SIPO" stands for household payment
          "LEASING" stands for leasing
          "UVER" stands for loan payment"
        expr: K_SYMBOL
        data_type: VARCHAR(16777216)
      - name: ORDER_ID
        description: identifying the unique order
        expr: ORDER_ID
        data_type: NUMBER(38,0)
    facts:
      - name: AMOUNT
        synonyms:
          - debited amount
        description: debited amount
        expr: AMOUNT
        data_type: FLOAT
        access_modifier: public_access
    primary_key:
      columns:
        - ORDER_ID
relationships:
  - name: R2
    left_table: CARD
    right_table: DISP
    relationship_columns:
      - left_column: DISP_ID
        right_column: DISP_ID
  - name: R3
    left_table: CLIENT
    right_table: DISTRICT
    relationship_columns:
      - left_column: DISTRICT_ID
        right_column: DISTRICT_ID
  - name: R4
    left_table: DISP
    right_table: CLIENT
    relationship_columns:
      - left_column: CLIENT_ID
        right_column: CLIENT_ID
  - name: R5
    left_table: DISP
    right_table: _ACCOUNT
    relationship_columns:
      - left_column: ACCOUNT_ID
        right_column: ACCOUNT_ID
  - name: R6
    left_table: LOAN
    right_table: _ACCOUNT
    relationship_columns:
      - left_column: ACCOUNT_ID
        right_column: ACCOUNT_ID
  - name: R8
    left_table: TRANS
    right_table: _ACCOUNT
    relationship_columns:
      - left_column: ACCOUNT_ID
        right_column: ACCOUNT_ID
  - name: R1
    left_table: _ACCOUNT
    right_table: DISTRICT
    relationship_columns:
      - left_column: DISTRICT_ID
        right_column: DISTRICT_ID
  - name: R7
    left_table: _ORDER
    right_table: _ACCOUNT
    relationship_columns:
      - left_column: ACCOUNT_ID
        right_column: ACCOUNT_ID​
create or replace semantic view SAMPLE_DATA.PUBLIC.E260128
	tables (
		SAMPLE_DATA.PUBLIC.CARD primary key (CARD_ID),
		SAMPLE_DATA.PUBLIC.CLIENT primary key (CLIENT_ID),
		SAMPLE_DATA.PUBLIC.DISP primary key (DISP_ID),
		SAMPLE_DATA.PUBLIC.DISTRICT primary key (DISTRICT_ID),
		SAMPLE_DATA.PUBLIC.LOAN primary key (LOAN_ID),
		SAMPLE_DATA.PUBLIC.TRANS primary key (TRANS_ID),
		_ACCOUNT as SAMPLE_DATA.PUBLIC.ACCOUNT primary key (ACCOUNT_ID),
		_ORDER as SAMPLE_DATA.PUBLIC."ORDER" primary key (ORDER_ID)
	)
	relationships (
		R2 as CARD(DISP_ID) references DISP(DISP_ID),
		R3 as CLIENT(DISTRICT_ID) references DISTRICT(DISTRICT_ID),
		R4 as DISP(CLIENT_ID) references CLIENT(CLIENT_ID),
		R5 as DISP(ACCOUNT_ID) references _ACCOUNT(ACCOUNT_ID),
		R6 as LOAN(ACCOUNT_ID) references _ACCOUNT(ACCOUNT_ID),
		R8 as TRANS(ACCOUNT_ID) references _ACCOUNT(ACCOUNT_ID),
		R1 as _ACCOUNT(DISTRICT_ID) references DISTRICT(DISTRICT_ID),
		R7 as _ORDER(ACCOUNT_ID) references _ACCOUNT(ACCOUNT_ID)
	)
	facts (
		DISTRICT.A10 as A10 with synonyms=('ratio of urban inhabitants') comment='ratio of urban inhabitants',
		DISTRICT.A12 as A12 with synonyms=('unemployment rate 1995') comment='unemployment rate 1995',
		DISTRICT.A13 as A13 with synonyms=('unemployment rate 1996') comment='unemployment rate 1996',
		DISTRICT.A15 as A15 with synonyms=('no. of committed crimes 1995') comment='no. of committed crimes 1995',
		LOAN.PAYMENTS as PAYMENTS with synonyms=('monthly payments') comment='monthly payments
unit:month',
		TRANS.AMOUNT as AMOUNT with synonyms=('amount of money') comment='amount of money
Unit:USD',
		_ORDER.AMOUNT as AMOUNT with synonyms=('debited amount') comment='debited amount'
	)
	dimensions (
		CARD.CARD_ID as CARD_ID with synonyms=('credit card id') comment='id number of credit card',
		CARD.DISP_ID as DISP_ID with synonyms=('disposition id') comment='disposition id',
		CARD.ISSUED as ISSUED comment='the date when the credit card issued 
in the form YYMMDD',
		CARD.TYPE as "TYPE" comment='type of credit card
""junior": junior class of credit card; 
"classic": standard class of credit card; 
"gold": high-level credit card"',
		CLIENT.BIRTH_DATE as BIRTH_DATE comment='birth date',
		CLIENT.CLIENT_ID as CLIENT_ID comment='the unique number',
		CLIENT.DISTRICT_ID as DISTRICT_ID with synonyms=('location of branch') comment='location of branch',
		CLIENT.GENDER as GENDER comment='"F:female 
M:male "',
		DISP.ACCOUNT_ID as ACCOUNT_ID comment='id number of account',
		DISP.CLIENT_ID as CLIENT_ID comment='id number of client',
		DISP.DISP_ID as DISP_ID with synonyms=('disposition id') comment='unique number of identifying this row of record',
		DISP.TYPE as "TYPE" comment='type of disposition
"OWNER" : "USER" : "DISPONENT"
commonsense evidence:
the account can only have the right to issue permanent orders or apply for loans',
		DISTRICT.A11 as A11 with synonyms=('average salary') comment='average salary',
		DISTRICT.A14 as A14 with synonyms=('no. of entrepreneurs per 1000 inhabitants') comment='no. of entrepreneurs per 1000 inhabitants',
		DISTRICT.A16 as A16 with synonyms=('no. of entrepreneurs per 1000 inhabitants') comment='no. of entrepreneurs per 1000 inhabitants',
		DISTRICT.A2 as A2 with synonyms=('district_name') comment='district_name',
		DISTRICT.A3 as A3 with synonyms=('region') comment='region',
		DISTRICT.A4 as A4 with synonyms=('number of inhabitants'),
		DISTRICT.A5 as A5 with synonyms=('no. of municipalities with inhabitants < 499') comment='municipality < district < region',
		DISTRICT.A6 as A6 with synonyms=('no. of municipalities with inhabitants 500-1999') comment='municipality < district < region',
		DISTRICT.A7 as A7 with synonyms=('no. of municipalities with inhabitants 2000-9999') comment='municipality < district < region',
		DISTRICT.A8 as A8 with synonyms=('no. of municipalities with inhabitants > 10000') comment='municipality < district < region',
		DISTRICT.A9 as A9 comment='not useful',
		DISTRICT.DISTRICT_ID as DISTRICT_ID with synonyms=('location of branch') comment='location of branch',
		LOAN.ACCOUNT_ID as ACCOUNT_ID comment='the id number identifying the account',
		LOAN.AMOUNT as AMOUNT comment='approved amount
unit:US dollar',
		LOAN.DATE as DATE comment='the date when the loan is approved',
		LOAN.DURATION as DURATION comment='loan duration
unit:month',
		LOAN.LOAN_ID as LOAN_ID comment='the id number identifying the loan data',
		LOAN.STATUS as STATUS comment='repayment status
"''A'' stands for contract finished, no problems;
''B'' stands for contract finished, loan not paid;
''C'' stands for running contract, OK so far;
''D'' stands for running contract, client in debt"',
		TRANS._ACCOUNT as "ACCOUNT" with synonyms=('account of the partner'),
		TRANS.ACCOUNT_ID as ACCOUNT_ID,
		TRANS.BALANCE as BALANCE with synonyms=('balance after transaction') comment='balance after transaction
Unit:USD',
		TRANS.BANK as BANK with synonyms=('Unit:USD') comment='each bank has unique two-letter code',
		TRANS.DATE as DATE with synonyms=('date of transaction') comment='date of transaction',
		TRANS.K_SYMBOL as K_SYMBOL with synonyms=('characterization of the transaction') comment='"POJISTNE": stands for insurrance payment
"SLUZBY": stands for payment for statement
"UROK": stands for interest credited
"SANKC. UROK": sanction interest if negative balance
"SIPO": stands for household
"DUCHOD": stands for old-age pension
"UVER": stands for loan payment',
		TRANS.OPERATION as OPERATION with synonyms=('mode of transaction') comment='mode of transaction
""VYBER KARTOU": credit card withdrawal
"VKLAD": credit in cash
"PREVOD Z UCTU" :collection from another bank
"VYBER": withdrawal in cash
"PREVOD NA UCET": remittance to another bank"',
		TRANS.TRANS_ID as TRANS_ID with synonyms=('transaction id') comment='transaction id',
		TRANS.TYPE as "TYPE" with synonyms=('+/- transaction') comment='+/- transaction
""PRIJEM" stands for credit
"VYDAJ" stands for withdrawal"',
		_ACCOUNT.ACCOUNT_ID as ACCOUNT_ID with synonyms=('account id') comment='the id of the account',
		_ACCOUNT.DATE as DATE with synonyms=('date') comment='the creation date of the account
in the form YYMMDD',
		_ACCOUNT.DISTRICT_ID as DISTRICT_ID with synonyms=('location of branch') comment='location of branch',
		_ACCOUNT.FREQUENCY as FREQUENCY with synonyms=('frequency') comment='frequency of the acount
""POPLATEK MESICNE" stands for monthly issuance
"POPLATEK TYDNE" stands for weekly issuance
"POPLATEK PO OBRATU" stands for issuance after transaction"',
		_ORDER.ACCOUNT_ID as ACCOUNT_ID comment='id number of account',
		_ORDER.ACCOUNT_TO as ACCOUNT_TO with synonyms=('account of the recipient') comment='account of the recipient
each bank has unique two-letter code',
		_ORDER.BANK_TO as BANK_TO with synonyms=('bank of the recipient') comment='bank of the recipient',
		_ORDER.K_SYMBOL as K_SYMBOL with synonyms=('characterization of the payment') comment='purpose of the payment
""POJISTNE" stands for insurance payment
"SIPO" stands for household payment
"LEASING" stands for leasing
"UVER" stands for loan payment"',
		_ORDER.ORDER_ID as ORDER_ID comment='identifying the unique order'
	)
	with extension (CA='{"tables":[{"name":"CARD","dimensions":[{"name":"CARD_ID"},{"name":"DISP_ID"},{"name":"ISSUED"},{"name":"TYPE"}]},{"name":"CLIENT","dimensions":[{"name":"BIRTH_DATE"},{"name":"CLIENT_ID"},{"name":"DISTRICT_ID"},{"name":"GENDER"}]},{"name":"DISP","dimensions":[{"name":"ACCOUNT_ID"},{"name":"CLIENT_ID"},{"name":"DISP_ID"},{"name":"TYPE"}]},{"name":"DISTRICT","dimensions":[{"name":"A11"},{"name":"A14"},{"name":"A16"},{"name":"A2"},{"name":"A3"},{"name":"A4"},{"name":"A5"},{"name":"A6"},{"name":"A7"},{"name":"A8"},{"name":"A9"},{"name":"DISTRICT_ID"}],"facts":[{"name":"A10"},{"name":"A12"},{"name":"A13"},{"name":"A15"}]},{"name":"LOAN","dimensions":[{"name":"ACCOUNT_ID"},{"name":"AMOUNT"},{"name":"DATE"},{"name":"DURATION"},{"name":"LOAN_ID"},{"name":"STATUS"}],"facts":[{"name":"PAYMENTS"}]},{"name":"TRANS","dimensions":[{"name":"_ACCOUNT"},{"name":"ACCOUNT_ID"},{"name":"BALANCE"},{"name":"BANK"},{"name":"DATE"},{"name":"K_SYMBOL"},{"name":"OPERATION"},{"name":"TRANS_ID"},{"name":"TYPE"}],"facts":[{"name":"AMOUNT"}]},{"name":"_ACCOUNT","dimensions":[{"name":"ACCOUNT_ID"},{"name":"DATE"},{"name":"DISTRICT_ID"},{"name":"FREQUENCY"}]},{"name":"_ORDER","dimensions":[{"name":"ACCOUNT_ID"},{"name":"ACCOUNT_TO"},{"name":"BANK_TO"},{"name":"K_SYMBOL"},{"name":"ORDER_ID"}],"facts":[{"name":"AMOUNT"}]}],"relationships":[{"name":"R2"},{"name":"R3"},{"name":"R4"},{"name":"R5"},{"name":"R6"},{"name":"R8"},{"name":"R1"},{"name":"R7"}]}');​

 

2-1. (option) REST API 발급받기 전에 NETWORK POLICY 임시 설정해줘야 401 에러 안남

  • US region 을 사용하면 디폴트로 NETWORK POLICY에 의해 허용된 IP 제외하고 모든 접근이 막혀있음. DEFAULT 설정 삭제가 불가능해서 임시로 사용하는 IP 설정해준다음 다 사용하면 해지해주어야함
    (해지 깜박하면 설정해준 IP 를 제외한 다른 아이피로 로그인도 안됨)
 
-- 현재 사용하는 IP 허용
CREATE NETWORK POLICY TEMP_POLICY
  ALLOWED_IP_LIST = ('147.6.93.99');
ALTER USER YOONSEO SET NETWORK_POLICY = TEMP_POLICY;
-- 위에서 해준 셋팅 해제
SHOW PARAMETERS LIKE 'NETWORK_POLICY' IN ACCOUNT;
ALTER USER YOONSEO UNSET NETWORK_POLICY;
DROP NETWORK POLICY TEMP_POLICY;

3.Cortex Analyst Rest API 가져와서 SQL 결과 나오는지 확인

  • 토큰 발급받는법 : 프로필 클릭 - [setting] - [Authentication] - [Programmatic access tokens] - [Generate New Token] 후 복사
  • 내account 확인 : 프로필 클릭 - [account] - [view account detail] - Account identifier 복사

Cortex Analyst Rest API 확인용

더보기
import requests

URL = "https://[내account].snowflakecomputing.com/api/v2/cortex/analyst/message"
TOKEN = "[발급받은내토큰]"

payload = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": (
                        "-- External Knowledge:\n"
                        "-- A3 contains the data of region\n"
                        "-- 'POPLATEK PO OBRATU' represents 'issuance after transaction'\n\n"
                        "-- Using valid SQLite and understading External Knowledge, answer the following questions for the tables provided above."
                        "-- Question:\n"
                        "-- How many accounts that choose issuance after transaction are staying in the East Bohemia region?"
                    )
                }
            ]
        }
    ],
    "semantic_view": "SAMPLE_DATA.PUBLIC.E260128"
}

headers = {
    "Authorization": f"Bearer {TOKEN}",
    "Content-Type": "application/json"
}

res = requests.post(URL, headers=headers, json=payload)

print("status:", res.status_code)
print("text:", res.text)
 
{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "which company had the most revenue?"
        }
      ]
    }
  ],
  "semantic_view": "MY_DB.MY_SCH.MY_SEMANTIC_VIEW"
}

4. Cortex Anaylst로 답변(predict_dev.json) 생성 :

4.1 파일 구성

  • 필요한 데이터는 아래 3개로 dev.json / dev_gold.sql / predict_dev.json 이다.
bird/llm
├── src/
│   └── evaluation.py
├── data/
│   ├── financila_only.json (문제 JSON 파일,json 파일이기만 하면 됨)
│   ├── dev_gold.sql        (정답만 모아져있는 SQL 파일)
├── exp_result/turbo_output/
    └── predict_dev.json    (Cortex Analyst로 얻은답변 SQL 파일)
 
  • dev.json 파일에서 Financial 데이터만 필터링 (106개남음)
import json
with open("./dev_20240627/dev.json", "r", encoding="utf-8") as f:
    data = json.load(f)   # ← 여기서 이제 list of dict
filtered = [item for item in data if item.get("db_id") == "financial"]
with open("financial_only.json", "w", encoding="utf-8") as f:
    json.dump(filtered, f, ensure_ascii=False, indent=2)
 
  • snowflake 에 날려서 받은 답변 SQL은 predict_dev.json로 저장

Financial 만 sql 결과 json으로 저장

더보기
import requests
import json
import os

# =========================
# CONFIG
# =========================
URL = "https://LXMVPPL-KB09931.snowflakecomputing.com/api/v2/cortex/analyst/message"
TOKEN = "eyJraWQiOiIyOTc3ODcyNjI2MjM4MzgiLCJhbGciOiJFUzI1NiJ9.eyJwIjoiMTc3NDk1MDg6MTc3NDk2MzYiLCJpc3MiOiJTRjoyMDEyIiwiZXhwIjoxODAxMTUzNTAzfQ.4youa0W2rDUOF8ZrGp-qnrHV2481MO2llsTNwYj4WtiUTlEnF9s9wO_chSCuuEU85YRyOHti38G7FLeiSfYuAg"

SEMANTIC_VIEW = "SAMPLE_DATA.PUBLIC.E260128"
INPUT_FILE = "financial_only.json"
OUTPUT_FILE = "/Users/yoonseo/Downloads/output.json"

DB_NAME = "financial"
DELIMITER = "\t----- bird -----\t"

output_json = {}

headers = {
    "Authorization": f"Bearer {TOKEN}",
    "Content-Type": "application/json"
}

with open(INPUT_FILE, "r", encoding="utf-8") as f:
    input_data = json.load(f)

for item in input_data:
    qid = str(item["question_id"])
    question = item["question"]
    evidence = item["evidence"]

    payload = {
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": (
                            "-- External Knowledge:\n"
                            f"-- {evidence}\n\n"
                            "-- Using valid Table and understanding External Knowledge,The SQL query will be executed locally using SQLite (not Snowflake); use only SQLite-compatible syntax and reference tables directly (e.g., FROM A).\n"
                            f"-- {question}"
                        )
                    }
                ]
            }
        ],
        "semantic_view": SEMANTIC_VIEW
    }

    # API CALL
    res = requests.post(URL, headers=headers, json=payload)

    sql_stmt = ""

    if res.headers.get("Content-Type", "").startswith("application/json"):
        data = res.json()
        for c in data.get("message", {}).get("content", []):
            if c.get("type") == "sql":
                sql_stmt = c.get("statement", "").strip()
                break

    sql_stmt = f"{sql_stmt}{DELIMITER}{DB_NAME}"
    output_json[qid] = sql_stmt

with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
    json.dump(output_json, f, indent=2, ensure_ascii=False)

print("✅ output.json 저장 완료")
print("📄 경로:", os.path.abspath(OUTPUT_FILE))​
  • financial 에 해당하는 답변만 필터링해서 dev_gold.sql 필터링해서 저장하기 (4.2 참조)

4.2 프롬프트 구성

  1. 원래 Bird Bench 에서 물어보는 프롬프트구성
    • bird/llm/src/gpt_request.py 참조
def generate_comment_prompt(question, knowledge=None):
    pattern_prompt_no_kg = "-- Using valid SQLite, answer the following questions for the tables provided above."
    pattern_prompt_kg = "-- Using valid SQLite and understading External Knowledge, answer the following questions for the tables provided above."
    # question_prompt = "-- {}".format(question) + '\n SELECT '
    question_prompt = "-- {}".format(question)
    knowledge_prompt = "-- External Knowledge: {}".format(knowledge)
    if not knowledge_prompt:
        result_prompt = pattern_prompt_no_kg + '\n' + question_prompt
    else:
        result_prompt = knowledge_prompt + '\n' + pattern_prompt_kg + '\n' + question_prompt

    return result_prompt
-- External Knowledge: {evidence}
-- Using valid SQLite and understading External Knowledge, answer the following questions for the tables provided above.
-- {question}

5.Evaluation

 

 

import subprocess

cmd = [
    "python3", "-u", "./DAMO-ConvAI-main/bird/llm/src/evaluation.py",
    "--db_root_path", "./dev_20240627/dev_databases/financial",         # 데이터 파일경로
    "--predicted_sql_path", "./DAMO-ConvAI-main/bird/llm/exp_result/turbo_output/",
    "--data_mode", "dev",
    "--ground_truth_path", "./DAMO-ConvAI-main/bird/llm/data/",
    "--num_cpus", "16",
    "--mode_gt", "gt",
    "--mode_predict", "gpt",
    "--diff_json_path", "./financial_only.json",    # financial만 뽑아낸 문제경로
    "--meta_time_out", "30.0"
]
subprocess.run(cmd)
start calculate
                     simple               moderate             challenging          total               
count                62                   37                   7                    106                 
======================================    ACCURACY    =====================================
accuracy             0.00                 0.00                 0.00                 0.00                
===========================================================================================
Finished evaluation

 

5.1 프롬프트를 여러번 수정해도 SNOWFLAKE SEMENTIC VIEW 이름을 포함해서 결과를 뱉어내기 때문에, 벤치결과0.0으로 나옴 → 수기비교

5.1.1 프롬프트 수정 방식

  • SQLite 아니고 Parquet 변환되어 테이블형태로 올라가있기 때문에 질문 수정 SQLite → Table
  • snowflake 기준으로 sql문을 생성하면 bird bench 실행시 못읽어와서 로컬 테이블 기준으로 읽어내라라는 프롬프트 추가해서 실행
-- IMPORTANT RULES:
-- The SQL query must be written for local execution, NOT for Snowflake.
-- Use local table references only (e.g., FROM DISTRICT, NOT FROM SAMPLE_DATA.PUBLIC.DISTRICT).
-- Table names must NOT contain underscores (_).
-- External Knowledge: {evidence}
-- Using valid Table and understading External Knowledge, answer the following questions for the tables provided above.
-- The SQL query will be executed locally using SQLite (not Snowflake); use only SQLite-compatible syntax and reference tables directly (e.g., FROM A).
-- When generating the SQL query, table names after FROM must not contain underscores (_).
-- {question}

5.1.1 원인파악 및 수기확인

  • 90번 문제 정답지
SELECT COUNT(T1.account_id) 
FROM account AS T1 
INNER JOIN loan AS T2 ON T1.account_id = T2.account_id 
INNER JOIN district AS T3 ON T1.district_id = T3.district_id 
WHERE T3.A3 = 'Prague'",
  • 90번 문제 실제 SQL 생성결과
WITH unemployment_rates AS (
  SELECT 
    AVG(a12) AS avg_unemployment_1995,
    AVG(a13) AS avg_unemployment_1996
  FROM SEMANTIC_VIEW(    -- 이부분이 SNOWFLAKE 기준으로 작성됨
    SAMPLE_DATA.PUBLIC.E260128
    FACTS district.a12, district.a13
  )
)
SELECT 
  avg_unemployment_1995,
  avg_unemployment_1996,
  CASE 
    WHEN avg_unemployment_1995 > avg_unemployment_1996 THEN '1995 has higher unemployment rate'
    WHEN avg_unemployment_1996 > avg_unemployment_1995 THEN '1996 has higher unemployment rate'
    ELSE 'Both years have equal unemployment rates'
  END AS comparison_result
FROM unemployment_rates
  • 벤치마크 비교를 위해서는 이렇게 결과가 나와야함

FROM account AS T1

  • snowflake 에서 아래처럼 시멘틱뷰 이름이 꼭 들어가기 때문에 0.0점이 나오는것으로 파악

FROM SEMANTIC_VIEW(
SAMPLE_DATA.PUBLIC.E260128 – – 시멘틱뷰 이름
FACTS district.a12, district.a13)

5.2 수기로 결과비교 확인 정답률 (Finance에 해당하는 106개 질의)

 
benchmark용 답지 json 파일 생성
  • 답변여부
    • 답변X : 38/106 (SQL생성X)
    • 답변0 : 68/106
      • 정답 : 37/68
      • 오답 : 31/68
  • 정답률 37/106
  • simple/moderate/challenging 난이도별 정답률
    • simple 29/62
    • moderate 8/37
    • challenging 0/7

5.3 사용한 모델

  • 모델은 랜덤배정
  • 결과에서는 claude-4-sonnet 모델을 사용함
 

 

반응형

'<개념> > 기타' 카테고리의 다른 글

SNOWFLAKE 기본사용법  (0) 2026.04.07
BIRD 벤치마크 Databricks 시도  (0) 2026.04.07
벡터 유사도 (vector similiarity)  (0) 2022.06.05
FPGA, NPU 초간단 개념  (0) 2022.02.28
푸리에 변환 초간단 정리  (0) 2022.01.30