Sunday, January 30, 2022

How to insert the Data from a Text file to SQL Server table using sqlalchemy package in Python

 How to insert the Dataframe from a Text file to SQL Server table using sqlalchemy and pandas packages in Python

Scenario:
Lets suppose we have a tab delimited Text file, from which we need to create a Dataframe and then we need to insert the records from this Dataframe to a SQL Server table.

In Python, we can create the Dataframes from the files using the 'pandas' package, and we can insert the data from the Dataframe to SQL Server using the 'sqlalchemy' package.

Input Data :
The sample input file(tab delimited text file) having the 20 records, with columns as "Order_Id", "Order_Date", "Prod_Id", "Prod_Name", "Units_Sold"

Please note that, the Order_Date column is a date column having some nulls.

Now we will discuss about the how to create a Dataframe, analyze and cleanup the data, and then finally loading the data to SQL Server table.
# ------------------------------------------------------------------------------------------------------------------------ #
# Packages used in the following:
import os
import sys
import csv
import pandas as pd
import sqlalchemy as sql_alch
from sqlalchemy import create_engine, text, inspect

print("Python Version: ", sys.version)

# ------------------------------------------------------------------------------------------------------------------------ #
'''
# Quick Notes about related Packages:
# import os : 
The OS module in python provides functions for interacting with the operating system. 
It comes under Python’s standard utility modules. 
This module provides a portable way of using operating system dependent functionality.

# import sys: 
The Sys is a built-in Python module that contains parameters specific to the system. 
It contains variables(eg: sys.path) and methods that interact with the interpreter and are also governed by it. It contains a list of directories that the interpreter will search in for the required module.

# import pandas: 
The Pandas package is one of the powerful open source libraries in the Python programming language.
It will be used for data analysis and data manipulation.

# import sqlalchemy: 
The SQLAlchemy is the Python SQL toolkit and Object Relational Mapper which used to access the databases using SQL.

# import pyodbc:
The pyodbc is an open source Python module that makes accessing ODBC databases simple.

'''
# ------------------------------------------------------------------------------------------------------------------------ #
# Defining input Folder and File path:
folder_path = 'C:\\Users\\write\\PycharmProjects\\pythonProject\\PyLab1\\File_Operations\\Input\\'
txtFile_name = "tdl_Products_Sample.txt"
txtFile = folder_path + txtFile_name

# Current working Excel File name
print('The Text File name:\n', os.path.basename(txtFile))
# Current working Excel File directory
print('The Text File Directory name:\n', os.path.dirname(txtFile))
# Current working Excel File full path
print('The full path of the Text File:\n', os.path.abspath(txtFile))  
# ------------------------------------------------------------------------------------------------------------------------ #

# Creating the Dataframe from a Text file(tab delimited):
data_obj = pd.read_csv(txtFile, sep='\t')
df1 = pd.DataFrame(data_obj)

# Analyze the input data from the Dataframe:
# Print the First 5 rows from a Dataframe
print('The First 5 rows from the Dataframe:\n', df1.head(5))

# Result:
The First 5 rows from the Dataframe:
     Order_Id  Order_Date  Prod_Id     Prod_Name  Units_Sold
0  123456812  2014-01-05     1235       DeskTop          32
1  123456822  2014-01-14     1238  Switch Board          36
2  123456806  2014-01-17     1235       DeskTop          11
3  123456816  2014-01-24     1234        Router          27
4  123456813  2014-02-03     1235       DeskTop          20 

# Print the Bottom 5 rows from a Dataframe
print('The Bottom 5 rows from the Dataframe:\n', df1.tail(5))

# Result:
The Bottom 5 rows from the Dataframe:
      Order_Id  Order_Date  Prod_Id      Prod_Name  Units_Sold
15  123456823  2014-04-26     1235        DeskTop          32
16  123456804  2014-05-04     1237  Laser Printer          20
17  123456810  2014-05-12     1237  Laser Printer          15
18  123456826         NaN     1234         Router          37
19  123456820  2014-05-26     1235        DeskTop          42

# Print the First 5 rows from a Column of Dataframe
print('The First 5 rows from Order_Date Column:\n', df1['Order_Date'].iloc[:5])

# Result:
The First 5 rows from Order_Date Column:
 0    2014-01-05
1    2014-01-14
2    2014-01-17
3    2014-01-24
4    2014-02-03
Name: Order_Date, dtype: object

# Creating Subset of a Dataframe with specific columns of existing Dataframe
req_cols = ['Order_Id', 'Prod_Name']
df2 = df1[req_cols]
print('The Subset of the Dataframe:\n', df2.head())

# Result:
The Subset of the Dataframe:
     Order_Id     Prod_Name
0  123456812       DeskTop
1  123456822  Switch Board
2  123456806       DeskTop
3  123456816        Router
4  123456813       DeskTop

# Print the original Data type of each Column of Dataframe
print('The Datatype of each column of Dataframe before update:\n', df1.dtypes)

# Result:
The Datatype of each column of Dataframe before update:
 Order_Id       int64
Order_Date    object
Prod_Id        int64
Prod_Name     object
Units_Sold     int64
dtype: object

# Cleansing the input data in the Dataframe:
From the above data analysis, we understand that the columns Order_Date , Prod_Name are having the datatype as 'object' which does not understand by SQL while inserting.

Now we need to update(cleansing) the data type of these columns to make sense to SQL.
In the following, i am updating the 'Order_Date' as string to avoid the date formatting issues, and then updating the Prod_Name also as a string.

# Replacing nulls(NaN) values from Order_Date with '9999-01-01' value
df1['Order_Date'].fillna("9999-01-01", inplace=True)

# Converting Column datatypes 
# df1['Order_Date'] = pd.to_datetime(df1['Order_Date'],format='%Y-%m-%d')
df1['Order_Date'] = df1['Order_Date'].convert_dtypes(convert_string=True)
df1['Prod_Name'] = df1['Prod_Name'].convert_dtypes(convert_string=True)

# Print the Data type of each Column of Dataframe after cleanup
print('The datatype of each column of Dataframe after update:\n', df1.dtypes)

# Result:
The datatype of each column of Dataframe after update:
 Order_Id       int64
Order_Date    string
Prod_Id        int64
Prod_Name     string
Units_Sold     int64
dtype: object

# Print the details of the Dataframe
print('The information of the Dataframe:')
df1.info()

# Result:
The information of the Dataframe:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Order_Id    20 non-null     int64 
 1   Order_Date  20 non-null     string
 2   Prod_Id     20 non-null     int64 
 3   Prod_Name   20 non-null     string
 4   Units_Sold  20 non-null     int64 
dtypes: int64(3), string(2)
memory usage: 928.0 bytes
# ------------------------------------------------------------------------------------------------------------------------ #

# Define Connection to SQL Server:
con_params = r'DRIVER={ODBC Driver 17 for SQL Server};SERVER=TPREDDY-PC;DATABASE=db_Analytics;Trusted_Connection=yes'

conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(con_params)

sql_engine = create_engine(conn_str, fast_executemany=True)
sql_con = sql_engine.connect()

# Defining the table structure:
tbl_create = text("""CREATE TABLE [dbo].[tbl_Sample_Products](
                            [Order_Id] [float] NOT NULL PRIMARY KEY,
                            [Order_Date] [datetime],
                            [Prod_Id] [int] NULL,
                            [Prod_Name] [varchar](255) NULL,
                            [Units_Sold] [int] NULL
                            ) ON [PRIMARY]
                    """)

# Checking the Table existence and creating:
if inspect(sql_con).has_table("tbl_Sample_Products", schema="dbo"):
    pass
else:
    sql_con.execute(tbl_create)

# Insert DataFrame to Table:
df1.to_sql(name='tbl_Sample_Products', con=sql_con, schema="dbo", if_exists='append', index=False)

# Closing the SQL Connections:
sql_con.close()
sql_engine.dispose()

# Result:


Notes:
In the first run, it will create a table if it does not exists already in the database.
Please note that the above query will insert the data record by record.
If you re-run the same code again, it will append data to existing table, if there is no primary key violation.

#--------------------------------------------------------------Thanks--------------------------------------------------------------#

No comments:

Post a Comment

Hi User, Thank You for visiting My Blog. Please post your genuine Feedback or comments only related to this Blog Posts. Please do not post any Spam comments or Advertising kind of comments which will be Ignored.

Featured Post from this Blog

How to compare Current Snapshot Data with Previous Snapshot in Power BI

How to Dynamically compare two Snapshots Data in Power BI Scenario: Suppose, we have a sample Sales data, which is stored with Monthly Snaps...

Popular Posts from this Blog