Scenario:
Lets suppose we have a tab delimited Text file, from which we need to create a Dataframe and then we need to insert the records from this Dataframe to a SQL Server table.
In Python, we can create the Dataframes from the files using the 'pandas' package, and we can insert the data from the Dataframe to SQL Server using the 'pyodbc' package.
Input Data :
The sample input file(tab delimited text file) having the 20 records, with columns as "Order_Id", "Order_Date", "Prod_Id", "Prod_Name", "Units_Sold"
# ------------------------------------------------------------------------------------------------------------------------ #
# Packages used in the following:
import os
import sys
import csv
import pandas as pd
import pyodbc
print("Python Version: ", sys.version)
'''
# Quick Notes about related Packages:
# import os :
The OS module in python provides functions for interacting with the operating system.
It comes under Python’s standard utility modules.
This module provides a portable way of using operating system dependent functionality.
# import sys:
The Sys is a built-in Python module that contains parameters specific to the system.
It contains variables(eg: sys.path) and methods that interact with the interpreter and are also governed by it. It contains a list of directories that the interpreter will search in for the required module.
# import pandas:
The Pandas package is one of the powerful open source libraries in the Python programming language.
It will be used for data analysis and data manipulation.
# import sqlalchemy:
The SQLAlchemy is the Python SQL toolkit and Object Relational Mapper which used to access the databases using SQL.
# import pyodbc:
The pyodbc is an open source Python module that makes accessing ODBC databases simple.
'''
# ------------------------------------------------------------------------------------------------------------------------ #
# Defining input Folder and File path:
folder_path = 'C:\\Users\\write\\PycharmProjects\\pythonProject\\PyLab1\\File_Operations\\Input\\'
txtFile_name = "tdl_Products_Sample.txt"
txtFile = folder_path + txtFile_name
# Current working Excel File name
print('The Text File name:\n', os.path.basename(txtFile))
# Current working Excel File directory
print('The Text File Directory name:\n', os.path.dirname(txtFile))
# Current working Excel File full path
print('The full path of the Text File:\n', os.path.abspath(txtFile))
# ------------------------------------------------------------------------------------------------------------------------ ## Creating the Dataframe from a Text file(tab delimited):
data_obj = pd.read_csv(txtFile, sep='\t')
df1 = pd.DataFrame(data_obj)
# Analyze the input data from the Dataframe:
# Print the First 5 rows from a Dataframe
print('The First 5 rows from the Dataframe:\n', df1.head(5))
# Result:
The First 5 rows from the Dataframe:
Order_Id Order_Date Prod_Id Prod_Name Units_Sold
0 123456812 2014-01-05 1235 DeskTop 32
1 123456822 2014-01-14 1238 Switch Board 36
2 123456806 2014-01-17 1235 DeskTop 11
3 123456816 2014-01-24 1234 Router 27
4 123456813 2014-02-03 1235 DeskTop 20
# Print the Bottom 5 rows from a Dataframe
print('The Bottom 5 rows from the Dataframe:\n', df1.tail(5))
# Result:
The Bottom 5 rows from the Dataframe:
Order_Id Order_Date Prod_Id Prod_Name Units_Sold
15 123456823 2014-04-26 1235 DeskTop 32
16 123456804 2014-05-04 1237 Laser Printer 20
17 123456810 2014-05-12 1237 Laser Printer 15
18 123456826 NaN 1234 Router 37
19 123456820 2014-05-26 1235 DeskTop 42
# Print the First 5 rows from a Column of Dataframe
print('The First 5 rows from Order_Date Column:\n', df1['Order_Date'].iloc[:5])
# Result:
The First 5 rows from Order_Date Column:
0 2014-01-05
1 2014-01-14
2 2014-01-17
3 2014-01-24
4 2014-02-03
Name: Order_Date, dtype: object
# Creating Subset of a Dataframe with specific columns of existing Dataframe
req_cols = ['Order_Id', 'Prod_Name']
df2 = df1[req_cols]
print('The Subset of the Dataframe:\n', df2.head())
# Result:
The Subset of the Dataframe:
Order_Id Prod_Name
0 123456812 DeskTop
1 123456822 Switch Board
2 123456806 DeskTop
3 123456816 Router
4 123456813 DeskTop
# Print the original Data type of each Column of Dataframe
print('The Datatype of each column of Dataframe before update:\n', df1.dtypes)
# Result:
The Datatype of each column of Dataframe before update:
Order_Id int64
Order_Date object
Prod_Id int64
Prod_Name object
Units_Sold int64
dtype: object
# Cleansing the input data in the Dataframe:
From the above data analysis, we understand that the columns Order_Date , Prod_Name are having the datatype as 'object' which does not understand by SQL while inserting.
Now we need to update(cleansing) the data type of these columns to make sense to SQL.
In the following, i am updating the 'Order_Date' as string to avoid the date formatting issues, and then updating the Prod_Name also as a string.
# Replacing nulls(NaN) values from Order_Date with '9999-01-01' value
df1['Order_Date'].fillna("9999-01-01", inplace=True)
# Converting Column datatypes
# df1['Order_Date'] = pd.to_datetime(df1['Order_Date'],format='%Y-%m-%d')
df1['Order_Date'] = df1['Order_Date'].convert_dtypes(convert_string=True)
df1['Prod_Name'] = df1['Prod_Name'].convert_dtypes(convert_string=True)
# Print the Data type of each Column of Dataframe after cleanup
print('The datatype of each column of Dataframe after update:\n', df1.dtypes)
# Result:
The datatype of each column of Dataframe after update:
Order_Id int64
Order_Date string
Prod_Id int64
Prod_Name string
Units_Sold int64
dtype: object
# Print the details of the Dataframe
print('The information of the Dataframe:')
df1.info()
# Result:
The information of the Dataframe:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order_Id 20 non-null int64
1 Order_Date 20 non-null string
2 Prod_Id 20 non-null int64
3 Prod_Name 20 non-null string
4 Units_Sold 20 non-null int64
dtypes: int64(3), string(2)
memory usage: 928.0 bytes
# ------------------------------------------------------------------------------------------------------------------------ #
# Define Connection to SQL Server:
sql_con = pyodbc.connect('Driver={ODBC Driver 17 for SQL Server};'
'Server=TPREDDY-PC;'
'Database=db_Analytics;'
'Trusted_Connection=yes;')
sql_cursor = sql_con.cursor()
# Defining create Table statement:
tbl_create = '''IF OBJECT_ID(N'[dbo].[tbl_Sample_Products]', N'U') IS NOT NULL
BEGIN
Print('Table already exists in the Database')
END
ELSE
BEGIN
CREATE TABLE [dbo].[tbl_Sample_Products](
[Order_Id] [float] NOT NULL PRIMARY KEY,
[Order_Date] [datetime],
[Prod_Id] [int] NULL,
[Prod_Name] [varchar](255) NULL,
[Units_Sold] [int] NULL
) ON [PRIMARY]
END
'''
# Executing create table statement:
sql_cursor.execute(tbl_create)
# ------------------------------------------------------------------------------------------------------------------------ #
# Defining the insert statement:
ins_rec = '''INSERT
INTO[dbo].[tbl_Sample_Products]
([Order_Id],[Order_Date], [Prod_Id], [Prod_Name],[Units_Sold])
VALUES(?, ?, ?, ?, ?)
'''
# Inserting the records from DataFrame to Table:
for row in df1.itertuples():
sql_cursor.execute(ins_rec,
row.Order_Id,
row.Order_Date,
row.Prod_Id,
row.Prod_Name,
row.Units_Sold
)
# Committing the SQL Statements to Database:
sql_con.commit()
# Closing the SQL Cursor and the Connection
sql_cursor.close()
sql_con.close()
# Result:
Notes:
In the first run, it will create a table if it does not exists already in the database.
Please note that the above query will insert the data record by record.
If you re-run the same code again, it will append data to existing table, if there is no primary key violation.
The above method, may not be the best choice for inserting the data in go.
#--------------------------------------------------------------Thanks--------------------------------------------------------------#
No comments:
Post a Comment
Hi User, Thank You for visiting My Blog. Please post your genuine Feedback or comments only related to this Blog Posts. Please do not post any Spam comments or Advertising kind of comments which will be Ignored.