How to insert the Data from a Text file to SQL Server table using pyodbc package in Python

How to insert the Dataframe from a Text file to SQL Server table using pyodbc and pandas packages in Python

Scenario:

Lets suppose we have a tab delimited Text file, from which we need to create a Dataframe and then we need to insert the records from this Dataframe to a SQL Server table.

In Python, we can create the Dataframes from the files using the 'pandas' package, and we can insert the data from the Dataframe to SQL Server using the 'pyodbc' package.

Input Data :

The sample input file(tab delimited text file) having the 20 records, with columns as "Order_Id", "Order_Date", "Prod_Id", "Prod_Name", "Units_Sold"

Please note that, the Order_Date column is a date column having some nulls.

Now we will discuss about the how to create a Dataframe, analyze and cleanup the data, and then finally loading the data to SQL Server table.

# ------------------------------------------------------------------------------------------------------------------------ #

# Packages used in the following:

import os

import sys

import csv

import pandas as pd

import pyodbc

print("Python Version: ", sys.version)

# ------------------------------------------------------------------------------------------------------------------------ #

'''

# Quick Notes about related Packages:

# import os :

The OS module in python provides functions for interacting with the operating system.

It comes under Python’s standard utility modules.

This module provides a portable way of using operating system dependent functionality.

# import sys:

The Sys is a built-in Python module that contains parameters specific to the system.

It contains variables(eg: sys.path) and methods that interact with the interpreter and are also governed by it. It contains a list of directories that the interpreter will search in for the required module.

# import pandas:

The Pandas package is one of the powerful open source libraries in the Python programming language.

It will be used for data analysis and data manipulation.

# import sqlalchemy:

The SQLAlchemy is the Python SQL toolkit and Object Relational Mapper which used to access the databases using SQL.

# import pyodbc:

The pyodbc is an open source Python module that makes accessing ODBC databases simple.

'''

# ------------------------------------------------------------------------------------------------------------------------ #

# Defining input Folder and File path:

folder_path = 'C:\\Users\\write\\PycharmProjects\\pythonProject\\PyLab1\\File_Operations\\Input\\'

txtFile_name = "tdl_Products_Sample.txt"

txtFile = folder_path + txtFile_name

# Current working Excel File name

print('The Text File name:\n', os.path.basename(txtFile))

# Current working Excel File directory

print('The Text File Directory name:\n', os.path.dirname(txtFile))

# Current working Excel File full path

print('The full path of the Text File:\n', os.path.abspath(txtFile))

# ------------------------------------------------------------------------------------------------------------------------ #

# Creating the Dataframe from a Text file(tab delimited):

data_obj = pd.read_csv(txtFile, sep='\t')

df1 = pd.DataFrame(data_obj)

# Analyze the input data from the Dataframe:

# Print the First 5 rows from a Dataframe

print('The First 5 rows from the Dataframe:\n', df1.head(5))

# Result:

The First 5 rows from the Dataframe:

Order_Id Order_Date Prod_Id Prod_Name Units_Sold

0 123456812 2014-01-05 1235 DeskTop 32

1 123456822 2014-01-14 1238 Switch Board 36

2 123456806 2014-01-17 1235 DeskTop 11

3 123456816 2014-01-24 1234 Router 27

4 123456813 2014-02-03 1235 DeskTop 20

# Print the Bottom 5 rows from a Dataframe

print('The Bottom 5 rows from the Dataframe:\n', df1.tail(5))

# Result:

The Bottom 5 rows from the Dataframe:

Order_Id Order_Date Prod_Id Prod_Name Units_Sold

15 123456823 2014-04-26 1235 DeskTop 32

16 123456804 2014-05-04 1237 Laser Printer 20

17 123456810 2014-05-12 1237 Laser Printer 15

18 123456826 NaN 1234 Router 37

19 123456820 2014-05-26 1235 DeskTop 42

# Print the First 5 rows from a Column of Dataframe

print('The First 5 rows from Order_Date Column:\n', df1['Order_Date'].iloc[:5])

# Result:

The First 5 rows from Order_Date Column:

0 2014-01-05

1 2014-01-14

2 2014-01-17

3 2014-01-24

4 2014-02-03

Name: Order_Date, dtype: object

# Creating Subset of a Dataframe with specific columns of existing Dataframe

req_cols = ['Order_Id', 'Prod_Name']

df2 = df1[req_cols]

print('The Subset of the Dataframe:\n', df2.head())

# Result:

The Subset of the Dataframe:

Order_Id Prod_Name

0 123456812 DeskTop

1 123456822 Switch Board

2 123456806 DeskTop

3 123456816 Router

4 123456813 DeskTop

# Print the original Data type of each Column of Dataframe

print('The Datatype of each column of Dataframe before update:\n', df1.dtypes)

# Result:

The Datatype of each column of Dataframe before update:

Order_Id int64

Order_Date object

Prod_Id int64

Prod_Name object

Units_Sold int64

dtype: object

# Cleansing the input data in the Dataframe:

From the above data analysis, we understand that the columns Order_Date , Prod_Name are having the datatype as 'object' which does not understand by SQL while inserting.

Now we need to update(cleansing) the data type of these columns to make sense to SQL.

In the following, i am updating the 'Order_Date' as string to avoid the date formatting issues, and then updating the Prod_Name also as a string.

# Replacing nulls(NaN) values from Order_Date with '9999-01-01' value

df1['Order_Date'].fillna("9999-01-01", inplace=True)

# Converting Column datatypes

# df1['Order_Date'] = pd.to_datetime(df1['Order_Date'],format='%Y-%m-%d')

df1['Order_Date'] = df1['Order_Date'].convert_dtypes(convert_string=True)

df1['Prod_Name'] = df1['Prod_Name'].convert_dtypes(convert_string=True)

# Print the Data type of each Column of Dataframe after cleanup

print('The datatype of each column of Dataframe after update:\n', df1.dtypes)

# Result:

The datatype of each column of Dataframe after update:

Order_Id int64

Order_Date string

Prod_Id int64

Prod_Name string

Units_Sold int64

dtype: object

# Print the details of the Dataframe

print('The information of the Dataframe:')

df1.info()

# Result:

The information of the Dataframe:

RangeIndex: 20 entries, 0 to 19

Data columns (total 5 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Order_Id 20 non-null int64

1 Order_Date 20 non-null string

2 Prod_Id 20 non-null int64

3 Prod_Name 20 non-null string

4 Units_Sold 20 non-null int64

dtypes: int64(3), string(2)

memory usage: 928.0 bytes

# ------------------------------------------------------------------------------------------------------------------------ #

# Define Connection to SQL Server:

sql_con = pyodbc.connect('Driver={ODBC Driver 17 for SQL Server};'

'Server=TPREDDY-PC;'

'Database=db_Analytics;'

'Trusted_Connection=yes;')

sql_cursor = sql_con.cursor()

# Defining create Table statement:

tbl_create = '''IF OBJECT_ID(N'[dbo].[tbl_Sample_Products]', N'U') IS NOT NULL

BEGIN

Print('Table already exists in the Database')

END

ELSE

BEGIN

CREATE TABLE [dbo].[tbl_Sample_Products](

[Order_Id] [float] NOT NULL PRIMARY KEY,

[Order_Date] [datetime],

[Prod_Id] [int] NULL,

[Prod_Name] [varchar](255) NULL,

[Units_Sold] [int] NULL

) ON [PRIMARY]

END

'''

# Executing create table statement:

sql_cursor.execute(tbl_create)

# ------------------------------------------------------------------------------------------------------------------------ #

# Defining the insert statement:

ins_rec = '''INSERT

INTO[dbo].[tbl_Sample_Products]

([Order_Id],[Order_Date], [Prod_Id], [Prod_Name],[Units_Sold])

VALUES(?, ?, ?, ?, ?)

'''

# Inserting the records from DataFrame to Table:

for row in df1.itertuples():

sql_cursor.execute(ins_rec,

row.Order_Id,

row.Order_Date,

row.Prod_Id,

row.Prod_Name,

row.Units_Sold

)

# Committing the SQL Statements to Database:

sql_con.commit()

# Closing the SQL Cursor and the Connection

sql_cursor.close()

sql_con.close()

# Result:

Notes:

In the first run, it will create a table if it does not exists already in the database.

Please note that the above query will insert the data record by record.

If you re-run the same code again, it will append data to existing table, if there is no primary key violation.

The above method, may not be the best choice for inserting the data in go.

#--------------------------------------------------------------Thanks--------------------------------------------------------------#

Power BI Analytics Kingdom | Blog

Search This Blog

Sunday, January 30, 2022

How to insert the Data from a Text file to SQL Server table using pyodbc package in Python

No comments:

Post a Comment

Featured Post from this Blog

How to compare Current Snapshot Data with Previous Snapshot in Power BI

Popular Posts from this Blog