In this notebook, we're going to be taking a dataset that I built from a set of data objects using T-SQL in SQL Server and performing EDA (exploratory data analysis) using Python.
This dataset mimics a sales cycle set that you might build out of a set of objects from a sales CRM such as Salesforce. Since it is fairly difficult to find rich data with complexity that mimics real-world data, this data is very much simplified from what you would find in a company with years of CRM data to work with, so there are some limitations to what can be accomplished. In real-world scenarios we would have an array of data objects across multiple databases which would allow for extensive engineering and analysis, but we can still simulate some of those scenarios with our limited sets.
Our first bit of business is to connect to our server to pull data from the database.
import os
from dotenv import load_dotenv
from sqlalchemy import create_engine
import urllib
load_dotenv(dotenv_path=r"C:\Users\nate\OneDrive\Desktop\crm_sales\sqlserver_connector.env")
server = os.getenv('DB_SERVER')
database = os.getenv('DB_DATABASE')
odbc_str = (
f"DRIVER={{ODBC Driver 17 for SQL Server}};"
f"SERVER={server};"
f"DATABASE={database};"
f"Trusted_Connection=yes;"
)
params = urllib.parse.quote_plus(odbc_str)
engine = create_engine(f"mssql+pyodbc:///?odbc_connect={params}")
If for any reason, you'd like to forgo connecting to your database, and want to work with a CSV, below is the basic setup for that.
import pandas as pd # import the pandas module for loading our dataframe
file_path = 'C:/Users/nate/OneDrive/Desktop/crm_sales/sales_cycle.csv' # create a variable that is the path to our data
df = pd.read_csv(file_path) # create a dataframe and use the pandas read_csv() function to load the dataframe looking at the file_path variable
print(df.head()) # head() method for printing column names and first 5 rows
opportunity_id sales_agent product series sales_price \ 0 019I751P Anna Snelling GTX Plus Basic GTX 1096.0 1 04AOWMXV Anna Snelling GTX Basic GTX 550.0 2 05HLE3JL Anna Snelling GTX Basic GTX 550.0 3 066V2ZGP Anna Snelling MG Advanced MG 3393.0 4 07DPEL0B Anna Snelling GTX Basic GTX 550.0 account deal_stage engage_date close_date DaysToClose ... \ 0 Statholdings Won 2017-07-02 2017-07-09 7 ... 1 Codehow Won 2017-12-11 2017-12-28 17 ... 2 Stanredtax Won 2017-09-01 2017-12-03 93 ... 3 Rangreen Won 2017-04-28 2017-08-14 108 ... 4 Cancity Won 2017-09-13 2017-11-25 73 ... AgentWonOpps AgentLostOpps LostProductRevenue TotalAgentWonOpps \ 0 1 0 0.0 208 1 1 0 0.0 208 2 1 0 0.0 208 3 1 0 0.0 208 4 1 0 0.0 208 TotalAgentLostOpps AgentTotalClosedWonValue AgentTotalLostRevenue \ 0 128 275056.0 229040.0 1 128 275056.0 229040.0 2 128 275056.0 229040.0 3 128 275056.0 229040.0 4 128 275056.0 229040.0 AgentClosedWonRate AgentClosedLostRate ClosedDealsStatus 0 0.62 0.38 Majority Won 1 0.62 0.38 Majority Won 2 0.62 0.38 Majority Won 3 0.62 0.38 Majority Won 4 0.62 0.38 Majority Won [5 rows x 22 columns]
Now that we have a dataframe loaded and have a snapshot of the data in front of us, we can start with some basic EDA.
print(df.info()) # info() method for printing field names and data types
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6711 entries, 0 to 6710 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 opportunity_id 6711 non-null object 1 sales_agent 6711 non-null object 2 product 6711 non-null object 3 series 5564 non-null object 4 sales_price 5564 non-null float64 5 account 6711 non-null object 6 deal_stage 6711 non-null object 7 engage_date 6711 non-null object 8 close_date 6711 non-null object 9 DaysToClose 6711 non-null int64 10 close_value 6711 non-null int64 11 TotalAgentOpps 6711 non-null int64 12 AgentWonOpps 6711 non-null int64 13 AgentLostOpps 6711 non-null int64 14 LostProductRevenue 6293 non-null float64 15 TotalAgentWonOpps 6711 non-null int64 16 TotalAgentLostOpps 6711 non-null int64 17 AgentTotalClosedWonValue 6711 non-null float64 18 AgentTotalLostRevenue 6711 non-null float64 19 AgentClosedWonRate 6711 non-null float64 20 AgentClosedLostRate 6711 non-null float64 21 ClosedDealsStatus 6711 non-null object dtypes: float64(6), int64(7), object(9) memory usage: 1.1+ MB None
print(df.shape) # shape attribute printed, outputs a tuple, (6711, 22) for 6711 rows and 22 columns
(6711, 22)
print(df.describe()) # describe() method to do a statistical summary of the data in the dataframe
sales_price DaysToClose close_value TotalAgentOpps AgentWonOpps \ count 5564.000000 6711.000000 6711.000000 6711.000000 6711.000000 mean 1876.395219 47.985397 1490.915512 260.216808 0.631501 std 2504.431797 41.057665 2320.670773 107.795003 0.482434 min 55.000000 1.000000 0.000000 79.000000 0.000000 25% 550.000000 8.000000 0.000000 194.000000 0.000000 50% 1096.000000 45.000000 472.000000 237.000000 1.000000 75% 3393.000000 85.000000 3225.000000 275.000000 1.000000 max 26768.000000 138.000000 30288.000000 553.000000 1.000000 AgentLostOpps LostProductRevenue TotalAgentWonOpps \ count 6711.000000 6293.000000 6711.000000 mean 0.368499 624.708406 163.797199 std 0.482434 1704.981613 67.727821 min 0.000000 0.000000 55.000000 25% 0.000000 0.000000 122.000000 50% 0.000000 0.000000 155.000000 75% 1.000000 550.000000 176.000000 max 1.000000 26768.000000 349.000000 TotalAgentLostOpps AgentTotalClosedWonValue AgentTotalLostRevenue \ count 6711.000000 6.711000e+03 6711.000000 mean 96.419610 3.928543e+05 143461.835196 std 41.143883 2.484273e+05 49891.452651 min 24.000000 1.234310e+05 56424.000000 25% 71.000000 2.616310e+05 93569.000000 50% 93.000000 3.287920e+05 143648.000000 75% 117.000000 4.458600e+05 175064.000000 max 204.000000 1.153214e+06 229040.000000 AgentClosedWonRate AgentClosedLostRate count 6711.000000 6711.000000 mean 0.631128 0.368872 std 0.034819 0.034819 min 0.550000 0.300000 25% 0.620000 0.350000 50% 0.630000 0.370000 75% 0.650000 0.380000 max 0.700000 0.450000
print(df.isnull().sum()) # isnull() method with the sum() function to check for null values in each column
opportunity_id 0 sales_agent 0 product 0 series 1147 sales_price 1147 account 0 deal_stage 0 engage_date 0 close_date 0 DaysToClose 0 close_value 0 TotalAgentOpps 0 AgentWonOpps 0 AgentLostOpps 0 LostProductRevenue 418 TotalAgentWonOpps 0 TotalAgentLostOpps 0 AgentTotalClosedWonValue 0 AgentTotalLostRevenue 0 AgentClosedWonRate 0 AgentClosedLostRate 0 ClosedDealsStatus 0 dtype: int64
print((df.isnull().sum() / len(df))*100) # the number of null rows divided by the number of rows in the dataframe
#len() returns the number of rows in the dataframe
opportunity_id 0.000000 sales_agent 0.000000 product 0.000000 series 17.091343 sales_price 17.091343 account 0.000000 deal_stage 0.000000 engage_date 0.000000 close_date 0.000000 DaysToClose 0.000000 close_value 0.000000 TotalAgentOpps 0.000000 AgentWonOpps 0.000000 AgentLostOpps 0.000000 LostProductRevenue 6.228580 TotalAgentWonOpps 0.000000 TotalAgentLostOpps 0.000000 AgentTotalClosedWonValue 0.000000 AgentTotalLostRevenue 0.000000 AgentClosedWonRate 0.000000 AgentClosedLostRate 0.000000 ClosedDealsStatus 0.000000 dtype: float64
print(df['sales_agent'].unique()) # unique() method will print just the distinct values of the selected column
['Anna Snelling' 'Boris Faz' 'Cassey Cress' 'Cecily Lampkin' 'Corliss Cosme' 'Daniell Hammack' 'Darcel Schlecht' 'Donn Cantrell' 'Elease Gluck' 'Garret Kinder' 'Gladys Colclough' 'Hayden Neloms' 'James Ascencio' 'Jonathan Berthelot' 'Kami Bicknell' 'Kary Hendrixson' 'Lajuana Vencill' 'Markita Hansen' 'Marty Freudenburg' 'Maureen Marcano' 'Moses Frase' 'Niesha Huffines' 'Reed Clapper' 'Rosalina Dieter' 'Rosie Papadopoulos' 'Versie Hillebrand' 'Vicki Laflamme' 'Violet Mclelland' 'Wilburn Farren' 'Zane Levy']
print(df['sales_agent'].value_counts()) # value_counts() method gives us the total count of the unique values of the selected column
sales_agent Darcel Schlecht 553 Vicki Laflamme 347 Anna Snelling 336 Kary Hendrixson 335 Donn Cantrell 275 Kami Bicknell 272 Jonathan Berthelot 264 Versie Hillebrand 264 Zane Levy 261 Cassey Cress 261 Reed Clapper 237 Gladys Colclough 232 Lajuana Vencill 231 Corliss Cosme 229 Markita Hansen 227 Maureen Marcano 213 James Ascencio 206 Moses Frase 195 Marty Freudenburg 194 Violet Mclelland 193 Daniell Hammack 187 Niesha Huffines 175 Cecily Lampkin 160 Boris Faz 153 Hayden Neloms 152 Elease Gluck 126 Garret Kinder 123 Rosie Papadopoulos 121 Rosalina Dieter 110 Wilburn Farren 79 Name: count, dtype: int64
Now that we've done some basic EDA with the data, we could make some alterations to the field names or data types if it feels necessary.
print(df.columns) # columns attribute to list out the column names for quick reference
Index(['opportunity_id', 'sales_agent', 'product', 'series', 'sales_price', 'account', 'deal_stage', 'engage_date', 'close_date', 'DaysToClose', 'close_value', 'TotalAgentOpps', 'AgentWonOpps', 'AgentLostOpps', 'LostProductRevenue', 'TotalAgentWonOpps', 'TotalAgentLostOpps', 'AgentTotalClosedWonValue', 'AgentTotalLostRevenue', 'AgentClosedWonRate', 'AgentClosedLostRate', 'ClosedDealsStatus'], dtype='object')
My preference for the column names would to have them all with the same naming convention. The dervived columns are all PascalCase, and I think the other field names should match.
df.rename(columns={'opportunity_id':'OpportunityID',
'sales_agent':'SalesAgent',
'product':'Product',
'series':'Series',
'sales_price':'SalesPrice',
'account':'Account',
'deal_stage':'DealStage',
'engage_date':'EngageDate',
'close_date':'CloseDate',
'close_value':'CloseValue'}, inplace=True) # rename() method for changing column names
# rename() uses the python dictionary format {'string':'string'}
# inplace=True will alter the current dataframe. If this is False, it will create a new dataframe with the column name change
Now let's check that out changes worked by listing out the columns again.
print(df.columns) # columns attribute to list out our columns for reference
Index(['OpportunityID', 'SalesAgent', 'Product', 'Series', 'SalesPrice', 'Account', 'DealStage', 'EngageDate', 'CloseDate', 'DaysToClose', 'CloseValue', 'TotalAgentOpps', 'AgentWonOpps', 'AgentLostOpps', 'LostProductRevenue', 'TotalAgentWonOpps', 'TotalAgentLostOpps', 'AgentTotalClosedWonValue', 'AgentTotalLostRevenue', 'AgentClosedWonRate', 'AgentClosedLostRate', 'ClosedDealsStatus'], dtype='object')
If we felt the need to change the datatype of say our CloseDate field from date to datetime we could do the following, although for this analysis it isn't necessary. I'll give an example regardless.
df['CreatedDate'] = pd.to_datetime(df['CreatedDate'], format = '%Y-%m-%d %H:%M:%S') # to_datetime() function will convert date to datetime data type
Here's a list of the to_datetime() function formatting options.
%Y Year with century 2023
%m Month as a zero-padded decimal number 01
%d Day of the month as a zero-padded decimal 15
%H Hour (24-hour clock) as a zero-padded decimal 14
%I Hour (12-hour clock) as a zero-padded decimal 02 (for 2 PM)
%M Minute as a zero-padded decimal 30
%S Second as a zero-padded decimal 00
%f Microsecond as a decimal number 123456
%p Locale's equivalent of AM/PM PM
%b Month as locale's abbreviated name Jan
%B Month as locale's full name January
%a Weekday as locale's abbreviated name Mon
%A Weekday as locale's full name Monday
%j Day of the year as a zero-padded decimal 015 (for Jan 15th)
%Z Time zone name (empty if no time zone) EST
%z UTC offset in the form +HHMM or -HHMM +0500