python_eda_practice

In this notebook, we're going to be taking a dataset that I built from a set of data objects using T-SQL in SQL Server and performing EDA (exploratory data analysis) using Python.

This dataset mimics a sales cycle set that you might build out of a set of objects from a sales CRM such as Salesforce. Since it is fairly difficult to find rich data with complexity that mimics real-world data, this data is very much simplified from what you would find in a company with years of CRM data to work with, so there are some limitations to what can be accomplished. In real-world scenarios we would have an array of data objects across multiple databases which would allow for extensive engineering and analysis, but we can still simulate some of those scenarios with our limited sets.

Our first bit of business is to connect to our server to pull data from the database.

In [ ]:

import os
from dotenv import load_dotenv
from sqlalchemy import create_engine
import urllib

load_dotenv(dotenv_path=r"C:\Users\nate\OneDrive\Desktop\crm_sales\sqlserver_connector.env")

server = os.getenv('DB_SERVER')
database = os.getenv('DB_DATABASE')

odbc_str = (
    f"DRIVER={{ODBC Driver 17 for SQL Server}};"
    f"SERVER={server};"
    f"DATABASE={database};"
    f"Trusted_Connection=yes;"
)

params = urllib.parse.quote_plus(odbc_str)
engine = create_engine(f"mssql+pyodbc:///?odbc_connect={params}")

If for any reason, you'd like to forgo connecting to your database, and want to work with a CSV, below is the basic setup for that.

In [55]:

import pandas as pd # import the pandas module for loading our dataframe

file_path = 'C:/Users/nate/OneDrive/Desktop/crm_sales/sales_cycle.csv' # create a variable that is the path to our data

df = pd.read_csv(file_path) # create a dataframe and use the pandas read_csv() function to load the dataframe looking at the file_path variable

print(df.head()) # head() method for printing column names and first 5 rows

  opportunity_id    sales_agent         product series  sales_price  \
0       019I751P  Anna Snelling  GTX Plus Basic    GTX       1096.0   
1       04AOWMXV  Anna Snelling       GTX Basic    GTX        550.0   
2       05HLE3JL  Anna Snelling       GTX Basic    GTX        550.0   
3       066V2ZGP  Anna Snelling     MG Advanced     MG       3393.0   
4       07DPEL0B  Anna Snelling       GTX Basic    GTX        550.0   

        account deal_stage engage_date  close_date  DaysToClose  ...  \
0  Statholdings        Won  2017-07-02  2017-07-09            7  ...   
1       Codehow        Won  2017-12-11  2017-12-28           17  ...   
2    Stanredtax        Won  2017-09-01  2017-12-03           93  ...   
3      Rangreen        Won  2017-04-28  2017-08-14          108  ...   
4       Cancity        Won  2017-09-13  2017-11-25           73  ...   

   AgentWonOpps  AgentLostOpps  LostProductRevenue  TotalAgentWonOpps  \
0             1              0                 0.0                208   
1             1              0                 0.0                208   
2             1              0                 0.0                208   
3             1              0                 0.0                208   
4             1              0                 0.0                208   

   TotalAgentLostOpps  AgentTotalClosedWonValue  AgentTotalLostRevenue  \
0                 128                  275056.0               229040.0   
1                 128                  275056.0               229040.0   
2                 128                  275056.0               229040.0   
3                 128                  275056.0               229040.0   
4                 128                  275056.0               229040.0   

   AgentClosedWonRate  AgentClosedLostRate  ClosedDealsStatus  
0                0.62                 0.38       Majority Won  
1                0.62                 0.38       Majority Won  
2                0.62                 0.38       Majority Won  
3                0.62                 0.38       Majority Won  
4                0.62                 0.38       Majority Won  

[5 rows x 22 columns]

Now that we have a dataframe loaded and have a snapshot of the data in front of us, we can start with some basic EDA.

In [41]:

print(df.info()) # info() method for printing field names and data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6711 entries, 0 to 6710
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   opportunity_id            6711 non-null   object 
 1   sales_agent               6711 non-null   object 
 2   product                   6711 non-null   object 
 3   series                    5564 non-null   object 
 4   sales_price               5564 non-null   float64
 5   account                   6711 non-null   object 
 6   deal_stage                6711 non-null   object 
 7   engage_date               6711 non-null   object 
 8   close_date                6711 non-null   object 
 9   DaysToClose               6711 non-null   int64  
 10  close_value               6711 non-null   int64  
 11  TotalAgentOpps            6711 non-null   int64  
 12  AgentWonOpps              6711 non-null   int64  
 13  AgentLostOpps             6711 non-null   int64  
 14  LostProductRevenue        6293 non-null   float64
 15  TotalAgentWonOpps         6711 non-null   int64  
 16  TotalAgentLostOpps        6711 non-null   int64  
 17  AgentTotalClosedWonValue  6711 non-null   float64
 18  AgentTotalLostRevenue     6711 non-null   float64
 19  AgentClosedWonRate        6711 non-null   float64
 20  AgentClosedLostRate       6711 non-null   float64
 21  ClosedDealsStatus         6711 non-null   object 
dtypes: float64(6), int64(7), object(9)
memory usage: 1.1+ MB
None

In [43]:

print(df.shape) # shape attribute printed, outputs a tuple, (6711, 22) for 6711 rows and 22 columns

(6711, 22)

In [45]:

print(df.describe()) # describe() method to do a statistical summary of the data in the dataframe

        sales_price  DaysToClose   close_value  TotalAgentOpps  AgentWonOpps  \
count   5564.000000  6711.000000   6711.000000     6711.000000   6711.000000   
mean    1876.395219    47.985397   1490.915512      260.216808      0.631501   
std     2504.431797    41.057665   2320.670773      107.795003      0.482434   
min       55.000000     1.000000      0.000000       79.000000      0.000000   
25%      550.000000     8.000000      0.000000      194.000000      0.000000   
50%     1096.000000    45.000000    472.000000      237.000000      1.000000   
75%     3393.000000    85.000000   3225.000000      275.000000      1.000000   
max    26768.000000   138.000000  30288.000000      553.000000      1.000000   

       AgentLostOpps  LostProductRevenue  TotalAgentWonOpps  \
count    6711.000000         6293.000000        6711.000000   
mean        0.368499          624.708406         163.797199   
std         0.482434         1704.981613          67.727821   
min         0.000000            0.000000          55.000000   
25%         0.000000            0.000000         122.000000   
50%         0.000000            0.000000         155.000000   
75%         1.000000          550.000000         176.000000   
max         1.000000        26768.000000         349.000000   

       TotalAgentLostOpps  AgentTotalClosedWonValue  AgentTotalLostRevenue  \
count         6711.000000              6.711000e+03            6711.000000   
mean            96.419610              3.928543e+05          143461.835196   
std             41.143883              2.484273e+05           49891.452651   
min             24.000000              1.234310e+05           56424.000000   
25%             71.000000              2.616310e+05           93569.000000   
50%             93.000000              3.287920e+05          143648.000000   
75%            117.000000              4.458600e+05          175064.000000   
max            204.000000              1.153214e+06          229040.000000   

       AgentClosedWonRate  AgentClosedLostRate  
count         6711.000000          6711.000000  
mean             0.631128             0.368872  
std              0.034819             0.034819  
min              0.550000             0.300000  
25%              0.620000             0.350000  
50%              0.630000             0.370000  
75%              0.650000             0.380000  
max              0.700000             0.450000

In [51]:

print(df.isnull().sum()) # isnull() method with the sum() function to check for null values in each column

opportunity_id                 0
sales_agent                    0
product                        0
series                      1147
sales_price                 1147
account                        0
deal_stage                     0
engage_date                    0
close_date                     0
DaysToClose                    0
close_value                    0
TotalAgentOpps                 0
AgentWonOpps                   0
AgentLostOpps                  0
LostProductRevenue           418
TotalAgentWonOpps              0
TotalAgentLostOpps             0
AgentTotalClosedWonValue       0
AgentTotalLostRevenue          0
AgentClosedWonRate             0
AgentClosedLostRate            0
ClosedDealsStatus              0
dtype: int64

In [49]:

print((df.isnull().sum() / len(df))*100) # the number of null rows divided by the number of rows in the dataframe

#len() returns the number of rows in the dataframe

opportunity_id               0.000000
sales_agent                  0.000000
product                      0.000000
series                      17.091343
sales_price                 17.091343
account                      0.000000
deal_stage                   0.000000
engage_date                  0.000000
close_date                   0.000000
DaysToClose                  0.000000
close_value                  0.000000
TotalAgentOpps               0.000000
AgentWonOpps                 0.000000
AgentLostOpps                0.000000
LostProductRevenue           6.228580
TotalAgentWonOpps            0.000000
TotalAgentLostOpps           0.000000
AgentTotalClosedWonValue     0.000000
AgentTotalLostRevenue        0.000000
AgentClosedWonRate           0.000000
AgentClosedLostRate          0.000000
ClosedDealsStatus            0.000000
dtype: float64

In [53]:

print(df['sales_agent'].unique()) # unique() method will print just the distinct values of the selected column

['Anna Snelling' 'Boris Faz' 'Cassey Cress' 'Cecily Lampkin'
 'Corliss Cosme' 'Daniell Hammack' 'Darcel Schlecht' 'Donn Cantrell'
 'Elease Gluck' 'Garret Kinder' 'Gladys Colclough' 'Hayden Neloms'
 'James Ascencio' 'Jonathan Berthelot' 'Kami Bicknell' 'Kary Hendrixson'
 'Lajuana Vencill' 'Markita Hansen' 'Marty Freudenburg' 'Maureen Marcano'
 'Moses Frase' 'Niesha Huffines' 'Reed Clapper' 'Rosalina Dieter'
 'Rosie Papadopoulos' 'Versie Hillebrand' 'Vicki Laflamme'
 'Violet Mclelland' 'Wilburn Farren' 'Zane Levy']

In [57]:

print(df['sales_agent'].value_counts()) # value_counts() method gives us the total count of the unique values of the selected column

sales_agent
Darcel Schlecht       553
Vicki Laflamme        347
Anna Snelling         336
Kary Hendrixson       335
Donn Cantrell         275
Kami Bicknell         272
Jonathan Berthelot    264
Versie Hillebrand     264
Zane Levy             261
Cassey Cress          261
Reed Clapper          237
Gladys Colclough      232
Lajuana Vencill       231
Corliss Cosme         229
Markita Hansen        227
Maureen Marcano       213
James Ascencio        206
Moses Frase           195
Marty Freudenburg     194
Violet Mclelland      193
Daniell Hammack       187
Niesha Huffines       175
Cecily Lampkin        160
Boris Faz             153
Hayden Neloms         152
Elease Gluck          126
Garret Kinder         123
Rosie Papadopoulos    121
Rosalina Dieter       110
Wilburn Farren         79
Name: count, dtype: int64

Now that we've done some basic EDA with the data, we could make some alterations to the field names or data types if it feels necessary.

In [60]:

print(df.columns) # columns attribute to list out the column names for quick reference

Index(['opportunity_id', 'sales_agent', 'product', 'series', 'sales_price',
       'account', 'deal_stage', 'engage_date', 'close_date', 'DaysToClose',
       'close_value', 'TotalAgentOpps', 'AgentWonOpps', 'AgentLostOpps',
       'LostProductRevenue', 'TotalAgentWonOpps', 'TotalAgentLostOpps',
       'AgentTotalClosedWonValue', 'AgentTotalLostRevenue',
       'AgentClosedWonRate', 'AgentClosedLostRate', 'ClosedDealsStatus'],
      dtype='object')

My preference for the column names would to have them all with the same naming convention. The dervived columns are all PascalCase, and I think the other field names should match.

In [74]:

df.rename(columns={'opportunity_id':'OpportunityID',
                   'sales_agent':'SalesAgent',
                   'product':'Product',
                   'series':'Series',
                   'sales_price':'SalesPrice',
                   'account':'Account',
                   'deal_stage':'DealStage',
                   'engage_date':'EngageDate',
                   'close_date':'CloseDate',
                   'close_value':'CloseValue'}, inplace=True) # rename() method for changing column names

# rename() uses the python dictionary format {'string':'string'}
# inplace=True will alter the current dataframe. If this is False, it will create a new dataframe with the column name change

Now let's check that out changes worked by listing out the columns again.

In [76]:

print(df.columns) # columns attribute to list out our columns for reference

Index(['OpportunityID', 'SalesAgent', 'Product', 'Series', 'SalesPrice',
       'Account', 'DealStage', 'EngageDate', 'CloseDate', 'DaysToClose',
       'CloseValue', 'TotalAgentOpps', 'AgentWonOpps', 'AgentLostOpps',
       'LostProductRevenue', 'TotalAgentWonOpps', 'TotalAgentLostOpps',
       'AgentTotalClosedWonValue', 'AgentTotalLostRevenue',
       'AgentClosedWonRate', 'AgentClosedLostRate', 'ClosedDealsStatus'],
      dtype='object')

If we felt the need to change the datatype of say our CloseDate field from date to datetime we could do the following, although for this analysis it isn't necessary. I'll give an example regardless.

In [ ]:

df['CreatedDate'] = pd.to_datetime(df['CreatedDate'], format = '%Y-%m-%d %H:%M:%S') # to_datetime() function will convert date to datetime data type

Here's a list of the to_datetime() function formatting options.

%Y Year with century 2023
%m Month as a zero-padded decimal number 01
%d Day of the month as a zero-padded decimal 15
%H Hour (24-hour clock) as a zero-padded decimal 14
%I Hour (12-hour clock) as a zero-padded decimal 02 (for 2 PM)
%M Minute as a zero-padded decimal 30
%S Second as a zero-padded decimal 00
%f Microsecond as a decimal number 123456
%p Locale's equivalent of AM/PM PM
%b Month as locale's abbreviated name Jan
%B Month as locale's full name January
%a Weekday as locale's abbreviated name Mon
%A Weekday as locale's full name Monday
%j Day of the year as a zero-padded decimal 015 (for Jan 15th)
%Z Time zone name (empty if no time zone) EST
%z UTC offset in the form +HHMM or -HHMM +0500

In [ ]:

DATAHEAD © 2025