Learn different methods to summarize data in Python.
Data is power. The more data we have, the better and more robust products we will create. However, working with large amounts of data has its challenges. We need tools and software packages to get information, such as creating a data summary in Python.
A considerable number of data-driven solutions and products use tabular data, that is, data stored in a table format with labeled rows and columns. Each row represents an observation (ie, a data point) and the columns represent features or attributes about that observation.
As the number of rows and columns increases, it becomes more difficult to manually inspect the data. Since we almost always work with large data sets, the use of a software tool to summarize the data is a key requirement.
Data summaries are useful for a variety of tasks:
- Learn the underlying structure of a data set.
- Understand the distribution of features (ie columns).
- Exploratory data analysis.
As the leading programming language in the data science ecosystem, Python has libraries for creating data summaries. The most popular and commonly used library for this purpose is pandas.Learn PythonIt has oneIntroduction to Python for data sciencecourse covering the panda library in great detail.
pandas is a data manipulation and analysis library for Python. In this article, we look at several examples to demonstrate how to use pandas to create and display data summaries.
Getting started with pandas
Let's start by importing pandas.
import pandas as pd
Consider a sales dataset in CSV format that contains sales and inventory amounts for some products and their product groups. We create a pandas DataFrame for the data in this file and display the first 5 lines as shown below:
df = pd.read_csv(“vendas.csv”)df.head()
Salida:
product group | Product code | sales amount | Stock Quantity | |
---|---|---|---|---|
0 | A | 1000 | 337 | 791 |
1 | C | 1001 | 502 | 757 |
2 | A | 1002 | 402 | 827 |
3 | A | 1003 | 411 | 1227 |
4 | C | 1004 | 186 | 361 |
A data digest in pandas starts with checking the size of the data. EITHERmould
The method returns a tuple with the row and column counts of a DataFrame.
>>> df.shape(300, 4)
Contains 300 rows and 4 columns. This is a clean dataset that is ready to be analyzed. However, most real life data sets require cleanup. Here is an article that explainsthe most useful python data cleanup modules.
We continue to summarize the data by focusing on each column separately. pandas has two main data structures: DataFrame and Series. A DataFrame is a two-dimensional data structure, while a String is one-dimensional. Each column in a DataFrame can be considered a Series.
Since the characteristics of categorical and numeric data are very different, it is best to address them separately.
categorical columns
If a column contains categorical data, such as the product group column in our DataFrame, we can check the count of distinct values in it. We do this withexclusive()
onunique()
functions
>>> df["product_group"].unique()array(['A', 'C', 'B', 'G', 'D', 'F', 'E'], dtype=objeto)> >> df["grupo_producto"].nunique()7
Onunique()
The function returns the count of distinct values, while the functionexclusive()
The function displays the various values. Another commonly used summary function on categorical columns isvalue_counts()
. Displays the various values in a column along with the counts of their occurrences. Thus, we have an overview of the data distribution.
>>> df["product_group"].value_counts()A 102B 75C 63D 37G 9F 8E 6Name: product_group, dtype: int64
Group A has the most products, followed by group B with 75 products. the exit ofvalue_counts()
The function is sorted in descending order by occurrence count.
numeric columns
When we work with numeric columns, we need different methods to summarize the data. For example, it doesn't make sense to check the number of distinct values for the sales quantity column. Instead, we calculate statistical measures such as the mean, median, minimum, and maximum.
Let's first calculate the average value of the sales quantity column.
>>> df["amount_of_sales"].mean()473,557
We simply select the column of interest and apply themean()
function. We can also perform this operation on multiple columns.
>>> df[["sale_quantity","stock_quantity"]].mean() sales_quantity 473,557stock_quantity 1160,837dtype: float64
When selecting multiple columns from a DataFrame, be sure to specify them as a list. Otherwise pandas throws a key error.
Just as easily as we can compute a single stat across multiple columns in a single operation, we can compute multiple stats at once. One option is to use theapply()
work as follows:
>>> df[["sales_quantity","stock_quantity"]].apply(["average","median"])
Salida:
sales amount | Stock Quantity | |
---|---|---|
mean | 473.556667 | 1160.836667 |
median | 446,000000 | 1174.000000 |
Functions are written to a list and then passed toapply()
. The median is the value in the middle when the values are ordered. Comparing the values of the mean and median gives us an idea about the skewness of the distribution.
We have many options for creating a data summary in pandas. For example, we can use a dictionary to calculate separate statistics for different columns. Here's an example:
df[["sales_quantity","stock_quantity"]].apply( { "sales_quantity":["average","median","max"], "stock_quantity":["average","median","min" ] })
Salida:
sales amount | Stock Quantity | |
---|---|---|
mean | 473.556667 | 1160.836667 |
median | 446,000000 | 1174.000000 |
maximum | 999.000000 | Yaya |
min | Yaya | 302,000000 |
The dictionary keys indicate the names of the columns and the values show the statistics to calculate for that column.
We can do the same operations with theaggregate()
function instead ofapply()
. The syntax is the same, so don't be surprised if you find tutorials that use theaggregate()
function instead.
pandas is a very useful and practical library in many ways. For example, we can calculate multiple statistics on all numeric columns with a single function:to describe()
:
>>> df.describe()
Salida:
sales amount | Stock Quantity | |
---|---|---|
say | 300,000000 | 300,000000 |
mean | 473.556667 | 1160.836667 |
standard | 295.877223 | 480.614653 |
min | 4,000000 | 302,000000 |
25% | 203,000000 | 750,500000 |
50% | 446,000000 | 1174.000000 |
75% | 721.750000 | 1590,500000 |
maximum | 999.000000 | 1988.000000 |
The statistics in this DataFrame give us a broad overview of the distribution of values. The count is the count of values (ie rows). The “25%”, “50%” and “75%” indicate the first, second and third quartiles, respectively. The second quartile (ie, 50%) is also known as the median. Finally, “std” is the standard deviation of the column.
A python data digest can be created for a specific part of the DataFrame. We only need to filter the relevant part before applying the functions.
For example, we describe the data for Product Group A only as follows:
df[df["product_group"]=="A"].describe()
We first select rows whose product group value is A, and then use theto describe()
function. The output has the same format as the previous example, but the values are calculated for Product Group A only.
We can also apply filters on numeric columns. For example, the following line of code calculates the average sales quantity for products with inventory greater than 500.
df[df["stock_qty"]>500]["vendas_qty"].mean()
Salida:
476.951
pandas allows you to create more complex filters quite efficiently. Here is an article that explains it in detail.how to filter based on rows and columns with pandas.
Summary of data groups
We can create a separate data summary for different groups in the data. It's pretty similar to what we did in the previous example. The only addition is grouping the data.
We group the rows by the different values in a column with thegroup by()
function. The following code groups the rows by product group.
df.groupby("product_group")
Once the groups are formed, we can calculate any statistics and describe or summarize the data. Let's calculate the average number of sales for each product group.
df.groupby("product_group")["sales_quantity"].mean()
Salida:
product_groupA 492.676471B 490.253333C 449.285714D 462.864865E 378.666667F 508.875000G 363.444444Name: sales_qty, dtype: float64
We can also perform multiple aggregations in a single operation. In addition to the average sales quantities, we will also count the number of products in each group. we use theaggregate()
function, which also allows you to assign names to added columns.
df.groupby("product_group").agg( avg_sales_qty = ("sales_qty", "media"), number_of_products = ("product_code","count"))
Salida:
product group | avg_sales_qty | number of products |
---|---|---|
A | 492.676471 | 102 |
B | 490.253333 | 75 |
C | 449.285714 | 63 |
D | 462.864865 | 37 |
mi | 378.666667 | 6 |
F | 508.875000 | 8 |
GRAMS | 363.444444 | 9 |
Data distribution with a histogram Matplotlib
Data visualization is another highly efficient technique for summarizing data. Matplotlib is a popular Python library for visually exploring and summarizing data.
There are many different types of data visualizations. A histogram is used to check the data distribution of the numeric columns. Split the entire range of values into discrete bins and count the number of values in each bin. As a result, we get an overview of the data distribution.
Let's create a histogram of the sales quantity column.
importar matplotlib.pyplot como pltplt.figure(figsize=(10,6))plt.hist(df["vendas_qty"], bins=10)
In the first line, we import thepyplot
interface matplotlib. The second line creates an empty shape object of the specified size. The third line plots the histogram of the sales quantity column in thefigure
object. The bins parameter determines the number of bins.
Here is the graph generated by this code:
The values on the x-axis show the edges of the box. The values on the y-axis show the number of values in each category. For example, there are more than 40 products whose sales quantity is between 100 and 200.
Data Summary in Python
It is vitally important to understand the available data before proceeding with the creation of data-driven products. You can start with a summary of data in Python. In this article, we go through several examples using the pandas and Matplotlib libraries to summarize the data.
Python has a rich selection of libraries that make data science tasks fast and simple.Python for data sciencetrack is a great start to your data science journey.
FAQs
How do you create a summary of data in Python? ›
Summarizing Data
The describe() function computes a summary of statistics pertaining to the DataFrame columns. This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns.
summary() function return a summarized representation of the Index. This function is similar to what we have for the dataframes. Example #1: Use Index. summary() function to find the summary of the Index.
How do you print the summary of a DataFrame in Python? ›The Describe function returns the statistical summary of the dataframe or series. This includes count, mean, median (or 50th percentile) standard variation, min-max, and percentile values of columns. To perform this function, chain . describe() to the dataframe or series.
How do you write a data summary? ›The three common ways of looking at the center are average (also called mean), mode and median. All three summarize a distribution of the data by describing the typical value of a variable (average), the most frequently repeated number (mode), or the number in the middle of all the other numbers in a data set (median).
What is Model Summary () in Python? ›Model summary
summary() to print a useful summary of the model, which includes: Name and type of all layers in the model. Output shape for each layer. Number of weight parameters of each layer. If the model has general topology (discussed below), the inputs each layer receives.
- df = pd. read_csv('College.csv')
- df. head(2) Out[3]: Unnamed: 0. ...
- df. rename(columns={'Unnamed: 0':'univ_name'},inplace=True)
- df. head(1) Out[5]: ...
- df. describe() Out[6]: ...
- %matplotlib inline df. describe(). plot() ...
- df. describe(). plot(). ...
- df['Apps']. sum() 2332273.
The summarize() function is used in the R program to summarize the data frame into just one value or vector. This summarization is done through grouping observations by using categorical values at first, using the groupby() function. The dplyr package is used to get the summary of the dataset.
How do you summarize a list in Python? ›Python provides an inbuilt function sum() which sums up the numbers in the list. Syntax: sum(iterable, start) iterable : iterable can be anything list , tuples or dictionaries , but most importantly it should be numbers.
Which command is used to summarize data? ›The sumtable command provides a simple method of producing summary tables of data from two or more groups.
How do you write a summary of text in Python? ›Input document → understand context → semantics → create own summary. 2. Extractive Summarization: Extractive methods attempt to summarize articles by selecting a subset of words that retain the most important points. This approach weights the important part of sentences and uses the same to form the summary.
How do you summarize categorical data in Python? ›
Proportions are often used to summarize categorical data and can be calculated by dividing individual frequencies by the total number of responses. In Python/pandas, df['column_name']. value_counts(normalize=True) will ignore missing data and divide the frequency of each category by the total in any category.