How to generate a data summary in Python (2023)

Learn different methods to summarize data in Python.

Data is power. The more data we have, the better and more robust products we will create. However, working with large amounts of data has its challenges. We need tools and software packages to get information, such as creating a data summary in Python.

A considerable number of data-driven solutions and products use tabular data, that is, data stored in a table format with labeled rows and columns. Each row represents an observation (ie, a data point) and the columns represent features or attributes about that observation.

As the number of rows and columns increases, it becomes more difficult to manually inspect the data. Since we almost always work with large data sets, the use of a software tool to summarize the data is a key requirement.

Data summaries are useful for a variety of tasks:

  • Learn the underlying structure of a data set.
  • Understand the distribution of features (ie columns).
  • Exploratory data analysis.

As the leading programming language in the data science ecosystem, Python has libraries for creating data summaries. The most popular and commonly used library for this purpose is pandas.Learn PythonIt has oneIntroduction to Python for data sciencecourse covering the panda library in great detail.

pandas is a data manipulation and analysis library for Python. In this article, we look at several examples to demonstrate how to use pandas to create and display data summaries.

Getting started with pandas

Let's start by importing pandas.

import pandas as pd

Consider a sales dataset in CSV format that contains sales and inventory amounts for some products and their product groups. We create a pandas DataFrame for the data in this file and display the first 5 lines as shown below:

df = pd.read_csv(“vendas.csv”)df.head()

Salida:

(Video) Python Pandas Tutorial 2 | How to Generate Data Frame Summary Statistic | Summarizing data in python

product groupProduct codesales amountStock Quantity
0A1000337791
1C1001502757
2A1002402827
3A10034111227
4C1004186361

A data digest in pandas starts with checking the size of the data. EITHERmouldThe method returns a tuple with the row and column counts of a DataFrame.

>>> df.shape(300, 4)

Contains 300 rows and 4 columns. This is a clean dataset that is ready to be analyzed. However, most real life data sets require cleanup. Here is an article that explainsthe most useful python data cleanup modules.

We continue to summarize the data by focusing on each column separately. pandas has two main data structures: DataFrame and Series. A DataFrame is a two-dimensional data structure, while a String is one-dimensional. Each column in a DataFrame can be considered a Series.

Since the characteristics of categorical and numeric data are very different, it is best to address them separately.

categorical columns

If a column contains categorical data, such as the product group column in our DataFrame, we can check the count of distinct values ​​in it. We do this withexclusive()onunique()functions

>>> df["product_group"].unique()array(['A', 'C', 'B', 'G', 'D', 'F', 'E'], dtype=objeto)> >> df["grupo_producto"].nunique()7

Onunique()The function returns the count of distinct values, while the functionexclusive()The function displays the various values. Another commonly used summary function on categorical columns isvalue_counts(). Displays the various values ​​in a column along with the counts of their occurrences. Thus, we have an overview of the data distribution.

>>> df["product_group"].value_counts()A 102B 75C 63D 37G 9F 8E 6Name: product_group, dtype: int64

Group A has the most products, followed by group B with 75 products. the exit ofvalue_counts()The function is sorted in descending order by occurrence count.

numeric columns

When we work with numeric columns, we need different methods to summarize the data. For example, it doesn't make sense to check the number of distinct values ​​for the sales quantity column. Instead, we calculate statistical measures such as the mean, median, minimum, and maximum.

Let's first calculate the average value of the sales quantity column.

>>> df["amount_of_sales"].mean()473,557

We simply select the column of interest and apply themean()function. We can also perform this operation on multiple columns.

(Video) Intro to Data Analysis / Visualization with Python, Matplotlib and Pandas | Matplotlib Tutorial

>>> df[["sale_quantity","stock_quantity"]].mean() sales_quantity 473,557stock_quantity 1160,837dtype: float64

When selecting multiple columns from a DataFrame, be sure to specify them as a list. Otherwise pandas throws a key error.

Just as easily as we can compute a single stat across multiple columns in a single operation, we can compute multiple stats at once. One option is to use theapply()work as follows:

>>> df[["sales_quantity","stock_quantity"]].apply(["average","median"])

Salida:

sales amountStock Quantity
mean473.5566671160.836667
median446,0000001174.000000

Functions are written to a list and then passed toapply(). The median is the value in the middle when the values ​​are ordered. Comparing the values ​​of the mean and median gives us an idea about the skewness of the distribution.

We have many options for creating a data summary in pandas. For example, we can use a dictionary to calculate separate statistics for different columns. Here's an example:

df[["sales_quantity","stock_quantity"]].apply( { "sales_quantity":["average","median","max"], "stock_quantity":["average","median","min" ] })

Salida:

sales amountStock Quantity
mean473.5566671160.836667
median446,0000001174.000000
maximum999.000000Yaya
minYaya302,000000

The dictionary keys indicate the names of the columns and the values ​​show the statistics to calculate for that column.

We can do the same operations with theaggregate()function instead ofapply(). The syntax is the same, so don't be surprised if you find tutorials that use theaggregate()function instead.

pandas is a very useful and practical library in many ways. For example, we can calculate multiple statistics on all numeric columns with a single function:to describe():

>>> df.describe()

Salida:

(Video) Summarize News Articles with Machine Learning in Python

sales amountStock Quantity
say300,000000300,000000
mean473.5566671160.836667
standard295.877223480.614653
min4,000000302,000000
25%203,000000750,500000
50%446,0000001174.000000
75%721.7500001590,500000
maximum999.0000001988.000000

The statistics in this DataFrame give us a broad overview of the distribution of values. The count is the count of values ​​(ie rows). The “25%”, “50%” and “75%” indicate the first, second and third quartiles, respectively. The second quartile (ie, 50%) is also known as the median. Finally, “std” is the standard deviation of the column.

A python data digest can be created for a specific part of the DataFrame. We only need to filter the relevant part before applying the functions.

For example, we describe the data for Product Group A only as follows:

df[df["product_group"]=="A"].describe()

We first select rows whose product group value is A, and then use theto describe()function. The output has the same format as the previous example, but the values ​​are calculated for Product Group A only.

We can also apply filters on numeric columns. For example, the following line of code calculates the average sales quantity for products with inventory greater than 500.

df[df["stock_qty"]>500]["vendas_qty"].mean()

Salida:

476.951

pandas allows you to create more complex filters quite efficiently. Here is an article that explains it in detail.how to filter based on rows and columns with pandas.

Summary of data groups

We can create a separate data summary for different groups in the data. It's pretty similar to what we did in the previous example. The only addition is grouping the data.

We group the rows by the different values ​​in a column with thegroup by()function. The following code groups the rows by product group.

df.groupby("product_group")

Once the groups are formed, we can calculate any statistics and describe or summarize the data. Let's calculate the average number of sales for each product group.

(Video) How to Create Summary Statistics Using Python Pandas | Covid 19 Data by ECDC Europe

df.groupby("product_group")["sales_quantity"].mean()

Salida:

product_groupA 492.676471B 490.253333C 449.285714D 462.864865E 378.666667F 508.875000G 363.444444Name: sales_qty, dtype: float64

We can also perform multiple aggregations in a single operation. In addition to the average sales quantities, we will also count the number of products in each group. we use theaggregate()function, which also allows you to assign names to added columns.

df.groupby("product_group").agg( avg_sales_qty = ("sales_qty", "media"), number_of_products = ("product_code","count"))

Salida:

product groupavg_sales_qtynumber of products
A492.676471102
B490.25333375
C449.28571463
D462.86486537
mi378.6666676
F508.8750008
GRAMS363.4444449

Data distribution with a histogram Matplotlib

Data visualization is another highly efficient technique for summarizing data. Matplotlib is a popular Python library for visually exploring and summarizing data.

There are many different types of data visualizations. A histogram is used to check the data distribution of the numeric columns. Split the entire range of values ​​into discrete bins and count the number of values ​​in each bin. As a result, we get an overview of the data distribution.

Let's create a histogram of the sales quantity column.

importar matplotlib.pyplot como pltplt.figure(figsize=(10,6))plt.hist(df["vendas_qty"], bins=10)

In the first line, we import thepyplotinterface matplotlib. The second line creates an empty shape object of the specified size. The third line plots the histogram of the sales quantity column in thefigureobject. The bins parameter determines the number of bins.

Here is the graph generated by this code:

How to generate a data summary in Python (1)

The values ​​on the x-axis show the edges of the box. The values ​​on the y-axis show the number of values ​​in each category. For example, there are more than 40 products whose sales quantity is between 100 and 200.

(Video) Data Analysis with Python - Full Course for Beginners (Numpy, Pandas, Matplotlib, Seaborn)

Data Summary in Python

It is vitally important to understand the available data before proceeding with the creation of data-driven products. You can start with a summary of data in Python. In this article, we go through several examples using the pandas and Matplotlib libraries to summarize the data.

Python has a rich selection of libraries that make data science tasks fast and simple.Python for data sciencetrack is a great start to your data science journey.

FAQs

How do you create a summary of data in Python? ›

Summarizing Data

The describe() function computes a summary of statistics pertaining to the DataFrame columns. This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns.

Is there a summary function in Python? ›

summary() function return a summarized representation of the Index. This function is similar to what we have for the dataframes. Example #1: Use Index. summary() function to find the summary of the Index.

How do you print the summary of a DataFrame in Python? ›

The Describe function returns the statistical summary of the dataframe or series. This includes count, mean, median (or 50th percentile) standard variation, min-max, and percentile values of columns. To perform this function, chain . describe() to the dataframe or series.

How do you write a data summary? ›

The three common ways of looking at the center are average (also called mean), mode and median. All three summarize a distribution of the data by describing the typical value of a variable (average), the most frequently repeated number (mode), or the number in the middle of all the other numbers in a data set (median).

What is Model Summary () in Python? ›

Model summary

summary() to print a useful summary of the model, which includes: Name and type of all layers in the model. Output shape for each layer. Number of weight parameters of each layer. If the model has general topology (discussed below), the inputs each layer receives.

How do I summarize a CSV file in Python? ›

Summarising, Aggregating, and Grouping data in Python Pandas
  1. df = pd. read_csv('College.csv')
  2. df. head(2) Out[3]: Unnamed: 0. ...
  3. df. rename(columns={'Unnamed: 0':'univ_name'},inplace=True)
  4. df. head(1) Out[5]: ...
  5. df. describe() Out[6]: ...
  6. %matplotlib inline df. describe(). plot() ...
  7. df. describe(). plot(). ...
  8. df['Apps']. sum() 2332273.

What does the summary () function do? ›

The summarize() function is used in the R program to summarize the data frame into just one value or vector. This summarization is done through grouping observations by using categorical values at first, using the groupby() function. The dplyr package is used to get the summary of the dataset.

How do you summarize a list in Python? ›

Python provides an inbuilt function sum() which sums up the numbers in the list. Syntax: sum(iterable, start) iterable : iterable can be anything list , tuples or dictionaries , but most importantly it should be numbers.

Which command is used to summarize data? ›

The sumtable command provides a simple method of producing summary tables of data from two or more groups.

How do you write a summary of text in Python? ›

Input document → understand context → semantics → create own summary. 2. Extractive Summarization: Extractive methods attempt to summarize articles by selecting a subset of words that retain the most important points. This approach weights the important part of sentences and uses the same to form the summary.

How do you summarize categorical data in Python? ›

Proportions are often used to summarize categorical data and can be calculated by dividing individual frequencies by the total number of responses. In Python/pandas, df['column_name']. value_counts(normalize=True) will ignore missing data and divide the frequency of each category by the total in any category.

Videos

1. Python Pandas Tutorial (Part 10): Working with Dates and Time Series Data
(Corey Schafer)
2. FREE Data Analyst Bootcamp!!
(Alex The Analyst)
3. Summary Statistics of pandas DataFrame in Python (4 Examples) | Calculate Descriptive Stats by Group
(Statistics Globe)
4. Get started with Data Analysis in Python ~ 10+ coding examples
(Chandoo)
5. Data Analysis in Python for Biologists - Charting & Graphing Simply Explained..!!!
(Bio-Resource)
6. Data Analysis with Python for Excel Users - Full Course
(freeCodeCamp.org)
Top Articles
Latest Posts
Article information

Author: Ray Christiansen

Last Updated: 12/01/2022

Views: 5761

Rating: 4.9 / 5 (69 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Ray Christiansen

Birthday: 1998-05-04

Address: Apt. 814 34339 Sauer Islands, Hirtheville, GA 02446-8771

Phone: +337636892828

Job: Lead Hospitality Designer

Hobby: Urban exploration, Tai chi, Lockpicking, Fashion, Gunsmithing, Pottery, Geocaching

Introduction: My name is Ray Christiansen, I am a fair, good, cute, gentle, vast, glamorous, excited person who loves writing and wants to share my knowledge and understanding with you.