Information summarization is a vital first step in any information evaluation workflow. Whereas Pandas’ describe()
perform has been a go-to instrument for a lot of, its performance is proscribed to numeric information and gives solely fundamental statistics. Enter Skimpy, a Python library designed to supply detailed, visually interesting, and complete information summaries for all column varieties.
On this article, we’ll discover why Skimpy is a worthy various to Pandas describe(). You’ll learn to set up and use Skimpy, discover its options, and evaluate its output with describe() by way of examples. By the top, you’ll have an entire understanding of how Skimpy enhances exploratory information evaluation (EDA).
Studying Outcomes
- Perceive the restrictions of Pandas’
describe()
perform. - Discover ways to set up and implement Skimpy in Python.
- Discover Skimpy’s detailed outputs and insights with examples.
- Evaluate outputs from Skimpy and Pandas
describe()
. - Perceive easy methods to combine Skimpy into your information evaluation workflow.
Why Pandas describe() is Not Sufficient?
The describe()
perform in Pandas is broadly used to summarize information shortly. Whereas it serves as a strong instrument for exploratory information evaluation (EDA), its utility is proscribed in a number of elements. Right here’s an in depth breakdown of its shortcomings and why customers typically search options like Skimpy:
Give attention to Numeric Information by Default
By default, describe()
solely works on numeric columns except explicitly configured in any other case.
Instance:
import pandas as pd
information = {
"Identify": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, 30, 35, 40],
"Metropolis": ["New York", "Los Angeles", "Chicago", "Houston"],
"Wage": [70000, 80000, 120000, 90000],
}
df = pd.DataFrame(information)
print(df.describe())
Output:
Age Wage
rely 4.000000 4.000000
imply 32.500000 90000.000000
std 6.454972 20000.000000
min 25.000000 70000.000000
25% 28.750000 77500.000000
50% 32.500000 85000.000000
75% 36.250000 97500.000000
max 40.000000 120000.000000
Key Difficulty:
Non-numeric columns (Identify
and Metropolis
) are ignored except you explicitly name describe(embrace="all")
. Even then, the output stays restricted in scope for non-numeric columns.
Restricted Abstract for Non-Numeric Information
When non-numeric columns are included utilizing embrace="all"
, the abstract is minimal. It exhibits solely:
- Depend: Variety of non-missing values.
- Distinctive: Depend of distinctive values.
- Prime: Probably the most incessantly occurring worth.
- Freq: Frequency of the highest worth.
Instance:
print(df.describe(embrace="all"))
Output:
Identify Age Metropolis Wage
rely 4 4.0 4 4.000000
distinctive 4 NaN 4 NaN
prime Alice NaN New York NaN
freq 1 NaN 1 NaN
imply NaN 32.5 NaN 90000.000000
std NaN 6.5 NaN 20000.000000
min NaN 25.0 NaN 70000.000000
25% NaN 28.8 NaN 77500.000000
50% NaN 32.5 NaN 85000.000000
75% NaN 36.2 NaN 97500.000000
max NaN 40.0 NaN 120000.000000
Key Points:
- String columns (
Identify
andMetropolis
) are summarized utilizing overly fundamental metrics (e.g.,prime
,freq
). - No insights into string lengths, patterns, or lacking information proportions.
No Info on Lacking Information
Pandas’ describe()
doesn’t explicitly present the share of lacking information for every column. Figuring out lacking information requires separate instructions:
print(df.isnull().sum())
Lack of Superior Metrics
The default metrics offered by describe()
are fundamental. For numeric information, it exhibits:
- Depend, imply, and normal deviation.
- Minimal, most, and quartiles (25%, 50%, and 75%).
Nonetheless, it lacks superior statistical particulars resembling:
- Kurtosis and skewness: Indicators of information distribution.
- Outlier detection: No indication of utmost values past typical ranges.
- Customized aggregations: Restricted flexibility for making use of user-defined features.
Poor Visualization of Information
describe()
outputs a plain textual content abstract, which, whereas purposeful, will not be visually participating or simple to interpret in some instances. Visualizing developments or distributions requires extra libraries like Matplotlib or Seaborn.
Instance: A histogram or boxplot would higher signify distributions, however describe()
doesn’t present such visible capabilities.
Getting Began with Skimpy
Skimpy is a Python library designed to simplify and improve exploratory information evaluation (EDA). It gives detailed and concise summaries of your information, dealing with each numeric and non-numeric columns successfully. In contrast to Pandas’ describe()
, Skimpy consists of superior metrics, lacking information insights, and a cleaner, extra intuitive output. This makes it a superb instrument for shortly understanding datasets, figuring out information high quality points, and making ready for deeper evaluation.
Set up Skimpy Utilizing pip:
Run the next command in your terminal or command immediate:
pip set up skimpy
Confirm the Set up:
After set up, you’ll be able to confirm that Skimpy is put in appropriately by importing it in a Python script or Jupyter Pocket book:
from skimpy import skim
print("Skimpy put in efficiently!")
Why Skimpy is Higher?
Allow us to now discover varied causes intimately as to why utilizing Skimpy is best:
Unified Abstract for All Information Varieties
Skimpy treats all information varieties with equal significance, offering wealthy summaries for each numeric and non-numeric columns in a single, unified desk.
Instance:
from skimpy import skim
import pandas as pd
information = {
"Identify": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, 30, 35, 40],
"Metropolis": ["New York", "Los Angeles", "Chicago", "Houston"],
"Wage": [70000, 80000, 120000, 90000],
}
df = pd.DataFrame(information)
skim(df)
Output:
Skimpy generates a concise, well-structured desk with info resembling:
- Numeric Information: Depend, imply, median, normal deviation, minimal, most, and quartiles.
- Non-Numeric Information: Distinctive values, most frequent worth (mode), lacking values, and character rely distributions.

Constructed-In Dealing with of Lacking Information
Skimpy robotically highlights lacking information in its abstract, displaying the share and rely of lacking values for every column. This eliminates the necessity for added instructions like df.isnull().sum()
.
Why This Issues:
- Helps customers establish information high quality points upfront.
- Encourages fast selections about imputation or elimination of lacking information.
Superior Statistical Insights
Skimpy goes past fundamental descriptive statistics by together with extra metrics that present deeper insights:
- Kurtosis: Signifies the “tailedness” of a distribution.
- Skewness: Measures asymmetry within the information distribution.
- Outlier Flags: Highlights columns with potential outliers.
Wealthy Abstract for Textual content Columns
For non-numeric information like strings, Skimpy delivers detailed summaries that Pandas describe()
can’t match:
- String Size Distribution: Offers insights into minimal, most, and common string lengths.
- Patterns and Variations: Identifies widespread patterns in textual content information.
- Distinctive Values and Modes: Offers a clearer image of textual content range.
Instance Output for Textual content Columns:
Column | Distinctive Values | Most Frequent Worth | Mode Depend | Avg Size |
---|---|---|---|---|
Identify | 4 | Alice | 1 | 5.25 |
Metropolis | 4 | New York | 1 | 7.50 |
Compact and Intuitive Visuals
Skimpy makes use of color-coded and tabular outputs which might be simpler to interpret, particularly for big datasets. These visuals spotlight:
- Lacking values.
- Distributions.
- Abstract statistics, all in a single look.
This visible attraction makes Skimpy’s summaries presentation-ready, which is especially helpful for reporting findings to stakeholders.
Constructed-In Assist for Categorical Variables
Skimpy gives particular metrics for categorical information that Pandas’ describe()
doesn’t, resembling:
- Distribution of classes.
- Frequency and proportions for every class.
This makes Skimpy notably beneficial for datasets involving demographic, geographic, or different categorical variables.
Utilizing Skimpy for Information Summarization
Beneath, we discover easy methods to use Skimpy successfully for information summarization.
Step1: Import Skimpy and Put together Your Dataset
To make use of Skimpy, you first have to import it alongside your dataset. Skimpy integrates seamlessly with Pandas DataFrames.
Instance Dataset:
Let’s work with a easy dataset containing numeric, categorical, and textual content information.
import pandas as pd
from skimpy import skim
# Pattern dataset
information = {
"Identify": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, 30, 35, 40],
"Metropolis": ["New York", "Los Angeles", "Chicago", "Houston"],
"Wage": [70000, 80000, 120000, 90000],
"Score": [4.5, None, 4.7, 4.8],
}
df = pd.DataFrame(information)
Step2: Apply the skim() Perform
The core perform of Skimpy is skim()
. When utilized to a DataFrame, it gives an in depth abstract of all columns.
Utilization:
skim(df)

Step3: Interpret Skimpy’s Abstract
Let’s break down what Skimpy’s output means:
Column | Information Kind | Lacking (%) | Imply | Median | Min | Max | Distinctive | Most Frequent Worth | Mode Depend |
---|---|---|---|---|---|---|---|---|---|
Identify | Textual content | 0.0% | — | — | — | — | 4 | Alice | 1 |
Age | Numeric | 0.0% | 32.5 | 32.5 | 25 | 40 | — | — | — |
Metropolis | Textual content | 0.0% | — | — | — | — | 4 | New York | 1 |
Wage | Numeric | 0.0% | 90000 | 85000 | 70000 | 120000 | — | — | — |
Score | Numeric | 25.0% | 4.67 | 4.7 | 4.5 | 4.8 | — | — | — |
- Lacking Values: The “Score” column has 25% lacking values, indicating potential information high quality points.
- Numeric Columns: The imply and median for “Wage” are shut, indicating a roughly symmetric distribution, whereas “Age” is evenly distributed inside its vary.
- Textual content Columns: The “Metropolis” column has 4 distinctive values with “New York” being essentially the most frequent.
Step4: Give attention to Key Insights
Skimpy is especially helpful for figuring out:
- Information High quality Points:
- Lacking values in columns like “Score.”
- Outliers by way of metrics like min, max, and quartiles.
- Patterns in Categorical Information:
- Most frequent classes in columns like “Metropolis.”
- String Size Insights:
- For text-heavy datasets, Skimpy gives common string lengths, serving to in preprocessing duties like tokenization.
Step5: Customizing Skimpy Output
Skimpy permits some flexibility to regulate its output relying in your wants:
- Subset Columns: Analyze solely particular columns by passing them as a subset of the DataFrame:
skim(df[["Age", "Salary"]])
- Give attention to Lacking Information: Shortly establish lacking information percentages:
skim(df).loc[:, ["Column", "Missing (%)"]]
Benefits of Utilizing Skimpy
- All-in-One Abstract: Skimpy consolidates numeric and non-numeric insights right into a single desk.
- Time-Saving: Eliminates the necessity to write a number of traces of code for exploring completely different information varieties.
- Improved Readability: Clear, visually interesting summaries make it simpler to establish developments and outliers.
- Environment friendly for Giant Datasets: Skimpy is optimized to deal with datasets with quite a few columns with out overwhelming the consumer.
Conclusion
Skimpy simplifies information summarization by providing detailed, human-readable insights into datasets of every kind. In contrast to Pandas describe()
, it doesn’t limit its focus to numeric information and gives a extra enriched abstract expertise. Whether or not you’re cleansing information, exploring developments, or making ready studies, Skimpy’s options make it an indispensable instrument for information professionals.
Key Takeaways
- Skimpy handles each numeric and non-numeric columns seamlessly.
- It gives extra insights, resembling lacking values and distinctive counts.
- The output format is extra intuitive and visually interesting than Pandas
describe()
.
Ceaselessly Requested Questions
A. It’s a Python library designed for complete information summarization, providing insights past Pandas describe()
.
describe()
?A. Sure, it gives enhanced performance and might successfully substitute describe()
.
A. Sure, it’s optimized for dealing with massive datasets effectively.
A. Set up it utilizing pip: pip set up skimpy
.
describe()
?A. It summarizes all information varieties, consists of lacking worth insights, and presents outputs in a extra user-friendly format.