After a hiatus from the blog, I’m back with a new post. While I’ve still been using Python
and Pandas, I wanted to explore some new technologies. In this post, I’ll delve into
polars. This article will cover some basic polars concepts, pointing out both its
strengths and differences compared to Pandas. While I’m not ditching Pandas completely,
I’ve found potential in polars for enhancing performance and capabilities in specific
scenarios. Join me on this exploration of alternative tools and frameworks; perhaps
polars might find a place in your toolkit too.
One of the reasons I like using pandas instead of Excel for data analysis is that it is
easier to avoid certain types of copy-paste Excel errors. As great as pandas is, there
is still plenty of opportunity to make errors with pandas code. This article discusses a
subtle issue with pandas groupby code that can lead to big errors if you’re not
careful. I’m writing this because I have happened upon this in the past but it still bit
me big time just recently. I hope this article can help a few of you avoid this mistake.
When doing analysis with Jupyter Notebooks, you will frequently find yourself
generating ad-hoc Excel reports to distribute to your end-users. After time, you might
end up with dozens (or hundreds) of notebooks and it can be challenging to
remember which notebook generated which Excel report. I have started using Excel
document properties to track which notebooks generate specific Excel files. Now,
when a user asks for a refresh of a 6 month old report, I can easily find the notebook
file and re-run the analysis. This simple process can save a lot of frustration for your
future self. In this brief article will walk through how to set these properties and give some
shortcuts for using VS Code to simplify the process.
Visual Studio Code is one of the most popular text editors with a track record of
continual improvements. One area where VS Code has been recently innovating is its
Jupyter Notebook support. The early releases of VS Code sought to replicate existing
Jupyter Notebook features in VS Code. Recent VS Code releases have continued to develop
notebook features that provide an experience that in many cases is better than the
traditional Jupyter Notebook experience.
I am a big fan of using Jupyter Notebooks for python analysis - even though there are limitations.
For the type of adhoc analysis I do, the notebook combination of code and visualizations is
superior to working with ad hoc Excel files. That being said, there are times when I wish
I had a more full-featured editor for my notebook code.
In this article I will cover 16 reasons why you should consider using VS Code as your editor
of choice when working with python in Jupyter Notebooks. I am not including them in any
particular order but think number 11 is one of my favorites.
It’s no secret that data cleaning is a large portion of the data analysis process. When
using pandas, there are multiple techniques for cleaning text fields to prepare for
further analysis. As data sets grow large, it is important to find efficient methods that
perform in a reasonable time and are maintainable since text cleaning is a process that
evolves over time.
This article will show examples of cleaning text fields in a large data file and illustrates
tips for how to efficiently clean unstructured text fields.
I enjoy hearing from readers that have used concepts from this blog to solve their own problems.
It always amazes me when I see examples where only a few lines of python code can solve
a real business problem and save organizations a lot of time and money. I am also impressed
when people figure out how to do this with no formal training - just with some hard work and
willingness to persevere through the learning curve.
I have talked quite a bit about how pandas is a great alternative to Excel for many tasks.
One of Excel’s benefits is that it offers an intuitive and powerful graphical interface for
viewing your data. In contrast, pandas + a Jupyter notebook offers a lot of programmatic
power but limited abilities to graphically display and manipulate a DataFrame view.
There are several tools in the Python ecosystem that are designed to fill this gap. They range
in complexity from simple JavaScript libraries to complex, full-featured data analysis engines.
The one common denominator is that they all provide a way to view and selectively filter
your data in a graphical format. From this point of commonality they diverge quite a bit in
design and functionality.
This article will review several of these options in order to give you an idea of the landscape
and evaluate which ones might be useful for your analysis process.
One of the most basic analysis functions is grouping and aggregating data. In some cases,
this level of analysis may be sufficient to answer business questions. In other instances,
this activity might be the first step in a more complex data science analysis. In pandas,
the groupby function can be combined with one or more aggregation
functions to quickly and easily summarize data. This concept is deceptively simple and most new
pandas users will understand this concept. However, they might be surprised at how useful complex
aggregation functions can be for supporting sophisticated analysis.
This article will quickly summarize the basic pandas aggregation functions and show examples
of more complex custom aggregations. Whether you are a new or more experienced pandas user,
I think you will learn a few things from this article.
With pandas it is easy to read Excel files and convert the data into a DataFrame.
Unfortunately Excel files in the real world are often poorly constructed. In those
cases where the data is scattered across the worksheet, you may need to customize the way
you read the data. This article will discuss how to use pandas and openpyxl to read these types
of Excel files and cleanly convert the data to a DataFrame suitable for further analysis.
The main purpose of this blog is to show people how to use Python to solve real world problems.
Over the years, I have been fortunate enough to hear from readers about how they have used tips
and tricks from this site to solve their own problems. In this post, I am extremely delighted to present
a real world case study. I hope it will give you some ideas about how you can apply these
concepts to your own problems.
This example comes from Michael Biermann from Germany. He had the challenging task of trying to
gather detailed historical weather data in order to do analysis on the relationship between
air temperature and power consumption. This article will show how he used a pipeline of Python
programs to automate the process of collecting, cleaning and processing gigabytes of weather
data in order to perform his analysis.
We are a participant in the Amazon Services LLC Associates Program,
an affiliate advertising program designed to provide a means for us to earn
fees by linking to Amazon.com and affiliated sites.