From the course: Excel with Copilot: AI-Driven Data Analysis

Data profiling

From the course: Excel with Copilot: AI-Driven Data Analysis

Data profiling

- [Instructor] Data profiling is similar to a chef examining ingredients to ensure quality, a crucial step for preparing delicious dishes. This process ensures that your data set is primed for analysis, with Copilot acting as a skilled assistant, intuitively assisting with each stop. To try data profiling and Copilot for yourself, head to 02_02 data profiling, which is a slightly adapted version of the famous Palmer Penguins data set. Let's launch our app skills here, and we'll begin with the fundamentals. We might query Copilot here about the number of rows found in this data set. (keyboard keys tapping) Now we even have a formula that we could use, and that's pretty neat. This will dynamically change if the number of rows changes in our data set. So I'm good with that. And we'll continue here. Let's check for the possibility of any outliers in the body mass G column. (keyboard keys tapping) So there are quite a few ways to look for outliers. We'll leave it to Copilot to see how it wants to navigate these. Now here's a place where Copilot is using Python code to work on these results. You may or may not be comfortable with that. I'm finding that Copilot in Excel likes to use Python more and more, especially for things like this that get a little more statistical. If you're not comfortable with Python, just say that right? Don't use Python, and we'll see what we get here. But in the meantime, we'll notice that one of these rows, and it looks like it's row 30 possibly, does have a body mass of 10,000 grams. That seems to be unrealistically high, and I did plant this row in there intentionally. So we'll see how Copilot without Python will do this. It is going to give us some formulas and functions. We could ask for a visualization. So this is a place where you could continue querying Copilot for alternatives, getting help how to do this, lots of ways, and you need to be the judge on what is the most effective way to spot the outlier. And then once you spot it, if there are any, what to do with them. And we'll look a little more with sorting and filtering in a bit here. But for now, let's move on to make sense of our categorical variables. So for example, there's an island column here. I just want to double check how many distinct islands are actually represented in this data set. So let's find that out. (keyboard keys tapping) And I'm going to preemptively say, don't use Python here. 'Cause now that we're moving more into data visualization, a lot of these results tend to use Python code. And we actually get a formula back that we can run ourselves here. Okay, and there are the three distinct columns. If we wanted to ask for frequency counts, for example, of this column, we could ask for that, right? What are the frequency counts? So not only what are the distinct islands here, but how many rows do we have for each island? We could run that. And again, this is Python, so we might want to specify, don't use Python. And we will get some ideas with how to do this. This is saying to put count if next to this. I'm not crazy about that idea. Maybe a pivot table or something would be better here. Python's an alternative as well. Let's try one more, we will get into data visualization here. I'm going to say visualize the distribution of the body mass G column. Don't use Python. So we're going to get more into the quantitative variables, and looking at their distribution, versus something more categorical. So again, our goal here is understanding what's in the data. If there are any anomalies, any inconsistencies, things that we should know about this data before we really move into a more complete analysis. And we do have a nice distribution visualization here. I can add this into our worksheet, and that is ready for us to do what we will with it, analyze, present, and so forth. So in the hands of those versed with these features, Copilot transforms Excel into a more potent and intuitive tool for data profiling, allowing you to unveil the full narrative of your data.

Contents