From the course: Microsoft Azure AI Fundamentals (AI-900) Cert Prep by Microsoft Press

Identify features and labels in a dataset for machine learning - Azure AI Services Tutorial

From the course: Microsoft Azure AI Fundamentals (AI-900) Cert Prep by Microsoft Press

Identify features and labels in a dataset for machine learning

- [Instructor] Well, hopefully you're coming into this lesson from the previous one where we introduced machine learning and the Azure machine learning service. I have some more important vocab for you. In your source data that you're using to train your ML models, Features are your input properties. For example, in the demo in the previous lesson, remember our bank marketing data, the features would be first name, last name, address, all that kind of stuff. The label is the column that you're operating on. And in that bank marketing data, it was a column that just had Y for yes, or N for no. It was the answer to the customer's question. Do you want to opt in? That's called the label. Features are really your independent variables. You're going to want to strip out missing data. You don't want to skew or bias the data in any number of ways. If it was realty, square footage, number of bedrooms. What I'm trying to say in a compact way, friend, is that data engineering is a job role into itself. It really is. And if you're into that thing, I would encourage you to pursue it. The target variable, as I said, is the label. This is the value that the AI predicts or classifies. Now, would a house sale price be a regression or a classification task? It'd be regression, wouldn't it? Because the house cell would be in your local currency, presumably. Part of data engineering is wrangling all the different data formats. CSV or comma separated value is about as simple as you can get. But Apache has parquet. There are binary formats that are optimized for storing data, structured, semi-structured, or unstructured. The quality of that data is going to affect the behavior of your ML. Adventure Works, for example, uses features like workout frequency and diet to predict fitness progress, right? So you think about data engineering best features, data cleansing in order to provide the ML algorithms with the cleanest data. And again, it's an iterative ongoing process.

Contents