Import and Export Data, Modules, Experiments
We’ll explore how to import and export data in our new machine learning tool, Azure ML. The import dataset module in Azure ML can read in data from a variety of sources: HTTP, Azure SQL database, Hadoop Hive query, or Azure Storage Blobs. It can also convert the data to a variety of formats and fi the data is fairly large, you have access to export the data to other parts of the Azure ecosystem using the export data module.
What You'll Learn
> To create an experiment
> Exploring your experiment workspace
> Idea of modules and how they work within Azure ML
> How to Import and export data
Hey, and welcome back to our data mining with Azure Machine Learning Studio series, brought to you by Data Science Dojo. OK.
So today we’re going to go ahead and create an experiment, our first experience. We’re going to go ahead and explore the experiment workspace. We’re going to look at the idea of modules and how they work within Azure ML. And then we’re going to see how we can import and export data within Azure ML.
OK. So this is where you should be. You should be instead of your Azure ML studio workspace. To create a new basically experiment, or a new anything, you can go to this new button on the bottom left hand corner of your screen. So click this new button.
So notice we can create a new and then blank experiment. But this is where we could create anything else that’s new. So new notebook, new project, new module, new dataset. So this is where we would import a local file, for example. All right. So let’s go ahead and create a blank experiment here.
Also, did you see that there were a bunch of these other templates that we could have copied as well? So if you needed any help on learning how to do something, like, I don’t know, preventative maintenance, you can go ahead and clone one of those things. I’m going to go ahead and close this box right up here. So you see this x button? I’m going to close it. So this is our experiment workspace.
And notice it gives us some hints as to how this works. And notice that it operates by some kind of module connecting other modules in some kind of workflow type fashion. And if you’ve ever used Visio before, it looks like a Visio that only flows from top to bottom.
So on the left will be your control panel. If you don’t see that, you can expand it left and right over here.
And let’s take a look at our saved data sets real quick. So this is, if you bring in data, this is where it would be, under my data sets. I have a few data sets already. I think it’s some test data right here. But it also brings you with basically some sample data sets. Then this is initiated with your workspace. So go ahead.
And I’m going to drag in the adult census income dataset. So notice that all I did was just dragged it in. So clicked, held onto it, and moved it into the workspace. And notice that this object is representing the entirety of the dataset. And notice that it has a node on the bottom here. So if I hover over it, it says data set. And the idea is the data that is inside of this is coming out of this node right here, so this bottom node. So nodes on the bottom of the modules always represent the thing that’s coming out of the module.
So if I right click on this– for those of you who are using Mac, I think you would just use double finger light tap on this node. I do recommend getting a mouse, though, to use Azure ML. It will help a lot. Azure ML is a lot of right clicking. So I can right click on this. And I can say, I can download a dataset. Or I can visualize the dataset. For the most part, you want to visualize the dataset. So notice I can right click and visualize. So this will give me a window that will show me the first 100 rows of this data. So this is just a real simple snapshot of the data. And I think the first 100 columns of the dataset, even though there’s not a hundred columns.
So this is a data set we can see about the census and demographic information about various people. So each row represents a person. So notice I can just click on any column. I can get very light descriptive stats of the data in this column. There’s seven unique values in this column. And then I can see that married is 46% of the data with 14,976 elements. And likewise, with the rest of them, if it’s a numeric column, I can see the what data type it is right here by just looking at it. If it’s numeric, it will give me some other measurements of spread, as well. So for example, mean, median, min, max, standard deviation, and things like that. So I can close this window.
So that is a data set that’s already inside of this workspace. All I did was dragged it in. So notice I can drag in other data sets as well.
So it also supports selecting. So notice I can just select multiple data sets. And then I can right click and hit this Delete button. Or I can select all, and then on my keyboard, I can hit Delete as well. And it will remove all of that at the same time. So I can go and drag that back in. Now, if I want to save this dataset, notice that there is a bunch of really cool things on the left hand side.
So these are all the tools, the transformation, the manipulation tools that you will use to your data set.
So I notice that if I go to Data Format conversions here, I can convert this data object to a CSV file, a comma separated values file wherein I can then feed that into a SQL database, or open it in Excel or notepad or something like that. All I did was took this, held it down from the output note, and dragged it into the input node of this guy. So notice that the top node is always the input node, so data going in. And notice that there’s data going out of this as well. So it goes in.
Some kind of function or operation happens in here. So this operation just has to be converting to a CSS. But you notice that I right click on this node, I can’t do anything yet. And that’s because it hasn’t run yet. So go down here. There’s this run button. So I will click on Run. And it’s going to go ahead and execute this module. So everything in this workspace, it’s going to execute it.
Notice that once I have this green check box, it means the operation is complete for this module. And whatever is in this module, the results have been cached. And then I can right click and then consume the output. So I can then right click and say download on this dataset. So then this is how I export data from inside Azure ML if I want to download it to a local file.
Now, keep in mind, we’re using our cloud based tool. And sometimes we’re using cloud based tool because we have access to huge amounts of data. It might not be good to download those huge amounts of data to your local computer. So I can just open this up in my notepad. Or I can open up in Excel, whatever you wish. So that’s how I export data.
Notice that there’s also a TSV here. And I can rearrange my workspace. And then also tell the adult census income by clicking and holding on and dragging the connector to the Convert to TSV. So the convert to TSV output or input node, it turns green. It means that it’s a supported data type that it will accept. So if I hover over the input node of convert to TSV, it tells you what data type it’s expecting to be passed into it. So if I hover over it, it says it wants a data set data type. And if I over the output of the adult census income object, notice I get a dataset object. So that’s why it turned green, because it’s willing to accept that as an output, an input parameter.
So now, notice that this is a checkmark and this does not, which means this has no results in it. It has not been executed. It is a blank set of instructions right now. It’s like a blueprint. Nothing has been done yet. But this side has been done. So I can hit this Run button again. And then it’s going to go ahead and refresh my workspace.
And now I have one side converted to TSV and one side converted to CSV. They are now both done. So I can also export my data.
The top left hand corner, there is lots and lots of tools inside of Azure ML. I can manipulate data by adding columns, rows, all that good stuff. I can do feature selection, all the machine learning models are in here. So really, everything you can ever need for data machine learning is on the left hand toolbar. So there’s too many things to actually look for itself. So you can search for them instead. So if I want to export data, that is the export data modules.
So notice how I typed in export up here. I select Export data and I drag it right into the workspace. And I can connect the export data. And this is just good style, by the way. And I like to make sure everything that is going to be on the same level means that these things are executed in the same step. They’re just diverging now. So I don’t want it to look like this. That’s a little bit of bad style. Because it makes a user, if you’re sharing this workspace, or if you come back to this workspace after not having worked on it for some time, it might seem that it’s executing these in order. But it’s actually not. It’s actually executing all of these at the same time.
So if I click on this export data module, notice that it as value required. That always means that there is some kind of levers and knob that I can go ahead and tweak. And that can be done. So once you click on it, notice that it’s highlighted in blue, I can see on the right hand side a window pops up. This is the Properties window.
These are the parameters I can tweak for this export data module. And that’s true of every module that has some kind of parameter. So basically, what it’s asking me is where do I want to export my data. And I can export my data to quite a few places. I can export my data to a Hive query. Basically, I can insert it into a Hadoop table. I could do an Azure SQL database. I can do an Azure table. Or I can do an Azure blob storage. I can also delete this, as well. I can also delete the connectors. So if I click on this connection, I can reclick. And I just delete it. Same thing with this.
So let’s talk about how do I also import data. So in this dataset over here, the adult census income data, it’s actually from the UCI repository. So we’re going to go ahead and open up Google and type in UCI adult census data. And that’s going to bring us to the original data set.
So if you click on this link right here, should be coming from archived.ics.uci.edu. I think it stands for University of California, And then if you scroll the very top, there is a button that says data folder. So this is the page that’s hosting the information about the data. But the actual data itself is inside of this folder. And we’re going to see if we can read in this file right here. So data, adult.data. So this is the raw file itself. Notice it’s very messy right here. So the idea is if I right click and copy URL, I can bring in a module called the import data module and read it directly in from the URL.
So the import data module, if I click on it, notice that it says value required. I need to specify where do I want this data to come in from. So I can launch what’s called this wizard right here, or I can select it manually. So the wizard basically pops up a window and asks you, what do you want to choose? But I prefer this method over here, which is I just want to select. I want to read in the data from an HTTP. I want to paste it in. And now it’s going to ask me what format is it. Is it a CSV or a TSV. So basically, I notice that there are commas here. So that means it is a comma separated value. So it’s a CSV. And does it have headers? No it does not. So notice that this checkbox– if I do have a header, I will click this right here. And then this, use cached results. What this means is every time you hit the Run button, it’s going to try to access this URL and try to reread the file in.
But you and I know that this is a static and not changing data file. So if we know that, we will just say use cached results. So read it in once, and never read it again, which is really nice. But if this is dynamic data, like sales data, or a stock portfolio data, data that’s changing every day, every second, you don’t want to use this.
The idea is you want the workspace to be dynamic and fetch new data every time and refit it into your current and existing data mining or data science pipelines, which is really, really useful, especially if you’re doing time series data. So we’re going to go ahead and paste our URL here, and with the following specifications. And we’re going to hit the Run button. This is going to go ahead and contact this URL right here and try to read in this data set for us. Now this might take a little bit of time. It has to also parse the data for us. It’s done.
So I can go ahead and right click now on the output of this. And I can visualize it. And it went ahead and read in all the data sets for us all nice and dandy like. And notice that since we didn’t have headers, it arbitrarily named these col 1 through col 15. I’ll show you how to rename columns in a separate video. So don’t worry about that.
So that is how you read in data from the internet. The import data module also supports hive queries, which is, you can read from Hadoop cluster, specifically I think HDInsight Hadoop clusters, if they have a hive table that’s already existing. You can connect to an Azure SQL database. You can connect to a document db, which is a NoSQL database, or Blob storage.
Now, if you want to read in from a local file, so let’s simulate a local file now. So I want to go into this data set, for example. Right click and say just save as. Just save it to your local file. So I’m going to save it to my desktop. Go ahead save it wherever you want. So notice it says dot txt. I don’t want that. I actually want it to be read as a CSV later. I’m going to save it as a CSV. I’m saving it as a CSV because there’s commas here. If it’s saved by tabs, then it’s a TSV. So it’s going to save it as a flat file for me. It’s going to save it to my desktop.
So if I want to read in the local file, I want to go back. And see this new button on the bottom left hand corner of my screen? I can click this new button. And when I hit this new button, I can go ahead and say New data set from local file. And then I can go ahead and choose from my desktop that file that I just saved. So adult.data.csv. This is where I get to name the data set. So I can call this whatever I want. So I’ll call this true adult census data.
And if I want to overwrite an existing file, I can check this box right here. But this is a brand new data file, so I don’t want to do that. And this is where I select the type of file. So if it is an Excel spreadsheet, I’ll go ahead and save it as an Excel spreadsheet, so .xls. But, when I read it in, I can read it in as a CSV. And it’ll parse the Excel spreadsheet just fine.
And notice I can read in R data files, which is also really cool. So if you know R, that’s also nice. So I’m going to save this as a CSV. And I can provide a description here. So if I go back to the UCI repository, I can basically just copy this abstract. So predict whether income exceeds 50K based upon census data. So I can copy that and bring that in here. I’ll just also cite the source, from UCI repository. I’ll go ahead and save it. And now it’s going to go ahead and load that data in for me.
Once the data is loaded in, it’s going to appear under the saved data set, specifically under my data sets. It looks like it finished uploading. So right now, the data will appear under My Datasets. So you don’t have to refresh anything. So you just click on My Datasets under Saved Datasets. And you should see the data file that I just brought in, which I named True Adult Census Data. So if I drag that in now, I can right click, and I can visualize it. And notice that it read in the dataset. But I think I also read it in wrong, because I didn’t say this file doesn’t have headers. So I should have said that. But it’s fine. You guys won’t make that same mistake that I did.
And then I think that’s all the time we had today. And that’s how you read in datasets from Azure ML. That’s how modules work. And that’s how you export data out of Azure ML.
Join us next time when we’ll start our real data mining project from start to finish, starting next episode.
If you liked that video and you want to see more videos like this in the future, go ahead and like and subscribe. And I will look forward to seeing you at our boot camp.
Phuc H Duong - Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.
© Copyright – Data Science Dojo