Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

Windows Azure Internals

0 (0 Likes / 0 Dislikes)
  • Embed Video

  • Embed normal player Copy to Clipboard
  • Embed a smaller player Copy to Clipboard
  • Advanced Embedding Options
  • Embed Video With Transcription

  • Embed with transcription beside video Copy to Clipboard
  • Embed with transcription below video Copy to Clipboard
  • Embed transcript

  • Embed transcript in:
    Copy to Clipboard
  • Invite a user to Dotsub
[TechEd 2013 - Windows Azure Internals - Mark Russinovich, Technical Fellow, Windows Azure] [Mark R.] Good afternoon, everybody. >>[audience] Good afternoon. [Mark R.] That's pretty good. I know it's been a long day and a long week. But I think we can do a little better than that. Good afternoon, everybody. [audience - louder] Good afternoon. [Mark R.] Much better. Welcome to Windows Azure Internals. My name is Mark Russinovich. I'm an architect in Windows Azure. I've been in Windows Azure for about the last 3 years, and for the next hour and 15 minutes I'm going to take you on a tour underneath the hood of Windows Azure to look at how the physical infrastructure that we create and run your VMs and storage on top of is organized as well as the logical infrastructure, the compute platform that we've got underneath that runs the datacenter and deploys virtual machines. Just so I get an idea for the level of familiarity people have with Windows Azure, let's see a raise of hands for how many people have deployed an application onto Windows Azure, either virtual machine or cloud service or Platform as a Service app. So most of you have. How many people have never done that? How many people didn't even know that Windows Azure existed until TechEd? [laughter] Actually, that's not a— A few years ago that was a real question. Today it seems like most people know about Windows Azure, which is a testament to how far we've come in the last few years. But let's go ahead and get started. I wanted to start by giving you the agenda for what I'm going to cover. I've broken it up into a few different areas. And I'm going to start, like I said, with the physical architecture of the datacenter and the logical architecture; then move on to how we deploy services, what happens underneath the hood when you push a package up to the portal or from Visual Studio or deploy a VM in the cloud; then how we update a service—so this applies to Platform as a Service— what steps does the platform go through; what happens when we roll out a new release of the hypervisor through the datacenter, how we orchestrate that. And then finally I'll spend the concluding part of the talk about disks, local disks on the servers for Platform as a Service as well as the Infrastructure as a Service persistent disks and the different performance characteristics, size characteristics, and how they work underneath the hood. This session is a 400 level session, so actually, it is assumed that you have some knowledge of Windows Azure. I'm not going to be spending a lot of time on basics. I'll talk about fault domains and update domains, very briefly give you an idea of what those things are, but I'm kind of assuming that you have some familiarity with the platform, that you know what a hypervisor is. How many people know what a hypervisor is? Okay. That's a good check. And you know what a virtual machine is, and that's basically 400 level, I think, there. [laughter] So let's start with the datacenter architecture. [Windows Azure Datacenter Architecture] I'll start again with just the build-outs that we've got going on. Right now we've got 8 online regions across the world. You can see 4 in the US, 2 in Europe, 2 in Southeast Asia. And this already constitutes on the orders of hundreds of thousands of servers. We're currently building out—and this is what we've announced, so there's build-outs that we haven't announced yet too, ones that we've got spec'd—land purchases, buildings that we're working on. But you can see we've got 6 more that we have just announced, and those are Australia, China, which Steve Ballmer went over there and opened the China Azure facility a few weeks ago, Japan. And you can see that for every one of these we've got 2 paired datacenters in each geo-political region. That's a deliberate principle of the platform— that for every datacenter there is a pairwise datacenter that is matched with it for asynchronous data replication in case of a datacenter disaster. So you can see the North-South US, East-West US, and then everywhere else. It's pretty obvious. As far as what the datacenters look like, there's a great video that Global Foundation Services, the branch of Microsoft that is responsible for organizing, overseeing the physical building of the buildings, the security for the buildings, going and putting racks in the buildings, they've got this 10-minute video. I'm not going to— Well, actually, why don't we stop and watch it. No, I'm just kidding. You should go watch it on your own time. I've pulled some screenshots from it just to show you. That's the outside of one of our datacenters in Quincy, Washington. There's a bunch of racks. This is a new colo going in there into Quincy, Washington. Azure right now, you didn't see a datacenter bubble in Quincy, Washington, but that's where Azure originally started, and Washington tax law forced us to not sell Azure there. That's why we went elsewhere. We've got stage deployments there, test deployments in Quincy, and I'll be actually showing you some information about that. Then we've got power redundancy in every datacenter. We've of course got APCs on all of the servers. We've got batteries backing up those servers and we also have generators. And so the generators are like the final straw when power hits the fan. We've got generators that can literally run— we've got enough fuel there to literally run them for about a week without getting power back on. And then we've got datacenter security, of course. If you go visit a Microsoft datacenter, there's a guardhouse, there's concrete barriers, there's security checks, there's actually an airlock as you go into the datacenter, which is pretty common, I guess. They weigh you actually going into the datacenter and weigh you coming back out to make sure that you're not lifting something out. [laughter] Actually, this is from the— If you came to my session yesterday, I actually was wondering about this datacenter right here that's pictured on the front page of WindowsAzure.com. I'm like, "That datacenter looks really familiar," and that is that datacenter right there. It's the Death Star Datacenter. Actually, I don't think that we've got a cylindrical datacenter. It would be kind of cool, though. You'd put a rotating restaurant in there. [laughter] Let's talk about the physical network now. I'm going to start with a historical walk through the way that we've developed our physical networking infrastructure because there are some artifacts that are visible in the software infrastructure that are based on the original constraints of the physical infrastructure. We've grown literally orders of magnitude—obviously many, many— since Azure started, and we've had to adapt the network to the ever-increasing scale. One of the problems with the original datacenters that we had, they weren't designed for east-west traffic. They were inherited from Microsoft's other properties, like Bing and Hotmail, that are very north-south oriented— traffic coming in from the Web, answers being computed locally on the servers, and then the answers going back out across the Web. When we have Azure applications, we've got disaggregated storage and compute, and we also have compute that spreads across the datacenter so there's a lot of east-west traffic that never leaves the datacenter, and we needed to have our datacenters adapt to those. This is the original datacenter design. It's called DLA. I can't remember what DLA stands for. But you can see that it's a traditional 3-tier networking architecture. The datacenter routers are at the top, then we've got these access routers and then aggregation switches with load balancers hanging off of them. You can see there's 2 of everything for redundancy. And underneath these access routers there's 3 of these aggregation switches. And underneath the aggregation switches there's 20 racks. And so, like I said, we've got disaggregated storage and compute, so you can see that what we did as far as laying out the software inside the datacenter originally in Azure is that we put compute underneath 2 of those aggregation switches and storage under the third. And that will become relevant later when I talk to you about it. Now, the problem with this architecture, as you can see at the top, it's 120/1. If you're going from a server over there on the far right, talking to a server on the far left of the datacenter, the traffic is going all the way up to the DCR and then back down. And that DCR then becomes a bottleneck, 120/1 oversubscription, meaning the bandwidth that can come up to the DCR is 120 times what can go out of the DCR. And we also have a limited amount of bandwidth going east to west, 120Gbps basically, bandwidth from one side to the other. So really limited for east-west. This is the Generation 2 network that we've got in place in most of our datacenters today. If you look at this, it's called a Claus network after the guy that made a fully connected network. There is multiple paths between any server, any rack, to any other server and any other rack in the datacenter. So what this means is that with enough switches there, there's no oversubscription up at the top in that spine layer or that border leaf layer there at the top. And there is redundancy just built in to this thing. There's a little bit of oversubscription right at the top of rack routers, about 3/1. It's about 2/5 to 3/1 right now, meaning that if everybody started blasting 10Gb, which is what we've got on the servers' cards today, if everybody blasted out their servers at the same time, there would be about a 3/1 congestion at the top of rack routers. Contrast that with 120/1 from the previous design up at the central bottleneck of the network. So a big change there. And then if you look at the scale that we can achieve on this one, the other one was a roughly 10,000 server scale. This one is a 30,000 server scale. So these are how many servers we can put on there with no oversubscription at the top. And if you compare this number, I think it's pretty dramatic. We had 120Gbps, and on this one we've got 30,000Gbps east-west bandwidth. So really highly optimized. This even has become constraining to us. As we get bigger and bigger and as we get bigger and bigger datacenters, 30,000 servers has become a problem. So we've designed a third generation datacenter that we're rolling out and retrofitting existing datacenters with this third generation, which is called the Quantum 10v2, we call it internally. The difference between this one and the previous one is that we've inserted another layer called a cluster spine. And the cluster spine, with a combination of the ToR oversubscription there's a little bit of oversubscription there, and it's about 4.5/1 roughly. So what we've found is that one of the aspects about the cloud is that when you're running with the kind of scale that we've got underneath there, it's extremely improbable that everybody is going to be using bandwidth at the same time. So this is what we've found and we believe is plenty of buffer for basically not being oversubscribed anywhere in the datacenter. And this scales, you'll see from the previous design of 30,000, to 100,000 servers. So I think this is going to buy us maybe another 6 months or a year. No, I'm just kidding. [laughter] It will buy us a while. What we've done with the logical design— So that's the physical architecture of the datacenters— servers, racks, network switches. Now, to manage this thing logically, we've divided the datacenter into what we call clusters. And a cluster maps to the original DLA aggregation switch. If you look back here on the DLA architecture, here is the switch. This was the original definition for a cluster, the racks that could fit under there, which was about 20 racks. So that's what we call a cluster. Today we still organize our datacenters into logical groups of about 20 racks, which is about 1000 servers. This design provides a level of isolation not for hardware purposes, which it was in the original design. If we had 2 aggregator switches fail, we'd lose access to the cluster. Here it's a logical software unit of isolation. If we are rolling out a new Fabric Controller or new piece of software to manage this cluster and it fails, we've only lost or have a problem with that group of 1000 and not the entire datacenter. So this is a design principle that we're following everywhere in Azure, which is fault isolation and containment, especially when it comes to software rollouts— to be very graduated about our software rollouts. And you'll see another example of that later in the talk. We also call clusters stamps, so that's another synonym. So you'll hear people talk about storage stamps. If you hear people talking about Azure storage internals, it's the same thing. A storage stamp is also a cluster of about 1000 servers. And each cluster is managed by a piece of software called the Fabric Controller. The Fabric Controller is the team, the software group, inside of Azure that I'm most closely aligned with. So I'm an architect for the Fabric Controller. And the Fabric Controller, it's natural for me to land there because when I was working on Windows Internals, I was most closely affiliated with the kernel team. And if you look at the responsibilities of the Windows kernel, it manages hardware and it virtualizes it for the processes that are running. It also defines what is a process or an application. It's exactly the parallel here with the Fabric Controller. It manages the datacenter hardware. It manages the racks, it manages the servers, it manages the network, and it is also responsible for defining what is an Azure application. So when you deploy cloud service, the compute part and network part of that is defined by the Fabric Controller. And its job is to go and map those applications and the resource requirements they have onto the physical hardware with the virtualized layer in between so that the applications don't see the physical hardware for security purposes and for scalability purposes and reliability. There's 2 inputs that the Fabric Controller gets then. One is from the bottom, which is the datacenter build-out team goes and figures out, "Okay, how are we going to divide up IP addresses?" "What are the IP addresses that these routers are going to have?" "Where are the certificates for these routers so that the Fabric Controller can talk to them?" That's one input the Fabric Controller gets, and the other one is from above. Of course, people deploying applications is the other input. To give you a look at what the Fabric Controller is designed as, it is actually essentially designed to use its own application model. So when you write a PaaS application in Azure, you have this concept of a role and it can scale out. And it scales out for availability as well as for scalability. The Fabric Controller is a special type of app. It's a stateful app. We don't have support really in public API surface or application model for stateful Azure applications. So stateless model, meaning you push all your persistent data out to something like Windows Azure storage or Windows Azure database. But the Fabric Controller maintains the state of the datacenter, and so it is a stateful replicated application. Five instances, basically 5 servers, out of every cluster are dedicated to running the Fabric Controller. And it has a primary. That is the one that's responsible for updates to the state of the datacenter, like changes to the state of the hardware, like I know the server has gone bad, I know this ToR has failed, as well as keeping track of what VMs are there and what applications they correspond to. And when it makes a change, it replicates it out to the other 5. The reason that we've got 5 is we can tolerate a failure of one instance in the middle of updating the Fabric Controller. So the way we update it is the same way that you update your PaaS applications. You do a rolling update where one slice of it gets updated. That might go down, come back up with new bits or a new config. Move to the next update and so on. We do the same thing here. And so if we're in the middle of updating, we're going to have 4 replicas active. One is being updated. We take another failure. Now we're down to 3. We still have quorum, which means that the changes can still be made to the state of the datacenter, and so we can still continue to operate. While that failed instance gets healed—and this is a process we call healing— when a server goes bad, when a VM goes bad or it goes bad on a server, we heal that application by reincarnating the VM or the application— in this case the Fabric Controller instance—to a healthy server. Take a quick look at what we feed the Fabric Controller. [Demo: Viewing a Cluster Definition] We've got for our datacenters massive XML files, believe it or not, and this is areas that I'm showing you deep under the covers gore of the way Azure has evolved. And like every software project, things start out, it's like, "Hey, let's do it this way. It's quick and dirty." "We don't have time to think about what's going to happen when we're at a million servers in a datacenter." And so nobody thought, "Oh, XML files describing a million servers." "What are they going to look like?" XML files describing anything are pretty scary. But what we've got here is one of the DataCenter.xml files. This is for the Columbia Datacenter, the Azure deployments there, and there's a few things that I'll just highlight in here. There's one cluster in here, co2stageapp02. So we've got this naming convention here. This means it's in Columbia, this means it's not production, and this means it's a compute cluster and it's computer cluster 2 in that datacenter. So I'm going to do a search here and we'll find the description that is fed to the Fabric Controller that manages that cluster. And you can see that it's a bunch of VLANs here with the routes to talk to different things in the datacenter. And then the load balancer configuration. This is back when we had a specific type of load balancer. It's different than what we've got now in production. And then we've got the machine pool, and a little bit further down we've got the actual description of all those racks here. The blade locations. We assign them locations, we assign them IP addresses, and we give them blade IDs, asset tags on them, and NIC MAC addresses basically to the physical MAC addresses that are stored there. And so you can see also described there the network switches associated with the rack. This is the ToR. This is the serial concentrator. So we can talk directly to the servers and do debugging on them. And then power strips. We've got 2 power strips for redundancy in all of these racks. So one of the key aspects of the Fabric Controller's management of the servers on the rack is controlling their power—shutting them off, rebooting them, turning them on. When the Fabric Controller starts up and gets deployed to these 5 instances and then it gets fed this information it says, "Oh, I've got a bunch of servers." "Time to get those servers ready for serving applications." What it does at that point is uses the PDU, the power distribution unit, to power on the node, and those nodes are programmed to PXE boot, and the PXE boot server is on the fabric VLAN. It will deploy a maintenance operating system, what we call the MOS, to this thing. The maintenance operating system is Linux. No, I'm just kidding. [laughter] It's Windows PE. So because we are Windows, of course, it is Windows PE. And what that does is formats the disk and downloads the host operating system, which is also a VHD. Everything is boot from VHD here. The host OS VHD is put down on the NTFS volume that's been formatted, and at that point the MOS reboots and it reboots into that host operating system, which has an agent that's been injected into it. This is the fabric's host agent, and that establishes a secure channel by generating a self-sign certificate and establishing SSL mutual auth channel with the Fabric Controller now. And at that point the node is ready to go. As far as what operating system version we're running, this is a question I actually had before the session. People say, "Hey, are you guys running the same version of Windows that we're buying and putting in our datacenters?" The answer is up until about 6 months ago we weren't. We started in Windows Azure with a version of Windows Server that was forked off of Server 2008 because Windows Azure had requirements that weren't being satisfied by Hyper-V at the time. One of them was we had 8 core physical servers where Hyper-V only went up to 4. Sorry. We had the need for 8 core VMs when Hyper-V only supported 4 at the time. We also had the need for boot from VHD, which Windows didn't have at the time. So all that was put into the Azure version of Windows first, and then you know it's all been migrated back into Hyper-V. And about 6 months ago, after Windows Server 2012 RTMed, we rolled out RTM 2012 with Hyper-V onto our host. So we are running stock Server 2012. We're working really closely now with the Windows team to get any innovations that we need into the next version of Windows. We do some development out side with their cooperation and collaboration and then we get code merged back in. So we're already working on stuff. There's stuff that they delivered in Blue that I can't talk about, but hopefully next TechEd I'll be able to give you an Internals talk about the cool stuff that we've got coming in Windows for managing our datacenter hardware. Virtual IP addresses. A key part of the way that the datacenter operates is networking. Networking is like the air that these servers breathe. It's the air in the cloud. And a key aspect of this is the virtual IP addresses or VIPs. Those of you that have deployed applications or VMs know that you get a VIP for your cloud service, one virtual IP address. What this does is these VIPs map onto VLANs. These VLANs provide a unit of security isolation. The VLANs extend to all of the VMs that are part of that same cloud service, and there are multiple ways to communicate between isolated VLANs and the datacenter. One of them is just by going through the VIPs, and the other one is by connecting those VLANs with an overlay VLAN called the virtual network, which I talked about yesterday morning. Underneath, non-publicly routable IP addresses are assigned to everything in the datacenter. You saw them in that DataCenter.xml, these 10 dot addresses. We've got a bunch of private IP address ranges that we're using all across our datacenters. And what a VIP maps, a port on a VIP can map to a dynamic IP address or DIP. And of course you can port forward those DIPs or load balance them. A key question that I hear a lot is, "When do I get to keep my VIP and when do I lose my VIP?" So you deploy an application and you get a VIP and you say, "You know what?" You fall in love with that VIP, you start to get to know it, it's really a lovely VIP, and actually you start to build a relationship with that VIP because you start to ackle things on your side, assuming that you're going to have that VIP, and then you become possessive of the VIP and you say, "I don't want to lose this VIP. What do I have to do to keep this VIP?" Well, we are going to have reserved VIPs or ones that you can basically lease and not have connected with the cloud service. That's inevitably coming. We don't have a timeline to share with you. But in the meantime, the golden rule is as long as you have at least one VM deployed, behind that VIP you get that VIP. That is your VIP. It's not going to ever change. That VIP crashes and gets restarted, you still get the same VIP. The only reason you lose the VIP is if you delete that deployment. So that VIP is yours as long as you have that deployment. What we used to do, what we started out with in the datacenter were hardware load balancers. And these hardware load balancers proved to be very problematic. I don't know how many of you like dealing with hardware load balancers, but they've got all sorts of problems. They've got limitations on the number of routes that you can put in them, they've got limitations on the number of ackles you can put on them, they fail, they're wildly expensive. If you want to be highly available, you need to buy not just 1 but 2 of them every place you put them. And they don't scale very well as far as the traffic that can flow through them. So about 5 or 6 years ago working with MSR, we started on a project called the Software Load Balancer. The software load balancer architecture is shown here, and I thought I'd share it with you so you can understand when your traffic comes into the datacenter or comes out of your VMs, what's managing the routes for that traffic? It is the software load balancer. You see the slide is divided into 3 sections, and the sections are here on the far left is the Fabric Controller, in the middle is the software load balancer management role or manager role, and on the far right—you can see that orange in the middle— that's the SLB MUX role. And then the bottom is an actual node, and you can see some SLB agents there on that node. The Fabric Controller is what's responsible for deploying an application, so once it's decided it's going to create a virtual machine, it needs to go talk to the network manager plugin in the Fabric Controller, which is going to talk to the SLB manager and figure out which VIP we want to give to the particular VMs. The SLB manager is going to— concurrently, the Fabric Controller is launching those virtual machines. You can see them show up there on the node. And the SLB host plugin at that point is going to ask the SLB manager, "What route do I give this guy? What VIP is mapped to this guy?" And he's going to tell the SLB host driver, which is plugged in to the networking stack, about the route for this, the DIP to the VIP mapping for these virtual machines. The SLB manager also then talks to the MUX role. So the manager is what's controlling the mappings of VIPs to DIPs. The MUX role is actually what the traffic is flowing through. And so the SLB manager, you can see there's a DIP health monitor there. The DIP health monitor is pinging your endpoints. So when you've got a load balanced endpoint, it's going to be pinging that's coming from there to see if it's alive or not. And if it's not, it's going to realize that that VIP to DIP mapping should be removed, and it's going to tell the MUX agent to stop forwarding traffic there. And the MUX agent is going to tell the physical network devices. Through BGP protocol it's going to update the routes and say, "Don't route traffic to this guy anymore." And now once we've got that set up, traffic starts flowing through, goes to the MUX on the way in to the datacenter. When you respond, it just goes straight back out over the Internet. How much overhead does the SLB add? Anybody have an idea? [Demo: SLB Latency] Want to take a guess? >>[audience member] Five milliseconds. [Mark R.] Five milliseconds. Ye of little faith. [laughter] Anybody else have a guess? Two milliseconds? Twenty microseconds. That's— >>[inaudible audience member response] Physics do still apply. Well, let's go see the difference between traffic between DIPs and VIPs. So I've got a little tool I wrote called PsPing. Anybody familiar with PsPing? Yeah? It's a tool for measuring bandwidth and latency. And I've got a cloud service with 2 VMs, Test 1 and Test 2. Here we've got PsPing running in server mode on Test 1. And it's waiting for a TCP connection on its local DIP. We can go ahead and minimize that because we're going to focus on Test 1 over here. And let's see. Let's do a DIP test first, actually. So it's listening here, so what we're going to do is that's 168 22 29 5000, and I've opened up the firewall ports here. So this right here is a PsPing. It's a packet size of 16384. Let's do 10000 iterations to that. And L is for latency test, and here we can see it's running and it's going to tell us that that took—oh, time out. What happened here? Oh. It's because of that. Command prompt had it frozen, so let's try that again. Okay. Drum roll. [audience member makes drum rolling sound] >>[Mark R.] Thank you. 0.56 milliseconds. So that was the latency, and that's round-trip latency between one VM and the other VM. Now the next test is to go through the VIP. So the VIP for Test 1. Let's go find test in my list of VMs. Here it is. Test. Then we can go find the VIP. So there's the VIP, 168, and do the same PsPing to the VIP at port 5000. All right. So we had only one person willing to guess 5 milliseconds was the overhead. So let's see what it is. And drum roll. >>[audience member makes drum rolling sound] >>[Mark R.] Thank you. 1.19 difference. Yeah, about 0.5 milliseconds. Half a millisecond. And what was my prediction? Tip. It's a tip for you. When you can, just avoid going through the load balancer. It saves you about a half a millisecond, you can see. So if you're concerned about latency, there's ways to avoid the load balancer. Just by going DIP to DIP like I did, for example, is one way within the cloud service. Even if you're going across cloud services, you can avoid going through the load balancer by going through VNET. So that's another way to do it. [Deploying Services] Next let's talk about what happens when you deploy a service. When you deploy a service to the cloud, you can do it in a few different ways. You can push it up through Visual Studio as a package, you can go right to the portal and say Upload, you can stick it in a storage account and go tell the portal, "Go pull it from here," or you can use the service management APIs directly to go push your package up into the portal. So all of those, what they have in common is that they're all going through this component called RDFE. RDFE, what it does when it gets your package is sticks it in its own storage account. RDFE right now is a standard Windows Azure application. It's scaled out; it's running on 160 servers right now. That's how much traffic. It literally receives millions of requests every day. It's a hot standby application, so it's got instance in multiple datacenters, ready for failover. And what RDFE does is picks a Fabric Controller to deploy this package to, a Fabric Controller cluster. This is the concept of cluster coming back. It's going to go pick a cluster. Actually what it does is does a simultaneous picking, if it can, if it's got a choice, 5 Fabric Controller clusters and gives them one by one. So if one fails, it can give it to the next guy. With 5 we shouldn't ever return an error to you that we couldn't deploy the application. And the FC, the Fabric Controller, its last step when it gets it, stores it in its own local image repository. It keeps all the artifacts that you give to it. It's packaged. So here's a tip is to keep your package small. If you look at the flow that I've just described, when you start with a core package and you push it up to the portal, the portal is just simply going to pass it through. The portal is a stateless application. It is also a Windows Azure application. It just passes your package on, which can be up to 600 MB in size, passes it to RDFE, RDFE hands it to the Fabric Controller. And by the way, RDFE has copied it to its own storage account in parallel as a backup. It sticks it in the Fabric Controller. The Fabric Controller stores it on disk and then it goes and pushes it to the servers that the individual roles are on. So you can see we've had a copy to storage, 2 copies to disk, 1 in the Fabric Controller, 1 on the server, so lots of copying of this data through the network and to local disks. If you've got a big package, this is going to actually impact the performance of your deployments. So the recommendation is instead just to push your code this way. Any other supporting artifacts and data files is to put them in storage and then have your application, your code, go pull it from storage. Not only that but cache the files locally so that when you have, for example, a code update, that you're not having to go re-fetch the data. And I'll tell you about how you can cache in a little bit. RDFE. I mentioned that term. What does RDFE stand for? Do you think it stands for some cool technical thing? How many people have seen my Windows Azure Internals talk before? A few of you. What's that? >>[audience member] It's the Pink Poodle. [Mark R.] It's the Pink Poodle. That's right. It's a little inside Azure piece of trivia. The Red Dog Front End. The original code name for the Windows Azure project was Red Dog. And it got its name Red Dog from a legend, this guy that I've admired tremendously and I kind of followed through Microsoft, Dave Cutler. Dave Cutler, chief architect of VMS, chief architect of Windows NT, and then was on the original core team of 15 or 20 people that went off and started the Azure project under Ray Ozzie. And they were traveling around looking at the datacenters, the way that we operate them. They were down in the Valley, San Francisco, northern San Francisco. And they had this lingering question. The burning question for anybody starting a new project is, "What do we call it?" "We've got to call it something so we can start making T-shirts and shoes and things like that, coffee cups." So they hadn't come up with a name. I guess they had a few contenders. And they passed a place called the Pink Poodle. They say they passed it. I don't know what "pass" means, whether it means stop in the parking lot or just you're driving past it on the road. And they liked it. I don't know what they liked about it. Maybe it was just the sign. Maybe it was the logo. Maybe it was just the name Pink Poodle they liked. But they decided to call the project Pink Poodle [laughter] until LCA found out and they said, "No, you're not going to call that the Pink Poodle." "You're going to call it something else." So they said, "Okay. We're going to call it Red Dog then." So that's the way it got the name Red Dog. And if you've got a mobile device, feel free to look up what Pink Poodle is. I have never been there. I've run into a few people that will admit that they've been there. They say that, yeah, they can see why somebody would name a project after that. [laughter] Let's talk about affinity groups now. If you've deployed a cloud service, you might have come across the concept of an affinity group. How many people have deployed a cloud service with an affinity group? Okay. So a few of you have. What is an affinity group? Let's talk about where that concept came from. If you look at the original datacenter networking architecture that I've got over there on the far right, you can see that this access router maps to 3 clusters. And those 3 clusters, the traffic between those clusters doesn't go all the way to the datacenter router. So the original team said, "It would be really nice if we gave people "the ability to co-locate their compute and storage under one of these access routers "so that they basically get better performance "and we're not congesting the datacenter routers by having traffic "between storage and compute that's under one of these things not go all the way up." So that's the original concept of affinity group. If you specified your compute, it would go underneath the same access router as another cloud service with the same affinity group or a storage account with the same affinity group. The networking architecture has obviously changed. So what does affinity group mean these days? Affinity group no longer has this benefit of east-west traffic. It is a convenience now for saying, "Put these things in the same datacenter or the same region." So when you deploy a bunch of cloud services and you say, "Put them in the same affinity group," really what we're trying to do is to say, "Those are going to go into the same region." So you don't have to worry about it. Like, "This guy is already in North Central US." "Deploy this guy in the same affinity group." It will go to the North Central US. But underneath the hood we still have some of these artifacts of the original Azure design, at least for now. We're working on changing them. For now, the Fabric Controller cluster is the scope for homogenous hardware as well as a single cloud service deployment. When you deploy a cloud service, you go into a particular cluster, like I told you. And the implications of that aren't so much bandwidth between cloud services and different clusters in the same region because we've got this flat network; the implications are the hardware. And people have started running into this because we've rolled out new clusters with A6 and A7 servers that supports A6 and A7 servers. Those servers have 64GB of RAM, whereas the traditional servers have 32. And so we can't fit the A7 VMs on those 32GB clusters. So the problem that you might run into is if you have a VNET or affinity group with a cloud service deployed into it and it's on one of these non-A6/A7 supporting clusters you can see there and you go deploy an A6 or A7 and you say, "Put that in the same affinity group," you're going to get a failure because it's got to go to the A6/A7 cluster, and VNETs plus affinity groups are bound to a particular compute cluster. So you will get a failure. So the tip here is if you need to use something that's constrained, like a VNET, with particular hardware, like an A6 or A7, deploy that first and use that as the anchor point. And then you can deploy other stuff. What people have run into is they've had something like this, and the only way to get A6s and A7s now to be in the same affinity group or VNET is to go and re-deploy that guy that's in that cluster and try to push it onto an A6 or A7 by leveraging an A6 or A7 first. So that's kind of an undocumented tip. You want to be in an A6 or A7 cluster but don't want an A6 or A7, deploy an A6 or A7 and then delete it. I probably shouldn't tell you that, but— [laughter] Now, the other implication is for storage, the other requirement we've got is that a storage account for an IaaS VM has to be in the same region. It can't be in a different region. So you will be forced to basically put it in the same datacenter; otherwise we'll fail the deployment. The deployment steps for an application. The Fabric Controller takes your service model files, and even the IaaS roles, the VMs, they have a service model file underneath them too that is the same kind of service model. We just don't expose it publicly yet. It determines the resource requirements. How many VMs do you have? What size are those VMs? Where is your code? How do they map to those VMs? And it creates what are called role images, then it goes and does resource allocations— what servers should those VMs go on, ideally— and then prepares the servers by pushing those role V images down to those servers so that they are there, creates the virtual machines and starts the virtual machines, configures the networking. When it comes to allocating resources, this is, for me, a fascinating promise. How do you allocate efficiently resources in the datacenter to be optimal, to get optimal utilization and optimal performance for things like updates? And I've been working with MSR on some algorithms. The basic problem here is you've got a number of hard constraints and you've got some soft constraints. The hard constraints are if you ask for an A6 VM, we've got to give you an A6 VM. We can't say, "Oh, there's a nice server over here with enough for extra large." "You like that?" You'll say, "No. I want my A6 or my A7." So that's a hard constraint. Another hard constraint is fault domains. Fault domains—again, it's kind of assumed knowledge coming into this— but fault domains, you get 2 fault domains for any of your roles or an availability set. What that means, a fault domain is a rack because you saw that there are PDUs and ToRs and servers, obviously, that are single points of failure in the datacenter. The network is not a single point of failure. So single point of failure is a rack. We will spread you across at least 2 racks, probably more, but our guarantee to you is 2. So if we have something like a ToR failure, your whole service doesn't go down. Part of your service goes down while we heal your service and move it on to a healthy rack. So that's a hard constraint, at least 2. Soft constraints are prefer allocations that minimize the host OS update walks, which I'll talk about later. We actually try to pack nodes generally. This is very soft, so you don't always see this. Here's an example of a fault domain and availability set just showing you that we spread you across at least 2. Here in this example we've got 2 roles, 2 instances in the front end, 3 in the back end. You lose a rack, you've lost half of your front end, a third of your back end, and then we'll heal you, but you didn't lose the whole thing. And that's where our 99.95% availability comes in. If you've got at least 2, then you're spread across at least 2 fault domains and we guarantee that 99.95% of the time over a year for both planned and unplanned outages at least one of them will be up. Here's an example of an allocation. These are 8 core servers. The white squares represent empty cores. At the top we've got Role A, which is 3 instances, 3 update domains. It's a large VM which is a 4 core VM. And then we've got a medium worker role, 2 core VMs, 2 of them, 2 update domains, and let's go deploy that. This is just an example allocation where the front ends and back ends of the same update domains will be placed on the same racks in the same servers. And at that point, the load balancer is wired up, like I explained before. Let's take a look at a real deployment. [Demo: Viewing Service Allocation] I've got a cool tool. Unfortunately, I can't share it with you. It wouldn't be that useful to you anyway. But this is called the Fabric Viewer. This is a tool that we use internally. It's one of our many diagnostic tools. But this is one way we look at what's allocated in a cluster. I'm looking at allocations in one of our production clusters. In fact, it's the— There's Scott. Say hi to Scott. Here's one of the production clusters, and I've got a cloud service here called mr-email, Azure Email Service. Let's go take a look at its topology. You can see that I've got right here, one, the front ends—2 front ends— I've got 3 middle tiers, and I've got here 4 back ends. So it's a total of 9 VMs, and you can see how they are spread across 4 fault domains, 0 to 4. So here's the constraining set of fault domains right here, 0, 1, 2, 3. Then the middle tier is on 0, 1, and 2. There is 0 and 1. And then you can see they are alternating fault domains. Let's go take a look at that deployment in this cluster right here. And what I can do to see where it is is pull out its deployment ID. So when you call support and they say, "Hey, you've got a problem." "What's your deployment ID? What's your subscription ID?" we use that to go find out information about you. And what's just highlighted in green is that deployment. And you can see these are our new A6/A7 clusters, so they've got 12 cores on them, and you can see that this guy then is an extra large. So this is one of the back ends. If we count 1, 2, 3, 4, 5, 6, 7, there's one of the front ends, 8, and the other front end is over here. All on different racks, even all on different servers. You can see this cluster is kind of busy. By the way, the load balancer is also here. Every cluster has a load balancer. Here is the load balancer. And you can see it's got a bunch of servers allocated to it. Interesting thing about this is the networking guys always think they're so cool. If you hover your mouse over here, the reason that that is the whole server and we don't have 12 core machines right here is look at what these guys do. They're like, "Ha, ha. We're so cool. I own the machine." [laughter] This is how we figure out the allocation. Let's talk about the steps now to actually get your code and data down onto that server. The Fabric Controller pushes a role file and configuration information to the target host, then it creates the VHDs, then we've got a guest agent sitting inside of your PaaS role that starts your code. It does a bunch of other things too. It activates the plugins, like the RDP plugin. It runs your startup tasks, calls your role entry point, and then starts a health heartbeat with your role entry point, which is the 15-second health heartbeat. The load balancer only routes to the ones that are actually sitting there and responding to the heartbeat. If the heartbeat is missed, the guest agent tells the host agent, "This guy is out of commission," and then you stop getting the probe. The IaaS provisioning flow is a little bit different than the PaaS provisioning flow because PaaS we've got these role VHDs that we deploy down to the servers, these role packages. For IaaS we're deploying a raw virtual machine. So let's take a look at that flow. First you create a VM. You tell RDFE to do that. We've got this storage account that RDFE manages, which has a section of it called the platform image repository, or PIR. These are where the gallery images are that you pick one and you say, "I want one of those." Or you can point at your own storage account and say, "I want one of those," the Sysprep generalized images that you created, and, "I want a new VM from that." If you pick a gallery image, what RDFE does is makes a copy-on-write copy to your storage account. Your storage account has to be in the same region as what the VM is going to. And we've got platform image repositories in every storage stamp, so no matter which storage stamp your storage account is in, we can do a copy-on-write copy, which internally in the platform we can do between storage accounts. So we aren't actually copying the full 8GB or 12GB of image; we are just doing a copy-on-write zero cost copy at that point. Then we generate an ISO and add it to your storage account. Then the RDFE calls the fabric, and the fabric gets an infinite lease, gets the storage shared access keys to talk to your storage account. By the way, infinite lease means this is why when you go try to delete a blob that is mounted as a disk you get an error is because there is a lease being held by the Fabric Controller, by the server that this thing is deployed onto, that won't let it go away. It's actually the RDFE grabs the lease and then hands off the lease to the Fabric Controller. And we create the tenants, what we call tenants or what we call VMs. We add the images to the Fabric Controller, add tenant secrets, update tenant, send the container configuration to the host agent from the Fabric Controller, which downloads the ISO, creates a resource VHD, which I'll talk about later, creates a cache VHD, prefetches the cache VHD, creates a VM, starts the VM, launches the IaaS disk driver, which then starts talking to your blob. The interesting step, I think, in here is this caching which we put in place to make it really as fast as possible. There's specialization that happens then. Oh. It's paging file and so on, which is the Unattend .iso stuff. So just like you deploy Windows and you have Unattend.xml files, we've got the same thing. In fact, I will show you one right here. [Demo: Inside an IaaS Provisioning ISO] I have mounted one of our ISOs that was generated from one of our deployments, a test deployment, and it's right here. And here is Unattend.xml. There's just some interesting things that I'll show in here. Here's one. PersistAllDeviceInstalls. I'll talk more about that in a second. It's a Sysprep option. Here then we launch Unattend.wsf, which is where the magic happens. Inside this directory, OEM, basically this is the heart of the provisioning agent, which you can see there is a WaGuest, which is the provisioning agent, and then there's a bunch of Sysprep stuff in here as well. So a lot of it happens in this OperatingSystem file. And there's a few things like here that we've got, like the admin account specification. So for some of you that do deployment, this will be interesting. For the rest of you, you're probably like, "Okay, whatever." But one of them is RdpKeepAlive. So what we do is turn on KeepAlive for RDP so that RDP keeps the channel open when you RDP in the VM, and that means the connection doesn't get torn down by software load balancer, which it will do after a few minutes. Software load balancer keeps connections open for about 10 minutes or more. But to prevent it from getting torn down at all so you can leave an RDP session window open and it will still be responsive when you go back to it, that's what we do. There is paging file stuff. It's in a separate file. SetSanPolicy. There's another one here. So we OnlineAll disks, which is not the default for SCSI disks in Windows. When it sees a new SCSI disk, it doesn't online the volumes. We force it to online the volumes when it sees a new disk show up. So it's some of the things that we've got in our Unattend.iso. Some of the deployment optimizations. We've got 2 that reduce the time that it takes for a VM to start from a base image and actually be functional for you to RDP into it. The Windows specialization can take up to 10 minutes if you've ever done Windows deployment. The 2 optimizations we've got in place, one of them is that PersistAllDrivers. What we do is we boot these VMs in our lab off of the hardware, and so the drivers get installed for our hardware. When you do Sysprep/generalize, you say PersistAllDrivers, that will keep the drivers inside the image rather than uninstalling them, which is the default behavior. This is what you want to do when you go and create an image in Azure and you do a Sysprep inside of it is do this PersistAllDrivers. That will save you time for your own images getting Sysprep specialized when you create VMs. The other is one that we take advantage of for our platform images— we don't make it available to you at this point; maybe in the future we will as part of Windows— is that we create prefetch files. Let me talk about our prefetching optimization, which is really cool. What we do in the lab is boot the VMs. So we get like a new version of Server for the month of May. In the lab we boot the VM with an instrumented disk driver that watches what sectors in that VHD get pulled in, and it creates a prefetch map, which is just Sector 5, Sector 8, Sector 3000. It's just a list of sectors that get read in up and to the point where the thing is ready to RDP into. These are all the sectors required for Windows to get up and running to that point. And at deployment time we have the prefetch driver pull those sectors out of the VHD. While the VM is getting set up on the host, we go and fetch those sectors in big chunks. With heavily pipelined I/Os we get up to about 100Mbps, pull it from blobs, and pull it down to this cache, which then becomes the disk cache for the OS. When the VM starts up, all the data it wants is right there. Let's go take a quick look at how we generate those prefetch files. [Demo: Prefetching IaaS Disks] I'm RDPed into a node. I meant to do smart card. And I'll show you the size of the prefetch file. Then we're going to boot 2 VMs, one with the prefetch in place and one without the prefetch in place, that will just basically pull down from blob storage all the sectors that have to be brought in. And I'm not even going to wait for the second VM, the non-optimized one, to launch because it's dramatic. You'll see how dramatic it is a second here after I log in. All right. When I can log in. All right. I got— Oh, here we go. Idle Timer Expired. Great. Oh, okay. So here I'm going to do— So you can take a look at the base images here. VM with prefetch here is a 1.7GB image. And here's the prefetch file. It's 300K of map, of sectors that have to be read to prefetch that file. So that's basically the pre-populated host operating system. And then if I do without prefetch, you'll see that this other VM starts out with its 3MB empty VHD. And I'm going to connect to those VMs now. On the left let's do with prefetch, and on the right we're going to do without prefetch, and then I'm going to start both of them. And we'll just wait long enough so we can see the dramatic difference in how these things start to spin up. So you'll see this has already switched to video graphics mode. Getting devices ready. And now let's just wait until this one says Getting devices ready. And that's 100%. Again, that's the persisted drivers. We still haven't even gotten this guy to the Getting drivers ready. And what we'd see is that VHD that's underneath this non-optimized one is expanding as the stuff comes into it, the stuff from blob storage, which is an overhead of itself, by the way, is expanding a file. Look. Restarting your PC. We haven't even seen this guy show up with Getting devices ready yet. So it's literally a difference of minutes. As far as our performance goals, we measure this weekly. In fact, every Thursday we have a performance meeting to look at what's going on, both from storage and compute deployment. Our performance targets are Windows deployments under 6 minutes, Linux deployments under 3 minutes. Yeah, I know. That's sad. [laughter] We're working on it. The tail up here in this graph is typically caused by hardware. But if you can see, this is the latest, this is Linux right here. You can see about the 70th percentile we are under 3 minutes. So we've got more work to do there. You can see this red line is Windows Server 2012. It's the fastest performing. You can see all the way to the 80th we're under 5 minutes. Here we cross the 6 minute goal at about 90. So we've got some work to do right here. And then you can see if we've got a SQL image it's going to take longer because SQL adds overhead. Server 2008 is this gray bar. Our goal is obviously the latest OS. So if you want the best performance, Server 2012 will get you there. [Demo: Deployment Performance] I'm going to skip this demo. [Updating Services] Let's talk about updating services now. When you update a service, we march through update domains. Now, the update domain walking is done by slices through your application. Like you saw, I had things in Update Domain 0, things in Update Domain 1. Here is the front end and middle tier getting updated. You can assign up to 20 update domains today. The default is 5 and actually 4 in availability set. For an IS you get 5. We don't allow you to change it at this point. In the future we might. But up to 20 for PaaS roles we'll march through. Our SLA is based on you having at least 2 because these update domains aren't just for you pushing out updates to your application, but we honor the update domains when we roll out operating system updates underneath and have to reboot the servers. We do it such that you don't have VMs from more than one update domain out at the same time. We might not do it in this exact order that you get when you push your own updates. We might update a server there from the middle tier that's hosting that middle tier VM before we do Update Domain 1, before we do Update Domain 0, but we make sure that Update Domain 0 VMs are never down at the same time Update Domain 1 and 2 VMs are down when we update the host operating system. [Demo: Update Performance] Take a quick look at this in action. There's a tool that we run internally too to measure, to look at the performance of the system graphically because a picture is worth a thousand words. And I'm going to load this UD Walk of my app. Remember I had 9 VMs. I had 2 in the front end, 3 in the middle tier, and 4 in the back end. And what we're going to see is the marking of these update domains here. This is already a feed just querying, waiting for the fabric to walk through these update domains. Here's the fabric. And the cool thing is when I start to expand this, you're going to see the updates. Here's Update Domain, obviously, 0, here's Update Domain 1, 2, and 3 that correspond to exactly what I saw. And you can see that there's 3 VMs, 3 VMs, then 2, and this is the straggler from that back end that was in that last update domain by itself. And for each of these we break it down into the things that happen to your role. The first thing is we stop the role. You can see Stopping role. Started. Here it got stopped. And then we update the role, destroy the role, update the VMs, and then we start up. System startup tasks running, then we call the OnStart. So this is exactly what happened to my VM. That's what we see underneath the hood. One of the things that we see customers do—it's a big mistake— is to respond—I can't remember if it's true or false—to your RoleEnvironmentChanging flag to say, "Recycle me." And we've seen customers that say, "Wait a minute." "When I do an update, even if it's just scaling out, it takes fricking forever." "What the hell is going on?" And so we've got our consultants pulled in, and they go look at it and say, "Let's see your code." They look at this code, and it always returns false or true—whatever it is— to say, "Recycle me." And what that will cause us to do is if you get any change to your topology, to your code, your configuration, we will go march your update domains for you when in most cases you don't need to recycle your code. You should adapt dynamically to as many configuration changes as possible. And so the goal here is if you want a fast update, deploy settings as configuration instead of as code and respond to the configuration updates by saying you don't need to be recycled. [Updating the Host OS] I started talking about updating the host operating system, which we do about once a month. [Demo: Delta VHDs] And we've got these allocation constraints that we have to honor update domains. We can't take down, again, multiple VMs from different update domains at the same time. This allocation down here is suboptimal because we can't update both of these servers at the same time since we'd be taking down Role A-1, UD 2 and Role A-1, UD 1. Up here we've got Role A-1, UD 1 and Role B-1, UD 1. So we're not violating update domain constraints here. If we did that here, we'd violate them. That's part of our layout. When we update the host operating system, it happens about once a month. So the VMs get shut down, your VMs get gracefully shut down on the servers we're about to update. We have to honor host OS update constraints. This causes us to basically go through a cluster of 1000 servers. It takes us about 20 batches to go do an update, 20 to 25 to 30 depending on the topology of applications and how fragmented the UDs have gotten on those servers. And the longer it takes, each slice takes about 20 to 30 minutes. And so it can take us 10 to 20 hours to go and role and update out through a cluster. I've got an example here of a march, [Demo: RootHE Update] and you can see again the principle that we have for all of our updates, which is to do a little bit, take a look if everything looks good, and then move on. So this is an update of one of those co2stage clusters, and you can see right away that there's 2 batches: this batch right here, which is a percentage of the servers, and then that looked good, we did health checks, and then did the rest of them. When you look at what goes on underneath, for each of these nodes we have to shut down the VMs. We shut down the VM, StoppingContainer, then we install new certificates, machine configuration, monitoring configuration, create the new host plugins, and then start up the containers. And then we wait up to 15 minutes for these guys to say that they're ready to go and then move on to the next batch and the next set that doesn't violate update constraints. And so you can see here in the bulk of it, here at the very end we ended up having to just do one server all by itself. We couldn't do it with any of the others or we would have violated update domain constraints. So one of the things we do to optimize for deploying images to the cluster is we use delta VHDs instead of deploying the full images. If you take a look at a full stock Server 2012 Enterprise edition, it's about 9GB of VHD. Compressed it's about 4GB. What we do is deploy the latest version of the OS to every single server in the cluster. That means that we're deploying—there are about 20TB of data being copied to that 1000 servers, 20TB, which takes several hours. So what we've deployed just recently is delta VHDs. What we do is have a base, look at the update, like April to May, just look at what sectors have changed in the VHD, and then create a delta file. And we deploy the delta file and we compress that too, and it compresses down to about 75MB. And now we deploy 75MB instead of 4GB to each server, and then we build the new release right on that server. So it's all done in parallel, and it has reduced the time that it takes us to prestage these images from about 10 hours to about 1 hour. [Disks] We've got about 10 minutes left, and let's talk about disks. Lots of people have burning questions about disks for some reason. Disks. Who has burning questions about disks? All right. I guess we can skip it then. [laughter] No. There's lots of different kinds of disks we've got in Azure, and what I've shown here is the disk architecture for one of our PaaS roles, a VM. You can see that there's 3 volumes that you get when you RDP into a server. You see the C volume, the D volume and the E or F volume. Underneath them are VHDs. There's a resource disk, which is C. That's where your paging file is stored, that's where crash dumps go, and that's where you can put cache data. That is a dynamic VHD. It dynamically expands. It's sitting on a striped volume across 5 disks on those blades. Then there's a Windows VHD with a differencing VHD on top of it, so that base VHD with a differencing VHD, which allows multiple VMs to share the same base. And then we've got your role, which is also a VHD, and that's going to be drive E or F depending on your updates. All of these VHDs are sitting on that stripe, along with the VHDs of other VMs that are on that same server. This is what you would see with disk management. Here's D, C, and E for a role VM. For IaaS it's a little bit different. The drive letter mappings are different, so the OS is on C instead of D. It is sitting on top of a RAM cache, which is sitting on top of a local disk cache, which is sitting on top of the stripe. Then D is the temporary disk. It's what we call temporary. In the PaaS world it's called resource disk. This is where the paging file and crash dumps go for IaaS VMs. It is a dynamic VHD also sitting on the stripe. And then the data disks that have no cache, those are not using any local disk; they are talking directly to blob storage. Here's a summary table that you can take a look at on the slides for the different sizes of these things and the VHD types underneath and where their storage is backed. You can see that only the persistent IaaS non-cached disk is backed by storage. You can see that IaaS persistent disk, which is a cached one, there's a local cache and that one is backed locally, plus by Windows Azure storage. The burning question people have about the resource disk or temporary disk is, "When can I count on the data being there?" Basically, never count on it being there. That doesn't mean that it's just going to go away. The bottom line is that that is sitting on a physical server, and if that server dies, you will lose it when that VM gets reincarnated. We don't mess with it for any other purpose, though. When we update the host OS, we leave those alone. When you update the guest OS, we leave that alone. When you deploy a role update, we leave it alone. So you will have it there, which is why earlier I said cache your artifacts there. Cache them there on that thing. That disk is what we leave alone. The other ones get messed with in the PaaS world. Your role VHD, the OS VHD, every time you do an update the role VHD is going to change, so you'll lose anything that you put there. And if you do a repave, you will lose things there. [Demo: Disk Performance] Let's take a quick look at perf. Our performance goal is that we will try to give you 500 IOPS off of each IaaS disk that's non-cached. And I've got Mr. Test. Where's Mr. Test? Here we go. And I'm going to start focusing on disks. What you can see here is here is the C drive, here is F or D—there's the temporary storage, which I've split actually into 2 volumes as part of my testing that I was doing earlier— then you can see I've got a data disk here, which is drive F, and then 2 other data disks which I've striped to create a 2TB striped disk across 2 Azure IaaS persistent disks. So the bottom line is I've got this OS disk, which is backed by Windows Azure storage but with a local RAM cache and disk cache on the stripe. I've got the resource disk, which is fully on the local disk stripe. I've got one data disk, non-cached, backed by Azure storage, and then I've got 2 disks and a stripe backed by Azure storage. Let's take a look at the performance of each of those, and I'll show you first of all by taking a look at the performance of this F drive, which is a—let me zoom out here—F drive, which is the single disk. Let's go fire this up, Iometer, which it's doing 8K I/Os, queue depth of 32, 8K aligned, which is what you'd see represented with SQL type of workloads. And you can see that we're getting about 1000 IOPS from Windows Azure storage. I mentioned that our goal is 500. We are going to institute a cap of 500 shortly. In about a month or 2 we will start to cap to 500 IOPS per disk. So 2 of the things that will give you is more consistency, and it will give us more control over our resource utilization on the back ends. Right now we basically have it throttled wide open, so you're seeing wide open performance. Tip on performance, and you're going to do a test like this. Let it run for about 45 minutes because the harder you hit your disk, the more storage adapts to you. If you're hitting the disk hard, what it does is split partitions and basically tries to isolate your disks from other ones that are also busy. And so what you will then get is your IOPS will increase. It does the split 4 or 5 times to get you to peak performance. And so after about 45 minutes you'll start to get performance up in that range. Let's go take a look at my stripe. You'd expect for a stripe to get about twice the performance, so let's see if we get about twice the performance. And actually, unfortunately, we're not getting twice the performance right now, and I can explain why. I was testing the stripe with Iometer running continuously up until about 4 hours ago. Once you let the drive go cold, what it starts to do is starts to push it with other hot drives, starts to pair you with other hot drives. So you will start to then have contention on the back end with other hot drives. What you will see as this runs over the next 10 or 20 minutes, I'll go up and start hitting 2000 IOPS, about twice the one disk. Again, though, when we put in the caps you're going to be capped at 1000 IOPS for this guy because that's 2 x 500. Then for local storage or for the C drive, which is the cached one, interesting behavior here. I've got Iometer working on a 30GB file. 30GB is pretty big. The local cache is not that big in most cases. It depends on the size of your VM. And what you start to see are the effects of the local stripe. So this is now being buffered by the local stripe, the cache on the local stripe, and so now we start to see more like single spinning spindle, actually the striped spindles kind of performance down in the low 100s of IOPS. The stripe is shared, again, with other VMs, so your performance is going to vary probably quite a bit. There's also another aspect, which is this cache is expanding as you write to it, and so you will have expansion performance too. As you expand it, you'll get better performance off it. But while it's expanding, it will be slower. And I've already expanded this guy out. And then I want to show you the difference between fitting in the cache and not fitting in the cache. I've got a SQL I/O test here, which I've only given at 10GB. Let me stop this. Where did Iometer go? Let me stop this. I've got a SQL I/O configuration here, and I'm going to run a random 8K read off— Is that an 8K write or a read? I'll do an 8K read to show you the performance of the cache. And this is a 10GB file, which all fits right in the cache. So what you're probably going to see is some pretty dramatic IOPS come off this thing because it happens to fit in the cache. So give that about a minute. And drum roll. >>[audience member makes drum rolling sound] [Mark R.] There we go. There's our drum. Let's see what we've got. [humming "Jeopardy" theme] It's a long minute, isn't it? That's the way the SQL guys get high TPC numbers. [laughter] Their minutes are very long. All right. And what do we get? Whoa! That's some serious IOPS right there. [chuckling] That's actually more than I expected. Close to 60,000. Not bad. But that, again, your mileage is going to vary because you're sharing that with other VMs. But that's the kind of performance you can get off there. Oh, and then the final thing is that if we did the same thing with this temporary storage disk, we'd see the local disk shine completely through again on this. Oh. What did I do wrong? I somehow clicked on something and screwed this up. But what you would see is the local disk performance shine through there, and you'll see results that are similar to what we saw for the host OS disk. So tips here for optimizing disk performance. Each IaaS disk has different purpose and different performance characteristics. The temporary disk you can lose. So put cache stuff on it that you don't care about. Basically treat it as on an on-premise server, a physical server, if the server crashed, that you'd lose that data. If it's okay to do that, then you can put it there. That's why you put the paging file there. You typically don't need the paging file after you reboot. That's only used for one boot session. So the paging file is going to be on the temporary disk. Data disk is great for random writes and large working sets. You saw that 500 IOPS pushing up to 100MBps kind of throughput is what you're going to get off those. The striped disk you can get awesome performance. So for your host OS this is a tip where you want to put a small SQL database. If it's small, it's going to be hit a lot with reads, awesome on the host OS disk. If it's big and it's going to be doing a lot of writes, better to put it on a data disk with no caching. And always prep your caches by scanning the disks, the relevant sectors. So even if you're on the temporary disk, prep your caches if you want good performance. Same thing goes for the host OS disk. And then finally, the tip about hitting Azure storage hard to tell it you want good performance, and it will give it to you. And that brings me to the end. So I've given you a whirlwind tour. I actually have a ton more really cool stuff to share with you. I could have made it a 2-parter; in fact, I've got so much material that I put the deck together and I'm like, "Oh! I have to cut this stuff out. It's really cool." You saw me actually skip over a demo that I realized I didn't have time to fit in. But I hope what you found, whether you're new to Azure, never touched it before, or somebody that's been using Azure for a while, I hope (a) that you got a better understanding of what's going on when you're actually deploying VMs onto us; what software infrastructure, what datacenter hardware is underneath you; as well as I hope you got some tips for how to better take advantage of the platform at the same time. This talk is going to be evolving, so you'll see me back here at TechEd next year talking about the latest datacenter enhancements as well as the latest software enhancements, and there's a bunch of them that we've got under way that I can't talk about yet. I hope that you were inspired by this. I hope that if you haven't touched Azure that you go deploy VMs, and I hope you have a great evening. I hope to see you tomorrow at my sessions tomorrow, and have a great end of TechEd. Thanks very much. [applause]

Video Details

Duration: 1 hour, 19 minutes and 26 seconds
Country: United States
Language: English
License: All rights reserved
Genre: None
Views: 9
Posted by: asoboleva99 on Jun 28, 2013

Mark Russinovich goes under the hood of the Microsoft datacenter operating system. Intended for developers who have already gotten their hands dirty with Windows Azure and understand its basic concepts, this session gives an inside look at the architectural design of the Windows Azure compute platform. Learn about Microsoft’s data center architecture, what goes on behind the scenes when you deploy and update a Windows Azure app and how it monitors and responds to the health of machines, its own components, and the apps it hosts.

Caption and Translate

    Sign In/Register for Dotsub to translate this video.