Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

Designing Cloud Architectures

0 (0 Likes / 0 Dislikes)
  • Embed Video

  • Embed normal player Copy to Clipboard
  • Embed a smaller player Copy to Clipboard
  • Advanced Embedding Options
  • Embed Video With Transcription

  • Embed with transcription beside video Copy to Clipboard
  • Embed with transcription below video Copy to Clipboard
  • Embed transcript

  • Embed transcript in:
    Copy to Clipboard
  • Invite a user to Dotsub
[TechEd2013] It seems the doors have been closed, so that is our signal to get started. First of all, welcome to this session. I hope everyone is having a good time in TechEd because I haven't. It was quite a stressful situation backstage. I will tell you why in a moment. But before we get started—a brief introduction of myself. My name is Haishi Bai. By the way, Bai is a last name in Chinese. It doesn't mean I'm saying goodbye. It's just a last name. Actually, it means white. This is my second year with Microsoft and before I joined Microsoft, I'd been working mostly in Silicone Valley for different companies on different projects for about 15 years. So I'd say I've seen a fair number of projects, and I've been involved in quite sizable cloud-based applications, but talking about architecture is still —how do I put it—it's still risky because whenever it comes to architecture, people tend to have very different and very strong opinions. I don't know how many of you have been in the situation that you are having a heated discussion with your colleagues about architecture. Anyone have been in that situation? Yeah—I see hands going up. Anyone actually punched anybody in the face? Okay—I didn't either, but in my mind, I did it—in my mind. So as I say, talking about architecture is always risky, but I think this kind of session is useful. I hope through this conversation we can give you some ideas and guidance, and I list some information so you can make an informed decision when you're designing your own cloud application. But just to clarify, I'm here—I'm talking about architecture in general. It's not about your specific projects, so I'm not discussing with you, and if you have questions, please hold it until the end of the session because we do have a lot of accountants to go through, and tomorrow evening, I think there's Ask the Experts session happening tomorrow night or Thursday night. I will be there as well, so you're welcome to stop by to have more discussions if you want. So let's get started. Today we actually have a very simple schedule. First, I will talk about why cloud is different—why we want to design a different architecture when we design cloud applications. Then we'll dig into three important aspects—failsafe design, scale and integration. These are not all the topics we should cover for the cloud designing, but these are three important ones. I think if we have sufficient time, we may talk a little more, but these are the three major topics we'll focus on today. Then the last, we'll save about 10 or 15 minutes to do a couple of demos. I'm going to show you some of the new features provided by Windows Azure and what you can use in your own projects. So that's the plan. Let's get started. First, let's talk about why cloud is different. Here I'm going to show you a couple of quotes from anonymous sources. Basically, these are common misconceptions. Actually, I should stay behind the table because this morning, I just realized my pants are not growing as fast as my legs. So I'd rather stay right here. Okay—so a common misconception—the first one. Actually, this one I said it many times myself. I don't know if you've said it. "That's a hardware failure, not my problem." I've been an engineer for quite a few years, and I've said that in many occasions with relief because that means my software is perfect. The system failed because there is some hardware failure such as the disk failed. That's really nobody's fault, right? The disk failed—what can you do? You have to restore the server. It's nobody's problem; however, on the cloud, it's different. You cannot say that any more because on the cloud, the virtual machines you get from cloud platforms are running on the commodity hardware. Those are not like top-class, expensive hardware, and they will fail. They will fail, and even if they don't fail by themselves, the virtual machines might be brought down by the cloud platform for maintenance, such as applying important security patches so that to take that offload from you, but on the other hand, you need to prepare for failure. So this is very important to remember that on the cloud, hardware failure is not an exception any more—it's the norm. It will happen, and your application has to be prepared for that because your customer doesn't care if it's a software problem or a hardware problem—they want the service to be available. If the service is unavailable, that's too bad. They will complain to you. You will lose business, so you have to prepare for failures. A second one—"Let's throw more memory into our database server." I've done that many times myself as well. I'm not mocking this method. This is a very valid method to solve some of the bottlenecks because it's cheap, it's fast, it's low risk, but on the cloud, you cannot scale up all the time. Even for the physical machines—even on an on-premise system, you cannot just infinitely increase the memory on the machine or replace the CPU on the machine. You have a hard limit. The virtual machines you get from cloud vendors are even more limited. Basically, you have a list of options, but you run out of options pretty quickly. Windows Azure just added 2 more levels. They have now A6 and A7 in case you haven't heard about it. So basically with A7 instances, you can go up as high as 8 ports and 56 gigs of memory. That's pretty large, but you cannot go beyond that. So when you hit the limit, you have to scale up your application. What's more interesting is that if you are subscribing to a service— when you subscribe to a service, all you get is an end point to that service. You're not getting machines back. You cannot throw memory at it. You cannot throw anything at it. If you cause a throttle, you have to figure out other ways to work around that limitation. The next one—"Our service is 99.9% available because we are using Azure." That should be a truth, isn't it? Doesn't Azure promise 9 out of 9's? Actually, it's higher than 99.9%—99.95%, 99.99%, and our application is only using the Windows Azure services, so we should be able to promise our customers that our service is 99.9% available; however, that's not the case, and in a moment, we'll explain exactly why you cannot make that kind of promise. And this is the scariest one—actually, I think I just heard about this again last week. "Running out of capacity—that's a good problem to have!" After you think about it, running out of capacity is a problem, but it's never a good problem. Many run out of capacity—that means you could have served more customers, but you didn't. You could have earned more money, but you didn't. Your customers could have been more satisifed, but they weren't. So running out of capacity is a problem, but it's not a good problem. But the scariest part of this statement is that when somebody is saying something like this, that means they are not thinking about scaling at all. Maybe they've heard that cloud is a scalable platform. That's a true statement. Cloud is a scalable platform, but that doesn't mean your application on the cloud is scalable. Scalability is not automatic. Not only do you need to design for it, you also need to plan for it, and we'll explain how in a moment. The next one—"We'll move everything to cloud!" Isn't this what Microsoft wants? Isn't this what Google wants, what Amazon wants? Is it recorded? Okay, I'm going to say it anyway. I believe actually this is exactly what Amazon and Google want. Why—because they have nothing on your on-premise data centers. For Microsoft, it is totally different. We acknowledge that Enterprise is made up of hundreds of applications. You have a lot of latency systems. It's just impractical to move everything from the cloud. Even if you start the course to move everything to the cloud, that process will take a long time. Months, years, decades—who knows? So when you design applications, you have to be ready that your application will need to work with on-premise systems, and Microsoft provides a very unique and comprehensive solution for you to use what we call the hybrid cloud—that you can not only make your cloud service (inaudible) and your on-premise system work together, you can make the transition. You can deploy your application on the cloud, on-premise, and because your application is your logical unit, it doesn't matter where you deploy it. You should get the same level of service. We are not totally there yet, but I think we are way ahead of other competitors. Designing for the cloud—as we are talking about those differences, let me just give a brief summary of what we are going to talk about today. First, we will talk about failsafe—again, things will fail. It's not a hardware problem any more. It's a problem of your application, and you could design for that. Second, scale—scalability is not automatic. If you only get one thing out of this session, maybe that's why I included it. Scalability is not automatic. You have to design for that; you have to plan for that. Integration—the application will need to work with on-premise systems. You have to design for integration. So let's just jump to the first one—the failsafe. Why do you want failsafe? It's simple—we want availability. We want our service to be available because if there's no service, there's no money, and let's just review some of the basics of availability. Availability equals uptime divided by uptime plus downtime. So for any given period of time, the more uptime you have, the higher availability your system has. It's straightforward. To achieve high availability, there's no magic. We have to use redundancy. With redundancy, we can actually achieve higher availability by introducing redundant resources. This is an example. For instance, in this system, I have three servers. Each server provides 90% of availability, but when I put them behind the load balancer, the whole system fails only if all three servers fail at the same time. So in this case, by simple math, my system actually can achieve 99.9% availability, and with the same machine spec, if I introduce a fourth machine, I could reach four 9's, but the point here is that availability is not free. When you introduce redundancy, you need to incur more costs. You need to get more resources allocated. So before you want to promise those 9's to a customer, you want to ask yourself, "Do I really need to promise those 9's?" Here I'm going to give you an example about a difference between three 9's and four 9's. Let's say you have a system that is available 99.9% of the time. That translates into about ten minutes of downtime each week. But if the system is 99.99% available, that translates into about less than two minutes downtime per week. So where do your customers really care about those eight minutes of difference? Can you go higher—because if you can go higher, you can save money on the resources. So before you promise those availabilities, you need to ask yourself, "Do I really need to go that high," because whenever you promise more, you have more liability on yourself. Is that really worth it? Now here we are answering the question that if you have subsystems that have a high availability, will the whole system be high available? Most of the systems are made of multiple subsystems, and by the rule of composition, your availability actually goes down. Here's an example. I have three systems. Each subsystem provides 99.9% availability. What's the availability of my whole system? Would it be the average of the three? Would it be the minimum of the three? Actually, the whole system works only when the three systems are working at the same time, so my availability is actually 99.7%. It's not 99.9%. Actually, your system availability will be lower than the lowest available component—yes. Your system availability will be lower than the lowest available components. Let's say you have other components with 99.9% availability, but you have one component with 20% availability. Then your whole system cannot go beyond 20% availability. So before you promise those 9's, you have to ask yourself, "Can I really reach those 9's?" And Windows Azure just happens to be a highly available system by itself. So we should be able to learn from Windows Azure how is design, how is using redundancy to provide high availability? Actually, there are quite a few points we can review. First one—the storage. Windows Azure storage saves three copies with data, and if you turn on the geo replication, we save six copies of the data, so that you have redundancy in your data layer. And a CQ database—that was called CQ Azure. CQ database, new provision, new instance. We actually have three servers that copy that instance for you, so if the primary service fell down, we still have backup services. Windows Azure caching—when you use Windows caching, you can turn on the high availability option, which means your data will be saved twice on different nodes of the cache cluster so that if one of the nodes fails, you can still get your data from the other node. Of course, I think everybody knows this. When you deploy a website or cloud service on Windows Azure, you can have multiple instances for scaling as well as for redundancy. And you can do the same thing with virtual machines, and all the people know this, but you can actually draw in the virtual machines into the same load balancer and provide redundancy. And there's built-in redundancy in Windows Azure virtual network gateways, so when you provision a new Windows Azure network, the gateways we provide to you have a built-in redundancy so that we can provide a very high available connection between your virtual network and your on-premise network. And at an even higher level of you have multiple deployments across different data centers, you can use Windows Azure Traffic Manager to set up either a load balancing rule or a failover rule, which means if the primary data center crashes because of some fake disaster, you can re-route the use of traffic to a backup data center. So you can see, there are many things Windows Azure provides you to achieve high availability, but that's only part of the story. The other part of the story of failsafe design is about the reliability. Here again, we are going to do some really quick review. Basically, this is a review of a computer science, whatever, course. So my system is available at the beginning, and it fails— that it takes some time to be repaired and statistically, this is about the mean time to repair or mean time to recover. Then the service is running again before it fails again. There is the mean time to failure, and there's another measurement we can use to measure the reliability, but we'll ignore that for now. So why I'm saying is that because on the on-premise system, the way we increase reliability of the system is to focus, to make the mean time to failure longer. Why do we do that? Because on the on-premise system, the mean time to repair is really long. Let's imagine, if your hardware fails, or you have to place or push the other— you start a new hardware task and then bring it online— so that time is very considerable. Let's say your primary system fails, and you need this much time to fix the system, so your backup system has to be maintained healthy long enough so that you can fix the primary system. This is kind of a chicken and egg problem because it takes longer to repair, you need to keep the system running for a longer time. So this is very expensive and complex to do. On the cloud, we are actually taking a totally different approach. So instead of focusing to make the mean time to failure longer, we are trying to make the mean time to restore shorter, as short as possible in the term of minutes so that when your service fails, it can be quickly restored, in many cases, without any action from you. So in this case, we are increasing the system reliability by shortening the mean time to restore, and we don't require to have a very high-end hardware to support the longer mean time to failure. And you can see this principle is used across Windows Azure, and a lot of those features are available to you as well. For instance, the auto recovery. We will deploy a service instance for Windows Azure if it fails or if it doesn't respond to the health probe. We will recycle your role. We will recycle your role or if the machine failed, we will allocate a new virtual machine and deploy the service and restart the service automatically. You don't even need to do anything about it. And the fault domain—basically, in the data center, we have those huge racks of servers, and each rack has its own power supply, cording supply, and network interfaces, so basically, each rack is a fault domain. For instance, if the power failed on that rack, the other servers will go down on that rack , but by using the fault domain, you can just drop Windows Azure to deploy your services across different racks so that even if one rack goes down, you still have instances on other servers. And you can do the same thing with virtual machines. If you directly provision virtual machines, you can use availability set to allocate virtual machines across different fault domains as well. And when we upgrade your service, instead of taking down your instances all at the same time, we actually do an upgrade domain walk, which means we'll take the instances down group by group so that as we are upgrading your group of instances, you'll still have other instances serving for your customers. And we support multiple deployment environments. We have staging environments and production environments so that you can deploy a new version on the staging environment, doing tests, and unless you are satisfied with the version, you can do a VIP swap to promote the staging to production, and your production will be becoming the staging environment for the next version, so it just goes on in parallel. And also to reduce the mean time to restore, we have provided very comprehensive diagnostic and debugging support. Actually, at end, I'm going to show a demo of one of the new features. We're providing a visual studio. So at first, we have a simulator. This way, we can simulate the basic cloud environment on your local machine. That's quite unique, and that's quite powerful, which means you can locally debug your cloud services before you deploy to the cloud. Everything is local—you can trace, you can step through, you can do everything just like a local program, and we also have intellitrace, which allows you to capture and play back server states on your local machine. Basically, if the error happens, you can grab the intellitrace file and you'll play back in your visual studio, and you can actually trace where the errors are. You can see the details of exceptions, and you can even see the context when the function is called. You can see the parameters of the function. It's really powerful. And we keep improving our diagnostic capabilities. Like I said, I will show you a brief demo after this. And of course, we have first party and third party support of telemetry. We have Management API and Management Portal, Performance Counters, Diagnostic Data, and you can use third party tools, which you can buy from the Windows Azure store, such as New Relic and App Dynamics. You can collect all those telemetry data and monitor your server's health. I will talk about this in the next slide—we also provide transient fault handling application block for you to handle transient errors. We will talk about transient errors in the next slide. So that's what Windows Azure provides you out of box, and what does that mean to you? What do you need to do to design for failsafe? It depends, of course. Every project is unique in its different ways, but we can abstract some of the general practices. So the first one—take advantage of those features. We just talked about plenty of Windows Azure features. Make sure you take advantage of them. Avoid single point of failure. I don't think that this one needs too much of an explanation. You need to build redundancy across all your layers. It's not sufficient just to build redundancy in your application layer or just database layer. You need redundancy across all your layers. Otherwise, that component will become a single point of failure. If it goes down, the whole system goes down. Failure mode analysis—actually, reliable engineering itself is a very large topic but this is one of the general practices you want to do when you are engaged in reliable engineering. You want to do the failure mode analysis, and either analyze your code to see where it may fail and how you will deal with it. Transient errors—for those of you who don't know about this, transient errors means those temporal errors that may happen once then we will retry, and it wil be okay. So there are many different reasons for transient errors. It may be because of a network problem, or it may be just because the service is throttling your car. Transient errors occur any time. You cannot predict them, and the way to handle those errors is to retry. No matter how simple the method call is. Let's say if you're using a service of SQ, you're calling the method to say, if the Q exists. That's a single method call returning a reporting value. That commonly fails because of the transient errors. So to make your call level, you really need to put a retry logic around those causes. But you can imagine, if you put all those retrys in each line of your code, your code will be really hard to read. So that's why Microsoft patterns and practices—they provide this transient error (inaudible )application block, which allows you to define your retry policies and to write a really clean code with the retry logic in it. For instance, you can define a policy to say, "I want to retry this method at a one-second interval," or "After the first failure, I will try after one second then after two seconds, then after four seconds," so you can progressively step back. So make sure to check out that application block. And graceful degradation—this means when some of your subsystems are down, instead of stop providing services, you try to keep your service up, but maybe with limited functionalities because you don't want to lose business. I'm going to give you an example. Let's say we have an online ordering system. So the online ordering system has two subsystems. One takes orders, and the other processes the order. Let's say, the second subsystem goes down. What do you do? Instead of stopping to serve your customers, you should still take in orders. Right—we still want that business, but as you're taking in the orders, what you can do is to inform your customer that you can still place your order, but your orders will be processed at a later time. So this is what we mean by graceful degradation. Actually for most online ordering systems, because of the process, orders usually take days or even weeks to be fulfilled, maybe that several hours of difference doesn't matter to a customer at all. So with graceful degradation, you are keeping your service available. That's another point where your promise of 9's to a customer, you are keeping your service available. Your service didn't go down although your subsystem may have failed. Eliminate human factors—humans are very smart and creative creatures, but humans are extremely unreliable. Although we may have the knowledge but maybe we are just in a bad mood, or maybe we are just getting lazy, or we are just spaced out for no reason. Yeah, that was me spacing out. So I hope I made the point. A human is very unreliable, so you should eliminate human factors as much as possible. Ultimately, whenever you can, whenever you see a chance to automate something, some process, you should do that. Now we can come to the next topic—scale. So to make the topic a little easier, let's review. We first review some of the basics, just to warm up. Scale—I often hear about scaling up and scaling out. The so-called scaling up, which means to increase the utilization of your existing resource. Let's say here, I have one instance. I can use parallelization multi-threading to make higher usage of my resource. Or I can buy a bigger machine, subscribe to a bigger virtual machine to gain more throughput, but there's always a hard limit. When they reach that instance capacity, I cannot go higher any more. In this case, I have to scale up by adding more instances. But here's the tricky question—can I just go up like this infinitely? Maybe, okay? Actually, that's a good answer—maybe. But you cannot do this just automatically because when you scale out, you're building up more and more service, but your service always has some kind of dependency to some kind of service, like a database or (inaudible) services, and those services will throttle you. You cannot just go on forever, right? So in this case, if you keep adding more instances, you're not gaining more throughput because you're having a bottleneck in your system. The way to deal with this is to segment your work. Basically, you have to plan to segment your workload into multiple service entities. For instance, if I use multiple queues, multiple topics, multiple storage accounts, multiple databases, or even multiple address subscriptions because no matter what you do, there is an ultimate limit. So you if you want to go higher than this red line, you have to segment your workload, and then you can actually keep going. I can just imagine, not only you need to design for scaling, you need to plan for scaling as well. So what we used to do on an on-premise system is that when we prepare for possible workload changes, we prepare for the worst case. Why—because provisioning new resources on on-premise data centers is really a long process. It takes weeks, or even months, to buy a new server, so in this case, we have to prepare for the peak. I mean have sufficient throughput to handle those workloads. But of course, in this case, there are a lot of resources wasted. They are just sitting idle most of the time. On the cloud, it is totally different. Instead of preparing for the worst, we can actually identify what we call a skinny unit. Basically, I'm saying instead of preparing for the most possible peak, I'm going to identify a skinny unit, which is defined as—to serve this number of customers, I need this number of servers, this number of queues, This number of storage accounts, and as my workload changes, I can just scale up and down those skinny units to adjust my throughput. However, you cannot do this reactively. Let me try to see —move the pointer here. You see the yellow line above the white line here? That means your system is already running out of capacity. So when you detect that and try to scale up, there will be lack because although scaling up is fast, but it's not instantaneous. So to deal with this situation, you have to do some kind of prediction. In other words, you have to be able to shift this curve to the left so that you feel your system is about to go out of capacity, you have to scale up beforehand. This is not always easy to do. This requires you have quite accurate measurements. This requires you have quite accurate predictions of the workload change, and there will be a lot of scaling up and scaling down operations. So what we usually suggest is that to design a larger skinny unit, so instead of changing your skinny unit per 100 customers, you design a skinny unit for 1,000, even 10,000 customers. So in this case, you don't need to do the scaling operations that often and you can still satisfy your workload changes. Scaling in Windows Azure—and in case you don't know, Windows Azure happens to be a scalable system as well. So let's see what Windows Azure is doing to achieve scalability. Of course, you can scale up by using different VM sizes. You can scale out by adding more instances. You can use other scaling application blocks, the Autoscaling application, and you can use data shading to shade your customer's data into multiple Azure CQ databases. Here this is what I just talked about—the way to segment your work. You can use multiple service entities like queues or storage accounts to segment your workload into multiple subscriptions. And we have CDN, the content delivery network. Basically, you can cache your content especially static contents onto the CDN nodes so that those customers can be served directly from the CDN node instead of going back to your server. So this will not only give you the performance benefit, it also helps you to scatter your user traffic to those CDN nodes so that with one server instance, you can actually achieve higher throughputs. You can use caching to offset some workloads. For instance, if you decide your database is becoming a bottleneck, you can use caching your application layer so that you don't go back to database and do a repetitive query that often so that you can solve that bottleneck and achieve higher workloads. That's what Windows Azure provides out of box, and what do you need to design for scaling? Again, it depends, because each project is unique. I'm just going to say it one more time today. But some general practices apply. First, as we said, you need to design for your capacity. You need to design for scaling. You need to plan for scaling, so cloud is not killing the IT pro's job. The IT pro's job is reoriented to provide even more values. You need to design a system so that it will scale well to serve not only 100 customers but a million customers. Property decomposition of system—this is too big of a topic to discuss in detail. Basically, in my opinion, that's the number one job of an architect. The number one job of an architect is to properly decompose the system, to identify what are the components and what are the connections among them. Only you achieve this, you can scale your different parts of your system independently. Let's say you are identifying your data layer as a bottleneck. You can scale out the data layer separately. But if you have a strong dependency between an application layer and the data layer, the scaling of the database layer may have a sudden effect on the application layer. For instance, maybe your application is not designed to take in additional workload—things like that. Stateless design—the most scalable components are stateless, which means each request can be handled independently. It's not affected by any request before or after it. In this case, your workload can be actually easily distributed to all your instances, and we can scale out to add more capacity without affecting existing instances. But we do realize that not all the systems are designed in this way. Designing a totally stateless system is not necessarily that easy, so I'm going to talk about how to transit a very stateful system into a stateless system in the next slide. And scale at all layers—I don't think I need to explain this. You need to scale all layers; otherwise, the layers you don't scale will become your bottleneck. Throttling—throttling means— actually scaling means how many customers you can effectively serve during a period of time. But you always have limited resources. So how to make those customers to effectively share your resources, you have to throttle. You cannot afford that a single customer generates a really heavy workload and kills your whole system. So you have to play fair and make sure all the customers are within their quotation. And here, let's talk about designing of the stateless system. Like I said, we can roughly categorize the system into three different types: stateless, shared states, and global states. The stateless system is really easy to be scaled down. Basically, because each request is handled independently, we can easily have more instances to handle more requests, no problem. The shared states system—they have this notion of a session, for instance, a shopping cart. When the user has a shopping cart, the request has to go with the same shopping cart; otherwise, he's shopping for somebody else. For this kind of a system, it's actually not that hard to change it to stateless. All you need to do is to separate your state and your requests, and there are easy ways, especially for the web requests. You can use (inaudible) state storages, and I will talk about one in the next slide. But on the other hand, even if you don't do this, if you don't separate the state and your action, you can still scale out by sessions. Basically you have sticky sessions or server affinity so that each user, when it opens a session, all the subsequent requests are handled by the same instance. You can do that. You can still scale your system basically by the user sessions. The hardest one is the global states. Global states, although here I can draw a diagram to show that, there is some component that knows about all the requests, but in reality, it's really hard to draw a diagram for this kind of system. Usually, this kind of system has a singleton instance that assumes it has the global knowledge. I've been trying to think of a good example without referring to any political concepts, but here I'll just make up someone. Let's say you have a dispatching system. You have a bus dispatching system. You have a dispatcher to dispatch buses to different routes, and now suddenly you have more buses to handle, and you have two dispatchers. These two dispatchers each assume they know about all the buses. You ever have this situation that two dispatchers are sending the same bus, and the bus is lost somewhere? So this kind of a system is really hard to be scaled down, but it's still possible, but you need to do some more work. They way to scale this without any code change is to add some kind of a routing mechanism in the middle. Basically, you are scaling by customer, by tenant. Let's say you're still allocating a dedicated server for customer A, another server for customer B. You can still do this, but you can see the flexibility and the real scalability is really low in this case. So what we suggest is to make those transitions: first, to eliminate the global states. Those come on as bad regardless if it's in the cloud or on-premise. Then from the shared states to the stateless, it's just a matter of separating the state and actions. There's two basic options you can use: the first, you can externalize the state. For instance, we have a cache session state provider. That's a really long class name, but it works. Basically, it's an session provider. If you've used sessions, you can just configure this provider with an application, and you'll just operate on sessions as you were. Then this session state is safe outside your process, so you can have a stable shared session. And the other way, this requests you to change the architecture of the system is to separate actions and states. I'm not saying this is the only way, but this is a feasible way, which is to use a job queue. Basically, you are centralizing the space into jobs, and the job creator and the job handler are essentially just action executors. They don't have states themselves. The job is carrying on their own states. Integration with on-premise systems: that's the third topic. Let me do a check on time. I think I'm doing just right. I think I'm doing just right, okay. Integration with on-premise systems—and here, I'm going to tell the story why I'm so stressed this morning. So I don't know if some of you may have noticed on the title slide, I have a second speaker. Mr.—suddenly the name escapes me, sorry. But see the humans are not reliable. He is from Argentina, and he brought a very nice demo—you know— the gas pump controllers you see in the gas station. What their company does is to integrate those old gas pump controllers with Windows Azure. I mean, those devices are really old—like, 10 years old. They are not designed for cloud. They are not even designed for network connectivity. But they did some amazing job to integrate those systems into Windows Azure. What happened is that because the system is shipped from Argentina when it passed customs, there were certain inspections going on. Basically, the machine was torn apart. All the parts were pulled out. It was dangling everywhere, so the machine was certainly broken. So this part, originally my plan was to do a demo and just talk about the integration, but this is my plan B, so I'm just going to go through the slides. So this is a failover scenario again in real life. Integration—I don't think we should go through this diagram, but the key point is that not all the systems or all the workloads should be migrated through the cloud. Before you embark on the project to migrate something onto the cloud, you need to ask yourself, "Is it worth it? Can I actually do it?" First, you have to be clear why you want to move to the cloud. There have to be some goals. If you just want to have something nice on a resume, that's a different story, but that's also a goal, right? You have to have some goals, and then you have to estimate how much effort you will take to achieve that goal, and is it worth it to take that, and because the on-premise system, they are interlinked with each other, you have to analyze what kind of impact—not only at technical level, but also at business level, and this is very important for a lot of technical guys to ignore, that when you migrate something from your on-premise to the cloud, you are potentially changing your business workflow. This is a big deal. This is not only raising your IT department; this will impact your server—your whole business. So before you do that, you have to answer those questions. The reason I am showing this diagram—actually, I want to show the red dot, which means not all the workloads and systems should be migrated to the cloud. That's why we need integration and general integration strategies. I think I will just skip this one because this one should be really straightforward. Usually how do we integrate different systems? We use data; we use shadow data we read through the same database, or we do the data import/export we integrate where we do the direct invocations where you expose the API from one system and recall from the other, or we use some middleware like base talk or whatever to facilitate your integration. As far as the topology of the integration, we can have direct connections. We can have the hub and spoke architecture. Of course, the hub and spoke reduces the number of connections, but on the other hand, it introduces a single point of failure. That's why a lot of people—this is not me saying it—there are a lot of people in the industry that prefer to use the middleware like Enterprise service bus to integrate the system. Although on the diagram, the service bus—this guy in the middle— is still one piece. But you know, in reality, it can be implemented as a distributive system with high reliability and availability. So although on a diagram, they may seem the same, but actually underneath, the principle is rather different, but that's not the main point I'm going to discuss today. So we'll see how Windows Azure can help you to achieve those integration goals. We have message-based integration. We have Windows Azure Service Bus. We have (inaudible) topics—all those wonderful notification support. You can achieve message-based integration easily. We also provide you with connectivity support. The Windows Azure virtual network. I'm going to show that realm of the Windows Azure virtual network in a moment. So basically, that provides you a very strong connectivity between Windows Azure machines and your on-premise machines. When we talk about integration, we have to handle the problem of authentication and authorization. In this area, Windows Azure is quite unique because we have Windows Azure Active Directory, which allows you to integrate your on-premise active directory credentials with your cloud credentials. We also have Azure Access Control Service to provide a single sign-on not only with active directory uses but with other social network uses, like Facebook, Google, Yahoo or whatever have you. Also the Windows Azure system is more open than ever before. We have a very inclusive ecosystem. Not only can you use the net and Windows, you can use Linux, you can use Note, Java, PHP—now I think we are adding Ruby as well. Windows Azure is embracing all the standards. We support (inaudible) APIs over HTTP. We support Oauth, oData—actually, that's one that Microsoft just started standard and WSStar, such as WSFederation, WSSecurity. And what does that mean for your application? That's the last time I'm going to say it. It depends because every project is unique, but there are some general practices. Loose coupling—when your systems are interlinked together, it's really hard to change or really hard to migrate anything, so if your components loosely coupled, you can easily just take them apart and host them in different environments. Miminum exposure—that means when you expose an interface, you should minimize the information or the complexity you expose because the more you expose, the more people will take use of it. Then when you make a change, it will be really hard to change those components without affecting dependent components. Integration isolation—actually, this is a term I cooked up. I don't think it's a good one, but what I mean is that when you are integrated with other systems, you should try to isolate the integration code in a central place. Let's say you are integrating with this machine. You should have a repository, maybe a proxy that can handle all the interaction with this system so that when you switch to a different system, you can easily switch it out without affecting the rest of your system. Secure by default—and I've seen this many times. Because the two companies have a very strong partnership, they are saying, "Hey, just send me a text, and I will deal with it." But then the requirement will change—"Hey, how about we do this "over Internet instead of intranet? How about we do this with other partners?" So when you're integrated with other systems, the general principle is to secure by default. You should employ either a transportation security like HTTPS or use message-level security such as encryption or a signature so that you can securely talk to other systems. And use standards. More and more people are using standards now these days. So should you in case that if you implement some standard interfaces, other customers may be able to integrate with you without you doing anything. Let me check on time, okay. And let me just say a little more words about message-based integration. So here I'm basically, assuming you're using Windows Azure service bus, and you have five systems to be integrated. Is that directly invoking each other? What you can do is to create a job queue so that your system can generate some jobs, and these jobs will be consumed by other systems. You can use topics to broadcast the jobs to other systems. The key difference here is that for the queue, system B and system C are actually competing for the job, so only one of the systems will get the job, but in the case of topics, both system D and system E will get their own separate copies of the job, so that's the difference. One gets it , or both of them get it. When you use message-based integration because it's a very flexible architecture, you can easily extend this architecture result effect in running system. For instance, now in having system F, it can just add a subscription to my queue, and my existing system doesn't even need to know about it. It just appears online, it takes jobs, and provides me more throughput. Let's say I change my system A to system G, but because I'm generating the same job, the other systems really don't care if there's a system A system G or some humanized-generated jobs, or I can even have a system H that's pumping jobs into the same pipeline. Other systems don't care. They will still just work as before. So message-based integration, this is not me recommending or Microsoft recommending. Actually, this is a quite commonly acknowledged integration pattern adapted by industry, so whenever you integrate two systems, especially if the systems are on different type stat, you should consider the message-based integration. A moment ago, we talked about how Windows Azure provides connectivity between your on-premise network and your Windows Azure network, and here's a little diagram to explain the configuration. So on the right, I have my local network and I have some machines. And on Windows Azure I have a virtual network that I connected with some virtual machines. What you can do is set up a site-to-site connection, which means you can actually have those machines connected as if they are connected on the same local network. They have the same local network address space. You can ping each other, you can send files to each other. They're just like they are on the same network. Now with the virtual network, we are providing point-to-site connection as well, which means for any laptop after you install a VPN client, you can actually connect to your Vnet cyber network from anywhere in the world. Actually, I'm going to show you that in a moment. So that's the two major integration offerings, message-based and the virtual network. How many of you are familiar with Windows Azure Active Directory? Heard about it—okay, about one-third. So here I'm just going to go through this quickly. So this Windows Azure Active Directory is a directory on the cloud to help you provide authentication of your cloud applications, and what's uniqe about it, I should just mention, you can integrate it with your on-premise active directory so that it can project your on-premise user credentials to the cloud so that your Enterprise users can log in to their call services using their domain accounts. So here I'm going to show you a little animation. Basically for your application, you configure with Active Directory as the trusted, rely-on party, which means I'm relying on Windows Azure Active Directory to do the authentication for me, and when the user tries access the application, the WF module got inserted in your application, will intercept that request and sees that the user has not been authenticated and it will ask the WAD to present your log-in page. You log in with Windows Azure Directory Attendant, and the Attendant gives you a token back. Then you attach this token to the next request and this time, WAF intercepts the request again and sees, now you are authenticated. So in this way, you can see in your application, you don't need to handle authentication at all. Basically, you are just off-sorting all the authentication work to Windows Azure Active Directory. Actually, the benefit of doing this may not be apparent in the software environment, but taking the real-life example, it should be easier to understand. This is like trying to lock your front door. Do you make a lock yourself or do you go to a store and buy a lock? I think most people would choose to buy a lock because you know those guys are professionals. They know how to properly make a lock. But in reality, a lot of the systems, what we see is that you also try to implement authentication by themselves, and the expectation— unless you are a good lock maker, you can make a decent lock; otherwise, you should rely on other people who have spent collectively hundreds of years on authentication as their profession and to do things properly for you, and you'll get that for free. Windows Azure Active Directory is free; there is no charge with Windows Azure subscription. You should definitely consider to use that. Here that's a single sign-on case. Basically, you can have multiple applications with the same tenant. Because they trust the token from the same tenant, you can provide this similar sign-on experience so that user only needs to sign in once, and can access all the trusted applications. There's a projection from your on-premises AD tool to your active directory. [Cloud Architecture in Practice] Gee, I went through that really quickly. I may take another sip of water. So here comes the fun part. So basically, we talked about a lot of practices and principles in general. Now I'd like to show you something in practice. I'm going to show you some specific scales, new features or patterns you can use in your own applications. So for this, I'm going to jump into, I think, three demos. So the first one is the streamlined diagnostics. Like I said, the general strategy for us to improve reliability is to reduce the mean time to repair. So we provide a lot of diagnostic support for you, and this is one of the examples. Let me log in to this machine. So this is a very simple site I created. This is called a blob album. This is using blob storage. You can create albums, upload pictures. For instance, if I open the album, you can see the pictures inside. I think I lost my network connection. Okay, it's back. So anyway, it's a very simple application. It's based on Windows Azure blob storage. Now I'm going to create a new album. I click on the link. I just say some name and some description, then say click. And too bad—nothing happened. There's no error message. It just didn't happen. So apparently, there's a bug in my code, so how do I fix that? So I can open up Windows Studio, and in the service explorer, now I can actually connect to all my Windows Azure websites. I can expand my website and find my subscription. This is my website. Just to show you I'm not cheating, you can see the url is a real website. It's not running on my local machine. So I go to this website, and I right click. I see View Streaming Logs new output window. I click on that. You can see the output window is here. Now I'm actually connected to my log streams running on my website. What I can do now is go back to my website. I just re-create the problem. I say create, yeah, nothing happened, and I go back, and you can see the exception right there. I don't need to stop the application or do any additional trips. I can see because of this, the blob content exist failed. It turns out because I'm a bad programmer, I guess, because I'm using the blob container name as my album name, so if the user enters a space in the name, then the system is screwed up. Of course, I can fix this, but the key point is that now with this kind of a streaming log, you can quickly identify and fix the problems on a website. So that's the first demo. It would be quick demos. So the second one is to the point-to-site connection. This is how Windows Azure provides connectivity between your on-premise network and your Windows Azure network, and you can also connect any machines to your Windows Azure networks. And let me switch back to my presentation machine. So in this case, I'll show you—I already defined a virtual network. That's my virtual network, and on the virtual network, I've provisioned a virtual machine. This is my virtual network. The point-to-site, my virtual network. And what I can do here are two things. First, I can upload a client certificate. Let's say the certificate is used to authenticate my client, and I can download a VPN client, which is tailored for this specific client. So I already did that. I will just show you after you start a VPN client, you will see additional entry in my network. See, this is my VPN connection. I can just click and connect to it. Now I'm connected to the same virtual network, and on the virtual machine, this is my virtual machine. I have created a shared folder, and with my local AP address, it's, so I can open the file browser. I can just say I can access that share, and actually I can create a new folder that says log files, and I can even—oops—(inaudible). Okay—and when I go back to my virtual machine, if I open a share folder, you can see my folder too. It's right there on the machine. So you can see it's really easy to have your laptop to connect your virtual machines on Windows Azure network. So that's the second quick demo. I think for the third one, I'm now going to switch back. The third one, actually I'm going to show the, what we call the competing consumer. Actually for this one, I do need to switch back. How much time do I have? I think I have time to go through this. The third demo is the competing consumer pattern. This is a very important pattern because I've seen, personally, it's important to me because I've used this pattern to solve a lot of difficult problems. The competing consumer pattern is really useful to solve the global state problem that I just mentioned. For instance, if your system has a global dispatcher. I mean, when we approach this problem, the problem is you have a centralized component that controls all the connected components. That's the problem, but the way to solve this problem, we can turn it around. Instead of having a central dispatcher, we can have competing consumers to compete for the task. This is how you do it using Windows Azure service bus. So ignore the text. I would just focus on the animation. So in this case, I have a producer and two consumers. What's happening is that the producer will generate some jobs, and the job will be put on a queue, and let's say consumer 1 wants to handle the job 1. What it does is to place a lock on the first job and handle the first job, and then the consumer tool comes in at the same time and places another lock on the second job and handles the second job. What happens is now that consumer 2 finished the job, and the job is removed the queue, and everything is okay, but consumer 1 actually crashed, and this lock will expire after a certain time window, and this job will reappear on the queue for other consumers to pick up. So in this way, you can actually have this filled over a scenario to ensure a job is handled at least once. So in this case, the job reappears, and consumer 2 picks it up, and handled the job. [Case study: ATIOnet] I skipped that slide. The competing consumer pattern is important because (a) it is a scaling pattern. You can scale out to multiple consumers as much as you need it. (b) It's a failover pattern. You can achieve failover using competing consumer. If one consumer fails, you can failover to another one. Three, it gives you ability to dynamically scale your system without a fact in existing components. If you need more throughput, you just add more consumers. Or you can remove consumers. Your existing system can keep running. So for that, I'm going to show you the—I think that's the last demo for today. This one, if you watched Microsoft events before, you may have seen this. This was a big compute scenario, but I married the host scenario into an inside worker role. So basically, in this case, I have a web front end that allows me to take in frame rendering jobs, but in the back end, I have seven worker role instances. Each of them is only capable of rendering one frame at a time, but if they want to render more frames in parallel, I have to scale the system out to multiple instances. So in this setup, I have a single web role texting the request. Then I send a job to the queue. Then I have seven competing consumers to compete for the job. So in this way, I would just create seven jobs, let's say. Have a render frame 10 to 16, I'd say render frames. Basically, here I'm generating seven jobs, and those jobs are sent to a job queue, and they are awaiting the worker roles to pick them up. And apparently now that the worker role is not working right now, I don't know why. It's not rendering. Okay, let me come back later, but we'll just go ahead. Okay, that's the competing consumer. I think that's the last demo. We have this case study because of the hardware failure, but if any of you are interested in such kind of a scenario, you can talk to Sebastian after the session. He's over there. Can you just raise your hand to show? Okay, I will skip the demo. Thank you very much, and if there's any—thank you, thank you. There's still a couple of housekeeping items I need to do. That's my contact information. If you have more questions, you're welcome to drop by. There's some other things, I don't really know. But the key point is to send us back your feedback. This is really important to us especially if you want to hear more about architecture, reflect that in your feedback value support so that we can do more of these sessions. Thank you very much, thank you.

Video Details

Duration: 1 hour, 12 minutes and 22 seconds
Country: United States
Language: English
Genre: None
Views: 5
Posted by: asoboleva99 on Jul 9, 2013

Langauges to MT: Fra, Ger, Spa, Bra, Kor, Jpn, Rus, CHT, Ita

Caption and Translate

    Sign In/Register for Dotsub to translate this video.