Watch videos with subtitles in your language, upload your videos, create your own subtitles! Click here to learn more on "how to Dotsub"

Large-Scale Data Warehousing and Big Data with Microsoft SQL Server Parallel Data Warehouse V2

0 (0 Likes / 0 Dislikes)
  • Embed Video

  • Embed normal player Copy to Clipboard
  • Embed a smaller player Copy to Clipboard
  • Advanced Embedding Options
  • Embed Video With Transcription

  • Embed with transcription beside video Copy to Clipboard
  • Embed with transcription below video Copy to Clipboard
  • Embed transcript

  • Embed transcript in:
    Copy to Clipboard
  • Invite a user to Dotsub
[TechEd 2013—DBI-B304] [Large-Scale Data Warehousing and Big Data with SQL Server 2012 Parallel Data Warehouse] [Brian Mitchell, Lead Senior Consultant, Microsoft BI & Big Data COE] [Brian Mitchell] Was that a yes? Okay, cool. All right. Welcome everybody to a title only a Microsoft DEV team would come up with, which is Large-Scale Data Warehousing and Big Data with SQL Server 2012 PDW. I had a different title for this session, but they took it away and gave us this one. So my name is Brian Mitchell, and I'm a lead senior consultant with the Microsoft BI and Big Data center of expertise within Microsoft Consulting Services. What we're going to talk about today is SQL Server 2012 PDW. What I would like to know from you guys real quick— and kind of do a level set—is who here has been to a PDW session in the past? I see like 2 guys raised their hands, and that's partially what I was afraid of. This is going to be mostly a what's new to PDW session. I'm going to try to level set it a little bit by talking a little bit about some basics around PDW. If I'm going too fast please let me know. I'll be happy to slow down and make sure we cover some of the fundamentals. Also, one of the things you can do is after this session, if we've whetted your appetite about PDW, you can go to my blog and check it out, and there's a PDW page there which has links to previous sessions from previous TechEds and previous past sessions in which you can find more information about PDW V1 where we talked about it in detail. This is a little bit about me. I've spent a little over half my career working on SQL Server at Microsoft. I have a couple of advanced certifications. I've worked with PDW since before we had our first customer. I was onsite when we delivered our very first PDW to our first customer. So I've had some pretty good experience working with customers about the pros and cons of PDW and our travels from V1 to V2, and I'll be happy to talk about the details of that as we go on. Here's my Facebook Big Data research. Sixty-eight percent of all Brians in the world are with an I. So my question to you is, why when I go to Starbucks do they always assume it's with a Y? I don't understand 'cause they're always—that's with a Y? No, it's not. So hopefully I'm going to try to change the world by making everybody assume it's with an I by starting today. There's my blog information and my Twitter information. Just pimping out my Twitter to get more followers, I guess. So a little level set about our data warehouse momentum. Who here was at the big data foundational session this morning? So about half the crowd or so. That was a really good session on big data, PDW, some of the direction we're going with trying to integrate all of the forces together. It's really an exciting time to be working in data warehousing and BI right now with Microsoft. The commitment to the product groups making to BI and to data warehousing is unprecedented at Microsoft, and here's just an example of some of the customers that are PDW customers just within the last year. It's just a subset of the customers and the kind of momentum we have with PDW and SQL Server data warehousing. Gartner has recognized Microsoft as a leader in data warehousing and business intelligence, and so that's always a good thing to know— that you know we're on the right track with both of those things. Here's an example of an example customer— one of those many customers that you saw just a minute ago— and the kind of performance improvements that they saw when they started moving to PDW V1 even, which is 100 times improvements in queries, queries that took 20 minutes now only taking 20 seconds, queries in which before they were doing things off aggregations and market baskets they are now doing item level queries. PDW does give you the ability to really drastically change how you look at your data warehouse and how and what kind of queries you can actually expose to your end users in their BI environments, and we'll talk in detail about that today. So you guys have probably seen some slide like this in the past, which is data is growing. We talked about that at the foundational session this morning. And that's just a simple fact that data is growing exponentially not linearly. So we are collecting more sources of data and the data that we do have is growing and customers are also wanting to keep longer histories of data in their data warehouse while at the same time expecting you to provide more real-time-ish access to that data. So that's just a fact, and then consumerization of IT— we talk about it a lot here at Microsoft. And we talk about it a lot with our customers. From a big data or PDW standpoint, the way I see it is consumerization of IT means that our users are less likely to be patient for data to come back because Google and Bing have made it seem to them that you can write a query against the whole corpus of the Internet and get an instantaneous response, right? So they expect that from our applications, and so an appliance like PDW that can scale to your needs can provide that kind of response time for your end users and speed up the kind of queries that you're looking for, and we'll talk in detail about some of the changes we made in 2012 to make that occur. So capitalizing on all these changes of big data means that we—you know—we need to understand what it is. For years we've mostly worked in this space as data guys, right? We've mostly worked in the ERP and CRM space dealing with things like inventory and sales and those—and point of sales and those kind of things. Over the last 5 to 10 years, we've been dealing more with the Web 2.0 world of adding on web logs and recommendations and eCommerce into our environments. And then as we start grappling with this big data stuff, how do we handle bringing log files into our environment? How do we deal with social sentiment, clickstream data, or what text— raw text images into our environment? So dealing with big data is going to be a big deal whether or not you're doing it today yet or it will be something coming in your near future. How do we handle that and how do we use something like PDW as an appliance to make sure that it is built so that it can handle this kind of data? And we'll talk about that today. So generally speaking, to handle that a lot of our options are fairly limited. We can keep our legacy investments. We can keep our SNP environments and continue to scale those up. Microsoft is making a big investment in making sure that you can scale as much as possible. Some of the things that you heard this morning about Hecaton or column store indexes makes it so that we can get more out of the SNP environment that we have today but that you're still dealing with constantly trying to change your applications to use these new tools as they come along. You can buy new Tier 1 hardware appliance from some of our competitors. Generally speaking, I can tell you that those are going to be significantly more expensive than PDW is, but we'll talk about the details of that later. You can acquire big data solutions so you can install Hadoop on a Linux environment and kind of scale out that way. That's certainly always an option to deal with some of those big data problems. It's going to require some training on your user's part. It's also going to require some new infrastructure, and then, of course, you can acquire business intelligence and deal with the solution complexity. Is all that stuff integrated back into your big data solution you just created for your Tier 1 appliance? So having an integrated business intelligence with your data warehouse is always a useful thing to have, and we'll talk in detail about how our BI stack at Microsoft integrates seamlessly with PDW. So what we have is we have PDW, and we have the ability to build for big data. And we'll talk in detail about PolyBase and how you can store your structured and unstructured data inside or outside of PDW but query it using some simple syntax. We'll talk about next generation performance of scale which is xVelocity and clustered column store indexes within PDW— updatable clustered column store indexes in PDW SQL Server 2012. And then finally engineer for optimal value so we were able to use the power of Windows Server 2012 and direct attach storage to get rid of some very expensive third-party SANs and other things that we had in previous versions of PDW to make it so that it is a much more optimal value for our customers. That's the introduction. What is PDW? So first of all just to make it clear— and I put this slide in here because sometimes when I have crowds like this that haven't been to a PDW session before we need to understand that PDW is an appliance. You can't download it off the Internet and install it on your own machines. It's not available in Azure. It is an appliance that you buy from our hardware partners, and we'll talk about who those partners are in a little bit and how you go about deciding on configurations. So it's built in the factory. What that means is that when you work with Microsoft and our partners they actually build PDW in the factory so that they build it right in the rack. They install the software and then they box it up in a pallet and then they send it to you, so when it arrives on your door— your doorstep of your data center— they literally can wheel it in, plug it in, and get it going and you can be loading data within a day or 2. It is that simple, so being built in the factory is a significant, important thing when it comes to installing your data warehouse environment because if you think about it—you think about how long it takes for you to build your own data warehouse environment today— it takes months usually for you to do to sit there and think about how, what you're going to purchase, how you're going to configure the software, arguing over SAN and LUN and all that kind of information and getting it all installed in your environment, and then by the time you do that you're months into your project and you really have no value. With something that is built in the factory as an appliance and delivered to you, you can be using it literally that—you know—have your data loaded pretty quickly. It's optimized for data warehouse workloads only. So that's pretty important. This is not an OLTPM appliance. This is an appliance that is built to drive your data warehousing environment and usually some sort of BI environment on top of the data warehouse. And it is built with Windows Server 2012 and SQL Server 2012. Those are the foundations for PDW. So here's kind of the agenda and the details of what we're going to talk about. We're going to talk about a little bit about the architecture— just software architecture really quickly— and then we'll talk about our appliance configurations and then the new features of SQL Server 2012 PDW. Our memory-optimized column store index, management simplicity, PolyBase, and then we'll talk a little bit about PowerView integration and finally some T-SQL parity, and I'll take questions. So as I'm going through this, if you guys have any questions I'm fine with interactivity. I don't mind at all—in fact, I have a SQL Server 2012 USB stick for the best question of the day, which I'll give away at the end of the session. If there's too many questions, we'll just slow it down and I'll ask you to let's just hold them off until the end. But I'm fine with taking questions. So let's go ahead and get started. [Architecture at a Glance] Basically, if you wanted to know one thing to kind of level set what you need to know about PDW when it comes to how to—how's it different than SNP SQL Server I would say that you need to understand the difference between replicated tables and distributed tables. And if you understand that, then that really helps. So remember PDW is an appliance. It is multiple SQL servers within an appliance, and one of the things we do is we spread— we shard your data across nodes essentially. Within the PDW environment, we use the term distribute your data. So one of the things we have is choices. You can either replicate your data or you can distribute your data. Replicate your data has nothing to do with SQL server replication. All it means is that when you create the table that is replicated, which is the default structure— and what that means is that a full copy of that table exists within each PDW node, so we make copies of it, and we put it on each one of the SQL servers. The reason we do that is simply for performance as we join tables together; so generally speaking, your replicated tables are going to be your smaller tables, and they're, generally speaking, going to be your dimensional tables if you're doing like a Kimball data warehouse kind of environment. So think tables 5 gigs and less, which is a majority of your tables, and you just replicate them out, and that really helps with the performance. Distributed tables, on the other hand, are tables that we hash based on a column that you give PDW, and what we do then is basically every row that gets inserted into PDW is hashed on that column and then it is stored on 1 node and 1 node only based on that hash, and that's really an important thing. So what we do is we spread your data across. If you have a one-data-rack appliance with 8 nodes in it, then that row is only going to live on 1 of those 8 nodes, and that's—and so that way when you query data, that's how we're able to get the massively parallel performance out of PDW by only having to deal with subsets of data on each one of the SQL servers as we bring it back together and then give it back to the end user. So digitally a little bit—what that looks like is for your dimensional tables—we'll take a dimensional table and we'll replicate it out and it gets a copy of that dimensional table on each node within the appliance, and this is just a little half-rack appliance within PDW of 4 nodes— or I guess that's 5 nodes. On the other hand, your larger tables— generally speaking, your fact tables or your really large dimensions if you're a large company that may be customer dimension might be— you might have millions of customers— what we'll do is we'll take that data and we'll actually shard it out so that each one of those rows actually resides on 1 and only 1 compute node within the appliance; and in fact, a little detail underneath the hood there is on each one of those compute nodes we have 8 different distributions, so in this particular or a typical one-data-rack appliance with 8 compute nodes we actually have 64 distributions across it. Each row goes into one of those 64 distributions, and so when you do a query against that table—select star against sales fact— what it's going to do is send 64 queries back to those back distributions, and each one of those queries is going to run, and then we're going to aggregate that data back together before we send it back to the end user. So at a high level, that's kind of what those do. A replicated table—it's just the same way that you do it today. You're just creating a table just as you do today on SQL server. A distributed table, on the other hand—all you do is you have the with statement at the end where you say your distribution equals your hash, and you pick the column that you want a hash on. And we have lots of presentations that you can find from past and TechEd previously when we talk more details about this and best practices and all that. We really don't have time for that today because we really want to talk about what's new with SQL Server PDW and what's really exciting about 2012. [Appliance Configurations (HP & Dell)] So let's start with appliance configurations. Our 2 partners are HP and Dell, and their configurations are slightly different, and we'll kind of walk through what those will look like. So first of all, let's talk about the difference between a V1 and V2. So a V1—basically a 1 data rack appliance was really 2 racks. It had a control rack and a compute rack, and there was lots of hardware in it. There was lots of SANs involved in it, so if you had a 1 data rack appliance you had 11 physical servers in your compute rack. You had 10 SANs. There was lots of interconnects between the two. There was SANs on the control rack for both the control node and the backup nodes. So there was a lot of hardware involved there, and while it was a dramatic improvement over previous versions— well, we didn't have anything like this in the past. This is really exciting and really good. PDW V2 takes us all to a new level. So one of the things we've done is we were able to shrink down our footprint from 2 racks down to 1 rack. And then after doing that, we were able to give each node more memory, more CPU; and actually with the 1 rack as opposed to 2 rack, we actually have more space for data, so what that does is by getting it down to 1 rack our pure hardware costs are much less expensive than they were before, and our price per terabyte is much lower than it was before, and our performance goes up significantly. Mostly what we'll talk about over the next hour or so is where we're getting those performance gains. But it's pretty exciting to see the differences and faster improvements in performance and much less cost to the end user. All right, so how'd we do that? So some of the key design changes we made in PDW are, first of all, we went to one standard node type which—and that node type has 256 gigs of memory, so when you buy a one-data-rack appliance with 8 nodes in it that's each one of those nodes has 256 gigs of memory because we're—and we're going to drive pretty hard through that memory. Also we updated the newest InfiniBand, so we went from DDR to FDR, which allows us a much faster performance of interconnects between the nodes and the appliance as we push data throughout the appliance and back to the end user. And then, really importantly, we move from SANs to JBODs, so using Windows Server 2012 technology to just manage the disks and get rid of those expensive third-party SANs was a significant deal for us. It really reduces costs, allows us to use Windows Server 2012 to manage stuff in the same way that we do in other environments. And then finally, we also removed what we called the backup and the landing zone, which were previously integrated into PDW, and now they're part reference architectures. Reasons for that were customers had wildly varying needs for both backup and landing zone. And while we provided a landing zone that had basically 8 terabytes of space in it, there were customers that needed more space, and we've provided a landing zone that had—I think it was 8 CPUs in it. That wasn't enough for some customers depending on what they were doing. Some customers had multiple loads that they wanted to do at any given time, and they would prefer to have multiple landing zones, so what it really came down to is we found out that it was actually better to just pull those out of the appliance and provide reference architectures and that gives the customers more agility— possibly even to just leverage SSIS environments that you already have, for example, without having to pay for a landing zone built into the appliance. And then we have the scale unit concept which is basically the ability to add either 2 or 3 compute nodes and the related storage at any given time, so we start off with a quarter rack, which essentially looks like this, where you just have 4 hosts and 2 of them are compute nodes down there at the bottom—hosts 2 and 3— attached to a JBOD. That's essentially where you can start, and then you can scale up from there, and we'll go into detail about what that looks like for both of our vendors. So an HP configuration basically can scale from 2 to 56 compute nodes, and this is kind of that scale for everybody kind of mentality that we have. When we start talking about 2 compute nodes, we're not talking about a large appliance. We're actually talking about something that most everyone that has a decent-sized data warehouse that's in the terabyte range at all could probably leverage at some point. So what that looks like is this. At the top what we have is we have our interconnects for InfiniBand and Ethernet, and we have a control node and a failover node, and then we add a scale unit down at the bottom. We start off with our base unit, and in HP that's going to be 2 servers. They have 16 cores each for a total of 32, and it has a total of 15 terabytes of raw space, which approximates out to about 75 terabytes of user data capacity when you start talking about compression. So at a minimum, that is a really exciting entry point into a massively parallel world, and what you can do from there is, as your needs grow over time, you can add additional scale units to the point where we can have, in an HP environment, up to 8 four-scale units, which is 8 compute nodes within the appliance. That gives you 60 terabytes of raw space within the appliance. And that's for one data rack, so we can go up from 1 to 7 data racks for that appliance. Now what's different about PDW that is compared to SNP is when you start at this scale unit down here of a scale unit of 1 and you need to grow, you add that second scale unit and work with Microsoft, and we can— your data just—we can redistribute your data across all that entire appliance. So that's different than an SNP environment where you have box and you run out of space and you need to decide to grow more space. You buy a new box, and then you have to go do this expensive and very time consuming migration process. The difference that's really exciting for MPP and for PDW is you just scale as needed, so if you scale— let's say you start at a 1/2 rack, like we see up there, you need to go to a full rack, you come in, you add 2 more scale units, and you redistribute your data, and you're now leveraging the entire appliance without having to do any lift-and-shift like you have to with an SNP environment. There may be some small downtime, but that's much different than a big lift-and-shift in a normal SNP world of moving from one box to another. And then finally, we have a little extra space for ETL servers and backup servers for customers that ended the appliance and hook up to InfiniBand so that they can load the appliances necessary. So that's one data rack. You can then slowly add quarters again to 2-1/2 racks, and then you can add another half and you can get to 2 racks. Add a third rack and on and on all the way up to 7 racks. So that's the HP configuration. Dell is slightly different; Dell comes with 3 compute nodes and 2 JBODs in a base unit. And the way that they—and that gives you basically about 22.6 terabytes of raw space, 79 terabytes of user data in that environment, and so basically what they do is we scale in thirds in a Dell environment so that you can have up to 9 compute nodes. You get a little bit more space in that particular appliance configuration, and then, of course, they still have space for the customer use. And then, of course, you can go up to 54 compute nodes, up to 6 racks, and on and on, so they're fairly comparable really performance-wise between the 2, whether or not it's the 9 or the 8. It's really fairly negligible; it's really just up to you on what— what your particular vendor of choice is. So the supported growth topologies kind of look like this. We'll be able to grow from a quarter rack up to 7 racks for HP or from what we call a quarter rack or a two-thirds rack all the way up to 6 racks in Dell. And what you can see there is you can basically start off with 15 terabytes of raw space and basically grow it up to 422 terabytes of raw space, which gives you about—for that particular environment— about 1.4 petabytes of space. If you use the 3-terabyte disk, you could probably get up to about 6 terabytes of— or petabytes of space as needed. So you can grow to basically any size, and I'm sure as time goes on, we'll continue to grow the amount of space that we can handle as PDW continues to mature. So how do we do this? So one of the things that we do—one of the—yes, sir? [inaudible question from audience] So the question is, if I have 5 racks and I want to go down to 4, can I take out 1 of those racks? The answer is, yes, you could. Is it going to—you would have to work with Microsoft to do that. You would basically—we would have to reconfigure it to be at 4 racks. There's—you would basically have to take a backup of your data and then restore it. It wouldn't be—it wouldn't be a trivial thing. It's not—growing is one thing that we have planned. To shrink like that—and then what would you do with your 5th rack? What are you going to do with those servers? So yeah, I mean there's a lot of questions there. It can be done but it's not a trivial thing like growing is, is what I would say. So let's talk about how we do some of this— some of this agility through virtualization. And this is pretty exciting, so PDW V1 did not have virtualization in it. We just did everything right on the bare metal, as you'd probably normally do in your environments. But what we found over time as you move to Windows Server 2012 and SQL Server 2012 is virtualization— the performance that you get is negligible, if any, and so one of the things that we can do from an agility standpoint is to use virtualization in that. So instead of using failover clustering like we did in V1, we're now using virtualization to manage our high availability in PDW, and what that allows us—several pieces of agility that are actually pretty nice. So what you see there is you have your storage spaces within the JBODs, and those compute nodes are connected to the JBODs like that. But what happens if we have a failure of host 2 is basically that compute node is moved over to our spare node, which is called host 1 in the appliance, and he can still have access to the JBOD as long as compute node 2 is up and running and has access to the disk, so you're able to use InfiniBand to connect back through the disk; so no matter where that virtual machine lies within the appliance—and it can failover from 1 rack to another rack— he still has access to his disk as long as the other server is up. In the case of a failover like that, what we'll do is you can work with Microsoft support and they'll work with you to get host 2 back up and running and then fail compute node 1 over. So essentially there's just one cluster across the whole appliance, and it's pretty exciting that the virtualization is used that way. So LUNs and filegroups in disk layouts. So what we have basically within one of those JBODs is we have 70 disks. And what we're able to do is we create 33 mirrored drives so that we can layout our data. So one of the things that I told you earlier was—is that in any particular compute node you have 8 distributions of data. And each one of those distributions actually physically on the back end is just a table, so within—you create a distributed table for fact sales and that has 8 compute nodes in it. You're actually creating 64 physical tables on the back end, 8 per compute node within your rack. And so in PDW V1 that essentially leveled out to each distribution went to a filegroup which went to a file that was on a mirrored RAID 1 array. And one of the things that we found out was that was great as long as we didn't have a lot of data movement and we didn't have a lot of activity, so one of the things we did with V2 is we essentially doubled the number of disks we have available to each one of those distributions; and so now when you create a distributed table, it creates 8 tables on each compute node. Each one of those 8 tables is associated with a filegroup, and each one of those filegroups has 2 files behind it, both on a—they're a separate RAID 1 arrays. So what we've done is we've significantly improved the I/O behind the system so that we can handle pretty much every workload you can imagine when it comes to pushing data through the appliance. So those distributions are now split into 2 files. Also what we've done to really improve I/O is to move temp DB off some local disks in V1 and move them out to the JBOD and spread it across basically what is 32 RAID 1 arrays. So temp DB is no longer constrained to a certain size. It can grow through your JBODs and it also has a significant amount of I/O behind it. This is really important because PDW uses temp DB extensively as we do movements of data and aggregation of data internally before we present it to the end user. All right, so let's talk about actually what is probably the most exciting introduction in the PDW V2, which is the Memory-Optimized Columnstore index in xVelocity. And that's supposed to turn. All right. So we've had xVelocity around for a while. It started off in Excel, and it started off in PowerPivot. As it grew from PowerPivot to analysis services tabular mode and SQL Server 2012, one of the key limitations to tabular mode and non-clustered columnstore indexes is that you cannot update them. Right? And this has become kind of—well, I would love to use it, but it becomes kind of a pain to use. So when we started talking about wanting to do columnstore indexing within PDW, one of the things that was just an absolute must was that it had to be an updatable columnstore— something that is updatable that we can use. And the product group came back and did us one even better. They gave us an updatable clustered columnstore within PDW, and it actually made it into PDW before it made it into the SQL server box product. This morning they talked about it a little bit—about SQL Server 2014— will have clustered columnstores in it, so what that means for us is that for PDW now the default—it's not the default, but from a services perspective, when we think about how we implement tables within PDW for 2012, the thing that we're going to go to first is a clustered columnstore index for our data. What that means is we're no longer storing it in a row store at all. We're storing our data—there isn't a copy of the data between the row store in your non-clustered index. You're just storing everything in columnstore indexing, and what that does it gives you a significant reduction in the amount of space used, and if you significantly reduce the amount of space on disk, then that is a lot less information that you have to pull through an I/O a lot less information you have to pull through a CPU, and a lot less information you have to store in memory. So it significantly improves performance all over the appliance such that you can deal with much larger data sets and get much faster query returns. And you know when we talk about 10 to 100 times in performance improvement, that's over PDW V1, which already gave you significant performance improvements over SNP; and when we talk about 7 times data compression, that's what we're expecting to see out of data compression when you go to a clustered columnstore index. So what does that look like? So how do we deal with the updatable part? That's kind of important and we'll talk about that. So basically a clustered columnstore index is comprised of 2 parts. We have a columnstore index, and then basically we have a row store, and that row store is basically what you're used to today, which is just the typical row store table. And basically that—to the end user—you don't see any of that complex-icity. If you create a table with the clustered columnstore index within PDW this is how we go in and implement it— by providing a row store that—basically as you do inserts into the table, those inserts go into the delta row store first. As we compress the data in the columnstore index, basically we compress it into million-row segments approximately, just like we do with non-clustered columnstore indexes. And the idea behind it is that the minimum I/O between disk and memory is those segments; and finally, the execution of batch mode is pretty important, too. So batch mode has been enhanced from SQL Server 2012 to PDW, and we'll talk about what those enhancements are so that you're more likely to get batch mode if you're using your columnstore indexes. So typically batches are about 1000 rows at a time, and basically what that means is if you are able to leverage batch mode that means that instead of going a row at a time through the CPUs, we're doing 1000 rows at a time through the CPUs, and that essentially just reduces the amount of work CPUs have to do, which significantly can improve performance. So you can see down there an example of a couple of different things of FactInternetSales_Row ran at about 6.7 seconds. You do the exact same data with a columnstore, and if you get batch mode like that you'll see it ran in less than half a second. Same data, same structures—just the difference between row store and batch mode. You can also see there on the estimated operator costs where it goes from 41.8 down to 0.6. This is a significant game changer. And really there's very little you have to do other than create your columnstore indexes or your clustered columnstore indexes and just leverage them, and that's what's nice is whenever you're able to get significant improvements in performance without having to do a lot of juggling on the back end. A key here before I go on though. Batch modes processing's only available for some operators, so hash joins and aggregates are supported within PDW. Right now merge joins, nested loop, and stream aggregates are not supported. But this really isn't a big deal because whenever you're doing PDW queries and you're doing a fact table that has millions or billions of rows to multiple dimension tables that have hundreds of thousands or millions of rows, you're almost always wanting to do hash joins anyway. That's essentially what we're looking for in PDW. We want to scan the data, put it into memory, hash join it together, and then give you an output of a result set. Generally speaking, whenever you see large data warehouses and you're doing a Kimball data warehouse and you're doing nested loop joins through large fact tables, you're doing it wrong. So the supported modes are actually the modes that we're, generally speaking, looking for whenever people are using PDW— or any data warehousing tool for that matter. Yes, Jeff? [inaudible question from audience member] I would be happy to. I'll go to the next slide. So how DML is supported. So this is how we go about supporting updates. Inserts are inserted into the delta row first, and so inserts go in there, and it's just a page compressed heap. And once you get enough inserts into the delta row stores so that it can be compressed, it will go ahead and do that through what we call a "tuple mover." So basically once you get about a million rows in each one of those distribution's delta row stores, they will then do a tuple mover and actually move it into the columnstore. Deletes—on the other hand, if you want to delete data from the columnstore, it's essentially just a logical delete. We essentially mark it for delete, and then we'll get— we'll actually get rid of that row whenever you do your next reorganize of your data. Updates, Jeff, are basically an old journaling technique of doing inserts and deletes. So we'll do an insert into the delta row store, and we'll mark the old row for delete as necessary. And so what that allows us to do is it allows us to have this columnstore index. It's a clustered columnstore index. It allows us to insert rows into it and then move those rows into columnstore once we have enough of them in the delta row store. Essentially the delta row store—because it's just a million rows or less at any given time—even though it's not part of the columnstore index yet, there's going to be very little performance impact by having some rows in there at any given time, and we would expect you to generally have rows in there, especially if you're trying to do updates throughout the day. On the other hand, if you're doing large batch inserts at night, ideally I would like to—you know you probably would like see you do the large batch insert—do a reorganize of your clustered columnstore index— and that way you compress everything; and then that way, everything's in your columnstore throughout the day. But if you need to do trickle-in updates through the day, you can certainly do that in PDW now with the clustered columnstore index without any problems. So we have some new DMVs available to us that allow us to kind of see in the back end. One of the reasons why you'll see a lot of screenshots of things like this is it's really hard to get a PDW appliance to do demos. There's only so many PDWs within Microsoft, and all of those are usually being used to sell to customers. We're usually doing some sort of POC or something like that on those appliances, so little guys like me that are out in the field— it's really hard, so I take a lot of screenshots and try to show you guys that way because that's the best I can do. So here's an example of some additional DMVs available, and so let's—at first, basically you just have a table that has clustered columnstore index on it. It's been compressed. And then we insert a row into it. And once we insert that row, you can see that first row up there. It shows us that we now have an open state description. What that means is I now have a delta row store so that I've added a row into the delta row store for that particular table. And then after the rebuild, you can see that my total rows for one of those indexes goes up by 1, and the delta row store disappears. So that's essentially how we handle that through PDW as we add data to it. In a little more detail—compressed basically shows whether or not we have a row group that is in columnstore format. Open shows a delta store that's accepting new rows. And closed represents a delta store ready for reorganizing. You can kind of walk through this example in the slide deck that you can get online to get some more details about how we go about working that. But essentially at the end of the day, what you'll see is if you're inserting more than 100,000 rows we'll compress it automatically. If it's less than that, then it'll be put into the delta store, and then you'll have to manage that with the reorganizer or wait for more rows to get added. It's also partition friendly, and I'm not going to spend too much time on this, but the idea is that we can handle partitioning just fine with clustered columnstore indexes also, and you can see—you can read this in more detail in the deck later on. So the syntax is similar to PDW V1. Basically you can create a table, basically, and you give it your distribution column, and then you also tell it you want it to be a clustered columnstore index. That's it. What that means is your entire data set is stored in a clustered columnstore. All of it. So there is no more row store. There is no more additional indexes you need to add. It's just all going to be stored in columnstore format and be accessed that way. If you want to know more about what columnstore indexing does for you from a performance standpoint, I do have a session on Thursday on columnstore indexing period for SQL Server 2012/2014. We'll talk about clustered columnstores in there for 2014, and we'll go into details about how columnstores are actually managed pretty well there. So you can also do a create table with a create index statement or CTAS. CTAS is a special statement within PDW that basically create table as a select, so you can create a brand-new table as a select from another table or another set of tables. And you can also create your clustered columnstore index within that— that period of time also. And then CTAS with the columnstore and basically what that looks like, and you can go into the details there. And additional DMVs with additional metadata so we can look at the columnstores as necessary—lots of detail there. Basically at the end of the day, what you want to know is that you'll only now have only a single index on your table. There's no need for a clustered index plus a non-non-clustered columnstore. There's no reason for a clustered columnstore plus a non-clustered row store. In fact, non-clustered columnstore indexes are not supported in PDW V2. You just have the clustered columnstore indexes. And then also some more details down there. Basically every PDW data type can be included in the columnstore indexes, so that's a big deal that we actually added data types from SQL Server 2012. And batch mode processing you're much more likely to get than you would have in 2012 also, so this is really exciting for us—all the changes here— to make sure that you get that performance as needed. And finally, some details there around that. So what we're going to talk about next is management simplicity. The idea here is that we've added some built-in resource classes essentially leveraging on the back-end resource governor to provide DBAs the ability to manage their environment a little bit better. What we found with PDW V1 was that oftentimes there were certain groups of users that needed more resources to complete their queries, depending on the size of their queries or the importance of the user. Also ETL loads needed—sometimes needed more resources allocated to them so that they could perform as the customer wanted, and what I mean by that is sometimes the—in PDW V1, the customer would have an ETL load that would run in an hour one day and run in 2 hours the next day. And it was mostly because of competition for resources. So what we were able to do was we were able to create new resource management classes which basically allow DBAs to basically hand out concurrency slots so that users can get more resources across the appliance. And essentially the default behavior is what we had in PDW V1, which is that you got one concurrency slot and that's what you got no matter who you were and no matter what process you were running. What we provide now within PDW is you have a total of 32 concurrency slots. If you want to, you can give somebody—either at the user level or at a load level like if you're doing ETL—medium, large, or extra-large resource classes. And in fact, you can see if you give somebody an extra-large resource class, you're going to use 21 of those 32 slots. But there's appropriate, perfectly good reasons why you may do that, and one of those is those ETL processes that I was talking about . If you have an ETL process that runs at night at two-o'clock in the morning and you want to give it—and nothing else is going on in the system— and you want to give it the most resources possible so that it will get done as quickly as possible and as efficiently as possible, you can give it an extra-large resource class which means every single connection is going to get 4.2 gigabytes of memory on each distribution across the appliance and that's—oh, I'm sorry—8.4 gigabytes of memory across the appliance for each distribution, and that's a big deal so that's— when you start talking about 64 distributions times 8.4, you start seeing the kind of memory that just that one resource that might be doing some ETL might be able to get. You may have other users that have very specific queries— maybe coming from analysis services, maybe in tabular mode, for example— that you want to give medium or large resource classes to so that they can—they can get back in a very good expected time. This is essentially just the ability for DBAs to be able to manage the system a little bit better. We have some DMVs that will allow you to kind of see in the back end a little bit so you can kind of manage that—see who's coming into the system, what queries they have, what their resource class is— so the DBAs can certainly manage that pretty well. And one of the things that's kind of important to know about is resource classes and columnstore indexing. One of the things that we have found by doing some testing is the larger the resource class the larger your segments within your columnstore indexes are going to be, so this is an example at the top one where you can see where they just had a default resource class, and it was actually compressing rows pretty quickly as on a system that was fairly competitive and had lots of things going on. As it runs out of memory, it will go ahead and compress the rows. You can see down below when you have somebody with an extra-large resource class the exact same load just had one row group and one delta and was able to load all the data into that so it was able to compress it. The more compression you have the less segments you have then the better your performance is going to be. Now the top one—what you can always do is you can always do a rebuild of the index and it's going to compress your segments down to less segments. But it was just an interesting facet to kind of understand how your resource classes— especially when you're talking about ELT loads at night— can affect how your data gets stored on disk. So this morning during the keynote and during the—they talked a little bit about Polybase and how important it was, and in fact, I heard Shawn Bice talk about how it was—if you were going to take one thing away from PDW V2 or SQL Server 2012 PDW it was go read a little bit about external tables in Polybase. And we will talk about that. I actually think the columnstore indexing is just as important if not more important from how great PDW is going to work for the average end user who wants to just store their data warehouse workload on PDW. For people as they want to get that agility and be ready for big data this is a big deal, so when we talk about Polybase and everything what we're essentially doing is being able to say that you can use Hadoop or HD Insight in an environment but not necessarily have to retrain your users how to do Hive jobs and Pig jobs. If you can store—you can use Hadoop and HD Insight to store a lot of your data as it's coming into your environment but still leverage all the skills that you have today by just creating external tables over it from something like PDW, so what you can get is the ability to within PDW query your relational data and your non-relational data that's out sitting out in Hadoop in one familiar interface, and that's by creating the external tables over them. And that's absolutely true. It also provides some agility when we start talking about housing your data long term and also from backup—backup and recovery— and we can talk—we'll talk a little bit about that. So basically the idea is external tables are the ability to basically do SQL queries across your data stored in HDFS. And the idea is that we have this new HDFS bridge that's within PDW that allows us to communicate to Hadoop from PDW so that we can parallel import into PDW, and that's kind of the key difference. There's lots of tools that make it available so that you can move data from PDW— I mean from something like Hadoop to something like SQL Server. Or something from like Hadoop to PDW. You can use integration services to move data from Hadoop to SQL Server. You can use Scoop to move data from Hadoop to SQL Server. You can use integration services in Scoop to move data from Hadoop to PDW. But it's single threaded so you're moving data from one node or all these nodes— these maybe dozens of nodes or hundreds of nodes in Hadoop— single threading it out and moving it to SQL Server and then loading it into SQL Server. The difference is when you start talking about Polybase, what we're talking about is we're talking about multi-node to multi-node communication, and that's what's really exciting about Polybase, and that's how we're able to get the performance of moving data from Hadoop to PDW or back from PDW to Hadoop as necessary. So the parallelism is oftentimes, in my opinion, not emphasized enough when we talk about how cool Polybase is. All right. So let's talk about it a little bit and how—what we're doing there. So basically what we're doing is you have a query come into PDW. So you've created an external table over some sort of HDFS data, and what we're going to do is as a query comes in to PDW we're going to go out and query that data in HDFS and only bring back the columns necessary to satisfy the query. So you may have a bunch of, let's say, Twitter data sitting out in your Hadoop environment, and it may have 100 columns in it. But if you're only querying 4 of those columns, we're only bringing back those 4 columns back into PDW— through those HDFS blocks through that parallel transfer— bring it into PDW, we'll mash it up with the data in the joined query that you may have in PDW and send back a result set. So that's really what it's doing when we talk about what Project Polybase looks like within PDW and how we're joining data. If you want to go the other direction, we can run a query within PDW and export that out to HDFS. Can anybody think of any good reason why you might want to do that? You have your data already in your relational environment. Why would you want to move it out into a Hadoop environment? [inaudible audience response] What's that? For others to consume. That's absolutely a great example. So maybe you've done the first step—SQL in, results out. Maybe you had some of that Twitter data in Hadoop. You had your organizational data in PDW. You wanted to mash that up to get some new insights, so you bring it into PDW, you do some manipulations within PDW and some queries, you get an intermediate table, and now you want to push that back out into your Hadoop environment for others to consume. That is absolutely a perfect example of how you can provide that out so now others can consume that data from Hadoop without having to come in and hit your PDW environment and compete for resources, for example. So if it's fairly static that's going to be around for a while and lots of people are going to want to have access to it— and they don't need it very quick—they actually want to import it into their own environment or their own applications—that's a perfect example. Another example that wasn't really obvious to me at first but the more I thought about it made sense which is disaster recovery. Think about it from this aspect. You have a PDW appliance. You may have 2 PDW appliances. But you want to create some sort of disaster recovery scenario that's different than just doing a backup restore. A backup restore is fine, but what happens if your data center goes away and you don't have a PDW appliance to restore to? What do you do? So exporting your data out through Polybase into a Hadoop environment is a great way to just export all your data out for long-term storage somewhere else. You can then use some sort of SAN technology—if you're using SANs or what have you—to replicate that data out to another environment, and that can work out really well, too. This also works with Azure or HD Insight. So you can push this out into the Cloud if you want to, for example, so that you have your backup of your data there. There's also—when you start thinking about it in those kind of terms, there's all sorts of examples of how you could probably deal with your data like that. So I talked a little bit about the parallel data transfers. Essentially what happens is—you know if you think about your HDFS data sitting out there in a bunch of text files, those text files within Hadoop are broken down into blocks, and so when you see this kind of query here—you know, with just 2 compute nodes—think of each one of those data nodes not only as just your HD Insight blocks but also think— or HD Insight data nodes—but also think of them as the blocks within the data node. So each compute node's going to talk to a subset of those blocks out there in Hadoop and only bring the data back for that, and then we'll process it within PDW, move the data around, and join it up to the tables that you may have in PDW, and it's very efficient at that. And this is really just kind of the first step. Polybase V1, which is what's available now, is just really kind of the first step. We have great plans for Polybase with PDW to make it the place where you want to integrate your data between Hadoop and your relational data warehouse. The next step is going to be—as for the next version of PDW— we're going to be doing what we call pushing predicate pushdowns, which is basically pushing down MapReduce jobs from PDW into Hadoop so that we can get back smaller data sets and really kind of improve performance; so the idea is you can bring a SQL query in. We'll now send a map job to Hadoop so that it can run a MapReduce job within HDFS and bring back a subset of the data through the HDFS blocks and join it to our PDW data. That will significantly improve performance. It will provide you much more agility and make this whole join of PDW and Hadoop even a closer process. And when we talk about—PDW kind of v-necks. When we talk about the next what we call appliance update— which is kind of like a service pack but since we apply everything to the appliance at once, we call it an appliance update. Appliance updates include all sorts of new features, too, so it's like a service pack on drugs because we'll include all sorts of new features with each appliance update. Those come out fairly regularly. We're doing an appliance update basically on average about every 6 months. PDW V1 came out in 2009 or so, and it had 4 appliance updates— 3 appliance updates and a 3.5 which had some new features into it. V2 has been out for just a couple months now, and we're already looking at appliance update one by some time before the end of the year, so when we start talking about these kind of features, it's really exciting how quickly they come about as the DEV team is very aggressive in adding new features into PDW. All right, so the Hadoop integration kind of looks like this. You might have seen this this morning. Creating an external table—basically, you're just creating a table like you would in SQL Server, and then you're pointing it at your HDFS data in Hadoop with your location statement down there at the bottom. It's really straightforward, and that pretty much works against multiple Cloudera, Hortonworks, and HD Insight versions of Hadoop. Looking at it from a slightly different format— you know you basically have an indicate with your external table. You give it the location of your Hadoop cluster and then your file format and you're done. And that basically creates a structure over your data in Hadoop, and you're off and running join—and from that point forward, that table looks like any other table to the end users, and they can join it to any table that you have within PDW and query the data as necessary. And PDW handles all that underneath the hood from then on out. So additional examples of how to go about creating the link tables— you can do it—we have a new statement called a CETAS, which means creating an external table as Select, so that means we basically can create an external table in your Hadoop environment—your HDFS environment— pulling data from PDW or even joining data from Hadoop and PDW and creating a new table within Hadoop with a CETAS, for example. So lots of options there to be able to do that stuff, and it's done through the HDFS bridge which resides on PDW. It's part of what we call the Data Movement Service which runs on each node within the appliance which moves the data back and forth across the appliance. I'm not going to spend too much time on that. Basically, you know, at the end of the day, you have a query against an external table executed in PDW. The HDFS bridge reads the data blocks using Hadoop's RecordReader interface. That data comes into PDW, we then hash it on the distributed column— if it's a distributed table—and we'll move it about as necessary to basically store the data for queries for the end users. So considerations that are kind of important— The data can be changed or removed at any time on Hadoop side. You really don't have any control over that really from an end user. So if it's stored in Hadoop—if your Hadoop administrators want to delete that data— they can do that. When you say create eternal table, that's just metadata over that. The same thing in Hive is true, too. So if you create a table in Hive, somebody can come out and delete the text files underneath it, and you can't stop them from doing that. PDW does not guarantee any form of concurrency control or isolation level— just like anything else in Hadoop levels— because we just don't have any of those guarantees. So the same query can get different results if that data's been changed and all that kind of stuff—just caveats you have to understand when dealing with the Hadoop environment. It's the eventual consistency environment that we hear from the no-SQL crowd. In dealing with Hadoop, that's generally what you have to deal with. And here's some things on setting up and monitoring it— not a lot of detail we really need to talk about today. [Empowering Users—PowerView Integration] So PowerView Integration. So this is actually pretty exciting to know that we have PowerView, and if you— how many people have users using PowerView in their environments today? It's not that many yet. Right? But it—we'll probably see a lot more of that over the rest of 2013. As people get Office 2013, it's kind of important—and SharePoint— as they continue to get the upgrades as necessary. But the ability to basically—when you start talking about terabytes and terabytes of data, and you start talking about PowerView, you start thinking about tabular mode and you think about in-memory and it goes—how do you deal with all that in-memory technology and get PowerView when you're over terabytes of data? Well, the way to do it is, within PDW we support direct query now, so what you can do is you can set up a PowerView environment through a tabular mode in analysis services using leveraging direct query against PDW to get some pretty good performance for PowerView, so you can allow PDW to really kind of drive that PowerView environment over terabytes or even petabytes of data if necessary. I try not to talk about the petabytes all that much because most people really are just getting into the terabytes range, and that's what we want to talk about. So PowerView along with our other tools— this is just kind of an eye chart of the kind of BI functionality and connectivity that we add to PDW and what drivers we have available— just to kind of let you know that Microsoft BI stack integrates with PDW very well now. Just using the native drivers that we have available, you can connect to PDW and leverage whether or not it's analysis services in MOLAP or tabular mode or it's reporting services or PowerPivot or what have you— we have the drivers for it to just leverage it as necessary. Also with SQL Server 2012 PDW, we've added a lot of T-SQL parity. So this has been a big driver for us from V1 with the RTM release through here is to continue adding parity. If people had been on V1 talks and you knew—know the kind of data types that we supported and the lack of stored procedures support and the other things that we didn't have with the original V1 and you see where we are today in just 2-½ years, it's just amazing the improvements that the product group has made. And some of these are not trivial things to do but basically being able to add the compatibility for things like SQL Server Data Tools. So to manage your PDW environment now you can use SQL Server Data Tools, and so we have Visual Studio and data tools to manage our PDW environments. All our BI tools can connect to PDW, and then we have lots of 3rd-party tools that can connect to V2 just fine. Here's just an example—some additional options that we have available that weren't available before—so things like SET ROWCOUNT, db_id, db_name—these are really important things that we need in our products so that our 3rd-party products that are available to work with SQL Server can be more likely to be used against PDW, and that's one of the things that the product group has really been pushing for over time is to get that kind of parity. We're not quite there yet, but we're getting really close to having full parity between SQL Server and PDW. Lots of stored procedure improvements. And you can see here some additional SET options that we have for stored procedures, the improvements that we've made, and additional system stored procedures that are now available within PDW. Some of these are not trivial in an MPP environment, so the fact that they've gotten these done over just the last couple years has been pretty impressive. Additional metadata functions and security functions that we have available like user_name and suser_sname are pretty big. And then some additional catalog views and general T-SQL available— so lots of T-SQL compatibility there. So getting towards the end, we can talk a little bit about—you know—in summary, we talked about any data, any size, anywhere. You can scale from a quarter rack. A quarter rack is just a few terabytes in size and all the way up to 6 or 7 racks, and you start talking about petabytes in size. This is very compatible or comparable to what you would build if you were building a fairly large SNP box price wise. If you start talking about a decent sized DL980 or Dell's comparison along with the SAN technology and then building out your SAN and all your disks and so that you get the right I/O and trying to get your balanced I/O, you're probably right in that same kind of range as what you might be able to get with a quarter rack for PDW and then be able to just know that you can scale out from there. Being able to do the visualizations through PowerViews—important. And then finally, really important is virtualization, the management simplicity, the clustered columnstore additions are all really important additions to PDW. I have a few more slides that are screenshots, and I'll show them if there aren't any questions, but I'll take a few questions if you don't mind. [inaudible question from the audience] So the question is, what do you need to do to your existing database to put it into PDW? The big one is what I started off with before, which is deciding what tables you're going to distribute and which ones you're going to replicate. So taking a look at your data warehouse and saying—looking at it and determining, based on data sizes, generally speaking, what you're going to distribute and what you're going to replicate, and then once you decided that, the next thing is choosing a distribution column. So that's really important, so when you start talking about multiple fact tables, you want to look at which columns can be distributed so that you evenly distribute data across your appliance, so take an example of the— let's say we have a fact sales table, and it's Microsoft's fact sales table. And so we have lots of dimensions in there, and one of them is product ID. And that probably wouldn't be a really good choice for our hashing because we sell lots of Windows, and we sell lots of Office, but we sell comparatively less other items. Whether or not it's SQL Server—well, we sell a lot of SQL Server out in the environment. It's not in the millions and millions of installations like— Windows 8 sold over a hundred million installations in just 8 months. So then we also have things like Dynamics and other things where you're really talking the hundreds or thousands of sales. So what you would have then is if you distributed on product IDs, you'd have one distribution with all your Microsoft Windows 8 Professional sales, and you'd have another distribution with all your Mouse 3000 sales. And if you added enough products then it might level out. But most likely, in that particular example it wouldn't because you have so many Windows and so many Office sales what you're going to have is you're going to have some distributions with more data than others, and whenever you run a query against that table that includes that data, you can only go as fast as your slowest distribution. So ideally, we want to have level distributions and level data across the appliance. So you want to pick a different one. For somebody like Microsoft and our fact sales table— customer would probably be fine. Customer ID is one of them that's going to be in there. Also that is probably a good one because we'll probably have other fact tables that have customer ID in them also; and so when we join those 2 fact tables together that are both hashed on the same ID, we have no data movement because the data for both of those resides within the same distribution, and thus we're able to join them within the same data node without having to move data. Some customers are going to sell more than other customers, sure. But it's not going to have that wild distribution that we had with product ID, plus we have a lot more customers than we have products. And so we're going to be able to spread that data out. So looking at those things before you move to PDW ahead of time is really important then looking at how you're going to load data— whether or not you're going to use SSIS or if you're going to use the built-in tool, which we didn't talk about today, which is called DW Loader, which is essentially a just kind of like bulk load for PDW. You drop files onto your landing zone, you kick off the DW Loader, and it loads data into PDW fairly efficiently. If you already have an SSIS environment running your data warehouse, we have connectors for PDW for SSIS, so just dropping those into your current SSIS path and changing your destination adapters is actually a fairly reasonable way to go, and what that does is that actually—SSIS is actually pretty exciting. When it—the way it loads data into PDW is, again, it loads it in parallel, so you have this one loading zone with SSIS running on it. When that connects to PDW, it then understands that it has 10 nodes or 20 nodes or however many compute nodes you have, and it starts loading data across the nodes in parallel. It doesn't go through the—there isn't a bottleneck of a control node as it loads data. It actually loads data to all the nodes at the same time. So it gets really great performance that way. [inaudible question from audience] Is there any way to test this technology? And the answer is yes, so we do have PDWs at all the Microsoft Technology Centers across the world; also our partners, like HP and Dell, also have some appliances available, and so basically what we do is we—there are POCs—Proof of Concepts— that the sales team will do with you in which we'll take a set of your data and take it into a Microsoft Technology Center and load it up and show you the difference in performance based on a defined set of— hey, if we do this, these are the things we're going to do—you know— try to scope it down into a 2- or 3-week POC. So and I would just say talk to your Microsoft sales representatives to kind of set that up. Yes, sir? [inaudible question from audience] Do we support multiple-site failover? No, we do not yet. So PDW as an appliance—there is no—we don't use always on. There is no replication. We do have backup, so you can do backups to a backup server, and there's a reference architecture on how to build a backup server so we can do backups to it. Like I said before, doing Hadoop—if you had a Hadoop environment, doing remote table copies out to the Hadoop environment and doing something with that can be pretty exciting. What a lot of customers are doing today in V1 is, if they have multiple appliances is doing dual loads into both so that—and sometimes they'll have maybe 3 racks in 1—1 data center and then maybe only 1 rack in another— and they're just keeping a subset of that data in the second place. But we don't have anything like that yet, no. Yes, sir? [inaudible question from audience] So the question is, in case of 1-disk failure, what happens to the data on the disk? And the best way to see that is to go back to that picture that I had at the beginning. And what happens is the disk—each disk within the JBOD is in a RAID 1 array, so what we have is we have 32 distributions or 32 files across the appliance—each node—so we have— so what we have is we have—I'm sorry, let me slow down. What we have is we have 16 files, and it's basically 2 files per filegroup for each distribution, and each one of those files is across a RAID 1 array. So what happens in a failure is one of those disks will fail. Let's say disk 2 up there fails, and disk 1 has a copy of that data. When disk 2 fails, you'll see down there at the bottom that we have some extra disks that are not highlighted. One of those disks will kick in, and it will become the second disk for that RAID 1 array, and the data will be copied from disk 1 to disk 67, and then you'll have a RAID 1 array again, and you'll have a light on the appliance there in disk 2; so if you're doing a walk-through of the data center, the light will be on to let you know to replace it. Also within PDW—if we go down here— which is my extra set of disks or slides— we have things like this, which gives you the overall health of the appliance. So what would happen there, if it was SQL CMP01, that overall checkbox would turn red, and you can click on it, and it will dig down into it and show you why—for that particular node—it's red. And it'll actually tell you—hey, disk number 2 in this JBOD is actually failed and needs to be replaced. [inaudible question from audience] So first of all, the appliance is still up and running. You still have the data on disk 1, and what it will do is it will do the copy over. It takes a few minutes to—depending on how much data's on there— a few minutes to an hour or 2 to get the copy so that you're in place, but I can tell you I've been working with dozens of customers over the last 3 years. We have lost no data within PDW to multiple-disk failures. So the reality is that you're always going to have disk failures in an appliance like this. When you start talking about hundreds and hundreds of disks and you think that each disk can average like a lifespan of 5 years and you start doing the division of 5 years times 12 months and how many disks are in the appliance— you're always going to have disks failing in an appliance, and it happens in your SANs today. And the idea is that as soon as those happen, what we have is we have escalated response times for PDW within support and escalated response times from our hardware vendors to get that disk replaced that day so that we can continue going. But you have multiple spares within the appliance before that even happens. I'll ask you first and then you, sir. [inaudible question from audience] All right. So the question is, if I create a table as a replicated table and I decide, no, I actually want to change it to a distributed table, can I just do an alter table? You cannot do an alter table, but we have the statement called 'create table less select,' and essentially what that allows you to do is to find the metadata and the definition of the table at the same time you do the create table. So basically you make a copy of the table with your new schema, and you're done. Then you can just delete the old table, rename the new one, and you're off and running. It's the simplest thing to do. We really—within services—don't worry at first about which distribution we pick and worry about whether or not it's the wrong one because it's so easy to do because— let's say you have 10 terabytes of data in a 1-data-rack appliance. And so that's really 64 distributions. You take that 10 terabytes and divide it by 64 to get whatever number that is. That's actually how many physical tables you have back there and about the size of data. So when you do a CTAS and you create a new distribution, that's actually 64 queries going on in the back end to redistribute that data and to put it in a different format. It happens very quickly. We do terabytes at a time, and it takes just a few minutes to redistribute a couple terabytes of data. Yeah, one more. [inaudible question from audience] The next steps? Okay. Was it this guy? Oh, the MapReduce one. Okay. Go ahead and ask your question. [inaudible question from audience] So in PDW V1 or PDW V2 release 1, we're not doing MapReduce jobs. We're essentially bringing all the data from the HDFS table or the HDFS that you're pointing at to bring it in. And when I say all the data, that's not quite true. Only the columns you specify but we're bringing all the data in. So there is—if you have a predicate to say bring me all the— bring me these 3 columns from this dataset where such and such happens— well, we don't do that where's such and such until it's in the PDW. With the next step, what we'll do is we'll run a MapReduce job to actually reduce that predicate down within Hadoop before we bring it into PDW to improve performance. [inaudible question from audience] Yeah, so that will be in appliance update 1. Yes, sir? [inaudible question from audience] Does the new columnstore add any restrictions or complexity with re-indexing jobs? No. You can do a reorganize or rebuild. [inaudible question from audience] The question is, is it online? The alter index for a rebuild for a clustered columnstore index is online, yes. Also whenever those tuple movers happen in the background—if it's automatically redone if it's got enough data— that is also online, so that's really important for us to be able to provide those online operations to make it so that you can continually add the data as necessary. [inaudible question from audience] So the question is, does PDW support the use case of multiple databases to an appliance? Consolidation. Absolutely. In fact, many—we've had several customers that I've worked with specifically where basically one group wanted PDW for a particular project. Maybe they had a couple terabytes of data, but it wasn't really enough to justify the purchase of an appliance. So what they did was then they went out and they evangelized within the company to find other groups that had data warehouses that they wanted to consolidate. And really from an organizational standpoint, that's just a no-brainer, and they were able to do that, and that worked really well. I'll take one more question. [inaudible question from audience] You can always do a redistribution of data by using CTAS. So create table select. You're going to create a copy of your data. So you're going to need enough space to be able to do that. But as long as you have enough space—slightly greater than the table that you're redistributing—you can always redistribute on a different ID. All right. So I'm out of time. I really appreciate everybody staying until the end of today. Thank you very much. [applause] You. Congratulations. Great questions. [SQL Server 2012 Parallel Data Warehouse Appliance]

Video Details

Duration: 1 hour, 15 minutes and 48 seconds
Country: United States
Language: English
Genre: None
Views: 7
Posted by: asoboleva99 on Jul 9, 2013

Please deliver CC from MT engine for following languages:
Fra, Spa, Kor, Jpn, Ita, Ger, CHT, Rus, Bra

Caption and Translate

    Sign In/Register for Dotsub to translate this video.