Richard Seroter talks about building highly available cloud integrations and the best practices, particularly when multiple cloud services are involved.
Integrate 2018, June 4-6, etc.venues, London
Yeah. Today we have Richard who doesn’t really need an introduction. He is already in the community since I joined and started working with BizTalk in 2004. And today Richard is gonna to talk about the high visibility of your Cloud integrations. So, yeah, please welcome Richard.
Richard: Thank you. All right. There are a few minutes before this started, where there were like nine people here. So you all pranked me well by showing up at the end. Good job. My ego was about to take a hit.
A quick show of hands, who’s built a model plane before? Few people. Okay. If Clemens was here, I think he always has like three in his backpack. So that’s fine. I actually took this from his Twitter feed. But, you know, if you build one of these, you’re all familiar that the glory parts are your cockpit, your wings, your tail, you know. The same with your systems, right? When you build your systems everything that gets all the praise at the frontend, the mobile app, the rest API, things like that. But I would contend the most underrated part that gets absolutely no praises, is the glue. It’s a thing that tie it all together. A model isn’t really useful if the first time your kid touches it the whole thing collapses and it’s really just a showpiece. And your system has to survive first contact with users as well.
So as we think about building systems, are we really paying enough attention to the glue? So, yeah, I’m thinking what we really are all after, is those connections between our things that actually can survive, right, that they’re durable, they’re reliable, they can handle failures. And so what I’m gonna talk about today is a few different patterns, if you’re an integration architect, where are those core patterns you should care about for high availability. I’m gonna go through a lot of the core glue services in Azure and talk about what is baked in for availability and what things you actually have to do that’s not included for free. And then finally I’ll talk about a few of the things we wanna bring all together to see of this…see how it makes sense when you’re working with all of it. Hope that sounds good. It’s not, the doors are already locked from the outside, so your mine for the next 45 minutes.
So a core pattern. Let’s talk about a few of the ones you should care about upfront. The first one is thinking about retrying transient failures. When you’re dealing with a Cloud system, of course, you’re gonna come across those occasional hiccups, right? You can have a network blip, a network partition, the database will be overloaded temporarily, some kind of faults will happen, just, again, casually, right? So we have to think about that’s not a catastrophic failure, that doesn’t mean shut the whole app down. It means we have to think about when we build integration systems, your middleware has to be smarter than that. You have to be thinking about whether what’s the right retry policy. Should I retry immediately? Do I do a back off strategy? Because if I just keep retrying endlessly maybe I will actually take the system down. So my thinking about a smart retry strategy.
Now, ideally when you’re working in cloud things, even the SDKs that come with that and other languages, somebody hs actually backing some retry logic for you. Which is really great, it’s behind the scenes, you don’t even have to think about it. But as you factor in retries, obviously you shouldn’t just blindly retry your failed request all the time. What if it’s not item potent and you’re actually decrementing a bank account over and over again and you’re doing negative things. Or you might do a circuit breaker pattern, if you’ve seen that. Just like, I’m in the U.K. with my weird U.S. plug and the hotel was smart enough to flip a circuit if I start consuming too much electricity.
Same within a software system, right? If your system’s overloaded the circuit breaker pattern will actually stop flow to the downstream system until things are healthy again. And if you’re a .NET then you can use Polly or Steeltoe or all these libraries that do that for you magically, which is great. So you have to think about what is your retry strategy? What is the impact of it? Even if you’re dealing with transactions, right, just blindly retrying things may not make sense. So when you’re an integrator, one of those core patterns for availability is thinking about how you do retries.
The next one, which is blindingly obvious in resilient systems, is thinking about retries or thinking about load balancing. How do I scale out some more things? I think I just saw Azure added this giant 12 or 24 terabyte VM instant size, which is amazing. That might be actually enough to run SharePoint. But now it’s too soon? That’s great, but that’s not resilience, right? That’s still a single point of failure, even if it’s a monster box. If that thing tips over, it really doesn’t matter how big it was. So I’m always thinking about scaling out.
Now, the load balancer, of course, its job is to not just blindly send traffic everywhere, it’s supposed to be smart. So it should be thinking about, “Is this thing actually healthy or not?” If it’s not healthy, it takes it out of rotation. But what’s important, especially as an integration person, is that load balancing applies to literally every part of your application to your network layer, database tier, storage, compute integration layer. Every single part of that. If you have anything that’s a single instance you immediately have a failure point. So you should always be thinking about load balance things.
And then finally, if you’re dealing in the cloud, again, nicely this is often always built-in. You don’t have to think about this. If I never have to configure a real load balancer again in my life that would be awesome. So, ideally our services are going to be taking care of this.
And then finally, if you’re dealing with something that does auto-scale, then this really feels magic because what I don’t want to do is have to load balance something, have to add things to the pool and then change things and up. Nope, just auto-scale, add instances automatically, take them away from the load balancer when the thing scales back. More and more of these cloud services, especially at Azure, take care of this for you.
So the next, what we should all get familiar with, if you’re not already, especially as an integration person, is thinking about replication. Everything gets replicated. And most importantly, it’s not just your data, it’s the metadata, it’s the reference data. We’re gonna talk about some of the Azure services that do this but just copying the data is no good if you don’t copy the schema as the rallying rules, all these other sorts of things. All of that should be getting replicated elsewhere.
So, how you read and write data might control this. You might create read replicas for databases. You might have sharding strategies for your event hubs, things like that. All of this will depend on how you actually wanna replicate. Now, I love the cross-region replication, we’ll talk about this with Azure SQL and things like that. But none of this is free. Speed of light has not yet been solved in Redmond. So I still have to think about the fact that there is a lag in data. If I’m replicating from the U.K. to the U.S. to Asia, clearly, all of this is asynchronous. And that means, if something goes down, they’ll lose some data. You have to be tolerant of that.
At the same time, when you’re doing this sort of thing, you might not always be straight up high availability. Sometimes it actually just has to do with the fact things will eventually tip over and is it easy to take that replica and expand things back up again? So all of this is about keeping your system online and helping you actually survive a legitimate failure.
So the next one, again, I think we get familiar with this integration people, is actually purposely throttling your user. I mean, Cloud systems are awesome, but how can I make sure one user isn’t actually swamping the entire thing? Which does happen from time to time. It makes sense. But you, as a system architect, or you as an engineer developer, have to consider that specifically so you may do a couple things. You may actually outright reject them, right? “Forget it. You’ve used your quota.” Or you do kind of the Netflix strategy of, “Hey, all of a sudden, the quality of my video seems a little sketchy, but I’m still online.” Right? You’re purposely turning back lower quality results but you’re keeping them online. It’s a good strategy. It may send back a smaller payload from a request-reply Rest service. You may be doing other things that actually purposely degrade the experience but don’t actually take them off line.
But the most important thing is here, you should be very transparent about this. Don’t secretly throttle people. That’s not a particularly good thing to do. Instead, you should be returning something back in the header that says, “Yes, you’re being throttled.” Or identify, “I’m using Netflix, I get little message in the corner saying, ‘Yep, we’ve lowered your video quality.'” You should be telling people you’re doing these sorts of things, so otherwise they may continue to beat you up not assuming you’re actually gonna take them offline or quiet them down.
Then there’s load level. And one of the reasons… I know, I like using BizTalk in the beginning is, how do you kind of protect some of these downstream systems by putting something up front, right? How do I put something that actually says, “I might be swamping it upstream, I’ve got all this water piling up, but my downstream system is gonna take a steady stream. It doesn’t matter how much water is behind me, it’s gonna take it as fast as it can. So I don’t take down that old system that can’t handle 100 transactions per second. I’ll just quietly throttle it down.” So your middleware gets speed up. We saw Steven talk about this yesterday with things like Service Bus, right? There’s a decent pattern that says, “For any high intake throughput, you should send it to Service Bus first, let it get completely destroyed. And then you just have an adapter that pulls it on premise quick as you can handle it.” And that way you’re taking advantage of something that can handle it.
So one of the nice things is this keep you from all those scaling to death. Because in a cloud system, if you have crazy bursty traffic and all the scaling is trying to be super helpful, you’re never gonna stop, you’re gonna scale up, scale down, scale up, scale down. Everybody gets exhausted. That’s a lot of stuff. So maybe, I actually want to load level and not require my compute nodes to constantly be going in and out. I’m just gonna have a nice consistent flow of traffic. It might save me money, it might make life easier.
Sometimes we don’t think about security as a high availability thing, but as we’re all designing and integrating systems creating that glue, having a proper security system makes a ton of sense. How do I not have one account that everyone uses? How do I make sure I don’t have one service plus topic in queue that literally everybody hangs off of? One of the nicest things about cloud systems is I can actually spin up kind of like microservices-like instances of my integration services with least privilege where I’m not giving everyone God rights.
Encryption at rest. It’s so easy. If you’re not doing it at this point with some of these services, you’re actually being somewhat negligent because it is that simple. So turning those things on, making sure your data stays safe. And then denial of service. I mean, nobody is getting beat up more now than cloud providers when it comes to denial of service attacks. But what’s great is they’ve built in some really sophisticated things that you all get for free. So, when you think about this, a security event is actually a high availability problem because they’re actually trying to take your systems offline.
Then get to automation. So, specifically here thinking about, “How do I make sure that when there is a disaster, and there will always be a disaster…” The best part of cloud computing is it has taught all of us that things will fail all the time and constantly. Because sometimes we got in this mindset that we can actually architect out failure, which is impossible. You can only reduce your meantime to recovery. That’s the only thing we can do. So when I think about something will go wrong, how quickly can I actually duplicate my one production environment somewhere else? And with things like arm templates, with things like all the scripting, I can literally recreate my environment somewhere else in a few minutes. That’s amazing, right? That’s great. You might not wanna pay the tax of actually having duplicate environments in nine regions. That’s really expensive to pay for.
So what you might wanna have is a template that says, “Look, when this thing collapses I can spin this thing up instantly, it looks the same. Sheesh, I can have the same network space in Azure in multiple virtual networks or literally the same IP space.” So I can duplicate all of that programmatically. If you’re not scripting that, you’re actually doing yourself a disservice because what you’re not gonna wanna do is go in there in the portal and start right clicking, adding things. That is not the way that you can be building this stuff at scale. So build more robots so that you don’t get paged at 2:00 a.m.
So let’s talk about some of the Azure services and how they support availability. These are some of the glue services. So when you’re connecting all your stuff. These are the services I think matter the most. So let’s about Azure storage. This is one that’s, I don’t know, a little underrated to me until I started screwing around with this a little more. So what does it do for you? What do you have to do? Which I think is really an important designation, because sometimes we can treat Cloud services as magic that just, “Hey, my systems stay online.” And that’s a remarkably bad way to think, right? We take a lot of personal responsibility for what happens in the cloud. So as your storage, you get file system, SMB protocol compatible file share. That’s really cool. You get this storage for your VM. That seems kind of important. And then blob storage or object storage, if you’re familiar with that. I have binary objects, videos, backups, whatever I want.
All of those are included, every region, really nice. There are four replication options, if you’re not familiar with these. So by default, you get this locally redundant option. It gets 11 lines of durability over a given year, like it’s almost impossible for that to lose data. So this is nice. It creates a few replicas within the data center, it actually spreads them across what Azure calls fault domains and upgrade domains. So even if a whole rack falls over, your storage would never all be on one rack of hardware. And if they’re upgrading the software, all your storage wouldn’t only be on one, it’s always spread out. So this is free. You get this by default, you’ve don’t have to think about it, which is really nice.
Now, you get to some of the newer ones like Zone-Redundant, this actually spreads it across availability zones. This give you 12 9s of durability. So this spreads it across AZs in the region. That’s pretty cool that that’s built in. But that’s okay. But Geo-Redundant is where it gets interesting. So this is 16 9s of durability, which starts to get ridiculous. I can’t even… I mean, it loses like a kilobyte a year or something like that. So in this case, it actually spreads it across other regions more than hundreds of miles apart. So I can literally survive a disaster in one region and my storage is replicated, in this case, asynchronously. The other two are synchronous replication for all the copies, which is pretty cool. This is, as you’d expect, async. So it’s actually replicating into another region. Now, you can’t access the secondary, right, there’s no URL for it. You actually don’t get to say there is a disaster. Microsoft has to declare a disaster for it to actually flip over. So you’re a little bit at the mercy of the Big Blue Machine there.
So you might wanna actually use Read-Access Geo-Redundant storage. What this is, is the same as the other one but the secondary one is actually accessible. You can actually have a URL, you can access it, you can use it. It’s just a read-only replica of everything in your primary, which is really neat. So this give me all sorts of different regions and I have this sort of read replica, which is pretty powerful stuff.
And then for you, this provides encryption at rest, handles the keys, all that kind of stuff, just by default kind of checkbox functionality. So all that just given to you. You have to do virtually nothing. So what do you have to do? You’re not absolved of nothing. So you have to set your replication options. You have to choose which one of these, right? There’s no magic AI bot yet that’s just reading your mind and predicting your storage needs. You have to choose the secondary storage access strategy. This is not one URL that just kind of hits both locations. You have to say, “Look, this storage is offline. Therefore, I will use this other URL.” So your code has to be smart. You have to be smart enough to detect the failure and then start to using the replica.
And then finally, you have to think about your encryption because on one end you can do server-side encryption. This is the one I mentioned earlier, that the server-side magically encrypts it, you’re all set. If you wanna do client side, like using the .NET SDK, you can actually encrypt it client side, send it over the wire encrypted and then, as your storage does nothing, and then when you get it back, your client co-decrypts it. It kind of depends if you want more control over the encryption process. But that’s on you. You have to choose how you wanna actually secure an encrypted data. So Azure storage does some really good stuff for you. Again, there’s a huge glue service as you’re integrating different systems, you’re probably dropping log files, you’re dropping different file pieces, objects, components. So this will probably be a part of almost any Azure architecture.
Let’s talk about then the next one, SQL database. So what does this bad boy do for you? So, by default, underneath the covers, this uses highly available storage, Azure storage, for the actual pieces under it. So that’s great. You can actually scale up to bigger and bigger machines. And you can actually scale out. We’ll talk about that, which is pretty nice, you get to read replicates. So what’s great is that you can actually pick and create multiple read replicas. I could say, “Here’s my primary in Germany. I’m gonna go, we’ll create a read replica in the UK and then create one in the U.S.” And you can actually chain them which is pretty wild. So I can actually create a replica that then reads from another replica. And you have this weird giant chain of replicas. You can say, “Why would I do that?” I might do it for availability and purposes and others. But the point is, it’s up to you to use them but you just get to create replicas. It’s a click option. And then everything gets synchronized asynchronously. And then your code, could be in region reading from the replica instead of the primary.
And then there’s Built in Backup and Restore automatically. You get multiple days of actually full backups, transactional backups, all that sort of stuff. Just comes for you. DBAs have to find something else to work on.
And then, threat detection is actually somewhat underrated but this is actually part of Azure SQL database, where it looks for SQL injection attacks, it looks for weird anomalous behavior. So all that stuff’s happening automatically versus someone looking at a screen and trying to read transaction queries. This thing is actually monitoring threats. So what do you have to do? Well, you have to create replicas. Azure SQL is not magically creating replicas and auto-scaling. As a choice for you to scale out your workloads, improve your availability, that’s on you. And you have to decide you’re scaling strategy. Do you wanna go for bigger, bigger hawking machines or do you want to scale out? Those are maybe dual strategies, you may do both. But those are both on you. You have to choose your scale strategy.
Of course, you have to restore your own backup. Microsoft will not detect that your database needs to be fixed and then magically back it up and restore it. It will just do backups. You have to actually restore it to something else. And then you have to actually turn on threat detection. This is not by default. So if you are actually worried about threats, and I think this is free. If it’s not, just bill it to Sandro or something, but it’s not me. But turn this on. Like it’s no brain dead simple to make sure your database is actually being protected. This is where being in a cloud platform does help because you’re just kind of getting this coole effective things just running in the system.
So Azure Cosmos DB. I mean, I would contend that this is the most interesting service in Azure, potentially just because it’s such a useful thing that’s so interestingly engineered. So what does this do for you? Again, this is that multi-modal database where you want graph or key value or documents store, all this in one crazy thing. So what do you get for this for free? So 5 9s of availability for reads, when you have replicas and this is the only service I think I’ve seen has a… If you look at the SLA for Cosmos DB, because reading SLAs is awesome. This one has crazy like latency SLAs and transaction SLAs, like they have a really tight performance SLA. It’s not just, “Hey, we’ll try to keep it online.” It’s a, “You’re gonna be able to hit this and get really quick responses back.” So they’ve made a legit commitment that this will be probably the fastest database option in the cloud.
So it automatically does data partitioning, automatically replicates data. So you don’t have to upfront say, “Well, I think I’ve got this much data. It should be this many data partitions.” No, just throw data into it and it will get re-partitioned and balanced and whatever behind the scenes. You just throw data at it, right? That’s really nice so that you don’t have to do any crazy architecture there. It’s also pretty wild. It has multiple consistency type. You could do strong consistency where everything has to get replicated and synced before it tells you it’s successful. You can do eventual consistency so all the replicas eventually get up to date. And there’s like four in between. So you have a lot of different choices for how consistent does your data have to be. It’s gonna be faster to be eventually consistent, but for some of your workload, you might actually want a real transactional system.
And then this one is cool. Unlike storage where you’re kind of at the mercy of Microsoft for the most part to declare a disaster, this one you can wait for Microsoft or you can just click a button and say, “I’m failing over to one of my replicas.” And when you fail over to a replica, what’s awesome is it actually creates that now as a read and write replica. So it becomes like the master gnome. So I just click a button, all of the sudden it makes that one the new primary. And I’m good to go. And the whole time you actually have no service disruption. So you can keep calling that database even while it’s failing over, which is pretty wild. So that’s really nice, all backed in for free.
So, what in the world do you have to do? So you technically have to define a partition key, if you care. If you don’t, then Cosmos just round robins your stuff across all these partitions, but you can define a partition key. That’s fine. The only real thing you have to do is to find throughput units. That’s the only unit that you actually configure in Cosmos, how much throughput do you need to process all your data. And then you can scale that yourself as well. And it use things like the replication policies, “Where do I want to send stuff?” All that. And then you choose, “Do I want read replicas, write replicas, read-write replicas?” You’re choosing where to stick stuff. And so, almost dumb simple in the portal, you’ll look at a map which most of us can read successfully, and then you just click things like put one here, put one here, put one here. Now surprisingly, it doesn’t tell you how much that’s gonna cost you. They just make it super easy, a kid can do it. Your bill is then tripled. But simplicity is awesome. So it’s a really nice way to easily replicate all your data.
And then what’s really wild, is I told you about the consistency option. You can actually override the consistency on a per request basis. So you can see my whole database prefers to be strong consistency. But on these requests for this sort of data, adds eventually consistent. I just need a lot fast performance. That’s pretty crazy. So you actually have control, and this is up to you, to actually set consistency levels on a per request basis, which is wild.
And then finally, it would be up to you to trigger an actual manual fail over, right again. Automatic fail over, you can wait for Microsoft to say things are completely messed up or you can just say, “Look, we’re gonna fail over.” Now, if you’re being really good, you do it yourself anyway because you should be failing over all the time to test your infrastructure, right? You shouldn’t be dusting off your DR routines once every 18 months and just check a box and say, “We can do disaster recovery,” like I always did. You should instead, be constantly failing over your systems to prove you’ve built a proper architecture.
Next one up, our good friend Azure Service Bus. What in the world does this thing do for you? Again, really, really, really important glue service. No one probably tells you, “Azure Service Bus, they’re working great.” They just complain when it fails because this is just kind of a transparent part of a lot of systems. But you automatically get resilience in the region. It can handle failures within individual parts of the region and your service bus node stays online. And this is one that has throttling built in. So if you start pummeling the service bus, it will actually start rejecting requests and put something in the response saying, “Chill out, slow down a bit.”
And it does automatic partitioning as well. So by default, if you don’t turn on partitioning, you’re just on one message store kind of message bus node and you could actually technically have some failure. By default, partitioning is turned on, which means you use multiple message stores and things in Azure so your data is replicated, your config is replicated, which is nice. And it does offer this Geo-Disaster recovery. Pretty much, it’s this idea of paired namespaces. I set up a secondary namespace in another region, I map it to my primary, and then all the metadata, not the messages, actually get replicated. And then if I do that, I actually have one connection string. So in your source code, even if you fail over, if there’s a disaster, your code doesn’t have to change. It’ll keep using the paired namespace in the other location, which is pretty nice. Now, again, your data doesn’t transfer. I still could potentially lose data, but I’d stay online. I would just start processing it in my secondary region.
All right, so that all sounds good. So what do you have to do? Well, you choose things like message retention time, time to live, how long should be it able to stay in the service buzz. You choose things like, “Do I want to use partitioning?” I don’t know what scenario you would not use partitioning. I don’t think it cost you anymore unless you just like to live dangerously, which is cool too, I respect that. And then you can use things like premium messaging. If you actually want an isolate environment consistent throughput and performance, this is something you clearly do pay for. And so, but in this case you actually, bless you, you get your own environment, which is really nice.
And then finally, you have to configure Geo-Disaster Recovery. While that’s built in and the paired names faces are super cool, this is totally up to you. If you don’t do this, and Azure Service Plus hits a regional outage, you’re out of luck, right? So you’ve got to actually choose to make sure you stay available there.
Good. So let’s talk about Azure Event Hubs. What about this character? Pretty similar actually, in terms of what it does for you and what you have to do. So, first off, it handles, of course, by default, just tons of load. Like, if you are sending in tens of thousands, hundreds of thousands of messages per second, per hour, this thing is gonna handle the event workloads. And Clemens talked about events versus kind of business messages and things like that. But if I just need to take in a crap ton of stuff, event hubs is your thing. That’s a terrible brochure for that, but you shouldn’t take that in.
It does built-in partitioning. So as I think about sharding, again, partitions in these sort of models saying data is stored in multiple places. Each partition is actually replicated automatically so that just happens for you. You don’t have to actually handle setting up these systems yourself. And then you’ve got the auto-inflate. I gave Dan Greiff last year that there’s no auto-deflates, like all the scaling can’t work if you only scale up, but they still haven’t added it. So you can only get bigger in Event Hubs, which seems financially great for Microsoft. But in this case though, you can do that, which is great. So, load keeps coming in and this soon will keep inflating maybe infinitely to charge you more and take in more data.
And this also supports that Geo-Disaster Recovery thing by pairing namespaces. So I can make sure I can continue processing crazy event throughput, even if my primary location actually goes down. So again, I have to provision throughput units, what’s this thing doing. I have to pick a partition count. You can’t change this after the fact, which is kind of interesting. So I have to know up front how many partitions do I need and a low number means, I might actually hit throughput limits because I think one throughput unit equates to one partition. So I can’t just flood one partition, I have to spread out my workload. So you have to choose those things. There is a thoughtful decision there. And then you have to actually configure Geo-Disaster recovery again. This will take in tons of data, still on you to make sure that you built this in a redundant way.
All right. Next step. Logic Apps. So as we keep moving up the stack, you almost think there’s less and less you probably have to do, right, compared to storage where you take on more responsibility or compute. As they move up the Logic Apps and functions, you’re hopefully handing off a lot more responsibility here. But for Logic Apps, in region high availability is built in. Logic Apps can handle failures of racks and different things that are happening in the data center. That’s cool.
I got a meeting coming up. I’m sure you’ll be on time.
You have to impose limits automatically on things to make sure Logic Apps online. So it actually does impose some limits on how big the message can be so you don’t accidentally take Logic Apps down. How long can your request be so it doesn’t run too long? Systems do this to protect themselves, but you should know these just to make sure your systems are also highly available by not sending in things that break all the different constraints.
And then, John, I think, talked about this on stage, is that you can synchronize some of those B2B resources, the integration account sort of stuff, the trading partner things. That can actually synchronize across regions automatically, which is handy. So again, the data is not going, but all the different transactional stuff, all the different kind of metadata actually does go across, which is handy.
So what do you have to do? So you actually have to configure that synchronization, it’s not magic. So you’ve got to set that up. And I’ll talk about this more in a couple minutes, but you actually have to then actually integrate with highly available endpoints. If you’ve chosen to connect Logic Apps to an individual server’s IP address, then you have made a bad choice but you can make that choice, right? That’s up to you. So it’s up to you as an architect to only connect the highly available things.
And then finally, really, it’s on you to duplicate the Logic App in another region. I think that’s the recommended pattern, because it’s not gonna magically work somewhere else. You should actually, if you care about it, duplicate that in another region and put in something like traffic manager or something on top of it to send data to both of them or fail over, things like that. It’s up to you to actually architect this specifically for redundancy across region because it will not happen for you automatically.
So, yeah, let’s talk about Azure Functions. And we think about the Serverless space. Steef-Jan talked about this a little bit yesterday, is everyone likes Serverless for the whole page you go and sort of event-driven stuff, which is awesome. I personally don’t think it’s like the next generation pass because Functions, there’s a whole new architecture. None of you can take, I don’t think, take your existing web application and just run it in Functions. No, you have to re architected for a series of things. So as you start using this, you also have to think about your resilience architecture a little differently as well. So you have to think about your scaling. Now, this comes for you for free. If you use this consumption plan, that means that you actually don’t pick servers, you don’t pick anything. It just scales based on all your request volume. And I think we saw that cool demo where you’re just seeing servers get added magically behind the scenes. That’s really neat. You don’t have to provision anything, just beat the daylights out of it and Functions will scale. That’s a really cool functionality.
Now, if you use the app service plan, that means it probably got some underused apps services. I can just cram my functions on there. There in those cases it won’t scale for you automatically. You have to scale your app service. So it’s built in based on both strategies, but it’s gonna behave differently. And if you look at the SLA for Functions, it’s a little quieter than the Cosmos one. It pretty much says it will be online most of the time. Just good. I mean, that’s what I look for in a service. So you’re not gonna get some crazy resilience SLA things like that, just gonna be online for your 4/9s, which I guess is okay.
So what do you have to do? Well, you have to choose your plan type. And this is the most important choice you’ll make in functions. Is this gonna be consumption meaning it just magically scales based on usage and I pay for that or if I have kind of a consistent load or underutilized stuff, maybe I’ll use the app service plan? But that’s completely on you. And then maybe choosing things like the scaling policies when I use an app service plan. Because I don’t wanna be doing Functions and having something underneath, it’s actually not gonna scale. So I’d be setting up all those scale policies for my app service.
And then again, I think you’re replicating those functions in other regions, right? This isn’t gonna magically work somewhere else. You actually have to copy your functions and probably, again, put something on top of it like a global load balancer and traffic manager and things to actually route things even to both locations or, again, do an active-passive sort of thing. But if you’re thinking Functions just work in every region, you have to actually architect for this.
All right. The last one I want to talk about is VPN Gateway, which again, if I never have to configure VPNs again in my life, I’d also be thrilled but for now we do. And so what does it mean? Because this is an important part of your integration architecture, especially if you start connecting these hybrid cases of things in the cloud, to things on premises. If you do this incorrectly, your hybrid architecture is kind of lame. So what does it do for you? Well, automatically, when you provision a gateway, you get the sort of dual thing. You get two nodes, an active standby thing, which is great. Active collapses or active is getting upgraded, it will automatically use the standby. So there are some smarts in there, which is nice. And that’s part of it being run as a service. You actually have no access to the actual underlying VMs that power this thing. And when there is a fail over, it’s automatically going between the active and the standby. That’s really nice. Very handy. You don’t have to deal with that.
So, what do you have to do? Well, you have to pick the sizing. And if you wanna resize, which you can do. And why would you do that? Well, I want more throughput, I what more network pads, I want more of those things. You can scale up to more things. It’s actually up to you to configure your on prem-VM, VPN device to be highly available, right? A pair of things probably on Prem, you can have this resilient cloud thing but if you have one just sputtering along node on premises, your connections is not gonna be very good.
At the same time, it’s up to you to configure active-active in Azure. That’s also not by default, it’s active standby. So if you actually want your VPNs to be available fully in cloud, you’re gonna potentially do active-active. And then you do the whole enchilada which is active-active on both ends. So that’s completely up to you. You have to set that sort of thing up. That’s not gonna happen. Now, there’s good reference architectures for this, this is not rocket science, you will not be the first person to do it, but it does involve work for you. This doesn’t happen automatically.
All right. So that was the whirlwind tour through Azure glue services, which again really powerful stuff. Hopefully your takeaway is there, “Good, but I still have to do something.” So hopefully we can cut down your hours of work and you can still look busy and fill the 40 hours, but you still have work to do with these Azure services. It’s not magically replacing you as an architect.
So when you are putting this all together, really three things I thought I’d point out here that you should consider. First of all, I mean, I can’t think of a scenario where you should be integrating with a single instance of anything. I wouldn’t be pinging a container’s IP address, I wouldn’t be pinging a server. When I’m integrating things in the cloud that you do on premises. On premises, I think we’ve all learned you should not use the file adapter and talk to C:. Like, no, stop that. Use a file share. Use something that can survive failure of an individual machine and have nodes at scale.
So same thing in the public cloud. You should never be connecting to anything that refers to a single instance, always hit the load balance URL. Even if there’s a single instance behind it, you should be hitting the load balance URL because you’re gonna swap that thing in, you’re gonna replace things and IPs are usually ephemeral in the public cloud. If I bounce a box, I can lose the IP. So I would always wanna be using something that’s highly available there.
The other one I think is arguably the most important, maybe I should have put it second, is understanding which services actually fail over together. So what I showed you earlier is a lot of services that have different fail over strategies, right? Service Bus to declare a disaster requires Microsoft, that might manually fail over Cosmos, so I don’t have to wait for storage. All these things fail over differently. So if you just expect that, “Hey, my service is behaving weirdly. It will all just magically work in another region,” that’s probably a bit of a sophisticated orchestration you’re gonna have to perform to figure out, “If this thing falls down then I should also move this.” Even if it’s not working poorly, right? Because I don’t want to have a function that talks to storage 3000 miles away.
I would want to fail over my function when I fell over my storage, for example. So you should be modeling all of your services and which of these things are kind of coupled and if one of these has a problem, these other three have to come along with it. Don’t just assume individual pieces should fail over. All of a sudden your apps is gonna to start performing awfully, right, you’re not gonna like it. So you should really be looking at all of your interconnected pieces.
And then, finally, you should be regularly performing chaos testing and chaos engineering. If you’ve done this stuff before, this is the idea of purposely introducing chaos into your architecture, meaning, turn off services. Unplug VM. Netflix kind of pioneered this and did their Chaos Monkey and Chaos Gorilla and things like that. Now if I asked all of you, if I turned to service on within your company that would randomly shut off VMs, how long until your business collapse? Would I get like an hour? For most people, because our architecture is not typically set up to just randomly lose things. That’s what Netflix does in production with their service because that helps them build a more hardened resilient architecture.
So when you’re doing these cloud integration services, you almost have no excuse to not purposely be trying to break it. And not break this service in Azure or break your architecture, right? You should be purposely turning off Logic Apps and seeing how your app handles it. Shut down storage, right? Disable your function. Start flooding Service Bus. You should be purposely doing this because it’s easy to replicate prod. I can click a button and recreate all these services, no excuses. “Well, I couldn’t get an environment to actually test this.” That’s not good anymore, right? I should be able to test everything and purposely break things.
Heck, at Pivotal, we’ve had customers actually go… we have our customers turn off racks as part of their like executive board meetings to prove that things work. Like you should be introducing chaos engineering to your customers, A, because it’s impressive and proves that these things will survive if you’ve architected correctly. And, B, it’s just good hygiene now. I’ll add onto this as one note. So if your customer ever experiences an outage as a result of something in Azure, and you send the note out that says, “Sorry, the system was offline all day. Azure was offline.” Every customer impacting incident is your fault, my fault. It is never Azure’s fault, it is never BizTalk’s fault, it’s never… Because we’ve architected it, right? I mean, a good carpenter doesn’t blame their tools. It’s got to be on us. So there is no outage that is ever Microsoft’s fault.
Now, things will fail in Microsoft all the time, and Amazon and Google. But we’ve all architected our systems on top. So if our system fails, and are customers or users are impacted, 100% on us. So never send that note out, I challenge you. Never send a note out blaming the service or the tool for your system being offline. Blame yourself and use it as a really cool chaos engineering learning situation after you’ve gotten your job back.
So we covered three things, we looked at some core patterns. Hopefully that helped us think about resilience and we think about retries and think about redundancy and replication. We’ve talked about those core Azure services. Again, it’s really actually interesting to study some of these because there’s some really cool things built in, a lot of things you still configure. And then as you piece this together, especially as an integration architect, think about how these things fail together. Think about how the impact of one thing failing will impact another because there will always be outages. I promise you, I mean, that’s the easiest promise we can make because there will always be Azure service outages, there will always be things in Salesforce or Google or Amazon, but how we build our systems is the only thing that seems to matter.
So a few resource, I just finished the Pluralsight course on how you do highly available systems in Microsoft Azure, which is… I’m not saying I did all this just for this talk. I’ve spent the last six months screwing around with Azure stuff and services. So it’s a new four-hour course on all the different components in depth that you get to do it too. And then I liberally stole from some really cool Microsoft guides on resilience. You should absolutely read those that talk about a number of great patterns, talk about how to configure some of these services. So Microsoft has done an awesome job of documenting some H/A considerations there.
And then, of course, find me and harass me on Twitter anytime you’d like. So appreciate the attention and 8:15, you all are troopers. So thank you for coming early this morning.
Fill the form below to get all the presentations delivered as a single zip file in your mailbox.
byJon Fancey & Matt Farmer
byMicrosoft Integration Team