Tuesday 3.1.11 @ 2pm
Leveraging Cloud Computing
In the second webinar, Online Tech's CEO, Yan Ness, will explain how companies can leverage cloud computing for resource flexibility and to increase the pace of product development.
April: Let me welcome you to our second cloud computing webinar. Our CEO, Yan Ness, is going to continue the discussion on how we can leverage cloud computing for resource flexibility and to increase the pace of product development. There is a lot of real case data to share with you, that we are excited to have. With no further ado, let me turn this over to Yan.
Yan: Hey everybody, thanks for joining us today. As I thought about how to describe leveraging cloud computing, I figured the best way for me to do this with real credibility was to describe how we use cloud computing for our infrastructure for both operations and administration. We have many clients that use this exact same stuff for themselves, but obviously what they do is their confidential information. So rather than share many details about what some of our clients are doing, I’m going to describe a little about our world. What was important to us, what we do and very specific ways we leverage cloud computing to do what we do even better.
I apologize in advance if it sounds like I am selling or bragging, that is not my intention. My intent is to share both from a business and technical standpoint, but I am passionate about what cloud computing can do for others, because I have experienced it myself as CEO. I have seen the return on investment analysis, signed off on the investments and I find it incredibly compelling.
Let me get started here. For those of you who are not completely familiar with who we are and what we do, I have got a slide here that describes our world. We have lots of stuff to manage. We are a data center and a hosting cloud provider. We currently have:
All of this has to be certified and audited by our auditors using SAS70, SSAE16, SOC II and SOC III audit standards. We have employees in 5 locations in 2 countries. Along with our 2 data centers, we have a corporate headquarters here in Ann Arbor where I am right now. We have our sales and marketing team here. We have sales people on the road. We have a development team in Canada. We have people in a lot of different places and obviously everything is 24x7 mission critical like a lot of IT systems are these days. You used to be able to do a lot of work on things at night and nowadays that is not the way it is. Everyone is expecting everything to be up and available all the time. So we have a lot of stuff to manage and all of it is mission critical.
We have a whole variety of need and these are some of the critical systems we have to deploy. On an administrative side, we have a lot of companies like Microsoft Exchange, Sharepoint, Fileserver and Domain Controller at different locations that are synchronizing and so forth.Our marketing department has a very robust, high use, high volume marketing site with both production and development versions of that. We also have load balance of those websites, because nothing is more embarrassing than our marketing site going down. We cannot do that anymore than anyone else can. We have a tool that we call OTPortal that is our client and intranet portal. It is a Microsoft.net application that our development team designs and builds. It has about half a GB database running on a Microsoft SQL 2008 server. There is probably 5 to 10 GB of additional content that gets uploaded and in to that infrastructure as well. We also have a tool called OTMobile, which provides our operations team mobile access to a lot of the information in OTPortal. That way if they are out on the raised floor or out on the road, they are never out of reach of the mission critical data they need.
Our operations team, which manages the data centers and all of the infrastructure in there, has a whole host of servers they have to deploy and manage in order to manage all kinds of things. We have a bandwidth management and billing system. A MySQL database reads data from our routers which is about 5 million records a month that is quite robust and real time. We deliver patch management for thousands of servers. We have dozens of network security appliances and as everybody knows the manufacturer does require all management consoles, so we have several management consoles for those devices as well. We deliver antivirus services to a whole host of servers. There is obviously a management console for that to update the scripts and so forth. We do lots of backup of dedicated servers, virtual servers and cloud hosts. We use R1 software backup technology and VEEAM backup technology. Each have their own management consoles and management servers. We also have SAN and NAS devices for our shared and dedicated storage and a collection of management servers that are required to manage and maintain those. We use a whole host of monitoring as you can imagine. Monitoring everything from the time of these devices, performances of these devices, disk space and environmental management. There is no shortage of services and servers that our operations team consistently happen to use.
Anytime we roll out anything new, any kind of new managed services, there is a collection of servers we have to deploy. Again, there is no shortage of servers that they need. As I mentioned earlier, we are an always up company. Our IT, exchange, sharepoint, and fileservers have to be up at all times or we cannot service our clients at all hours which is what we do. In fact we have a lab in one of our data centers. In the back corner of one of our data centers we have a rack with a reproduction version of our network. It has a couple of our routers, switches and firewall devices. We put everything in that lab and test it before it goes into production. There is nothing more embarrassing than rolling something out and having it not work. That is completely unacceptable. So like a variety of companies, we have a ton of needs and as I mentioned earlier, every single one is extremely mission critical.
About a year ago, we went all in and decided to build our own private cloud. We moved our entire infrastructure to a private cloud and we wanted to run it for quite some time before we were going to sell it. We really believe that you have to eat your own dog. At that time we had 23 physical servers (18 Windows and 5 CentOS servers) and 4 database servers. The utilization of them was around 10%. We migrated to our private cloud which we built out of two Dell servers, a couple of Quad Core CPUs with 48 GB of RAM. We also built an 8 TB growth storage SAN all in an HA configuration, basically N+1. We could run our entire cloud on one of those two hosts. We used DRs and other HA technology that comes with VMware so that any failure in one of the hosts would cause all of the virtual servers on that host to keep running on the other host. We also needed to back that up. To put that much power and emphasis on three boxes makes you a little more nervous. So we built a second private cloud at our other data center and we use continuous offsite backup to that data center so we effectively have a warm site DR in Ann Arbor at our second data center. We tested the DR and are getting about a 4 hour recovery time. I believe we now have 26 virtual servers in that cloud and it is still growing.
As I mentioned, we needed to backup that cloud. We have the full private cloud up at our Mid Michigan data center and we replicate the data to a smaller private cloud at our other data center. That way any virtual server we put on our production cloud in Mid Michigan is automatically available in a DR scenario to our smaller private cloud sitting in our second data center. The diagram shown depicts some of that for you. At the end of the day, these are the five values we gained from moving to our cloud and I will get into each of these more specifically.
Pace. We basically increased the pace of product development. That is probably the most important thing and I will explain why I believe that in a minute. We drove down our total cost of ownership, improved our up time, increased our ability to deal with performance issues (in really dramatic fashion) and for those who really care, it is a lot more green that it ever has been before. I will show you some calculations about how that works. And by green I mean both for the planet and for the checking account, because money is green.
At the end of the day, this all ends up saving time and the reason why as a CEO you are really concerned about that, more so than anything else is, is because time really is money. In fact I have a saying that says time is really, truly the only unrenewable resource in the universe. The earth will regrow over millions of years and anything we might think we can do to it, but no matter what you do, you cannot go back in time. We use what I call a kind of copy and paste capability in a cloud computing environment to save time. There are two examples I am going to give on that. One thing we do before we do releases of tools like OTPortal or OTMobile, is we will use this copy and paste power to make a copy and paste of our entire production environment. A handful of servers makeup our production environment for OTPortal. Copy and paste, just like you would a file, that environment and we will roll out the new features into that environment. Now we can test those new features against the full production environment as it sits right now. We can do that as frequently and as often as we want. It used to take us a couple of weeks to do that, because we had to deploy some servers and go through some restore processes. Now it is a matter of an hour or a few minutes for the operations team. It is a matter of an hour for us to have a meeting, describe what we want and get it done. So now we can do OTPortal releases about every two weeks. Which means we get new time saving features out there sooner. Again, that saves time, because the faster you are to market you are setting the pace.
Another place we use this copy and paste power is design feasibility. Because we have a pretty tight release cycle, we cannot afford to spend a lot of time to research whether a certain design idea is feasible, whether or not it is going to work or if it will scale. We hit a lot of API’s for a lot of systems, such as backup systems and firewall systems. So we need to understand before we go into production what is going to happen if I make 1,000 API calls in an hour to our firewall management system, well what we can do is copy and paste our firewall server to another virtual server, turn it over to our development team and they can run a whole bunch of tests and say: “Look, when you deploy this, if you do 5,000 API calls in an hour against our current firewall production system, what is going to happen?” That is what allows us to get better and newer features released faster with lower technology risk, which was basically unheard of before. There was always a technology feasibility risk involved in deploying new features for us and we basically removed that.
Total Cost of Ownership. Total cost of ownership is something everybody talks about and everybody brags about with each new release of technology. Total cost of ownership goes down and yet IT budgets keep going up and the world needs more IT people. I have always wondered how that works. Well, with the cloud we really did experience this and since many of you have a technical background you are going to see this immediately. Basically our old total cost of ownership for the same physical server when we had 26 physical servers deployed over a two or three year period which means we had a variety of CPUs, memory and disks. Some were SATA, some were SAS and some were different memory packages. We had 52 power supplies, 26 backups, 26 anti-virus, 26 machines to network and patch, 4 Cisco network switches, 2 racks in the data center, 100+ network cables and half a dozen power strips. Believe me, power strips can fail. We have seen it happen. They are a simple device, but they can fail. We took hours and hours to upgrade disks (which I will talk about in a little while) and had lots of down time to upgrade memory. We had to have a whole different collection of memory and disks on site in case something happened. You know if you deploy 23 physical servers over a two year period, it’s very difficult for each one to have the exact same specifications and take the exact same spare parts.
Well our cloud total cost of ownership is two servers, SAN, 2 network switches, 2 power strips and a quarter of a rack. You do not have to do too much analysis to know pretty quickly that that top total cost of ownership is a lot more expensive than the cloud total cost of ownership. If you manage IT equipment at all, which many of you do, you can see the savings just by looking at this list. My estimate is, well you can clearly see at least 50% on hardware and I am estimating 90% on management. Total cost of ownership of all of this has gone down dramatically.
Uptime Improved. We also improved uptime in a couple different ways. First of all, we protected against server failure. Remember, with 26 physical servers you have 26 points of failure. If one server fails, anything that was running on it is down. In our cloud we have an N+1 host. If I have a host failure, things automatically keep running on my other host. That is one of the beauties of our private cloud offering and of VMware’s technology. So that means that every single virtual server we put in our cloud are automatically protected against server failure. In order to have that same effect, I would have had to buy and deploy 26 additional servers. Which is as you know another 52 power supplies, a couple hundred more network cables. Not to mention the load balancing and replication necessary between those 52 servers. It was not even practical, we did not even do it, most people do not.
Another place we improved uptime was with upgrades. Now I know everybody says they do not count maintenance windows in their uptime, but you know something, if you are doing maintenance on a system and people cannot use it, their pace stops. So we really do everything we can to drive down maintenance. We want our systems to be able to be maintained without down time. With our new private cloud we can upgrade the memory one host at a time, because our entire cloud can run on an N+1 environment. You bring down one host, upgrade the hardware, bring it up, move the production to that, upgrade the other host and you are up and running. When I add disks, it is a couple mouse clicks and a reboot.
So instead of hours and hours of down time, you can actually change your configuration and your specifications with, for all practical purposes, no down time. One of the things that hit us in the face though, and I am sure those of you with private clouds can relate, is the whole concept of a SANs failure. You still have that one critical component, the SAN, which has to keep working or the whole cloud comes down. So what we did to manage that is we bought an extremely redundant and robust SAN and put it in a high availability power configuration and then protected it with DR which I will explain further in a minute. We use all redundant controllers, redundant network switches. ray-to-ray drives and the SANs these days have a lot of clever technology that know when a drive is failing or has failed. They automatically order a new one and give us enough time to do something about it so we can repair a failed rated drive without any downtime. With the new DR options you are going to see in a minute that the SAN failure is not as catastrophic as it would have been a number of years ago.
One thing that became an interesting discussion internally is, what do we do about our databases? I was the one who was pushing for a hybrid cloud. I wanted our database to be a separate physical server from our private cloud for performance reasons. I just could not imagine that that could run in the cloud reliably. Well we looked at it and the cost of deploying a separate physical box and backing it up and then you have the single failure of that hardware issue. If the database server goes down, a lot of applications stop working. So the answer was lets have a cluster of database servers and now we are protected. Well you look at the cost and complexity of clustering the database and you compare that to, for example upgrading our entire cloud infrastructure and keeping our database there and you end up concluding that the best solution is to keep that database in the cloud and make that cloud big enough so it can run that database and that is what we did. At the end of the day, it allows us to do that copy and paste feature for our database. It gives us protection against host failure. If one of the hosts fails, the database moves over to the other host and we are still up and running. So without having to deploy clustering and buying two servers and all the complexity that goes along with that, we decided to keep the database in our cloud and not go with the hybrid cloud environment. It turned out to be a fantastic decision, because we ran into performance issues and had to allocate more RAM to that database server and did it in a few mouse clicks which you are going to see in a second and it really kept up the pace.
Performance. I call this the “I need more power Scotty” world. Scotty always seemed able to find it and I find the same paradime with our operations team. It is ”Hey, we need more juice. Our bandwidth database is getting a little slow. We need another GB of RAM. Can you do it, Scotty?” Well in the old days, we had 23 servers of a variety of configurations and it was well what kind of RAM does that server need? We had to schedule downtime, shutdown the server, remove it from the rack, open the server, put in the resources, boot the server, turn it on and then make sure there is enough “umph” in it now. Sometimes you did not have enough and you had to go through those steps again and sometimes you did have enough. Then you had to turn the server back off, because we had to rerack it, restart the server and you had a two hour downtime on a good day. Well with the “cloudy days” like today, we schedule downtime, because now it is a five minute window, click the mouse a few times, add some disks, CPUs, etc., reboot the server and look at the performance. That is it, done. The entire cloud is sweet, because if you find out you are going to have to add another 12GB to the entire cloud you have two hosts. Move the production over to host number two, take host one down, add the hardware, bring it up, move production back to host one, take host two down, add new hardware and bring it up with practically zero downtime. Dealing with performance issues did not become show stoppers, which allowed us to keep the pace up. We did not have to sit and spend five days performance testing making sure that this new database tool we were going to deploy or this new feature we were going to add to OTPortal or new feature we were going to add to our firewall manager could be delivered by the infrastructure we had. We always knew we were going to be ok.
Green. People are now beginning to wake up to the fact that power is money and IT budgets consume more and more power than ever before. So whether or not you are interested in saving the planet, everybody wants to save money. So I did a back of the napkin analysis. I was curious to see how much power we saved. Power by the way is a huge cost driver for us, so all of our clients win if we can manage that better. This is basically the math of how it works. Each of our servers consumes about 300 watts of power and they are up 100% of the time. That is 365 days and 24 hours. Well that 300 watts? We actually need 1.8 times that, because that server takes those 300 watts and throws off a ton of heat so we have to spend about another 80% in energy to remove and cool that heat. You can look up online how many pounds of CO2 is released in the air for every kilowatt of power you draw. It is usually done by state and your environmental laws. Here in Michigan it is 1.58 lbs CO2/kWhr. Vermont is just a little under 1 CO2/kWhr and in some places it is higher at 2 or 3 CO2/kWhr.
So if I take those 26 physical servers and the network it comes out to about 200,000 lbs of CO2 per year. If I really wanted redundancy with N+1 uptime I get with the cloud, it comes to about 400,000 lbs of CO2 per year. I contrast that to our physical network which has a couple of physical servers, network and SAN and lets call it a 35 server capacity (although I think we could get to 50 or 70). That is about 31,000 lbs of CO2 per year that we are putting out versus the 400,000 lbs. The average 21 mpg car puts out about 6 to 6 1/2 metric tons of CO2 per year. That is the equivalent of 300,000 miles being driven by a car that gets 21 mpg every single year. That is pretty compelling. So whether you want to save the planet or save money, you are doing both when you move to cloud computing.
Disaster recovery is my favorite feature of all. Love money, love to save time. As I mentioned earlier it is the only non-renewable resource, but at the end of the day we sell sleep. We allow people to stop worrying about whether something is going to be up. And if there is anything that keeps me up at night it is not our financial, not our strategy, it is our disaster recovery. It is, “Are we up?” That is what keeps me up all night long many nights. I want to know we are up and the only way I can sleep is if I know that if something happens we are going to be aright. What is beautiful about our cloud is every time anybody in our operations, development group, administrative group or sales, anywhere they need a new server if they create it in our cloud, I know we are safe. With our cloud, both hosts are automatically backed up and it does not take that much for us to manage it, because we have two hosts and not 26 physical servers.
There is a bit of complexity in all of this, but that is the business we are in. We basically do continual data protection of those hosts and the SAN to our other data center where we have a warm cloud running. We can fail over to that other cloud in less than 4 hours. We test this a few times a year, because it is not that expensive or difficult for us to test. In fact, what happens is the exchange takes about 3 hours and 15 minutes to get back up, because of the way exchange works and the other 25 servers take less than 45 minutes. If you think about having 20 plus servers and a medium size company our size all having a 50 mile away warm DR with a less than 4 hour RTO and able to practice that twice a year, that is a five figure monthly contract with SunGard that you do not get to practice twice a year. This was just unheard of years ago. At the end of the day, this is the best sleeping pill I have ever had.
So, I’m going to close with a question we get from a lot of people: “What is your recommendation on cloud computing?” It sounds trite, but I really mean this. I tell them to start right now. Just go do it. Do not even think of having another dedicated server. If you have 4, 5, or 10 physical servers and they are a year or two old, just migrate to the cloud. You will sleep better, move faster, save money and save the planet. It is just so compelling that it is just something you should go do. It is like trying to live without a cell phone versus just having a cell phone. One of the things we really wanted to make sure of as we did this, is we did not use something we could not sell and sell to our clients. We only use what we sell, because we think it is really, really important that if it is not good for us, how can it be good for anyone else? It is really good for everyone else, I want it before my competitors do. A few years ago this level of pace, flexibility, protection and total cost of ownership would have been impossible. If you had come to me 3 years ago and told me you 26 servers, I want a 4 hour RTO 50 miles away, N+1 on all 26 servers, the ability to do maintenance and upgrades with no downtime, I would have said you need to wait 3 or 4 years, because there is no amount of money that is going to get you that today. So at the end of the day, we move faster and we do it for less.