Future of Cloud Computing

A lot of people have done predictions about future of Cloud Computing in 2012 and beyond. Here I attempt to summarize and analyze some of these predictions. During 9th Cloud Expo many predictions about cloud computing were made. Let us have a look at what some of them have to say.

Lauren C. States (VP & CTO, Cloud Computing & Growth Initiatives, IBM) expected a rise in service providers due to infrastructure costs not being an issue anymore. This is probably quite true that we will see lot of new cloud providers, but with lot of providers being available and the cost of cloud instances dropping (it has already started to drop) I see the quality of service provided becoming the key factor like mobile service providers today. She also mentioned that people will be using any device any where to access data and services. We already see it happening as the mobile and tablet devices are increasing tremendously. This also increases reliance on cloud based services and storage, which requires ease and speed of access. This will require upgrade to exiting network infrastructure so that it is not a bottleneck. She also raised concerns about security of cloud environments and I agree that with so many people using cloud services, security will need to be strengthened. The industry will make changes and cope with increased demand, I don’t see any major threats, but I do see increase in hybrid cloud environments to increase redundancy in enterprises. Lastly she mentioned convergence of social, mobile and cloud environments, which i do see happening due to interdependencies of these areas. She also mentioned that providers will use this convergence model to do real-time analytics on big data. I do agree there is a gap in effective real-time analysis of data
happening over the cloud computing environments, but real-time analysis is difficult to do in traditional environments unless effective algorithms are written that manage all the variability of data in these circumstances. So although I see progress being made in real-time data analytics in hybrid cloud models, I don’t see it happening in a big way in 2012.

Peter Coffee (VP and Head of Platform Research at Salesforce.com) expects a surge in data coming from social networks and more and more companies providing social media analytics. I do see this happening a lot in future, as social networking tools (like twitter and yammer) become part of our life inside and outside organizations. He also points to accelerating tablet use that will definitely lead to more frequent use of social and cloud environments. Peter also mentioned File-based models of collaboration, which will make file location irrelevant. We already see it happening in almost all of cloud storage environments and it will increase further to provide a better secure cloud based file system that will be used consistently across all providers like traditional file systems. Due to data privacy and sovereignty issues I do see hybrid cloud models becoming more economical and promoted further.

Krishnan Subramanian (Industry Analyst covering Cloud Computing & Open Source) sees Platform as a Service (PaaS) becoming the future of cloud computing in 2012 and beyond. We are already seeing that happening and Gartner recently discussed idea of iPaaS and provided a reference model for it (I will discuss it in a separate blog post). He also sees Federated clouds talking off and more cloud brokerage services being developed. I do see both these happening to help enterprise cloud computing customers in better navigating the landscape. Like others he also sees convergence of cloud, mobile and social apps. Lastly he is hoping for greater cloud computing adoption in Asia-Pacific especially in India and China. This is an obvious one and will happen given large of IT services being provided from this region these days.

Brian Gracely (Director of Solutions at EMC) also sees PaaS being promoted quite a lot in future. He also predicts that any PaaS provider, whoever can combine Java and .NET into an integrated PaaS platform, with options for modern web languages, will take a significant lead with developers. Microsoft has already started doing that with their Windows Azure Platform by allowing developers to use multiple programming languages (like Java and PHP) on it and I expect to strengthen their position further in 2012. He also mentions that security will replace availability issues for cloud environments, which is quite true. I also agree with that good and bad press about cloud computing will only increase its awareness to people.

Last but not least I will discuss comments about future of cloud computing from two of my friends on social networks – William Toll and Scott Stewart. William Toll (VP, Marketing at Yottaa, Inc.) predicts that infrastructure providers will provide pre-packaged private cloud deployment platforms. While I do see this happen in future, I would hope that it extends to PaaS providers like Microsoft. Microsoft should be able to extend customized pre-packaged Windows Azure Appliance for variety of environments in organizations running private cloud. As mentioned before he also predicts that Data Analysis and Mining of Big Data will increase in future. He also mentioned an interesting point here. Data integration platforms need to expand their capability to pull data from across environments providing next generation of information infrastructure in which companies can leverage unstructured text and social media data seamlessly and provide analytics on the integrated data. I have worked in the Data Integration space for last decade and I still agree that work still needs to be done on using data semantics to achieve these objectives. He agrees that 2012 will see PaaS deriving cloud computing. He predicts that Independent Software Vendors will transition apps directly to PaaS bypassing infrastructure. I agree this is happen especially this will happen aggressively in conjunction to pre-packaged PaaS meeting certain customer requirements. Lastly, he mentions that mobile environments will drive data portability and synchronization projects and all the data will be managed in cloud storage. I agree with this observation completely as this will be result of convergence of social media with mobile environments provided by tablets as others have observed.

Scott Stewart (Research Director at Longhaus Pty Ltd) suggests that as cloud Service management, cloud service assurance and cloud security become key in future the whole IT service management will need to be provide as a service. As cloudification will settle and cloud services will have become common in enterprises, people will concentrate on refining internal and external business processes to take advantage of cloud efficiently. I agree with his nest observation that cloud-based smartphone wallet will be used more often and information personalization will start to be key for commerce. This will increase the demand of interactive mobile PaaS environments using cameras and touch and gestures. It will be another application that will make PaaS drive cloud computing in future. Lastly one of the interesting points he mentioned is that Desktop as a Service will become more popular. Well this will be true for enterprises that are running all their apps from cloud, as this provide them with an opportunity to run the complete standard operating environment from cloud. This will also require regular maintenance, backups and redundancy integrated in the cloud platform that is providing the solution.

These are some of the key discussion points in the article; there are many more that I have not covered. So it will a great read for anyone involved in cloud computing. In summary the key areas driving the future for cloud computing are: Platform as a Service, Mobile Environments, Social Media, Data Integration, Data Analytics, Real-Time Data, Hybrid Cloud, IT as a Service, Desktop as a Service.

——
Dr. Amandeep Sidhu

The comments in this blog are of my own and do not represent the companies I worked or the companies mentioned in the Blog.

Comments on Which Clouds Play Nice?

On October 19, 2011 IT news released a guide to Top 20 Software as a Service (SaaS) providers available in Australian Market (http://www.itnews.com.au/News/277170,revealed-which-clouds-play-nice.aspx). In this blog will briefly discuss this guide and what it means for companies in Australia. First of all thanks to IT News team this is a very nice and comprehensive report of various SaaS providers in Australia. It is a must read if your company irrespective of its size is considering provide cloud services in Australia.

The guide analyzed these SaaS providers according to 4 criteria. Firstly it looked at how these services handle data. I agree this one of the most important criteria. How quickly it is possible to upload and download data from a cloud service is very important to a company. Secondly, it looked at how these SaaS providers integrated various cloud services a company has with them. It also looked at if these providers could provide a single sign on an authentication for all your cloud services. One thing it didn’t discuss is how the cloud services can integrated with company’s on-premise resources and authentication that you have already invested in. Now this a very important aspect that all companies should look at before choosing your cloud provider. This not only provides seamless integration of resources but also eases data sharing and avoids data duplication. Now Microsoft integrates its cloud platforms and services on and off the cloud and provides single authentication. How does Microsoft do that? – to avoid duplication of effort I refer you to MSDN blog by Planky on “Single-sign-on between on-premise apps, Windows Azure apps and Office 365 services”. Thirdly the report looked at any third party support is available to integrate two different cloud services from two different providers. Now not many providers like to do that to promote their product and also adding support for other providers makes it complicated to maintain it technically. Report does discuss some third party vendors that provide integration of some SaaS providers. The lack of this support also leads to fourth and final criteria considered by this report – are required APIs available to write and integrate code with SaaS? Now Microsoft provides lot of API support and programming can be easily done with Visual Studio Light Switch to deploy applications to Azure. The Azure applications can easily be integrated with other Microsoft cloud services and Microsoft on-premises applications as discussed in Planky’s blog. The two other resources you will need to third party support and APIs are: “Windows Azure Prescriptive Guidance” and “Deploy a LightSwitch Application to Azure“. Also if your company needs CRM solution running from Cloud and seamlessly integrating with everything else, Microsoft Dynamics CRM 2011 has been recently integrated with the Windows Azure platform.

This report categorized various applications and services into 5 categories of

CRM, Office Communications, Collaboration and Project Management, Finance and Human Resources

and then compared them using extensive set of criteria, more than the 4 criteria discussed above. The report then showed how various applications can be used for complete workflow in an organization.

So here are my recommendations:

  1. First read this report if you are planning to develop or deploy cloud services in Australia.
  2. If you already have on-premises infrastructure using Microsoft Software or you expect your clients to you Microsoft Software, stick with Microsoft as a cloud services provider as they provide seamless integration with everything on-premises or on cloud.
  3. Do talk to Microsoft Staff in Sydney when planning your cloud implementation in Australia. They are very helpful and provide lot of resources to support your planning, testing and deployment.

——
Dr. Amandeep Sidhu

The comments in this blog are of my own and do not represent the companies I worked or the companies mentioned in the Blog.

Azure versus EC2

I have been often asked why I prefer Microsoft Windows Azure to Amazon EC2 as a Cloud Solution. In this blog i will discuss the reasons for it. When I started looking at cloud solution for computing research data sets I looked at all the cloud solutions available and quickly short listed two vendors: Amazon and Microsoft. Then I went on to decide what to run on these environments. I decided sequence assembly of a genome would be a great HPC job to run to compare the performance of these environments. I have previous expertise in Bioinformatics so it was a straightforward choice. Also sequence assembly was rarely done before completely on cloud so it had a bit of novelty.

I used Open Source Ray Sequence Assembler for testing. The Assembler is written for Linux but is also ported for Windows by Applied Maths. I used both versions in this test. The assembler needed two input files 4GB each. I ran the program with default parameters because the intension here was to compare the run time of the program. I preferred using extra large instances in both the environments.

Initially I ran windows version of the program on Azure and Linux version of the program on EC2. The setup of application requires a lot of time on EC2 Linux instance, as you need to install all the compilers, libraries, etc. before building the program from source. In comparison windows version of program only requires installation of Visual C++ libraries.

Running the program on Azure took 6 hours to complete. When I ran Linux version of amazon I hoped that program would have similar run times as I didn’t change anything, and the extra large instance had similar resources. Instead to my surprise it took 21 hours to complete. I double checked everything and ran again but got exactly same run time. I then ran windows version of program on EC2 extra large instance but the run time did not change.

One thing is clear the default instances of Azure are tuned for HPC whereas EC2 is not tuned at all. So when comparing apples to apples Azure performs much better than EC2. Now you can request HPC tuned extra large instances of EC2 but then you have to pay more, so you are comparing apples to oranges. The Beijing Genomics institute also did some bioinformatics analysis and found EC2 to be slow.

I also wanted to compare Azure’s performance against National Computing Infrastructure in Canberra using similar CPU and RAM allocation. I was expecting it to out perform Azure because of how this supercomputing infrastructure is configured. The program finished in 4.5 hours as expected. This is still great news for azure factoring in the cost and effort that goes into managing and maintaining these supercomputing clusters. Azure worker nodes are easier to provision and maintain.

Last thing I looked at for comparison is the kind of support provided by the vendors. Now this is very important as the research applications are normally novel and there is a good possibility that they have never been run in a cloud environment before. Amazon being IaaS provides infrastructure and user is responsible for installing dependencies, configuring application and the data. Amazon actively provides no support although there is community support. Azure is PaaS and user is only responsible for application and data, as it should. The Virtual Machine role in Azure as provides a pseudo IaaS where user can create a Custom VM according to requirements and run it every time. Also Microsoft also provided lot of support to make sure novel applications run on Azure.

So I prefer Azure as a Cloud Computing environment over EC2 as it performs better, its default instances are HPC tuned, and it has better support mechanisms provided by Microsoft.

——
Dr. Amandeep Sidhu

I have over 200 favorite tweets. Next I will comment and discuss these tweets in my blogs.

The comments in this blog are of my own and do not represent the companies I worked or the companies mentioned in the Blog.

HPC+Cloud for Research Data

Currently for Data Intensive Science research groups use HPC clusters that are individually setup and maintained. Researchers also utilize their local and national eResearch resources such as iVEC, ARCS GridCompute, NCI infrastructure, etc. Typically research data flows are predicted (and sometimes unpredicted) bursts, and don’t provide required critical mass for optimal utilization of these large HPC clusters. Therefore a pattern of peaks and troughs is seen in utilization of compute and storage for Data-Intensive projects. Combination of on-premises HPC and Cloud (HPC+Cloud) best suited to scale the compute and storage resources based on the demand.

Also sometimes research software is not available on HPC Clusters (It is either specialized or is not widely used by the community). In these circumstances, researchers either run their computations on their desktop computer or purchase a separate standalone workstation for the purpose. This presents us with a significant barrier where researchers don’t have ability to scale their computational and storage capabilities, which leads to a whole range of inefficiencies and lost opportunities. HPC+Cloud solution can bridge this wide gap by using Virtual Machines (VMs), configured according to research requirements in the cloud. This allows extending capabilities of researcher’s workstation to HPC and Cloud and provides the much needed scalability. Further investigation still needs to be done on licensing options for Cloud Computing Environments.

Some of the High Performance and Data Intensive Services required for Research Projects can be moved to the cloud. Some applications are not suited for the Cloud, as Data Transfers required will incur a huge cost. Some of the applications are also not yet developed to scale in the Cloud Environment. Also it is not ideal to have large HPC solution on premises for these applications as its very difficult to optimally utilize and maintain. As a result a HPC+Cloud solution is ideal where a small HPC Cluster locally and larger jobs can either sent to the cloud or split between local and cloud nodes.

Both IaaS and PaaS models are useful for handling research data in Public Cloud Computing Environment. IaaS gives more flexibility in terms of customization of operating environment and software according to researcher’s needs, but also adds tasks of maintenance and support for the researchers. PaaS is better as researchers only need to install the application and mange the data and the vendor manages everything else.

——
Dr. Amandeep Sidhu

Next in the series will be comparison of Amazon and Azure for genome sequencing.

The comments in this blog are of my own and do not represent the companies I worked or the companies mentioned in the Blog.

Cloud Computing

Cloud Computing is not a new concept. It goes back to 1960s when John McCarthy said, “computation may someday be organized as a public utility”. John McCarthy, who recently passed away on October 24, will always be known for his contributions to Artificial Intelligence. Cloud Computing is mechanism for delivery of Computing (more broadly Information Technology) as a Service.

The scale of data generated recently by various instruments (as mentioned in my last post) makes the case for delivering computing as service. Traditional methodologies of providing computing for processing data are no longer efficient, and scaling them to cope with demand is not economical. Progress made in Hardware Virtualization, Power Systems, Cooling Mechanisms and Network Technologies made modern Cloud Computing Paradigm possible. I am fascinated by recent developments in “containerized data centers” that contain thousands of nodes per container and how they are revolutionizing computing as whole. These containers are so efficiently designed that they need 3 inputs for power, cooling and networking and can be completely managed remotely. Presently as the global data centers of cloud providers have many of these containers so the ratio of cost of physical PC to virtual PC for same configuration is extremely low. Therefore in dealing with High-Throughput and Data-Intensive applications Cloud Computing actually makes sense.

There are various Cloud Computing models currently being adopted in various enterprises – Public, Private, and Hybrid Clouds. Of these Hybrid Clouds are the future in which sensitive data is Private and rest can be exposed. These models also have better integration with existing High-Performance Computing (HPC) infrastructures. Using these Hybrid Clouds as extensions of On-premises HPC allows for quick scaling of compute to manage planned or unplanned processing burst in Data-Intensive applications.

Various Cloud Computing vendors provide Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS) in Public Cloud. For Data-Intensive applications PaaS is most suitable as everything else except for the data and the vendor manages the application.

Various vendors provide various cloud computing services. Here the vendor list consists of major vendors only and is no way complete. Amazon provides IaaS so user has choice of OS, Databases, Applications, etc., and have to worry about system integration, maintenance etc. Also the computing instances are tailored towards hosting web services and there are extra costs for HPC instances. Windows Azure provides PaaS so user has is only responsible for Applications that need to be deployed and the data processed by these applications. The instances are tailored to mange web services and HPC and there is no extra cost for HPC. Google, Apple, and Salesforce, provide SaaS and thus are managed by vendors as an end-to-end solution. As a result they are generally more suitable for small and medium sized enterprises.

Therefore currently a small footprint of on-premises HPC integrated with Hybrid Cloud (PaaS in Public Cloud) is ideal solution to manage Data-Intensive Applications. From current vendor offerings Windows HPC+Azure performs exceptionally well as a solution.

——
Dr. Amandeep Sidhu

Next in the series will be discussion on HPC and Research Clouds.

The comments in this blog are of my own and do not represent the companies I worked or the companies mentioned in the Blog.

It’s all about Data

I see people talking about new super computers, new high performance computing (HPC) environments and even I am involved in managing a HPC environment. We talk about all this stuff without realizing that it’s all about managing data effectively that machines these days generate. If we can effectively manage data online or offline we won’t need these huge server and storage farms and will reduce the carbon footprint as a result.

So how can this be done is what I am trying to talk about in this blog. I have normally worked with Biomedical Datasets in last decade, so I will take examples from those but they can be easily applied to astronomy, geosciences, etc. which generate lot of data on daily basis.

Now let’s talk about what I regularly use on daily basis and then just compound the problem by adding thousands users like me. I have a high-end windows desktop for mundane tasks and I use a Windows HPC server and a Ubuntu server for high-end tasks. I usually process at least a whole or part of a genome as required by my fellow researchers or just for trying some new bioinformatics workflow and tools. I have huge amounts of storage attached to all these machines I daily use, but if I still fall short of storage I have access a national data fabric and to state’s peta scale data store. BTW the peta scale data store is currently being upgraded to handle huge amounts of SKA is going to generate because of Australia’s bid.

At home I have a complete Mac environment with one lone PC. I have a Mac OS X Lion Server controlling everything and two Mac Book Pros. I have about 3 TB of storage between these machines and I back up everything on a 3 TB NAS on daily basis. Now given the fact that I normally take a bit (stretching the semantics of bit here) of my work home and also that I use my NAS as mini Linux server, I m running out of space. Now that storage these days is so cheap I m in process of upgrading my NAS to 9 TB.

Now let’s talk about what happens when I m mobile. I normally access my data through the national data fabric and icloud. These cloud storage solutions normally give me around 200 GB of storage to move some of my data with me. I am always on a lookout for bigger storage solutions in cloud to give me mobility.

Let’s talk about cloud computing and options it has for storing and processing large amounts of data. Unfortunately for me apple is not big on high throughput data applications on cloud but I really like the advances Microsoft has made with windows azure. Also the slices of compute and storage they offer are competitive to amazon. And finally it’s Microsoft so it will seamlessly integrate with everything although we will have few bugs and quirks.

More about our use of Azure at Curtin for processing sequencing data:

http://www.itnews.com.au/News/271076,curtin-trials-dna-sequencing-in-azure.aspx

Somebody asked me a while back while preparing for a grant’s budget: what is the best hardware for running high throughput whole genome sequencing? And I said a budget laptop and put at least 50K for it. He looked puzzled, but what I meant was that we need the laptop to connect to our high-end server on which the compute will run and store data on the corporate SAN. We needed 50K to add storage to university’s corporate SAN.

I was also recently asked if bioinformatics on cloud was a good idea. I said yes and no and I have to get back to you on that. Basically we can use tablet devices like iPads running anything from Windows, iOS and Android to connect to cloud. The data can be stored in raw or processed form on the Cloud. There are numerous apps for seamlessly accessing data from the cloud. Even the compute can be done remotely using these thin clients like iPads and we use or build apps to run jobs on big HPC servers and clusters. The major issue with the cloud is network bottleneck and so the data and compute needs to be at one place or as close as possible. Also whether cloud in its current state can handle storing and processing Tera or Peta scale data is another question.

So I am working towards a HPC+Cloud Hybrid model where all the apps that you need are in cloud including the apps for storing data and doing computation on it with HPC. Researchers like me use thin clients like iPhone and iPads to use these apps.

Next we will see if such a framework is possible in current environment and what pieces of the puzzle are missing. Well Microsoft has done of work in developing Microsoft biology tools (MBT) for HPC server 2008 and they have NCBI BLAST on azure. Running BLAST is no easy task it is very resource intensive can consume huge amount of resources for big sequences. One of the best tools I like in MBT is the excel add-in for bioinformatics as a ribbon. It allows us do sequence alignment and BLAST on sequences. With Office 365 and BLAST already on Azure this app is the easiest for Microsoft to move to cloud. Well when we can put one of most computationally intensive tasks of sequence assembly on the cloud what stops us from putting rest of the tasks on the cloud and run everything from the cloud. Nothing right! Hold on to that thought and let me explain why we have been very slow in putting all the high data intensive tasks on the cloud.

I quote Fran Berman here, VP of Research at Rensselaer Polytechnic Institute and co-chair of the US Blue Ribbon Task Force on Sustainable Digital Preservation and Access. They released their report on Sustainable Economics for a Digital Planet: Ensuring Long Term Access to Digital Information. In the report Fran Berman said, “The data deluge is here. Ensuring that our most valuable information is available both today and tomorrow is not just a matter of finding funds, it’s about creating a ‘data economy’ in which those who care, those who will pay, and those who preserve are working in coordination”. The report is available at: Data Deluge Report

Now is it true, and yes it is very much a reality. I couldn’t think of better place than SDSC (San Diego Supercomputing Centre) to make the final report available. They crunch more data in a day, than all of us can ever generate collectively in a week. They manage many massive data sets but the one worth mentioning here from a biological prospective is Protein Data Bank (PDB). PDB contains structural information about all the known proteins in the world and links to more information about these proteins like their structure, classifications, and functions from other massive databases hosted elsewhere like sequence information from UniProt in European Bioinformatics Institute (EBI) Protein classification information from SCOP hosted at Cambridge, and function information from publications in biggest massive data set of all: MEDLINE.

Now I read somewhere in a recent Nature Article Janet Thornton Director of EBI saying: “we cannot manage to store all the we generate due to lack of funding”. Lack of Funding – if you haven’t realized by now they are the biggest bioinformatics institute in Europe and one thing they don’t lack is funding. She meant to say they lack storage for massive amounts of data EBI generates and they need to literally buy more every time new project starts. They need cooperation from their collaborators worldwide and share each other’s burden of storage. Well Worldwide Protein Data Bank (WWPDB) was recently created through global collaborations as massive data integration task and that can be done for every database. UniProt EBI’s own database was recently created in same fashion. The same can be applied for every other major data source – collaborate and share the burden.

I was the one who proposed the idea of integration of protein data sources in 2004 to bring some standardization. It went from having a semantic map to better interoperate heterogeneous data sources to a full-blown world’s first protein ontology (more at: Protein Ontology

Another example is GenBank recently announced that it will no longer accept submission of sequence reads generated from next generation sequencing (NGS) methods but will accept submissions of assembled genomes from these reads. Now this is essentially to save space – storing eventual output of assembly but not the intermediates. NGS these days generates gigabytes and terabytes of data that universities and institutions are not at all prepared to manage.

Also I was told that when the SKA bid gets decided in 2012 the first one-third of the project would generate more astronomy data than we ever generated.

So do we go for state and national massive server and storage that everyone uses or do we put everything on cloud and let the big IT companies do what they do best – manage infrastructure and storage. I will prefer a Hybrid HPC+Cloud model where all these stakeholders are involved in a collaborative network and we collectively share the burden of Big Data. Either ways we need centralize IT infrastructure to better mange the data and information and to save up electricity and maintenance costs, which put together are more than the cost of equipment.

———–
Dr. Amandeep S. Sidhu

This blog is first in the series, next will be on various cloud services provided by companies. I hope to cover Microsoft, Amazon, VMWare, EMC, NetApp, Dell, HP, IBM, Google and Apple in the discussion.

The comments in this blog are of my own and do not represent the companies I worked or the companies mentioned in the Blog.