Vishal Shah - #V's Blog

  • Archive
  • RSS
  • Ask me anything
Life is really simple, but we insist on making it complicated.
Confucius
    • #quotes
  • 2 weeks ago
  • 2
  • Permalink
  • Share
    Tweet

SPDY - HTTP evolution, the internet & more

Introduction

As you might or might not know, HTTP & TCP are the protocols of the web. Everything builds upon them. I wanted to write something on SPDY - a Google proposed next-generation application protocol, that’s tries to fix some of HTTP limitations, while being extremely clever and staying backward-compatible. It tries to achieve that by building upon SSL to support “multiple concurrent, interleaved streams over a single TCP connection”. Thanks to this, you get the added security benefits of SSL. You do pay a latency penalty, but in the grand scheme of things, because most of the time only a single connection will be required per host to retrieve multiple resources and even support features like “server push” in the midst, it doesn’t seem like a lot.

SPDY Stack

Despite SPDY is out there replacing HTTP, it is really a sort of HTTP 2.0. It only changes HTTP components or features that are in need for a makeover. Starting from scratch would be way too much work.

How does SPDY enhance HTTP?

  • Persistent connections & concurrent HTTP requests - SPDY supports unlimited bidirectional, concurrent streams (as opposed to FIFO in HTTP pipelining) over a single TCP connection. Current browsers open a limited number of connections per domain which is limiting for many “chatty” applications or those serving lots of resources. Also, all connections with SPDY are persistent connections!
  • Server push - can push data back to clients without clients explicitly requesting it
  • Compression - Request and response headers are compressed. Also content is always compressed as opposed to optional compression in HTTP. This improves latency to transfer packets back & forth from client to server. Application using lots of cookies and custom headers, can benefit from this especially.
  • Request priorities - the client can request as many items as it wants from the server, and assign a priority to each request

How does SPDY benefit the user?

  • Reduces bandwidth needs
  • Relying on SSL provides better security
  • Richer client UX experiences thanks to features like bi-directional streams, server push, …

What’s not awesome about SPDY from an end-user standpoint?

  • SPDY’s immediate benefits are more visible for users with slow network conenction - modem, slow DSL connections.
  • As more and more people have faster network connection, the benefits unfortunately diminish. Here’s a test - compare a large page load side-by-side by a SPDY enabled browser like Chrome on a SPDY enabled site like google.com vs a a browser without SPDY support. Unfortunately, you won’t notice a difference. Most servers also don’t take advantage of server push, so the benefits from an end user standpoint with more bandwidth is not significant.
  • SPDY does do good to the larger ecosystem - proxies, routers, switches etc as ideally less data is sent through them, but the overall % gain is unclear. My guess is not a lot.
  • Hence the question arises - is SPDY going the change how we browse and the evolution of web pages and their development - maybe. Server push is by far its “sexiest” feature.

SPDY cons for developers?

  • SPDY web servers are more complicated than traditional HTTP servers. Troubleshooting issues also gets trickier. With power comes responsibility.

General Thoughts

I love SPDY’s motivation & goals. Back in school, I loved networking. I used to take graduate networking classes while still in undergraduate, to dig deeper and so as to have access to the professors and more importantly their research :) I was quite fascinated by the magic of packet switching & all the multiplexing, the switches, proxies, routers all that collectively work together to form the internet we know of.

HTTP was designed more than a decade ago, and it has practically gotten one big update in 1.1 - range requests, pipelining, chunked transfer encoding, etc. There have been a ton of proposed extensions and very few have gotten love. And for good reasons - you see, the success of the world wide web (WWW) has caused its own evolution to slow down. Really? Yes.

The world wide web is a great place to be. Anybody with a web server can join the network and start distributing content. There were very few authorities and policies in place, being started as a research project at CERN which itself was built upon the larger “internet” which too started as a research project (ARPANET) on packet switched networks in the late 1960s.

This distributed nature of WWW meant, drop by drop, servers started pouring in, especially as commercial uses poured in but don’t forget the Joe’s who started realizing the potential and started putting servers & content up there. This is one of the biggest cons of being distributed - if you are not in control of all the nodes in a distributed environment, you are in some sort of trouble. For example, you can not request all the nodes to update their servers or support a new protocol version for example.

Also, different corporations own different parts of the network from physical to software to applications. This is one of the internet’s greatest strengths, in how various parties contribute and work together to build a seemingly cohesive whole. Without this, the internet of something even remotely close to the internet’s scale simply would not be possible. A single owner would only get you so far. At some point in time, you have to open up platform to incorporate others.

Because of this nature and sheer scale, protocol level changes are very hard to support. You really have to evangelize, promote and hope folks will listen and update. In some cases, you can just put sheer pressure, in some cases you simply can not. For example, people use different browsers and are on different versions of the browsers. Being compatible with all of them is a big big challenge. Everything has to be backward-compatible to the extent possible. Hence, HTTP 1.0 stuck around for a while and the supposedly minor version update 1.1 took years to adopt. That’s amazing and well saddening. The most powerful system in the world is one of the hardest to upgrade. Even a spaceship upgrade should be simpler, while its floating in outer space I suppose. At least you are in control and make decisions to support the upgrade.

If you thought, the internet was perfect, think again.

HTTP 1.1 adds some great features on top of HTTP 1.0 while still being backward-compatible, but what’s next? Google clearly has its best interests in a faster internet. It provides a better experience, and supports forward looking initiatives to move the web forward. They also indirectly benefit revenue wise, since if the speed of browsing increases, so will searches & ad revenue. SPDY is Google’s attemp to a better HTTP and I am really glad that there is someone at Google actively pushing for these standards. Very few companies are in a position to make something happen here. Very few. Google, Microsoft, maybe Facebook. Maybe Yahoo. That’s it.

Google has some interesting research going on transport & internet level protocols, specifically TCP - See SCTP. But its hard to be backward compatible when you propose enhancements at these levels. And because of the distributed nature of the internet, changes like these take a very very long time. Nothing is perfect. Sometimes you just have to live with what you have at the moment. That’s certainly the case with a lot of the internet’s protocols.

Vishal

    • #internet
    • #network
  • 3 weeks ago
  • Permalink
  • Share
    Tweet

Redis + Lua for processing JSON values

Redis + Lua for processing JSON values

I love Redis. Its simplicity & attention to minimalism is striking and I find myself right at home when working with Redis. Its no surprise, Redis is one of my favorite open-source project.

Well, you might have heard that Redis 2.6 RC was just released and it has native lua support! Now you might think, what the hell? But trust me, this is awesome. Think of it like PLSQL (well, kinda…) of Redis.

For things that you traditionally had to pull objects back from to redis to do, for example processing Redis objects and summarizing the results or generating stats or calculating mean/averages, you know can do it all right in Redis. That’s right, in Lua. Lua scripting is powerful. And thankfully Lua scripts are run in an atomic fashion - no other scripts are commands is executed while the script is executing.

Plus, Redis’ Lua support includes some of the most popular Lua libraries - base lib, table lib, string lib, math lib, debug lib, cjson lib & cmsgpack lib.

The one I am excited about is cjson! Yes, now that means you can “near-natively” process JSON in Redis. I say near-natively, since it still has to go through the Lua runtime. But that’s OK. Its an extension of Redis, the way I look at it. And this is better than native Lua support, because now sky is the limit. You want native YAML support, well, if yaml lib is included (can’t right now, but fingers crossed), you can get that via the same Lua based scripting system. Its quite ingenious way to cheat, to build more, by less.

A use case where we have a need for such processing is aggreagte value calculation. We store stringified JSON strings as values in Redis. An example is poll stats. If I were to calculate demographic level stats or just a total count of values, I would have to batch fetch the entire JSON object values and calculate the stats in my app layer (Node.js for example). With Lua, I can parse JSON (decode), loop through the matching keys/objects and calculate the stats and return them all from Redis, without ever getting back to the app layer. This is fantastic. (If this interests you, be sure to see some the PS’s below).

Here are some code examples - (Note: You use the eval command for executing Lua scripts. The Lua runtime is also sandboxed and not all Lua objects are available, for example io is not available, as it simply does not make sense.)

eval "return cjson.decode(cjson.encode('{v_rocks:true,whos_v:\"http://www.vishalshah.org\"}'))" 0

"{v_rocks:true,whos_v:\"http://www.vishalshah.org\"}"

Here’s another example where we set the following JSON string {v_rocks:true,whos_v:"http://www.vishalshah.org"}, which is encoded via cjson and retrieve it after decoding via the same.

redis 127.0.0.1:6379> eval "redis.call('set', 'vmeta', cjson.encode('{v_rocks:true,whos_v:\"http://www.vishalshah.org\"}'))" 0

(nil)

redis 127.0.0.1:6379> get vmeta

"\"{v_rocks:true,whos_v:\\\"http:\\/\\/www.vishalshah.org\\\"}\""

redis 127.0.0.1:6379> eval "return cjson.decode(redis.call('get', 'vmeta'))" 0

"{v_rocks:true,whos_v:\"http://www.vishalshah.org\"}"

Voila!

PS0: You still have have to JSON stringify/parse from the app layer when storing JSON values in Redis. However this is best done at the redis driver level. If you know of a Redis driver that does not have json stringify/parse wrappers, make sure you add them by extending/forking the driver yourself.

PS1. More than JSON support specifically, I am more excited about Lua in general. Stats calculations, log like processing, filters, and other checks can now happen in Redis.

PS2. Be careful - Lua scripts are atomically run. If you have a long running Lua script, other Redis commands will wait and hence you backend’s processing will slow down. With a lot of power, comes great responsibility. One way to cheat is to run Lua scripts on slaves that have less load or on staging servers running from most recent Redis dumps, so you can safely play around!

Vishal

    • #redis
    • #nosql
    • #architecture
  • 3 weeks ago
  • Permalink
  • Share
    Tweet

Some nice stats on AOL, Facebook, Draw Something

  • $1.5 billion: The cost of cutting London-Toyko latency by 60ms;
  • 9 days: It took AOL 9 years to hit 1 million users. Facebook 9 months. Draw Something 9 days;
  • ~362 sq ft solar array: powers 1 sq ft of data center.

Source

  • 2 months ago
  • Permalink
  • Share
    Tweet

Tao of YouTube: choose the simplest solution possible with the loosest guarantees that are practical. The reason you want all these things is you need flexibility to solve problems. The minute you over specify something you paint yourself into a corner. You aren’t going to make those guarantees. Your problem becomes automatically more complex when you try and make all those guarantees. You leave yourself no way out.

Awesome. —V

7 Years Of YouTube Scalability Lessons In 30 Minutes
  • 2 months ago
  • Permalink
  • Share
    Tweet
Instead of sitting at your computer, and looking at books, go to a drycleaner, and sit there. The way to get an interesting idea is to go to the source. Stay there until you have thought of something interesting about drycleaning. Then, listen to that idea and it will design itself.
Bob Gill on how to have a good idea
  • 2 months ago
  • Permalink
  • Share
    Tweet

Web Scale, Online-Offline Architecture Pattern/Template for Tiny to Large Scale Products

Reading about and practicing myself many many architectural styles & pattern, I think I have something rather interesting to share.

I have identified a very standard architecture pattern, most online-offline systems can use (an online-offline system is a word I made up :) that supports both online/realtime & offline processing).

Its not new or anything revolutionary. It just works. digg follows most of that standard/canned architecture - app server - caching layers - storage - messaging/queues - async/offline hadoop processing.

Such architecture supports everything from small to large use cases and products. It supports processing large amount of data via hadoop while quick data queries uses the cache and datastore. App server is where the “online” business logic is. Offline business logic is in hadoop.

Its so generic & balanced that you can scale individual layers/components without affecting rest of the system. Also it fits in our SOA, where apps on the top are nothing but service end-points with no UI, or they can be java/node/php… apps with a UI.

I am not proposing, this is one stack that everybody should use, but its sort of an architecture pattern that works for many problems. You can skip components as you wish and add them later. For ex, for a mobile app/service we are currently building, we have no need for offline processing & a dedicated messaging layer - hence the hadoop & messaging layer goes away! We can add in the future if we want to - to add lots of stats, processing, etc..

So, you can scale it back to just the app server, caching and data store if you need, which is not that exciting or revealing, but you know you can scale it to ridiculous extent, by adding the async messaging & offline MapReduce/Hadoop processing layer. And bingo!!

Here’s the digg diagram. I chose not to build my own diagram because I am lazy. Digg’s diagram has most of it. Don’t get too carried away with their arrows. The important thing to note are the components and layers and the purpose they serve.

— Vishal

    • #architecture
  • 3 months ago
  • 1
  • Permalink
  • Share
    Tweet
It’s not the consumers’ job to know what they want
Steve Jobs when asked what market research went into the company’s elegant product designs
    • #quotes
  • 4 months ago
  • Permalink
  • Share
    Tweet
If you think too much, you will make mistakes.
Vishal Shah (1/27/2012) while planning with team on a software project
    • #quotes
  • 4 months ago
  • 1
  • Permalink
  • Share
    Tweet

Will Amazon’s DynamoDB be a game changer?

My answer to this question on Quora -

I am very excited about DynamoDB and I definately think its a game changer. Its killer features are easy to get started, auto-sharding, dynamic scalability per table & proportional AWS costs/billing, in-place atomic updates, both eventual and strong consistency support. But Dynamo has some work set before it can really catch on fire - mostly around queries

  1. Much better integration with elastic map-reduce where I can seamlessly run MR jobs on the fly for queries, etc.
  2. Queries using key-filters other than range (only key filter supported at this time)
  3. Scan operation has a limit of 1 MB - the client has to do multiple trips if you have more than 1 MB of data in the table
  4. No index support other than the key itself. Riak supports secondary indexes for ex that are auto-built. You can build them yourself, hence not a deal breaker, but still..

PS. using solid state drives for storage is awesome!

Vishal

    • #cloud
  • 4 months ago
  • Permalink
  • Share
    Tweet

Will Amazon make the Silk cloud backend available to other browsers?

My answer to this question on Quora -

I for one would think that even if they do open Silk, browser vendors will be less inclined to use it.

Google (Chrome), for example has its own cloud infrastructure, potentially bigger than Amazon’s.

Safari (my current browser of choice) traditionally is very conservative on using any 3rd party services due to Apple’s focus on simplicity - not a bad thing at all in my opinion.

Firefox can benefit potentially but fundamentally it might not be a big win and here’s why.

Silk/computing in the cloud primarily aids mobile (including tablet) devices - with limited network & cpu/memory resources. That’s when this really shines! - the approach of offloading some of the processing and caching on the cloud.

For desktop browsers, some of the above constraints sort of, vanish - more cpu/memory/caching capacity available. And hence the benefits are smaller, much smaller. In fact, it might even negatively impact overall performance, if Amazon’s cloud is experiencing a lot of traffic for example.

I do believe however that the hybrid processing and caching model is a terrific idea when well executed for smaller devices. But there is just not enough motivation as of yet for more powerful devices.

Vishal

    • #amazon
    • #cloud
    • #mobile
  • 4 months ago
  • 3
  • Permalink
  • Share
    Tweet

Managing User Presence, Software Caches, Counters, Sessions among other things using Redis

I have so many developers & architect choose the wrong strategy for managing user presence, caches, stats and session information.

My simple advice that works - use redis + expire. Its blazing fast, stupidly simple and very scalable using file syncing and replication and/or sharing.

Don’t try to kludge in using the old school way using sql or custom in-memory hash tables or memcached used inappropriately.

As a software architect, the hardest thing to do is pick the right tool for the job while balancing complexity, cost, performance and learning. And if there is one tool I never forget and keep on getting back to is redis which is an intentionally kept simple but superb artifact of the KISS principle. On top of it, its genius is behind its beautiful command language that is easy to operate and learn and more importantly build upon for very-simple-to-very-complex systems based on data models that uses KV alone. It takes a while to get used to but KV > Complex data systems for most situations, especially for session management.

You will find yourself using Redis for counters, stats, caches all while not having the cold start problem since Redis fsync’s (file sync) the memory representation so you can always lose a Redis instance and bring another back up without a sweat.

And the icing on the cake? It complements and not competes with other NOSQL, especially persistent, systems. I get amused when people talk or search for redis vs cassandra vs riak vs …

The closest you probably know related to redis is memcached.

Vishal

    • #architecture
  • 4 months ago
  • 2
  • Permalink
  • Share
    Tweet

API Design Best Practices. How to attain API Awesomeness.

API Design Best Practices

API’s can be fundamentally important for an organization. However, its use, contrary to popular belief, is not just for external clients & developers, but they can play an important role in building system wide applications & towards an API driven architectural style(more than in a separate blog).

So what are the good qualities to think about while designing APIs. In other words, I want to try to discuss, how to design APIs that are well designed, scalable (in its use) and some thoughts around related matters.

Qualities of good APIs

  • Less is more
    • The less the API set, the better.
      • There are some obvious benefits here. Less means less to build, support, maintain. Its also supposedly easier to understand by API consumers, developers. Its easy to build off them as well.
      • Sometimes there is a need for more powerful, custom APIs. In that case, my genius idea is to build another “set” of custom APIs on “top”.
        • Ok, so what does that mean? Well, lets take an example. I have 3 simple APIs to offer to the rest of the world and to developers within my company. Now, I will offer another set of APIs that are built on “top” off these APIs with additional hooks and customizations. These APIs are designed for the power users, if you will. They are documented as such in a separate section. Now, what are the benefits of such an approach?
          • Simple - isolation. Which allows us to “evolve” one set of APIs separately than others. This might not seem obvious, but believe me when I say, its tremendously powerful. I can add, deprecate a set of APIs, add/remove params without impacting my entire user base. This is awesome.
            • I don’t know of any organization, large or small, formally following this. They might have such a pattern, but it often is an artifact of iterations, vs planned. Simply following this rule, will make you a better API designer, architect and developer.
  • Loosely coupled to clients, possibly RESTful, platform agnostic
    • Architects have learned the hard way, the cost of building tightly coupled APIs to clients. And I understand why they did that. APIs initially were needed to solve a problem, offer services to a set of clients and hence its natural for API designers to follow that client needs. Wrong 8/10 times. Clients come & go. That’s a fact of life. Both internal & external clients change, because they are often product focused and no product is constant, at least the successful ones. DON’T design your APIs for a client or a smaller set of clients. You will be amazed, at the possibilities of beautifully designed APIs. Systems are better built with loose dependencies on APIs as opposed to tight binary dependencies.
    • REST is a very powerful pattern, that the web/http builds upon and you can not go wrong building RESTful APIs, but there are some gotchas. Don’t follow 100% REST terminology as chances are you will not have web scale routers and caching infrastructure. There is often something more needed. Also some of REST philosophy is hard to absorb. It’s OK. Start small, simple. Iterate from that point on. Learn from giants - Twitter, Facebook, LinkedIn, Google.
  • Performant
    • APIs are no good if they are not fast, simply put. Clients, often external, are dependent on them. Their performance, their user experience is dependent on your shoulders. And if there are 1000’s of clients, that’s a lot of responsibility. Don’t sweat it. Only expose APIs in the beginning that you know follow good algorithms, example does not involve an entire sweep of the database. APIs should mostly use keys to look up data, cache hard to calculate data and avoid user specific complex data to be served without designing & planning for it. Caches are API’s best friends. Take advantage of them. Assume clients will try to abuse them. You have to be smart, accountable & resilient.
    • Keep responses small. Large responses are one of the biggest reasons some APIs are slower. Support pagination if there is more data. Simple. (See more tips in the response format tips below!)
    • Consider supporting binary response formats for extra perfomance. msgpack, protocol buffers are excellent for compacting your API response and at the same time, supporting fast data parsing and loading, a great win!
  • Don’t trust your clients
    • This is something novice designers & architects always make. Trusting the clients to do the right thing. Wrong again. Never, trust your clients enough. Even if they are internal clients. Because even if the intentions are good, a small client application bug can suffocate API’s which in turn has down stream impacts.
  • API servers are API servers
    • API servers are often aggregators. They don’t have a lot of business logic, but instead depend on other systems to pull and present the data to the clients. Often from a variety of back-ends. If you are not doing this, in other words, designing or building APIs on your app server, well good luck. I will say, its a terrible idea. Even if you entire stack is hosted off one-machine, try to isolate the api server as a separate module, with its own dependencies and code base. There is only binary/library/API dependency on other modules/system components. You will thank me for this.
  • Param Design
    • Keep your param list small, very small for that matter. Give good names, so that 9/10 times I get it without reading a lot of documentation. These are for obvious reasons. As a rule of thumb, allways choose simplicity when possible. As API designers, this is the hardest aspect. As computer scientists, we are trained to solve complicated problems, but we are not trained enough on simplicity, the power of less, and all that. Just look at Apple products & user experiences.
    • Clearly mark off optional parameters and document them separately, because many developers might not even be interested in that much power. Also, on the flip side, clearly note your default values for optional params.
    • If you are designing HTTP APIs, follow the proper verbs. GET for data out. PUT/POST for data in or updates. DELETE for deleting data (use caution here, especially around authentication).
  • Response design & formats
    • Support only response format to the extent possible. Don’t try to be cool and support multiple formats unless absolutely required. JSON, these days offers a good balance of simplicity and client compatibility across the gamut of clients out there. I don’t prefer XML, but it works.
    • If you need to support, more than one response format, try to isolate the view (the response template) from the data. This is inline with MVC design pattern. This will make it very easy to support & maintain multiple formats. And its easy to debug issues as well. Please note this!
    • Keep responses small. Don’t include everything in the response.
    • I have a great tip, that many don’t know. Use params to drive what to include in the response. This is a great way to give the control to the client and let them decide the tradeoff between performance and quantity. You should make this clear in the API documentation. Also, this can be designed as a priviliged service. Meaning, only priviliged (trusted, etc) can choose to include certain items in the response as they add to the response and/or cost more to calculate/generate.
  • Real-Time Processing for Statistics & Monitoring
    • Avoid doing real-time processing to the extent possible that is in regards to monitoring or calculating stats or for monitoring purposes. That’s because often times its not very simple to that in a fashion that does not compromise the API performance, and the integrity of the statistical data and taking other necessary actions.
    • I have a much better idea to share for that problem. Do processing offline! What I mean by that is that you should log all the requests you get for the systems. Periodically process that data and calculate aggregate stats that the APIs can use to monitor and accept/deny/etc activities from clients. This is amazing powerful because it is very scalable. Offline systems can use hadoop/map-reduce on streamed scribe data and there you go, you can calculate the most sophisticated or the simplest of statistic, like API counts from a client or partner. Imagine, doing that in real-time, especially when you have 20 API servers and any one of them can serve API request. You have no choice but to use a distributed storage system and/or cache locally and periodically sync with other peers.
    • Instead, doing it offline and updating the live API servers with that stat or making it available in constant time lookup (key lookups from caching systems or KV data stores for ex Redis) does wonders. Sure, there is a downside. You will loose a window of opportunity when the stat is not updated and you would still be serving clients. But you can have the offline batch processing as often as you like. For ex. every 15 minutes. That way, the worse you loose is a 30 minute window. And besides, you should be investigating DoS attacks among other things at the site level, not just APIs, and those systems can help/aid under critical attack or abuse situations.
  • Authentication
    • Coming soon!

Vishal

    • #api
    • #architecture
    • #design
  • 5 months ago
  • 3
  • Permalink
  • Share
    Tweet
Don’t be too much of a generalist where you don’t have confidence is yourself or what you build.
Vishal Shah, 2011
    • #quotes
  • 5 months ago
  • Permalink
  • Share
    Tweet
Be the change you want to see in this world
Mahatma Gandhi
    • #quotes
  • 6 months ago
  • Permalink
  • Share
    Tweet
← Newer • Older →
Page 1 of 2

About

  • Blog Archive
  • Vishal's Home Page

I like designing & architecting things that help better and simplify life in some way or other.

I have degrees in Computer Science & Mechanical Engineering and have studied Industrial Design in San Francisco, CA.

Pages

  • Reading
  • Playing
  • Work
  • Contact
  • Following

Me, Elsewhere

  • @whos_v on Twitter
  • goldenv on Flickr
  • Linkedin Profile
  • Xbox Live Profile

Twitter

loading tweets…

  • RSS
  • Random
  • Archive
  • Ask me anything
  • Mobile

Effector Theme by Carlo Franco.

Powered by Tumblr