Page tree
Skip to end of metadata
Go to start of metadata

The persistence layer is responsible for insulating various components that have to preserve data across multiple web requests from the specifics of data storage. This includes data associated with specific clients as well as data that has to be globally accessible when servicing all clients. Not all components requiring persistence necessarily use this layer if they have more specialized requirements that are not easy to abstract behind a common interface.

Draft Proposal

The following technical requirements for the abstract API are suggested based on experience with the Service Provider's overlapping requirements:

  • String-Based API
    • Handle storing string and text data (blobs can be encoded as text), keeping serialization of objects separate.
    • One of the consequences of this is that aliasing has be implemented by hand by managing alternate indexes to information. For example, a secondary key B to an obect keyed by A would be stored as a mapping of the strings (B, A) so that B can be used to find A. If the mapping of B to A is not unique, then the value becomes a list requiring upkeep, and this can cause performance problems if the set of A is unbounded or large. If this is a common case, building in explicit (and thus more efficienct) secondary indexing may be worth considering.
  • Two-Part Keys
    • Supporting "partitions" or "contexts" makes it practical to share one instance of a storage back-end across different client components. Not such a big deal with database tables or in-memory storage, but very useful for options like memcache. Ultimately many back-ends will have to combine the keys, but that can be left to implementations to deal with.
  • Exposing Capabilities
    • Exposing back-end implementation capabilities such as maximum key size enables clients to intelligently query for them and adapt behavior. For example, some components might be able to truncate or hash keys while others might not. This might be something to enhance by adding pluggable strategy objects to shorten keys. Another aspect of variable behavior might be support for versioning, which a client-side storage option wouldn't handle (you can't conditionally set a cookie).
  • Internal Synchronization
    • All operations should be atomic to simplify callers.
  • Versioning
    • Attaching a simple incrementing version to records makes detecting collisions and resolving contention relatively simple without necessarily losing data. Callers can determine whether to ignore or reconcile conflicts. As noted, this may need to be an optionally supported feature.
  • TTLs
    • All records normally would get a TTL value to support cleanup. This wouldn't work for some use cases, so we probably need a permanent option (which again, might be negotiable).

At least in the SP, eventing or pub/sub has not been a requirement to date and I'd like to avoid it if we can, since it greatly limits the possible implementations.

Use Cases

Replay Cache

Most identity protocols assume the use of nonces (usually via message IDs) to prevent replay attacks, though these checks are usually of low importance within the IdP. The more valuable capability is in detecting stale requests to prevent the browser from being trapped in a back-button / login loop. Because of the low security importance, an unreplicated in-memory storage service is usually sufficient. A passively replicated data store would also work well. Client-side storage is not an option, obviously.

Use of the storage API is straightforward; a context is used to isolate the namespace of possible values being checked and the value to check is the key. The value is irrelevant. The key size here can potentially exceed a desirable key size, though not in general, and hashing is sufficient to address that.

Artifact Store

The SAML artifact mechanism requires associating artifact message handles with assertions or messages. For SAML 1 artifacts to function, all servers responding to artifact lookup requests need access to the data store, making in-memory implementations suitable only for single-node systems. Replication would need to be rapid and reliable. For SAML 2 artifacts, it's possible to associate an artifact with a server URL. With additional work to deploy dedicated TLS-protected virtual hosts with unique names, it's possible to avoid a replicated artifact store. Normally every server in a cluster would be load-balanced behind one name and certificate, so this is much more complex to support, probably requiring additional addresses or ports. In either case, client-side storage is not an option.

The two-part key mechanism is irrelevant here because all artifacts are unique by themselves. The message handle is the key, and the serialized message is the value. The key size here can potentially exceed a desirable key size, though not in general, and hashing is sufficient to address that. The value is a potentially non-trivial message on the order of 10k in size.

Terms of Use

No experience with this use case, but I would speculate that this is associating some kind of local user identity with an identifier representing some kind of ToU. I would imagine a ToU could contain parameterized sections or require user input that would need to be preserved, and that would be a simple matter of storing a more complex object produced by a particular ToU module. I could imagine needing a TTL for this data for ToU that have to be renewed periodically, but permanence might also be needed.

Server-side storage here seems awkward without replication, since a user wouldn't understand why he/she was being prompted again. Client-side storage is possible but also quite awkward due to multiple devices. Also seems like a bad thing to eat into our exceedingly limited cookie space. This could be a use case for Web Storage.

Need to investigate existing uApprove code to see what's being stored.

Technology considerations seem similar to the Terms of Use case, only moreso. No way anything more than a global yes/no fits into a cookie, but Web Storage is a possibility if the extra prompting from multiple devices isn't a concern.

Session Store

We need some form of persistence for user sessions to support SSO, and features like logout depend on what we store and how we store it. This is a primary use case for client-side storage, but also a difficult one because of size limitations, particularly if logout is involved. This is a likely candidate for storing some kind of structured data as a blob but unlike the SP, sessions shouldn't need to be arbitrarily extensible.

As a first cut, the data involved is:

  • a unique ID, highly random (16 bytes)
  • representation of the user (ideally a canonical name) (256 bytes)
    • currently this is defined per service and allows us to attach things like the client address so that the resolver can use it
  • expiration based on time of last use (8 bytes)
  • nary authentication state (time, duration, method) (8 + 8 + 2 bytes)
  • nary service login records (entityID, method, NameID, SessionIndex) (256 + 2 + ? + 32 bytes)
    • method mainly serves here to drive attribute filters based on authentication method, can we toss this?
    • do we need time of login to a service?

Lookup of sessions is primarily by the unique ID, except when logout is involved. Then we need lookup by (entityID, NameID, SessionIndex?).

A simple layering on top of the API might be to pickle the entire structure against the session ID as a single-part key, and then create reverse mappings for the entityID + NameID (or a hash) of each service login to the session ID. The reverse mappings ought to expire on the same basis as the primary record, but that might not be efficient to manage, not clear at this point.

With a server-side approach, data needs to be replicated at least by the time stickiness were to wear off, or SSO won't happen (nor logout of course). A client-side approach is the holy grail here, but see below.

Estimated size data is shown above. We could use 2-byte shorts to represent authentication methods, and expand those into URIs only when needed. This saves substantial space in capturing authentication state. An entityID can be up to 1024 bytes specificationally speaking, but in practice are much shorter and well under 256 bytes. The outlier is the NameID, which is nearly unbounded in theory and we can't hash it down because the whole point is to be able to propagate it in a logout request.

Even a simple case study is already well in excess of some browser limits on total cookie size for a domain, even without including overhead for padding, encryption, a MAC, and encoding. Compression would help a little but probably not significantly. Web Storage does not seem like a good fit for this use case either. Session information needs to be accessible to the server without a lot of hoop-jumping, and Web Storage does not allow for this. We would have to generate interstitial pages that use JavaScript to read and post back the session data to the server in the middle of the conversation.

I think a likely direction here is to split off the data associated with service logins because that's only required for logout, and is the entire reason this becomes unmanageable to store in cookies. Thus, the session cache component could incorporate multiple storage service instances injected for the different subsets of data.

Another problem here is that the current server-side design allows us to make data available to the resolver about the user or the client extensibly via Java subject/principal objects. Moving that to the client creates problems with attribute queries, and it's been bad in the past to support functionality that only works with push, it breaks the symmetry and consistency of the resolver's behavior in different flows. This may be another opportunity to push advanced needs to server-side storage.

Possible Implementations

In-Memory

Not much to say, this is obviously straightforward.

Memcache

There's an existing implementation of the V2 session cache, and a version of the SP interface, which leads me to assume this should be possible. What isn't clear to me is the point of it. I know memcache's value as a cache, but this is a storage layer, not a cache. Unless the service were deployed separately from any IdP node, there would be no simple way to take down the server with the memcache daemon. With a single point of failure like that, a database seems like a much better choice. Probably this is another case where non-persistent state and true persistence lead to different back-ends.

JDBC

Clearly possible, and the SP has an ODBC implementation. JDBC should be a straightforward port even without optimizing it.

Cookies

Supporting cookies is principally a size problem. Full portability means limiting total cookie usage to 4k for the whole IdP, and we probably lose 25% to securing the data. Chunking is probably a waste of time unless we want to target browsers without the tiny domain-wide limit. Opera's probably practical to treat exceptionally, but I don't think Safari is.

The storage API would obviously need direct or indirect access to the HttpServletRequest/Response pair, and there could be timing issues if an attempt to update the data were made after generating a response to the client.

The V2 session cache uses a different HMAC key for every session because it's storing the session key on the server. A client-only model would mean using a fixed key or keys. This isn't a major problem except that we also need to encrypt, which is not something the V2 code does. Keeping the encryption key safe means handling key versioning and ideally automating the generation of new keys automatically, perhaps on a schedule.

Applying the notion of a cookie name/value pair to the proposed technical design above, one might represent every individual record as a separate cookie, but this seems impractical because of the overhead of securing them. If we imagine that use of cookie storage would be relatively minimal because of the size limitations, it seems possible to serialize an entire set of mappings into a cookie named by a storage service instance. That is, the cookie name acts like a database connection name and the storage plugin "connects" to the cookie when asked to read the data, and writes back changes. Clearly this involves some overhead, but it maps well to the design and seems to work well if the number of mappings is low or one.

Versioning wouldn't be easy here since different nodes could both update and write back the same cookie, but I suppose one could have some kind of server-side synchronization of updates to the information such that it's in a consistent state before a cookie gets written back. Seems like a lot of work and hard to manage, and I would guess that the use cases for using the cookie for storage could live without versioning.

Web Storage

Web storage has much better capacity then cookies, but when you dig into it, it is a very poor solution for storing data generated and manipulated by the server. It's totally targeted at client-side application logic. The only way the data gets to the server is via a JavaScript-triggered post operation to the server, which is an awkward thing to do. The overall implementation looks very awkward because even writing back data involves being able to inject JavaScript into a page at an appropriate time.

Another deficiency is that there's no support for expiration of data other than brute force, or unless data is scoped to a browser window (I often open and close tabs and windows, which would break this).

On the other hand, for truly persistent data, in which the user interface is deeply involved (think ToU and consent), this seems like a less than crazy idea. These are also areas of particularly likely dependence on JavaScript as a matter of course. The IdP proper seems like something we want to avoid having such a dependency.

  • No labels

8 Comments

  1. RDW Comments

    Just a brain dump of thoughts to act as a discussion points next week.  No response required or expected.

    Service Capabilities:

    I think that an important capability to be exposed is the most obvious one of “scope”.  So cookie and web storage is scoped by the browser and session/private browsing, memory is scoped by the IdP and so on and so forth.

     

    Atomicity:

    I’d like to poke at the other the facets of ACID (Coherent, Isolated and Durable).  In particular Coherency is important in the case to the same subject simultaneously logging in via two agents.  There are probably others, I am not sure but ACID is important and if we ignore it we need to know why. (this deteriorates into the question of what is “good enough”.

     

    USE CASES

    I sense that the matrix of capabilities vs requirements is going to be important and will need clear documentation so people can understand why when they say “I want this clustered, and I want to use ECP and I don’t use JDBC” they will get a “wont work”  response (to contrive a bad example). 

    Equally many of this may be dynamic (“use webstorage if you can, otherwise local database, otherwise in memory”, so we may need tools to help that.

    A related issue is that some of these questions are not yes not but have an element of softness so Memcache (see below) is not a 100% avaiability solution but it might be “close enough”

     

    Session Store

    I suggest that we discuss Session storage here, rather than in the storage slot (then we can use the storage slot do the “make the same Subject the same, in time and in space” discussion)

    “a unique ID, highly random (16 bytes)”: Random query: is 16 bytes enough?  The spec speaks of 128-160 bits, and we just need a clever crypto guy to come along and we lose 3 bits from our PRNG  and hen we all have our summer spoiled.

    Session Store/cookies

    What are we going to do if the Agent doesn’t support cookies (RESTIAN profiles or broken browsers).  What about ECP (I need to review that profile before next week)

    IMPLEMENTATION

    Memcache:

    I looked at this and to a degree memcache is self-healing – if the memcache node crashes, then the individual nodes will continue to work and as soon as they make a change to whatever is being stored they repopulate the memcache so it has a use.  The precise semantics are vague and flakey enough that we might not want to put our branding on it, but we should surely encourage a implementation.

    Cookies:

    Keeping the encryption key safe means handling key versioning and ideally automating the generation of new keys automatically, perhaps on a schedule.”  I’d like to understand how this might be done, there is an element of “inventing a new protocol”, which always has an air of danger associated.  I’m not saying that it’s not required, I’d just like to understand this and make sure we are happy that we are not exposing ourselves to attacks…

    I know that during the lifetime of V2 we changed the underlying implementation of cookies (but I do not remember the details) and I know that that cost us because not all web servers did the right thing.  This is just the pain of working with browsers, but I think we might need a mitigation strategy and I’ll say more below under “Web Storage”.

    Is there an issue with Web storage in that we are immediately implying that an agent represents only one subject at one time?  This is not new but people do whine bout it from time to time.  Can we invent something which allows people to do this as the cost of (say) deploying a JDBC session store?

    WebStorage:

    This confused me until I realized that Web Storage has nothing to do with the Web (It’s the 21st century equivalent of buying a 1Mb Winchester disk for your TRS80)

    This is highly seductive, but I am worried about the almost certain non-uniformity of implementation across browsers, the security advisories which will crop up as soon as people learn how to abuse it and the extra cost of maintaining javascript.  Nonetheless I think that we should probably do this – probably mostly for the use by terms of use and Attribute query.  We might want to discourage its use for Sessions initially (until we are sure it works as we want)

    It feels likely that we will need a fallback for cases where this doesn’t exist (not just IE7, again I come back to ecp and my misunderstanding thereof).

    Finally, I would strongly suggest that if we go down this route (and probably even if we don’t) we have (which means that we start by writing and not leave to the end) some good diagnostic tools for the end user (and the sysadmin) to allow them to work out the precise capabilioty of that precise browser/networkVPN/SSHoffloader/Webserver/IdP set up.  

    I’m thinking of some javascript to test the API, the set a cookie and then go back to the browser (handwave handwave)  – we should then make that code common whatever we we use for pushing the data back to the server.

    I think we need to eschew any common javascript packages until we understand the security implications around their distribution.

    Cloud Storage

    This is what I thought was meant by web storage.  It is interesting because it’s a goodish fit for the API, it’s also web 2 and hence sexy, but as a result probably easier to deploy than a robust data base.  Down sides are that its another mission critical service to go wrong, that it may not be apposite for the high update churn that session management might offer.  Also from what I remember at least with AWS, they manage availability by not guaranteeing that time always goes forwards.  So they are Atomic, Consistent and Isolated, but not (always) durable. Nonetheless probably worth spending the week doing…

     

     

  2. Tom invited me to review this document from the perspective of my experience with the CAS storage tier. I can appreciate the interest in a simple string-based storage API, but I would like to caution that you should consider serialization as a first order concern since it applies to many of the use cases. I would imagine the JDBC storage implementation would be popular (it is arguably our most popular storage backend.) While serialization is baked into Java and there are mature APIs for persisting objects to database, in practice there are some platform specific concerns that make it problematic. The 10k estimate for artifact values is definitely in the range where you might anticipate complications. Considering serialization explicitly has value for key-value stores like memcached. A Kryo-based marshaller for the memcached backend produced a dramatic improvement in both speed and memory footprint. A domain-specific serialization mechanism optimized for space might make cookie storage more feasible.

    What isn't clear to me is the point of it. I know memcache's value as a cache, but this is a storage layer, not a cache.

    The cache-like behavior that memcached provides can be beneficial for some use cases. As an SSO session store, a node outage produces a very gentle failure mode: users simply have to reauthenticate if their session ID hashes to a node that is not available. Additionally it's trivially easy to drop all SSO sessions, which is a feature for some environments. In practice memcached has proved rock solid as a session store; we have not even restarted the memcached daemons since we switched backends over 6 months ago.

  3. Just responding to comments:

    I see the value in custom serialization strategies, but I don't see a win in doing that at the layer of storage vs. at the layer where knowledge of the objects being stored is. As a middle ground, I think we could add methods with a generic (<T>, SerializationStrategy<T>) signature, so that objects could be serialized using code that is still external to a storage service. Or it just happens within components that store objects that would benefit from enhanced serialization. Lots of ways to optimize the code organization without having to worry about it all in the lowest layer.

    Regarding cookies, I'm afraid nothing helps. There's nothing we could do short of magic that would fit what we would need for logout into 8192 bytes. Until Safari and Opera browsers stop limiting cookies so aggressively, we're stuck.

  4. One point I don't see addressed explicitly here is the behavior used in v2 by depending on Terracotta. With Terracotta instrumenting the classes at runtime, it was always able to know when an object changes and replicate it to other nodes. When I implemented a Storage Service, it was necessary to add a filter at the web container level to "touch" each session object as it changed. So anything done in v3 should have an explicit mechanism to tell the Storage Service that an object has changed.

    • Related to session persistence, it would be useful to have a mechanism to administratively 'poison' or invalidate a session(s) by user identifier, e.g. in the event of a compromise, optionally performing single sign out.
    • Second, I've been working recently with CAS and Ehcache. A couple of benefits of Ehcache over memcached I see so far are (1) Ehcache does not require a separate daemon to configure/run, (2) the cache(s) can spill over to to disk for memory-limited systems or for that matter be fully persisted to disk, optionally across restarts (3) Ehcache has an optional bootstrapping mechanism to pull e.g. session state from its peers, i.e. when a node goes down no session state needs to be lost.
    • Been testing recently up to 14 TPS on a pair of CAS nodes and have not seen issues with Ehcache replication being able to keep up (Java RMI). That said, replication is done "in the clear" so any cross data center replication would need to be protected (VPN, SSL tunnel, ...).
  5. Anonymous

    From an outsiders point-of-view, I'm confused why you don't appear to have considered using an established NoSQL or distributed key-value store solution.

    But then again, something with a bit more Service (external process rather than inside the JVM) with a robustly designed API might be useful:

    • more robust expiration of data on restart/crash of the IdP
    • more tunable behaviour (if desired) for applications spanning global distances (bring the IdP closer to users at different sites, global redundancy)
    • if solution N turns stale/unsupported, then it is easier to move to solution N+1 (such as what happened with Terracotta)
    • reduced dependency/lock-in to JVM provider or version (so you don't have to wait to move to Java 7, or migrate from Oracle Java 6 to OpenJDK, for example)
    • reduced risk when tuning/changing JVM parameters.

    If I had a vote, I would suggest that something like Redis (try http://try.redis.io/), or MongoDB (try http://try.mongodb.org/) – I'm not positioned to recommend one over the other – would perform admirably well, but never having deployed such a thing, I can't say what the overheads/negatives would be.

    With a suitably well-defined API for the Persistence Layer, it should be fairly easy (for someone with no knowledge of IdP internal architecture) to implement new plugins for different persistence layers as requirements / support-options change.

  6. Bother, I got logged out before I posted by previous comment.

    I might also mention that something like an external solution would also provide:

    • a place to write logs (making it easier to centralize logs) – or at least statistics.
    • a place to write data for other custom development / monitoring... such as for uApprove, shared tokens
    • if persistency is important, then an external solution would also provide better tools for backup

    Also, memcacheDB (as opposed to memcached), might also be a useful option. It is explicitly not a cache:

    MemcacheDB is a distributed key-value storage system designed for persistent. It is NOT a cache solution, but a persistent storage engine for fast and reliable key-value based object storage and retrieval. It conforms to memcache protocol(not completed, see below), so any memcached client can have connectivity with it. MemcacheDB uses Berkeley DB as a storing backend, so lots of features including transaction and replication are supported.

  7. Responding to the comments above, we're not going to pick any solution precisely because there is never, ever, going to be anything that most people will deploy. Anything we pick will become "the reason people can't use Shibboleth". So the only defaults have to be in-memory and perhaps client-side where feasible.

    Our goal is to avoid storage semantics that will make using good solutions impossible, but also avoid imposing horrible burdens on our code because of a desire to allow for some particular solution that's popular, but not necessarily fit for purpose. We may well not be exactly at the right midpoint on this, I know that the multi-party key approach definitely has issues with some options.

    Regarding Paul's point, 100% certainty: there will be no "implicit" behavior wrt object changes. If something needs to be updated in the storage layer, there will be an update operation against the API, period. That goes for Sessions, and anything else.