Summarizations of Applying Access Control to Search
One of the issues with providing search functionality to corporate knowledge is that access control is in full effect; you can’t simply do full indexing of everything available, because each user has a different set of data available to them. It also seems like everyone writes their own access control system as well (including yours truly), complicating matters. I’ll overview some of the interesting stuff going on and then go on to list some ideas for implementation in open source.
My Summary:
The critical issue is whether to integrate your access control system into your indexing process, or to modularize it into its own component and provide an interface to the search application. If you modularize, you end up doing a roundtrip check to the security module while iterating over result sets interminably. If you integrate, then you have to inflate your indexes considerably, which probably doesn’t scale too well. Not to mention that you now probably have an mini (or full) access control system in parallel to your main access control system, which you now must maintain and replicate successfully.
Either way, it seems like a tricky problem. Here are my thoughts about practical implementation:
- If you really want to keep the modularization, it might be possible to create some sort of batch access control check. Instead of iterating one check at a time, bundle up a chunk of your result set, send it over the wire to the access control system, and get back a matrix of results. Might work a little better and would probably incur less network overhead, even though it’s still the primitive solution.
- If your set of credentials and content is manageable (and if you don’t mind being a jackass), you can try an unscalable solution of performing an exhaustive indexing operation at scheduled intervals for each credential over the entire set of content, cached at the search application level. This is also a primitive solution but probably would result in fast queries.
- If your access control system does caching, that will help, but the first time is still going to be quite a nasty hit, and why would you search twice on the same terms in the same session?
- This is kind of a stupid idea, but what if you could decouple the search application from your standard web idea of a search application, and treat it more like a P2P network search? In P2P applications, you enter terms, hit search, and alt-tab or go away and come back when it’s had more time to look around. This probably isn’t acceptable in web application UX unless the user understands that secure web applications with private content requires special handling. Good luck on that one. If they’re in their browser, they probably expect google.
- For a more metadata-oriented access control solution, it might be possible to run and maintain multiple indexes, partitioned by metadata property, that basically consider themselves static content sets. Then, when you search over a user credential, you can leverage parallel checks to multiple indexes for each metadata property the user has access to. Then, you’ve probably got some set mathematics to perform on the parallel result sets that are returned, based upon the relationships between the metadata. This is some limited integration with the access control system, and is a pretty heinous idea, but it might work better in heavy or complex data sets if the processing power is available.
Research References:
There are very few providers or researchers that i’ve found doing work in this area. It even seems like nobody’s coined a proper term for such functionality in search, so I list some terms you can google for at the bottom.
Netegrity / Inktomi - SiteMinder
Netegrity collaborating with Inktomi have apparently abstracted out RBAC into Netegrity’s SiteMinder software. It connects to LDAP on the user management side, and integrates with Inktomi’s Enterprise Search Security Module to basically do a last-step check on each search result returned. It’s the primitive solution, and has a host of performance issues involved with abstracting the permissions system out of the search component. This is apparently the only commercial solution to the problem that I could find. They even say they’re the only vendor inside their PDF! If it’s in the PDF, it must be true.
- If your search wants 100 results, do you just use that as a parameter for the initial grab of results? Or do you use that as a goal, and continue checking results until it all adds up to 100?
http://www.netegrity.com/partners/related/InktomiDatasheet.pdf
XenIntranet
There’s a reference in the changelogs for XenIntranet to adding access control to search. It looks like they use a custom ACL solution, and probably integrated it directly. See comments far below.
http://www.xenintranet.com/changelogs.php
Stanford Peers
This paper from Stanford people Mayank Bawa, Roberto Bayardo Jr., and Rakesh Agrawal describes a Privacy-Preserving Index. It also complains about the lack of Private information search technology, but the solution it posits seems to be more about preventing reverse engineering of data availability through special algorithms for building distributed indexes. The powerpoint below has animations describing the techniques.
*** Update - Mayank Bawa was kind enough to write me and point me to the original powerpoint slides for the presentation, so I changed the link below and removed my snarky comment about Stanford (full disclosure: I went to USC). Thanks, Mayank!
http://www-db.stanford.edu/~bawa/Pub/ppi.ppt
They do list a couple of interesting links.
The Stanford Peers P2P homepage. That’s interesting, that resource discovery over P2P networks may have a lot to do with access-controlled search. This page lists a lot of resources for reading on P2P network topics, but it’s a little stuffy in there.
IBM’s YouServ, a distributed personal webserver at use within IBM for web hosting / file sharing.
Chris Weider
This early paper (’96!) from Chris Weider seems to touch briefly upon some of these issues in the second from last paragraph. It seems to be more concerned with exposing for-pay content to public users via normal search tools. Some solutions it describes are indexing proxies, which might index and expose for-pay content via search tools only. Think a9’s searching through book content for keywords.
http://www.isoc.org/isoc/whatis/conferences/inet/96/proceedings/a2/a2_1.htm
MIT Computation Structure Group
This bunch of people from MIT seem to be barking up the right tree. However, they like to use a bunch of big words. I think that when you’re dealing with a subject as complex as access control mapped onto search, you need to give your reader a bit of a break when it comes to academic huffing and puffing. Anywho, to summarize, they also complain about the performance implications of Netegrity-Inktomi-adopted approach of completely modularizing access control, and are working on integrating ACL’s into the Intenational Naming System.
http://www.csg.lcs.mit.edu/pubs/reports/search3.pdf, referenced in the MIT Computation Structure Group’s Search Project
If you want to do more research
Here are some of the terms I searched with that turned up goodies:
“permission-based search” “access-controlled search”
Possible outlets for implementation
- Integrating a Lucene port with a standardized access control system - I might end up doing this with a customized access control system.
Anyway, if you’re researching or doing development on anything like these things, I would love to hear from you. gluk AT padtie dot com.