Inverted tables: an alternative to relational structures

articles: 

The inverted table format can deliver fast and flexible query capabilities, but is not widely used. ADABAS is probably the most successful implementation, but how often do you see that nowadays? Following is a description of how to implement inverted structures within a relational database. All code run on Oracle Database 12c, release 12.1.0.1.

Consider this table and a few rows, that describe the contents of my larder:

create table food(id number,capacity varchar2(10),container varchar2(10),item varchar2(10));
insert into food values(1,'large','bag','potatoes');
insert into food values(2,'small','box','carrots');
insert into food values(3,'medium','tin','peas');
insert into food values(4,'large','box','potatoes');
insert into food values(5,'small','tin','carrots');
insert into food values(6,'medium','bag','peas');
insert into food values(7,'large','tin','potatoes');
insert into food values(8,'small','bag','carrots');
insert into food values(9,'medium','box','peas');

The queries I run against the table might be "how many large boxes have I?" or "give me all the potatoes, I don't care about how they are packed". The idea is that I do not know in advance what columns I will be using in my predicate: it could be any combination. This is a common issue in a data warehouse.
So how do I index the table to satisfy any possible query? Two obvious possibilities:
First, build an index on each column, and the optimizer can perform an index_combine operation on whatever columns happen to be listed in the predicate. But that means indexing every column - and the table might have hundreds of columns. No way can I do that.
Second, build a concatenated index across all the columns: in effect, use an IOT. That will give me range scan access if any of the predicated columns are in the leading edge of the index key followed by filtering on the rest of the predicate. Or if the predicate does not include the leading column(s), I can get skip scan access and filter. But this is pretty useless, too: there will be wildly divergent performance depending on the predicate.
The answer is to invert the table:
create table inverted(colname varchar2(10),colvalue varchar2(10),id number);
insert into inverted select 'capacity',capacity,id from food;
insert into inverted select 'container',container,id from food;
insert into inverted select 'item',item,id from food;

Now just one index on each table can satisfy all queries:
create index food_i on food(id);
create index inverted_i on inverted(colname,colvalue);

To retrieve all the large boxes:
orclz> set autotrace on explain
orclz> select * from food where id in
  2  (select id from inverted where colname='capacity' and colvalue='large'
  3  intersect
  4  select id from inverted where colname='container' and colvalue='box');

        ID CAPACITY   CONTAINER  ITEM
---------- ---------- ---------- ----------
         4 large      box        potatoes


Execution Plan
----------------------------------------------------------
Plan hash value: 1945359172

---------------------------------------------------------------------------------
| Id  | Operation                                | Name       | Rows  | Bytes | C
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |            |     3 |   141 |
|   1 |  MERGE JOIN                              |            |     3 |   141 |
|   2 |   TABLE ACCESS BY INDEX ROWID            | FOOD       |     9 |   306 |
|   3 |    INDEX FULL SCAN                       | FOOD_I     |     9 |       |
|*  4 |   SORT JOIN                              |            |     3 |    39 |
|   5 |    VIEW                                  | VW_NSO_1   |     3 |    39 |
|   6 |     INTERSECTION                         |            |       |       |
|   7 |      SORT UNIQUE                         |            |     3 |    81 |
|   8 |       TABLE ACCESS BY INDEX ROWID BATCHED| INVERTED   |     3 |    81 |
|*  9 |        INDEX RANGE SCAN                  | INVERTED_I |     3 |       |
|  10 |      SORT UNIQUE                         |            |     3 |    81 |
|  11 |       TABLE ACCESS BY INDEX ROWID BATCHED| INVERTED   |     3 |    81 |
|* 12 |        INDEX RANGE SCAN                  | INVERTED_I |     3 |       |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("ID"="ID")
       filter("ID"="ID")
   9 - access("COLNAME"='capacity' AND "COLVALUE"='large')
  12 - access("COLNAME"='container' AND "COLVALUE"='box')

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

orclz>

Or all the potatoes:
orclz> select * from food where id in
  2  (select id from inverted where colname='item' and colvalue='potatoes');

        ID CAPACITY   CONTAINER  ITEM
---------- ---------- ---------- ----------
         1 large      bag        potatoes
         4 large      box        potatoes
         7 large      tin        potatoes


Execution Plan
----------------------------------------------------------
Plan hash value: 762525239

---------------------------------------------------------------------------------
| Id  | Operation                              | Name       | Rows  | Bytes | Cos
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |            |     3 |   183 |
|   1 |  NESTED LOOPS                          |            |       |       |
|   2 |   NESTED LOOPS                         |            |     3 |   183 |
|   3 |    SORT UNIQUE                         |            |     3 |    81 |
|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| INVERTED   |     3 |    81 |
|*  5 |      INDEX RANGE SCAN                  | INVERTED_I |     3 |       |
|*  6 |    INDEX RANGE SCAN                    | FOOD_I     |     1 |       |
|   7 |   TABLE ACCESS BY INDEX ROWID          | FOOD       |     1 |    34 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access("COLNAME"='item' AND "COLVALUE"='potatoes')
   6 - access("ID"="ID")

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)
   - this is an adaptive plan

orclz>

Of course, consideration needs to be given to handling more complex boolean expressions; maintaining the inversion is going to take resources; and a query generator has to construct the inversion code and re-write the queries. But In principle, this structure can deliver indexed access for unpredictable predicates of any number of any columns, with no separate filtering operation. Can you do that with a normalized star schema? I don't think so.
I hope this little thought experiment has stimulated the little grey cells, and made the point that relational structures are not always optimal for all problems.
--
John Watson
Oracle Certified Master DBA
http://skillbuilders.com

Comments

It would be great if Barbara from the OraFAQ forum took a look at this. She is an expert on Oracle's IndexTypes, and I suspect that you could create an Index Type to do this for you seamlessly. Ie. When you modify data in FOOD, the index structure (which is actually another table) is updated atomatically. Also, you wouldn't have to write your queries against the INVERTED table, Oracle could translate queries against FOOD into equivalent queries against the IndexType, but you may have to use one of those funny predicates (CONTAINS()?)

I have never seen Barbara comment on articles on this part of the site. John, it might be worthwhile sending her a PM to look at this page - she might have an elegant hybrid solution.

My thoughts on:
>>But that means indexing every column - and the table might have hundreds of columns. No way can I do that.

I would keep an open mind on this. Maintaining the INVERTED structure will probably generate just as much (or more?) IO. If the FOOD table is suitable for Bitmap indexing (no concurrent DML, no row-by-row DML), then I reckon a Bitmap index on every column would make for more efficient SELECTs and more efficient DML - that's just my gut feel though - I have no proof.

That said, there are plenty of times when Bitmap indexes are not an option. I'm filing the idea away for the future.

John, thanks for sending the PM and Ross, thanks for suggesting it. This can be accomplished using Oracle Text features: multi_column_datastore, section group, context index, and queries using contains and within. Please see the demonstration below. I have put the insert statement just before the query, in order to demonstrate that the index is updated upon commit. The multi_column_datastore does the same thing for one table as the user_datastore and procedure can do for multiple tables. Behind the scenes, they create an xml type document with tags that correspond to the columns or anything else that you wish to use. When you create section groups, you can then search within those sections. The auto_section_group does this using the existing tag names. This is just the simplest counterpart to the inverted table method that you posted. There are many more Oracle Text options that can be combined with this.

SCOTT@orcl12c> -- table provided by John Watson:
SCOTT@orcl12c> create table food(id number,capacity varchar2(10),container varchar2(10),item varchar2(10))
  2  /

Table created.

SCOTT@orcl12c> -- Oracle Text multi_column_datastore:
SCOTT@orcl12c> BEGIN
  2    CTX_DDL.CREATE_PREFERENCE ('test_mcds', 'MULTI_COLUMN_DATASTORE');
  3    CTX_DDL.SET_ATTRIBUTE	 ('test_mcds', 'COLUMNS', 'capacity, container, item');
  4  END;
  5  /

PL/SQL procedure successfully completed.

SCOTT@orcl12c> -- optional dummy column for indexing:
SCOTT@orcl12c> ALTER TABLE food ADD (any_column VARCHAR2(1))
  2  /

Table altered.

SCOTT@orcl12c> -- Oracle Text Context index that uses multi_column_datastore and section group:
SCOTT@orcl12c> CREATE INDEX test_idx ON food (any_column)
  2  INDEXTYPE IS CTXSYS.CONTEXT
  3  PARAMETERS
  4    ('DATASTORE	test_mcds
  5  	 SECTION GROUP	CTXSYS.AUTO_SECTION_GROUP
  6  	 SYNC		(ON COMMIT)')
  7  /

Index created.

SCOTT@orcl12c> -- data provided by John Watson:
SCOTT@orcl12c> insert all
  2  into food (id, capacity, container, item) values(1,'large','bag','potatoes')
  3  into food (id, capacity, container, item) values(2,'small','box','carrots')
  4  into food (id, capacity, container, item) values(3,'medium','tin','peas')
  5  into food (id, capacity, container, item) values(4,'large','box','potatoes')
  6  into food (id, capacity, container, item) values(5,'small','tin','carrots')
  7  into food (id, capacity, container, item) values(6,'medium','bag','peas')
  8  into food (id, capacity, container, item) values(7,'large','tin','potatoes')
  9  into food (id, capacity, container, item) values(8,'small','bag','carrots')
 10  into food (id, capacity, container, item) values(9,'medium','box','peas')
 11  select * from dual
 12  /

9 rows created.

SCOTT@orcl12c> COMMIT
  2  /

Commit complete.

SCOTT@orcl12c> -- queries:
SCOTT@orcl12c> SET AUTOTRACE ON EXPLAIN
SCOTT@orcl12c> -- to retrieve all large boxes:
SCOTT@orcl12c> SELECT * FROM food
  2  WHERE  CONTAINS
  3  	      (any_column,
  4  	       'large WITHIN capacity AND
  5  		box WITHIN container') > 0
  6  /

        ID CAPACITY   CONTAINER  ITEM       A
---------- ---------- ---------- ---------- -
         4 large      box        potatoes

1 row selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3055600298

----------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |     1 |    48 |     4   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| FOOD     |     1 |    48 |     4   (0)| 00:00:01 |
|*  2 |   DOMAIN INDEX              | TEST_IDX |       |       |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("CTXSYS"."CONTAINS"("ANY_COLUMN",'large WITHIN capacity AND
                   box WITHIN container')>0)

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

SCOTT@orcl12c> -- all potatoes:
SCOTT@orcl12c> SELECT * FROM food
  2  WHERE  CONTAINS
  3  	      (any_column,
  4  	       'potatoes WITHIN item') > 0
  5  /

        ID CAPACITY   CONTAINER  ITEM       A
---------- ---------- ---------- ---------- -
         1 large      bag        potatoes
         4 large      box        potatoes
         7 large      tin        potatoes

3 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3055600298

----------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |     1 |    48 |     4   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| FOOD     |     1 |    48 |     4   (0)| 00:00:01 |
|*  2 |   DOMAIN INDEX              | TEST_IDX |       |       |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("CTXSYS"."CONTAINS"("ANY_COLUMN",'potatoes WITHIN item')>0)

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

SCOTT@orcl12c>

I've been looking at Barbara's solution to my how-to-index-every-column problem, and trying to do little reverse engineering. I think that Oracle Text is in fact doing an inversion. Looking at the tables created by Text, it is storing every literal value in a manner that makes multi-column predicate searches indexable. And of course it all works with minimal manual intervention. I suppose it was unlikely that I would have invented something that the guys at Redwood Shores hadn't thought of. Thank you for the example, definitely worth remembering.

I do agree that an inverted table can bring a solution to some situations where an application allow many or all columns of a table to be used as a key to look up data. At first sight, the more columns the more tempting the solution might look. In some specific situations the solution might work, but I'd like to warn that in many situations, in particular those with many columns in the "un-inverted" table, it won't bring the expected performance gains. I'll try to explain why.

The fact that after reverting your table all your columns are indexed doesn't mean that the Oracle optimizer will produce an efficient access path. It has been written several times in other articles: the Oracle optimizer does a pretty damm good job IF it can predict the selectivity/cardinality of the query predicates accurately. Unfortunately in almost all cases the accuracy of the histogram on a unified "colvalue" is problematic. For example if the original table has 40 columns and 10 of those are skew, how much of that skewness will be visible in the histogram of a single unified "colvalue"? In addition to the original skewness a new skewnees of another magnitude is added, because of the differences of the NDV of the original columns. Also the statistics like the NDV and min/mix values of the original columns induvidually mostly is lost. 12c might have some enhancements regarding histograms but there is no such thing like magic.

What about where predicates involving multiple columns of the "un-inverted" table? We all know that the data in individual columns of many tables corrolates in one way or another. As a result the cardinality of 2 columns combined can be (a lot) higher than the theoretical value assumed by the optimizer. As John has explained in one of his many great articles, extended statistics come to our rescue here, but only if the table has not been inverted. It is very diffucult to solve this kind of problems for inverted tables, so far I could not find a solutions that is transparant for the applications. Maybe John or someone else has found a sollution for this problem?

Conclusion: an inverted table might solve the overhead problems of maintaining too many indexes on a table, but the query performance might be (considerably) worse than if you would create a limited set of carefull chosen indexes on an "un-inverted" table. Watch out with inverted tables if you have skew data and/or many columns with rather different NDVs.

Note: you can (and always should consider) create extended statistics for the "colname"/"colvalue" column combination on an inverted table. It gives the optimizer at least a clue of the lost NDV statistics of the individual columns of the orignal table.

Thank you for your comment, Marc. I hadn't actually considered performance.

Definitely, extended stats on inverted(colname,colvalue), in case the index stats aren't good enough. Perhaps also 1024 hash partitions on colname (to match the maximum number of possible columns) and a local (bitmap?) index would give the CBO some nice possibilities if the data is skewed, and some good cardinality estimates. Raising the number of histogram buckets to the max of 2048 and of course the new histogram types would help, too.

I do know that tuning Oracle Text implementations requires an expert. You have probably highlighted why.
--
John Watson
Oracle Certified Master DBA
http://skillbuilders.com