dbf files and indexes [Python]

Prev: matplotlib: show xticks only for discrete times
Next: Tryton 1.6 series is out

From: Ethan Furman on 27 May 2010 15:45

Let's say I have two tables:

CatLovers DogLovers
------------------- -------------------
| name | age | | name | age |
|-----------------| |-----------------|
| Allen | 42 | | Alexis | 7 |
| Jerod | 29 | | Michael | 21 |
| Samuel | 17 | | Samuel | 17 |
| Nickalaus | 55 | | Lawrence | 63 |
| Frederick | 34 | | Frederick | 34 |
------------------- -------------------

NumberOfPets
---------------------------
| name | cats | dogs |
---------------------------
| Allen | 2 | 0 |
| Alexis | 0 | 3 |
| Michael | 0 | 1 |
| Samuel | 1 | 2 |
| Jerod | 3 | 0 |
| Nickalaus | 5 | 0 |
| Lawrence | 0 | 1 |
| Frederick | 3 | 2 |
---------------------------

(I know, I know -- coming up with examples has never been my strong
point. ;)

catlovers = dbf.Table('CatLovers')
doglovers = dbf.Table('DogLovers')
petcount = dbf.Table('NumberOfPets')

For the sake of this highly contrived example, let's say I'm printing a
report that I would like in alphabetical order of those who love both
cats and dogs...

def names(record):
return record.name

c_idx = catlovers.create_index(key=names)
d_idx = doglovers.create_index(key=names)
p_idx = petcount.create_index(key=names)

# method 1
for record in c_idx:
if record in d_idx:
print record.name, record.age, \
p_idx[record].cats, p_idx[record].dogs

*or*

# method 2
for record in c_idx:
if d_idx.key(record) in d_idx: # or if names(record) in d_idx:
print record.name, record.age \
p_idx[record].cats, p_idx[record].dogs

Which is better (referring to the _in_ statement)? Part of the issue
revolves around the question of is _any_ record in the CatLovers table
really in the DogLovers index? Obviously no -- so if you are asking the
question in code you are really asking if a record from CatLovers has a
matching key value in DogLovers, which means either the __contains__
code can apply the key function to the record (implicit, as in method 1
above) or the calling code can do it (explicit, as in method 2 above).

I'm leaning towards method 1, even though the key function is then
called behind the scenes, because I think it makes the calling code cleaner.

Opinions?

~Ethan~

From: D'Arcy J.M. Cain on 27 May 2010 16:04

On Thu, 27 May 2010 12:45:58 -0700
Ethan Furman <ethan(a)stoneleaf.us> wrote:
> Let's say I have two tables:
>
> CatLovers DogLovers
> ------------------- -------------------
> | name | age | | name | age |
> |-----------------| |-----------------|
[...]
>
> NumberOfPets
> ---------------------------
> | name | cats | dogs |
> ---------------------------
[...]

First problem is learning to count. :-)

Second (first real) problem is that you database is not normalized. If
all of the cat lovers and dog lovers are in NumberOfPets then move the
age into that. Probably should rename it as well.

Finally, are these SQL databases? The best way of getting information
is with SQL.

SELECT * FROM NumberOfPets
WHERE name IN (SELECT name FROM CatLovers) OR
name IN (SELECT name FROM DogLovers)
ORDER BY name;

> catlovers = dbf.Table('CatLovers')
> doglovers = dbf.Table('DogLovers')
> petcount = dbf.Table('NumberOfPets')

I guess you should tell us what dbf is. It doesn't seem to be a
standard module and it doesn't look like DB-API. It's hard to answer
your question without knowing what these functions do.

--
D'Arcy J.M. Cain <darcy(a)druid.net> | Democracy is three wolves
http://www.druid.net/darcy/ | and a sheep voting on
+1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.

From: Ethan Furman on 27 May 2010 16:06

Ethan Furman wrote:
> Let's say I have two tables:

Okay, let's say I have three tables instead. ;p

From: Ethan Furman on 27 May 2010 16:33

D'Arcy J.M. Cain wrote:
> On Thu, 27 May 2010 12:45:58 -0700
> Ethan Furman <ethan(a)stoneleaf.us> wrote:
>> Let's say I have two tables:
>>
>> CatLovers DogLovers
>> ------------------- -------------------
>> | name | age | | name | age |
>> |-----------------| |-----------------|
> [...]
>> NumberOfPets
>> ---------------------------
>> | name | cats | dogs |
>> ---------------------------
> [...]
>
> First problem is learning to count. :-)

Heh -- like I said, I'm terrible at making examples. Originally it was
two tables, then I threw in a third so I could show using an index as a
dictionary.

> Second (first real) problem is that you database is not normalized.
> If all of the cat lovers and dog lovers are in NumberOfPets then move
> the age into that. Probably should rename it as well.

The actual tables that I would be using are different versions of
mailing tables, with one table not having as many records as the other,
and wanting to do something with the records that are in common between
the two.

> Finally, are these SQL databases? The best way of getting information
> is with SQL.

The format of these tables is either dBase III or Visual FoxPro 6, and
dbf is a module I have coded (still working on SQL support for it...
getting closer!). A slightly out-of-date version of it is available on
PyPI.

My question centers around the __contains__ method of the Index object
that I am writing, and whether it should call the key function of the
index when passed a record, or if I should make the calling code do that.

Hope this helps.

~Ethan~

From: Christian Heimes on 27 May 2010 16:27

> Finally, are these SQL databases? The best way of getting information
> is with SQL.
>
> SELECT * FROM NumberOfPets
> WHERE name IN (SELECT name FROM CatLovers) OR
> name IN (SELECT name FROM DogLovers)
> ORDER BY name;

A good way is to use SQL with JOINs instead of horrible nested selects.
Although SQL is declarative, you shouldn't make the work of the query
optimizer so hard.

Christian

| Next | Last
Pages: 1 2
Prev: matplotlib: show xticks only for discrete times
Next: Tryton 1.6 series is out