Showing posts with label schema. Show all posts
Showing posts with label schema. Show all posts

Friday, March 30, 2012

How to simplify XML Query? (SQL2000)

Hi, I am trying to write a query in SQL 2000 to produce a slightly more than
just basic xml schema. I can do the query using FOR XML EXPLICIT, but as the
schema that I want to produce is quite large (with a reasonable amount of
repitition) this seems unnecessarily complicated. A simplified version of the
schema I want to produce is:
<ArrayOfEntity>
<Entity Key=123>
<Name>...</Name>
<Prop1>...</Prop1>
<Prop2>...</Prop2>
<Prop3>...</Prop3>
<Node1 Key=32>text</Node1>
<Node2 Key=43>text</Node2>
<Node3 Key=54>text</Node3>
<Node4 Key=65>text</Node4>
</Entity>
<Entity Key=234>
...
</Entity>
</ArrayOfEntity>
The thing is, there are quite a few of these child "Node"s, which will only
contain a key and text value. There are also more complicated child nodes
(not shown here).
I was quite excited to find the [Element!Num!Attr!xml] explicit column name,
because I could then write a function to generate the child node. But this
does not seem to allow me to specify the Node Key's!! :(
Is there any way to write a reasonably simple query to get the above schema'
Thanks
GregHi Greg
You will need to use the FOR XML EXPLICIT option if you wish to retrieve the
data in this format. A different approach you be to apply a transform after
the data has been extracted. You could use XSLT to do this, possibly use the
document()
function to include and export the data in fragments.
http://groups.google.com/group/microsoft.public.xml/msg/24b2c92baf8fbde6?hl=en&
John
"Greg Bacchus" wrote:
> Hi, I am trying to write a query in SQL 2000 to produce a slightly more than
> just basic xml schema. I can do the query using FOR XML EXPLICIT, but as the
> schema that I want to produce is quite large (with a reasonable amount of
> repitition) this seems unnecessarily complicated. A simplified version of the
> schema I want to produce is:
> <ArrayOfEntity>
> <Entity Key=123>
> <Name>...</Name>
> <Prop1>...</Prop1>
> <Prop2>...</Prop2>
> <Prop3>...</Prop3>
> <Node1 Key=32>text</Node1>
> <Node2 Key=43>text</Node2>
> <Node3 Key=54>text</Node3>
> <Node4 Key=65>text</Node4>
> </Entity>
> <Entity Key=234>
> ...
> </Entity>
> </ArrayOfEntity>
> The thing is, there are quite a few of these child "Node"s, which will only
> contain a key and text value. There are also more complicated child nodes
> (not shown here).
> I was quite excited to find the [Element!Num!Attr!xml] explicit column name,
> because I could then write a function to generate the child node. But this
> does not seem to allow me to specify the Node Key's!! :(
> Is there any way to write a reasonably simple query to get the above schema'
> Thanks
> Greg|||Thanks for your reply John.
Can you do XSL Transforms in SQL 2000?
Greg
"John Bell" wrote:
> Hi Greg
> You will need to use the FOR XML EXPLICIT option if you wish to retrieve the
> data in this format. A different approach you be to apply a transform after
> the data has been extracted. You could use XSLT to do this, possibly use the
> document()
> function to include and export the data in fragments.
> http://groups.google.com/group/microsoft.public.xml/msg/24b2c92baf8fbde6?hl=en&
> John
> "Greg Bacchus" wrote:
> > Hi, I am trying to write a query in SQL 2000 to produce a slightly more than
> > just basic xml schema. I can do the query using FOR XML EXPLICIT, but as the
> > schema that I want to produce is quite large (with a reasonable amount of
> > repitition) this seems unnecessarily complicated. A simplified version of the
> > schema I want to produce is:
> >
> > <ArrayOfEntity>
> > <Entity Key=123>
> > <Name>...</Name>
> > <Prop1>...</Prop1>
> > <Prop2>...</Prop2>
> > <Prop3>...</Prop3>
> > <Node1 Key=32>text</Node1>
> > <Node2 Key=43>text</Node2>
> > <Node3 Key=54>text</Node3>
> > <Node4 Key=65>text</Node4>
> > </Entity>
> > <Entity Key=234>
> > ...
> > </Entity>
> > </ArrayOfEntity>
> >
> > The thing is, there are quite a few of these child "Node"s, which will only
> > contain a key and text value. There are also more complicated child nodes
> > (not shown here).
> >
> > I was quite excited to find the [Element!Num!Attr!xml] explicit column name,
> > because I could then write a function to generate the child node. But this
> > does not seem to allow me to specify the Node Key's!! :(
> >
> > Is there any way to write a reasonably simple query to get the above schema'
> > Thanks
> > Greg|||Hi Greg
It should be possible to do this using the SP_OA... procedures but really I
would expect it to be done externally.
You may want to look at SQL2005 as this could all be done a lot easier using
.NET
John
"Greg Bacchus" wrote:
> Thanks for your reply John.
> Can you do XSL Transforms in SQL 2000?
> Greg
> "John Bell" wrote:
> > Hi Greg
> >
> > You will need to use the FOR XML EXPLICIT option if you wish to retrieve the
> > data in this format. A different approach you be to apply a transform after
> > the data has been extracted. You could use XSLT to do this, possibly use the
> > document()
> > function to include and export the data in fragments.
> >
> > http://groups.google.com/group/microsoft.public.xml/msg/24b2c92baf8fbde6?hl=en&
> >
> > John
> >
> > "Greg Bacchus" wrote:
> >
> > > Hi, I am trying to write a query in SQL 2000 to produce a slightly more than
> > > just basic xml schema. I can do the query using FOR XML EXPLICIT, but as the
> > > schema that I want to produce is quite large (with a reasonable amount of
> > > repitition) this seems unnecessarily complicated. A simplified version of the
> > > schema I want to produce is:
> > >
> > > <ArrayOfEntity>
> > > <Entity Key=123>
> > > <Name>...</Name>
> > > <Prop1>...</Prop1>
> > > <Prop2>...</Prop2>
> > > <Prop3>...</Prop3>
> > > <Node1 Key=32>text</Node1>
> > > <Node2 Key=43>text</Node2>
> > > <Node3 Key=54>text</Node3>
> > > <Node4 Key=65>text</Node4>
> > > </Entity>
> > > <Entity Key=234>
> > > ...
> > > </Entity>
> > > </ArrayOfEntity>
> > >
> > > The thing is, there are quite a few of these child "Node"s, which will only
> > > contain a key and text value. There are also more complicated child nodes
> > > (not shown here).
> > >
> > > I was quite excited to find the [Element!Num!Attr!xml] explicit column name,
> > > because I could then write a function to generate the child node. But this
> > > does not seem to allow me to specify the Node Key's!! :(
> > >
> > > Is there any way to write a reasonably simple query to get the above schema'
> > > Thanks
> > > Greg

How to simplify XML Query? (SQL2000)

Hi, I am trying to write a query in SQL 2000 to produce a slightly more than
just basic xml schema. I can do the query using FOR XML EXPLICIT, but as the
schema that I want to produce is quite large (with a reasonable amount of
repitition) this seems unnecessarily complicated. A simplified version of the
schema I want to produce is:
<ArrayOfEntity>
<Entity Key=123>
<Name>...</Name>
<Prop1>...</Prop1>
<Prop2>...</Prop2>
<Prop3>...</Prop3>
<Node1 Key=32>text</Node1>
<Node2 Key=43>text</Node2>
<Node3 Key=54>text</Node3>
<Node4 Key=65>text</Node4>
</Entity>
<Entity Key=234>
...
</Entity>
</ArrayOfEntity>
The thing is, there are quite a few of these child "Node"s, which will only
contain a key and text value. There are also more complicated child nodes
(not shown here).
I was quite excited to find the [Element!Num!Attr!xml] explicit column name,
because I could then write a function to generate the child node. But this
does not seem to allow me to specify the Node Key's!!
Is there any way to write a reasonably simple query to get the above schema?
Thanks
Greg
Hi Greg
You will need to use the FOR XML EXPLICIT option if you wish to retrieve the
data in this format. A different approach you be to apply a transform after
the data has been extracted. You could use XSLT to do this, possibly use the
document()
function to include and export the data in fragments.
http://groups.google.com/group/micro...f8fbde6?hl=en&
John
"Greg Bacchus" wrote:

> Hi, I am trying to write a query in SQL 2000 to produce a slightly more than
> just basic xml schema. I can do the query using FOR XML EXPLICIT, but as the
> schema that I want to produce is quite large (with a reasonable amount of
> repitition) this seems unnecessarily complicated. A simplified version of the
> schema I want to produce is:
> <ArrayOfEntity>
> <Entity Key=123>
> <Name>...</Name>
> <Prop1>...</Prop1>
> <Prop2>...</Prop2>
> <Prop3>...</Prop3>
> <Node1 Key=32>text</Node1>
> <Node2 Key=43>text</Node2>
> <Node3 Key=54>text</Node3>
> <Node4 Key=65>text</Node4>
> </Entity>
> <Entity Key=234>
> ...
> </Entity>
> </ArrayOfEntity>
> The thing is, there are quite a few of these child "Node"s, which will only
> contain a key and text value. There are also more complicated child nodes
> (not shown here).
> I was quite excited to find the [Element!Num!Attr!xml] explicit column name,
> because I could then write a function to generate the child node. But this
> does not seem to allow me to specify the Node Key's!!
> Is there any way to write a reasonably simple query to get the above schema?
> Thanks
> Greg
|||Thanks for your reply John.
Can you do XSL Transforms in SQL 2000?
Greg
"John Bell" wrote:
[vbcol=seagreen]
> Hi Greg
> You will need to use the FOR XML EXPLICIT option if you wish to retrieve the
> data in this format. A different approach you be to apply a transform after
> the data has been extracted. You could use XSLT to do this, possibly use the
> document()
> function to include and export the data in fragments.
> http://groups.google.com/group/micro...f8fbde6?hl=en&
> John
> "Greg Bacchus" wrote:
|||Hi Greg
It should be possible to do this using the SP_OA... procedures but really I
would expect it to be done externally.
You may want to look at SQL2005 as this could all be done a lot easier using
..NET
John
"Greg Bacchus" wrote:
[vbcol=seagreen]
> Thanks for your reply John.
> Can you do XSL Transforms in SQL 2000?
> Greg
> "John Bell" wrote:

How to simplify XML Query? (SQL2000)

Hi, I am trying to write a query in SQL 2000 to produce a slightly more than
just basic xml schema. I can do the query using FOR XML EXPLICIT, but as the
schema that I want to produce is quite large (with a reasonable amount of
repitition) this seems unnecessarily complicated. A simplified version of th
e
schema I want to produce is:
<ArrayOfEntity>
<Entity Key=123>
<Name>...</Name>
<Prop1>...</Prop1>
<Prop2>...</Prop2>
<Prop3>...</Prop3>
<Node1 Key=32>text</Node1>
<Node2 Key=43>text</Node2>
<Node3 Key=54>text</Node3>
<Node4 Key=65>text</Node4>
</Entity>
<Entity Key=234>
..
</Entity>
</ArrayOfEntity>
The thing is, there are quite a few of these child "Node"s, which will only
contain a key and text value. There are also more complicated child nodes
(not shown here).
I was quite excited to find the [Element!Num!Attr!xml] explicit column n
ame,
because I could then write a function to generate the child node. But this
does not seem to allow me to specify the Node Key's!!
Is there any way to write a reasonably simple query to get the above schema?
?
Thanks
GregHi Greg
You will need to use the FOR XML EXPLICIT option if you wish to retrieve the
data in this format. A different approach you be to apply a transform after
the data has been extracted. You could use XSLT to do this, possibly use the
document()
function to include and export the data in fragments.
[url]http://groups.google.com/group/microsoft.public.xml/msg/24b2c92baf8fbde6?hl=en&[/u
rl]
John
"Greg Bacchus" wrote:

> Hi, I am trying to write a query in SQL 2000 to produce a slightly more th
an
> just basic xml schema. I can do the query using FOR XML EXPLICIT, but as t
he
> schema that I want to produce is quite large (with a reasonable amount of
> repitition) this seems unnecessarily complicated. A simplified version of
the
> schema I want to produce is:
> <ArrayOfEntity>
> <Entity Key=123>
> <Name>...</Name>
> <Prop1>...</Prop1>
> <Prop2>...</Prop2>
> <Prop3>...</Prop3>
> <Node1 Key=32>text</Node1>
> <Node2 Key=43>text</Node2>
> <Node3 Key=54>text</Node3>
> <Node4 Key=65>text</Node4>
> </Entity>
> <Entity Key=234>
> ...
> </Entity>
> </ArrayOfEntity>
> The thing is, there are quite a few of these child "Node"s, which will onl
y
> contain a key and text value. There are also more complicated child nodes
> (not shown here).
> I was quite excited to find the [Element!Num!Attr!xml] explicit column
name,
> because I could then write a function to generate the child node. But this
> does not seem to allow me to specify the Node Key's!!
> Is there any way to write a reasonably simple query to get the above schem
a'
> Thanks
> Greg|||Thanks for your reply John.
Can you do XSL Transforms in SQL 2000?
Greg
"John Bell" wrote:
[vbcol=seagreen]
> Hi Greg
> You will need to use the FOR XML EXPLICIT option if you wish to retrieve t
he
> data in this format. A different approach you be to apply a transform afte
r
> the data has been extracted. You could use XSLT to do this, possibly use t
he
> document()
> function to include and export the data in fragments.
> [url]http://groups.google.com/group/microsoft.public.xml/msg/24b2c92baf8fbde6?hl=en&[
/url]
> John
> "Greg Bacchus" wrote:
>|||Hi Greg
It should be possible to do this using the SP_OA... procedures but really I
would expect it to be done externally.
You may want to look at SQL2005 as this could all be done a lot easier using
.NET
John
"Greg Bacchus" wrote:
[vbcol=seagreen]
> Thanks for your reply John.
> Can you do XSL Transforms in SQL 2000?
> Greg
> "John Bell" wrote:
>sql

Wednesday, March 21, 2012

How to setup a many-to-many schema to favor retreival speed.

I'm am novice in the RDMS world. I would like to set up a database that will scale with a large dataset of over 1,000,000 unique records. In a nutshell, I have about a million+ documents that are stored on a file server as TIFF files. Each document has 2 or more persons (names) associated with that document like "John Smith" or "Mr. John Doe" etc.. A person can be related to many documents (depending on how many transactions they conducted over the years) and any document can be related to many people as most of the transactions were from multiple sellers and multiple buyers. The end users almost always searches for documents by using the first and last name of the person related to the document along with other document fields such as DATE and TYPE to narrow the search. I would like to design and index the database for fast searching as updates are not as time sensitive. I think I need at least 3 tables:

1) NAME_TABLE normalized into NameID(PK) NameFirst, NameMiddle and NameLast

2) DOC_TABLE normalized into DocID(PK), DocTransNum, DocFilePath, DocDate, DocType, (and other attributes)

3) NAME_DOC table to relate the above two tables in a many to many relationship.

My questions are: Where should I put the field that identifies the person as a buyer or seller as it relates to the document? John Smith can be both a buyer and seller on different documents so I don't think it belongs in the NAME table. If I have a another table that splits the names into SELLER_NAMES and BUYER_NAMES it would create some duplicate names and increase the search time right? I want to search for a document using a name and the option to specify if the person name search is limited to buyers, sellers or both. Do I need to do anything special in setting up the tables for fast searching other than indexing all the name fields and the DocID field? What indexing needs to be done on the many-to-many table if any? If fast searching on a mostly static database is a top requirement, should I denormalize the document table to include the names even though there will be duplicate names throughout? How large of a test set of records do I need to load to see any real deltas in performance as I modify and tweak?

Sorry for all the questions but I believe they are related to the overall problem to solve.

Regards,

DeBug

1) Where to indicate buyer versus seller? It seems the most appropriate place is in the Name_Doc table. It is unlikely that a person is both a buyer and a seller on a given document. If they could be, then you could have two bit fields isBuyer and isSeller, both on the Name_Doc table. Queries would be straightforward with this arrangement. Do not split the Name table into buyers and sellers. You are correct that it will lead to duplication and does not model the real problem.

2) How should you index? Use the performance wizard to aid in indexing. More indexes are not necessarily better. It will have good recommendations on clustered versus non-clustered indexes.

3) Should you denormalize? No. First, it will be hard to denormalize a many-to-many relationship. Second, 1 million records seems like a large number, but in reality with today's memory and optimization engines it is not that large.

4) Dataset sizes? You will probably want 50,000 records to allow the wizard to make good suggestions. Others will have better insights than I do.

|||

Hi,

I don′t if I got you right, so I will repeat the story first to see if we are in sync. You have two entities which can be compared to books and authors. One book can written by many authors and one authors can write many books (thats the many to many RS). What you now want to do is to sell a book / documents and want to reuse the information of the first persons table, right ? So the question, why are the perons related to the documents / book ? Did they write or bought them ? Is it another realtionship than you already defined in the n-m RS ?

HTH, Jens K. Suessmeyer.


http://www.sqlserver2005.de

|||

Thanks for the response. You are correct that a buyer and seller are two different people and that a single person would not sell to himself. I can not think of an execption to this unless a parent sells to his child so that "John Smith" was the seller and "John Smith" was a buyer. This could also be true in the event that two people with the same name are conducting business. Rare, but it could happen. So I guess the sofware should not care if a "Doug DeBug" sold to a "Doug DeBug" as long as the user could enter "Doug DeBug" and the business logic would return the correct document, then all is well. All index information is keyed from the actual image so this should not be a problem in that a string of characters is simply associated with a document image. So if I understand you, I could add a field to the Name_Doc table like "IsBuyer" as a TRUE or FALSE and use that logic to search for John Smith the buyer and or John Smith the seller?

I'm using SQL Express so I'm not sure it contains the performance wizard. However, I have a subscription to MS's action pack which includes a version of SQL that I'm sure will contain the wizard. I would guess that I could tune the database using that as a tool and then model the results to the Express version for production.

I will try the 50K records level and see what Mr. Wizard suggests.

Thanks again for your reply.

DeBug

|||

Jens,

Yes, the books and authors is a good example of what I'm doing. The documents would be like the books and the buyers and sellers would be like the authors. The exception is that I need to also search on a name based on if the person is a seller or a buyer. So there is some additional work there. Also, your post mentions the reuse of the information of the first person. I really don't need to do that but simply want to be able to search for the correct document using a person's name and type (was this person a buyer or seller) to correctly locate and display the right TIFF file to the user. Kind of like an electronic filing cabinet type of application. I've written many of these before using Access and Visual Basic 6 but this project is much larger. When SQL Express was introduced, I was just waiting for the right project to come along to take advantage of this wonderful (and free) tool. To answer your second question, the named persons are simple the owners and buyers and the document is a scanned copy of the sales contract. When a disputed document is need to resolve a problem between the buyer and seller, a user will ask the person making the request if they were the buyer or seller and use this to narrow down the search. Most requestors have nothing to start with other than the persons name. In fact, sometimes they don't know if the person was the buyer or seller and thus the need to be able to search for all documents related to a specific name. Hope this helps clarify the problem.

Thanks for responding to my questions.

DeBug

|||

Hi,

ok I *think* I got you :-) There can be multiple buyers and sellers based on one document. That should be something like:

CREATE TABLE Document
(
DocumentId
<additional columns here>
)

CREATE TABLE InvolvedPersons
(
PersonId
<additional columns here>
)


CREATE TABLE AssociatedPersonsToDocuments
(
DocumentId,
PersonId,
AssociationType
<additional columns here>
)

CREATE TABLE AssociationType
(
AssociationType --This can be buyer "B" or seller "S" or additional flag that you want to set in the future, e.g. author of the contract, etc.
<additional columns here>
)

The names of the columns are sure replaceable, this was just for my own understanding and to show you how this fits all together

HTH, Jens K. Suessmeyer.

http://www.sqlserver2005.de

|||

I don't follow your logic with the additional tables. Here is where I'm am on the problem:

CREATE TABLE [dbo].[Names](
[NameID] [int] IDENTITY(1,1) NOT NULL,
[FirstName] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[MiddleName] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[LastName] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[Suffix] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
CONSTRAINT [PK_Names] PRIMARY KEY CLUSTERED
(
[NameID] ASC
)WITH (PAD_INDEX = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]

CREATE TABLE [dbo].[Documents](
[DocID] [int] IDENTITY(1,1) NOT NULL,
[DocPath] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[DocFileDate] [smalldatetime] NOT NULL,
CONSTRAINT [PK_Documents] PRIMARY KEY CLUSTERED
(
[DocID] ASC
)WITH (PAD_INDEX = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]

CREATE TABLE [dbo].[NamesDocuments](
[NameDocID] [int] IDENTITY(1,1) NOT NULL,
[NameID] [int] NOT NULL,
[DocID] [int] NOT NULL,
[Type] [bit] NOT NULL,
CONSTRAINT [PK_NamesDocuments] PRIMARY KEY CLUSTERED
(
[NameDocID] ASC
)WITH (PAD_INDEX = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]

GO
ALTER TABLE [dbo].[NamesDocuments] WITH CHECK ADD CONSTRAINT [FK_NamesDocuments_Documents] FOREIGN KEY([DocID])
REFERENCES [dbo].[Documents] ([DocID])
GO
ALTER TABLE [dbo].[NamesDocuments] CHECK CONSTRAINT [FK_NamesDocuments_Documents]
GO
ALTER TABLE [dbo].[NamesDocuments] WITH CHECK ADD CONSTRAINT [FK_NamesDocuments_Names] FOREIGN KEY([NameID])
REFERENCES [dbo].[Names] ([NameID])
GO
ALTER TABLE [dbo].[NamesDocuments] CHECK CONSTRAINT [FK_NamesDocuments_Names]

The above script creates the two tables for the names and documents. The many-to-many table contains the linking info with the additional column to identify if the nameIDtoDocumentID for that record was a buyer of seller for that transaction. What advantages do I have in using your example that contained an additional table to identify if the named person(s) were buyers or sellers....I'm confused :)

DeBug

|||

See my response to Jens for how I set up the tables and if that is what you would expect. Also, the NamesDocuments table will be about 3 to 3.5 times the size of the documents table so table size will be more like 3,250,000 rows and not the 1 million as I first suggested.

Again, thanks for your help on this!

DeBug

|||

OK, lets abstract my samples. As I saw from your tables you are only prepared for storing two types of people, buyers and sellers (which is stored in the Type column, right ? ) . If you only want this, then you are ok. If you want to store more information about the document like other persons which are involved in the document (like lawyers etc, mentioned in my sample you would have to extend your database) If you would store the Type as a integer (instead of a bit) you would be able to map the type to the additional table (mentioned in my sample) and extened your solution to include various type of perople involved in the document. Lets make a sample of your NamesDocuments table

[NameDocID] [NameID] [DocID] [Type]
1 1 1 1
2 2 1 2
3 3 1 3

Being Type like the following (1=buyer, 2=seller,3=lawyer), you will be able to extend this making something like 4=referals,5=reviewers etc.

That would be a great advantage to the solution with the bit option.

Hope I didn′t confused you too much :-)

HTH, Jens K. Suessmeyer.


http://www.sqlserver2005.de

|||

I like your idea of extending the named persons beyond just buyer and seller. Adding another type like "reviewer" might be something I would want to add if the business rules were to change, which could happen at any time. Currently, a log is kept as the documents move through the workflow that audit the creation and review process but is not part of the long term document history. Is there a performance penality if I add another table "Types" with TypeID (PK int), Description (vchar(20))? Would this would require an additional JOIN to extract the TYPE of person and slow down performance? Or do I simply change the data type to "int", and use business logic to say that a 1 is "buyer" and that a "2" is "seller", etc. and not add an additional table? Also, I was able to set the one-to-many relationships (at least I think I set them up correctly) from the NamesDocuments table to the respective Names and Documents tables. What do I need to do to set up referential rules and or constraints to prevent orphaned records? I think I can change or correct a row from the Names table without changing any of the related records. Also true of the Documents table. However, even as a novice, I know you should not delete rows from the Names or Documents table that are part of the many-to-many table. In our business, it is common to find a document that was not correctly indexed and the administrator would need to ability to update a document to add/remove/change buyers and or sellers to correct any errors. My next task is to figure out a way to load about 25,000 random names to the Names table. I will most likely use VB.NET and some name data from the 1990 Census to simulate that table. The documents table can simply be a TIFF file used over and over again with a new file name so no problem there. Thank you for your help and suggestions!

Regards,

DeBug

How to setup a many-to-many schema to favor retreival speed.

I'm am novice in the RDMS world. I would like to set up a database that will scale with a large dataset of over 1,000,000 unique records. In a nutshell, I have about a million+ documents that are stored on a file server as TIFF files. Each document has 2 or more persons (names) associated with that document like "John Smith" or "Mr. John Doe" etc.. A person can be related to many documents (depending on how many transactions they conducted over the years) and any document can be related to many people as most of the transactions were from multiple sellers and multiple buyers. The end users almost always searches for documents by using the first and last name of the person related to the document along with other document fields such as DATE and TYPE to narrow the search. I would like to design and index the database for fast searching as updates are not as time sensitive. I think I need at least 3 tables:

1) NAME_TABLE normalized into NameID(PK) NameFirst, NameMiddle and NameLast

2) DOC_TABLE normalized into DocID(PK), DocTransNum, DocFilePath, DocDate, DocType, (and other attributes)

3) NAME_DOC table to relate the above two tables in a many to many relationship.

My questions are: Where should I put the field that identifies the person as a buyer or seller as it relates to the document? John Smith can be both a buyer and seller on different documents so I don't think it belongs in the NAME table. If I have a another table that splits the names into SELLER_NAMES and BUYER_NAMES it would create some duplicate names and increase the search time right? I want to search for a document using a name and the option to specify if the person name search is limited to buyers, sellers or both. Do I need to do anything special in setting up the tables for fast searching other than indexing all the name fields and the DocID field? What indexing needs to be done on the many-to-many table if any? If fast searching on a mostly static database is a top requirement, should I denormalize the document table to include the names even though there will be duplicate names throughout? How large of a test set of records do I need to load to see any real deltas in performance as I modify and tweak?

Sorry for all the questions but I believe they are related to the overall problem to solve.

Regards,

DeBug

1) Where to indicate buyer versus seller? It seems the most appropriate place is in the Name_Doc table. It is unlikely that a person is both a buyer and a seller on a given document. If they could be, then you could have two bit fields isBuyer and isSeller, both on the Name_Doc table. Queries would be straightforward with this arrangement. Do not split the Name table into buyers and sellers. You are correct that it will lead to duplication and does not model the real problem.

2) How should you index? Use the performance wizard to aid in indexing. More indexes are not necessarily better. It will have good recommendations on clustered versus non-clustered indexes.

3) Should you denormalize? No. First, it will be hard to denormalize a many-to-many relationship. Second, 1 million records seems like a large number, but in reality with today's memory and optimization engines it is not that large.

4) Dataset sizes? You will probably want 50,000 records to allow the wizard to make good suggestions. Others will have better insights than I do.

|||

Hi,

I don′t if I got you right, so I will repeat the story first to see if we are in sync. You have two entities which can be compared to books and authors. One book can written by many authors and one authors can write many books (thats the many to many RS). What you now want to do is to sell a book / documents and want to reuse the information of the first persons table, right ? So the question, why are the perons related to the documents / book ? Did they write or bought them ? Is it another realtionship than you already defined in the n-m RS ?

HTH, Jens K. Suessmeyer.


http://www.sqlserver2005.de

|||

Thanks for the response. You are correct that a buyer and seller are two different people and that a single person would not sell to himself. I can not think of an execption to this unless a parent sells to his child so that "John Smith" was the seller and "John Smith" was a buyer. This could also be true in the event that two people with the same name are conducting business. Rare, but it could happen. So I guess the sofware should not care if a "Doug DeBug" sold to a "Doug DeBug" as long as the user could enter "Doug DeBug" and the business logic would return the correct document, then all is well. All index information is keyed from the actual image so this should not be a problem in that a string of characters is simply associated with a document image. So if I understand you, I could add a field to the Name_Doc table like "IsBuyer" as a TRUE or FALSE and use that logic to search for John Smith the buyer and or John Smith the seller?

I'm using SQL Express so I'm not sure it contains the performance wizard. However, I have a subscription to MS's action pack which includes a version of SQL that I'm sure will contain the wizard. I would guess that I could tune the database using that as a tool and then model the results to the Express version for production.

I will try the 50K records level and see what Mr. Wizard suggests.

Thanks again for your reply.

DeBug

|||

Jens,

Yes, the books and authors is a good example of what I'm doing. The documents would be like the books and the buyers and sellers would be like the authors. The exception is that I need to also search on a name based on if the person is a seller or a buyer. So there is some additional work there. Also, your post mentions the reuse of the information of the first person. I really don't need to do that but simply want to be able to search for the correct document using a person's name and type (was this person a buyer or seller) to correctly locate and display the right TIFF file to the user. Kind of like an electronic filing cabinet type of application. I've written many of these before using Access and Visual Basic 6 but this project is much larger. When SQL Express was introduced, I was just waiting for the right project to come along to take advantage of this wonderful (and free) tool. To answer your second question, the named persons are simple the owners and buyers and the document is a scanned copy of the sales contract. When a disputed document is need to resolve a problem between the buyer and seller, a user will ask the person making the request if they were the buyer or seller and use this to narrow down the search. Most requestors have nothing to start with other than the persons name. In fact, sometimes they don't know if the person was the buyer or seller and thus the need to be able to search for all documents related to a specific name. Hope this helps clarify the problem.

Thanks for responding to my questions.

DeBug

|||

Hi,

ok I *think* I got you :-) There can be multiple buyers and sellers based on one document. That should be something like:

CREATE TABLE Document
(
DocumentId
<additional columns here>
)

CREATE TABLE InvolvedPersons
(
PersonId
<additional columns here>
)


CREATE TABLE AssociatedPersonsToDocuments
(
DocumentId,
PersonId,
AssociationType
<additional columns here>
)

CREATE TABLE AssociationType
(
AssociationType --This can be buyer "B" or seller "S" or additional flag that you want to set in the future, e.g. author of the contract, etc.
<additional columns here>
)

The names of the columns are sure replaceable, this was just for my own understanding and to show you how this fits all together

HTH, Jens K. Suessmeyer.

http://www.sqlserver2005.de

|||

I don't follow your logic with the additional tables. Here is where I'm am on the problem:

CREATE TABLE [dbo].[Names](
[NameID] [int] IDENTITY(1,1) NOT NULL,
[FirstName] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[MiddleName] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[LastName] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[Suffix] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
CONSTRAINT [PK_Names] PRIMARY KEY CLUSTERED
(
[NameID] ASC
)WITH (PAD_INDEX = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]

CREATE TABLE [dbo].[Documents](
[DocID] [int] IDENTITY(1,1) NOT NULL,
[DocPath] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[DocFileDate] [smalldatetime] NOT NULL,
CONSTRAINT [PK_Documents] PRIMARY KEY CLUSTERED
(
[DocID] ASC
)WITH (PAD_INDEX = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]

CREATE TABLE [dbo].[NamesDocuments](
[NameDocID] [int] IDENTITY(1,1) NOT NULL,
[NameID] [int] NOT NULL,
[DocID] [int] NOT NULL,
[Type] [bit] NOT NULL,
CONSTRAINT [PK_NamesDocuments] PRIMARY KEY CLUSTERED
(
[NameDocID] ASC
)WITH (PAD_INDEX = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]

GO
ALTER TABLE [dbo].[NamesDocuments] WITH CHECK ADD CONSTRAINT [FK_NamesDocuments_Documents] FOREIGN KEY([DocID])
REFERENCES [dbo].[Documents] ([DocID])
GO
ALTER TABLE [dbo].[NamesDocuments] CHECK CONSTRAINT [FK_NamesDocuments_Documents]
GO
ALTER TABLE [dbo].[NamesDocuments] WITH CHECK ADD CONSTRAINT [FK_NamesDocuments_Names] FOREIGN KEY([NameID])
REFERENCES [dbo].[Names] ([NameID])
GO
ALTER TABLE [dbo].[NamesDocuments] CHECK CONSTRAINT [FK_NamesDocuments_Names]

The above script creates the two tables for the names and documents. The many-to-many table contains the linking info with the additional column to identify if the nameIDtoDocumentID for that record was a buyer of seller for that transaction. What advantages do I have in using your example that contained an additional table to identify if the named person(s) were buyers or sellers....I'm confused :)

DeBug

|||

See my response to Jens for how I set up the tables and if that is what you would expect. Also, the NamesDocuments table will be about 3 to 3.5 times the size of the documents table so table size will be more like 3,250,000 rows and not the 1 million as I first suggested.

Again, thanks for your help on this!

DeBug

|||

OK, lets abstract my samples. As I saw from your tables you are only prepared for storing two types of people, buyers and sellers (which is stored in the Type column, right ? ) . If you only want this, then you are ok. If you want to store more information about the document like other persons which are involved in the document (like lawyers etc, mentioned in my sample you would have to extend your database) If you would store the Type as a integer (instead of a bit) you would be able to map the type to the additional table (mentioned in my sample) and extened your solution to include various type of perople involved in the document. Lets make a sample of your NamesDocuments table

[NameDocID] [NameID] [DocID] [Type]
1 1 1 1
2 2 1 2
3 3 1 3

Being Type like the following (1=buyer, 2=seller,3=lawyer), you will be able to extend this making something like 4=referals,5=reviewers etc.

That would be a great advantage to the solution with the bit option.

Hope I didn′t confused you too much :-)

HTH, Jens K. Suessmeyer.


http://www.sqlserver2005.de

|||

I like your idea of extending the named persons beyond just buyer and seller. Adding another type like "reviewer" might be something I would want to add if the business rules were to change, which could happen at any time. Currently, a log is kept as the documents move through the workflow that audit the creation and review process but is not part of the long term document history. Is there a performance penality if I add another table "Types" with TypeID (PK int), Description (vchar(20))? Would this would require an additional JOIN to extract the TYPE of person and slow down performance? Or do I simply change the data type to "int", and use business logic to say that a 1 is "buyer" and that a "2" is "seller", etc. and not add an additional table? Also, I was able to set the one-to-many relationships (at least I think I set them up correctly) from the NamesDocuments table to the respective Names and Documents tables. What do I need to do to set up referential rules and or constraints to prevent orphaned records? I think I can change or correct a row from the Names table without changing any of the related records. Also true of the Documents table. However, even as a novice, I know you should not delete rows from the Names or Documents table that are part of the many-to-many table. In our business, it is common to find a document that was not correctly indexed and the administrator would need to ability to update a document to add/remove/change buyers and or sellers to correct any errors. My next task is to figure out a way to load about 25,000 random names to the Names table. I will most likely use VB.NET and some name data from the 1990 Census to simulate that table. The documents table can simply be a TIFF file used over and over again with a new file name so no problem there. Thank you for your help and suggestions!

Regards,

DeBug