|
Chapter 18Databases for Perl
CONTENTS
There are two ways to look at working with databases in Perl. You can either implement the database entirely in Perl or use Perl as an interface to existing database engines on your machine. You can use stand-alone database packages, such as the DBM package that comes with Perl, or you can get other packages, such as the RDBM package, off the Internet. To use Perl as an interface to a commercial package, you can use driver front-ends to the database engine for the package. The DBM PackagesPerl comes with a set of database management (DBM) library files. Here's the list of packages that come with the latest version of Perl (5.002):
The AnyDBM_File package encapsulates the rest of the packages. If you use the AnyDBM file, you'll automatically load one of the standard packages from the previous list. To use the AnyDBM_File package, insert the following statement at the start of your Perl script: use AnyDBM_File; There is nothing preventing you from explicitly naming a particular DBM package to override the defaults used by the AnyDBM_file package. To use the GDBM_File explicitly, insert this statement instead: use GDBM_File; A DBM, when used with these packages, is a mapping of an associative array to a file on disk. To map the associative array to disk, you use dbmopen() to create or open an existing file, make all the modifications to the associative array, and then close the file with a call to dmbclose(). I covered how to bind variables to associative arrays and how to use the tie() function in Chapter 6, "Binding Variables to Objects." In the future, most dbmopen() function calls will be replaced by, or at least internally work as, tie() function calls. Basically, a DBM file will be tied to an associative array. Subsequent updates to the associative array will be reflected in the file on disk. However, you will come across legacy code that does use the dbmopen() and dbmclose() functions, so it's important to see how these work together. Also, you may find it easier to use these DBM functions to program for quick prototyping than to have to code the callback functions required for the tie() function. After you've read this chapter, you should be able to decide whether to use DBM functions or write the functions for database access using the tie() function. Here's an example to illustrate how the DBM files work. First, I'll create a small database of some stock symbols and the names of the companies. The file containing this data sample is called sym.txt and is partially shown here: $ head sym.txt This file contains several lines of records where each record has two items of data: the first column is the stock ticker symbol followed by the company name. Note that a company name can contain one or more space characters. Now I'll construct a mini-database of all company names indexed by their stock symbol. To use the DBM utilities for this endeavor, I would use the script shown in Listing 18.1. Listing 18.1. Creating a DBM file. 1 #!/usr/bin/perl In Listing 18.1, the AnyDBM_File package is used at line 10 to get the best available package for the system. You can override the type by replacing the AnyDBM_File package with a DBM package of your choice. In line 11, the input file is opened. Line 16 creates a DBM called ticker and maps it onto an associative array called %stocks{}. Note that you are actually creating two files called ticker.pag. The file permissions given for this open are 0666 so that it's possible for all other users to use this DBM. In lines 17 through 23, the symbol names are read in, and the values are assigned to the newly created hash, called stocks. Lines 18 and 19 simply take the text input from the sym.txt file and separate it into a symbol and a company name. At line 21 a value is assigned to the associative array using the symbol as the index into the array. Now that this array has been mapped to the disk, the data can be kept in the array even after the script that created this file is long gone. Note also that the join() function is used in line 21 to piece together the items of names of the companies into one string. Had the join() function not been used, $stocks{$symbol} would contain the number of items of the @names array and not the contents. In line 31 the dbmclose() function is used to commit the changes to disk. If you do not use dbmclose(), your DBM file modifications are not be saved to disk. The dbmclose() function disconnects (or un-ties) the hash from the file on disk so that any changes made to the hash after the dbmclose() function are made as if to an uninitialized associative array. When the contents of the hash will be flushed to disk depends on the underlying system. As far as the program is concerned, the hash connected to the underlying database is not initialized anymore. Two files are created with the use of the AnyDBM_File package. These are the two files that were created by the script in Listing 18.1 when I ran it: -rw-r--r-- 2 khusain users 6608 Mar 26 08:08 scripts/ticker.dir Of course, because you do not have access to the original files that I used in this example, the sizes shown here will be different for the data file that you use. Just note that the size of both the files is the same. If the AnyDBM_File package had been substituted with the GDBM_File page, we would only get one file: -rw-r--r-- 1 khusain users 6608 Mar 26 08:06 scripts/ticker Again, the size of the ticker file will be different depending on the input data file that you use. However, you should note that the size of the ticker file is the same as that of the two files created by the AnyDBM-File package. Viewing Your DataThe database has now been created on disk. Whether it consists of one file or two is not important in the application that will access the database because all the internals are hidden with the DBM function calls. In order for us to look at the data in the newly created DBM file, another script needs to be created. The viewing script is shown in Listing 18.2. Listing 18.2. Viewing the contents of a DBM file. 1 #!/usr/bin/perl Line 13 opens the hash as before. Because the file already exists, there is no need to create it. Lines 14 and 15 are where the contents of the hash $stocks is printed out. Line 20 closes the file. Adding and Removing ItemsIt's relatively simple to add a new symbol to the newly created database. To see how to add to the database, refer to Listing 18.3. Listing 18.3. Adding to the DBM array. 1 #!/usr/bin/perl To delete an item from the %stocks{} associative array, use the delete command. For example, the code in Listing 18.3 can be modified to become a deletion command by adding the delete operator to line 20. The code in line 20 would look like this: delete $stocks{$symbol}; Operations on a DBM FileThe contents of a DBM array can be easily searched using Perl's features. To look for a symbol in an existing DBM, use the Perl search operation with the =~ // syntax. Listing 18.5 illustrates this. Listing 18.5. Searching the DBM array. 1 #!/usr/bin/perl Lines 19 through 24 contain a while loop that uses the each command. The each command is more efficient than using a for loop command because both the key and the indexed value are retrieved with the each command. In the for loop on the keys of an associative array, two separate steps must be taken to retrieve the same information-first to get the key for the loop iteration with a call to the keys(hash) function, and second to get the value based on the retrieved key. Running the program to look for NAL in the company names produces the following output: Enter String to Look for: NAL The program found all company names with the string NAL in their name. As you can see, within Perl you have the power to create a new database, add or delete items from the database, and list the contents of the database. Using the hash as the map, you can perform many other operations on the hash and reflect the results back to disk. The DBM files are still a flat database; therefore, in order to create relationships between data items, you still have to do some shuffling around of indexes, filenames, and files. Fairly sophisticated data algorithms are not out of reach, however, if you are careful. For example, the sample $symbol index can also be used as a filename containing historical data. Let's say the %stocks{} DBM array is used as a storage area for the stocks in a portfolio. It's easy to get historical data from the Internet (for example, via CompuServe) for a given stock symbol. Listing 18.5 collects and displays some statistics for the symbols in the sample portfolio. So that you don't have to type everything via STDIN, the script is designed to use command-line options with the Getopts::Long module. See Chapter 16, "Command-Line Interface with Perl," for more information on how to use the Getopts::Long.pm module. In this example, the -s option is used to specify the stock symbol and the -v option is used to get a complete listing of all the readings. Listing 18.4. Using the DBM array. 1 #!/usr/bin/perl You print the stocks you are tracking by checking to see if the name exists in the %stocks{} hash. This is done in line 23 with the following statement: if ($stocks{$search}) { Once the file is found to exist with the -e flag in line 24, it is opened and read from in lines 27 to 35. The summary is printed in lines 36 through 40. In case the symbol is not listed in the database, an error message is printed in the else clause in line 43. The format for the output via the write statement in line 33 is defined in lines 50 to 53. In this example, the historical data for a stock is stored in a file called AMAT. Therefore, to invoke the script, use this command: $ test.pl -s AMAT As you can see, the output from this script is made a bit more presentable with the use of the format statement. Using DBM in Modules
Because it's possible to use the DBM files from within modules,
let's see if the Invest.pm
module (covered in Chapter 5, "Object-Oriented
Programming in Perl") can be updated to include saving and
restoring portfolio information on disk. You will be appending
some functions to the original Invest.pm.
All the new functions will go at the end of the file unless specified
Some changes need to be made to the Invest.pm module to get it to work with the DBM files. The first change is to include the following statement before the first executable line in the Invest.pm file: require AnyDBM_File; The @portfolio array is changed to %portfolio because this array can be mapped directly to disk via a DBM file. The @portfolio array contains references to hashes and not the content of the item in the referenced hash. Therefore, a new scheme has to be incorporated in order to parse the values and then store them to disk. After the values are stored to disk, the reverse operation has to be applied to read them back. Because this sample portfolio is not a large database, the important values can be stored using a colon delimited array. If this example were a very large array, the items could be stored with the pack() function and thus store only binary values. A new function has to be created to save the contents of the portfolio to disk. Add the following function to the end of the Invest.pm file on or about line 116: savePortfolio { The %dummy hash is used to map to the disk file. Each item in the portfolio hash is parsed for storage in the %dummy hash. The value of $key goes from 0, 1, 2, and up. One string is saved to disk per hash item. Here's the format of the string: $key:$type:$symbol:$shares $type can be Stock or Fund. $symbol is the stock or fund symbol and $shares is the number of shares of the stock. Keep in mind that this is only an example-the real data stored would probably have to include date of purchase, purchase price, and so on. In this event, the fields could be appended to the string with colons separating each field. If you want to add purchase price, your field would look like this: $key:$type:$symbol:$shares:$purchased To restore the file back from disk, the Invest.pm file will have to read back the same portfolio file and restore all the values in the portfolio array. The function will also have to recognize the difference between the types of items it has to re-create. The restoration is done by this function which is added to the end of the Invest.pm file at about line 132: sub restorePortfolio { To create the sample portfolio, use the shell script shown in Listing 18.6. You might want to edit the code shown in Listing 18.6 to print the reported information in a different format than the one shown here. In this sample script, two stocks are added and then the contents of the portfolio are printed. Listing 18.6. Creating a sample portfolio. 1 #!/usr/bin/perl To view the contents of the portfolio, you'll have to write another simple script. This script is shown is Listing 18.7. The file to recover data from is called myStocks. You should be able to see this file in your directory. Listing 18.7. Listing the portfolio. 1 #!/usr/bin/perl Multiple DBM FilesThere are occasions when you'll want to have more than one file open for DBM access. You can use a unique stock ticker symbol as the index into several hashes, each of which is then mapped to its own DBM file. For example, the following database code could be used to track stock price information in one DBM file and earnings information in another DBM file. Both DBM files will be indexed via the stock symbol name. Results of analyzing data from both DBM files could be printed using code similar to the following snippet: foreach $symbol (@listOfsymbols) { The Catch with DBM UtilitiesSo far I have only covered the standard DBM utilities that come with Perl distribution. For most casual users, these DBM files will be sufficient for their database needs. Unfortunately, when things get complicated, as in the case of relational databases, you might want to reconsider your options with other database solutions. The price tag for DBM utilities is attractive because they're free. However, you just might want to pay someone to acquire a commercial Relational Database Management System (RDBMS). Second, there is an inherent danger in using DBM utilities that I must warn you about. If you make a mistake in working with your mapped hash and somehow write it to disk with a dbmclose(), guess what? You just wiped out your entire database. This type of faux pas is not hard to do, especially if you are modifying data. Obliteration of your DBM database is generally only recoverable from backup. Commercial databases have a "safer" feel because they provide you with a safety net by keeping alternate backups. You are still your own worst enemy, but it's a little bit harder to destroy all data. In any event, always back up your data. If you do decide to use a database management system (DBMS) other than the DBM utilities, all is not lost. You can use RDB (a freeware relational database) or other Perl front-ends to popular databases. All of the front packages allow your Perl programs to talk to different databases via the protocol used by a DBI package. Listing 18.8 presents the final version of the Invest.pm file after all the modifications discussed up to now in this chapter have been made to it. Listing 18.8. The final version of Invest.pm with DBM support. 1 Line 8 is where the require statement is used to implement the AnyDBM_FILE support. Lines 17 through 19 use other packages as well. What Is the DBI Package?The database interface (DBI) package for Perl is the implementation of the DBI Application Program Interface (API) specification written by Tim Bunce (Tim.Bunce@ig.co.uk). The DBI package API is designed specifically for use with Perl. The set of functions and variables in the DBI package provide a consistent interface to the application using it. The strong point of the DBI package API, in addition to its broad set of available functions, is that it completely isolates the using application from the internal implementation of the underlying database. The DBI specification exists at various sites in the CPAN archives, but the latest version (v0.6) is not up to date. The best source of information is to look in the source files for a DBI package itself. The entire specification is good for getting an idea of how everything is intended to work together. However, the interface has changed considerably since the specification was released. Check out the file dbispec.v06 in compressed form at ftp.demon.co.uk in the /pub/perl/db directory. The DBI specification started out as DBperl back in 1992 as a team effort from several Perl enthusiasts. Here are the initial contributors to the specification for each type of database:
The original DBI specification was edited by Kurt Anderson. In 1994, Tim Bunce took over the editing and maintenance of the specification in addition to the DBI and DBD::Oracle package development. The specification and related files are copyrighted by Tim Bunce. The original specification was edited by Kurt Anderson from the discussions on the mailing list. In 1993, Tim Bunce took over the editing and maintenance of the specification and in 1994 started the development of the DBI and DBD::Oracle modules. The DBI specification and modules are copyrighted by Tim Bunce but are freely available to all with the same terms as Perl. (Tim is the technical director of the software systems house, Paul Ingram Group in Surrey, England. Tim can be reached at Tim.Bunce@ig.co.uk, but DBI related mail should be sent to the dbi-users@fugue.com mailing list.) The DBI is not related to any one specific database because it serves as an intermediary between a program and one or more DBD::* driver modules. DBD:: modules are drivers written to support a specific database back-end. The DBI:: module manages all installed DBD:: drivers in your system. You can load and use more than one DBD:: module at the same time. DBD:: modules are written in such a way that they may be copied and customized to suit your specific needs. For example, the DBD::Oracle module served as the starting point for Alligator Descartes, another well-known pioneer in developing database interfaces for Perl, to develop DBD:: modules for other databases. He has written two copyrighted documents on how to develop your own driver from DBD:: modules. These documents are located on the Web site www.hermetica.com in the technologia/DBI directory. Available PackagesSome of DBI:: packages available on the Internet are listed here; you can get the latest versions of these files from the Internet CPAN sites:
The interface packages are simply front-ends to the database engine that you must have installed on your machine. For example, in order to use the Oracle DBI package, you'll need the Oracle database engine installed on your system. The installation instructions are located in the README files in the packages themselves. You'll need to have Perl 5.002 installed on your system to get some of the packages to work, especially the DBI module. The DBI interface is very different than the old, database-specific interfaces provided by oraperl, ingperl, interperl, and so on. To simplify the transition to Perl 5, some DBD:: drivers, such as DBD::Oracle, come with an extra module that emulates the old-style interface. The DBI interface has never been fully defined because it has been constantly evolving. This evolution will take a large step forward with the adoption of the standard ODBC interface as the core of the DBI. Because this redefinition of the DBI interface standard is bound to change the DBI interface, Tim Bunce recommends using stable emulation interfaces, such as oraperl, instead. RDBThe RDB database utilities for Perl deserve an honorable mention. The RDB package is complete, simple to use, and very easy to set up. The author of this package is Walt Hobbs; he can be reached at hobbs@rand.org. The source and documentation is found in the file RDB-2.5k.tar.Z in the CPAN modules directories. There is a file named Info.RDB in the package that provides a short overview of how the RDB package organizes its data and what commands are available for you to use. The Info.RDB file also lists the origins of the RDB package, describes how to use it, and provides a sample data file. The operators in the RDB packages are Perl scripts that use standard I/O for UNIX to operate on ASCII files. The format for databases in the RDBM package is to store data in rows and columns in text files. Each row contains items separated by tabs, and each row is terminated by a newline character. (The field separator is a tab, and the record separator, therefore, is the newline character.) Each column in the text file has the items' names and format defined as the first two rows. Because the data is stored in such a format, it's easier to access the data using programs other than those supplied with the RDB package. You can use Perl or awk scripts to get what you want if the RDM programs do not give you what you need. The RDB operators are only Perl scripts that you can use as a basis for writing your own extensions. The operators on the package include listing by row, listing by column, merging tables, and printing reports. All operators in the RDB package read from standard input and write to standard output. Consider the following sample data file. There are four columns in the data file. Each column has a heading and type of data associated with it. The comments with # in the front of the line are ignored. The first uncommented row contains the name of fields per column. The row immediately after that stores the type of data. 4N means that LINE and WORD are four-digit wide numbers. (S specifies a string and M stands for month.) A digit by itself is a string; therefore, NAME fields are eight characters wide. # The commands to operate on are relatively simple. To sort the table by LINE numbers, you use this command on the test.data file: sorttbl < test.data LINE > out The resulting output in the out file is as follows. The format for the output file is an RDB file itself. You can run other RDB operators on it, too! # You can get summary information about columns with the summ command. For example, to get summary information for the test.data file for averages, use this command: summ -m < test.data Here's the resulting output: Rows: 13 You can join two tables together on a per-column basis to get a merged table. The command to do this is jointbl. The -c option does the merge for you on a per-column basis. Consider the two files p1.dat and p2.dat with a common column of NAME in each file. The merged output is shown with this command: jointbl -c < p1.dat NAME p2.dat The use of the jointbl command is shown in the following input/output example: $ cat p1.dat Other more powerful features of this RDB package are listed in the RDB.ps PostScript file. It prints out to a 72-page manual with examples and details on all of the commands available for you. If you do not feel like killing a tree, you can use ghostview to view the file. SummaryPerl supplies a flat database package in the DBM utilities. The modules allow Perl scripts to map hashes to disk files for storage. For most users, storing data with DBM utilities is sufficient. Some DBD:: and DBI:: modules are available as front-ends to commercial databases such as Oracle, Sybase, Informix, and Quickbase. You need the specific database and database engine installed on your machine in order for the DBD:: code to work. The RDB package provides a text file-based relational database management system. Utilities in the RDB package are a set of Perl programs that operate on rows and columns of data files.
|
|||||||||||||||||||||||||||
With any suggestions or questions please feel free to contact us |