HW01: Simple Page Scraper due Tue 23 Jan 13:20

\begin{purpose}
Completing this assignment will provide experience in:
\begin{...
...tem leveraging the power of server-side web coding
\end{itemize}
\end{purpose}

Allowed and Disallowed Resources

In completing this assignment you MAY use/access the following resources:

You may NOT use/access:

Failure to abide by these guidelines will result in a zero for the assignment and the incident will be reported to the university provost as a violation of the university academic integrity policy. A second incident of academic dishonesty (whether from this course or another computer science course) will result in an F in the course.

Overview

We will extend the behavior of the work in the prelab assignment so that when a user enters a valid ISBN we will access information about the book from Amazon's website. We can lookup book data from Amazon via the book's ISBN by using the following URL: https://www.amazon.com/exec/obidos/ISBN=0123456789 Instead of using fopen to open a local file you can use it to open a URL and read the contents on the page one line at a time.

Steps

IMPORTANT: Amazon doesn't take kindly to lots of automated requests to its system. So, while you are debugging your code you should use a static local file, and not make a new request during the development process.

  1. If you haven't done so already, complete the lab day assignment because this builds on it.
  2. Here are a couple of ISBNs that may be useful for test purposes: 0-8120-4152-6 and 0-07-050606-X. Start by visiting: https://www.amazon.com/exec/obidos/ISBN=0812041526 in your browser and then save the source code to a file named page1.html. Do the same for the second ISBN and save it as page2.html. Save the contents of these files in your hw01 directory on the CSCI server. You can do this by transferring with an sftp client or by copy/pasting.
  3. Modify scraper.php to read page2.html instead of the poem file.
  4. Rather than displaying the page we want to extract the data of interest to us. In particular you should parse the page to obtain: author names, publisher, and book title. NOTE: Not all information is available for all books. Also, there is some variation about how the information is provided for different books. Your extraction code should work for most books, but don't spend forever trying to make it perfect for all cases.
  5. If no matching ISBN is found then provide a simple message indicating that fact. Otherwise, display the extracted information with one item per line. Here is my output for ISBN 0-07-050606-X:
    Title: A Tutorial Introduction to Occam Programming
    Author: Dick Pountain
    Author: David May
    Publisher: McGraw-Hill (date)
    
  6. Once your code is properly scraping the desired data and is working for both page1.html and page2.html, you can modify your fopen statement so it reads directory from the amazon URL (with the ISBN entered by the user embedded in the URL.

Hints

Here is an example of the the code I used to extract a title:
if (preg_match("/<span id=\"productTitle\".+>(.+)<\/span>  <\/h1>/",$line,$result)) {
   echo "Title: $result[1]<br>\n";
}

NOTE: $line is the current line of HTML code we are looking at. If the line has an id of “productTitle” then I know it contains the title which I capture into $result using the parenthesized .+. I determined the string to look for by inspecting sample pages at Amazon.

Take time to read the documentation for the preg_match command rather than blindly trying to use it. Then come up with additional commands to extract the other required elements.

Suggested, But Not Required

Next week we will be starting the first stage of a series of assignments that will build for most of the semester. One of the steps of next week's assignment is to build HTML and CSS pages for that site as a starting point. That task alone will likely take a couple of hours. After that you will add a good bit of database and form handling code which will take a good bit of time. Since this week's lab day and homework assignment was short my advice is that you build your HTML and CSS starting point this week.

If you choose to do this here are some thoughts to consider:

It would make sense for you to use the basic styling and page arrangement you used for your final web page from last semester, but you are welcome to branch out and do something different.