Sergeant's Website Web II

HW01: Simple Page Scraper due Tue 23 Jan 13:20

$\begin{purpose} Completing this assignment will provide experience in: \begin{... ...tem leveraging the power of server-side web coding \end{itemize} \end{purpose}$

Allowed and Disallowed Resources

In completing this assignment you MAY use/access the following resources:

The JavaScript Command Sheets handed out in Web 1 (and also available here: https://josephus.hsutx.edu/classes/all/javascriptcommandsheets/.
Examples and sample code found here: https://josephus.hsutx.edu/classes/w2/source/.
The VSCode editor with appropriate syntax highlighting and formatting plugins. You may not use any plugins that generate code, however.
Video instructions provided in Canvas as part of this course. You MAY NOT USE any other video resources.
The course notes that accompany the video instructions which can be found here: https://docs.google.com/presentation/d/113px2eo6b2rltW0cKty7782GpEFIYA4WyFlhE6ye5G4/edit?pli=1#slide=id.p
The official JavaScript Documentation found here: https://developer.mozilla.org/en-US/docs/Web/JavaScript.
The official PHP Documentation found here: https://www.php.net/docs.php
The official jQuery Documentation found here: https://api.jquery.com/
Any handouts provided by the instructor as part of this course.
Your own course notes
Your instructor
Discussions about the assignment with other students as long as you never look at the code produced by another student and you never receive instructions about solving the homework. That is, discussions need to be about concepts and understanding the technologies and not about how to solve the particular problem posed in this assignment.

You may NOT use/access:

Resources not expressly listed above, including, but not limited to, the following ...
Source code not provided as part of this assignment. (Obviously, this includes, but is not limited to, source code written by other students whether current or in the past).
Code-generating tools (of which ChatGPT is one example).
Any web sites not directly linked to from the homework assignment.

Failure to abide by these guidelines will result in a zero for the assignment and the incident will be reported to the university provost as a violation of the university academic integrity policy. A second incident of academic dishonesty (whether from this course or another computer science course) will result in an F in the course.

Overview

We will extend the behavior of the work in the prelab assignment so that when a user enters a valid ISBN we will access information about the book from Amazon's website. We can lookup book data from Amazon via the book's ISBN by using the following URL: https://www.amazon.com/exec/obidos/ISBN=0123456789 Instead of using fopen to open a local file you can use it to open a URL and read the contents on the page one line at a time.

Steps

IMPORTANT: Amazon doesn't take kindly to lots of automated requests to its system. So, while you are debugging your code you should use a static local file, and not make a new request during the development process.

If you haven't done so already, complete the lab day assignment because this builds on it.
Here are a couple of ISBNs that may be useful for test purposes: 0-8120-4152-6 and 0-07-050606-X. Start by visiting: https://www.amazon.com/exec/obidos/ISBN=0812041526 in your browser and then save the source code to a file named page1.html. Do the same for the second ISBN and save it as page2.html. Save the contents of these files in your hw01 directory on the CSCI server. You can do this by transferring with an sftp client or by copy/pasting.
Modify scraper.php to read page2.html instead of the poem file.
Rather than displaying the page we want to extract the data of interest to us. In particular you should parse the page to obtain: author names, publisher, and book title. NOTE: Not all information is available for all books. Also, there is some variation about how the information is provided for different books. Your extraction code should work for most books, but don't spend forever trying to make it perfect for all cases.
If no matching ISBN is found then provide a simple message indicating that fact. Otherwise, display the extracted information with one item per line. Here is my output for ISBN 0-07-050606-X:
```
Title: A Tutorial Introduction to Occam Programming
Author: Dick Pountain
Author: David May
Publisher: McGraw-Hill (date)
```
Once your code is properly scraping the desired data and is working for both page1.html and page2.html, you can modify your fopen statement so it reads directory from the amazon URL (with the ISBN entered by the user embedded in the URL.

Hints

Here is an example of the the code I used to extract a title:

if (preg_match("/<span id=\"productTitle\".+>(.+)<\/span>  <\/h1>/",$line,$result)) {
   echo "Title: $result[1]<br>\n";
}

NOTE: $line is the current line of HTML code we are looking at. If the line has an id of “productTitle” then I know it contains the title which I capture into $result using the parenthesized .+. I determined the string to look for by inspecting sample pages at Amazon.

Take time to read the documentation for the preg_match command rather than blindly trying to use it. Then come up with additional commands to extract the other required elements.

Suggested, But Not Required

Next week we will be starting the first stage of a series of assignments that will build for most of the semester. One of the steps of next week's assignment is to build HTML and CSS pages for that site as a starting point. That task alone will likely take a couple of hours. After that you will add a good bit of database and form handling code which will take a good bit of time. Since this week's lab day and homework assignment was short my advice is that you build your HTML and CSS starting point this week.

If you choose to do this here are some thoughts to consider:

The site we will be building is a simple community book selling website in which users can list books for sale (title, condition, and price) and interested shoppers can contact the seller to arrange for the sale.
The application will need a menu with three items: Home, Add Book, and Login.
For the next homework you'll need a home page that will list the books currently for sale (show title and price only). You can just put in a couple of hard-coded books to get styling worked out.
The “Add Book” page should use the same basic format as the home page but should present a form that allows a user to enter a book title, the condition of the book, and the price of the book.

It would make sense for you to use the basic styling and page arrangement you used for your final web page from last semester, but you are welcome to branch out and do something different.

HW01: Simple Page Scraper due Tue 23 Jan 13:20

Allowed and Disallowed Resources

Overview

Steps

Hints

Suggested, But Not Required

Web II

Course Resources

Grades and Homework

Quick Links