Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Python is the launguage Commments if Possible 1. Your program will be parsing an

ID: 3592591 • Letter: P

Question

Python is the launguage

Commments if Possible

1. Your program will be parsing and analyzing log files from an Apache web server. The first thing your program must do is retrieve the log file across the network. It is available here: https://s3.amazonaws.com/tcmg412-fall2016/http_access_log

2. Once you download the file, you will be parsing the file in order to answer several questions:

How many total requests were made in the time period represented in the log?

How many requests were made on each day? per week? per month?

What percentage of the requests were not successful (any 4xx status code)?

What percentage of the requests were redirected elsewhere (any 3xx codes)?

What was the most-requested file?

What was the least-requested file?

3. You will need to output this data to the screen. The format you choose for this is up to you (human readable, machine readable, plain text, JSON, etc), but your decisions and the implementation should be logical and consistent.

4. Finally, it was decided that the logs should be broken into separate files by month. Your program should split the log file into 12 smaller files, where the data stored in each file are the log events for a single month. These should be written to disk in the same directory as your program file, in a logical and consistent manner.

Explanation / Answer

ports in the Analytics user interface are generally organized into these categories:

Each report, regardless of the section to which it belongs, consists of two primary fields—metrics, and dimensions. Analytics reports use a combination of metrics and dimensions to describe key types of user activity to your website, such as which search engine users used to reach your site in the Search Engines report, or which pages on your site received the most traffic in the Top Content report. Similarly, the Core Reporting API groups both dimensions and metrics into several categories of report data. By choosing your own combinations of dimensions and metrics, you can create a customized report tailored to your specifications.

Keep in mind that not all categories of data can be combined in a single request. When you request a combination of dimensions and metrics that are not allowed, you will receive an error response instead of an actual feed. This causes no harm, so feel free to experiment with combinations of metrics and dimensions that seem most useful. For a detailed list of the metrics and dimensions you can query, see the Dimensions & Metrics Reference

To understand how Analytics data is applied to the view (profile) you are requesting data for, see the background document on Accounts and Views (Profiles).

Data Feed Request

This section describes all the elements and parameters that make up a data feed request. In general, you provide the table ID corresponding to the view (profile) you want to retrieve data from, choose the combination of dimensions and metrics, and provide a date range along with other parameters in a query string.

https://www.googleapis.com/analytics/v2.4/data

?ids=ga:12345

&dimensions=ga:source,ga:medium

&metrics=ga:sessions,ga:bounces

&sort=-ga:sessions

&filters=ga:medium%3D%3Dreferral

&segment=gaid::-10 OR segment=sessions::condition::ga:medium%3D%3Dreferral

&start-date=2008-10-01

&end-date=2008-10-31

&start-index=10

&max-results=100

&prettyprint=true

Base URL

https://www.googleapis.com/analytics/v2.4/data

Required.

The base URL for a data feed request.

ids

ids=ga:12345

Required.

The unique table ID used to retrieve the Analytics Report data. This ID is provided by the <ga:tableId> element for each entry in the account feed. This value is composed of the ga:namespace and the view (profile) ID of the web property.

dimensions

dimensions=ga:source,ga:medium

Optional.

The dimensions parameter defines the primary data keys for your Analytics report, such as ga:browser or ga:city. Use dimensions to segment your metrics. For example, while you can ask for the total number of pageviews to your site, it might be more interesting to ask for the number of pageviews segmented by browser. In this case, you'll see the number of pageviews from Firefox, Internet Explorer, Chrome, and so forth.

When the value of the dimension cannot be determined, Analytics uses the special string (not set). There are a number of situations where the dimension value will not be set. For example, suppose you want to query your reports for country, city, and pageviews, and suppose the following is true for your view (profile) data:

The results for this request would return data as illustrated by the following example table.

When using dimensions in a feed request, be aware of the following constraints:

For more information and the list of all dimensions, see the Dimensions section in the Dimensions and Metrics Reference.

metrics

metrics=ga:sessions,ga:bounces

Required.

The aggregated statistics for user activity in a view (profile), such as clicks or pageviews. When queried by alone, metrics provide the total values for the requested date range, such as overall pageviews or total bounces. However, when requested with dimensions, values are segmented by the dimension. For example, ga:pageviews requested with ga:country returns the total pageviews per country. When requesting metrics, keep in mind:

For more information and the list of all metrics, see the Metrics section in the Dimensions and Metrics Reference.

sort

sort=-ga:sessions

Optional.

Indicates the sorting order and direction for the returned data. For example, the following parameter would first sort by ga:browser and then by ga:pageviews in ascending order.

sort=ga:browser,ga:pageviews

If you do not indicate a sorting order in your query, the data is sorted by dimension from left to right in the order listed. For example, if the query looks like this:

dimensions=ga:browser,ga:country

Sorting occurs first by ga:browser, then by ga:country. However, if the query uses a different order:

dimensions=ga:country,ga:browserSorting occurs first by ga:country, then by ga:browser.

When using the sort parameter, keep in mind the following:

The sort direction can be changed from ascending to descending by using a minus sign (-) prefix on the requested field. For example:

filters

filters=ga:medium%3D%3Dreferral

Optional.

The filters query string parameter restricts the data returned from your request to the Analytics servers. When you use the filters parameter, you supply a dimension or metric you want to filter, followed by the filter expression. For example, the following feed query requests ga:pageviews and ga:browser from view (profile) 12134, where the ga:browser dimension starts with the string Firefox:

https://www.googleapis.com/analytics/v2.4/data ?ids=ga:12134 &dimensions=ga:browser&metrics=ga:pageviews &filters=ga:browser%3D~%5EFirefox &start-date=2007-01-01 &end-date=2007-12-31

Filtered queries restrict the rows that do (or do not) get included in the result. Each row in the result is tested against the filter: if the filter matches, the row is retained and if it doesn't match, the row is dropped.

Filter Syntax

A single filter uses the form:

ga:name operator expression

In this syntax:

Filter Operators

There are six filter operators for dimensions and six operators for metrics. The operators must be URL encoded in order to be included in URL query strings.

Tip: Use the Data Feed Query Explorer to design filters that need URL encoding, since the explorer will automatically URL encode necessary strings and spaces for you.

Filter Expressions

There are a couple of important rules for filter expressions:

For more information on common regular expression matches supported by Google Analytics, see What are regular expressions in the Help Center.

Combining Filters

Filters can be combined using OR and AND boolean logic. This allows you to effectively extend the 128 character limit of a filter expression.

OR

The OR operator is defined using a comma (,). It takes precedence over the AND operator and may NOT be used to combine dimensions and metrics in the same expression.

Examples: (each must be URL encoded)

Country is either (United States OR Canada):
ga:country==United%20States,ga:country==Canada

Firefox users on (Windows OR Macintosh) operating systems:
ga:browser==Firefox;ga:operatingSystem==Windows,ga:operatingSystem==Macintosh

AND

The AND operator is defined using a semi-colon (;). It is preceded by the OR operator and CAN be used to combine dimensions and metrics in the same expression.

Examples: (each must be URL encoded)

Country is United States AND the browser is Firefox:
ga:country==United%20States;ga:browser==Firefox

Country is United States AND language does not start with 'en':
ga:country==United%20States;ga:language!~^en.*

Operating system is (Windows OR Macintosh) AND browser is (Firefox OR Chrome):
ga:operatingSystem==Windows,ga:operatingSystem==Macintosh;ga:browser==Firefox,ga:browser==Chrome

Country is United States AND sessions are greater than 5:
ga:country==United%20States;ga:sessions>5

segment

segment=gaid::-10
segment=sessions::condition::ga:medium%3D%3Dreferral
segment=users::condition::ga:browser%3D%3DChrome

Optional.

For complete details on how to request a segment in the Core Reporting API see the Segments Dev Guide.

For a conceptual overview of segments, see the Segments Feature Reference and Segments in the Help Center.

Dimensions and Metrics allowed in segments.
Not all dimensions and metrics can be used in segments. To review which dimensions and metrics are allowed in segments visit the Dimensions and Metrics Explorer.

Note: The dynamic:: prefix has been deprecated as of March 27, 2014. It is recommended that youmigrate to the new syntax as soon as possible.

start-date

start-date=2009-04-20

Required.

All Analytics feed requests must specify a beginning and ending date range. If you do not indicate start- and end-date values for the request, the server returns a request error. Date values are in the form YYYY-MM-DD.

The earliest valid start-date is 2005-01-01. There is no upper limit restriction for a start-date. However, setting a start-date that is too far in the future will most likely return empty results.

end-date

end-date=2009-05-20

Required.

All Analytics feed requests must specify a beginning and ending date range. If you do not indicate start- and end-date values for the request, the server returns a request error. Date values are in the form YYYY-MM-DD.

The earliest valid end-date is 2005-01-01. There is no upper limit restriction for an end-date. However, setting an end-date that is too far in the future might return empty results.

start-index

start-index=10

Optional.

If not supplied, the starting index is 1. (Feed indexes are 1-based. That is, the first entry is entry 1, not entry 0.) Use this parameter as a pagination mechanism along with the max-results parameter for situations when totalResults exceeds 10,000 and you want to retrieve entries indexed at 10,001 and beyond.

max-results

max-results=100

Optional.

Maximum number of entries to include in this feed. You can use this in combination with start-indexto retrieve a subset of elements, or use it alone to restrict the number of returned elements, starting with the first. If you do not use the max-results parameter in your query, your feed returns the default maximum of 1000 entries.

The Analytics Core Reporting API returns a maximum of 10,000 entries per request, no matter how many you ask for. It can also return fewer entries than requested, if there aren't as many dimension segments as you expect. For instance, there are fewer than 300 possible values for ga:country, so when segmenting only by country, you can't get more than 300 entries, even if you set max-results to a higher value.

prettyprint

prettyprint=true

Optional.

Adds extra whitespace to the feed XML to make it more readable. This can be set to true or false, where the default is false. Use this parameter for debugging if you're looking at the feed responses directly.

Data Feed Response

The data feed returns data that is entirely dependent on the fields you specify in your request using the dimensions and metrics parameters. For a list of the available dimensions and metrics that you can query in the data feed, see the Dimensions & Metrics Reference. This section describes the general structure of the data feed as returned in XML, with a description for the key elements of interest for the data feed.

Data Feed Error Codes

The Core Reporting API returns a 200 HTTP status code if your request is successful. If an error or problem occurs with your request, the Data Feed returns HTTP status codes based on the type of error, along with a reason describing the nature of the error.

Note: The descriptive reason returned by the API may change at any time. For that reason, your application should not use string matching on the reason, but rather rely only on the error code.

The following list shows the possible error codes and corresponding reasons.

Sampling

Google Analytics calculates certain combinations of dimensions and metrics on the fly. To return the data in a reasonable time, Google Analytics only processes a sample of the data.

If the data you see from the Core Reporting API doesn't match the web interface, use the containsSampledDatatop-level response element to determine if the data has been sampled.

Use the containsSampledData top-level response element to determine if any metric values in the response entries contain sampled data.

See Sampling for a general description of sampling and how it is used with Google Analytics.

Handling Large Data Results

If you expect your query to return large result sets, the guidelines below will help you optimize your API query, avoid errors, and minimize quota overruns. Keep in mind that we establish a baseline level of optimization for any given API request by allowing a maximum number of dimensions (7) and metrics (10). While some queries that specify large numbers of metrics and dimensions can take longer to process than others, limiting the number of requested metrics does not generally improve query performance. Instead, you can use the following techniques for the best performance results.

Paging

Paging through results can be a useful way to break large results sets into manageable chunks. The data feed tells you how many matching rows exist, along with giving you the requested subset of rows. If there is a high ratio of total matching rows to number of rows actually returned, then the individual queries might be taking longer than necessary. If you need only a limited number of rows, such as for display purposes, setting an explicit limit is fine. However, if the purpose of your application is to process a large set of results in its entirety, then it is most efficient to request the maximum allowed rows.

Splitting the Query by Date Range

Instead of paging through the date-keyed results of one long date range, consider forming separate queries for one week—or even one day—at a time. For a very large data set, it may still be necessary to page through results, such as when a request for one day still contains more than the maximum number of result rows per query. In any case, if the number of matching rows for your query is higher than the max results rows, breaking apart the date range may improve the total time to retrieve the answer. This is true whether the queries are being sent in a single thread or in parallel.

Use Filters Intelligently

Consider whether additional filters might reduce the data while still providing the information you need. Can a dimension filter, such as a regular expression match on a page path, return the subset of the data you care about? Can value thresholds (such as ignoring matches with less than 5 sessions) filter out less interesting results? This approach can be used as a complement to any of the other suggestions mentioned earlier. With this technique, the actual time to get each result set is likely to be about the same, but fewer result pages would be retrieved, thus reducing the overall interaction time and minimizing impact on your quota allowanc

Country City Pageviews (not set) (not set) 23 Country A (not set) 13 Country A City A1 10 Country A City A2 5 Country B (not set) 10 Country B City B1 5 Country B City B2 13