Dataset for "Statistics and Social Network of YouTube Videos"

Xu Cheng and Cameron Dale and Jiangchuan Liu
{xuc, camerond, jcliu} [at] cs [dot] sfu [dot] ca
School of Computing Science
Simon Fraser University
British Columbia, Canada

NOTE: The datasets presented below are for academic use only.

1. Datasets of Normal Crawl

We consider all the YouTube videos to form a directed graph, where each video is a node in the graph. If a video b is in the related video list (first 20 only) of a video a, then there is a directed edge from a to b. Our crawler uses a breadth-first search to find videos in the graph. We define the initial set of 0-depth video IDs, which the crawler reads in to a queue at the beginning of the crawl. When processing each video, it checks the list of related videos and adds any new ones to the queue.

Given a video ID, the crawler first extracts information from the YouTube API, which contains all the meta-data except age, category and related videos. The crawler then scrapes the video's webpage to obtain the remaining information.We record the following information of a YouTube video in order; they are divided by '\t' in the data file.

video ID an 11-digit string, which is unique
uploader a string of the video uploader's username
age an integer number of days between the date when the video was uploaded and Feb.15, 2007 (YouTube's establishment)
category a string of the video category chosen by the uploader
length an integer number of the video length
views an integer number of the views
rate a float number of the video rate
ratings an integer number of the ratings
comments an integer number of the comments
related IDs up to 20 strings of the related video IDs

Our first crawl was on February 22nd, 2007, and started with the initial set of videos from the list of "Recently Featured", "Most Viewed", "Top Rated" and "Most Discussed", for "Today", "This Week", "This Month" and "All Time", which totalled 189 unique videos on that day. The crawl went to more than four depths, finding approximately 750 thousand videos in about five days. In the following weeks we ran the the crawler every two to three days, each time defining the initial set of videos from the list of "Most Viewed", "Top Rated", and "Most Discussed", for "Today" and "This Week", which is about 200 to 300 videos. On average, the crawl finds 73 thousand distinct videos each time in less than 9 hours.

All the 35 datasets can be downloaded from here. In each package, there are: (1) "0.txt", "1.txt", "2.txt" and "3.txt" (and "4.txt" in "0222.zip"), storing the video data of depth 0, 1, 2 and 3 respectively; (2) "log.txt", a log file indicating the start and finsh time, number of videos crawled and the duration of each depth.

No. File (click to download) Date Total Crawled Videos Duration (second) Note
1 0222.zip Feb. 22nd, 2007 749361 432589 crawled to fifth depth, but did not finish
2 0301.zip Mar. 1st, 2007 155513 93352  
3 0302.zip Mar. 2nd, 2007 10324 4692 only crawled three depths
4 0303.zip Mar. 3rd, 2007 68343 30898  
5 0305.zip Mar. 5th, 2007 65754 21578  
6 0306.zip Mar. 6th, 2007 42775 14203  
7 0309.zip Mar. 9th, 2007 64239 29681  
8 0313.zip Mar. 13th, 2007 100382 42383  
9 0314.zip Mar. 14th, 2007 71347 31362  
10 0315.zip Mar. 15th, 2007 45338 19068  
11 0316.zip Mar. 16th, 2007 42628 17925  
12 0318.zip Mar. 18th, 2007 63725 32160  
13 0320.zip Mar. 20th, 2007 45239 18128  
14 0322.zip Mar. 22nd, 2007 53156 21199  
15 0325.zip Mar. 25th, 2007 62327 24960  
16 0327.zip Mar. 27th, 2007 50940 19849  
17 0329.zip Mar. 29th, 2007 103476 47433  
18 0403.zip Apr. 3rd, 2007 101693 39617  
19 0410.zip Apr. 10th, 2007 206590 120455 fourth depth was not completed
20 0413.zip Apr. 13th, 2007 83712 36535  
21 0418.zip Apr. 18th, 2007 91822 35555  
22 0420.zip Apr. 20th, 2007 112259 90939  
23 0422.zip Apr. 22nd, 2007 89017 29036  
24 0424.zip Apr. 24th, 2007 76936 24571  
25 0426.zip Apr. 26th, 2007 62897 20433  
26 0428.zip Apr. 28th, 2007 71794 21455  
27 0430.zip Apr. 30th, 2007 84675 26874  
28 0502.zip May 2nd, 2007 39238 13639 fourth depth was not completed due to networking problem
29 0505.zip May 5th, 2007 62203 26814  
30 0507.zip May 7th, 2007 62440 19911  
31 0509.zip May 9th, 2007 67939 21681  
32 0511.zip May 11th, 2007 66595 20531  
33 0513.zip May 13th, 2007 67370 21312  
34 0515.zip May 15th, 2007 45071 21312  
35 0518.zip May 18th, 2007 51873 15880  

The new data crawled in 2008 are as followings:

No. File (click to download) Date Total Crawled Videos Duration (second) Note
1 080327.zip Mar. 27th, 2008 310005 72984  
2 080329.zip Mar. 29th, 2008 186671 44014  
3 080331.zip Mar. 31st, 2008 150382 38292  
4 080402.zip Apr. 2nd, 2008 154372 31394  
5 080404.zip Apr. 4th, 2008 155213 31903  
6 080406.zip Apr. 6th, 2008 128499 30830  
7 080408.zip Apr. 8th, 2008 133279 32044  
8 080412.zip Apr. 12th, 2008 123843 32691  
9 080414.zip Apr. 14th, 2008 86737 17580  
10 080416.zip Apr. 16th, 2008 101249 20779  
11 080418.zip Apr. 18th, 2008 114269 27237  
12 080422.zip Apr. 22nd, 2008 114269 28355  
13 080424.zip Apr. 24th, 2008 110232 30862  
14 080426.zip Apr. 26th, 2008 68205 16975  
15 080428.zip Apr. 28th, 2008 74816 17760  
16 080430.zip Apr. 30th, 2008 102969 21505  
17 080502.zip May 2nd, 2008 96684 25334  
18 080504.zip May 4th, 2008 69169 20761  
19 080506.zip May 6th, 2008 98766 19694  
20 080508.zip May 8th, 2008 84197 17148  
21 080510.zip May 10th, 2008 85094 23972  
22 080512.zip May 12th, 2008 91306 25485  
23 080514.zip May 14th, 2008 89824 39771  
24 080516.zip May 16th, 2008 88333 21908  
25 080518.zip May 18th, 2008 90320 25645  
26 080520.zip May 20th, 2008 78123 18765  
27 080522.zip May 22nd, 2008 73417 15274  
28 080524.zip May 24th, 2008 81537 18678  
29 080526.zip May 26th, 2008 73165 19818  
30 080528.zip May 28th, 2008 69112 23274  
31 080530.zip May 30th, 2008 72138 17795  
32 080601.zip Jun. 1st, 2008 69987 17197  
33 080603.zip Jun. 3rd, 2008 84000 18900  
34 080605.zip Jun. 5th, 2008 75995 - The crawl paused due to the networking problem.
35 080609.zip Jun. 9th, 2008 77603 15333  
36 080611.zip Jun. 11th, 2008 72138 14528  
37 080613.zip Jun. 13th, 2008 72042 21381  
38 080615.zip Jun. 15th, 2008 68736 20136  
39 080617.zip Jun. 17th, 2008 74689 21144  
40 080619.zip Jun. 19th, 2008 72542 21808  
41 080621.zip Jun. 21st, 2008 55535 17371  
42 080623.zip Jun. 23rd, 2008 59165 15304  
43 080625.zip Jun. 25th, 2008 53232 18606  
44 080627.zip Jun. 27th, 2008 65906 17314  
45 080629.zip Jun. 29th, 2008 52219 13496  
46 080701.zip Jul. 1st, 2008 62065 15565  
47 080703.zip Jul. 3rd, 2008 62065 14607  
48 080705.zip Jul. 5th, 2008 51555 15052  
49 080707.zip Jul. 7th, 2008 63053 18723  
50 08079.zip Jul. 9th, 2008 54981 16367  
51 080711.zip Jul. 11th, 2008 65412 19833  
52 080713.zip Jul. 13th, 2008 48748 18690  
53 080715.zip Jul. 15th, 2008 54075 20403  
54 080717.zip Jul. 17th, 2008 48884 18205  
55 080719.zip Jul. 19th, 2008 72773 27226  
56 080721.zip Jul. 21st, 2008 46868 19656  
57 080723.zip Jul. 23rd, 2008 50290 22334  
58 080725.zip Jul. 25th, 2008 78680 31925  
59 080727.zip Jul. 27th, 2008 60205 32802  

 


2. Datasets of Updating Crawl

To study the growth trend of the video popularity, we also updated the statistics of some previously found videos. For this crawl we only retrieve the number of views for relatively new videos (uploaded after February 15th, 2007). This crawl is performed once a week from March 5th to April 16th 2007, which results in seven datasets.

The 7 datasets can be downloaded from here. In each package, there are: (1) "update.txt", storing the updated video data; it contains video ID, views, rate, ratings and comments; (2) "id.txt", containing all the video IDs to be updated; (3) "log.txt", a log file indicating the start and finish time, number of videos updated and the duration.

No. File (click to download) Date Total Updated Videos Duration (second)
1 0305u.zip Mar. 5th, 2007 53818 4858
2 0312u.zip Mar. 12th, 2007 79091 12931
3 0319u.zip Mar. 19th, 2007 140702 12489
4 0326u.zip Mar. 26th, 2007 185784 16473
5 0402u.zip Apr. 2nd, 2007 217170 17541
6 0409u.zip Apr. 9th, 2007 247526 20656
7 0416u.zip Apr. 16th, 2007 316313 27145

In 2008, we updated the statistics of 161085 videos once a week for 21 weeks. The new data are also presented here:

No. File (click to download) Date Duration (second)
1 080416u.zip Apr. 16th, 2008 -
2 080423u.zip Apr. 23rd, 2008 9694
3 080430u.zip Apr. 30th, 2008 9895
4 080508u.zip May 8th, 2008 6227
5 080514u.zip May 14th, 2008 10474
6 080521u.zip May 21st, 2008 9968
7 080528u.zip May 28th, 2008 10846
8 080604u.zip Jun. 4th, 2008 12064
9 080611u.zip Jun. 11th, 2008 12476
10 080618u.zip Jun. 18th, 2008 10244
11 080625u.zip Jun. 25th, 2008 8651
12 080702u.zip Jul. 2nd, 2008 8659
13 080709u.zip Jul. 9th, 2008 8816
14 080716u.zip Jul. 16th, 2008 13399
15 080723u.zip Jul. 23rd, 2008 14716
16 080730u.zip Jul. 30th, 2008 13362
17 080806u.zip Aug. 6th, 2008 19496
18 080813u.zip Aug. 13th, 2008 14492
19 080820u.zip Aug. 20th, 2008 14784
20 080827u.zip Aug. 27th, 2008 14476
21 080903u.zip Sep. 3rd, 2008 14683

 


3. Datasets of Video File Size and Bitrate

We have also separately crawled the video file size and video bitrate information. To get the file size, the crawler retrieves the response information from the server when requesting to download the video file and extracts the information on the size of the download. Some videos have the bitrate embedded in the FLV video meta-data, which the crawler extracts after downloading the meta-data of the video file.

In "0523.zip", there are: "idlength.txt", containing the video IDs and video lengths; "size.txt", adding the video file size (integer number of Byte) into the previous information; "log.txt", a log file.

In "0628.zip", there are: "idlength.txt", containing the video IDs and video lengths; "rate.txt", adding the video bitrate (double number of kbps) into the previous information; the VBR videos result in invalid bitrate value, and we use video lengths and sizes to calculate the average bitrate; "log.txt", a log file.

In "080908sizerate.zip", there are: "idlength.txt", containing the video IDs and video lengths; "sizerate.txt", adding the video file size (integer number) and bitrate (double number of kbps) into the previous information; the VBR videos result in invalid bitrate value, and we record a "VBR"; "log.txt", a log file.

No. File (click to download) Date Total Crawled Duration (second) Note
1 0523.zip May 23rd, 2007 199691 88470 video file size
2 0628.zip Jun. 28th, 2007 199691 81082 video bitrate
3 080908sizerate.zip Sep. 8th, 2008 153710 104804 video file size and bitrate

 


4. Datasets of User Information

We have collected the information about YouTube users. The crawler retrieves information on the number of uploaded videos and friends of each user from the YouTube API, for a total of more than 1 million users. There are: "userid.txt", containing all the username collected in previous datasets; "user.txt", containing the information of number of uploads, watches and friends in order.

In 2008, we have also re-collected the information about YouTube users, for a total of more than 2 million users. There are: "userid.txt", containing all the username collected in all normal datasets; "user.txt", containing the information of number of uploads and friends in order.

No. File (click to download) Date Total Updated Users Duration (second) Note
1 0528.zip May. 28th, 2007 1062324 68397  
2 080903user.zip Sep. 3rd, 2008 2139109 254709  

 

 

Copyright 2008 © NETSG, School of Computing Science, Simon Fraser University