Difference between revisions of "ImportTextFiles.php"
Line 10: | Line 10: | ||
==Preprocess file names== | ==Preprocess file names== | ||
Give a list of input data files, for example, a list of files from Wikipedia's Module collection, which contains a list of unique IDs mapping to the page name. The following command will produce a smaller file that contains the replacement. | Give a list of input data files, for example, a list of files from Wikipedia's Module collection, which contains a list of unique IDs mapping to the page name. The following command will produce a smaller file that contains the replacement. | ||
ls ./images/TEMPLATE/ | xargs | ls ./images/TEMPLATE/ | xargs -i grep {} wp_articles_template.csv > template_replacement.txt | ||
When performing data extraction, one usually uses <code>,</code>(comma) as a way to delimit words. To be compatible with the [[moveBatch.php]] syntax, we need to perform a substitution task: | When performing data extraction, one usually uses <code>,</code>(comma) as a way to delimit words. To be compatible with the [[moveBatch.php]] syntax, we need to perform a substitution task: | ||
sed -i "s/,/|/g" template_replacement.txt | sed -i "s/,/|/g" template_replacement.txt | ||
==Move pages to its proper names== | ==Move pages to its proper names== | ||
After the <code>replacement</code> file is ready, one can simply run the [[moveBatch.php]] script to change the page names. | After the <code>replacement</code> file is ready, one can simply run the [[moveBatch.php]] script to change the page names. |
Revision as of 06:45, 27 January 2022
Please refer to the original MediaWiki document[1]
A Typical Importation Process
- Extract text content from some sources, ideally Wikipedia or some reputable data source with existing data model.
- In the case of Wikipedia, every page can be considered as a single text file, and each page has a unique ID provided in either XML or MediaWiki data model.
- Dump all the textual data content into a directory. Each page should be stored in a unique file, whose page name is stored in a comma delimited file (CSV), that each line associates a unique ID with a the page name.
- Move the file into a place where MediaWiki's maintenance script can easily have access, such as
/var/www/html/images
. For Docker deployments, this directory is likely to be mapped onto the host machine's hard drive, therefore, easy to put the text files under this directory. For example:INPUTDATA_DIR
under/var/www/html/images
. - Go to the terminal of the host machine that serves MediaWiki, then, run the
importTextFiles.php
script like follows:
root@hostname:/var/www/html/#php ./maintenance/importTextFiles.php -s "Loading Textual Content from external sources" --overwrite --use-timestamp ./images/INPUTDATA_DIR/*
Preprocess file names
Give a list of input data files, for example, a list of files from Wikipedia's Module collection, which contains a list of unique IDs mapping to the page name. The following command will produce a smaller file that contains the replacement.
ls ./images/TEMPLATE/ | xargs -i grep {} wp_articles_template.csv > template_replacement.txt
When performing data extraction, one usually uses ,
(comma) as a way to delimit words. To be compatible with the moveBatch.php syntax, we need to perform a substitution task:
sed -i "s/,/|/g" template_replacement.txt
Move pages to its proper names
After the replacement
file is ready, one can simply run the moveBatch.php script to change the page names.
php ../maintenance/moveBatch.php --r="load template data" --noredirects template_replacement.txt
Note: The original instruction on the MediaWiki manual shows an --u=user
option. The string user
should be a user name in the database that you are trying to load this data. For simplicity, --u
option can be ignored most of the time. If you insist using this option, make sure that you know a user name that already exists on the database, otherwise, it will throw exceptions.