Cloud vs. software. Human Crowd at the plagiarism detectionHunter copy
Sensational plagiarism cases have raised more recently, the academic community. Have such work does not consider using software to originality? At
the Berlin University a team of scientists has been studying for
several years plagiarism detection systems, and has recognized the five
best in the guttenbergsche thesis.Not
only universities, but also publishers and other businesses that
generate certain publications to the public would be happy with a kind
of silver bullet that can detect a breeze, whether a digital present
work is plagiarized or not. At the
universities needed some entries will be read not only by the teachers,
they could immediately report to the Audit Committee.Although the doctor's father has claimed to Guttenberg, that these programs back then were not so far. But
the makers disagreed with the plagiarism detection system, Turnitin,
"Mr. Guttenberg's work could have been run through our program. That would have spared him much trouble. "Vile self-praise? Since 2004, the so-called plagiarism detection systems are tested at the Berlin University. Over
several years, many test cases were developed in different languages,
to test with different types of plagiarism, but also with original works
of these systems through their paces. The
manufacturers have to be willing to put their system free of charge
available for testing, which sometimes because of poor ratings in the
past was not the case.The most recent test
came in January 2011, which examined 26 solutions were divided into
three classes according to their performance, "particularly useful",
"limited use" and "useless". Even the best systems can not exceed 60 to 70 percent of plagiarized portions. A number of other systems, however, is hardly useful. In the category of useless systems, we have even made some fraudsters find that only want to collect the documents.The systems can naturally compare the text to check only with digital content available. They
are working Internet-based, it loads its document on a Web form to the
system or send it by e-mail and one gets more or less after a long
delay, a result that points to similar or identical text on the web. Attempt
detection systems to check whether these passages are marked in the
document to be analyzed as a quote - but that rarely works in practice.Swarm intelligence in the "Crowd"A frequently voiced criticism of the test method was the short length of the examined documents 389-1055 words. In February 2011 we decided to test the work of Karl-Theodor zu Guttenberg. Here
we had the opportunity to build a 475-page book, by examining the
"crowd-sourcing" to the cooperation of many people over the internet,
has already broken down in detail in his plagiarized parts, with the
five "partially supporting" systems. It should therefore be whether the "crowd" the beats in the cloud operating system software, or vice versa.After
the Bremen law professor Andreas Fischer-Lescano in February 2011 of a
plagiarism suspect with respect to the thesis of Karl-Theodor had
expressed to Guttenberg and the latter the accusation "absurd" was
called, there was a small group of interested people, including graduate
students and postdocs, together in a Wikia wiki. Wikia is an advertising-funded, open collaboration platform, which offers free wikis with a very simple WYSIWYG editor held. Because plagiarists plagiarize only rarely in one place, the GuttenPlagWiki was founded, and we began to dissect the work.Over
time, developed the core group, consisting of only about 20 people, a
useful way for the further processing of the representation. A
reference is a "fragment" and the page number in the name of
plagiarism, followed by the corresponding rows in a from-to-notation.
What looks like a bar code that shows the findings of plagiarism by page. The symbol of Plagiarismusjäger - the "barcode" - indicates which pages contain plagiarism. Using
a visualization language was generated based on the fragments of the
barcode list semi-automatic, with a fully automatic creation of the
inner core of the group had decided, because plagiarism is wrong
messages in the wiki feared that the project would have to discredit
them. Appeared before the bar code, there was always a plausibility check. The
sharpness of the conflict was obvious even to non-combatants, have
become so well known are the cases of text vandalism, verbal abuse and
anonymous threats.The "cloud" as an alternativeThat
the supply Guttenberg's work in large parts of plagiarism, is now
undisputed, and later to the details GuttenPlag evaluation. The
idea to submit all academic work digitally and to get tested by a
software designed to prevent plagiarism is obvious on the background of
the events.So we have filed with the five best plagiarism detection systems, the Guttenberg dissertation. All
the participating vendors - PlagAware, Turnitin iThenticate /, Ephorus,
and PlagScan WITNESS - working "in the cloud": You do not know what
server on which country will land the highly charged work, nor who else
has access. Include in some of the tested
by us in 2010 systems the terms even that one grants all rights to the
texts of the respective companies, such as Turnitin: "With regard to
papers submitted to the Site, You hereby grant iParadigms a
non-exclusive, royalty- free, perpetual,
world-wide, irrevocable license to reproduce, transmit, display,
disclose, archive and otherwise use in connection with the Services any
paper You submit to the site. "But back to KTzG work. In
its digital version, it is a professional compound PDF, use the
ligatures and justified by the narrow space (square) produces. It
is 7.3 MB in size and features, including attachments, et cetera for
475 pages with about 190 000 words, together the exact number depends on
the counting of words such as "§ 6, paragraph 1".Because of the size of the file, there were some problems when uploading. PlagAware broke off on page 159, because already too many matches were found. iThenticate the work be split into 13 parts, comprising 15 000 words and tested each part individually. Witness at first seemed to have crashed the whole system. The
correspondence with the technical support found that achieved in
particular by the similarities with the GuttenPlagWiki the internal
limit of 1000 hits and the report was therefore not represented. The technicians were able to fix in a weekend shift and allowed yet another report. PlagScan was not a problem with the file size, but had to expect a night, until a result was present.
The
Crowd-result: the 20 most frequently used sources for GuttenPlag (Fig.
2) were used to assess the results GuttenPlagWiki results. This resulted in two problems that are typical for information retrieval:1st Like many of the most plagiarized sources (see Figure 2) have been found (yield)?2nd Like many of the reported findings are actually the primary sources (accuracy)? In
the beginning there were 151 reported even GuttenPlag sources, but
during the tests, some texts are deleted from the list because they were
found in other plants. Some of these
works were "plagiarism cousins", so they have taken from the same
sources as Guttenberg, unless otherwise specified, others have quoted
correctly. At least we have found a source that was not recorded in GuttenPlagWiki (Pernice) and this nacherfasst there.Prior to this, two commonly used terms may not be any discussion of the plagiarism explains.A
pawn sacrifice is - according Lahusen - the correct citation and
referencing of a little more literal, the subsequent acquisition without
quote marks. This creates the impression that the text was by the author.A stricter pawn sacrifice is to put the source in the bibliography, without using the exact point mark.Plagiarism
as a structure the acquisition of a thought process, a chain of
argument or the footnotes in ascending order from another plant is
called. This is not popularly regarded as
plagiarism ("The spring was called" or "It's been said quite
differently"), but quite scientifically.Coverage of the top sourcesAmazingly, the systems are not all searchable, often used originals. Seven
of the 20 top sources of the Scientific Service of the German Bundestag
and are not published, it was in four books, which can be found on
Google Books. One source, Vile, 1991, was translated by machine, making it untraceable. Altogether,
eight of the top 20 online sources were found, in principle, at least
seven of them were at least three of the five systems (Table 1).
The
cloud Result: 13 of 20 were top sources do not, but found all the
online sources available from at least three systems (Table 1). The main source, Volkmann-Schluck called, all systems except headmaster, and indeed in the first place. Only
Schmitz 2001 (fourth place among the sources searchable), the FAZ
article of Ten Pfennig (No. 9) and a tower of Nettesheim 2002 (# 10)
were found by all five systems. However Zehnpfennig was often even more so far down on the list, for example, reported PlagScan this source only in 38th Place. Some of the systems were also little agreement, however, we have only "yes / no" assessed: reported or not. Therefore, these figures are to be regarded as upper limits of what would be found with such a system.The systems in detailGuttenPlag The group found that around 94% of the pages and in 63% of lines contained plagiarism. The
system gave the percentages in the table "Percentage weighting
plagiarism" which explains where nowhere is, what do these percentages -
are percent of the sample or to a percentage of total work?In
addition, the percentages indicated and references quite varied from
day to day, although we had requested in the meantime no new reports.In addition to the evaluation itself is important for its presentation and user friendliness of the software. There is much need for improvement. So you could click on any of the reports on the reported plagiarism site to see the page number. This would be necessary if a plagiarism case would be heard by a Doctoral Committee. Would
need a very readable and clearly understandable, color-coded comparison
of the plagiarized text and the original text as rendered in Figure 3
shows examples of hand. To document a plagiarism of an audit committee, the source would have to be documented with detailed information.
Now
to the results in detail: PlagAware: The confrontation, which had
fallen to us so far with short documents, was seen in this long work to
be extremely problematic. There were often marked only three or four words, followed by "[...]" and again three to four words. First you had to scroll a long way to get past the preface and table of contents. A law professor at such a result would have probably judged the system as broken and aborted the test. In addition, it is not clear from which part of the work sites were selected - then we had to manually search the PDF. Several sources reported belonging to pastebin, a website for easy text publication, or GuttenPlagWiki. Such sources could, however, exclude from the comparison. The
report disappeared in between after a software update from the database
- which hopefully does not occur with paying customers. We had, like all candidates, a free test account.iThenticate
irritated by the fact that over 40% of the links to the sources of an
HTTP 404 error message ("file not found") returns, although the
references were available, as an inspection by hand revealed. The
latter was complicated, because the text of the report did not just
copy - we had a phrase or three to write five words and search again
using Google. It was also necessary for many large PDF sources. This means that reported iThenticate plagiarism, and we had to investigate further to find the acquired points. From time to these places, however, were quoted correctly - what the system had escaped. It is feared that by the tide of the 404 error is not just an IT inexperienced law professor would have aborted the test. Were finally broken up in six of the 13 parts into which the work had iThenticate, the top-plagiarism just such 404er.Ephorus offers a pleasant juxtaposition, however, include any source for a complete copy of the entire dissertation. We assume that the test broke off after about 10 sources, the report had already reached 54 Mbytes. It is simply unnecessary to repeat the parts not plagiarized. The copied part and specifying the page numbers are what's interesting.PlagScan reported not much, but what was reported was actually plagiarism - if you could find the bodies. There is a fold-out list for each position of the sources. You have to open each source individually, and then search in it to find the place. The overall presentation used flying the window, but there is no good comparison. PlagAware points from a possible source - and overlooks a large.WITNESS showed extremely slow. According
to the party this is connected with the fact that with the amount of
hits and alternative hits (most of current media who report on the case
and cite a plagiarism site) can not cope. There were a lot of navigation problems and the reported figures are not evident. Meanwhile, can not charge more, this report.Of
the total 131 for the time of testing by GuttenPlag verified sources
found PlagAware 7 (5%), 30 (23%), Ephorus 6 (5%), PlagScan 19 (15%) and
WITNESS iThenticate 16 (12%), all together found 38 .Based
on the Top 20 online sources available GuttenPlag the yield looks like
this: iThenticate found of which 16 (80%), WITNESS 13 (65%), PlagScan 12
(60%) and PlagAware and headmaster respectively 6 (30%).Even
if 80% iThenticate has found the most important sources, they went to
the sheer volume of sources in 1156 reported - a big problem precision. Of those, but only slightly more than 400 at all longer than 20 words. We have examined only the 117 sources, 100 or more words in length. Just
over 40% of which were, however, at the given URL is no longer found on
the Internet, including six of the 13 "top sources" that were reported
for each section.Top among these sources,
there was an absolutely correct quotation (also got the front) and a
correspondence between the bibliography and bibliography to Guttenberg's
another source. After all, three of them, Volkmann-Schluck, and two of the second most common source line.ConclusionHad the law professors at the University of Bayreuth, the plagiarism in the dissertation Guttenberg can explore using software? At
least have the first and second reader an indication of some possible
non-designated sources and get more matches can investigate.But all the systems studied suffer from deficiencies in the presentation of results and operating problems. Thus, the reported values for the "quantity" are misleading to plagiarism. Even
though Ephorus is only 5% (and many systems report "green, no
plagiarism" for values below 10%) - the found bodies are serious
plagiarism.On the other hand, reports
iThenticate for Part 5 of the dissertation 56% plagiarism, but eight of
the 13 sources with at least 100 words are not copied on the Internet
at the URL listed there. In defense of the software must be said that plagiarism is difficult to quantify. Shall
report GuttenPlagWiki 94% of pages with plagiarism - which has been
occasionally reported in the press as plagiarism share of 94%. That's not, 63% of the rows would be a more accurate indication of the amount of plagiarism.Our
recommendation: If teachers have a suspect and will not find with
Google, they should use one of the software systems, preferably two or
three different ones. But you need assistance in interpreting the results, such as by trained staff at the university libraries. Basically,
all work by sending a plagiarism detection system, then beyond a simple
"threshold" to alert the teachers, resulting in far too many false
alarms. With caution is the proposal of the Turnitin service provider, students may apply the systems themselves. This
could in fact bring some to the idea Synonymisierungssysteme use, the
systematic replacement of one's work in the found words by words of
similar or identical meaning.The latest plagiarism detection systems work best with smaller texts such as housework. For complex texts with many quotations and footnotes, they are unsuitable. Way
to win in the plagiarism detection of a sufficiently large crowd,
particularly because it is much more precise and can also use offline
sources.
No comments:
Post a Comment